← Back to feed
2026-05-22scalinginfra

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang

Key claim

Tune dense once, transfer to all MoE configurations.

Complete-muE is a framework that enables efficient hyperparameter transfer from dense models to Mixture-of-Experts (MoE) models. It allows for stable hyperparameter optimization across different model architectures, significantly speeding up convergence without extensive hyperparameter searches. The key result is that hyperparameters tuned on a single dense model can be effectively transferred to all MoE configurations.

Novelty
8.0/10

Complete-muE provides a novel framework for hyperparameter transfer in MoE setups.

Reliability
7.5/10

The methodology is solid with extensive experiments confirming the results.

Deep reliability assessment

The methodology supports the claim that hyperparameters tuned on a dense model can be effectively transferred to various MoE configurations with minor adjustments, but the potential drift in hyperparameter optima due to non-strict SDE behavior is acknowledged, suggesting that the results may not be universally optimal without further tuning.

Reproducibility

No open source code or dataset is explicitly mentioned in the paper, making reproducibility dependent on the detailed methodology provided.

Discussion questions

  1. How does the assumption that hyperparameter drift is minor hold across different types of tasks and datasets?
  2. What are the practical implications for builders in terms of computational resources saved by using this transfer method?
  3. What specific experimental results or conditions would falsify the claim that Complete-muE provides near-optimal transfer across all MoE configurations?

Key figure

Figure 1 illustrates the Complete-muE framework's capability to transfer hyperparameters across dense FFN, Dense MoE, and sparse MoE architectures, highlighting the non-transferability of existing methods in these scenarios.

Read on arXiv →
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models — Frontier Papers