SIGN IN SIGN UP

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

158577 0 0 Python

refactor: split nemotron_h into Bamba (dense) + GraniteMoeHybrid (MoE)

Hub survey shows only 2 nemotron_h patterns exist:
- Dense (M,-,*): structurally identical to Bamba (Mamba2+FFN or Attn+FFN per layer)
- MoE (M,E,*): new arch, inherits from GraniteMoeHybrid

Changes:
- Replace NemotronHBlock+MIXER_TYPES dispatch with two explicit decoder layer
  classes (NemotronHMambaDecoderLayer, NemotronHAttentionDecoderLayer), both
  inheriting from GraniteMoeHybridDecoderLayer per transformers tenets
  ("different stages warrant explicit classes, not codepaths")
- "moe" layers are now mamba+MoE (two-stage, matching GraniteMoeHybrid) instead
  of pure-MoE single-dispatch layers
- Add NemotronHDenseConfig(PreTrainedConfig) with model_type="nemotron_h_dense"
  routing to BambaForCausalLM in auto-config (draft; weight converter needed)
- Add WeightRenaming entries in conversion_mapping.py for hub checkpoint compat

Ref: https://github.com/huggingface/transformers/pull/44763#pullrequestreview-4026759334

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
A
Arthur committed
b44745c3801f2aaae18bcda459d5fa0bcf15d05f
Parent: 0a2c4f9