🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
refactor: split nemotron_h into Bamba (dense) + GraniteMoeHybrid (MoE)
Hub survey shows only 2 nemotron_h patterns exist:
- Dense (M,-,*): structurally identical to Bamba (Mamba2+FFN or Attn+FFN per layer)
- MoE (M,E,*): new arch, inherits from GraniteMoeHybrid
Changes:
- Replace NemotronHBlock+MIXER_TYPES dispatch with two explicit decoder layer
classes (NemotronHMambaDecoderLayer, NemotronHAttentionDecoderLayer), both
inheriting from GraniteMoeHybridDecoderLayer per transformers tenets
("different stages warrant explicit classes, not codepaths")
- "moe" layers are now mamba+MoE (two-stage, matching GraniteMoeHybrid) instead
of pure-MoE single-dispatch layers
- Add NemotronHDenseConfig(PreTrainedConfig) with model_type="nemotron_h_dense"
routing to BambaForCausalLM in auto-config (draft; weight converter needed)
- Add WeightRenaming entries in conversion_mapping.py for hub checkpoint compat
Ref: https://github.com/huggingface/transformers/pull/44763#pullrequestreview-4026759334
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> A
Arthur committed
b44745c3801f2aaae18bcda459d5fa0bcf15d05f
Parent: 0a2c4f9