🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722)
* introducing test tensor parallel mixing to catch TP related error * Remove test file for tensor parallel functionality * Refactor dense and MoE test scripts for parallel execution and improved GPU management - Updated `run_dense_tests.sh` and `run_moe_tests.sh` to support parallel execution of tests using available GPU pairs. - Changed variable names for clarity, replacing `NUM_GPUS` with `GPUS_PER_TEST`. - Enhanced output messages to reflect the number of parallel test slots and GPU usage. - Implemented logic to handle skipped tests and updated result reporting to include skipped counts. - Removed `TensorParallelTesterMixin` from `CausalLMModelTest` and integrated it into `ModelTesterMixin` for better structure in test classes. * restore * add all reduce for ep * fix init and bias sharding * fix finalize weight init * add full stacktracing * fix * add report to run tests * okay big improvement here * the only case shard index should be used is when we are acctually collecting for mergeModuleList * more fixes * fix EP forward gpt oss * add test that trigger the weight converter or only dynamoc loading * Update test scripts to use new tensor parallel test keyword - Modified `run_dense_tests.sh` and `run_moe_tests.sh` to change the pytest keyword from "test_tensor_parallel" to "test_tp_" for improved test targeting. - Cleaned up comments and removed unused code in `test_tensor_parallel_mixin.py` to streamline the testing process and enhance readability. * cleaning + find_port + remove comments * revert some shit * when you are stupid sometimes you really need a brain :) :) :) :) * fix TP * Ok GPT oss is fixed now * try to fix perms * test only causal llm * attempt to fix * am I a doomer and AI is not that bad? * fix * it "passes" but the output is shit * style my man * outputs are gonna be giberish but at least the forward pass "works" * dtyle * fix mixtral * okay shape fixes * tensor idx is only for groupped gemm / EP * fix gate_up shard * fix :) * revert some EP changes that are breaking other stuff * style * fix solar open tp * trigger test on deepseek v3 * fix glm4_moe tp * fix glm4 moe lite tensor parallel * fix longcat and glm4_moe_lite by all reducing gradients of k_rot * fix ernie4_5_moe * fix qwen3 by all reduce grads of q_norm * fix deepseek v3 tp (need a constant dropout other different RNG + all_reduce backward for K rotary) * Rename ReplicatedInTP to ReplicatedWithGradAllReduce and update references in tensor_parallel.py * fix minimax_m2 * fix deepseek v2 for TP * fix minimax * fix qwen3_next for TP * fix dots1 tp * fix flex_olmo TP * fix qwen3 tp dense * fix exaone4 tp * fix gemma3 tp * fix apterus TP * fix seed_oss tp by setting 0 to dropout * fix gemma3n for TP * dropout set to 0 for test + gradient slicing depending on fused weights or not * make fixup + glm4 important fix on tp plan to avoid assigning wrong TP plan * linting * remove shell scripts * make test tensor parallel triggering the CI * fix ci * fix ci * mark it as ep_plan * add @require_torch_multi_accelerator * fix CI * undo pr merge tensor parallel * revert core model loading file * revert modeling_utils file * small fix in modeling_utils * Update tensor parallel test configurations to enable tests by default and standardize seed values for reproducibility. * linting * Reorganize imports in modeling_utils.py to maintain consistency * fix qwen3_5_moe tp * fix glm moe dsa tp * fix qwen3_5 tp * Add training_overfit_steps parameter to Gemma3nTextModelTest * fix 16 bits alignment * Add WeightConverter for gate_up_proj and down_proj with 16 bytes alignment in checkpoint mapping * Add solar_open mapping with WeightConverter for gate_up_proj and down_proj, ensuring 16 bytes alignment * Update hub metadata (#43892) * update * reorder * Add MlaKvAProjParallel layer for MLA attention and update TP plans - Introduced MlaKvAProjParallel class to handle kv_a_proj_with_mqa in tensor parallelism. - Updated prepare_module_tp methods to accept model parameter for better integration. - Adjusted base_model_tp_plan in various configurations to include mla_kv_a_proj. - Removed redundant all_reduce_backward calls from DeepseekV2 and DeepseekV3 attention implementations. * fix doc * force 16 Bytes Alignment * fix slice tensor * more doc * better abstraction for zero experts * linting * refactor * redudancy in tests * simplify * revert * fix gemma2 * fix * make tests work only on CPU * linting * skip tests for run_slow * cleaning * cleaning * enhance doc on dynamic weight loading * add config instead of model for tp * more doc to tensor parallel for MlaKvAProjParallel * use -1 instead of self.num_heads, this way when TP is used, it can infer the local_num_heads size * fix modular glm_moe_dsa * collect all gradient failure tests before stopping at first one * generate more max new tokens for tensor parallel test as models are smalls Co-authored-by: Arthur <[email protected]> * compare generated tokens for tensor parallel tests * use attr config as much as possible * add TP + quantized tests * raise error if attr does not exist to say add it to the auto mapping * update doc * install torchao for tp + quantization tests * update doc * update doc * update doc * update doc * udapte doc * update doc * partially fix tp + quantization generation * partially fix tp + quantize * skipping some tp + quantized test for now * guard torchao import for test_training_ci * Update src/transformers/models/longcat_flash/modular_longcat_flash.py Co-authored-by: Arthur <[email protected]> * move file * fix linting * fix linting * fix port conflict in test --------- Co-authored-by: Arthur Zucker <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Arthur <[email protected]>
F
Ferdinand Mom committed
f49c720f52a08ad68f9f1d299cf65e7125d2e359
Parent: 5c1c72b
Committed by GitHub <[email protected]>
on 3/4/2026, 8:57:50 AM