fix(fp8): route ModelMixin through hook-based path to survive partial load (#9231)
Diffusers' enable_layerwise_casting() installs a LayerwiseCastingHook that (a) only casts dtype in pre_forward, not device, and (b) replaces Linear.forward with an instance-level wrapper that calls the original Linear.forward captured before the hook was installed. ModelCache.put() later runs apply_custom_layers_to_model, which constructs a new CustomLinear sharing the original Linear's __dict__ — so the diffusers wrapper carries over and routes calls to the captured original forward, silently bypassing CustomLinear.forward and its cast_to_device autocast. With partial loading (e.g. FLUX.2 Klein 9B on a constrained GPU), some Linear weights stay on CPU. The diffusers pre_forward only casts dtype, so F.linear then sees input on cuda:0 and weight on cpu and raises "Expected all tensors to be on the same device". Route every nn.Module — including ModelMixin — through _apply_fp8_to_nn_module, which uses register_forward_pre_hook / register_forward_hook(always_call=True). nn.Module._call_impl dispatches these around forward without replacing it, so CustomLinear.forward is still reached and cast_to_device moves the weight to the input device. Lose diffusers' _disable_peft_input_autocast in the process, which is irrelevant — InvokeAI patches LoRAs through CustomLinear's _patches_and_weights, not PEFT BaseTunerLayer. Add regression test that asserts the ModelMixin branch calls _apply_fp8_to_nn_module and not enable_layerwise_casting. Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
A
Alexander Eichhorn committed
811103e1c3ede247b9bda19e0b3cab8ccd989b2b
Parent: e5dac65
Committed by GitHub <noreply@github.com>
on 5/26/2026, 8:26:25 PM