0 0 0 Python

fix(fp8): route ModelMixin through hook-based path to survive partial load (#9231)

Diffusers' enable_layerwise_casting() installs a LayerwiseCastingHook that
(a) only casts dtype in pre_forward, not device, and (b) replaces Linear.forward
with an instance-level wrapper that calls the original Linear.forward captured
before the hook was installed. ModelCache.put() later runs
apply_custom_layers_to_model, which constructs a new CustomLinear sharing the
original Linear's __dict__ — so the diffusers wrapper carries over and routes
calls to the captured original forward, silently bypassing CustomLinear.forward
and its cast_to_device autocast.

With partial loading (e.g. FLUX.2 Klein 9B on a constrained GPU), some Linear
weights stay on CPU. The diffusers pre_forward only casts dtype, so F.linear
then sees input on cuda:0 and weight on cpu and raises
"Expected all tensors to be on the same device".

Route every nn.Module — including ModelMixin — through _apply_fp8_to_nn_module,
which uses register_forward_pre_hook / register_forward_hook(always_call=True).
nn.Module._call_impl dispatches these around forward without replacing it, so
CustomLinear.forward is still reached and cast_to_device moves the weight to
the input device. Lose diffusers' _disable_peft_input_autocast in the process,
which is irrelevant — InvokeAI patches LoRAs through CustomLinear's
_patches_and_weights, not PEFT BaseTunerLayer.

Add regression test that asserts the ModelMixin branch calls
_apply_fp8_to_nn_module and not enable_layerwise_casting.

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

Alexander Eichhorn committed 23d ago

811103e1c3ede247b9bda19e0b3cab8ccd989b2b

Parent: e5dac65

Committed by GitHub <noreply@github.com> on 5/26/2026, 8:26:25 PM