COMMITS
October 26, 2025
T
[Cute,Bwd,Sm100] Implement cluster
Tri Dao committed
T
[Cute,Bwd,Sm100] Remove delay_tma_store option
Tri Dao committed
T
[Cute] Change utils.view_transpose back
Tri Dao committed
T
[Cute,Bwd,Sm100] Reduce sync
Tri Dao committed
T
[Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr
Tri Dao committed
T
[Cute,Bwd,Sm100] Tune registers
Tri Dao committed
T
[Cute] Add store_shared_remote_fp32x4 util function
Tri Dao committed
October 25, 2025
T
[Cute,Bwd] Enable bwd benchmarks
Tri Dao committed
T
[Cute,Bwd,Sm100] Enable bwd tests
Tri Dao committed
T
[Cute,Bwd,Sm100] Causal mask
Tri Dao committed
T
[Cute,Bwd,Sm100] Simplify layouts in compute_loop
Tri Dao committed
T
[Cute,Bwd,Sm100] Make postprocessing work, add interface
Tri Dao committed
October 24, 2025
T
[Cute,Sm100] In gemm ptx, add to base smem_address instead
Tri Dao committed
T
[Cute,Fwd,Sm100] Fix interface w score mod to get it to run
Tri Dao committed
K
Fix FA3 segfault with custom CUDA streams in ABI stable build (#1957)
Kevin Wang committed
R
[CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (#1961)
Reuben Stern committed
October 22, 2025
T
[Cute,Bwd,Sm100] More cleanup
Tri Dao committed
October 21, 2025
T
[Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx
Tri Dao committed
J
cutlass v4.3.0 (#1952)
Johnny committed
R
Block Sparsity and Flex Attention mask mod support (#1942)
Reuben Stern committed
D
[CuteDSL] Fix hash function for cute.jit decorator (#1953)
Driss Guessous committed
K
Fix hopper cuda 13 build (#1949)
Kevin Wang committed
T
[Cute,Bwd,Sm100] Add option for delay tma store
Tri Dao committed
T
[Cute,Bwd,Sm100] Hardcode dS_stage = 1
Tri Dao committed
T
[Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables
Tri Dao committed
T
[Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1
Tri Dao committed
T
[Cute,Bwd,Sm100] Clean up compute fn
Tri Dao committed
October 20, 2025
T
[Cute,Bwd,Sm100] Try gemm_ptx
Tri Dao committed
T
[Cute,Bwd,Sm100] sdQaccum doesn't need swizzle
Tri Dao committed
T
[Cute,Bwd,Sm100] All compute warps wait for lse_barrier
Tri Dao committed