CUDA: use tensor cores for MMQ (#7676)

* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early

Johannes Gäßler committed 2y ago

1f0dabda8d5c131f9d4632aa41de74317cdd61fb

Parent: af4ae50

Committed by GitHub <noreply@github.com> on 6/10/2024, 9:45:13 AM