-
Notifications
You must be signed in to change notification settings - Fork 400
Closed
Description
Hi there, I was reading and saw:
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
But I'm curious, would it make sense to set -DSD_FLASH_ATTN=ON
for the Mac, Linux, and other non-CUBLAS builds:
- build: "noavx"
defines: "-DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx2"
defines: "-DGGML_AVX2=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx"
defines: "-DGGML_AVX2=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx512"
defines: "-DGGML_AVX512=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "cuda12"
Thanks!
Metadata
Metadata
Assignees
Labels
No labels