You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-1Lines changed: 23 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,26 @@
1
-
# This is a fork of qwopqwop200's repository meant for stable usage in https://github.com/oobabooga/text-generation-webui.
1
+
## This is a fork of qwopqwop200's repository meant for stable usage in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
2
+
This package uses [import redirection](https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/blob/main/gptq_for_llama/__init__.py) to allow for easier integration with existing projects.
3
+
4
+
[Oobabooga's fork](https://github.com/oobabooga/GPTQ-for-LLaMa) is used by default when a compatible GPU is detected.
5
+
[qwopqwop200's 'cuda' branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) is used for GPUs older than Pascal.
6
+
AMD-compatible conversions of both are available courtesy of [WapaMario63's](https://github.com/WapaMario63) work: [GPTQ-for-LLaMa-ROCm](https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm)
7
+
8
+
Python modules can be imported as if they are in the main package and the appropriate versions will be selected:
9
+
```python
10
+
import gptq_for_llama.llama_inference_offload
11
+
from gptq_for_llama.modelutils import find_layers
12
+
from gptq_for_llama.quant import make_quant
13
+
```
14
+
15
+
This can be overriden by setting the `QUANT_CUDA_OVERRIDE` environment variable to either `old` or `new` before importing.
16
+
There is also an experimental function for switching versions on the fly:
17
+
```python
18
+
from gptq_for_llama import switch_gptq
19
+
20
+
switch_gptq('new')
21
+
import gptq_for_llama.llama_inference_offload
22
+
```
23
+
Limited testing showed reliable swapping of versions. However, this may not work when swapping models repeatedly.
2
24
3
25
# GPTQ-for-LLaMA
4
26
4 bits quantization of [LLaMA](https://arxiv.org/abs/2302.13971) using [GPTQ](https://arxiv.org/abs/2210.17323)
0 commit comments