Skip to content

Commit 789779b

Browse files
authored
add llm evaluate for language modeling (#1350)
Signed-off-by: Xin He <xin3.he@intel.com> Signed-off-by: YIYANGCAI <yiyang.cai@intel.com> Signed-off-by: chensuyue <suyue.chen@intel.com>
1 parent 0a06448 commit 789779b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+868
-3713
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ In particular, the tool provides the key features, typical examples, and open co
2121

2222
* Support a wide range of Intel hardware such as [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Xeon CPU Max Series](https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html), [Intel Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html), and [Intel Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) with extensive testing; support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testing
2323

24-
* Validate popular LLMs such as LLama2, [LLama](examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static), [MPT](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/README.md), [Falcon](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/language-modeling/quantization/README.md), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging zero-code optimization solution [Neural Coder](/neural_coder#what-do-we-offer) and automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies
24+
* Validate popular LLMs such as [LLama2](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Falcon](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging zero-code optimization solution [Neural Coder](/neural_coder#what-do-we-offer) and automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies
2525

2626
* Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)
2727

docs/source/smooth_quant.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -373,7 +373,7 @@ A list of models that achieved a <1% accuracy drop is shown below.
373373
Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.
374374
## Example
375375

376-
User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant/README.md) on how to use smooth quant.
376+
User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to use smooth quant.
377377

378378
```python
379379
recipes = {

examples/.config/model_params_pytorch.json

Lines changed: 70 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -450,20 +450,83 @@
450450
"main_script": "run_clm.py",
451451
"batch_size": 8
452452
},
453-
"gpt_j_wikitext_weight_only":{
454-
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_weight_only",
455-
"dataset_location": "",
456-
"input_model": "/tf_dataset2/models/pytorch/gpt-j-6B",
457-
"main_script": "run_clm.py",
458-
"batch_size": 8
459-
},
460453
"gpt_neox":{
461454
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/fx",
462455
"dataset_location": "/tf_dataset/pytorch/glue_data_new/oscar",
463456
"input_model": "/tf_dataset2/models/huggingface/gpt-neox-japanese-2.7b",
464457
"main_script": "run_clm.py",
465458
"batch_size": 8
466459
},
460+
"opt_125m_woq_awq":{
461+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
462+
"dataset_location": "",
463+
"input_model": "",
464+
"main_script": "run_clm_no_trainer.py",
465+
"batch_size": 8
466+
},
467+
"opt_125m_woq_gptq":{
468+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
469+
"dataset_location": "",
470+
"input_model": "",
471+
"main_script": "run_clm_no_trainer.py",
472+
"batch_size": 8
473+
},
474+
"opt_125m_woq_teq":{
475+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
476+
"dataset_location": "",
477+
"input_model": "",
478+
"main_script": "run_clm_no_trainer.py",
479+
"batch_size": 8
480+
},
481+
"opt_125m_ipex":{
482+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
483+
"dataset_location": "",
484+
"input_model": "",
485+
"main_script": "run_clm_no_trainer.py",
486+
"batch_size": 8
487+
},
488+
"opt_125m_ipex_sq":{
489+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
490+
"dataset_location": "",
491+
"input_model": "",
492+
"main_script": "run_clm_no_trainer.py",
493+
"batch_size": 8
494+
},
495+
"bloom_560m_ipex_sq": {
496+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
497+
"dataset_location": "",
498+
"input_model": "bigscience/bloom-560m",
499+
"batch_size": 1,
500+
"main_script": "run_clm_no_trainer.py"
501+
},
502+
"llama2_7b_ipex_sq":{
503+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
504+
"dataset_location": "",
505+
"input_model": "",
506+
"main_script": "run_clm_no_trainer.py",
507+
"batch_size": 1
508+
},
509+
"gpt_j_ipex_sq":{
510+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
511+
"dataset_location": "",
512+
"input_model": "",
513+
"main_script": "run_clm_no_trainer.py",
514+
"batch_size": 1
515+
},
516+
"gpt_j_woq_rtn":{
517+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
518+
"dataset_location": "",
519+
"input_model": "",
520+
"main_script": "run_clm_no_trainer.py",
521+
"batch_size": 1
522+
},
523+
"falcon_7b_sq":{
524+
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
525+
"dataset_location": "",
526+
"input_model": "",
527+
"main_script": "run_clm_no_trainer.py",
528+
"batch_size": 1
529+
},
467530
"xlm-roberta-base_MRPC": {
468531
"model_src_dir": "nlp/huggingface_models/text-classification/quantization/ptq_static/fx",
469532
"dataset_location": "",
@@ -583,41 +646,6 @@
583646
"main_script": "run_glue.py",
584647
"batch_size": 64
585648
},
586-
"bloom-560m_sq": {
587-
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
588-
"dataset_location": "",
589-
"input_model": "bigscience/bloom-560m",
590-
"batch_size": 1,
591-
"main_script": "eval_lambada.py"
592-
},
593-
"bloom-176b_sq": {
594-
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
595-
"dataset_location": "",
596-
"input_model": "bigscience/bloom",
597-
"batch_size": 1,
598-
"main_script": "eval_lambada.py"
599-
},
600-
"opt-125m_sq": {
601-
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
602-
"dataset_location": "",
603-
"input_model": "facebook/opt-125m",
604-
"batch_size": 1,
605-
"main_script": "eval_lambada.py"
606-
},
607-
"opt-6.7b_sq": {
608-
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
609-
"dataset_location": "",
610-
"input_model": "facebook/opt-6.7b",
611-
"batch_size": 1,
612-
"main_script": "eval_lambada.py"
613-
},
614-
"gpt-j-6B_sq": {
615-
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
616-
"dataset_location": "",
617-
"input_model": "EleutherAI/gpt-j-6B",
618-
"batch_size": 1,
619-
"main_script": "eval_lambada.py"
620-
},
621649
"wide_resnet101_2_fx": {
622650
"model_src_dir": "oob_models/gen-efficientnet-pytorch",
623651
"dataset_location": "/tf_dataset/pytorch/ImageNet/raw",

examples/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -664,13 +664,13 @@ Intel® Neural Compressor validated examples with multiple compression technique
664664
<td>EleutherAI/gpt-j-6B</td>
665665
<td>Natural Language Processing</td>
666666
<td>Post-Training Static Quantization</td>
667-
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx">fx</a> / <a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
667+
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx">fx</a> / <a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
668668
</tr>
669669
<tr>
670670
<td>EleutherAI/gpt-j-6B</td>
671671
<td>Natural Language Processing</td>
672672
<td>Post-Training Weight Only Quantization</td>
673-
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only">weight_only</a></td>
673+
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">weight_only</a></td>
674674
</tr>
675675
<tr>
676676
<td>abeja/gpt-neox-japanese-2.7b</td>
@@ -682,13 +682,13 @@ Intel® Neural Compressor validated examples with multiple compression technique
682682
<td>bigscience/bloom</td>
683683
<td>Natural Language Processing</td>
684684
<td>Post-Training Static Quantization</td>
685-
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
685+
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
686686
</tr>
687687
<tr>
688688
<td>facebook/opt</td>
689689
<td>Natural Language Processing</td>
690690
<td>Post-Training Static Quantization</td>
691-
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
691+
<td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
692692
</tr>
693693
<tr>
694694
<td>SD Diffusion</td>
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
Step-by-Step
2+
============
3+
This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.
4+
5+
The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models.
6+
7+
# Prerequisite
8+
## 1. Create Environment
9+
```
10+
# Installation
11+
pip install -r requirements.txt
12+
```
13+
14+
# Run
15+
16+
Here is how to run the scripts:
17+
18+
**Causal Language Modeling (CLM)**
19+
20+
`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows.
21+
### GPT-J-6b
22+
23+
#### Quantization
24+
```bash
25+
# "--sq" is used to enable smooth quant
26+
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
27+
python run_clm_no_trainer.py \
28+
--model EleutherAI/gpt-j-6B \
29+
--quantize \
30+
--sq \
31+
--alpha 1.0 \
32+
--output_dir "saved_results" \
33+
--ipex
34+
```
35+
36+
**Notes**: Smooth quantization here is based on torch.jit. Without past key value in example_inputs, the quantized model cannot be used for text-generation. For text-generation task, please go to [link](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization)
37+
38+
```bash
39+
# "--approach weight_only" is used to enable weight only quantization.
40+
python run_clm_no_trainer.py \
41+
--model EleutherAI/gpt-j-6B \
42+
--quantize \
43+
--approach weight_only \
44+
--woq_bits 4 \
45+
--woq_group_size 128 \
46+
--woq_scheme asym \
47+
--woq_algo RTN \
48+
--woq_enable_mse_search \
49+
--output_dir "saved_results"
50+
```
51+
**Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md)
52+
53+
54+
#### Accuracy with lm_eval
55+
```bash
56+
# FP32 Accuracy
57+
python run_clm_no_trainer.py \
58+
--model EleutherAI/gpt-j-6B \
59+
--accuracy \
60+
--batch_size 112 \
61+
--tasks "lambada_openai"\
62+
--int8 \
63+
--ipex \
64+
--output_dir "saved_results" # load int8 model
65+
# to validate FP32 model, please remove "--int8" and "--output_dir".
66+
```
67+
### OPT-1.3b/2.7b/6.7b
68+
69+
#### Quantization
70+
71+
```bash
72+
# "--sq" is used to enable smooth quant
73+
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
74+
python run_clm_no_trainer.py \
75+
--model facebook/opt-2.7b \
76+
--quantize \
77+
--sq \
78+
--alpha 0.5 \
79+
--ipex \
80+
--output_dir "saved_results" \
81+
--int8_bf16_mixed
82+
```
83+
84+
#### Accuracy with lm_eval
85+
```bash
86+
python run_clm_no_trainer.py \
87+
--model facebook/opt-2.7b \
88+
--accuracy \
89+
--batch_size 112 \
90+
--tasks "lambada_openai" \
91+
--int8 \
92+
--ipex \
93+
--output_dir "saved_results" # load int8 model
94+
# to validate FP32 model, please remove "--int8" and "--output_dir".
95+
```
96+
### LLAMA2-7b/13b/30b
97+
>Note: LLAMA requires IPEX requirements >= 2.1 to get better accuracy.
98+
#### Quantization
99+
100+
```bash
101+
# "--sq" is used to enable smooth quant
102+
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
103+
python run_clm_no_trainer.py \
104+
--model meta-llama/Llama-2-7b-hf \
105+
--quantize \
106+
--sq \
107+
--alpha 0.8 \
108+
--ipex \
109+
--output_dir "saved_results" \
110+
--int8_bf16_mixed
111+
```
112+
113+
#### Accuracy with lm_eval
114+
```bash
115+
python run_clm_no_trainer.py \
116+
--model meta-llama/Llama-2-7b-hf \
117+
--accuracy \
118+
--batch_size 112 \
119+
--tasks "lambada_openai" \
120+
--int8 \
121+
--ipex \
122+
--output_dir "saved_results" # load int8 model
123+
# to validate FP32 model, please remove "--int8" and "--output_dir".
124+
```
125+
126+
### BLOOM
127+
#### Quantization
128+
```bash
129+
# "--sq" is used to enable smooth quant
130+
python run_clm_no_trainer.py \
131+
--model bigscience/bloom-560m \
132+
--quantize \
133+
--ipex \
134+
--sq \
135+
--alpha 0.5 \
136+
--output_dir "saved_results"
137+
```
138+
#### Accuracy with lm_eval
139+
```bash
140+
python run_clm_no_trainer.py \
141+
--model bigscience/bloom-560m \
142+
--accuracy \
143+
--batch_size 112 \
144+
--tasks "lambada_openai" \
145+
--int8 \
146+
--ipex \
147+
--output_dir "saved_results" # load int8 model
148+
# to validate FP32 model, please remove "--int8" and "--output_dir".
149+
```
150+
151+
### Falcon-7b
152+
```bash
153+
# "--sq" is used to enable smooth quant
154+
python run_clm_no_trainer.py \
155+
--model tiiuae/falcon-7b-instruct \
156+
--quantize \
157+
--sq \
158+
--alpha 0.5 \
159+
--output_dir "saved_results"
160+
```
161+
#### Accuracy with lm_eval
162+
```bash
163+
python run_clm_no_trainer.py \
164+
--model bigscience/bloom-560m \
165+
--accuracy \
166+
--batch_size 112 \
167+
--tasks "lambada_openai" \
168+
--int8 \
169+
--output_dir "saved_results" # load int8 model
170+
# to validate FP32 model, please remove "--int8" and "--output_dir".
171+
```
172+
173+
174+
[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023).
175+
[2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
accelerate
2+
protobuf
3+
sentencepiece != 0.1.92
4+
datasets >= 1.1.3
5+
torch >= 1.10
6+
transformers
7+
pytest
8+
wandb
9+
einops
10+
neural-compressor
11+
intel-extension-for-transformers
12+
git+https://github.com/EleutherAI/lm-evaluation-harness.git@83dbfbf6070324f3e5872f63e49d49ff7ef4c9b3
13+
git+https://github.com/huggingface/peft.git@6c44096c7b8d55a2ecf24be9bc68393467e1584a
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
python examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py \
2+
--model facebook/opt-125m \
3+
--dataset NeelNanda/pile-10k \
4+
--seed 0 \
5+
--quantize \
6+
--approach weight_only \
7+
--woq_algo GPTQ \
8+
--woq_bits 4 \
9+
--woq_group_size 128 \
10+
--gptq_pad_max_length 2048 \
11+
--gptq_use_max_length \
12+
--gptq_gpu \
13+
--gptq_debug

0 commit comments

Comments
 (0)