intel
diff --git a/‎README.md
Lines changed: 1 addition & 1 deletion b/‎README.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/smooth_quant.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/smooth_quant.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/.config/model_params_pytorch.json
Lines changed: 70 additions & 42 deletions b/‎examples/.config/model_params_pytorch.json
Lines changed: 70 additions & 42 deletions
diff --git a/‎examples/README.md
Lines changed: 4 additions & 4 deletions b/‎examples/README.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/README.md
Lines changed: 175 additions & 0 deletions b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/README.md
Lines changed: 175 additions & 0 deletions
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only/cnn_dm_dataset.py renamed to ‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/cnn_dm_dataset.py b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only/cnn_dm_dataset.py renamed to ‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/cnn_dm_dataset.py
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/requirements.txt
Lines changed: 13 additions & 0 deletions b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/requirements.txt
Lines changed: 13 additions & 0 deletions
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run-gptq-llm.sh
Lines changed: 13 additions & 0 deletions b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run-gptq-llm.sh
Lines changed: 13 additions & 0 deletions
@@ -21,7 +21,7 @@ In particular, the tool provides the key features, typical examples, and open co
 
 * Support a wide range of Intel hardware such as [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Xeon CPU Max Series](https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html), [Intel Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/flex-series.html), and [Intel Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) with extensive testing; support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testing
 
-* Validate popular LLMs such as LLama2, [LLama](examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static), [MPT](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/README.md), [Falcon](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/language-modeling/quantization/README.md), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging zero-code optimization solution [Neural Coder](/neural_coder#what-do-we-offer) and automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies
+* Validate popular LLMs such as [LLama2](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Falcon](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [GPT-J](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [Bloom](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), [OPT](/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm), and more than 10,000 broad models such as [Stable Diffusion](/examples/pytorch/nlp/huggingface_models/text-to-image/quantization), [BERT-Large](/examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx), and [ResNet50](/examples/pytorch/image_recognition/torchvision_models/quantization/ptq/cpu/fx) from popular model hubs such as [Hugging Face](https://huggingface.co/), [Torch Vision](https://pytorch.org/vision/stable/index.html), and [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging zero-code optimization solution [Neural Coder](/neural_coder#what-do-we-offer) and automatic [accuracy-driven](/docs/source/design.md#workflow) quantization strategies
 
 * Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)
 
 
@@ -373,7 +373,7 @@ A list of models that achieved a <1% accuracy drop is shown below.
 Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.
 ## Example
 
-User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant/README.md) on how to use smooth quant.
+User could refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm) on how to use smooth quant.
 
 ```python
 recipes = {
 
@@ -450,20 +450,83 @@
       "main_script": "run_clm.py",
       "batch_size": 8
     },
-    "gpt_j_wikitext_weight_only":{
-      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_weight_only",
-      "dataset_location": "",
-      "input_model": "/tf_dataset2/models/pytorch/gpt-j-6B",
-      "main_script": "run_clm.py",
-      "batch_size": 8
-    },
     "gpt_neox":{
       "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/fx",
       "dataset_location": "/tf_dataset/pytorch/glue_data_new/oscar",
       "input_model": "/tf_dataset2/models/huggingface/gpt-neox-japanese-2.7b",
       "main_script": "run_clm.py",
       "batch_size": 8
     },
+    "opt_125m_woq_awq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 8
+    },
+    "opt_125m_woq_gptq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 8
+    },
+    "opt_125m_woq_teq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 8
+    },
+    "opt_125m_ipex":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 8
+    },
+    "opt_125m_ipex_sq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 8
+    },
+    "bloom_560m_ipex_sq": {
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "bigscience/bloom-560m",
+      "batch_size": 1,
+      "main_script": "run_clm_no_trainer.py"
+    },
+    "llama2_7b_ipex_sq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 1
+    },
+    "gpt_j_ipex_sq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 1
+    },
+    "gpt_j_woq_rtn":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 1
+    },
+    "falcon_7b_sq":{
+      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/llm",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "run_clm_no_trainer.py",
+      "batch_size": 1
+    },
     "xlm-roberta-base_MRPC": {
       "model_src_dir": "nlp/huggingface_models/text-classification/quantization/ptq_static/fx",
       "dataset_location": "",
@@ -583,41 +646,6 @@
       "main_script": "run_glue.py",
       "batch_size": 64
     },
-    "bloom-560m_sq": {
-      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
-      "dataset_location": "",
-      "input_model": "bigscience/bloom-560m",
-      "batch_size": 1,
-      "main_script": "eval_lambada.py"
-    },
-    "bloom-176b_sq": {
-      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
-      "dataset_location": "",
-      "input_model": "bigscience/bloom",
-      "batch_size": 1,
-      "main_script": "eval_lambada.py"
-    },
-    "opt-125m_sq": {
-      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
-      "dataset_location": "",
-      "input_model": "facebook/opt-125m",
-      "batch_size": 1,
-      "main_script": "eval_lambada.py"
-    },
-    "opt-6.7b_sq": {
-      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
-      "dataset_location": "",
-      "input_model": "facebook/opt-6.7b",
-      "batch_size": 1,
-      "main_script": "eval_lambada.py"
-    },
-    "gpt-j-6B_sq": {
-      "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant",
-      "dataset_location": "",
-      "input_model": "EleutherAI/gpt-j-6B",
-      "batch_size": 1,
-      "main_script": "eval_lambada.py"
-    },
     "wide_resnet101_2_fx": {
       "model_src_dir": "oob_models/gen-efficientnet-pytorch",
       "dataset_location": "/tf_dataset/pytorch/ImageNet/raw",
 
@@ -664,13 +664,13 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <td>EleutherAI/gpt-j-6B</td>
     <td>Natural Language Processing</td>
     <td>Post-Training Static Quantization</td>
-    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx">fx</a> / <a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
+    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/fx">fx</a> / <a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
   </tr>
   <tr>
     <td>EleutherAI/gpt-j-6B</td>
     <td>Natural Language Processing</td>
     <td>Post-Training Weight Only Quantization</td>
-    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only">weight_only</a></td>
+    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">weight_only</a></td>
   </tr>
   <tr>
     <td>abeja/gpt-neox-japanese-2.7b</td>
@@ -682,13 +682,13 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <td>bigscience/bloom</td>
     <td>Natural Language Processing</td>
     <td>Post-Training Static Quantization</td>
-    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
+    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
   </tr>
   <tr>
     <td>facebook/opt</td>
     <td>Natural Language Processing</td>
     <td>Post-Training Static Quantization</td>
-    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_static/ipex/smooth_quant">smooth quant</a></td>
+    <td><a href="./pytorch/nlp/huggingface_models/language-modeling/quantization/llm">smooth quant</a></td>
   </tr>
   <tr>
     <td>SD Diffusion</td>
 
@@ -0,0 +1,175 @@
+Step-by-Step
+============
+This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.
+
+The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models.
+
+# Prerequisite
+## 1. Create Environment
+```
+# Installation
+pip install -r requirements.txt
+```
+
+# Run
+
+Here is how to run the scripts:
+
+**Causal Language Modeling (CLM)**
+
+`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows.
+### GPT-J-6b
+
+#### Quantization
+```bash
+# "--sq" is used to enable smooth quant
+# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
+python run_clm_no_trainer.py \
+    --model EleutherAI/gpt-j-6B \
+    --quantize \
+    --sq \
+    --alpha 1.0 \
+    --output_dir "saved_results" \
+    --ipex 
+```
+
+**Notes**: Smooth quantization here is based on torch.jit. Without past key value in example_inputs, the quantized model cannot be used for text-generation. For text-generation task, please go to [link](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization)
+
+```bash
+# "--approach weight_only" is used to enable weight only quantization.
+python run_clm_no_trainer.py \
+    --model EleutherAI/gpt-j-6B \
+    --quantize \
+    --approach weight_only \
+    --woq_bits 4 \
+    --woq_group_size 128 \
+    --woq_scheme asym  \
+    --woq_algo RTN \
+    --woq_enable_mse_search \
+    --output_dir "saved_results"
+```
+**Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md)
+
+
+#### Accuracy with lm_eval
+```bash
+# FP32 Accuracy
+python run_clm_no_trainer.py \
+    --model EleutherAI/gpt-j-6B \
+    --accuracy \
+    --batch_size 112 \
+    --tasks "lambada_openai"\
+    --int8 \
+    --ipex \
+    --output_dir "saved_results"  # load int8 model
+# to validate FP32 model, please remove "--int8" and "--output_dir".
+```
+### OPT-1.3b/2.7b/6.7b
+
+#### Quantization
+
+```bash
+# "--sq" is used to enable smooth quant
+# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
+python run_clm_no_trainer.py \
+    --model facebook/opt-2.7b \
+    --quantize \
+    --sq \
+    --alpha 0.5 \
+    --ipex \
+    --output_dir "saved_results" \
+    --int8_bf16_mixed 
+```
+
+#### Accuracy with lm_eval
+```bash
+python run_clm_no_trainer.py \
+    --model facebook/opt-2.7b \
+    --accuracy \
+    --batch_size 112 \
+    --tasks "lambada_openai" \
+    --int8 \
+    --ipex \
+    --output_dir "saved_results"  # load int8 model
+# to validate FP32 model, please remove "--int8" and "--output_dir".
+```
+### LLAMA2-7b/13b/30b
+>Note: LLAMA requires IPEX requirements >= 2.1 to get better accuracy.
+#### Quantization
+
+```bash
+# "--sq" is used to enable smooth quant
+# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
+python run_clm_no_trainer.py \
+    --model meta-llama/Llama-2-7b-hf \
+    --quantize \
+    --sq \
+    --alpha 0.8 \
+    --ipex \
+    --output_dir "saved_results" \
+    --int8_bf16_mixed 
+```
+
+#### Accuracy with lm_eval
+```bash
+python run_clm_no_trainer.py \
+    --model meta-llama/Llama-2-7b-hf \
+    --accuracy \
+    --batch_size 112 \
+    --tasks  "lambada_openai" \
+    --int8 \
+    --ipex \
+    --output_dir "saved_results"  # load int8 model
+# to validate FP32 model, please remove "--int8" and "--output_dir".
+```
+
+### BLOOM
+#### Quantization
+```bash
+# "--sq" is used to enable smooth quant
+python run_clm_no_trainer.py \
+    --model bigscience/bloom-560m \
+    --quantize \
+    --ipex \
+    --sq \
+    --alpha 0.5 \
+    --output_dir "saved_results"
+```
+#### Accuracy with lm_eval
+```bash
+python run_clm_no_trainer.py \
+    --model bigscience/bloom-560m \
+    --accuracy \
+    --batch_size 112 \
+    --tasks  "lambada_openai" \
+    --int8 \
+    --ipex \
+    --output_dir "saved_results"  # load int8 model
+# to validate FP32 model, please remove "--int8" and "--output_dir".
+```
+
+### Falcon-7b
+```bash
+# "--sq" is used to enable smooth quant
+python run_clm_no_trainer.py \
+    --model tiiuae/falcon-7b-instruct \
+    --quantize \
+    --sq \
+    --alpha 0.5 \
+    --output_dir "saved_results"
+```
+#### Accuracy with lm_eval
+```bash
+python run_clm_no_trainer.py \
+    --model bigscience/bloom-560m \
+    --accuracy \
+    --batch_size 112 \
+    --tasks  "lambada_openai" \
+    --int8 \
+    --output_dir "saved_results"  # load int8 model
+# to validate FP32 model, please remove "--int8" and "--output_dir".
+```
+
+
+[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023).
+[2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).
@@ -0,0 +1,13 @@
+accelerate
+protobuf
+sentencepiece != 0.1.92
+datasets >= 1.1.3
+torch >= 1.10
+transformers
+pytest
+wandb
+einops
+neural-compressor
+intel-extension-for-transformers
+git+https://github.com/EleutherAI/lm-evaluation-harness.git@83dbfbf6070324f3e5872f63e49d49ff7ef4c9b3
+git+https://github.com/huggingface/peft.git@6c44096c7b8d55a2ecf24be9bc68393467e1584a
@@ -0,0 +1,13 @@
+python examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py \
+    --model facebook/opt-125m \
+    --dataset NeelNanda/pile-10k \
+    --seed 0 \
+    --quantize \
+    --approach weight_only \
+    --woq_algo GPTQ \
+    --woq_bits 4 \
+    --woq_group_size 128 \
+    --gptq_pad_max_length 2048 \
+    --gptq_use_max_length \
+    --gptq_gpu \
+    --gptq_debug