Skip to content

Commit aad03d5

Browse files
xin3hexinhe3
andauthored
Enhance example for HPU performance (#2043)
* Enhance example for HPU performance Signed-off-by: xinhe3 <xinhe3@habana.ai> * Update run_clm_no_trainer.py * remove wikitext to avoid oom for llama2-7b bs=8 * remove wikitext Signed-off-by: xinhe3 <xinhe3@habana.ai> --------- Signed-off-by: xinhe3 <xinhe3@habana.ai> Co-authored-by: xinhe3 <xinhe3@habana.ai>
1 parent c186708 commit aad03d5

File tree

2 files changed

+20
-10
lines changed

2 files changed

+20
-10
lines changed

examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -55,22 +55,23 @@ python run_clm_no_trainer.py \
5555
```
5656
### Evaluation
5757

58+
> Note: The SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false is an experimental flag which yields better performance for uint4, and it will be removed in a future release.
59+
5860
```bash
5961
# original model
6062
python run_clm_no_trainer.py \
6163
--model meta-llama/Llama-2-7b-hf \
6264
--accuracy \
6365
--batch_size 8 \
64-
--tasks "lambada_openai,wikitext" \
65-
--output_dir saved_results
66+
--tasks "lambada_openai"
6667

6768
# quantized model
68-
python run_clm_no_trainer.py \
69+
SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_clm_no_trainer.py \
6970
--model meta-llama/Llama-2-7b-hf \
70-
--load \
7171
--accuracy \
7272
--batch_size 8 \
73-
--tasks "lambada_openai,wikitext" \
73+
--tasks "lambada_openai" \
74+
--load \
7475
--output_dir saved_results
7576
```
7677

@@ -81,15 +82,14 @@ python run_clm_no_trainer.py \
8182
python run_clm_no_trainer.py \
8283
--model meta-llama/Llama-2-7b-hf \
8384
--performance \
84-
--batch_size 8 \
85-
--output_dir saved_results
85+
--batch_size 8
8686

8787
# quantized model
88-
python run_clm_no_trainer.py \
88+
SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_clm_no_trainer.py \
8989
--model meta-llama/Llama-2-7b-hf \
90-
--load \
9190
--performance \
9291
--batch_size 8 \
92+
--load \
9393
--output_dir saved_results
9494
```
9595

examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
1515
from neural_compressor.torch.utils import is_hpex_available
1616

17+
if is_hpex_available():
18+
import habana_frameworks.torch.core as htcore # pylint: disable=E0401
19+
htcore.hpu_set_inference_env()
20+
1721
parser = argparse.ArgumentParser()
1822
parser.add_argument(
1923
"--model", nargs="?", default="EleutherAI/gpt-j-6b"
@@ -44,7 +48,7 @@
4448
help="Pad input ids to max length.")
4549
parser.add_argument("--calib_iters", default=512, type=int,
4650
help="calibration iters.")
47-
parser.add_argument("--tasks", default="lambada_openai,hellaswag,winogrande,piqa,wikitext",
51+
parser.add_argument("--tasks", default="lambada_openai,hellaswag,winogrande,piqa",
4852
type=str, help="tasks for accuracy validation")
4953
parser.add_argument("--peft_model_id", type=str, default=None, help="model_name_or_path of peft model")
5054
# ============WeightOnly configs===============
@@ -501,6 +505,12 @@ def run_fn_for_gptq(model, dataloader_for_calibration, *args):
501505
user_model, tokenizer = get_user_model()
502506

503507

508+
if is_hpex_available():
509+
from habana_frameworks.torch.hpu import wrap_in_hpu_graph
510+
user_model = user_model.to(torch.bfloat16)
511+
wrap_in_hpu_graph(user_model, max_graphs=10)
512+
513+
504514
if args.accuracy:
505515
user_model.eval()
506516
from neural_compressor.evaluation.lm_eval import evaluate, LMEvalParser

0 commit comments

Comments
 (0)