doc: specify device_map argument in the examples (#1621)

SorenDreano · SorenNumind · web-flow · commit adaa6bbaab5f · 2025-07-04T06:46:02.000-05:00
SUMMARY: This PR implements the change proposed in #1620 Specifying the argument device_map in the from_pretrained method is more coherent with the description published in the example: ``` The model is first loaded onto the cpu, as indicated through the use of None for the device_map argument in the from_pretrained method when loading the model. ``` TEST PLAN: Documentation change only --------- Signed-off-by: Soren Dreano <soren@numind.ai> Co-authored-by: Soren Dreano <soren@numind.ai>
diff --git a/examples/big_models_with_sequential_onloading/README.md b/examples/big_models_with_sequential_onloading/README.md
@@ -18,7 +18,7 @@ The Llama 3.3 70b is larger than 80 GB, surpassing the size of 1 A100. However,
 
 ```python
 model_id = "meta-llama/Llama-3.3-70B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map=None)
 ```
 
 The model is first loaded onto the `cpu`, as indicated through the use of `None` for the `device_map` argument in the `from_pretrained` method when loading the model.
@@ -42,4 +42,4 @@ output = model.generate(**sample, max_new_tokens=100)
 print(tokenizer.decode(output[0]))
 ```
 
-Finally, we call `dispatch_for_generation` to evenly load the model across available devices (potentially offloading the model if required) and run sample generations on the newly quantized model.
+Finally, we call `dispatch_for_generation` to evenly load the model across available devices (potentially offloading the model if required) and run sample generations on the newly quantized model.
diff --git a/examples/big_models_with_sequential_onloading/llama3.3_70b.py b/examples/big_models_with_sequential_onloading/llama3.3_70b.py
@@ -8,7 +8,11 @@
 
 # Select model and load it.
 model_id = "meta-llama/Llama-3.3-70B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype="auto",
+    device_map=None,
+)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 
 # Select calibration dataset.