8740 add single processing to getting started instructions (#8875)

pgmoka · web-flow · commit e341ff091717 · 2025-03-27T10:30:34.000-07:00
diff --git a/README.md b/README.md
@@ -123,23 +123,67 @@ Our github contains many useful docs on working with different aspects of PyTorc
 
 ## Getting Started
 
+Following here are guides for two modes:
+- Single process: one Python interpreter controlling a single GPU/TPU at a time
+- Multi process: N Python interpreters are launched, corresponding to N GPU/TPUs
+found on the system
+
+Another mode is SPMD, where one Python interpreter controls all N GPU/TPUs found on 
+the system. Multi processing is more complex, and is not compatible with SPMD. This
+tutorial does not dive into SPMD. For more on that, check our
+[SPMD guide](https://github.com/pytorch/xla/blob/master/docs/source/perf/spmd_basic.md).
+
+### Simple single process
+
+To update your exisitng training loop, make the following changes:
+
+```diff
++import torch_xla
+
+ def train(model, training_data, ...):
+   ...
+   for inputs, labels in train_loader:
++    with torch_xla.step():
+       inputs, labels = training_data[i]
++      inputs, labels = inputs.to('xla'), labels.to('xla')
+       optimizer.zero_grad()
+       outputs = model(inputs)
+       loss = loss_fn(outputs, labels)
+       loss.backward()
+       optimizer.step()
+
++  torch_xla.sync()
+   ...
+
+ if __name__ == '__main__':
+   ...
++  # Move the model paramters to your XLA device
++  model.to('xla')
+   train(model, training_data, ...)
+   ...
+```
+
+The changes above should get your model to train on the TPU.
+
+### Multi processing
+
 To update your existing training loop, make the following changes:
 
 ```diff
 -import torch.multiprocessing as mp
-+import torch_xla as xla
++import torch_xla
 +import torch_xla.core.xla_model as xm
 
  def _mp_fn(index):
    ...
 
 +  # Move the model paramters to your XLA device
-+  model.to(xla.device())
++  model.to(torch_xla.device())
 
    for inputs, labels in train_loader:
-+    with xla.step():
++    with torch_xla.step():
 +      # Transfer data to the XLA device. This happens asynchronously.
-+      inputs, labels = inputs.to(xla.device()), labels.to(xla.device())
++      inputs, labels = inputs.to(torch_xla.device()), labels.to(torch_xla.device())
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)
@@ -150,8 +194,8 @@ To update your existing training loop, make the following changes:
 
  if __name__ == '__main__':
 -  mp.spawn(_mp_fn, args=(), nprocs=world_size)
-+  # xla.launch automatically selects the correct world size
-+  xla.launch(_mp_fn, args=())
++  # torch_xla.launch automatically selects the correct world size
++  torch_xla.launch(_mp_fn, args=())
 ```
 
 If you're using `DistributedDataParallel`, make the following changes: