Skip to content

Updated note #1470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ The distributed training workload is assigned to a project and is affected by th

To learn more about the distributed training workload type in Run:ai and determine that it is the most suitable workload type for your goals, see [Workload types](../../overviews/workload-types.md).

!!! Note
Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

![](../../img/training-workload.png)

## Creating a distributed training workload
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ This article provides a step-by-step walkthrough for running a PyTorch distribut

Distributed training is the ability to split the training of a model among multiple processors. Each processor is called a worker. Worker nodes work in parallel to speed up model training. There is also a master which coordinates the workers.

!!! Note
Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

## Prerequisites

Before you start, make sure:
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/k8s/backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ The Run:ai control plane chart includes multiple sub-charts of third-party compo

### PostgreSQL

If you have opted to connect to an [external PostgreSQL database](../../self-hosted-installation/installation/cp-system-requirements.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details:
If you have opted to connect to an [external PostgreSQL database](prerequisites.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details:

1. Disable PostgreSQL deployment - `postgresql.enabled`
2. Run:ai connection details - `global.postgresql.auth`
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/ocp/backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ The Run:ai control plane chart includes multiple sub-charts of third-party compo

### PostgreSQL

If you have opted to connect to an [external PostgreSQL database](../../self-hosted-installation/installation/cp-system-requirements.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details:
If you have opted to connect to an [external PostgreSQL database](./prerequisites.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details:

1. Disable PostgreSQL deployment - `postgresql.enabled`
2. Run:ai connection details - `global.postgresql.auth`
Expand Down
4 changes: 4 additions & 0 deletions docs/platform-admin/workloads/overviews/workload-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,12 @@ As models mature and the need for more robust data processing and model training

Training tasks demand high memory, compute power, and storage. Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPU’s that are in your quota.


See [Standard training](../../../Researcher/workloads/training/standard-training/trainings-v2.md) and [Distributed training](../../../Researcher/workloads/training/distributed-training/distributed-training.md) to learn more about how to submit a training workload via the Run:ai UI. For quick starts, see [Run your first standard training](../../../Researcher/workloads/training/standard-training/quickstart-standard-training.md) and [Run your first distributed training](../../../Researcher/workloads/training/distributed-training/quickstart-distributed-training.md).

!!! Note
Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

## Inference: deploying and serving models

Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.
Expand Down