diff --git a/docs/Researcher/workloads/training/distributed-training/distributed-training.md b/docs/Researcher/workloads/training/distributed-training/distributed-training.md index e23e9381fe..5aad68d90f 100644 --- a/docs/Researcher/workloads/training/distributed-training/distributed-training.md +++ b/docs/Researcher/workloads/training/distributed-training/distributed-training.md @@ -8,6 +8,9 @@ The distributed training workload is assigned to a project and is affected by th To learn more about the distributed training workload type in Run:ai and determine that it is the most suitable workload type for your goals, see [Workload types](../../overviews/workload-types.md). +!!! Note + Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them. + ![](../../img/training-workload.png) ## Creating a distributed training workload diff --git a/docs/Researcher/workloads/training/distributed-training/quickstart-distributed-training.md b/docs/Researcher/workloads/training/distributed-training/quickstart-distributed-training.md index b6205520fc..88eccf757b 100644 --- a/docs/Researcher/workloads/training/distributed-training/quickstart-distributed-training.md +++ b/docs/Researcher/workloads/training/distributed-training/quickstart-distributed-training.md @@ -4,6 +4,9 @@ This article provides a step-by-step walkthrough for running a PyTorch distribut Distributed training is the ability to split the training of a model among multiple processors. Each processor is called a worker. Worker nodes work in parallel to speed up model training. There is also a master which coordinates the workers. +!!! Note + Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them. + ## Prerequisites Before you start, make sure: diff --git a/docs/admin/runai-setup/self-hosted/k8s/backend.md b/docs/admin/runai-setup/self-hosted/k8s/backend.md index 8b9fde2967..9b0aa93231 100644 --- a/docs/admin/runai-setup/self-hosted/k8s/backend.md +++ b/docs/admin/runai-setup/self-hosted/k8s/backend.md @@ -68,7 +68,7 @@ The Run:ai control plane chart includes multiple sub-charts of third-party compo ### PostgreSQL -If you have opted to connect to an [external PostgreSQL database](../../self-hosted-installation/installation/cp-system-requirements.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details: +If you have opted to connect to an [external PostgreSQL database](prerequisites.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details: 1. Disable PostgreSQL deployment - `postgresql.enabled` 2. Run:ai connection details - `global.postgresql.auth` diff --git a/docs/admin/runai-setup/self-hosted/ocp/backend.md b/docs/admin/runai-setup/self-hosted/ocp/backend.md index 5e2e7e217c..3473657d88 100644 --- a/docs/admin/runai-setup/self-hosted/ocp/backend.md +++ b/docs/admin/runai-setup/self-hosted/ocp/backend.md @@ -66,7 +66,7 @@ The Run:ai control plane chart includes multiple sub-charts of third-party compo ### PostgreSQL -If you have opted to connect to an [external PostgreSQL database](../../self-hosted-installation/installation/cp-system-requirements.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details: +If you have opted to connect to an [external PostgreSQL database](./prerequisites.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details: 1. Disable PostgreSQL deployment - `postgresql.enabled` 2. Run:ai connection details - `global.postgresql.auth` diff --git a/docs/platform-admin/workloads/overviews/workload-types.md b/docs/platform-admin/workloads/overviews/workload-types.md index fa1a3b1f48..79fa415375 100644 --- a/docs/platform-admin/workloads/overviews/workload-types.md +++ b/docs/platform-admin/workloads/overviews/workload-types.md @@ -40,8 +40,12 @@ As models mature and the need for more robust data processing and model training Training tasks demand high memory, compute power, and storage. Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPU’s that are in your quota. + See [Standard training](../../../Researcher/workloads/training/standard-training/trainings-v2.md) and [Distributed training](../../../Researcher/workloads/training/distributed-training/distributed-training.md) to learn more about how to submit a training workload via the Run:ai UI. For quick starts, see [Run your first standard training](../../../Researcher/workloads/training/standard-training/quickstart-standard-training.md) and [Run your first distributed training](../../../Researcher/workloads/training/distributed-training/quickstart-distributed-training.md). +!!! Note + Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them. + ## Inference: deploying and serving models Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.