-
Notifications
You must be signed in to change notification settings - Fork 65
Added self-hosted bcm install #1456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SherinDaher-Runai
wants to merge
25
commits into
v2.19
Choose a base branch
from
Self-hosted-install-BCM
base: v2.19
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 18 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
a7d30fa
Added self-hosted bcm install
SherinDaher-Runai fbb3ab6
a
ozRunAI 7245ad6
Updated requirements
SherinDaher-Runai 35b999a
Updates
SherinDaher-Runai 2efcde9
Updated screenshot
SherinDaher-Runai 37af33c
Updated
SherinDaher-Runai 417cdee
Update system-requirements.md
SherinDaher-Runai b6adb7b
Updates
SherinDaher-Runai c5d8b65
Update system-requirements.md
SherinDaher-Runai 631c3ed
Updates
SherinDaher-Runai 1d8eadc
Update system-requirements.md
SherinDaher-Runai bc08906
Update system-requirements.md
SherinDaher-Runai 790a0d1
A
ozRunAI a977c12
Updates
SherinDaher-Runai 33277b0
Updated
SherinDaher-Runai be29cbd
Updated yaml files
SherinDaher-Runai 42bfd31
a
ozRunAI fbcae90
a
ozRunAI d30ac3b
Update docs/admin/runai-setup/self-hosted/bcm/system-requirements.md
ozRunAI 3be3084
Update docs/admin/runai-setup/self-hosted/bcm/install-control-plane.md
ozRunAI 8fe5190
Update docs/admin/runai-setup/self-hosted/bcm/system-requirements.md
ozRunAI d1fd0be
Update docs/admin/runai-setup/self-hosted/bcm/system-requirements.md
ozRunAI c48f750
Update docs/admin/runai-setup/self-hosted/bcm/system-requirements.md
ozRunAI 30ff87d
Update docs/admin/runai-setup/self-hosted/bcm/system-requirements.md
ozRunAI a2b6a35
Update docs/admin/runai-setup/self-hosted/bcm/system-requirements.md
ozRunAI File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
|
||
--- | ||
apiVersion: metallb.io/v1beta1 | ||
kind: L2Advertisement | ||
metadata: | ||
name: l2-ingress | ||
namespace: metallb-system | ||
spec: | ||
ipAddressPools: | ||
- ingress-pool | ||
nodeSelectors: | ||
- matchLabels: | ||
node-role.kubernetes.io/runai-system: "true" | ||
|
||
--- | ||
apiVersion: metallb.io/v1beta1 | ||
kind: IPAddressPool | ||
metadata: | ||
name: ingress-pool | ||
namespace: metallb-system | ||
spec: | ||
addresses: | ||
- 192.168.0.250-192.168.0.251 # Example of two ip address - | ||
autoAssign: false | ||
serviceAllocation: | ||
priority: 50 | ||
namespaces: | ||
- ingress-nginx | ||
- knative-serving |
24 changes: 24 additions & 0 deletions
24
docs/admin/runai-setup/self-hosted/bcm/files/networkoperator.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
|
||
deployCR: true | ||
nfd: | ||
enabled: true | ||
ofedDriver: | ||
deploy: false | ||
psp: | ||
enabled: false | ||
rdmaSharedDevicePlugin: | ||
deploy: false | ||
secondaryNetwork: | ||
cniPlugins: | ||
deploy: true | ||
deploy: true | ||
ipamPlugin: | ||
deploy: false | ||
multus: | ||
deploy: true | ||
nvIpam: | ||
deploy: true | ||
sriovDevicePlugin: | ||
deploy: false | ||
sriovNetworkOperator: | ||
enabled: true |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Install the Cluster | ||
|
||
|
||
## System and Network Requirements | ||
Before installing the NVIDIA Run:ai cluster, validate that the [system requirements](./system-requirements.md) and [network requirements](./network-requirements.md) are met. Make sure you have the [software artifacts](./preparations.md) prepared. | ||
|
||
Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to: | ||
|
||
* Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking | ||
* Look at additional components installed and analyze their relevance to a successful installation | ||
|
||
For more information, see [preinstall diagnostics](https://github.com/run-ai/preinstall-diagnostics). To run the preinstall diagnostics tool, [download](https://runai.jfrog.io/ui/native/pd-cli-prod/preinstall-diagnostics-cli/) the latest version, and run: | ||
|
||
```bash | ||
chmod +x ./preinstall-diagnostics-<platform> && \ | ||
./preinstall-diagnostics-<platform> \ | ||
--domain ${CONTROL_PLANE_FQDN} \ | ||
--cluster-domain ${CLUSTER_FQDN} \ | ||
#if the diagnostics image is hosted in a private registry | ||
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \ | ||
--image ${PRIVATE_REGISTRY_IMAGE_URL} | ||
``` | ||
|
||
## Helm | ||
|
||
NVIDIA Run:ai requires [Helm](https://helm.sh/) 3.14 or later. To install Helm, see [Installing Helm](https://helm.sh/docs/intro/install/). | ||
|
||
## Permissions | ||
|
||
A Kubernetes user with the `cluster-admin` role is required to ensure a successful installation. For more information, see [Using RBAC authorization](https://kubernetes.io/docs/reference/access-authn-authz/rbac/). | ||
|
||
## Installation | ||
|
||
Follow the steps below to add a new cluster. | ||
|
||
!!! Note | ||
When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created. | ||
|
||
If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1. | ||
|
||
1. In the NVIDIA Run:ai platform, go to **Resources** | ||
2. Click **+NEW CLUSTER** | ||
3. Enter a unique name for your cluster | ||
4. Choose the NVIDIA Run:ai cluster version (latest, by default) | ||
5. Select **Same as control plane** | ||
6. Click **Continue** | ||
|
||
**Installing NVIDIA Run:ai Cluster** | ||
|
||
In the next Section, the NVIDIA Run:ai cluster installation steps will be presented. | ||
|
||
1. Follow the installation instructions and run the commands provided on your Kubernetes cluster | ||
2. Append `--set global.customCA.enabled=true` to the Helm installation command | ||
3. Click **DONE** | ||
|
||
The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**. | ||
|
||
!!! Tip | ||
Use the `--dry-run` flag to gain an understanding of what is being installed before the actual installation. For more details, see see [Understanding cluster access roles.](https://docs.run.ai/v2.19/admin/config/access-roles/). | ||
|
||
|
||
!!! Note | ||
To customize the installation based on your environment, see [Customize cluster installation](../../cluster-setup/customize-cluster-install.md). | ||
|
||
## Troubleshooting | ||
|
||
If you encounter an issue with the installation, try the troubleshooting scenario below. | ||
|
||
### Installation | ||
|
||
If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs: | ||
|
||
``` bash | ||
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh | ||
``` | ||
|
||
### Cluster Status | ||
|
||
If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster [troubleshooting scenarios](../../troubleshooting/troubleshooting.md#cluster-health) | ||
|
43 changes: 43 additions & 0 deletions
43
docs/admin/runai-setup/self-hosted/bcm/install-control-plane.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Install the Control Plane | ||
|
||
Installing the NVIDIA Run:ai control plane requires Internet connectivity. | ||
|
||
|
||
## System and Network Requirements | ||
Before installing the NVIDIA Run:ai control plane, validate that the [system requirements](./system-requirements.md) and [network requirements](./network-requirements.md) are met. Make sure you have the [software artifacts](./preparations.md) prepared. | ||
|
||
## Permissions | ||
|
||
As part of the installation, you will be required to install the NVIDIA Run:ai control plane [Helm chart](https://helm.sh/). The Helm charts require Kubernetes administrator permissions. You can review the exact objects that are created by the charts using the `--dry-run` flag on both helm charts. | ||
|
||
## Installation | ||
|
||
Run the following command. Replace `global.domain=<DOMAIN>` with the one obtained [here](./system-requirements.md#fully-qualified-domain-name-fqdn) | ||
|
||
```bash | ||
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \ | ||
--version "<VERSION> " \ | ||
--set global.customCA.enabled=true \ | ||
--set global.domain=<DOMAIN> | ||
|
||
Release "runai-backend" does not exist. Installing it now. | ||
NAME: runai-backend | ||
LAST DEPLOYED: Mon Dec 30 17:30:19 2024 | ||
NAMESPACE: runai-backend | ||
STATUS: deployed | ||
REVISION: 1 | ||
ozRunAI marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
!!! Note | ||
To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend. | ||
|
||
## Connect to NVIDIA Run:ai User Interface | ||
|
||
1. Open your browser and go to: `https://<DOMAIN>`. | ||
2. Log in using the default credentials: | ||
|
||
* User: `test@run.ai` | ||
* Password: `Abcd!234` | ||
|
||
You will be prompted to change the password. | ||
|
64 changes: 64 additions & 0 deletions
64
docs/admin/runai-setup/self-hosted/bcm/network-requirements.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Network requirements | ||
|
||
The following network requirements are for the NVIDIA Run:ai components installation and usage. | ||
|
||
## Installation | ||
|
||
### Inbound rules | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| --------------------------- | ---------------- | ------- | -------------------------- | ---- | | ||
| Installation via BCM | SSH Access | Installer Machine | NVIDIA Base Command Manager headnodes | 22 | | ||
|
||
### Outbound rules | ||
| Name | Description | Source | Destination | Port | | ||
| --------------------------- | ---------------- | ------- | -------------------------- | ---- | | ||
| Container Registry | Pull NVIDIA Run:ai images | All kubernetes nodes | runai.jfrog.io | 443 | | ||
| Helm repository | NVIDIA Run:ai Helm repository for installation | Installer machine | runai.jfrog.io | 443 | | ||
|
||
The NVIDIA Run:ai installation has [software requirements](system-requirements.md) that require additional components to be installed on the cluster. This article includes simple installation examples which can be used optionally and require the following cluster outbound ports to be open: | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| -------------------------- | ------------------------------------------ | -------------------- | --------------- | ---- | | ||
| Kubernetes Registry | Ingress Nginx image repository | All kubernetes nodes | registry.k8s.io | 443 | | ||
| Google Container Registry | GPU Operator, and Knative image repository | All kubernetes nodes | gcr.io | 443 | | ||
| Red Hat Container Registry | Prometheus Operator image repository | All kubernetes nodes | quay.io | 443 | | ||
| Docker Hub Registry | Training Operator image repository | All kubernetes nodes | docker.io | 443 | | ||
|
||
|
||
|
||
## External access | ||
|
||
Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management. | ||
|
||
|
||
!!! Note | ||
Ensure the inbound and outbound rules are correctly applied to your firewall. | ||
|
||
### Inbound rules | ||
|
||
To allow your organization’s NVIDIA Run:ai users to interact with the cluster using the [NVIDIA Run:ai Command-line interface](../../reference/cli/runai/), or access specific UI features, certain inbound ports need to be open: | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| --------------------------- | ---------------- | ------- | -------------------------- | ---- | | ||
| NVIDIA Run:ai control plane | HTTPS entrypoint | 0.0.0.0 | NVIDIA Run:ai system nodes | 443 | | ||
| NVIDIA Run:ai cluster | HTTPS entrypoint | RFC1918 private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) | ||
| NVIDIA Run:ai system nodes | 443 | | ||
|
||
|
||
### Outbound rules | ||
|
||
!!! Note | ||
Outbound rules applied to the NVIDIA Run:ai cluster component only. In case the NVIDIA Run:ai cluster is installed together with the NVIDIA Run:ai control plane, the NVIDIA Run:ai cluster FQDN refers to the NVIDIA Run:ai control plane FQDN. | ||
{% endhint %} | ||
|
||
For the NVIDIA Run:ai cluster installation and usage, certain **outbound** ports must be open: | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| ------------------ | -------------------------------------------------------------------------------- | -------------------------- | -------------------------------- | ---- | | ||
| Cluster sync | Sync NVIDIA Run:ai cluster with NVIDIA Run:ai control plane | NVIDIA Run:ai system nodes | NVIDIA Run:ai control plane FQDN | 443 | | ||
| Metric store | Push NVIDIA Run:ai cluster metrics to NVIDIA Run:ai control plane's metric store | NVIDIA Run:ai system nodes | NVIDIA Run:ai control plane FQDN | 443 | | ||
|
||
## Internal network | ||
|
||
Ensure that all Kubernetes nodes can communicate with each other across all necessary ports. Kubernetes assumes full interconnectivity between nodes, so you must configure your network to allow this seamless communication. Specific port requirements may vary depending on your network setup. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Next Steps | ||
|
||
## Restrict System Node Scheduling (Post-Installation) | ||
|
||
After installation, you can configure NVIDIA Run:ai to enforce stricter scheduling rules that ensure system components and workloads are assigned to the correct nodes. The following flags are set using the `runaiconfig`. See [Advanced Cluster Configurations](../../../config/advanced-cluster-config.md) for more details. | ||
|
||
1. Set `global.NodeAffinity.RestrictRunAISystem=true`. This ensures that NVIDIA Run:ai system components are scheduled only on nodes labeled as system nodes: | ||
|
||
2. Set `global.nodeAffinity.restrictScheduling=true`. This prevents pure CPU workloads from being scheduled on GPU nodes. | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Preparations | ||
|
||
You should receive a token from NVIDIA Run:ai customer support. The following command provides access to the NVIDIA Run:ai container registry: | ||
|
||
```bash | ||
kubectl create secret docker-registry runai-reg-creds \ | ||
--docker-server=https://runai.jfrog.io \ | ||
--docker-username=self-hosted-image-puller-prod \ | ||
--docker-password=<$TOKEN> \ | ||
--docker-email=support@run.ai \ | ||
--namespace=runai-backend | ||
``` | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail for airgapped superpods as there is no access to external-public chart repositories.
Consider adding Connected/airgapped options, same options we have for self-hosted control-plane installation.
If superpod is connected, we need to pull the chart first with the following: