-
Notifications
You must be signed in to change notification settings - Fork 65
Added self-hosted bcm install #1456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v2.19
Are you sure you want to change the base?
Changes from 16 commits
a7d30fa
fbb3ab6
7245ad6
35b999a
2efcde9
37af33c
417cdee
b6adb7b
c5d8b65
631c3ed
1d8eadc
bc08906
790a0d1
a977c12
33277b0
be29cbd
42bfd31
fbcae90
d30ac3b
3be3084
8fe5190
d1fd0be
c48f750
30ff87d
a2b6a35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
|
||
--- | ||
apiVersion: metallb.io/v1beta1 | ||
kind: L2Advertisement | ||
metadata: | ||
name: l2-ingress | ||
namespace: metallb-system | ||
spec: | ||
ipAddressPools: | ||
- ingress-pool | ||
nodeSelectors: | ||
- matchLabels: | ||
node-role.kubernetes.io/runai-system: "true" | ||
|
||
--- | ||
apiVersion: metallb.io/v1beta1 | ||
kind: IPAddressPool | ||
metadata: | ||
name: ingress-pool | ||
namespace: metallb-system | ||
spec: | ||
addresses: | ||
- <RESERVED IP>/32 | ||
autoAssign: false | ||
serviceAllocation: | ||
priority: 50 | ||
namespaces: | ||
- ingress-nginx | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
|
||
deployCR: true | ||
nfd: | ||
enabled: true | ||
ofedDriver: | ||
deploy: false | ||
psp: | ||
enabled: false | ||
rdmaSharedDevicePlugin: | ||
deploy: false | ||
secondaryNetwork: | ||
cniPlugins: | ||
deploy: true | ||
deploy: true | ||
ipamPlugin: | ||
deploy: false | ||
multus: | ||
deploy: true | ||
nvIpam: | ||
deploy: true | ||
sriovDevicePlugin: | ||
deploy: false | ||
sriovNetworkOperator: | ||
enabled: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Install the Cluster | ||
|
||
|
||
## System and Network Requirements | ||
Before installing the NVIDIA Run:ai cluster, validate that the [system requirements](./system-requirements.md) and [network requirements](./network-requirements.md) are met. Make sure you have the [software artifacts](./preparations.md) prepared. | ||
|
||
Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to: | ||
|
||
* Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking | ||
* Look at additional components installed and analyze their relevance to a successful installation | ||
|
||
For more information, see [preinstall diagnostics](https://github.com/run-ai/preinstall-diagnostics). To run the preinstall diagnostics tool, [download](https://runai.jfrog.io/ui/native/pd-cli-prod/preinstall-diagnostics-cli/) the latest version, and run: | ||
|
||
```bash | ||
chmod +x ./preinstall-diagnostics-<platform> && \ | ||
./preinstall-diagnostics-<platform> \ | ||
--domain ${CONTROL_PLANE_FQDN} \ | ||
--cluster-domain ${CLUSTER_FQDN} \ | ||
#if the diagnostics image is hosted in a private registry | ||
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \ | ||
--image ${PRIVATE_REGISTRY_IMAGE_URL} | ||
``` | ||
|
||
## Helm | ||
|
||
NVIDIA Run:ai requires [Helm](https://helm.sh/) 3.14 or later. To install Helm, see [Installing Helm](https://helm.sh/docs/intro/install/). | ||
|
||
## Permissions | ||
|
||
A Kubernetes user with the `cluster-admin` role is required to ensure a successful installation. For more information, see [Using RBAC authorization](https://kubernetes.io/docs/reference/access-authn-authz/rbac/). | ||
|
||
## Installation | ||
|
||
Follow the steps below to add a new cluster. | ||
|
||
!!! Note | ||
When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created. | ||
|
||
If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1. | ||
|
||
1. In the NVIDIA Run:ai platform, go to **Resources** | ||
2. Click **+NEW CLUSTER** | ||
3. Enter a unique name for your cluster | ||
4. Choose the NVIDIA Run:ai cluster version (latest, by default) | ||
5. Select **Same as control plane** | ||
6. Click **Continue** | ||
|
||
**Installing NVIDIA Run:ai Cluster** | ||
|
||
In the next Section, the NVIDIA Run:ai cluster installation steps will be presented. | ||
|
||
1. Follow the installation instructions and run the commands provided on your Kubernetes cluster | ||
2. Append `--set global.customCA.enabled=true` to the Helm installation command | ||
3. Click **DONE** | ||
|
||
The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**. | ||
|
||
!!! Tip | ||
Use the `--dry-run` flag to gain an understanding of what is being installed before the actual installation. For more details, see see [Understanding cluster access roles.](https://docs.run.ai/v2.19/admin/config/access-roles/). | ||
|
||
|
||
!!! Note | ||
To customize the installation based on your environment, see [Customize cluster installation](../../cluster-setup/customize-cluster-install.md). | ||
|
||
## Troubleshooting | ||
|
||
If you encounter an issue with the installation, try the troubleshooting scenario below. | ||
|
||
### Installation | ||
|
||
If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs: | ||
|
||
``` bash | ||
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh | ||
``` | ||
|
||
### Cluster Status | ||
|
||
If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster [troubleshooting scenarios](../../troubleshooting/troubleshooting.md#cluster-health) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Install the Control Plane | ||
|
||
Installing the NVIDIA Run:ai control plane requires Internet connectivity. | ||
|
||
|
||
## System and Network Requirements | ||
Before installing the NVIDIA Run:ai control plane, validate that the [system requirements](./system-requirements.md) and [network requirements](./network-requirements.md) are met. Make sure you have the [software artifacts](./preparations.md) prepared. | ||
|
||
## Permissions | ||
|
||
As part of the installation, you will be required to install the NVIDIA Run:ai control plane [Helm chart](https://helm.sh/). The Helm charts require Kubernetes administrator permissions. You can review the exact objects that are created by the charts using the `--dry-run` flag on both helm charts. | ||
|
||
## Installation | ||
|
||
Run the following command. Replace `global.domain=<DOMAIN>` with the one obtained [here](./system-requirements.md#fully-qualified-domain-name-fqdn) | ||
|
||
```bash | ||
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will fail for airgapped superpods as there is no access to external-public chart repositories. If superpod is connected, we need to pull the chart first with the following: helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update |
||
--version "<VERSION> " \ | ||
--set global.customCA.enabled=true \ | ||
--set global.domain=<DOMAIN> | ||
|
||
Release "runai-backend" does not exist. Installing it now. | ||
NAME: runai-backend | ||
LAST DEPLOYED: Mon Dec 30 17:30:19 2024 | ||
NAMESPACE: runai-backend | ||
STATUS: deployed | ||
REVISION: 1 | ||
ozRunAI marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
!!! Note | ||
To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend. | ||
|
||
## Connect to NVIDIA Run:ai User Interface | ||
|
||
1. Open your browser and go to: `https://<DOMAIN>`. | ||
2. Log in using the default credentials: | ||
|
||
* User: `test@run.ai` | ||
* Password: `Abcd!234` | ||
|
||
You will be prompted to change the password. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Network requirements | ||
|
||
The following network requirements are for the NVIDIA Run:ai components installation and usage. | ||
|
||
## Installation | ||
|
||
### Inbound rules | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| --------------------------- | ---------------- | ------- | -------------------------- | ---- | | ||
| Installation via BCM | SSH Access | Installer Machine | NVIDIA Base Command Manager headnodes | 22 | | ||
|
||
### Outbound rules | ||
| Name | Description | Source | Destination | Port | | ||
| --------------------------- | ---------------- | ------- | -------------------------- | ---- | | ||
| Container Registry | Pull NVIDIA Run:ai images | All kubernetes nodes | runai.jfrog.io | 443 | | ||
| Helm repository | NVIDIA Run:ai Helm repository for installation | Installer machine | runai.jfrog.io | 443 | | ||
|
||
The NVIDIA Run:ai installation has [software requirements](system-requirements.md) that require additional components to be installed on the cluster. This article includes simple installation examples which can be used optionally and require the following cluster outbound ports to be open: | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| -------------------------- | ------------------------------------------ | -------------------- | --------------- | ---- | | ||
| Kubernetes Registry | Ingress Nginx image repository | All kubernetes nodes | registry.k8s.io | 443 | | ||
| Google Container Registry | GPU Operator, and Knative image repository | All kubernetes nodes | gcr.io | 443 | | ||
| Red Hat Container Registry | Prometheus Operator image repository | All kubernetes nodes | quay.io | 443 | | ||
| Docker Hub Registry | Training Operator image repository | All kubernetes nodes | docker.io | 443 | | ||
|
||
|
||
|
||
## External access | ||
|
||
Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management. | ||
|
||
|
||
!!! Note | ||
Ensure the inbound and outbound rules are correctly applied to your firewall. | ||
|
||
### Inbound rules | ||
|
||
To allow your organization’s NVIDIA Run:ai users to interact with the cluster using the [NVIDIA Run:ai Command-line interface](../../reference/cli/runai/), or access specific UI features, certain inbound ports need to be open: | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| --------------------------- | ---------------- | ------- | -------------------------- | ---- | | ||
| NVIDIA Run:ai control plane | HTTPS entrypoint | 0.0.0.0 | NVIDIA Run:ai system nodes | 443 | | ||
| NVIDIA Run:ai cluster | HTTPS entrypoint | RFC1918 private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) | ||
| NVIDIA Run:ai system nodes | 443 | | ||
|
||
|
||
### Outbound rules | ||
|
||
!!! Note | ||
Outbound rules applied to the NVIDIA Run:ai cluster component only. In case the NVIDIA Run:ai cluster is installed together with the NVIDIA Run:ai control plane, the NVIDIA Run:ai cluster FQDN refers to the NVIDIA Run:ai control plane FQDN. | ||
{% endhint %} | ||
|
||
For the NVIDIA Run:ai cluster installation and usage, certain **outbound** ports must be open: | ||
|
||
| Name | Description | Source | Destination | Port | | ||
| ------------------ | -------------------------------------------------------------------------------- | -------------------------- | -------------------------------- | ---- | | ||
| Cluster sync | Sync NVIDIA Run:ai cluster with NVIDIA Run:ai control plane | NVIDIA Run:ai system nodes | NVIDIA Run:ai control plane FQDN | 443 | | ||
| Metric store | Push NVIDIA Run:ai cluster metrics to NVIDIA Run:ai control plane's metric store | NVIDIA Run:ai system nodes | NVIDIA Run:ai control plane FQDN | 443 | | ||
|
||
## Internal network | ||
|
||
Ensure that all Kubernetes nodes can communicate with each other across all necessary ports. Kubernetes assumes full interconnectivity between nodes, so you must configure your network to allow this seamless communication. Specific port requirements may vary depending on your network setup. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Next Steps | ||
|
||
## Restrict System Node Scheduling (Post-Installation) | ||
|
||
After installation, you can configure NVIDIA Run:ai to enforce stricter scheduling rules that ensure system components and workloads are assigned to the correct nodes. The following flags are set using the `runaiconfig`. See [Advanced Cluster Configurations](../../../config/advanced-cluster-config.md) for more details. | ||
|
||
1. Set `global.NodeAffinity.RestrictRunAISystem=true`. This ensures that NVIDIA Run:ai system components are scheduled only on nodes labeled as system nodes: | ||
|
||
2. Set `global.nodeAffinity.restrictScheduling=true`. This prevents pure CPU workloads from being scheduled on GPU nodes. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Preparations | ||
|
||
You should receive a token from NVIDIA Run:ai customer support. The following command provides access to the NVIDIA Run:ai container registry: | ||
|
||
```bash | ||
kubectl create secret docker-registry runai-reg-creds \ | ||
--docker-server=https://runai.jfrog.io \ | ||
--docker-username=self-hosted-image-puller-prod \ | ||
--docker-password=<$TOKEN> \ | ||
--docker-email=support@run.ai \ | ||
--namespace=runai-backend | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can select the EXACT IP you want for each service -
kubectl -n kourier-system patch svc kourier
--type='merge'
-p '{"spec": {"type": "LoadBalancer", "loadBalancerIP": "192.168.0.250"}}'
kubectl -n ingress-nginx patch svc ingress-nginx-controller --type='merge' -p '{"spec": {"type": "LoadBalancer", "loadBalancerIP": "192.168.0.251"}}'