Skip to content

harishvs/imex_cpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVIDIA IMEX Multi-Node Setup (NO GPU Mode Testing)

This Terraform project sets up AWS infrastructure for testing NVIDIA IMEX (Interconnect Management and Execution) service on non-GB200 EC2 instances using the --nogpu command line flag.

Overview

This project is specifically designed for testing IMEX communication and configuration without requiring GB200 hardware or actual GPUs. It creates:

  • VPC with public and private subnets
  • 4 G5G.16xlarge EC2 instances running Ubuntu 22.04 with NVIDIA drivers
  • Security groups for IMEX communication
  • IAM roles for Session Manager access
  • IMEX service configured in NO GPU mode for testing purposes

NVIDIA IMEX Documentation

For detailed information about NVIDIA IMEX service configuration and management, refer to the official documentation:

NVIDIA IMEX Service for NVLink Networks - Getting Started Guide

Key Components

  • VPC Module: Uses the official AWS VPC module for clean configuration
  • EC2 Module: Custom module for launching GPU instances with IMEX support
  • Security Groups: Configured for IMEX peer communication on port 50000
  • User Data: Installs NVIDIA drivers, Docker, AWS CLI, and configures IMEX

Usage

  1. Configure your AWS credentials
  2. Update variables.tf with your desired values
  3. Create S3 backend for Terraform state management:
    ./create-s3-backend.sh
  4. Run terraform init
  5. Run terraform plan
  6. Run terraform apply
  7. After successful deployment, configure NVIDIA IMEX domains:
    # Configure Domain 1 (instances 1 and 2):
    ./configure-nvidia-imex-domain.sh 1 <instance-id-1> <instance-id-2>
    
    # Configure Domain 2 (instances 3 and 4):
    ./configure-nvidia-imex-domain.sh 2 <instance-id-3> <instance-id-4>

What the Configuration Script Does

The project uses a single unified script configure-nvidia-imex-domain.sh to configure NVIDIA IMEX in independent domains:

configure-nvidia-imex-domain.sh

Configures any domain by accepting the domain number and instance IDs as parameters:

  1. Retrieves Private IPs: Gets the private IP addresses of the specified instances using AWS CLI
  2. Creates Node Configuration: Creates /etc/nvidia-imex/nodes_config.cfg on each instance containing the private IPs of both nodes in the domain
  3. Starts IMEX Service: Launches the NVIDIA IMEX service in NO GPU mode (--nogpu flag) for testing
  4. Restarts Service: Restarts the service to ensure it picks up the new configuration
  5. Displays Status: Shows the IMEX node status and connectivity table using nvidia-imex-ctl
  6. Logs Output: Displays service logs and process status for troubleshooting

The script uses AWS Systems Manager (SSM) to execute commands remotely on the specified instances, ensuring each domain is configured independently with its own node list for IMEX communication.

Usage: ./configure-nvidia-imex-domain.sh <domain-number> <instance-id-1> <instance-id-2>

What Happens in the EC2 User Data Script in terraform (modules/ec2/main.tf)

When each EC2 instance launches, the user data script automatically performs the following setup:

System Updates & Base Packages

  • Updates the Ubuntu system packages
  • Installs essential tools: curl, wget, git, unzip, jq, htop, tree
  • Enables and starts the AWS SSM agent for Session Manager access

Docker Installation

  • Adds Docker's official GPG key and repository
  • Installs Docker Engine, CLI, and Compose
  • Starts and enables Docker service
  • Adds the ubuntu user to the docker group

AWS CLI v2 Installation

  • Downloads and installs the latest AWS CLI v2 for ARM64
  • Removes installation files after setup

NVIDIA IMEX Installation

  • Adds NVIDIA CUDA repository for Ubuntu 22.04 ARM64
  • Adds Ubuntu jammy-proposed repository for latest packages
  • Updates package sources to use Ubuntu ports for ARM64 compatibility
  • Installs NVIDIA IMEX package version 570.133.20 to match the G5G driver version

Verification & Finalization

  • Verifies Docker and AWS CLI installations
  • Checks for NVIDIA drivers (pre-installed on Deep Learning AMI)
  • Creates a welcome message displayed on login

The script ensures each instance is ready for IMEX testing with all necessary software pre-installed and configured.

IMEX Configuration (NO GPU Mode)

The instances are configured to run NVIDIA IMEX service in NO GPU mode for testing purposes:

  • Service mode: --nogpu flag enabled for testing without GB200 hardware
  • Default config file location: /etc/nvidia-imex/config.cfg
  • Node configuration file: /etc/nvidia-imex/nodes_config.cfg
  • Service name: nvidia-imex
  • Communication port: 50000
  • Purpose: Test IMEX communication and configuration on standard EC2 instances

Troubleshooting

For IMEX service issues, check:

  • Service status: sudo systemctl status nvidia-imex
  • Service logs:
    sudo journalctl -u nvidia-imex
    cat /var/log/nvidia-imex.log
  • IMEX control tool: nvidia-imex-ctl

About

run imex on non gb200 hardware

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published