This Terraform project sets up AWS infrastructure for testing NVIDIA IMEX (Interconnect Management and Execution) service on non-GB200 EC2 instances using the --nogpu
command line flag.
This project is specifically designed for testing IMEX communication and configuration without requiring GB200 hardware or actual GPUs. It creates:
- VPC with public and private subnets
- 4 G5G.16xlarge EC2 instances running Ubuntu 22.04 with NVIDIA drivers
- Security groups for IMEX communication
- IAM roles for Session Manager access
- IMEX service configured in NO GPU mode for testing purposes
For detailed information about NVIDIA IMEX service configuration and management, refer to the official documentation:
NVIDIA IMEX Service for NVLink Networks - Getting Started Guide
- VPC Module: Uses the official AWS VPC module for clean configuration
- EC2 Module: Custom module for launching GPU instances with IMEX support
- Security Groups: Configured for IMEX peer communication on port 50000
- User Data: Installs NVIDIA drivers, Docker, AWS CLI, and configures IMEX
- Configure your AWS credentials
- Update
variables.tf
with your desired values - Create S3 backend for Terraform state management:
./create-s3-backend.sh
- Run
terraform init
- Run
terraform plan
- Run
terraform apply
- After successful deployment, configure NVIDIA IMEX domains:
# Configure Domain 1 (instances 1 and 2): ./configure-nvidia-imex-domain.sh 1 <instance-id-1> <instance-id-2> # Configure Domain 2 (instances 3 and 4): ./configure-nvidia-imex-domain.sh 2 <instance-id-3> <instance-id-4>
The project uses a single unified script configure-nvidia-imex-domain.sh
to configure NVIDIA IMEX in independent domains:
Configures any domain by accepting the domain number and instance IDs as parameters:
- Retrieves Private IPs: Gets the private IP addresses of the specified instances using AWS CLI
- Creates Node Configuration: Creates
/etc/nvidia-imex/nodes_config.cfg
on each instance containing the private IPs of both nodes in the domain - Starts IMEX Service: Launches the NVIDIA IMEX service in NO GPU mode (
--nogpu
flag) for testing - Restarts Service: Restarts the service to ensure it picks up the new configuration
- Displays Status: Shows the IMEX node status and connectivity table using
nvidia-imex-ctl
- Logs Output: Displays service logs and process status for troubleshooting
The script uses AWS Systems Manager (SSM) to execute commands remotely on the specified instances, ensuring each domain is configured independently with its own node list for IMEX communication.
Usage: ./configure-nvidia-imex-domain.sh <domain-number> <instance-id-1> <instance-id-2>
When each EC2 instance launches, the user data script automatically performs the following setup:
- Updates the Ubuntu system packages
- Installs essential tools:
curl
,wget
,git
,unzip
,jq
,htop
,tree
- Enables and starts the AWS SSM agent for Session Manager access
- Adds Docker's official GPG key and repository
- Installs Docker Engine, CLI, and Compose
- Starts and enables Docker service
- Adds the ubuntu user to the docker group
- Downloads and installs the latest AWS CLI v2 for ARM64
- Removes installation files after setup
- Adds NVIDIA CUDA repository for Ubuntu 22.04 ARM64
- Adds Ubuntu jammy-proposed repository for latest packages
- Updates package sources to use Ubuntu ports for ARM64 compatibility
- Installs NVIDIA IMEX package version 570.133.20 to match the G5G driver version
- Verifies Docker and AWS CLI installations
- Checks for NVIDIA drivers (pre-installed on Deep Learning AMI)
- Creates a welcome message displayed on login
The script ensures each instance is ready for IMEX testing with all necessary software pre-installed and configured.
The instances are configured to run NVIDIA IMEX service in NO GPU mode for testing purposes:
- Service mode:
--nogpu
flag enabled for testing without GB200 hardware - Default config file location:
/etc/nvidia-imex/config.cfg
- Node configuration file:
/etc/nvidia-imex/nodes_config.cfg
- Service name:
nvidia-imex
- Communication port: 50000
- Purpose: Test IMEX communication and configuration on standard EC2 instances
For IMEX service issues, check:
- Service status:
sudo systemctl status nvidia-imex
- Service logs:
sudo journalctl -u nvidia-imex
cat /var/log/nvidia-imex.log
- IMEX control tool:
nvidia-imex-ctl