Skip to content

Adding GPU E2E Test #171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions .github/workflows/gpu-e2e-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0

name: Run GPU E2E Test
env:
TERRAFORM_AWS_ASSUME_ROLE: ${{ secrets.TERRAFORM_AWS_ASSUME_ROLE }}

on:
workflow_dispatch:
inputs:
addon_name:
required: true
type: string
default: "amazon-cloudwatch-observability"
description: "GPU E2E Test"
addon_version:
required: true
type: string
default: "v1.1.0-eksbuild.1"
description: "EKS addon version"
run_in_beta:
required: true
type: boolean
default: true
description: "Run in EKS Addon Beta environment"

concurrency:
group: ${{ github.workflow }}-${{ github.ref_name }}
cancel-in-progress: true

permissions:
id-token: write
contents: read

jobs:
GenerateTestMatrix:
name: 'GenerateTestMatrix'
runs-on: ubuntu-latest
outputs:
eks_addon_matrix: ${{ steps.set-matrix.outputs.eks_addon_matrix }}
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ env.TERRAFORM_AWS_ASSUME_ROLE }}
aws-region: us-west-2

- name: Generate matrix
id: set-matrix
run: |
echo "::set-output name=eks_addon_matrix::$(echo $(cat integration-tests/generator/k8s_versions_matrix.json))"

- name: Echo test plan matrix
run: |
echo "eks_addon_matrix: ${{ steps.set-matrix.outputs.eks_addon_matrix }}"
echo "Addon name ${{ github.event.inputs.addon_name }}, addon version ${{ github.event.inputs.addon_version }} "

GPUE2ETest:
needs: [GenerateTestMatrix]
name: GPUE2ETest
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
arrays: ${{ fromJson(needs.GenerateTestMatrix.outputs.eks_addon_matrix) }}
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ env.TERRAFORM_AWS_ASSUME_ROLE }}
aws-region: us-west-2

- name: Confirm EKS Version Support
run: |
if [[
$(go list -m k8s.io/client-go | cut -d ' ' -f 2 | cut -d '.' -f 2) -lt $( echo ${{ matrix.arrays.k8sVersion }} | cut -d '.' -f 2)
|| $(go list -m k8s.io/apimachinery | cut -d ' ' -f 2 | cut -d '.' -f 2) -lt $( echo ${{ matrix.arrays.k8sVersion }} | cut -d '.' -f 2)
|| $(go list -m k8s.io/component-base | cut -d ' ' -f 2 | cut -d '.' -f 2) -lt $( echo ${{ matrix.arrays.k8sVersion }} | cut -d '.' -f 2)
|| $(go list -m k8s.io/kubectl | cut -d ' ' -f 2 | cut -d '.' -f 2) -lt $( echo ${{ matrix.arrays.k8sVersion }} | cut -d '.' -f 2)
]]; then
echo k8s.io/client-go $(go list -m k8s.io/client-go) is less than ${{ matrix.arrays.k8sVersion }}
echo or k8s.io/apimachinery $(go list -m k8s.io/apimachinery) is less than ${{ matrix.arrays.k8sVersion }}
echo or k8s.io/component-base $(go list -m k8s.io/component-base) is less than ${{ matrix.arrays.k8sVersion }}
echo or k8s.io/kubectl $(go list -m k8s.io/kubectl) is less than ${{ matrix.arrays.k8sVersion }}, fail test
echo "please run go get -u && go mod tidy"
exit 1;
fi

- name: Verify Terraform version
run: terraform --version

- name: Terraform apply
uses: nick-fields/retry@v2
with:
max_attempts: 1
timeout_minutes: 60 # EKS takes about 20 minutes to spin up a cluster and service on the cluster
retry_wait_seconds: 5
command: |
cd integration-tests/terraform/gpu

terraform init
if terraform apply -var="beta=${{ github.event.inputs.run_in_beta }}" -var="addon_name=${{ github.event.inputs.addon_name }}" -var="addon_version=${{ github.event.inputs.addon_version }}" -var="k8s_version=${{ matrix.arrays.k8sVersion }}" --auto-approve; then
terraform destroy -var="beta=${{ github.event.inputs.run_in_beta }}" -auto-approve
else
terraform destroy -var="beta=${{ github.event.inputs.run_in_beta }}" -auto-approve && exit 1
fi

- name: Terraform destroy
if: ${{ cancelled() || failure() }}
uses: nick-fields/retry@v2
with:
max_attempts: 3
timeout_minutes: 8
retry_wait_seconds: 5
command: |
cd integration-tests/terraform/gpu

terraform destroy -var="beta=${{ github.event.inputs.run_in_beta }}" --auto-approve

79 changes: 79 additions & 0 deletions integration-tests/gpu/gpu_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

//go:build !windows

package emf

import (
"fmt"
"log"
"testing"

"github.com/stretchr/testify/suite"

"github.com/aws/amazon-cloudwatch-agent-test/environment"
"github.com/aws/amazon-cloudwatch-agent-test/environment/computetype"
"github.com/aws/amazon-cloudwatch-agent-test/test/metric/dimension"
"github.com/aws/amazon-cloudwatch-agent-test/test/status"
"github.com/aws/amazon-cloudwatch-agent-test/test/test_runner"
)

type GPUTestSuite struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, you don't need a test suite for this e2e test. The purpose of test suite is to group similar tests into a group to execute them all in the suite. There is no other test will be added to this suite. Other accelerated compute instance types (eg tranium) will be its own test with a different instance type in terraform file that will execute a different test.

suite.Suite
test_runner.TestSuite
}

func (suite *GPUTestSuite) SetupSuite() {
fmt.Println(">>>> Starting GPU Container Insights TestSuite")
}

func (suite *GPUTestSuite) TearDownSuite() {
suite.Result.Print()
fmt.Println(">>>> Finished GPU Container Insights TestSuite")
}

func init() {
environment.RegisterEnvironmentMetaDataFlags()
}

var (
eksTestRunners []*test_runner.EKSTestRunner
)

func getEksTestRunners(env *environment.MetaData) []*test_runner.EKSTestRunner {
if eksTestRunners == nil {
factory := dimension.GetDimensionFactory(*env)

eksTestRunners = []*test_runner.EKSTestRunner{
{
Runner: &NvidiaTestRunner{test_runner.BaseTestRunner{DimensionFactory: factory}, "EKS_GPU_NVIDIA", env},
Env: *env,
},
}
}
return eksTestRunners
}

func (suite *GPUTestSuite) TestAllInSuite() {
env := environment.GetEnvironmentMetaData()
switch env.ComputeType {
case computetype.EKS:
log.Println("Environment compute type is EKS")
for _, testRunner := range getEksTestRunners(env) {
testRunner.Run(suite, env)
}
default:
return
}

suite.Assert().Equal(status.SUCCESSFUL, suite.Result.GetStatus(), "GPU Container Test Suite Failed")
}

func (suite *GPUTestSuite) AddToSuiteResult(r status.TestGroupResult) {
suite.Result.TestGroupResults = append(suite.Result.TestGroupResults, r)
}

func TestGPUSuite(t *testing.T) {
suite.Run(t, new(GPUTestSuite))
}
118 changes: 118 additions & 0 deletions integration-tests/gpu/nvidia_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

//go:build !windows

package emf

import (
"time"

"github.com/aws/amazon-cloudwatch-agent-test/environment"
"github.com/aws/amazon-cloudwatch-agent-test/test/metric"
"github.com/aws/amazon-cloudwatch-agent-test/test/status"
"github.com/aws/amazon-cloudwatch-agent-test/test/test_runner"
)

const (
gpuMetricIndicator = "_gpu_"

containerMemTotal = "container_gpu_memory_total"
containerMemUsed = "container_gpu_memory_used"
containerPower = "container_gpu_power_draw"
containerTemp = "container_gpu_temperature"
containerUtil = "container_gpu_utilization"
containerMemUtil = "container_gpu_memory_utilization"
podMemTotal = "pod_gpu_memory_total"
podMemUsed = "pod_gpu_memory_used"
podPower = "pod_gpu_power_draw"
podTemp = "pod_gpu_temperature"
podUtil = "pod_gpu_utilization"
podMemUtil = "pod_gpu_memory_utilization"
nodeMemTotal = "node_gpu_memory_total"
nodeMemUsed = "node_gpu_memory_used"
nodePower = "node_gpu_power_draw"
nodeTemp = "node_gpu_temperature"
nodeUtil = "node_gpu_utilization"
nodeMemUtil = "node_gpu_memory_utilization"
nodeCountTotal = "node_gpu_total"
nodeCountRequest = "node_gpu_request"
nodeCountLimit = "node_gpu_limit"
clusterCountTotal = "cluster_gpu_total"
clusterCountRequest = "cluster_gpu_request"
)

var expectedDimsToMetrics = map[string][]string{
"ClusterName": {
containerMemTotal, containerMemUsed, containerPower, containerTemp, containerUtil, containerMemUtil,
podMemTotal, podMemUsed, podPower, podTemp, podUtil, podMemUtil,
nodeMemTotal, nodeMemUsed, nodePower, nodeTemp, nodeUtil, nodeMemUtil,
//nodeCountTotal, nodeCountRequest, nodeCountLimit,
//clusterCountTotal, clusterCountRequest,
},
"ClusterName-Namespace": {
podMemTotal, podMemUsed, podPower, podTemp, podUtil, podMemUtil,
},
//"ClusterName-Namespace-Service": {
// podMemTotal, podMemUsed, podPower, podTemp, podUtil, podMemUtil,
//},
"ClusterName-Namespace-PodName": {
podMemTotal, podMemUsed, podPower, podTemp, podUtil, podMemUtil,
},
"ClusterName-ContainerName-Namespace-PodName": {
containerMemTotal, containerMemUsed, containerPower, containerTemp, containerUtil, containerMemUtil,
},
"ClusterName-ContainerName-FullPodName-Namespace-PodName": {
containerMemTotal, containerMemUsed, containerPower, containerTemp, containerUtil, containerMemUtil,
},
"ClusterName-ContainerName-FullPodName-GpuDevice-Namespace-PodName": {
containerMemTotal, containerMemUsed, containerPower, containerTemp, containerUtil, containerMemUtil,
},
"ClusterName-FullPodName-Namespace-PodName": {
podMemTotal, podMemUsed, podPower, podTemp, podUtil, podMemUtil,
},
"ClusterName-FullPodName-GpuDevice-Namespace-PodName": {
podMemTotal, podMemUsed, podPower, podTemp, podUtil, podMemUtil,
},
"ClusterName-InstanceId-NodeName": {
nodeMemTotal, nodeMemUsed, nodePower, nodeTemp, nodeUtil, nodeMemUtil,
//nodeCountTotal, nodeCountRequest, nodeCountLimit,
},
"ClusterName-GpuDevice-InstanceId-InstanceType-NodeName": {
nodeMemTotal, nodeMemUsed, nodePower, nodeTemp, nodeUtil, nodeMemUtil,
},
}

type NvidiaTestRunner struct {
test_runner.BaseTestRunner
testName string
env *environment.MetaData
}

var _ test_runner.ITestRunner = (*NvidiaTestRunner)(nil)

func (t *NvidiaTestRunner) Validate() status.TestGroupResult {
var testResults []status.TestResult
testResults = append(testResults, metric.ValidateMetrics(t.env, gpuMetricIndicator, expectedDimsToMetrics)...)
testResults = append(testResults, metric.ValidateLogs(t.env))
return status.TestGroupResult{
Name: t.GetTestName(),
TestResults: testResults,
}
}

func (t *NvidiaTestRunner) GetTestName() string {
return t.testName
}

func (t *NvidiaTestRunner) GetAgentConfigFileName() string {
return ""
}

func (t *NvidiaTestRunner) GetAgentRunDuration() time.Duration {
return 3 * time.Minute
}

func (t *NvidiaTestRunner) GetMeasuredMetrics() []string {
return nil
}
7 changes: 7 additions & 0 deletions integration-tests/terraform/basic_components/main.tf
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT


module "common" {
source = "../common"
}

data "aws_iam_instance_profile" "cwagent_instance_profile" {
name = module.common.cwa_iam_instance_profile
}

data "aws_iam_role" "cwagent_iam_role" {
name = module.common.cwa_iam_role
}
Expand All @@ -23,3 +28,5 @@ data "aws_subnets" "public_subnet_ids" {
data "aws_security_group" "security_group" {
name = module.common.vpc_security_group
}


8 changes: 8 additions & 0 deletions integration-tests/terraform/basic_components/output.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

output "vpc_id" {
value = data.aws_vpc.vpc.id
}

output "security_group" {
value = data.aws_security_group.security_group.id
}
Expand All @@ -12,3 +16,7 @@ output "public_subnet_ids" {
output "role_arn" {
value = data.aws_iam_role.cwagent_iam_role.arn
}

output "instance_profile" {
value = data.aws_iam_instance_profile.cwagent_instance_profile.name
}
3 changes: 3 additions & 0 deletions integration-tests/terraform/basic_components/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
variable "region" {
default = "us-west-2"
}
Loading