Skip to content

Possible performance degradation on ALinux2 when using ParallelCluster 2.11.0 and custom AMIs from 2.6.0 to 2.11.0

Enrico Usai edited this page Jul 8, 2021 · 11 revisions

The issue

The performance of tightly coupled / MPI workloads on clusters with Amazon Linux 2 operating system may be impacted by enabling CloudWatch logging.

Our preliminary analysis has found this is likely related to the CloudWatch Agent version 1.247348.0b251302, you can check which version you have installed by running the command: yum list amazon-cloudwatch-agent

This performance issue may affect workloads differently depending on cluster size and applications used.

The workaround

To overcome the issue there are multiple options.

Option 1: Create the cluster with CloudWatch logging disabled

Create a cluster with the following configuration

[cluster yourcluster]
cw_log_settings = custom-cw
...

[cw_log custom-cw]
enable = false

CloudWatch logging and the CloudWatch Agent service will be disabled by default, avoiding the possible performance degradation issue.

Option 2: Create a custom AMI with a downgraded CloudWatch Agent

  1. Follow the official documentation to modify an existing ParallelCluster AMI
  2. As part of the AMI customization step, connect to the instance and run the following command: sudo yum downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2
  3. Complete the steps to create a custom AMI
  4. Create a cluster using the generated AMI with the custom_ami parameter.

Option 3: Downgrade CloudWatch agent within individual jobs

Example:

#!/bin/bash
#SBATCH --job-name=yourjob
# add your options

# downgrade 
for i in $(scontrol show hostnames $SLURM_JOB_NODELIST)
do
    ssh $i "sudo systemctl stop amazon-cloudwatch-agent.service"
    ssh $i "sudo yum downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2"
    ssh $i "sudo systemctl start amazon-cloudwatch-agent.service"
done

# start your application
sleep 100
Clone this wiki locally