This project provides a simple way to monitor GPU usage on remote servers and receive notifications through a Feishu bot. It connects to servers via SSH, retrieves GPU statistics using nvidia-smi
, and sends formatted messages to a specified Feishu group.
- Remote Monitoring: Connects to multiple servers to gather GPU data.
- Two Report Types:
- Overall GPU Status: Reports utilization and memory usage for all GPUs on a server.
- User-Specific Processes: Lists all GPU processes for a specific user.
- Feishu Integration: Sends alerts and reports directly to a Feishu chat.
- Configurable: All settings, including server IPs, credentials, and Feishu bot details, are managed through a
config.yaml
file.
- Create a
config.yaml
file in the root directory. - Add the server details, user credentials, and Feishu bot information. See the example below for the required structure.
Tip
You need to get Feishu Webhook URL, see this blog for more info.
Create a config.yaml
file with the following structure:
# list of ip to be monitored
ip_list:
- "192.168.1.101"
- "192.168.1.102"
# ssh port
port: 22
# ssh password
password: "your_ssh_password"
# ssh username
user: "your_ssh_username"
# feishu bot webhook url
feishu_url: "your_feishu_webhook_url"
# feishu keyword
feishu_keyword: "your_feishu_keyword"
# running on tmux is recommended
python -m src.run
All the logs will be decorated in gpu_log
file.
{
"NVIDIA GeForce RTX 3090": [
{
"id": "db94",
"pid": 577961,
"process_name": "python",
"used_memory": "448 MiB"
},
{
"id": "db95",
"pid": 148218,
"process_name": "python",
"used_memory": "448 MiB"
},
{
"id": "db19",
"pid": 2871586,
"process_name": "python",
"used_memory": "450 MiB"
},
{
"id": "db21",
"pid": 3103532,
"process_name": "python",
"used_memory": "450 MiB"
}
]
}
【Monitoring】
GPU for db93
0: 0%, Mem=21/24576 MiB
1: 0%, Mem=7132/24576 MiB
2: 0%, Mem=8664/24576 MiB
3: 0%, Mem=20804/24576 MiB
4: 0%, Mem=20804/24576 MiB
5: 0%, Mem=1/24576 MiB
6: 0%, Mem=1/24576 MiB
7: 0%, Mem=1/24576 MiB
GPU for db94
0: 100%, Mem=17248/24576 MiB
1: 0%, Mem=10468/24576 MiB
2: 0%, Mem=20814/24576 MiB
3: 0%, Mem=13020/24576 MiB
4: 0%, Mem=9492/24576 MiB
5: 100%, Mem=12128/24576 MiB
6: 0%, Mem=0/24576 MiB
7: 0%, Mem=0/24576 MiB
GPU for db95
0: 3%, Mem=454/24576 MiB
1: 0%, Mem=17288/24576 MiB
2: 0%, Mem=16256/24576 MiB
3: 0%, Mem=4/24576 MiB
4: 0%, Mem=1/24576 MiB
5: 0%, Mem=1/24576 MiB
6: 0%, Mem=1/24576 MiB
7: 0%, Mem=1/24576 MiB