Skip to content

xiyuanyang-code/Feishu-GPU-Auto-Monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Auto GPU Monitoring with Feishu Bot

This project provides a simple way to monitor GPU usage on remote servers and receive notifications through a Feishu bot. It connects to servers via SSH, retrieves GPU statistics using nvidia-smi, and sends formatted messages to a specified Feishu group.

Features

  • Remote Monitoring: Connects to multiple servers to gather GPU data.
  • Two Report Types:
    1. Overall GPU Status: Reports utilization and memory usage for all GPUs on a server.
    2. User-Specific Processes: Lists all GPU processes for a specific user.
  • Feishu Integration: Sends alerts and reports directly to a Feishu chat.
  • Configurable: All settings, including server IPs, credentials, and Feishu bot details, are managed through a config.yaml file.

Installation

  • Create a config.yaml file in the root directory.
  • Add the server details, user credentials, and Feishu bot information. See the example below for the required structure.

Tip

You need to get Feishu Webhook URL, see this blog for more info.

Create a config.yaml file with the following structure:

# list of ip to be monitored
ip_list:
  - "192.168.1.101"
  - "192.168.1.102"
# ssh port
port: 22
# ssh password
password: "your_ssh_password"
# ssh username
user: "your_ssh_username"

# feishu bot webhook url
feishu_url: "your_feishu_webhook_url"
# feishu keyword
feishu_keyword: "your_feishu_keyword"

Usage

# running on tmux is recommended 
python -m src.run

All the logs will be decorated in gpu_log file.

Demo

{
    "NVIDIA GeForce RTX 3090": [
        {
            "id": "db94",
            "pid": 577961,
            "process_name": "python",
            "used_memory": "448 MiB"
        },
        {
            "id": "db95",
            "pid": 148218,
            "process_name": "python",
            "used_memory": "448 MiB"
        },
        {
            "id": "db19",
            "pid": 2871586,
            "process_name": "python",
            "used_memory": "450 MiB"
        },
        {
            "id": "db21",
            "pid": 3103532,
            "process_name": "python",
            "used_memory": "450 MiB"
        }
    ]
}
【Monitoring】
GPU for db93
0: 0%, Mem=21/24576 MiB
1: 0%, Mem=7132/24576 MiB
2: 0%, Mem=8664/24576 MiB
3: 0%, Mem=20804/24576 MiB
4: 0%, Mem=20804/24576 MiB
5: 0%, Mem=1/24576 MiB
6: 0%, Mem=1/24576 MiB
7: 0%, Mem=1/24576 MiB


GPU for db94
0: 100%, Mem=17248/24576 MiB
1: 0%, Mem=10468/24576 MiB
2: 0%, Mem=20814/24576 MiB
3: 0%, Mem=13020/24576 MiB
4: 0%, Mem=9492/24576 MiB
5: 100%, Mem=12128/24576 MiB
6: 0%, Mem=0/24576 MiB
7: 0%, Mem=0/24576 MiB

GPU for db95
0: 3%, Mem=454/24576 MiB
1: 0%, Mem=17288/24576 MiB
2: 0%, Mem=16256/24576 MiB
3: 0%, Mem=4/24576 MiB
4: 0%, Mem=1/24576 MiB
5: 0%, Mem=1/24576 MiB
6: 0%, Mem=1/24576 MiB
7: 0%, Mem=1/24576 MiB

About

Scripts for monitoring GPU usage on remote machine via feishu bot

Resources

Stars

Watchers

Forks

Languages