System-Telemetry-Agent

System Telemetry Agent is a tool built for Microsoft as part of Trinity SwEng 2024. Our product will allow a server engineer to monitor a group of servers live and assess a network’s health. It will provide a clean dashboard interface for users to understand the overall state of servers on a network and monitor each node individually.

The issue of monitoring is common in cloud computing. When a system scales horizontally and is composed of several nodes, it becomes crucial to find an efficient and reliable way to quickly identify faults and react to critical conditions by sending alerts or re-allocate resources. For example, if a node runs out of memory, there should be a way for the system to detect the source of the outage without having to manually trace back the effects throughout the network. Tools like Prometheus - a key component in our solution, achieve this by regularly pulling metrics from target endpoints and triggering alerts when a node exposes critical metrics.

Functional Requirements

In order to meet this vision, the project will have to address the following set of functional requirements:

Develop a monitoring agent that runs on a target machine and collects system metrics
The agent should support collection of the following metrics:
- CPU: temperature, frequency, utilization (%)
- RAM: utilization (%)
- Disk / Storage (OS drive): capacity (used vs. unused)
- Networking: in traffic (Mb/s), out traffic (Mb/s)
The agent should run on Linux machines
The monitoring process should be lightweight to avoid an 'observer effect'
The metrics are to be displayed to the end user in real-time on an interactive dashboard
The system should scale to support N machines

System Architecture

The following diagram shows the different components of the product we are aiming for. The goal is to have our Python agent run on N Ubuntu-based virtual machines, all hosted on Microsoft Azure. The metrics are exposed on each individual machine, pulled and queued into Azure Service Bus and stored into one single instance of Prometheus server (also hosted on the cloud).

We plan to achieve this architecture incrementally by first simulating locally the Azure setup using a set of Ubuntu containers orchestrated by Docker. Then we will test the agent on one hosted VM and include the service bus when we feel comfortable moving to N machines.

The Team

Team Leads:

Dmitry Kryukov - 3rd Year

Kostiantyn Ohorodnyk - 3rd Year

Massimiliano Romagnoli - 3rd Year

Liam Zone - 3rd Year

Qiming Nie - 3rd Year

Frontend:

Ayomide Oyelakun - 2nd Year

Binli Wang - 2nd Year

Backend:

Leila Adil - 2nd Year

Cindy Ariyo - 2nd Year

Victor Dalessandris - 2nd Year

Claire McCooey - 2nd Year

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.github/workflows		.github/workflows
alertmanager		alertmanager
azure		azure
docker		docker
docs		docs
exporter		exporter
grafana		grafana
prometheus		prometheus
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
start_exporter.py		start_exporter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

System-Telemetry-Agent

Functional Requirements

System Architecture

The Team

Team Leads:

Frontend:

Backend:

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 12

Uh oh!

Languages

max-romagnoli/System-Telemetry-Agent

Folders and files

Latest commit

History

Repository files navigation

System-Telemetry-Agent

Functional Requirements

System Architecture

The Team

Team Leads:

Frontend:

Backend:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 12

Uh oh!

Languages

Packages