Skip to content

ZJUSCT/grafana.clusters.zjusct.io

Repository files navigation

grafana.clusters.zjusct.io

中文 README | English README

Observability - ZJUSCT OpenDocs

This repository contains configuration files for the ZJUSCT observability system.

Demo

pdu container netflow trace
pdu.jpeg container.jpeg netflow.jpeg trace.jpeg
cluster hostmetrics syslog mirror
cluster.jpeg hostmetrics.jpeg syslog.jpeg mirror.jpeg

Todo List

Technology Stack

We ❤️ Open Source

Layer Components
Data Collection OpenTelemetry Collector
Data Storage ClickHouse, InfluxDB, Prometheus
Data Analysis, Visualization and Alerting Grafana

The overall system design follows the KISS (Keep It Simple, Stupid) principle, simplifying the complexity of interactions between layers and reducing the difficulty of system maintenance.

Data State

For ease of operations management, the system state should be completely determined by configuration files in the repository, with Docker being stateless. Data that needs to be persisted is stored locally using Docker Volumes.

  • Configuration files: Simple text files that can be managed using Git, stored in this repository. With these configuration files, we only need to clone the repository and run docker compose up to quickly deploy the entire system, ready to use out of the box.

    Some services store configurations in databases, such as Grafana, which uses built-in SQLite 3 to store configurations, users, dashboards, and other data. Nevertheless, it provides Provisioning functionality to initialize various configurations via configuration files. InfluxDB is more extreme, as its automatic token generation mechanism prevents configuration files from completely determining its state.

  • Database: Service databases need persistent storage. Docker officially recommends using Volumes to store write-intensive data like databases.

    Use volumes for write-heavy workloads: Volumes provide the best and most predictable performance for write-heavy workloads. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write. Volumes have other benefits, such as allowing you to share data among containers and persisting even when no running container is using them.

    To achieve out-of-the-box functionality, the Volumes in compose.yml all use relative paths, and the Git repository maintains the empty folder structure of database.

Security

Four access scopes are distinguished:

Area Trust Level Exposed Services
Docker internal, host machine Communication managed by Docker Engine, fully trusted All
Cluster internal Good security status, no TLS encryption required Services with no authentication or weak authentication, such as syslog, snmp
Campus network TLS encryption required, authentication required Only otel-collector and Grafana
Public network Blocked None

Authentication tokens are hosted in the cluster's VaultWarden. They are set as environment variables in compose.yml, with the .env file generated using the get_credential.sh script and read by Docker Compose. The .env file should not be committed to the Git repository.

Details

OpenTelemetry

Collector is deployed in the Agent + Gateway mode. In this mode, agents are responsible for data collection as much as possible, with more transformation and processing logic handled by the gateway. Resource Attributes are used to identify entities that generate data, and are attached by Collectors at different levels. Dividing data sources into hierarchical structures through resource attributes facilitates data analysis in Grafana.

flowchart TD
 subgraph s1["cloud.region"]
    n8["Infrastructure"]
    n1(["agent"])
    subgraph s3["host.name"]
     n7(["agent"])
     n6["service.name"]
    end
    subgraph s2["host.name"]
     n5["service.name"]
     n4(["agent"])
     n2["service.name"]
    end
 end
 n3(["gateway"])
 n5 --> n4
 n2 --> n4
 n6 --> n7
 n7 --> n1
 n4 --> n1(["gateway"])
 n1(["gateway(cluster)"]) --> n3(["gateway(final)"])
 n8["Infrastructure"] --> n1
Loading

Below are resource attributes and their sources in the ZJUSCT observability system, based on Semantic Conventions 1.28.0:

  • Node agent

    • Node's own resource attributes

      Resource Attribute Source Notes
      Node host.name and System os.* resourcedetector Automatically added in the pipeline
    • Services running on nodes can be roughly divided into two categories: processes and containers.

    Resource Attribute Source Notes
    Process and runtime process.* Not mandatory -
    Service service.name journaldreceiver Using Operator to extract the SYSLOG_IDENTIFIER field
    filelogreceiver Using Operator to add. Generally manually added (files rarely contain service names)
    Container container.name dockerstatsreceiver Automatically added
    filelogreceiver Using Operator to extract fields from Docker JSON logs, requires modification of Docker daemon.json
  • Cluster gateway

    Resource Attribute Source Notes
    Cluster cloud.region Manually added in the pipeline Currently, OTel does not define a true "cluster" resource attribute, so we temporarily use cloud service information cloud.* instead.
    In simple cross-cluster deployments without a dedicated cluster gateway, cloud.region needs to be added to the agent.
    Device device.* - We primarily use device attributes to represent infrastructure (routers, switches, smart PDUs, etc.), distinct from nodes.
    - device.id
    - device.type
    syslogreceiver Using Operator to extract the hostname field.

Apart from the above resource attributes and basic format parsing like JSON, agents should avoid performing other processing as much as possible. This facilitates deployment (main changes occur in the gateway) and keeps agents lightweight, reducing resource consumption at the edge.

Grafana

Based on the data sources, resource attributes, and OpenTelemetry semantic specifications mentioned above, we have created a series of Grafana dashboards, stored in the config/grafana/provisioning/dashboards directory:

  • zjusct/single: Panel collections designed for a single type of data (such as a specific Receiver and storage backend combination), which can be easily incorporated into dashboards.
  • zjusct/combined: Dashboards combining multiple data sources, for more specific application scenarios.

Since Grafana data is not persistent, dashboard backups should be considered when removing the Grafana Docker container. We use esnet/gdg: Grafana Dashboard Manager for batch dashboard backups.

Many sites offer public dashboards, and we've referenced some of their designs:

Code Style

  • Use the EditorConfig plugin. You can refer to the .editorconfig file.
  • Format SQL using the VSCode SQLTools plugin.

About

ZJUSCT 可观测性系统

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages