grafana.clusters.zjusct.io

This repository contains configuration files for the ZJUSCT observability system.

Demo

pdu	container	netflow	trace

cluster	hostmetrics	syslog	mirror

Todo List

To be done
- Grafana Provision Alerts
Pending
- InfluxDB Exporter
- Journald attribute processing: Wait for journald - Consider parsing more known fields from logs · Issue #7298 · open-telemetry/opentelemetry-collector-contrib.

Technology Stack

We ❤️ Open Source

Layer	Components
Data Collection	OpenTelemetry Collector
Data Storage	ClickHouse, InfluxDB, Prometheus
Data Analysis, Visualization and Alerting	Grafana

The overall system design follows the KISS (Keep It Simple, Stupid) principle, simplifying the complexity of interactions between layers and reducing the difficulty of system maintenance.

Data State

For ease of operations management, the system state should be completely determined by configuration files in the repository, with Docker being stateless. Data that needs to be persisted is stored locally using Docker Volumes.

Configuration files: Simple text files that can be managed using Git, stored in this repository. With these configuration files, we only need to clone the repository and run docker compose up to quickly deploy the entire system, ready to use out of the box.

Some services store configurations in databases, such as Grafana, which uses built-in SQLite 3 to store configurations, users, dashboards, and other data. Nevertheless, it provides Provisioning functionality to initialize various configurations via configuration files. InfluxDB is more extreme, as its automatic token generation mechanism prevents configuration files from completely determining its state.
Database: Service databases need persistent storage. Docker officially recommends using Volumes to store write-intensive data like databases.

Use volumes for write-heavy workloads: Volumes provide the best and most predictable performance for write-heavy workloads. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write. Volumes have other benefits, such as allowing you to share data among containers and persisting even when no running container is using them.

To achieve out-of-the-box functionality, the Volumes in compose.yml all use relative paths, and the Git repository maintains the empty folder structure of database.

Security

Four access scopes are distinguished:

Area	Trust Level	Exposed Services
Docker internal, host machine	Communication managed by Docker Engine, fully trusted	All
Cluster internal	Good security status, no TLS encryption required	Services with no authentication or weak authentication, such as syslog, snmp
Campus network	TLS encryption required, authentication required	Only otel-collector and Grafana
Public network	Blocked	None

Authentication tokens are hosted in the cluster's VaultWarden. They are set as environment variables in compose.yml, with the .env file generated using the get_credential.sh script and read by Docker Compose. The .env file should not be committed to the Git repository.

Details

OpenTelemetry

Collector is deployed in the Agent + Gateway mode. In this mode, agents are responsible for data collection as much as possible, with more transformation and processing logic handled by the gateway. Resource Attributes are used to identify entities that generate data, and are attached by Collectors at different levels. Dividing data sources into hierarchical structures through resource attributes facilitates data analysis in Grafana.

flowchart TD
 subgraph s1["cloud.region"]
    n8["Infrastructure"]
    n1(["agent"])
    subgraph s3["host.name"]
     n7(["agent"])
     n6["service.name"]
    end
    subgraph s2["host.name"]
     n5["service.name"]
     n4(["agent"])
     n2["service.name"]
    end
 end
 n3(["gateway"])
 n5 --> n4
 n2 --> n4
 n6 --> n7
 n7 --> n1
 n4 --> n1(["gateway"])
 n1(["gateway(cluster)"]) --> n3(["gateway(final)"])
 n8["Infrastructure"] --> n1

Below are resource attributes and their sources in the ZJUSCT observability system, based on Semantic Conventions 1.28.0:

Node agent

Node's own resource attributes

Resource Attribute Source Notes

Node host.name and System os.* resourcedetector Automatically added in the pipeline
Services running on nodes can be roughly divided into two categories: processes and containers.

Resource Attribute	Source	Notes
Process and runtime `process.*`	Not mandatory	-
Service `service.name`	`journaldreceiver`	Using Operator to extract the `SYSLOG_IDENTIFIER` field
	`filelogreceiver`	Using Operator to add. Generally manually added (files rarely contain service names)
Container `container.name`	`dockerstatsreceiver`	Automatically added
	`filelogreceiver`	Using Operator to extract fields from Docker JSON logs, requires modification of Docker `daemon.json`

Cluster gateway

Resource Attribute	Source	Notes
Cluster `cloud.region`	Manually added in the pipeline	Currently, OTel does not define a true "cluster" resource attribute, so we temporarily use cloud service information `cloud.*` instead. In simple cross-cluster deployments without a dedicated cluster gateway, `cloud.region` needs to be added to the agent.
Device `device.*`	-	We primarily use device attributes to represent infrastructure (routers, switches, smart PDUs, etc.), distinct from nodes. - `device.id` - `device.type`
	`syslogreceiver`	Using Operator to extract the `hostname` field.

Apart from the above resource attributes and basic format parsing like JSON, agents should avoid performing other processing as much as possible. This facilitates deployment (main changes occur in the gateway) and keeps agents lightweight, reducing resource consumption at the edge.

Grafana

Based on the data sources, resource attributes, and OpenTelemetry semantic specifications mentioned above, we have created a series of Grafana dashboards, stored in the config/grafana/provisioning/dashboards directory:

zjusct/single: Panel collections designed for a single type of data (such as a specific Receiver and storage backend combination), which can be easily incorporated into dashboards.
zjusct/combined: Dashboards combining multiple data sources, for more specific application scenarios.

Since Grafana data is not persistent, dashboard backups should be considered when removing the Grafana Docker container. We use esnet/gdg: Grafana Dashboard Manager for batch dashboard backups.

Many sites offer public dashboards, and we've referenced some of their designs:

Sentry Software

Code Style

Use the EditorConfig plugin. You can refer to the .editorconfig file.
Format SQL using the VSCode SQLTools plugin.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
config		config
database		database
demo		demo
query/clickhouse		query/clickhouse
test		test
tools		tools
.editorconfig		.editorconfig
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
README_cn.md		README_cn.md
compose.yml		compose.yml
validate.sh		validate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

grafana.clusters.zjusct.io

Demo

Todo List

Technology Stack

Data State

Security

Details

OpenTelemetry

Grafana

Code Style

About

Uh oh!

Releases

Packages

Languages

ZJUSCT/grafana.clusters.zjusct.io

Folders and files

Latest commit

History

Repository files navigation

grafana.clusters.zjusct.io

Demo

Todo List

Technology Stack

Data State

Security

Details

OpenTelemetry

Grafana

Code Style

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages