-
Notifications
You must be signed in to change notification settings - Fork 155
Description
What motivated this proposal?
What motivated this proposal?
We're running JetStream clusters in production and discovered critical lag metrics are missing from prometheus-nats-exporter
For example, one of replicas are falling 1000+ messages behind, but it couldn't be detected because these metrics aren't exposed.
The NATS server already provides this data via /jsz?streams=true
but metrics are not at Prometheus exporter
What is the proposed change?
What is the proposed change?
Export the replica and source lag metrics that NATS server already provides:
Replica lag: replicas[].lag
- how many operations behind each replica is
Replica status: replicas[].current
- whether replica is up-to-date
Source lag: sources[].lag
- messages behind for mirrored/sourced streams
Reserved resources: reserved_memory
and reserved_storage
These would become Prometheus metrics like:
nats_stream_replica_lag{stream="KV_metadata",replica="node-0"} 1899
nats_stream_replica_current{stream="KV_metadata",replica="node-0"} 0
nats_stream_source_lag{stream="my_stream",source="upstream-0"} 0
Who benefits from this change?
Who benefits from this change?
Users using JetStream in production with:
Multi-node clusters needing replication monitoring
Cross-region/cross-cluster mirrors and sources
Kubernetes operators managing NATS clusters
Currently, these users must write custom monitoring or discover issues only after failures occur.
What alternatives have you evaluated?
What alternatives have you evaluated?
custom exporter - writing our own tool to poll /jsz?streams=true
CLI scripting - running nats stream info in loops, not scalable
manual checks - reactive, not proactive monitoring