Skip to content

Commit 922a15d

Browse files
committed
Update metrics doc
1 parent 1553498 commit 922a15d

File tree

5 files changed

+181
-71
lines changed

5 files changed

+181
-71
lines changed

docs/reference/metrics.md

Lines changed: 157 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -5,78 +5,195 @@ sidebar_position: 70
55

66
Quickwit exposes key metrics in the [Prometheus](https://prometheus.io/) format on the `/metrics` endpoint. You can use any front-end that supports Prometheus to examine the behavior of Quickwit visually.
77

8+
:::tip
9+
10+
Workloads with a large number of indexes generate high cardinality metrics for the label `index`. Set the environment variable `QW_DISABLE_PER_INDEX_METRICS=true` to disable that label if this is problematic for your metrics database.
11+
12+
:::
13+
814
## Cache Metrics
915

10-
Currently Quickwit exposes metrics for three caches: `fastfields`, `shortlived`, `splitfooter`. These metrics share the same structure.
16+
Quickwit exposes several metrics every caches. The cache type is defined in the `component_name` label. Values are `fastfields`, `shortlived`, `splitfooter`, `fd`, `partial_request`, and `searcher_split`.
1117

12-
| Namespace | Metric Name | Description | Type |
13-
| --------- | ----------- | ----------- | ---- |
14-
| `quickwit_cache_{cache_name}` | `in_cache_count` | Count of {cache_name} in cache | `gauge` |
15-
| `quickwit_cache_{cache_name}` | `in_cache_num_bytes` | Number of {cache_name} bytes in cache | `gauge` |
16-
| `quickwit_cache_{cache_name}` | `cache_hit_total` | Number of {cache_name} cache hits | `counter` |
17-
| `quickwit_cache_{cache_name}` | `cache_hits_bytes` | Number of {cache_name} cache hits in bytes | `counter` |
18-
| `quickwit_cache_{cache_name}` | `cache_miss_total` | Number of {cache_name} cache hits | `counter` |
18+
| Namespace | Metric Name | Description | Labels | Type |
19+
| --------- | ----------- | ----------- | ------ | ---- |
20+
| `quickwit_cache` | `in_cache_count` | Count of entries in cache by component | [`component_name`] | `gauge` |
21+
| `quickwit_cache` | `in_cache_num_bytes` | Number of bytes in cache by component | [`component_name`] | `gauge` |
22+
| `quickwit_cache` | `cache_hits_total` | Number of cache hits by component | [`component_name`] | `counter` |
23+
| `quickwit_cache` | `cache_hits_bytes` | Number of cache hits in bytes by component | [`component_name`] | `counter` |
24+
| `quickwit_cache` | `cache_misses_total` | Number of cache misses by component | [`component_name`] | `counter` |
25+
| `quickwit_cache` | `cache_evict_total` | Number of cache entries evicted by component | [`component_name`] | `counter` |
26+
| `quickwit_cache` | `cache_evict_bytes` | Number of cache entries evicted in bytes by component | [`component_name`] | `counter` |
1927

20-
## CLI Metrics
28+
## Cluster Metrics
29+
30+
Cluster metrics help track the behavior of the Chitchat protocol.
31+
32+
Note: the cluster protocol uses GRPC to catch up large deltas in its state. Those calls are monitored as [GRPC metrics](#grpc-metrics).
2133

2234
| Namespace | Metric Name | Description | Type |
2335
| --------- | ----------- | ----------- | ---- |
24-
| `quickwit` | `allocated_num_bytes` | Number of bytes allocated memory, as reported by jemalloc. | `gauge` |
36+
| `quickwit_cluster` | `live_nodes` | The number of live nodes observed locally | `gauge` |
37+
| `quickwit_cluster` | `ready_nodes` | The number of ready nodes observed locally | `gauge` |
38+
| `quickwit_cluster` | `zombie_nodes` | The number of zombie nodes observed locally | `gauge` |
39+
| `quickwit_cluster` | `dead_nodes` | The number of dead nodes observed locally | `gauge` |
40+
| `quickwit_cluster` | `cluster_state_size_bytes` | The size of the cluster state in bytes | `gauge` |
41+
| `quickwit_cluster` | `node_state_size_bytes` | The size of the node state in bytes | `gauge` |
42+
| `quickwit_cluster` | `node_state_keys` | The number of keys in the node state | `gauge` |
43+
| `quickwit_cluster` | `gossip_recv_messages_total` | Total number of gossip messages received | `counter` |
44+
| `quickwit_cluster` | `gossip_recv_bytes_total` | Total amount of gossip data received in bytes | `counter` |
45+
| `quickwit_cluster` | `gossip_sent_messages_total` | Total number of gossip messages sent | `counter` |
46+
| `quickwit_cluster` | `gossip_sent_bytes_total` | Total amount of gossip data sent in bytes | `counter` |
47+
| `quickwit_cluster` | `grpc_gossip_rounds_total` | Total number of gRPC gossip rounds performed with peer nodes | `counter` |
48+
49+
## Control Plane Metrics
2550

26-
## Common Metrics
51+
| Namespace | Metric Name | Description | Labels | Type |
52+
| --------- | ----------- | ----------- | ------ | ---- |
53+
| `quickwit_control_plane` | `indexes_total` | Number of indexes | | `gauge` |
54+
| `quickwit_control_plane` | `restart_total` | Number of control plane restarts | | `counter` |
55+
| `quickwit_control_plane` | `schedule_total` | Number of control plane schedule operations | | `counter` |
56+
| `quickwit_control_plane` | `apply_total` | Number of control plane apply plan operations | | `counter` |
57+
| `quickwit_control_plane` | `metastore_error_aborted` | Number of aborted metastore transactions (do not trigger a control plane restart) | | `counter` |
58+
| `quickwit_control_plane` | `metastore_error_maybe_executed` | Number of metastore transactions with an uncertain outcome (do trigger a control plane restart) | | `counter` |
59+
| `quickwit_control_plane` | `open_shards_total` | Number of open shards per source | [`index_id`] | `gauge` |
60+
| `quickwit_control_plane` | `shards` | Number of (remote/local) shards in the indexing plan | [`locality`] | `gauge` |
61+
62+
## GRPC Metrics
63+
64+
The following subsystems expose gRPC metrics: `cluster`, `control_plane`, `indexing`, `ingest`, `metastore`.
2765

2866
| Namespace | Metric Name | Description | Labels | Type |
2967
| --------- | ----------- | ----------- | ------ | ---- |
30-
| `quickwit` | `write_bytes`| Number of bytes written by a given component in [`indexer`, `merger`, `deleter`, `split_downloader_{merge,delete}`] | [`index`, `component`] | `counter` |
68+
| `quickwit_{subsystem}` | `grpc_requests_total` | Total number of gRPC requests processed | [`kind`, `rpc`, `status`] | `counter` |
69+
| `quickwit_{subsystem}` | `grpc_requests_in_flight` | Number of gRPC requests in-flight | [`kind`, `rpc`] | `gauge` |
70+
| `quickwit_{subsystem}` | `grpc_request_duration_seconds` | Duration of request in seconds | [`kind`, `rpc`, `status`] | `histogram` |
71+
| `quickwit_grpc` | `circuit_break_total` | Circuit breaker counter | | `counter` |
3172

3273
## Indexing Metrics
3374

3475
| Namespace | Metric Name | Description | Labels | Type |
3576
| --------- | ----------- | ----------- | ------ | ---- |
36-
| `quickwit_indexing` | `processed_docs_total`| Number of processed docs by index, source and processed status in [`valid`, `schema_error`, `parse_error`, `transform_error`] | [`index`, `source`, `docs_processed_status`] | `counter` |
37-
| `quickwit_indexing` | `processed_bytes`| Number of processed bytes by index, source and processed status in [`valid`, `schema_error`, `parse_error`, `transform_error`] | [`index`, `source`, `docs_processed_status`] | `counter` |
38-
| `quickwit_indexing` | `available_concurrent_upload_permits`| Number of available concurrent upload permits by component in [`merger`, `indexer`] | [`component`] | `gauge` |
39-
| `quickwit_indexing` | `ongoing_merge_operations`| Number of available concurrent upload permits by component in [`merger`, `indexer`]. | [`index`, `source`] | `gauge` |
77+
| `quickwit_indexing` | `processed_docs_total` | Number of processed docs by index and processed status | [`index`, `docs_processed_status`] | `counter` |
78+
| `quickwit_indexing` | `processed_bytes` | Number of bytes of processed documents by index and processed status | [`index`, `docs_processed_status`] | `counter` |
79+
| `quickwit_indexing` | `backpressure_micros` | Amount of time spent in backpressure (in micros) | [`actor_name`] | `counter` |
80+
| `quickwit_indexing` | `concurrent_upload_available_permits_num` | Number of available concurrent upload permits by component | [`component`] | `gauge` |
81+
| `quickwit_indexing` | `split_builders` | Number of existing index writer instances | | `gauge` |
82+
| `quickwit_indexing` | `ongoing_merge_operations` | Number of ongoing merge operations | | `gauge` |
83+
| `quickwit_indexing` | `pending_merge_operations` | Number of pending merge operations | | `gauge` |
84+
| `quickwit_indexing` | `pending_merge_bytes` | Number of pending merge bytes | | `gauge` |
85+
| `quickwit_indexing` | `kafka_rebalance_total` | Number of kafka rebalances | | `counter` |
4086

4187
## Ingest Metrics
4288

43-
| Namespace | Metric Name | Description | Type |
44-
| --------- | ----------- | ----------- | ---- |
45-
| `quickwit_ingest` | `ingested_num_bytes` | Total size of the docs ingested in bytes | `counter` |
46-
| `quickwit_ingest` | `ingested_num_docs` | Number of docs received to be ingested | `counter` |
47-
| `quickwit_ingest` | `queue_count` | Number of queues currently active | `counter` |
89+
| Namespace | Metric Name | Description | Labels | Type |
90+
| --------- | ----------- | ----------- | ------ | ---- |
91+
| `quickwit_ingest` | `docs_total` | Total number of docs ingested, measured in ingester's leader | [`validity`] | `counter` |
92+
| `quickwit_ingest` | `docs_bytes_total` | Total size of the docs ingested in bytes, measured in ingester's leader | [`validity`] | `counter` |
93+
| `quickwit_ingest` | `ingest_result_total` | Number of ingest requests by result | [`result`] | `counter` |
94+
| `quickwit_ingest` | `reset_shards_operations_total` | Total number of reset shards operations performed | [`status`] | `counter` |
95+
| `quickwit_ingest` | `shards` | Number of shards hosted by the ingester | [`state`] | `gauge` |
96+
| `quickwit_ingest` | `shard_lt_throughput_mib` | Shard long term throughput as reported through chitchat | | `histogram` |
97+
| `quickwit_ingest` | `shard_st_throughput_mib` | Shard short term throughput as reported through chitchat | | `histogram` |
98+
| `quickwit_ingest` | `wal_acquire_lock_requests_in_flight` | Number of acquire lock requests in-flight | [`operation`, `type`] | `gauge` |
99+
| `quickwit_ingest` | `wal_acquire_lock_request_duration_secs` | Duration of acquire lock requests in seconds | [`operation`, `type`] | `histogram` |
100+
| `quickwit_ingest` | `wal_disk_used_bytes` | WAL disk space used in bytes | | `gauge` |
101+
| `quickwit_ingest` | `wal_memory_used_bytes` | WAL memory used in bytes | | `gauge` |
102+
<!-- uncomment when replication is released
103+
| `quickwit_ingest` | `replicated_num_bytes_total` | Total size in bytes of the replicated docs | | `counter` |
104+
| `quickwit_ingest` | `replicated_num_docs_total` | Total number of docs replicated | | `counter` |
105+
-->
106+
107+
Note that the legacy ingest (V1) only records the `docs_total` and `docs_bytes_total` metrics. The `validity` label is always set to `valid` because it doesn't parse the documents at ingest time. Invalid documents are discarded asynchronously in the indexing pipeline's doc processor.
108+
109+
## Janitor Metrics
48110

49-
## Metastore Metrics
111+
| Namespace | Metric Name | Description | Labels | Type |
112+
| --------- | ----------- | ----------- | ------ | ---- |
113+
| `quickwit_janitor` | `ongoing_num_delete_operations_total` | Number of ongoing delete operations per index | [`index`] | `gauge` |
114+
| `quickwit_janitor` | `gc_deleted_splits_total` | Total number of splits deleted by the garbage collector | [`result`] | `counter` |
115+
| `quickwit_janitor` | `gc_deleted_bytes_total` | Total number of bytes deleted by the garbage collector | | `counter` |
116+
| `quickwit_janitor` | `gc_runs_total` | Total number of garbage collector executions | [`result`] | `counter` |
117+
| `quickwit_janitor` | `gc_seconds_total` | Total time spent running the garbage collector | | `counter` |
118+
119+
## Jaeger Metrics
120+
121+
| Namespace | Metric Name | Description | Labels | Type |
122+
| --------- | ----------- | ----------- | ------ | ---- |
123+
| `quickwit_jaeger` | `requests_total` | Number of requests | [`operation`, `index`] | `counter` |
124+
| `quickwit_jaeger` | `request_errors_total` | Number of failed requests | [`operation`, `index`] | `counter` |
125+
| `quickwit_jaeger` | `request_duration_seconds` | Duration of requests | [`operation`, `index`, `error`] | `histogram` |
126+
| `quickwit_jaeger` | `fetched_traces_total` | Number of traces retrieved from storage | [`operation`, `index`] | `counter` |
127+
| `quickwit_jaeger` | `fetched_spans_total` | Number of spans retrieved from storage | [`operation`, `index`] | `counter` |
128+
| `quickwit_jaeger` | `transferred_bytes_total` | Number of bytes transferred | [`operation`, `index`] | `counter` |
50129

51-
All metastore methods are monitored by the 3 metrics:
130+
## Memory Metrics
52131

53132
| Namespace | Metric Name | Description | Labels | Type |
54133
| --------- | ----------- | ----------- | ------ | ---- |
55-
| `quickwit_metastore` | `requests_total` | Number of requests | [`operation`, `index`] | `counter` |
56-
| `quickwit_metastore` | `request_errors_total` | Number of failed requests | [`operation`, `index`] | `counter` |
57-
| `quickwit_metastore` | `request_duration_seconds` | Duration of requests | [`operation`, `index`, `error`] | `histogram` |
134+
| `quickwit_memory` | `active_bytes` | Total number of bytes in active pages allocated by the application, as reported by jemalloc `stats.active` | | `gauge` |
135+
| `quickwit_memory` | `allocated_bytes` | Total number of bytes allocated by the application, as reported by jemalloc `stats.allocated` | | `gauge` |
136+
| `quickwit_memory` | `resident_bytes` | Total number of bytes in physically resident data pages mapped by the allocator, as reported by jemalloc `stats.resident` | | `gauge` |
137+
| `quickwit_memory` | `in_flight_data_bytes` | Amount of data in-flight in various buffers in bytes | [`component`] | `gauge` |
58138

59-
Examples of operation names: `create_index`, `index_metadata`, `delete_index`, `stage_splits`, `publish_splits`, `list_splits`, `add_source`, ...
139+
## Metastore Metrics
60140

61-
## Rest API Metrics
141+
| Namespace | Metric Name | Description | Labels | Type |
142+
| --------- | ----------- | ----------- | ------ | ---- |
143+
| `quickwit_metastore` | `acquire_connections` | Number of connections being acquired (PostgreSQL only) | | `gauge` |
144+
| `quickwit_metastore` | `active_connections` | Number of active (used + idle) connections (PostgreSQL only) | | `gauge` |
145+
| `quickwit_metastore` | `idle_connections` | Number of idle connections (PostgreSQL only) | | `gauge` |
62146

63-
| Namespace | Metric Name | Description | Type |
64-
| --------- | ----------- | ----------- | ---- |
65-
| `quickwit` | `http_requests_total` | Total number of HTTP requests received | `counter` |
147+
## OTLP Metrics
148+
149+
| Namespace | Metric Name | Description | Labels | Type |
150+
| --------- | ----------- | ----------- | ------ | ---- |
151+
| `quickwit_otlp` | `requests_total` | Number of requests | [`service`, `index`, `transport`, `format`] | `counter` |
152+
| `quickwit_otlp` | `request_errors_total` | Number of failed requests | [`service`, `index`, `transport`, `format`] | `counter` |
153+
| `quickwit_otlp` | `request_duration_seconds` | Duration of requests | [`service`, `index`, `transport`, `format`, `error`] | `histogram` |
154+
| `quickwit_otlp` | `ingested_log_records_total` | Number of log records ingested | [`service`, `index`, `transport`, `format`] | `counter` |
155+
| `quickwit_otlp` | `ingested_spans_total` | Number of spans ingested | [`service`, `index`, `transport`, `format`] | `counter` |
156+
| `quickwit_otlp` | `ingested_bytes_total` | Number of bytes ingested | [`service`, `index`, `transport`, `format`] | `counter` |
157+
158+
## REST API Metrics
159+
160+
| Namespace | Metric Name | Description | Labels | Type |
161+
| --------- | ----------- | ----------- | ------ | ---- |
162+
| `quickwit` | `http_requests_total` | Total number of HTTP requests processed | [`method`, `status_code`] | `counter` |
163+
| `quickwit` | `request_duration_secs` | Response time in seconds | [`method`, `status_code`] | `histogram` |
164+
| `quickwit` | `ongoing_requests` | Number of ongoing requests on specific endpoint groups | [`endpoint_group`] | `gauge` |
165+
| `quickwit` | `pending_requests` | Number of pending requests on specific endpoint groups | [`endpoint_group`] | `gauge` |
66166

67167
## Search Metrics
68168

69-
| Namespace | Metric Name | Description | Type |
70-
| --------- | ----------- | ----------- | ---- |
71-
| `quickwit_search` | `leaf_searches_splits_total` | Number of leaf searches (count of splits) started | `counter` |
72-
| `quickwit_search` | `leaf_search_split_duration_secs` | Number of seconds required to run a leaf search over a single split. The timer starts after the semaphore is obtained | `histogram` |
73-
| `quickwit_search` | `active_search_threads_count` | Number of threads in use in the CPU thread pool | `gauge` |
169+
| Namespace | Metric Name | Description | Labels | Type |
170+
| --------- | ----------- | ----------- | ------ | ---- |
171+
| `quickwit_search` | `root_search_requests_total` | Total number of root search gRPC requests processed | [`status`] | `counter` |
172+
| `quickwit_search` | `root_search_request_duration_seconds` | Duration of root search gRPC requests in seconds | [`status`] | `histogram` |
173+
| `quickwit_search` | `root_search_targeted_splits` | Number of splits targeted per root search gRPC request | [`status`] | `histogram` |
174+
| `quickwit_search` | `leaf_search_requests_total` | Total number of leaf search gRPC requests processed | [`status`] | `counter` |
175+
| `quickwit_search` | `leaf_search_request_duration_seconds` | Duration of leaf search gRPC requests in seconds | [`status`] | `histogram` |
176+
| `quickwit_search` | `leaf_search_targeted_splits` | Number of splits targeted per leaf search gRPC request | [`status`] | `histogram` |
177+
| `quickwit_search` | `leaf_searches_splits_total` | Number of leaf searches (count of splits) started | | `counter` |
178+
| `quickwit_search` | `leaf_search_split_duration_secs` | Number of seconds required to run a leaf search over a single split. The timer starts after the semaphore is obtained | | `histogram` |
179+
| `quickwit_search` | `leaf_search_single_split_tasks` | Number of single split search tasks pending or ongoing | [`status`] | `gauge` |
180+
| `quickwit_search` | `leaf_search_single_split_warmup_num_bytes` | Size of the short lived cache for a single split once the warmup is done | | `histogram` |
181+
| `quickwit_search` | `job_assigned_total` | Number of jobs assigned to searchers, per affinity rank | [`affinity`] | `counter` |
182+
| `quickwit_search` | `searcher_local_kv_store_size_bytes` | Size of the searcher kv store in bytes. This store is used to cache scroll contexts | | `gauge` |
74183

75184
## Storage Metrics
76185

186+
| Namespace | Metric Name | Description | Labels | Type |
187+
| --------- | ----------- | ----------- | ------ | ---- |
188+
| `quickwit_storage` | `get_slice_timeout_outcome` | Outcome of get_slice operations. success_after_1_timeout means the operation succeeded after a retry caused by a timeout | [`outcome`] | `counter` |
189+
| `quickwit_storage` | `object_storage_requests_total` | Number of requests to the object store, by action and status. Requests are recorded when the response headers are returned | [`action`, `status`] | `counter` |
190+
| `quickwit_storage` | `object_storage_request_duration` | Durations until the response headers are returned from the object store, by action and status | [`action`, `status`] | `histogram` |
191+
| `quickwit_storage` | `object_storage_download_num_bytes` | Amount of data downloaded from object storage | [`status`] | `counter` |
192+
| `quickwit_storage` | `object_storage_download_errors` | Number of download requests that received successful response headers but failed during download | [`status`] | `counter` |
193+
| `quickwit_storage` | `object_storage_upload_num_bytes` | Amount of data uploaded to object storage. The value recorded for failed and aborted uploads is the full payload size | [`status`] | `counter` |
194+
195+
## CLI Metrics
196+
77197
| Namespace | Metric Name | Description | Type |
78198
| --------- | ----------- | ----------- | ---- |
79-
| `quickwit_storage` | `object_storage_gets_total` | Number of objects fetched | `counter` |
80-
| `quickwit_storage` | `object_storage_puts_total` | Number of objects uploaded. May differ from object_storage_requests_parts due to multipart upload | `counter` |
81-
| `quickwit_storage` | `object_storage_puts_parts` | Number of object parts uploaded | `counter` |
82-
| `quickwit_storage` | `object_storage_download_num_bytes` | Amount of data downloaded from an object storage | `counter` |
199+
| `quickwit_cli` | `thread_unpark_duration_microseconds` | Duration for which a thread of the main tokio runtime is unparked | `histogram` |

0 commit comments

Comments
 (0)