Skip to content

Commit 46306a8

Browse files
author
rossi Pan
committed
refactor and monitor the health of etcd with metrics endpoint
1 parent e8e296f commit 46306a8

19 files changed

+1183
-182
lines changed

README.md

Lines changed: 65 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -83,30 +83,45 @@ $ kubectl apply -f zabbix-agent-daemonset.yaml
8383

8484
| Zabbix Item Name | Zabbix Item Key |
8585
| ------------ | ----------- |
86-
| **etcd node: health**| healthz|
87-
| **etcd node: receive requests**| v2/stats/self:recvAppendRequestCnt |
88-
| **etcd node: send requests**| v2/stats/self:sendAppendRequestCnt |
89-
| **etcd node: state**| v2/stats/self:state |
90-
| **etcd node: expires**| v2/stats/store:expireCount |
91-
| **etcd node: gets fail**| v2/stats/store:getsFail |
92-
| **etcd node: gets success**| v2/stats/store:getsSuccess |
93-
| **etcd node: watchers**| v2/stats/store:watchers |
94-
| **etcd cluster: sets fail**| v2/stats/store:setsFail |
95-
| **etcd cluster: sets success**| v2/stats/store:setsSuccess |
96-
| **etcd cluster: update fail**| v2/stats/store:updateFail |
97-
| **etcd cluster: update success**| v2/stats/store:updateSuccess |
98-
| **etcd cluster: compare and delete fail**| v2/stats/store:compareAndDeleteFail |
99-
| **etcd cluster: compare and delete success**| v2/stats/store:compareAndDeleteSuccess |
100-
| **etcd cluster: compare and swap fail**| v2/stats/store:compareAndSwapFail |
101-
| **etcd cluster: compare and swap success**| v2/stats/store:compareAndSwapSuccess |
102-
| **etcd cluster: create fail**| v2/stats/store:createFail |
103-
| **etcd cluster: create success**| v2/stats/store:createSuccess |
104-
| **etcd cluster: delete fail**| v2/stats/store:deleteFail |
105-
| **etcd cluster: delete success**| v2/stats/store:deleteSuccess |
106-
| **ETCD MEMBERS**| v2/members |
107-
| **etcd follower: {#MEMBER NAME} failed raft requests**| v2/stats/leader:followers/{#MEMBER ID}/counts/fail |
108-
| **etcd follower: {#MEMBER NAME} successful raft requests**| v2/stats/leader:followers/{#MEMBER ID}/counts/success |
109-
| **etcd follower: {#MEMBER NAME} latency to leader**| v2/stats/leader:followers/{#MEMBER ID}/latency/current |
86+
| **etcd node: health**| etcd.stats["health:health"]|
87+
| **etcd node: receive requests**| etcd.stats["v2/stats/self:recvAppendRequestCnt"] |
88+
| **etcd node: send requests**| etcd.stats["v2/stats/self:sendAppendRequestCnt"] |
89+
| **etcd node: state**| etcd.stats["v2/stats/self:state"] |
90+
| **etcd node: expires**| etcd.stats["v2/stats/store:expireCount"] |
91+
| **etcd node: gets fail**| etcd.stats["v2/stats/store:getsFail"] |
92+
| **etcd node: gets success**| etcd.stats["v2/stats/store:getsSuccess"] |
93+
| **etcd node: watchers**| etcd.stats["v2/stats/store:watchers"] |
94+
| **etcd cluster: sets fail**| etcd.stats["v2/stats/store:setsFail"] |
95+
| **etcd cluster: sets success**| etcd.stats["v2/stats/store:setsSuccess"] |
96+
| **etcd cluster: update fail**| etcd.stats["v2/stats/store:updateFail"] |
97+
| **etcd cluster: update success**| etcd.stats["v2/stats/store:updateSuccess"] |
98+
| **etcd cluster: compare and delete fail**| etcd.stats["v2/stats/store:compareAndDeleteFail"] |
99+
| **etcd cluster: compare and delete success**| etcd.stats["v2/stats/store:compareAndDeleteSuccess"] |
100+
| **etcd cluster: compare and swap fail**| etcd.stats["v2/stats/store:compareAndSwapFail"] |
101+
| **etcd cluster: compare and swap success**| etcd.stats["v2/stats/store:compareAndSwapSuccess"] |
102+
| **etcd cluster: create fail**| etcd.stats["v2/stats/store:createFail"] |
103+
| **etcd cluster: create success**| etcd.stats["v2/stats/store:createSuccess"] |
104+
| **etcd cluster: delete fail**| etcd.stats["v2/stats/store:deleteFail"] |
105+
| **etcd cluster: delete success**| etcd.stats["v2/stats/store:deleteSuccess"] |
106+
| **ETCD MEMBERS**| etcd.member.discovery |
107+
| **etcd follower: {#MEMBER NAME} failed raft requests**| etcd.stats["v2/stats/leader:followers/{#ID}/counts/fail"] |
108+
| **etcd follower: {#MEMBER NAME} successful raft requests**| etcd.stats["v2/stats/leader:followers/{#ID}/counts/success"] |
109+
| **etcd follower: {#MEMBER NAME} latency to leader**| etcd.stats["v2/stats/leader:followers/{#ID}/latency/current"] |
110+
| **The number of leader changes seen**| etcd.metrics[counter,etcd_server_leader_changes_seen_total] |
111+
| **The total number of failed proposals seen**| etcd.metrics[counter,etcd_server_proposals_failed_total] |
112+
| **Whether or not a leader exists. 1 is existence, 0 is not**| etcd.metrics[gauge,etcd_server_has_leader] |
113+
| **The total number of consensus proposals applied in last 5 minutes**| etcd.metrics[gauge,etcd_server_proposals_applied_total] |
114+
| **The total number of consensus proposals committed in last 5 minutes**| etcd.metrics[gauge,etcd_server_proposals_committed_total] |
115+
| **The current number of pending proposals to commit**| etcd.metrics[gauge,etcd_server_proposals_pending] |
116+
| **Maximum number of open file descriptors**| etcd.metrics[gauge,process_max_fds] |
117+
| **Number of open file descriptors**| etcd.metrics[gauge,process_open_fds] |
118+
| **etcd_disk_backend_commit_duration_seconds_count in last 5 minutes**| etcd.metrics[histogram,etcd_disk_backend_commit_duration_seconds_count] |
119+
| **etcd_disk_backend_commit_duration_seconds_sum in last 5 minutes**| etcd.metrics[histogram,etcd_disk_backend_commit_duration_seconds_sum] |
120+
| **The latency distributions of commit called by backend in last 5 minutes**| last("etcd.metrics[histogram,etcd_disk_backend_commit_duration_seconds_sum]",0)/last("etcd.metrics[histogram,etcd_disk_wal_fsync_duration_seconds_count]",0) |
121+
| **etcd_disk_wal_fsync_duration_seconds_count in last 5 minutes**| etcd.metrics[histogram,etcd_disk_wal_fsync_duration_seconds_count] |
122+
| **etcd_disk_wal_fsync_duration_seconds_sum in last 5 minutes**| etcd.metrics[histogram,etcd_disk_wal_fsync_duration_seconds_sum] |
123+
| **The latency distributions of fsync called by wal in last 5 minutes**| last("etcd.metrics[histogram,etcd_disk_wal_fsync_duration_seconds_sum]",0)/last("etcd.metrics[histogram,etcd_disk_wal_fsync_duration_seconds_count]",0) |
124+
110125

111126

112127
### Kubernetes apiserver/controller/scheduler
@@ -119,34 +134,34 @@ $ kubectl apply -f zabbix-agent-daemonset.yaml
119134
| **apiserver_request_count: error_rate (verb=PATCH)**| apiserver_request_error_rate[PATCH]|
120135
| **apiserver_request_count: error_rate (verb=POST)**| apiserver_request_error_rate[POST]|
121136
| **apiserver_request_count: error_rate (verb=PUT)**| apiserver_request_error_rate[PUT]|
122-
| **apiserver_request_count: verb=DELETE, metrics=error_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,DELETE:error_count]|
123-
| **apiserver_request_count: verb=DELETE, metrics=total_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,DELETE:total_count]|
124-
| **apiserver_request_count: verb=GET, metrics=error_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,GET:error_count]|
125-
| **apiserver_request_count: verb=GET, metrics=total_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,GET:total_count]|
126-
| **apiserver_request_count: verb=LIST, metrics=error_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,LIST:error_count]|
127-
| **apiserver_request_count: verb=POST, metrics=total_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,LIST:total_count]|
128-
| **apiserver_request_count: verb=PATCH, metrics=error_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PATCH:error_count]|
129-
| **apiserver_request_count: verb=PATCH, metrics=total_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PATCH:total_count]|
130-
| **apiserver_request_count: verb=POST, metrics=error_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,POST:error_count]|
131-
| **apiserver_request_count: verb=POST, metrics=total_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,POST:total_count]|
132-
| **apiserver_request_count: verb=PUT, metrics=error_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PUT:error_count]|
133-
| **apiserver_request_count: verb=PUT, metrics=total_count**| metrics_exporter[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PUT:total_count]|
134-
| **apiserver_request_latencies: DELETE**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,DELETE]|
135-
| **apiserver_request_latencies: GET**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,GET]|
136-
| **apiserver_request_latencies: LIST**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,LIST]|
137-
| **apiserver_request_latencies: PATCH**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,PATCH]|
138-
| **apiserver_request_latencies: POST**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,POST]|
139-
| **apiserver_request_latencies: PUT**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,PUT]|
140-
| **apiserver_request_latencies: POST**| metrics_exporter[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,POST]|
141-
| **apiserver: healthz**| metrics_exporter[https://{HOST.IP}:443/healthz,healthz]|
142-
| **kube-scheduler: healthz**| metrics_exporter[http://{HOST.IP}:10251/healthz,healthz]|
143-
| **kube-scheduler: current leader**| metrics_exporter[https://{HOST.IP}:443,get_leader,kube-scheduler]|
144-
| **kube-controller-manager: healthz**| metrics_exporter[http://{HOST.IP}:10252/healthz,healthz]|
145-
| **kube-controller-manager: current leader**| metrics_exporter[https://{HOST.IP}:443,get_leader,kube-controller-manager]|
137+
| **apiserver_request_count: verb=DELETE, metrics=error_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,DELETE:error_count]|
138+
| **apiserver_request_count: verb=DELETE, metrics=total_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,DELETE:total_count]|
139+
| **apiserver_request_count: verb=GET, metrics=error_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,GET:error_count]|
140+
| **apiserver_request_count: verb=GET, metrics=total_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,GET:total_count]|
141+
| **apiserver_request_count: verb=LIST, metrics=error_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,LIST:error_count]|
142+
| **apiserver_request_count: verb=POST, metrics=total_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,LIST:total_count]|
143+
| **apiserver_request_count: verb=PATCH, metrics=error_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PATCH:error_count]|
144+
| **apiserver_request_count: verb=PATCH, metrics=total_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PATCH:total_count]|
145+
| **apiserver_request_count: verb=POST, metrics=error_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,POST:error_count]|
146+
| **apiserver_request_count: verb=POST, metrics=total_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,POST:total_count]|
147+
| **apiserver_request_count: verb=PUT, metrics=error_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PUT:error_count]|
148+
| **apiserver_request_count: verb=PUT, metrics=total_count**| kube.metrics[https://{HOST.IP}:443/metrics,counter,apiserver_request_count,PUT:total_count]|
149+
| **apiserver_request_latencies: DELETE**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,DELETE]|
150+
| **apiserver_request_latencies: GET**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,GET]|
151+
| **apiserver_request_latencies: LIST**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,LIST]|
152+
| **apiserver_request_latencies: PATCH**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,PATCH]|
153+
| **apiserver_request_latencies: POST**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,POST]|
154+
| **apiserver_request_latencies: PUT**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,PUT]|
155+
| **apiserver_request_latencies: POST**| kube.metrics[https://{HOST.IP}:443/metrics,summary,apiserver_request_latencies_summary,POST]|
156+
| **apiserver: healthz**| kube.metrics[https://{HOST.IP}:443/healthz,healthz]|
157+
| **kube-scheduler: healthz**| kube.metrics[http://{HOST.IP}:10251/healthz,healthz]|
158+
| **kube-scheduler: current leader**| kube.metrics[https://{HOST.IP}:443,get_leader,kube-scheduler]|
159+
| **kube-controller-manager: healthz**| kube.metrics[http://{HOST.IP}:10252/healthz,healthz]|
160+
| **kube-controller-manager: current leader**| kube.metrics[https://{HOST.IP}:443,get_leader,kube-controller-manager]|
146161

147162

148163
### Kubelet
149164
| Zabbix Item Name | Zabbix Item Key |
150165
| ------------ | ----------- |
151-
| **kubelet: healthz**| metrics_exporter[https://{HOST.IP}:10250/healthz,healthz]|
152-
| **KUBELET_RUNNING_POD_COUNT**| metrics_exporter[https://{HOST.IP}:10250/metrics,gauge,kubelet_running_pod_count]|
166+
| **kubelet: healthz**| kube.metrics[https://{HOST.IP}:10250/healthz,healthz]|
167+
| **KUBELET_RUNNING_POD_COUNT**| kube.metrics[https://{HOST.IP}:10250/metrics,gauge,kubelet_running_pod_count]|

etc/zabbix/exporter/__init__.py

Whitespace-only changes.

etc/zabbix/exporter/etcd/__init__.py

Whitespace-only changes.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
#!/usr/bin/env python
2+
"""
3+
Monitoring the health of etcd with metrics api.
4+
5+
Examples:
6+
$ ./etcd-metrics.py -t gauge -q etcd_server_has_leader
7+
$ ./etcd-metrics.py -t counter -q etcd_server_leader_changes_seen_total
8+
$ ./etcd-metrics.py -t gauge -q process_max_fds
9+
$ ./etcd-metrics.py -t gauge -q process_open_fds
10+
$ ./etcd-metrics.py -t counter -q etcd_server_proposals_failed_total
11+
$ ./etcd-metrics.py -t gauge -q etcd_server_proposals_committed_total
12+
$ ./etcd-metrics.py -t gauge -q etcd_server_proposals_applied_total
13+
$ ./etcd-metrics.py -t gauge -q etcd_server_proposals_pending
14+
$ ./etcd-metrics.py -t histogram -q etcd_disk_backend_commit_duration_seconds_sum
15+
$ ./etcd-metrics.py -t histogram -q etcd_disk_backend_commit_duration_seconds_count
16+
$ ./etcd-metrics.py -t histogram -q etcd_disk_wal_fsync_duration_seconds_sum
17+
$ ./etcd-metrics.py -t histogram -q etcd_disk_wal_fsync_duration_seconds_count
18+
"""
19+
import json
20+
import os, sys
21+
import urllib2
22+
import urllib2_ssl
23+
import time
24+
import StringIO
25+
import argparse
26+
import ConfigParser
27+
from base64 import b16encode
28+
from sys import exit, stderr
29+
30+
# this will let the script to import parent modules when execute directly
31+
sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))
32+
from prometheus_client.parser import text_string_to_metric_families
33+
34+
stats_cache_file_tmpl = '/tmp/zbx_etcd_stats_{url}.txt'
35+
rootfs_path = '/rootfs'
36+
etcd_config_file = rootfs_path + '/etc/etcd-environment'
37+
38+
config = StringIO.StringIO()
39+
config.write('[dummysection]\n')
40+
config.write(open(etcd_config_file).read())
41+
config.seek(0, os.SEEK_SET)
42+
cp = ConfigParser.ConfigParser()
43+
cp.readfp(config)
44+
node_url = cp.get('dummysection', 'ETCD_ADVERTISE_CLIENT_URLS') + '/metrics'
45+
key_file = rootfs_path + '/etc/ssl/certs/etcd-client-key.pem'
46+
cert_file = rootfs_path + '/etc/ssl/certs/etcd-client.pem'
47+
ca_certs = rootfs_path + '/etc/ssl/certs/etcd-trusted-ca.pem'
48+
49+
def connect(timeout=60):
50+
'''Get the specified stats from the etcd (or from cached data).'''
51+
52+
# generate path for cache file
53+
cache_file = stats_cache_file_tmpl.format(url=b16encode(node_url))
54+
55+
# get the age of the cache file
56+
if os.path.exists(cache_file):
57+
cache_age = int(time.time() - os.path.getmtime(cache_file))
58+
else:
59+
cache_age = timeout
60+
61+
# read stats from cache if it's still valid
62+
if cache_age < timeout:
63+
with open(cache_file, 'r') as c:
64+
raw = c.read()
65+
66+
# if not get, get the fresh stats from the etcd server
67+
else:
68+
try:
69+
opener = urllib2.build_opener(urllib2_ssl.HTTPSHandler(
70+
key_file=key_file,
71+
cert_file=cert_file,
72+
ca_certs=ca_certs))
73+
raw = opener.open('%s' % (node_url)).read()
74+
except (urllib2.URLError, ValueError) as e:
75+
if e.code == 403:
76+
raw = e.read()
77+
else:
78+
print >> stderr, '%s (%s)' % (e, node_url)
79+
return None
80+
81+
try:
82+
# save the contents to cache_file
83+
cache_file_tmp = open(cache_file + '.tmp', "w")
84+
cache_file_tmp.write(raw)
85+
cache_file_tmp.flush()
86+
cache_file_tmp.close()
87+
os.rename(cache_file + '.tmp', cache_file)
88+
except:
89+
pass
90+
91+
# finally return the parsed response
92+
try:
93+
response = raw
94+
except Exception as e:
95+
print >> stderr, e
96+
return None
97+
98+
return response
99+
100+
def gauge(query_label_name):
101+
metrics = connect()
102+
103+
for family in text_string_to_metric_families(metrics):
104+
for sample in family.samples:
105+
item = "{0}".format(*sample)
106+
if item == query_label_name:
107+
value = "{2}".format(*sample)
108+
break
109+
110+
return value
111+
112+
if __name__ == "__main__":
113+
parser = argparse.ArgumentParser(description='Fetch etcd server metric')
114+
parser.add_argument('-t',dest='query_type',action='store',help='[gauge|histogram]',required='true')
115+
parser.add_argument('-q',dest='query_label_name',action='store',required='true')
116+
117+
args = parser.parse_args()
118+
query_type = args.query_type
119+
query_label_name = args.query_label_name
120+
121+
if query_type == 'gauge' or query_type == 'histogram':
122+
result = gauge(query_label_name=query_label_name)
123+
elif query_type == 'counter':
124+
#Make counter metric name not have _total internally.
125+
#With OpenMetrics the _total is a suffix on a sample
126+
#for a counter, so the convention that Counters should end
127+
#in total is now enforced. If an existing counter is
128+
#missing the _total, it'll now appear on the /metrics.
129+
#https://github.com/prometheus/client_python/commit/a4dd93bcc6a0422e10cfa585048d1813909c6786
130+
if not query_label_name.endswith('_total'):
131+
query_label_name = query_label_name + '_total'
132+
133+
result = gauge(query_label_name=query_label_name)
134+
135+
if result is not None:
136+
print result
137+
else:
138+
print "ZBX_NOTSUPPORTED"
139+
exit(1)

etc/zabbix/etcd-stats/etcd-stats.py renamed to etc/zabbix/exporter/etcd/etcd-stats.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@
1818
import argparse
1919
import ConfigParser
2020
from base64 import b16encode
21-
from optparse import OptionParser
2221
from sys import exit, stderr
2322

2423
stats_cache_file_tmpl = '/tmp/zbx_etcd_stats_{type}_{url}.txt'

etc/zabbix/exporter/kubernetes/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)