Skip to content

Commit ff0e925

Browse files
authored
[Metric] Fix prometheus metric backend (#3124)
1 parent cd22c4c commit ff0e925

File tree

7 files changed

+215
-34
lines changed

7 files changed

+215
-34
lines changed

docs/source/development/metrics.rst

Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,95 @@ Mars metrics support three different backends:
5959
* ``prometheus`` is an open-source systems monitoring and alerting toolkit.
6060
* ``ray`` is a metric backend which just runs on ray engine.
6161

62-
We can choose a metric backend by configuring ``metrics.backend`` in
63-
``mars/deploy/oscar/base_config.yml`` or its descendant files.
62+
Console
63+
````````````````
64+
65+
The default metric backend is ``console``. It just logs the value when log level
66+
is ``debug``.
67+
68+
Prometheus
69+
````````````````
70+
71+
Firstly, we should download Prometheus. For details, please refer to
72+
`Prometheus Getting Started
73+
<https://prometheus.io/docs/prometheus/latest/getting_started/>`_.
74+
75+
Secondly, we can new a Mars session by configuring Prometheus backend as follows:
76+
77+
.. code-block:: python
78+
79+
In [1]: import mars
80+
81+
In [2]: session = mars.new_session(
82+
...: n_worker=1,
83+
...: n_cpu=2,
84+
...: web=True,
85+
...: config={"metrics.backend": "prometheus"}
86+
...: )
87+
Finished startup prometheus http server and port is 15768
88+
Finished startup prometheus http server and port is 44303
89+
Finished startup prometheus http server and port is 63391
90+
Finished startup prometheus http server and port is 13722
91+
Web service started at http://0.0.0.0:15518
92+
93+
Thirdly, we should config Prometheus, more configurations please refer to
94+
`Prometheus Configuration
95+
<https://prometheus.io/docs/prometheus/latest/configuration/configuration/>`_.
96+
97+
.. code-block:: yaml
98+
99+
scrape_configs:
100+
- job_name: 'mars'
101+
102+
scrape_interval: 5s
103+
104+
static_configs:
105+
- targets: ['localhost:15768', 'localhost:44303', 'localhost:63391', 'localhost:13722']
106+
107+
108+
Then start Prometheus:
109+
110+
.. code-block:: shell
111+
112+
$ prometheus --config.file=promconfig.yaml
113+
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:296 msg="no time or size retention was set so using the default time retention" duration=15d
114+
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:332 msg="Starting Prometheus" version="(version=2.13.1, branch=non-git, revision=non-git)"
115+
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:333 build_context="(go=go1.13.1, user=brew@Mojave.local, date=20191018-01:13:04)"
116+
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:334 host_details=(darwin)
117+
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:335 fd_limits="(soft=256, hard=unlimited)"
118+
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:336 vm_limits="(soft=unlimited, hard=unlimited)"
119+
level=info ts=2022-06-07T13:05:01.487Z caller=main.go:657 msg="Starting TSDB ..."
120+
level=info ts=2022-06-07T13:05:01.488Z caller=web.go:450 component=web msg="Start listening for connections" address=0.0.0.0:9090
121+
level=info ts=2022-06-07T13:05:01.494Z caller=head.go:514 component=tsdb msg="replaying WAL, this may take awhile"
122+
level=info ts=2022-06-07T13:05:01.495Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=1
123+
level=info ts=2022-06-07T13:05:01.495Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=1
124+
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:672 fs_type=1a
125+
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:673 msg="TSDB started"
126+
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:743 msg="Loading configuration file" filename=promconfig_mars.yaml
127+
level=info ts=2022-06-07T13:05:01.501Z caller=main.go:771 msg="Completed loading of configuration file" filename=promconfig_mars.yaml
128+
level=info ts=2022-06-07T13:05:01.501Z caller=main.go:626 msg="Server is ready to receive web requests."
129+
130+
Fourthly, run a Mars task:
131+
132+
.. code-block:: python
133+
134+
In [3]: import numpy as np
135+
136+
In [4]: import mars.dataframe as md
137+
138+
In [5]: df1 = md.DataFrame(np.random.randint(0, 3, size=(10, 4)),
139+
...: columns=list('ABCD'), chunk_size=5)
140+
...: df2 = md.DataFrame(np.random.randint(0, 3, size=(10, 4)),
141+
...: columns=list('ABCD'), chunk_size=5)
142+
...:
143+
...: r = md.merge(df1, df2, on='A').execute()
144+
145+
Finally, we can check metrics in Prometheus web http://localhost:9090.
146+
147+
Ray
148+
````````````````
149+
150+
We could config ``metrics.backend`` when creating a Ray cluster or new a session.
64151

65152
Metrics Naming Convention
66153
------------------

docs/source/locale/zh_CN/LC_MESSAGES/development/metrics.po

Lines changed: 75 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ msgid ""
88
msgstr ""
99
"Project-Id-Version: mars 0.9.0rc2+18.g21929ced5\n"
1010
"Report-Msgid-Bugs-To: \n"
11-
"POT-Creation-Date: 2022-04-24 12:19+0800\n"
11+
"POT-Creation-Date: 2022-06-08 14:41+0800\n"
1212
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
1313
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
1414
"Language-Team: LANGUAGE <LL@li.org>\n"
@@ -53,8 +53,8 @@ msgstr "``Meter`` 是一组事件发生的速率。 我们可以将其用作 qps
5353

5454
#: ../../source/development/metrics.rst:16
5555
msgid ""
56-
"``Histogram`` is a type of statistics which records the average value of"
57-
" a window data."
56+
"``Histogram`` is a type of statistics which records the average value of "
57+
"a window data."
5858
msgstr "``Histogram`` 是一种统计类型,它记录窗口数据的平均值。"
5959

6060
#: ../../source/development/metrics.rst:18
@@ -66,8 +66,9 @@ msgid ""
6666
"**Note**: If ``tag_keys`` is declared, ``tags`` must be specified when "
6767
"invoking ``record`` method and tags' keys must be consistent with "
6868
"``tag_keys``."
69-
msgstr "**注意**:如果声明了 ``tag_keys``,调用 ``record`` 方法时必须指定 ``tags`` "
70-
"参数,并且 ``tags`` 的 keys 必须跟 ``tag_keys`` 保持一致。"
69+
msgstr ""
70+
"**注意**:如果声明了 ``tag_keys``,调用 ``record`` 方法时必须指定 ``tags`` 参数,并且 ``tags`` 的"
71+
" keys 必须跟 ``tag_keys`` 保持一致。"
7172

7273
#: ../../source/development/metrics.rst:54
7374
msgid "Three different Backends"
@@ -89,40 +90,93 @@ msgstr "``prometheus`` 一个开源系统监控和报警工具包。"
8990
msgid "``ray`` is a metric backend which just runs on ray engine."
9091
msgstr "``ray`` 是一种运行在 ray 引擎上的 metric 后端。"
9192

92-
#: ../../source/development/metrics.rst:62
93+
#: ../../source/development/metrics.rst:63
94+
msgid "Console"
95+
msgstr ""
96+
97+
#: ../../source/development/metrics.rst:65
98+
msgid ""
99+
"The default metric backend is ``console``. It just logs the value when "
100+
"log level is ``debug``."
101+
msgstr "默认的 metric 后端是 ``console``. 它只是在日志级别为 ``debug`` 时打印出 metric 的值。"
102+
103+
#: ../../source/development/metrics.rst:69
104+
msgid "Prometheus"
105+
msgstr ""
106+
107+
#: ../../source/development/metrics.rst:71
108+
msgid ""
109+
"Firstly, we should download Prometheus. For details, please refer to "
110+
"`Prometheus Getting Started "
111+
"<https://prometheus.io/docs/prometheus/latest/getting_started/>`_."
112+
msgstr ""
113+
"首先,我们需要下载 Prometheus。具体的可以参考 `Prometheus Getting Started "
114+
"<https://prometheus.io/docs/prometheus/latest/getting_started/>`_."
115+
116+
#: ../../source/development/metrics.rst:75
93117
msgid ""
94-
"We can choose a metric backend by configuring ``metrics.backend`` in "
95-
"``mars/deploy/oscar/base_config.yml`` or its descendant files."
96-
msgstr "我们可以通过配置 ``mars/deploy/oscar/base_config.yml`` 或它的继承文件中的 "
97-
"``metrics.backend`` 来选择一种 metric 后端。"
118+
"Secondly, we can new a Mars session by configuring Prometheus backend as "
119+
"follows:"
120+
msgstr "其次,我们可以如下配置 Prometheus 后端来启动一个 Mars session:"
98121

99-
#: ../../source/development/metrics.rst:66
122+
#: ../../source/development/metrics.rst:93
123+
msgid ""
124+
"Thirdly, we should config Prometheus, more configurations please refer to"
125+
" `Prometheus Configuration "
126+
"<https://prometheus.io/docs/prometheus/latest/configuration/configuration/>`_."
127+
msgstr ""
128+
"第三,我们要配置 Prometheus,更多的配置可以参考 `Prometheus Configuration "
129+
"<https://prometheus.io/docs/prometheus/latest/configuration/configuration/>`_."
130+
131+
#: ../../source/development/metrics.rst:108
132+
msgid "Then start Prometheus:"
133+
msgstr "接着,启动 Prometheus:"
134+
135+
#: ../../source/development/metrics.rst:130
136+
msgid "Fourthly, run a Mars task:"
137+
msgstr "第四,执行一个 Mars task:"
138+
139+
#: ../../source/development/metrics.rst:145
140+
msgid "Finally, we can check metrics in Prometheus web http://localhost:9090."
141+
msgstr "最后,我们可以在 Prometheus 的网页端 http://localhost:9090 查看 metrics。"
142+
143+
#: ../../source/development/metrics.rst:148
144+
msgid "Ray"
145+
msgstr ""
146+
147+
#: ../../source/development/metrics.rst:150
148+
msgid ""
149+
"We could config ``metrics.backend`` when creating a Ray cluster or new a "
150+
"session."
151+
msgstr "我们可以在创建 Ray cluster 时或新建 session 时配置 ``metrics.backend``。"
152+
153+
#: ../../source/development/metrics.rst:153
100154
msgid "Metrics Naming Convention"
101155
msgstr "Metrics 命名约定"
102156

103-
#: ../../source/development/metrics.rst:68
157+
#: ../../source/development/metrics.rst:155
104158
msgid "We propose a naming convention for metrics as follows:"
105159
msgstr "我们提出一种如下的 metrics 命名约定:"
106160

107-
#: ../../source/development/metrics.rst:70
161+
#: ../../source/development/metrics.rst:157
108162
msgid "``namespace.[component].metric_name[_units]``"
109163
msgstr ""
110164

111-
#: ../../source/development/metrics.rst:72
165+
#: ../../source/development/metrics.rst:159
112166
msgid "``namespace`` could be ``mars``."
113167
msgstr "``namespace`` 可以是 ``mars``。"
114168

115-
#: ../../source/development/metrics.rst:73
116-
msgid "``component`` could be `supervisor`, `worker` or `band` etc, and can be "
169+
#: ../../source/development/metrics.rst:160
170+
msgid ""
171+
"``component`` could be `supervisor`, `worker` or `band` etc, and can be "
117172
"omitted."
118173
msgstr "``component`` 可以是 `supervisor`,`worker` 或 `band` 等等,也可以省略这个参数。"
119174

120-
#: ../../source/development/metrics.rst:74
175+
#: ../../source/development/metrics.rst:161
121176
msgid ""
122177
"``units`` is the metric unit which may be seconds when recording time, or"
123178
" ``_count`` when metric type is ``Counter``, ``_number`` when metric type"
124179
" is ``Gauge`` if there is no suitable unit."
125-
msgstr "``units`` 是 metric 的单位,当记录的是时间时,可以用 seconds,当没有合适的单位"
126-
"时,``Counter`` 类型的 metric 可以用 ``_count``,``Gauge`` 类型的 metric 可以用 "
127-
"``_number``。"
128-
180+
msgstr ""
181+
"``units`` 是 metric 的单位,当记录的是时间时,可以用 seconds,当没有合适的单位时,``Counter`` 类型的 "
182+
"metric 可以用 ``_count``,``Gauge`` 类型的 metric 可以用 ``_number``。"

mars/metrics/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,5 @@
1313
# limitations under the License.
1414

1515
from .api import Metrics
16-
from .api import init_metrics
16+
from .api import init_metrics, shutdown_metrics
1717
from .api import record_time_cost_percentile, Percentile

mars/metrics/api.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626

2727
logger = logging.getLogger(__name__)
2828

29+
_init = False
2930
_metric_backend = "console"
3031
_backends_cls = {
3132
"console": console_metric,
@@ -35,6 +36,10 @@
3536

3637

3738
def init_metrics(backend="console", config: Dict[str, Any] = None):
39+
global _init
40+
if _init is True:
41+
return
42+
3843
backend = backend or "console"
3944
if backend not in _backends_cls:
4045
raise NotImplementedError(f"Do not support metric backend {backend}")
@@ -43,17 +48,29 @@ def init_metrics(backend="console", config: Dict[str, Any] = None):
4348
if _metric_backend == "prometheus":
4449
try:
4550
from prometheus_client import start_http_server
51+
from ..utils import get_next_port
4652

4753
port = config.get("port", 0) if config else 0
54+
port = port or get_next_port()
4855
start_http_server(port)
49-
logger.info("Finished startup prometheus http server and port is %d", port)
56+
logger.warning(
57+
"Finished startup prometheus http server and port is %d", port
58+
)
5059
except ImportError:
5160
logger.warning(
5261
"Failed to start prometheus http server because there is no prometheus_client"
5362
)
63+
_init = True
5464
logger.info("Finished initialize the metrics with backend %s", _metric_backend)
5565

5666

67+
def shutdown_metrics():
68+
global _metric_backend
69+
_metric_backend = "console"
70+
global _init
71+
_init = False
72+
73+
5774
class Metrics:
5875
"""
5976
A factory to generate different types of metrics.

mars/metrics/backends/prometheus/prometheus_metric.py

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
import os
16+
import socket
17+
1518
from typing import Optional, Dict
1619

1720
from ....utils import lazy_import
@@ -28,16 +31,28 @@
2831

2932
class PrometheusMetricMixin(AbstractMetric):
3033
def _init(self):
31-
self._metric = (
32-
pc.Gauge(self._name, self._description, self._tag_keys) if pc else None
34+
# Prometheus metric name must match the regex `[a-zA-Z_:][a-zA-Z0-9_:]*`
35+
# `.` is a common character in metrics, so here replace it with `:`
36+
self._name = self._name.replace(".", ":")
37+
self._tag_keys = self._tag_keys + (
38+
"host",
39+
"pid",
3340
)
41+
self._tags = {"host": socket.gethostname(), "pid": os.getpid()}
42+
try:
43+
self._metric = (
44+
pc.Gauge(self._name, self._description, self._tag_keys) if pc else None
45+
)
46+
except ValueError: # pragma: no cover
47+
self._metric = None
3448

3549
def _record(self, value=1, tags: Optional[Dict[str, str]] = None):
3650
if self._metric:
37-
if tags:
38-
self._metric.labels(**tags).set(value)
51+
if tags is not None:
52+
tags.update(self._tags)
3953
else:
40-
self._metric.set(value)
54+
tags = self._tags
55+
self._metric.labels(**tags).set(value)
4156

4257

4358
class Counter(PrometheusMetricMixin, AbstractCounter):

mars/metrics/backends/prometheus/tests/test_prometheus_metric.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,8 @@ def test_counter(start_prometheus_http_server):
5454
c = Counter("test_counter", "A test counter", ("service", "tenant"))
5555
assert c.name == "test_counter"
5656
assert c.description == "A test counter"
57-
assert c.tag_keys == ("service", "tenant")
57+
assert set(["host", "pid"]).issubset(set(c.tag_keys))
58+
assert set(["service", "tenant"]).issubset(set(c.tag_keys))
5859
assert c.type == "counter"
5960
c.record(1, {"service": "mars", "tenant": "test"})
6061
verify_metric("test_counter", 1.0)
@@ -66,7 +67,7 @@ def test_gauge(start_prometheus_http_server):
6667
g = Gauge("test_gauge", "A test gauge")
6768
assert g.name == "test_gauge"
6869
assert g.description == "A test gauge"
69-
assert g.tag_keys == ()
70+
assert set(["host", "pid"]).issubset(set(g.tag_keys))
7071
assert g.type == "gauge"
7172
g.record(0.1)
7273
verify_metric("test_gauge", 0.1)
@@ -78,7 +79,7 @@ def test_meter(start_prometheus_http_server):
7879
m = Meter("test_meter")
7980
assert m.name == "test_meter"
8081
assert m.description == ""
81-
assert m.tag_keys == ()
82+
assert set(["host", "pid"]).issubset(set(m.tag_keys))
8283
assert m.type == "meter"
8384
num = 3
8485
while num > 0:
@@ -92,7 +93,7 @@ def test_histogram(start_prometheus_http_server):
9293
h = Histogram("test_histogram")
9394
assert h.name == "test_histogram"
9495
assert h.description == ""
95-
assert h.tag_keys == ()
96+
assert set(["host", "pid"]).issubset(set(h.tag_keys))
9697
assert h.type == "histogram"
9798
num = 3
9899
while num > 0:

0 commit comments

Comments
 (0)