Replies: 2 comments 2 replies
-
Thanks @tmenjo !
|
Beta Was this translation helpful? Give feedback.
0 replies
-
@whynowy Thank you for your comment and creating issues! I reconsider that it would be nice to reuse In my opinion, a good metric set has the following characteristics at least:
Based on this, I made two figures that describes which metric means what, and relationship between metrics. One is for stream map mode, and the other is for batch and unary modes. "Before" means current status and "After" is my proposal. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm interested in performance measurement of Numaflow. I read numaflow.numaproj.io and Golang codes for latency metrics, and found that some metrics seem in lack of design documents. I also found that there are metrics that their variables are defined but measurement codes are not implemented for now, perhaps removed in the history.
To make patches, I'd like to discuss design of metrics, especially latency metrics in Map UDF Vertex here. I propose three points of discussion as follows. Could you please let me know your opinion?
Point 1: exposing forwarder_udf_processing_time regardless of map modes, or not
For now forwarder_udf_processing_time (UDFProcessingTime) is exposed only in Map UDF's stream mode, not in batch or unary modes.
numaflow/pkg/udf/forward/forward.go
Line 400 in 6962837
numaflow/pkg/udf/forward/forward.go
Lines 431 to 479 in 6962837
I'd say it should be done in all the three modes.
By the way, forwarder_udf_processing_time is not exposed in Reduce UDF, too. But I think this shoule be discussed separately because concurrent processing inside Reduce UDF looks widely different from that of Map UDF.
Point 2: unifying the meaning of forwarder_udf_processing_time, or not
In Map UDF's stream mode, forwarder_udf_processing_time includes latency for applying UDF AND writing to downstream ISBs. This is probably inevitable due to streaming concurrency of the mode, but counterintuitive to its metric name.
numaflow/pkg/udf/forward/forward.go
Line 451 in 6962837
(Writes to all ISBs, included in the above L431-L479)
In batch or unary map mode, we may expose it as duration from the start of sending requests to RPC server to the end of receiving responses, as intuitive as the metric name. But if we do so also in stream map mode, the streaming feature of that mode would be lost.
Which way should we go:
Or are there any other ways?
Point 3: per-partition or per-batch forwarder_write_processing_time
In Map UDF, forwarder_write_processing_time means latency for each write to each partition, even if one batch may cause multiple writes to multiple partitions. But any other latency metric in Map UDF such as forwarder_read_processing_time is measured as per-batch basis.
forwarder_write_processing_time (WriteProcessingTime) in Map UDF:
numaflow/pkg/udf/forward/forward.go
Lines 578 to 640 in 6962837
(Per partition)
forwarder_read_processing_time (ReadProcessingTime) in Map UDF:
numaflow/pkg/udf/forward/forward.go
Lines 195 to 260 in 6962837
(Per batch)
I'd say forwarder_write_processing_time also should be per-batch metric, that is, latency including all writes to all partitions.
Note that, in Source and Sink Vertices, forwarder_write_processing_time looks already exposed as per-batch metric.
Beta Was this translation helpful? Give feedback.
All reactions