Design of latency metrics in Map UDF Vertex #2498

tmenjo · 2025-03-24T00:30:25Z

tmenjo
Mar 24, 2025

I'm interested in performance measurement of Numaflow. I read numaflow.numaproj.io and Golang codes for latency metrics, and found that some metrics seem in lack of design documents. I also found that there are metrics that their variables are defined but measurement codes are not implemented for now, perhaps removed in the history.

To make patches, I'd like to discuss design of metrics, especially latency metrics in Map UDF Vertex here. I propose three points of discussion as follows. Could you please let me know your opinion?

Point 1: exposing forwarder_udf_processing_time regardless of map modes, or not

For now forwarder_udf_processing_time (UDFProcessingTime) is exposed only in Map UDF's stream mode, not in batch or unary modes.

numaflow/pkg/udf/forward/forward.go

Line 400 in 6962837

    
           func (isdf *InterStepDataForward) streamMessage(ctx context.Context, dataMessages []*isb.ReadMessage) (map[string][][]isb.Offset, error) {

numaflow/pkg/udf/forward/forward.go

Lines 431 to 479 in 6962837

    
           // Process the single data message 
        
           start := time.Now() 
        
           metrics.UDFReadMessagesCount.With(metricLabelsWithPartition).Inc() 
        
           writeMessageCh := make(chan isb.WriteMessage) 
        
           errs, ctx := errgroup.WithContext(ctx) 
        
           errs.Go(func() error { 
        
           	return isdf.opts.streamMapUdfApplier.ApplyMapStream(ctx, dataMessages[0], writeMessageCh) 
        
           }) 
        
           // Stream the message to the next vertex 
        
           for writeMessage := range writeMessageCh { 
        
           	writeMessage.Headers = dataMessages[0].Headers 
        
           	metrics.UDFWriteMessagesCount.With(metricLabelsWithPartition).Add(1) 
        
           	// Determine where to step and write to buffers 
        
           	if err := isdf.whereToStep(&writeMessage, messageToStep, dataMessages[0]); err != nil { 
        
           		return nil, fmt.Errorf("failed at whereToStep, error: %w", err) 
        
           	} 
        
           	curWriteOffsets, err := isdf.writeToBuffers(ctx, messageToStep) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("failed to write to toBuffers, error: %w", err) 
        
           	} 
        
           	// Merge current write offsets into the main writeOffsets map 
        
           	for vertexName, toVertexBufferOffsets := range curWriteOffsets { 
        
           		for index, offsets := range toVertexBufferOffsets { 
        
           			writeOffsets[vertexName][index] = append(writeOffsets[vertexName][index], offsets...) 
        
           		} 
        
           	} 
        
           	// Clear messageToStep, as we have written the messages to the buffers 
        
           	for toVertex := range isdf.toBuffers { 
        
           		messageToStep[toVertex] = make([][]isb.Message, len(isdf.toBuffers[toVertex])) 
        
           	} 
        
           } 
        
           // Handle errors in UDF processing 
        
           if err := errs.Wait(); err != nil { 
        
           	metrics.UDFError.With(metricLabels).Inc() 
        
           	if ok, _ := isdf.IsShuttingDown(); ok { 
        
           		isdf.opts.logger.Errorw("mapUDF.Apply, Stop called while stuck on an internal error", zap.Error(err)) 
        
           		metrics.PlatformError.With(metricLabels).Inc() 
        
           	} 
        
           	return nil, fmt.Errorf("failed to applyUDF, error: %w", err) 
        
           } 
        
           metrics.UDFProcessingTime.With(metricLabels).Observe(float64(time.Since(start).Microseconds()))

I'd say it should be done in all the three modes.

By the way, forwarder_udf_processing_time is not exposed in Reduce UDF, too. But I think this shoule be discussed separately because concurrent processing inside Reduce UDF looks widely different from that of Map UDF.

Point 2: unifying the meaning of forwarder_udf_processing_time, or not

In Map UDF's stream mode, forwarder_udf_processing_time includes latency for applying UDF AND writing to downstream ISBs. This is probably inevitable due to streaming concurrency of the mode, but counterintuitive to its metric name.

numaflow/pkg/udf/forward/forward.go

Line 451 in 6962837

curWriteOffsets, err := isdf.writeToBuffers(ctx, messageToStep)

(Writes to all ISBs, included in the above L431-L479)

In batch or unary map mode, we may expose it as duration from the start of sending requests to RPC server to the end of receiving responses, as intuitive as the metric name. But if we do so also in stream map mode, the streaming feature of that mode would be lost.

Which way should we go:

Unifying the meaning of forwarder_udf_processing_time to include writing time even in batch or unary mode; or
Varying the meaning to measure only time for applying UDF in batch or unary mode

Or are there any other ways?

Point 3: per-partition or per-batch forwarder_write_processing_time

In Map UDF, forwarder_write_processing_time means latency for each write to each partition, even if one batch may cause multiple writes to multiple partitions. But any other latency metric in Map UDF such as forwarder_read_processing_time is measured as per-batch basis.

forwarder_write_processing_time (WriteProcessingTime) in Map UDF:

numaflow/pkg/udf/forward/forward.go

Lines 578 to 640 in 6962837

    
           writeStart := time.Now() 
        
           for { 
        
           	_writeOffsets, errs := toBufferPartition.Write(ctx, messages) 
        
           	// Note: this is an unwanted memory allocation during a happy path. We want only minimal allocation since using failedMessages is an unlikely path. 
        
           	var failedMessages []isb.Message 
        
           	needRetry := false 
        
           	for idx, msg := range messages { 
        
           		if err = errs[idx]; err != nil { 
        
           			// ATM there are no user-defined errors during write, all are InternalErrors. 
        
           			// Non retryable error, drop the message. Non retryable errors are only returned 
        
           			// when the buffer is full and the user has set the buffer full strategy to 
        
           			// DiscardLatest or when the message is duplicate. 
        
           			if errors.As(err, &isb.NonRetryableBufferWriteErr{}) { 
        
           				metricLabelWithReason := map[string]string{ 
        
           					metrics.LabelVertex:             isdf.vertexName, 
        
           					metrics.LabelPipeline:           isdf.pipelineName, 
        
           					metrics.LabelVertexType:         string(dfv1.VertexTypeSink), 
        
           					metrics.LabelVertexReplicaIndex: strconv.Itoa(int(isdf.vertexReplica)), 
        
           					metrics.LabelPartitionName:      toBufferPartition.GetName(), 
        
           					metrics.LabelReason:             err.Error(), 
        
           				} 
        
           				metrics.DropMessagesCount.With(metricLabelWithReason).Inc() 
        
           				metrics.DropBytesCount.With(metricLabelWithReason).Add(float64(len(msg.Payload))) 
        
           				isdf.opts.logger.Infow("Dropped message", zap.String("reason", err.Error()), zap.String("partition", toBufferPartition.GetName()), zap.String("vertex", isdf.vertexName), zap.String("pipeline", isdf.pipelineName), zap.String("msg_id", msg.ID.String())) 
        
           			} else { 
        
           				needRetry = true 
        
           				// we retry only failed messages 
        
           				failedMessages = append(failedMessages, msg) 
        
           				metrics.WriteMessagesError.With(metricLabelsWithPartition).Inc() 
        
           				// a shutdown can break the blocking loop caused due to InternalErr 
        
           				if ok, _ := isdf.IsShuttingDown(); ok { 
        
           					metrics.PlatformError.With(metricLabels).Inc() 
        
           					return writeOffsets, fmt.Errorf("writeToBuffer failed, Stop called while stuck on an internal error with failed messages:%d, %v", len(failedMessages), errs) 
        
           				} 
        
           			} 
        
           		} else { 
        
           			writeCount++ 
        
           			writeBytes += float64(len(msg.Payload)) 
        
           			// we support write offsets only for jetstream 
        
           			if _writeOffsets != nil { 
        
           				writeOffsets = append(writeOffsets, _writeOffsets[idx]) 
        
           			} 
        
           		} 
        
           	} 
        
           	if needRetry { 
        
           		isdf.opts.logger.Errorw("Retrying failed messages", 
        
           			zap.Any("errors", errorArrayToMap(errs)), 
        
           			zap.String(metrics.LabelPipeline, isdf.pipelineName), 
        
           			zap.String(metrics.LabelVertex, isdf.vertexName), 
        
           			zap.String(metrics.LabelPartitionName, toBufferPartition.GetName()), 
        
           		) 
        
           		// set messages to failed for the retry 
        
           		messages = failedMessages 
        
           		// TODO: implement retry with backoff etc. 
        
           		time.Sleep(isdf.opts.retryInterval) 
        
           	} else { 
        
           		break 
        
           	} 
        
           } 
        
           metrics.WriteProcessingTime.With(metricLabelsWithPartition).Observe(float64(time.Since(writeStart).Microseconds()))

(Per partition)

forwarder_read_processing_time (ReadProcessingTime) in Map UDF:

numaflow/pkg/udf/forward/forward.go

Lines 195 to 260 in 6962837

    
           // There is a chance that we have read the message and the container got forcefully terminated before processing. To provide 
        
           // at-least-once semantics for reading, during restart we will have to reprocess all unacknowledged messages. It is the 
        
           // responsibility of the Read function to do that. 
        
           readStart := time.Now() 
        
           readMessages, err := isdf.fromBufferPartition.Read(ctx, isdf.opts.readBatchSize) 
        
           isdf.opts.logger.Debugw("Read from buffer", zap.String("bufferFrom", isdf.fromBufferPartition.GetName()), zap.Int64("length", int64(len(readMessages)))) 
        
           if err != nil { 
        
           	isdf.opts.logger.Warnw("failed to read fromBufferPartition", zap.Error(err)) 
        
           	metrics.ReadMessagesError.With(metricLabelsWithPartition).Inc() 
        
           } 
        
           // process only if we have any read messages. There is a natural looping here if there is an internal error while 
        
           // reading, and we are not able to proceed. 
        
           if len(readMessages) == 0 { 
        
           	// When the read length is zero, the write length is definitely zero too, 
        
           	// meaning there's no data to be published to the next vertex, and we consider this 
        
           	// situation as idling. 
        
           	// In order to continue propagating watermark, we will set watermark idle=true and publish it. 
        
           	// We also publish a control message if this is the first time we get this idle situation. 
        
           	// We compute the HeadIdleWMB using the given partition as the idle watermark 
        
           	var processorWMB = isdf.wmFetcher.ComputeHeadIdleWMB(isdf.fromBufferPartition.GetPartitionIdx()) 
        
           	if !isdf.wmbChecker.ValidateHeadWMB(processorWMB) { 
        
           		// validation failed, skip publishing 
        
           		isdf.opts.logger.Debugw("skip publishing idle watermark", 
        
           			zap.Int("counter", isdf.wmbChecker.GetCounter()), 
        
           			zap.Int64("offset", processorWMB.Offset), 
        
           			zap.Int64("watermark", processorWMB.Watermark), 
        
           			zap.Bool("idle", processorWMB.Idle)) 
        
           		return nil 
        
           	} 
        
           	// if the validation passed, we will publish the watermark to all the toBuffer partitions. 
        
           	for toVertexName, toVertexBuffer := range isdf.toBuffers { 
        
           		for _, partition := range toVertexBuffer { 
        
           			if p, ok := isdf.wmPublishers[toVertexName]; ok { 
        
           				idlehandler.PublishIdleWatermark(ctx, isdf.fromBufferPartition.GetPartitionIdx(), partition, p, isdf.idleManager, isdf.opts.logger, isdf.vertexName, isdf.pipelineName, dfv1.VertexTypeMapUDF, isdf.vertexReplica, wmb.Watermark(time.UnixMilli(processorWMB.Watermark))) 
        
           			} 
        
           		} 
        
           	} 
        
           	return nil 
        
           } 
        
           var dataMessages = make([]*isb.ReadMessage, 0, len(readMessages)) 
        
           // store the offsets of the messages we read from ISB 
        
           var readOffsets = make([]isb.Offset, len(readMessages)) 
        
           for idx, m := range readMessages { 
        
           	readOffsets[idx] = m.ReadOffset 
        
           	totalBytes += len(m.Payload) 
        
           	if m.Kind == isb.Data { 
        
           		dataMessages = append(dataMessages, m) 
        
           		dataBytes += len(m.Payload) 
        
           	} 
        
           } 
        
           // If we don't have any data messages(we received only wmbs), we can ack all the readOffsets and return early. 
        
           if len(dataMessages) == 0 { 
        
           	if err := isdf.ackFromBuffer(ctx, readOffsets); err != nil { 
        
           		isdf.opts.logger.Errorw("Failed to ack from buffer", zap.Error(err)) 
        
           		metrics.AckMessageError.With(map[string]string{metrics.LabelVertex: isdf.vertexName, metrics.LabelPipeline: isdf.pipelineName, metrics.LabelVertexType: string(dfv1.VertexTypeMapUDF), metrics.LabelVertexReplicaIndex: strconv.Itoa(int(isdf.vertexReplica)), metrics.LabelPartitionName: isdf.fromBufferPartition.GetName()}).Add(float64(len(readOffsets))) 
        
           		return err 
        
           	} 
        
           	return nil 
        
           } 
        
           metrics.ReadProcessingTime.With(metricLabelsWithPartition).Observe(float64(time.Since(readStart).Microseconds()))

(Per batch)

I'd say forwarder_write_processing_time also should be per-batch metric, that is, latency including all writes to all partitions.

Note that, in Source and Sink Vertices, forwarder_write_processing_time looks already exposed as per-batch metric.

whynowy · 2025-03-25T07:32:30Z

whynowy
Mar 25, 2025
Maintainer

Thanks @tmenjo !

For forwarder_udf_processing_time, you are right, we somehow messed up this metric in the past feature implementation as well as refactor. Created an issue Visit metrics forwarder_udf_processing_time and forwarder_concurrent_udf_processing_time #2502.
Also created an issue for forwarder_write_processing_time, Inconsistent metic forwarder_write_processing_time #2501.

0 replies

tmenjo · 2025-03-27T01:07:34Z

tmenjo
Mar 27, 2025
Author

@whynowy Thank you for your comment and creating issues! I reconsider that it would be nice to reuse forwarder_concurrent_udf_processing_time for stream map mode because UDF is applied concurrently in that mode.

In my opinion, a good metric set has the following characteristics at least:

One metric has one consistent meaning
No two or more metric has the same meaning

Based on this, I made two figures that describes which metric means what, and relationship between metrics. One is for stream map mode, and the other is for batch and unary modes. "Before" means current status and "After" is my proposal.

2 replies

tmenjo Mar 27, 2025
Author

Sorry, but I finally understood that forwarder_write_processing_time is per downstream partition, not upstream one. And I rethink that it is reasonable for the metric to be per downstream partition.

I'd like to make the Metric page in the Operator Manual more fulfilling so that no one has misunderstandings like me.

tmenjo Mar 28, 2025
Author

Updated. I understood that forwarder_write_processing_time in Source Vertex also should be fixed.

tmenjo · 2025-06-12T00:34:50Z

tmenjo
Jun 12, 2025
Author

Close the discuttion, as #2588 merged.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design of latency metrics in Map UDF Vertex #2498

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Design of latency metrics in Map UDF Vertex #2498

Uh oh!

tmenjo Mar 24, 2025

Point 1: exposing forwarder_udf_processing_time regardless of map modes, or not

Point 2: unifying the meaning of forwarder_udf_processing_time, or not

Point 3: per-partition or per-batch forwarder_write_processing_time

Replies: 3 comments · 2 replies

Uh oh!

whynowy Mar 25, 2025 Maintainer

Uh oh!

tmenjo Mar 27, 2025 Author

Uh oh!

tmenjo Mar 27, 2025 Author

Uh oh!

tmenjo Mar 28, 2025 Author

Uh oh!

tmenjo Jun 12, 2025 Author

tmenjo
Mar 24, 2025

Replies: 3 comments 2 replies

whynowy
Mar 25, 2025
Maintainer

tmenjo
Mar 27, 2025
Author

tmenjo Mar 27, 2025
Author

tmenjo Mar 28, 2025
Author

tmenjo
Jun 12, 2025
Author