Skip to content

Conversation

@VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented Oct 16, 2025

Background

The collector core marks telemetry data as read-only after it has been consumed by the fanout consumer.
Ref: https://github.com/open-telemetry/opentelemetry-collector/blob/58b71431b36f62a7797a02d661547584cf15222a/internal/fanoutconsumer/metrics.go#L67-L69

This ensures that exporters and other consumers cannot mutate the same data instance after it has been forwarded downstream.

However, in the prometheusremotewritereceiver, we currently cache the entire ResourceMetrics object, which is directly tied to the original pmetric.Metrics instance. When a subsequent request reuses this cached object and appends new samples, it tries to mutate a read-only object, causing a panic.

Fix

Instead of caching the full ResourceMetrics object, we should:

  • Cache only the attributes (like job/instance info).
  • Group time series by job/instance per request and create new ResourceMetrics objects for each request.

Hence, there should be two caches:

  • An attributes cache:
    • Stores resource attributes for job/instance
  • Resource metric cache:
    • This will exist per request.
    • We will add group by job/instance per request and add it to the same resource.

If the target_info metrics and normal metrics are part of different requests, then there are two cases:

  1. If target_info arrives first:
    • In this case, we cache the associated labels by creating a pcommon.Map object and use it to populate the ResourceAttributes of subsequent telemetry.
  2. If normal metrics arrives first:
    • This case is trickier because we can’t predict when (or if) the corresponding target_info metrics will arrive.
    • The safest approach is to forward the telemetry immediately without any resource attributes and clearly document this behavior.
    • As a future improvement, we can explore a decoupled architecture where telemetry is temporarily buffered before being sent, allowing time for the target_info data to arrive and populate the resource attributes.

How to Reproduce

Refer #41347 (comment)

Fixes #41347

@VihasMakwana VihasMakwana changed the title [receiver/prometheusremotewrite] - Fix the caching [receiver/prometheusremotewrite] - Fix panic caused by mutating read-only ResourceMetrics Oct 23, 2025
@VihasMakwana VihasMakwana marked this pull request as ready for review October 23, 2025 06:16
@VihasMakwana VihasMakwana requested review from a team, ArthurSens and dashpole as code owners October 23, 2025 06:16
@github-actions github-actions bot requested a review from perebaj October 23, 2025 06:17
w.WriteHeader(http.StatusNoContent)

obsrecvCtx := prw.obsrecv.StartMetricsOp(req.Context())
if m.MetricCount() == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to send empty telemetry IMO. Let me know if any of the codeowners feel otherwise.

@perebaj
Copy link
Contributor

perebaj commented Oct 23, 2025

Hey @VihasMakwana, thanks for the effort here!

We have been discussing the better way to solve this issue on our by-weekly SIG meeting. I think that we finally found a solution that doesn't need to deal with mutexes and change the minimum amount of code that we can!

This version can be found here.

I just need to discuss it with Sens and finish some local tests. Then we can merge it!

@VihasMakwana
Copy link
Contributor Author

@perebaj Hi! Thank you for mentioning your PR. I'll review it, but it looks good to go.
I had no idea about it and I created mine😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/prometheusremotewrite]: http: panic serving: invalid access to shared data

3 participants