Skip to content

Conversation

@celian-garcia
Copy link
Member

@celian-garcia celian-garcia commented Sep 29, 2025

Description

This PR revives the previously non-functional #40344, aiming to enhance performance by running metrics collection across Azure subscriptions in parallel. This significantly reduces latency when dealing with multiple subscriptions.

Compared to the original PR, this version addresses a critical concurrency issue:
fatal error: concurrent map read and map write
Reference: https://victoriametrics.com/blog/go-sync-map/

To resolve this, I evaluated multiple approaches and found that using the concurrent-map library yielded the best results. Details of the evaluation and benchmarks are documented in concurrency_bench_report.md.

Link to tracking issue

Fixes #39417

Testing

During testing, a race condition was discovered in the Azure SDK related to the fallback behavior of the Cloud option in the NewClient function:

// NewClient creates a client that accesses Azure Monitor metrics data.
// Client should be used for performing metrics queries on multiple monitored resources in the same region.
// A credential with authorization at the subscription level is required when using this client.
//
// endpoint - The regional endpoint to use, for example https://eastus.metrics.monitor.azure.com.
// The region should match the region of the requested resources. For global resources, the region should be 'global'.
func NewClient(endpoint string, credential azcore.TokenCredential, options *ClientOptions) (*Client, error) {
	if options == nil {
		options = &ClientOptions{}
	}
	if reflect.ValueOf(options.Cloud).IsZero() {
		options.Cloud = cloud.AzurePublic // <-- HERE
	}
	c, ok := options.Cloud.Services[ServiceName]
	if !ok || c.Audience == "" {
		return nil, errors.New("provided Cloud field is missing Azure Monitor Metrics configuration")
	}

To prevent this, our implementation explicitly sets the Cloud option in all cases, ensuring deterministic behavior and avoiding the race.

Documentation

A new markdown file (concurrency_bench_report.md) has been added to document:

  • The rationale behind choosing concurrent-map
  • Benchmark results comparing different implementations
  • Notes for future contributors who may want to explore alternative concurrency strategies

@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/fix-39417 branch 9 times, most recently from db21fd1 to f3d47c2 Compare October 2, 2025 09:55
@celian-garcia celian-garcia marked this pull request as ready for review October 2, 2025 11:37
@celian-garcia celian-garcia requested a review from a team as a code owner October 2, 2025 11:37
@celian-garcia celian-garcia requested a review from atoulme October 2, 2025 11:37
@celian-garcia
Copy link
Member Author

celian-garcia commented Oct 2, 2025

Sorry for this failing test. It's flaky and I thought I fixed it. I will have to check twice.
=> EDIT: fixed. See PR description

@celian-garcia celian-garcia marked this pull request as draft October 2, 2025 13:32
@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/fix-39417 branch from 43a2807 to e4c4c7e Compare October 3, 2025 10:42
@celian-garcia celian-garcia marked this pull request as ready for review October 3, 2025 11:04
@atoulme
Copy link
Contributor

atoulme commented Oct 7, 2025

Please address the conflict.

… Batch API mode

Signed-off-by: Célian Garcia <celian.garcia@amadeus.com>
Signed-off-by: Celian GARCIA <celian.garcia@amadeus.com>
@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/fix-39417 branch from ab513fa to 480dd4a Compare October 7, 2025 08:07
@celian-garcia
Copy link
Member Author

@atoulme, conflicts resolved 🙂

Co-authored-by: Moritz Wiesinger <moritz.wiesinger@dynatrace.com>
@atoulme atoulme merged commit 4f10361 into open-telemetry:main Oct 10, 2025
185 checks passed
@github-actions github-actions bot added this to the next release milestone Oct 10, 2025
ChrsMark pushed a commit to ChrsMark/opentelemetry-collector-contrib that referenced this pull request Oct 20, 2025
… Batch API mode (open-telemetry#43047)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

This PR revives the previously non-functional open-telemetry#40344, aiming to enhance
performance by running metrics collection across Azure subscriptions in
parallel. This significantly reduces latency when dealing with multiple
subscriptions.

Compared to the original PR, this version addresses a critical
concurrency issue:
```fatal error: concurrent map read and map write```
Reference: https://victoriametrics.com/blog/go-sync-map/

To resolve this, I evaluated multiple approaches and found that using the concurrent-map library yielded the best results. Details of the evaluation and benchmarks are documented in concurrency_bench_report.md.


<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#39417

<!--Describe what testing was performed and which tests were added.-->
#### Testing

During testing, a race condition was discovered in the Azure SDK related to the fallback behavior of the Cloud option in the NewClient function:
```go
// NewClient creates a client that accesses Azure Monitor metrics data.
// Client should be used for performing metrics queries on multiple
monitored resources in the same region.
// A credential with authorization at the subscription level is required
when using this client.
//
// endpoint - The regional endpoint to use, for example
https://eastus.metrics.monitor.azure.com.
// The region should match the region of the requested resources. For
global resources, the region should be 'global'.
func NewClient(endpoint string, credential azcore.TokenCredential,
options *ClientOptions) (*Client, error) {
	if options == nil {
		options = &ClientOptions{}
	}
	if reflect.ValueOf(options.Cloud).IsZero() {
		options.Cloud = cloud.AzurePublic // <-- HERE
	}
	c, ok := options.Cloud.Services[ServiceName]
	if !ok || c.Audience == "" {
return nil, errors.New("provided Cloud field is missing Azure Monitor
Metrics configuration")
	}

```
To prevent this, our implementation explicitly sets the Cloud option in all cases, ensuring deterministic behavior and avoiding the race.

<!--Describe the documentation added.-->
#### Documentation

A new markdown file (concurrency_bench_report.md) has been added to document:

- The rationale behind choosing concurrent-map
- Benchmark results comparing different implementations
- Notes for future contributors who may want to explore alternative concurrency strategies
<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Célian Garcia <celian.garcia@amadeus.com>
Signed-off-by: Celian GARCIA <celian.garcia@amadeus.com>
Co-authored-by: Moritz Wiesinger <moritz.wiesinger@dynatrace.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/azuremonitor] use_batch_api should parallelize the calls by subscription

6 participants