Skip to content

Increase SQL Query Histogram Range Beyond 10ms for Latency Tracking #2467

@aryanmehrotra

Description

@aryanmehrotra

The current 10 ms upper bound on GoFr’s SQL query histogram quantiles is great for ultra-fast query visibility (for apps serving millions of requests per minute), but it compresses all slower queries into the same top bucket, which makes real-world latency analysis difficult — especially in production workloads where 100 ms–5 s queries are normal for analytical or transactional DBs.


⚙️ Why 10 ms was used initially

  • It was designed to detect micro-latency regressions in high-throughput services (like internal GoFr benchmarks).
  • Early use cases focused on microservices hitting in-memory or well-indexed SQL queries where anything >10 ms was an anomaly.
  • It optimized for Prometheus cardinality — fewer histogram buckets = smaller metrics footprint.

Why that’s limiting in real workloads

  • Queries taking 20 ms, 200 ms, or 2 s all fall into the same bucket — you lose resolution.
  • Makes it impossible to differentiate between “slightly slow” and “critical” queries.
  • You can’t correlate query latency with API request latency effectively.
  • Makes alerting thresholds (like P95 or P99 query latency) misleadingly small.

Recommended Fix

Adopt a wider histogram range with logarithmic or percentile-style bucket spacing.

Example ideal bucket set for SQL histograms:

[]float64{
    0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20, 30, 60,
}

This covers:

  • Microsecond-level detail for fast queries
  • Smooth spread for mid-range queries
  • Full visibility up to 60 seconds for extreme cases

🧩 Dynamic Adjustment (Could be an option)

Instead of hardcoding buckets, expose them via configuration — for example:

app.Metrics.SetHistogramBuckets("sql_query_duration_seconds", []float64{
    0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60,
})

You could tie this to:

  • Datasource type (MySQL, Postgres, BigQuery, etc.)
  • Request timeout (e.g., if REQUEST_TIMEOUT=30s, upper bound could be 30 s)
  • Environment (dev → short, prod → wide)

📊 Impact

  • Slightly higher Prometheus storage footprint (more buckets)

  • Far more actionable insights:

    • 95th percentile query time per DB
    • Breakdown of fast vs slow queries
    • Easier detection of query regressions after deployments

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageThe issue needs triaging.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions