-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
triageThe issue needs triaging.The issue needs triaging.
Description
The current 10 ms upper bound on GoFr’s SQL query histogram quantiles is great for ultra-fast query visibility (for apps serving millions of requests per minute), but it compresses all slower queries into the same top bucket, which makes real-world latency analysis difficult — especially in production workloads where 100 ms–5 s queries are normal for analytical or transactional DBs.
⚙️ Why 10 ms was used initially
- It was designed to detect micro-latency regressions in high-throughput services (like internal GoFr benchmarks).
- Early use cases focused on microservices hitting in-memory or well-indexed SQL queries where anything >10 ms was an anomaly.
- It optimized for Prometheus cardinality — fewer histogram buckets = smaller metrics footprint.
❌ Why that’s limiting in real workloads
- Queries taking 20 ms, 200 ms, or 2 s all fall into the same bucket — you lose resolution.
- Makes it impossible to differentiate between “slightly slow” and “critical” queries.
- You can’t correlate query latency with API request latency effectively.
- Makes alerting thresholds (like P95 or P99 query latency) misleadingly small.
✅ Recommended Fix
Adopt a wider histogram range with logarithmic or percentile-style bucket spacing.
Example ideal bucket set for SQL histograms:
[]float64{
0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20, 30, 60,
}This covers:
- Microsecond-level detail for fast queries
- Smooth spread for mid-range queries
- Full visibility up to 60 seconds for extreme cases
🧩 Dynamic Adjustment (Could be an option)
Instead of hardcoding buckets, expose them via configuration — for example:
app.Metrics.SetHistogramBuckets("sql_query_duration_seconds", []float64{
0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60,
})You could tie this to:
- Datasource type (MySQL, Postgres, BigQuery, etc.)
- Request timeout (e.g., if
REQUEST_TIMEOUT=30s, upper bound could be 30 s) - Environment (dev → short, prod → wide)
📊 Impact
-
Slightly higher Prometheus storage footprint (more buckets)
-
Far more actionable insights:
- 95th percentile query time per DB
- Breakdown of fast vs slow queries
- Easier detection of query regressions after deployments
Metadata
Metadata
Assignees
Labels
triageThe issue needs triaging.The issue needs triaging.