Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
279 changes: 279 additions & 0 deletions docs/architecture-decisions/high-precision-fractional-bucketing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
---
# Valid statuses: draft | proposed | rejected | accepted | superseded
status: draft
author: Michael Beemer
created: 2025-09-10
updated: 2025-09-10
---

# High-Precision Fractional Bucketing for Sub-Percent Traffic Allocation

This ADR proposes enhancing the fractional operation to support high-precision traffic allocation down to 0.001% granularity by increasing the internal bucket count from 100 to 100,000 while maintaining the existing weight-based API.

## Background

The current fractional operation in flagd uses a 100-bucket system that maps hash values to percentages in the range [0, 100].
This approach works well for most use cases but has significant limitations in high-throughput environments where precise sub-percent traffic allocation is required.

Currently, the smallest allocation possible is 1%, which is insufficient for:

- Gradual rollouts in ultra-high-traffic systems where 1% could represent millions of users
- A/B testing scenarios requiring precise control over small experimental groups
- Canary deployments where operators need to start with very small traffic percentages (e.g., 0.1% or 0.01%)

The current implementation in `fractional.go` calculates bucket assignment using:

```go
bucket := hashRatio * 100 // in range [0, 100]
```

This limits granularity to 1% increments, making it impossible to achieve the precision required for sophisticated traffic management strategies.

## Requirements

- Support traffic allocation precision down to 0.001% (3 decimal places)
- Maintain backwards compatibility with existing weight-based API
- Preserve deterministic bucketing behavior (same hash input always produces same bucket)
- Ensure consistent bucket assignment across different programming languages
- Support weight values up to a reasonable maximum that works across multiple languages
- Maintain current performance characteristics
- Prevent users from being moved between buckets when only distribution percentages change
- Guarantee that any variant with weight > 0 receives some traffic allocation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this. It isn't necessary if the flag is configured properly but I'm afraid that it wouldn't be that obvious that there's a misconfiguration. This basically prevents 0% distribution if a weight is defined.

- Handle edge cases gracefully without silent failures
- Validate weight configurations and provide clear error messages for invalid inputs

## Considered Options

- **Option 1: 10,000 buckets (0.01% precision)** - 1 in every 10,000 users, better but still not sufficient for many high-throughput use cases
- **Option 2: 100,000 buckets (0.001% precision)** - 1 in every 100,000 users, meets most high-precision needs
- **Option 3: 1,000,000 buckets (0.0001% precision)** - 1 in every 1,000,000 users, likely overkill and could impact performance

## Proposal

Implement a 100,000-bucket system that provides 0.001% precision while maintaining the existing integer weight-based API.

### API changes

No API changes are required. The existing fractional operation syntax remains unchanged:

```json
"fractional": [
{ "cat": [{ "var": "$flagd.flagKey" }, { "var": "email" }] },
["red", 50],
["blue", 30],
["green", 20]
]
```

### Implementation Changes

1. **Bucket Count**: Change from 100 to 100,000 buckets by modifying bucket calculation from `hashRatio * 100` to `hashRatio * 100000`
2. **Minimum Allocation Guarantee**: Any variant with weight > 0 receives at least 1 bucket (0.001%)
3. **Excess Bucket Handling**: Remove excess buckets from the largest variant to maintain exactly 100,000 total buckets
4. **Weight Sum Validation**: Reject configurations where total weight exceeds maximum safe integer value
5. **Maximum Weight Sum**: Use language-specific maximum 32-bit signed integer constants for cross-platform compatibility

### Minimum Allocation Guarantee

To prevent silent configuration failures, any variant with a positive weight will receive at least 0.001% allocation (1 bucket), even if the calculated percentage would round to zero. This ensures predictable behavior where positive weights always result in some traffic allocation.

**Example**: Configuration `["variant-a", 1], ["variant-b", 1000000]`

- Without guarantee: variant-a gets 0% (never selected)
- With guarantee: variant-a gets 0.001%, variant-b gets 99.999%

### Excess Bucket Management

When minimum allocations cause the total to exceed 100,000 buckets, excess buckets are removed from the variant with the largest allocation.
This approach:
Comment on lines +87 to +88
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description of how excess buckets are handled is slightly inconsistent. This section states that excess buckets are removed from 'the variant with the largest allocation' (singular), while the 'Edge Case Handling' section on line 127 refers to it as 'Excess distributed fairly among largest variants' (plural). The code example shows a sequential removal process. For clarity and consistency, I suggest refining the description to accurately reflect the implementation, for example: 'Excess buckets are removed sequentially from variants with the largest allocations, starting with the largest, until the total bucket count is exactly 100,000.'


- Maintains the minimum guarantee for small variants
- Has minimal impact on large variants (small relative reduction)
- Preserves deterministic behavior
- Prevents bucket count overflow

### Weight Sum Validation

When the total weight sum exceeds the maximum safe integer value, the fractional evaluation will return a validation error with a clear message.
This prevents integer overflow issues and provides immediate feedback to users about invalid configurations.

```go
import "math"

func validateWeightSum(variants []fractionalEvaluationVariant) error {
var totalWeight int64 = 0
for _, variant := range variants {
totalWeight += int64(variant.weight)
if totalWeight > math.MaxInt32 {
return fmt.Errorf("total weight sum %d exceeds maximum allowed value %d",
totalWeight, math.MaxInt32)
}
}
return nil
}
```

Implementations should prefer built-in language constants (e.g., `math.MaxInt32` in Go, `Integer.MAX_VALUE` in Java, `int.MaxValue` in C#) rather than hardcoded values to ensure maintainability and clarity.

### Edge Case Handling

The implementation addresses several edge cases:

1. **All weights are 0**: Returns empty string (maintains current behavior)
2. **Negative weights**: Treated as 0 (maintains current validation behavior)
3. **Single variant**: Receives all 100,000 buckets regardless of weight value
4. **Empty variants**: Returns error (maintains current validation behavior)
5. **Weight sum overflow**: Returns validation error with clear message
6. **Multiple variants with minimum allocation**: Excess distributed fairly among largest variants

### Maximum Weight Considerations

To ensure cross-language compatibility, we establish a maximum total weight sum equal to the maximum 32-bit signed integer value (2,147,483,647). This limit:

- Works reliably across all target languages (Go, Java, .NET, JavaScript, Python)
- Provides more than sufficient range for any practical use case
- Prevents integer overflow issues in 32-bit signed integer systems
- Allows for extremely fine-grained control (individual weights can be 1 out of 2+ billion)
- Uses language-native constants for better maintainability

### Code Changes

The following shows how the core logic in `fractional.go` would be modified.

```go
const bucketCount = 100000

// bucketAllocation represents the number of buckets allocated to a variant
type bucketAllocation struct {
variant string
buckets int
}

func (fe *Fractional) Evaluate(values, data any) any {
valueToDistribute, feDistributions, err := parseFractionalEvaluationData(values, data)
if err != nil {
fe.Logger.Warn(fmt.Sprintf("parse fractional evaluation data: %v", err))
return nil
}

if err := validateWeightSum(feDistributions.weightedVariants); err != nil {
fe.Logger.Warn(fmt.Sprintf("weight validation failed: %v", err))
return nil
}

return distributeValue(valueToDistribute, feDistributions)
}

func validateWeightSum(variants []fractionalEvaluationVariant) error {
var totalWeight int64 = 0
for _, variant := range variants {
totalWeight += int64(variant.weight)
if totalWeight > math.MaxInt32 {
return fmt.Errorf("total weight sum %d exceeds maximum allowed value %d",
totalWeight, math.MaxInt32)
}
}
return nil
}

func calculateBucketAllocations(variants []fractionalEvaluationVariant, totalWeight int) []bucketAllocation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculateBucketAllocations function handles the case where the sum of allocated buckets exceeds bucketCount (excess > 0). However, it doesn't account for the opposite scenario: when the total allocated buckets is less than bucketCount due to rounding down during the proportional calculation (int(...)). If a deficit of buckets occurs, some hash values in distributeValue will not fall into any variant's bucket range, causing the function to incorrectly return an empty string. The ADR should specify how to handle bucket deficits, for instance, by distributing the remaining buckets among the variants (e.g., starting with the largest ones) to ensure the total is exactly bucketCount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Properly supporting guaranteed bucketing may add more complexity than I'd like. I'm sure I can address this issue but I'd like feedback on if it's worth supporting this at all.

The reason I added this is to avoid configurations like ["variant-a", 1], ["variant-b", 1000000], resulting in zero evaluations returning variant-a. You could argue it's a misconfiguration, but it may not be obvious the the user.

allocations := make([]bucketAllocation, len(variants))
totalAllocated := 0

// Calculate initial allocations
for i, variant := range variants {
if variant.weight == 0 {
allocations[i] = bucketAllocation{variant: variant.variant, buckets: 0}
} else {
// Calculate proportional allocation
proportional := int((int64(variant.weight) * bucketCount) / int64(totalWeight))
// Ensure minimum allocation of 1 bucket for any positive weight
buckets := max(1, proportional)
allocations[i] = bucketAllocation{variant: variant.variant, buckets: buckets}
}
totalAllocated += allocations[i].buckets
}

// Handle excess buckets by removing from largest allocation
excess := totalAllocated - bucketCount
if excess > 0 {
// Sort indices by bucket count (descending) to find largest allocation
indices := make([]int, len(allocations))
for i := range indices {
indices[i] = i
}
sort.Slice(indices, func(i, j int) bool {
if allocations[indices[i]].buckets == allocations[indices[j]].buckets {
return allocations[indices[i]].variant < allocations[indices[j]].variant // Tie-break by variant name
}
return allocations[indices[i]].buckets > allocations[indices[j]].buckets
})

// Remove excess from largest allocation, respecting minimum guarantee
for _, idx := range indices {
if excess <= 0 {
break
}

// Don't reduce below 1 bucket if original weight > 0
minAllowed := 0
if variants[idx].weight > 0 {
minAllowed = 1
}

canRemove := allocations[idx].buckets - minAllowed
toRemove := min(excess, canRemove)
allocations[idx].buckets -= toRemove
excess -= toRemove
}
}

return allocations
}
```

**5. Replace the distribution logic:**

```go
func distributeValue(value string, feDistribution *fractionalEvaluationDistribution) string {
if feDistribution.totalWeight == 0 {
return ""
}

allocations := calculateBucketAllocations(feDistribution.weightedVariants, feDistribution.totalWeight)

hashValue := int32(murmur3.StringSum32(value))
hashRatio := math.Abs(float64(hashValue)) / math.MaxInt32
bucket := int(hashRatio * bucketCount) // in range [0, bucketCount)

currentBucket := 0
for _, allocation := range allocations {
currentBucket += allocation.buckets
if bucket < currentBucket {
return allocation.variant
}
}

return ""
}
```

### Consequences

- Good, because it enables precise traffic control for high-throughput environments
- Good, because it matches industry-standard precision offered by leading vendors
- Good, because it maintains API backwards compatibility
- Good, because integer weights remain simple to understand and configure
- Good, because it prevents silent configuration failures through minimum allocation guarantee
- Good, because excess handling is predictable and fair
- Good, because weight validation provides clear error messages for invalid configurations
- Bad, because it represents a behavioral breaking change for existing configurations
- Bad, because it slightly increases memory usage for bucket calculations
- Bad, because actual percentages may differ slightly from configured weights due to minimum allocations

### Implementation Plan

1. Update flagd-testbed with comprehensive test cases for high-precision fractional bucketing across all evaluation modes
2. Implement core logic in flagd to support 100,000-bucket system with minimum allocation guarantee and excess handling
3. Update flagd providers to ensure consistent behavior and testing across language implementations
4. Documentation updates, migration guides, and example configurations to demonstrate the new precision capabilities