A comprehensive study of the "Balmer Peak" phenomenon in binary search algorithms, where introducing controlled randomness (epsilon) can improve performance on certain array structures.
This project implements and analyzes Epsilon Drunk Binary Search, a modified binary search algorithm that uses 0.5 + Ξ΅
instead of 0.5
for the midpoint calculation, where Ξ΅ is a small random offset. The research explores how different array structures (characterized by their Gini coefficient) respond to different levels of search "drunkness."
- Standard binary search: Uses exact midpoint
mid = (low + high) / 2
- Drunk binary search: Uses perturbed midpoint
mid = low + (high - low) * (0.5 + Ξ΅)
- Epsilon (Ξ΅): Random offset in range
[-Ξ΅_max, +Ξ΅_max]
- Negative Ξ΅ = "undershoot" (bias toward lower indices)
- Positive Ξ΅ = "overshoot" (bias toward higher indices)
The Gini coefficient is a statistical measure of inequality originally used in economics to measure income distribution. In our context, we apply it to measure the inequality in gap distributions between consecutive array elements.
-
Sober arrays: Perfectly even spacing between elements (Gini = 0.000)
-
Drunk arrays: Uneven gap distributions with some large and small gaps (higher Gini values)
-
Mathematical definition: For gaps
$g_1, g_2, \ldots, g_n$ :$$\text{Gini} = \frac{2 \sum_{i=1}^{n} i \cdot g_i}{n \sum_{i=1}^{n} g_i} - \frac{n+1}{n}$$ -
Why we use it: The Gini coefficient provides a single, intuitive metric (0-1 scale) to characterize how "uneven" or "drunk" an array's structure is, allowing us to systematically study performance across different data distributions.
The Balmer Peak is a programming folklore concept suggesting that there's an optimal level of alcohol consumption that maximizes programming productivity - too little and you're overly cautious, too much and you make mistakes.
We apply this concept metaphorically to binary search:
- Too sober (Ξ΅ = 0): Deterministic but potentially suboptimal on uneven data
- Optimal drunkness (small Ξ΅): Just enough randomness to improve performance
- Too drunk (large Ξ΅): Excessive randomness hurts performance
Our research seeks to find the "Balmer Peak" epsilon value that minimizes average comparisons for each array structure.
epsilon-drunk-binary-search/
βββ main.py # Core analysis and grid search
βββ plot_results.py # Visualization and plotting
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ epsilon_drunk_search_results.csv # Generated results
βββ plots/ # Generated visualizations
βββ array_vs_method_drunkness_correlation.png
βββ epsilon_drunk_search_summary.png
βββ balmer_peak_*.png # Individual scenario plots
pip install -r requirements.txt
# Generate comprehensive grid search data
python3 main.py
# Generate all plots and visualizations
python3 plot_results.py
Modify main.py
to adjust:
- Gini levels: Change
gini_levels = [0.0, 0.05, 0.1, ...]
- Epsilon range: Modify
epsilon_values = np.arange(-0.3, 0.31, 0.02)
- Array size: Adjust
arr_size = 5000
- Trials per combination: Change
num_trials = 100
def generate_array_with_gini(size: int, target_gini: float) -> List[int]:
"""Generate array targeting specific Gini coefficient for gap distribution."""
# Uses deterministic gap distribution to achieve target inequality
def drunk_binary_search(arr: List[int], target: int, epsilon_range: float) -> Tuple[int, int]:
"""Binary search with random epsilon offset from 0.5 midpoint."""
epsilon = random.uniform(-epsilon_range, epsilon_range)
drunk_ratio = max(0.1, min(0.9, 0.5 + epsilon))
mid = int(low + (high - low) * drunk_ratio)
- Statistical testing: t-tests for significance
- Confidence intervals: 95% CI for performance metrics
- Grid search: Systematic exploration of parameter space
- Trials per combination: 100 independent searches
- Array size: 5,000 elements per test array
- Target selection: Random targets uniformly distributed across array indices
- Statistical tests: Two-tailed t-tests comparing drunk vs. regular binary search
- Primary metric: Average number of comparisons
- Confidence intervals: 95% CI using t-distribution
- Standard deviation: Sample standard deviation across trials
- Statistical significance: p < 0.05 threshold
- Effect size: Percentage improvement over regular binary search
Regular binary search: 11.72 Β± 1.33 comparisons
Best drunk search: 11.14 Β± 1.66 comparisons
95% CI: [10.81, 11.47]
Improvement: 4.95%
Statistical significance: YES (p = 0.0070)
- 12 Array Types: Gini coefficients from 0.000 to 0.600
- 31 Search Strategies: Epsilon from -0.30 to +0.30 in steps of 0.02
- 372 Total Combinations: 100 trials each = 37,200 individual searches
- Total data points: 372 statistical measurements with full CI analysis
Example showing the Balmer Peak for arrays with Gini coefficient 0.2. The plot demonstrates how performance varies across different epsilon values, with the optimal "drunk" level clearly visible as the minimum point.
The main result: a 31Γ12 heatmap showing all 372 parameter combinations. Each cell represents the average performance (number of comparisons) for a specific Array Gini coefficient (x-axis) and Search epsilon level (y-axis). Darker colors indicate better performance.
- Array Structure Matters: Different Gini coefficients favor different epsilon strategies
- Optimal Zones: Performance peaks around Ξ΅ β Β±0.04 to Β±0.08
- Non-linear Relationship: Complex interaction between array structure and search strategy
- Statistical Significance: Multiple combinations show measurable improvements (p < 0.05)
- Performance Range: 0.1-5% improvements in comparison count for optimal combinations
Based on comprehensive grid search analysis across all array structures, we recommend Ξ΅ = -0.040 for general-purpose applications. This value provides the best average performance across diverse data distributions.
def recommended_drunk_binary_search(arr: List[int], target: int) -> Tuple[int, int]:
"""
Binary search with recommended epsilon for optimal average performance.
Uses Ξ΅ = -0.040 (slight undershoot bias) for best general-purpose results.
"""
low, high = 0, len(arr) - 1
comparisons = 0
while low <= high:
# Use 0.46 instead of 0.5 (0.5 - 0.04 = 0.46)
mid = int(low + (high - low) * 0.46)
comparisons += 1
if arr[mid] == target:
return mid, comparisons
elif arr[mid] < target:
low = mid + 1
else:
high = mid - 1
return -1, comparisons
- Unknown data structures: When you don't know the Gini coefficient of your arrays
- General-purpose libraries: For implementing in shared codebases
- Conservative improvement: Modest but reliable 0.5-2% performance gains
- Low-risk adoption: Minimal implementation changes with proven benefits
numpy>=1.21.0
- Numerical computationsscipy>=1.7.0
- Statistical analysismatplotlib>=3.5.0
- Plottingpandas>=1.3.0
- Data manipulationscikit-learn>=1.0.0
- Regression analysistqdm>=4.64.0
- Progress bars
This research is inspired by:
- Randomized algorithms: How controlled randomness can improve deterministic algorithms
- Balmer Peak hypothesis: Optimal performance with slight impairment
- Binary search variants: Adaptations for non-uniform data distributions
- Performance analysis: Statistical methods for algorithm comparison
- Clustered data: Arrays with uneven element distribution
- Skewed searches: When targets aren't uniformly distributed
- Cache effects: Slight randomness may improve memory access patterns
- Uniform data: Perfectly spaced arrays (Gini β 0)
- Large epsilon: Excessive randomness hurts performance
- Small arrays: Overhead outweighs benefits
- Real-world datasets: Test on actual data distributions
- Cache analysis: Measure memory hierarchy effects
- Adaptive epsilon: Dynamic adjustment based on array characteristics
- Multi-dimensional search: Extension to higher-dimensional spaces
- Theoretical analysis: Mathematical modeling of performance bounds
MIT License
Copyright (c) 2024 Epsilon Drunk Binary Search Analysis
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you use this work in research, please cite:
@software{epsilon_drunk_binary_search,
title = {Epsilon Drunk Binary Search Analysis},
author = {Souradeep Nanda},
year = {2025},
url = {https://github.com/Ghost---Shadow/epsilon-drunk-binary-search},
note = {A study of randomized binary search performance across array structures}
}
GitHub Repository: https://github.com/Ghost---Shadow/epsilon-drunk-binary-search
"In code, as in life, sometimes a little randomness leads to better outcomes."