Skip to content

A comprehensive study of the "Balmer Peak" phenomenon in binary search algorithms, where introducing controlled randomness (epsilon) can improve performance on certain array structures.

License

Notifications You must be signed in to change notification settings

Ghost---Shadow/epsilon-drunk-binary-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

A comprehensive study of the "Balmer Peak" phenomenon in binary search algorithms, where introducing controlled randomness (epsilon) can improve performance on certain array structures.

Overview

This project implements and analyzes Epsilon Drunk Binary Search, a modified binary search algorithm that uses 0.5 + Ξ΅ instead of 0.5 for the midpoint calculation, where Ξ΅ is a small random offset. The research explores how different array structures (characterized by their Gini coefficient) respond to different levels of search "drunkness."

Key Concepts

Epsilon Drunk Binary Search

  • Standard binary search: Uses exact midpoint mid = (low + high) / 2
  • Drunk binary search: Uses perturbed midpoint mid = low + (high - low) * (0.5 + Ξ΅)
  • Epsilon (Ξ΅): Random offset in range [-Ξ΅_max, +Ξ΅_max]
    • Negative Ξ΅ = "undershoot" (bias toward lower indices)
    • Positive Ξ΅ = "overshoot" (bias toward higher indices)

Array Drunkness (Gini Coefficient)

The Gini coefficient is a statistical measure of inequality originally used in economics to measure income distribution. In our context, we apply it to measure the inequality in gap distributions between consecutive array elements.

  • Sober arrays: Perfectly even spacing between elements (Gini = 0.000)

  • Drunk arrays: Uneven gap distributions with some large and small gaps (higher Gini values)

  • Mathematical definition: For gaps $g_1, g_2, \ldots, g_n$:

    $$\text{Gini} = \frac{2 \sum_{i=1}^{n} i \cdot g_i}{n \sum_{i=1}^{n} g_i} - \frac{n+1}{n}$$

  • Why we use it: The Gini coefficient provides a single, intuitive metric (0-1 scale) to characterize how "uneven" or "drunk" an array's structure is, allowing us to systematically study performance across different data distributions.

Balmer Peak

The Balmer Peak is a programming folklore concept suggesting that there's an optimal level of alcohol consumption that maximizes programming productivity - too little and you're overly cautious, too much and you make mistakes.

We apply this concept metaphorically to binary search:

  • Too sober (Ξ΅ = 0): Deterministic but potentially suboptimal on uneven data
  • Optimal drunkness (small Ξ΅): Just enough randomness to improve performance
  • Too drunk (large Ξ΅): Excessive randomness hurts performance

Our research seeks to find the "Balmer Peak" epsilon value that minimizes average comparisons for each array structure.

Project Structure

epsilon-drunk-binary-search/
β”œβ”€β”€ main.py                 # Core analysis and grid search
β”œβ”€β”€ plot_results.py         # Visualization and plotting
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ epsilon_drunk_search_results.csv  # Generated results
└── plots/                 # Generated visualizations
    β”œβ”€β”€ array_vs_method_drunkness_correlation.png
    β”œβ”€β”€ epsilon_drunk_search_summary.png
    └── balmer_peak_*.png  # Individual scenario plots

Usage

Installation

pip install -r requirements.txt

Running the Analysis

# Generate comprehensive grid search data
python3 main.py

# Generate all plots and visualizations
python3 plot_results.py

Customization

Modify main.py to adjust:

  • Gini levels: Change gini_levels = [0.0, 0.05, 0.1, ...]
  • Epsilon range: Modify epsilon_values = np.arange(-0.3, 0.31, 0.02)
  • Array size: Adjust arr_size = 5000
  • Trials per combination: Change num_trials = 100

Key Algorithms

Array Generation

def generate_array_with_gini(size: int, target_gini: float) -> List[int]:
    """Generate array targeting specific Gini coefficient for gap distribution."""
    # Uses deterministic gap distribution to achieve target inequality

Drunk Binary Search

def drunk_binary_search(arr: List[int], target: int, epsilon_range: float) -> Tuple[int, int]:
    """Binary search with random epsilon offset from 0.5 midpoint."""
    epsilon = random.uniform(-epsilon_range, epsilon_range)
    drunk_ratio = max(0.1, min(0.9, 0.5 + epsilon))
    mid = int(low + (high - low) * drunk_ratio)

Performance Analysis

  • Statistical testing: t-tests for significance
  • Confidence intervals: 95% CI for performance metrics
  • Grid search: Systematic exploration of parameter space

Statistical Methodology (For Nerds πŸ€“)

Experimental Design

  • Trials per combination: 100 independent searches
  • Array size: 5,000 elements per test array
  • Target selection: Random targets uniformly distributed across array indices
  • Statistical tests: Two-tailed t-tests comparing drunk vs. regular binary search

Performance Metrics

  • Primary metric: Average number of comparisons
  • Confidence intervals: 95% CI using t-distribution
  • Standard deviation: Sample standard deviation across trials
  • Statistical significance: p < 0.05 threshold
  • Effect size: Percentage improvement over regular binary search

Example Statistical Output

Regular binary search: 11.72 Β± 1.33 comparisons
Best drunk search: 11.14 Β± 1.66 comparisons  
95% CI: [10.81, 11.47]
Improvement: 4.95%
Statistical significance: YES (p = 0.0070)

Results

Grid Search Dimensions

  • 12 Array Types: Gini coefficients from 0.000 to 0.600
  • 31 Search Strategies: Epsilon from -0.30 to +0.30 in steps of 0.02
  • 372 Total Combinations: 100 trials each = 37,200 individual searches
  • Total data points: 372 statistical measurements with full CI analysis

Sample Performance Plot

Balmer Peak Analysis

Example showing the Balmer Peak for arrays with Gini coefficient 0.2. The plot demonstrates how performance varies across different epsilon values, with the optimal "drunk" level clearly visible as the minimum point.

Comprehensive Heatmap

Full Grid Search Heatmap

The main result: a 31Γ—12 heatmap showing all 372 parameter combinations. Each cell represents the average performance (number of comparisons) for a specific Array Gini coefficient (x-axis) and Search epsilon level (y-axis). Darker colors indicate better performance.

Key Findings

  1. Array Structure Matters: Different Gini coefficients favor different epsilon strategies
  2. Optimal Zones: Performance peaks around Ξ΅ β‰ˆ Β±0.04 to Β±0.08
  3. Non-linear Relationship: Complex interaction between array structure and search strategy
  4. Statistical Significance: Multiple combinations show measurable improvements (p < 0.05)
  5. Performance Range: 0.1-5% improvements in comparison count for optimal combinations

Recommended Implementation

Universal Epsilon Value: Ξ΅ = -0.040

Based on comprehensive grid search analysis across all array structures, we recommend Ξ΅ = -0.040 for general-purpose applications. This value provides the best average performance across diverse data distributions.

Implementation

def recommended_drunk_binary_search(arr: List[int], target: int) -> Tuple[int, int]:
    """
    Binary search with recommended epsilon for optimal average performance.
    Uses Ξ΅ = -0.040 (slight undershoot bias) for best general-purpose results.
    """
    low, high = 0, len(arr) - 1
    comparisons = 0
    
    while low <= high:
        # Use 0.46 instead of 0.5 (0.5 - 0.04 = 0.46)
        mid = int(low + (high - low) * 0.46)
        comparisons += 1
        
        if arr[mid] == target:
            return mid, comparisons
        elif arr[mid] < target:
            low = mid + 1
        else:
            high = mid - 1
    
    return -1, comparisons

When to Use This Approach

  • Unknown data structures: When you don't know the Gini coefficient of your arrays
  • General-purpose libraries: For implementing in shared codebases
  • Conservative improvement: Modest but reliable 0.5-2% performance gains
  • Low-risk adoption: Minimal implementation changes with proven benefits

Dependencies

  • numpy>=1.21.0 - Numerical computations
  • scipy>=1.7.0 - Statistical analysis
  • matplotlib>=3.5.0 - Plotting
  • pandas>=1.3.0 - Data manipulation
  • scikit-learn>=1.0.0 - Regression analysis
  • tqdm>=4.64.0 - Progress bars

Scientific Background

This research is inspired by:

  • Randomized algorithms: How controlled randomness can improve deterministic algorithms
  • Balmer Peak hypothesis: Optimal performance with slight impairment
  • Binary search variants: Adaptations for non-uniform data distributions
  • Performance analysis: Statistical methods for algorithm comparison

Interpretation

When Drunk Search Helps

  • Clustered data: Arrays with uneven element distribution
  • Skewed searches: When targets aren't uniformly distributed
  • Cache effects: Slight randomness may improve memory access patterns

When Standard Search is Better

  • Uniform data: Perfectly spaced arrays (Gini β‰ˆ 0)
  • Large epsilon: Excessive randomness hurts performance
  • Small arrays: Overhead outweighs benefits

Future Work

  • Real-world datasets: Test on actual data distributions
  • Cache analysis: Measure memory hierarchy effects
  • Adaptive epsilon: Dynamic adjustment based on array characteristics
  • Multi-dimensional search: Extension to higher-dimensional spaces
  • Theoretical analysis: Mathematical modeling of performance bounds

License

MIT License

Copyright (c) 2024 Epsilon Drunk Binary Search Analysis

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Citation

If you use this work in research, please cite:

@software{epsilon_drunk_binary_search,
  title = {Epsilon Drunk Binary Search Analysis},
  author = {Souradeep Nanda},
  year = {2025},
  url = {https://github.com/Ghost---Shadow/epsilon-drunk-binary-search},
  note = {A study of randomized binary search performance across array structures}
}

GitHub Repository: https://github.com/Ghost---Shadow/epsilon-drunk-binary-search


"In code, as in life, sometimes a little randomness leads to better outcomes."

About

A comprehensive study of the "Balmer Peak" phenomenon in binary search algorithms, where introducing controlled randomness (epsilon) can improve performance on certain array structures.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages