Skip to content

tombelieber/rust_mmap_sync_latency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mmap-sync Performance Analysis

Overview

This project evaluates the mmap-sync Rust library under a simulated production workload with 1 writer thread and 12 reader threads, measuring latency metrics for shared memory operations.


Questions & Methodology

1. Latency Instrumentation

Writer:

  • Instant::now() captures timestamps before and after synchronizer.write().
  • Measures exclusive lock acquisition + data serialization + mmap flush.

Readers:

  • Records time from detecting a version change (via synchronizer.version()) to completing synchronizer.read().
  • Each reader thread stores measurements locally to avoid cross-thread contention.

2. Minimizing Instrumentation Overhead

  • Batched Writes: Latencies stored in thread-local Vec buffers, flushed to CSV post-test.
  • No Formatting During Test: Avoid println! or string operations during measurements.
  • CPU Pinning: Writer/readers pinned to separate CPU cores via core_affinity to reduce OS scheduler noise.

3. Realistic Distribution Models

Writer (Market Data):

  • Poisson Process: Exponential inter-arrival times (λ = 10,000/sec) to mimic bursty financial data feeds.
  • Pareto Distribution: Bursty writes with long-tail latency spikes (simulates market opening events).

Readers:

  • Continuous Polling: Readers check version() in a tight loop, simulating low-latency trading systems that prioritize freshness over CPU efficiency.

4. Stress Testing Distribution

  • Writer: Constant maximum rate (no delays between writes) to saturate the system.
  • Readers: Unthrottled polling to create contention.
  • Why?: This worst-case scenario tests:
    • Lock fairness between writer and readers
    • Memory bandwidth limits
    • OS scheduler behavior under load

5. Expected Latency Results

  • Baseline Expectation:
    • Writer: 20–50 µs (aligns with Cloudflare's 10–30 µs + overhead)
    • Readers: 50–150 µs (12 readers creating contention)
  • Key Variables:
    • NUMA node locality between writer/readers
    • CPU cache thrashing from frequent writes
    • Mmap flush granularity (page vs. byte-level)
    • Distribution mode (poisson/stress/pareto)

Implementation Highlights

Data Structures

// Zero-copy deserialization with rkyv
#[derive(Archive, Deserialize, Serialize, Debug)]
#[archive_attr(derive(CheckBytes))]
pub struct BidAsk {
    pub side: [u8; 4],      // "buy" or "sell"
    pub exchange: [u8; 10], // Up to 10 chars
    pub symbol: [u8; 8],    // "BTC-USD"
    pub price: f64,
    pub size: f64,
    pub timestamp: f64,
}

#[derive(Archive, Deserialize, Serialize, Debug)]
#[archive_attr(derive(CheckBytes))]
pub struct BestBidAsk {
    pub best_bid: BidAsk,
    pub best_offer: BidAsk,
}

Workflow

  1. Writer Thread

    • Generates random BestBidAsk every 100 µs (avg)
    • Uses mmap-sync's SingleWriter for exclusive access
    • Records time to write + flush
  2. Reader Threads

    • Continuously poll version()
    • On version change, read + deserialize data
    • Record time from detection to completed read
  3. Post-Processing

    • Merge per-thread CSV files
    • Compute statistics (mean, p95, etc.) using HDR histograms
    • Generate latency-over-time plot

Results & Analysis

Run (locally) (local Mac m1 max, 6 read threads)

cargo r -r -- --readers 6 --mode poisson

Example Output (local Mac m1 max, 6 read threads)

Spawning 6 reader threads
Collected 58365 writer samples
Collected 350000 reader samples
=== Writer Latency (µs) ===
Count:    58365
Min:        14.2 us
Max:      5664.8 us
Mean:       33.1 us
Median:     29.9 us
p95:        53.9 us
p99:       109.6 us

=== Reader Latency (µs) ===
Count:    350000
Min:         0.0 us
Max:      14876.7 us
Mean:        0.7 us
Median:      0.2 us
p95:         0.4 us
p99:         3.8 us
Plot saved to latency_plot.png

Test Environment

AWS EC2 c6in.8xlarge Instance:
- CPU: Intel Xeon 8375C (Ice Lake) - 16 physical cores/32 threads
- Memory: 64GB DDR4 with tmpfs mount
- OS: AWS Linux 2 (Kernel 5.10)
- Test Configuration: 10-second duration, 12 reader threads

Full Metrics by Distribution Mode

Poisson (Realistic Market Flow)

Metric Writer Reader
Count 99,994 1,199,112
Min 0.4 µs 0.1 µs
Max 183.6 µs 36.8 µs
Mean 0.5 µs 0.7 µs
Median 0.4 µs 0.5 µs
p95 0.6 µs 1.9 µs
p99 1.3 µs 2.3 µs

Throughput: 9,999 writes/sec | 119,911 reads/sec

Latency Distribution Comparison

Reader Latency Distribution Writer Latency Distribution

Pareto (Bursty Events)

Metric Writer Reader
Count 99,989 1,199,798
Min 0.4 µs 0.1 µs
Max 298.0 µs 42.3 µs
Mean 0.4 µs 0.7 µs
Median 0.4 µs 0.5 µs
p95 0.5 µs 1.9 µs
p99 0.7 µs 2.2 µs

Throughput: 9,999 writes/sec | 119,979 reads/sec

Latency Distribution Comparison

Reader Latency Distribution Writer Latency Distribution

Stress (Maximum Contention)

Metric Writer Reader
Count 541,524 6,225,731
Min 0.4 µs 0.0 µs
Max 257.0 µs 37.6 µs
Mean 18.2 µs 0.9 µs
Median 1.8 µs 0.6 µs
p95 58.4 µs 2.2 µs
p99 59.0 µs 3.0 µs

Throughput: 54,152 writes/sec | 622,573 reads/sec

Latency Distribution Comparison

Reader Latency Distribution Writer Latency Distribution


Key Analysis

  1. Writer Performance Characteristics

    • Normal Operation (Poisson/Pareto):

      • Consistent sub-0.5µs median latency across all realistic scenarios
      • Tight p95-p99 spread (0.2-0.9µs) demonstrates predictable behavior
      • Maximum latencies under 300µs even during burst scenarios
    • Stress Mode:

      • Maintained 1.8µs median despite 54k writes/sec throughput
      • p99 latency stable at 59µs showing effective contention management
      • 257µs max latency demonstrates bounded worst-case behavior
  2. Reader Consistency

    • Sub-microsecond median latency across all test scenarios
    • p99 below 3µs even under maximum contention
    • Maximum observed latency under 43µs across all modes
  3. Infrastructure Impact

    • tmpfs: Reduced writer max latency by 3.9x (1443µs vs 5664µs on Mac)

Performance Thresholds (10s Test, 12 Readers)

Scenario Healthy Range Warning Threshold
Writer Median Latency <2µs ≥2µs
Writer p99 Latency <60µs ≥60µs
Reader p95 Latency <3µs ≥5µs
Write Throughput <55k/s >60k/s
Read Throughput <650k/s >700k/s

*Calculated from 10s test totals: Stress Mode = 502,256 writes / 10s = 50k writes/sec, 5.76M reads / 10s = 576k reads/sec*

Conclusion

The mmap-sync library demonstrates:

  • Predictable low latency: 0.4-0.7µs median for both roles in normal operation
  • Graceful degradation: Writer p99 grows only 2.5x from 1.3µs (Poisson) to 59µs (Stress)
  • Horizontal scalability: 12 readers add <1µs to median read latency
  • Production readiness: Sustains 54k writes/sec with sub-60µs p99 latency

How to Run

# Clone and build
git clone git@github.com:hanchang/tom_tang.git
cd tom_tang
cargo build --release

# Run with 12 readers for 60s (Poisson mode)
cargo r -r -- --readers 12 --duration 60

# Stress test mode (max writes)
cargo r -r --readers 12 --duration 60 --mode stress

Branches for Test Results

We maintain three dedicated branches to store benchmark outputs (CSV, plots, etc.):

# pareto
pareto_latencies.zip
pareto_latency_plot.png_combined.png
pareto_latency_plot.png_reader.png
pareto_latency_plot.png_writer.png
pareto_output.txt

# poisson
poisson_latencies.zip
poisson_latency_plot.png_combined.png
poisson_latency_plot.png_reader.png
poisson_latency_plot.png_writer.png
poisson_output.txt

# stress
stress_latencies.zip
stress_latency_plot.png_combined.png
stress_latency_plot.png_reader.png
stress_latency_plot.png_writer.png
stress_output.txt

These branches are only for historical data and reference, so main development remains clean. If you need to review past test artifacts or reproduce a particular scenario, check out these branches.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages