Mmap-sync Performance Analysis

Overview

This project evaluates the mmap-sync Rust library under a simulated production workload with 1 writer thread and 12 reader threads, measuring latency metrics for shared memory operations.

Questions & Methodology

1. Latency Instrumentation

Writer:

Instant::now() captures timestamps before and after synchronizer.write().
Measures exclusive lock acquisition + data serialization + mmap flush.

Readers:

Records time from detecting a version change (via synchronizer.version()) to completing synchronizer.read().
Each reader thread stores measurements locally to avoid cross-thread contention.

2. Minimizing Instrumentation Overhead

Batched Writes: Latencies stored in thread-local Vec buffers, flushed to CSV post-test.
No Formatting During Test: Avoid println! or string operations during measurements.
CPU Pinning: Writer/readers pinned to separate CPU cores via core_affinity to reduce OS scheduler noise.

3. Realistic Distribution Models

Writer (Market Data):

Poisson Process: Exponential inter-arrival times (λ = 10,000/sec) to mimic bursty financial data feeds.
Pareto Distribution: Bursty writes with long-tail latency spikes (simulates market opening events).

Readers:

Continuous Polling: Readers check version() in a tight loop, simulating low-latency trading systems that prioritize freshness over CPU efficiency.

4. Stress Testing Distribution

Writer: Constant maximum rate (no delays between writes) to saturate the system.
Readers: Unthrottled polling to create contention.
Why?: This worst-case scenario tests:
- Lock fairness between writer and readers
- Memory bandwidth limits
- OS scheduler behavior under load

5. Expected Latency Results

Baseline Expectation:
- Writer: 20–50 µs (aligns with Cloudflare's 10–30 µs + overhead)
- Readers: 50–150 µs (12 readers creating contention)
Key Variables:
- NUMA node locality between writer/readers
- CPU cache thrashing from frequent writes
- Mmap flush granularity (page vs. byte-level)
- Distribution mode (poisson/stress/pareto)

Implementation Highlights

Data Structures

// Zero-copy deserialization with rkyv
#[derive(Archive, Deserialize, Serialize, Debug)]
#[archive_attr(derive(CheckBytes))]
pub struct BidAsk {
    pub side: [u8; 4],      // "buy" or "sell"
    pub exchange: [u8; 10], // Up to 10 chars
    pub symbol: [u8; 8],    // "BTC-USD"
    pub price: f64,
    pub size: f64,
    pub timestamp: f64,
}

#[derive(Archive, Deserialize, Serialize, Debug)]
#[archive_attr(derive(CheckBytes))]
pub struct BestBidAsk {
    pub best_bid: BidAsk,
    pub best_offer: BidAsk,
}

Workflow

Writer Thread
- Generates random BestBidAsk every 100 µs (avg)
- Uses mmap-sync's SingleWriter for exclusive access
- Records time to write + flush
Reader Threads
- Continuously poll version()
- On version change, read + deserialize data
- Record time from detection to completed read
Post-Processing
- Merge per-thread CSV files
- Compute statistics (mean, p95, etc.) using HDR histograms
- Generate latency-over-time plot

Results & Analysis

Run (locally) (local Mac m1 max, 6 read threads)

cargo r -r -- --readers 6 --mode poisson

Example Output (local Mac m1 max, 6 read threads)

Spawning 6 reader threads
Collected 58365 writer samples
Collected 350000 reader samples
=== Writer Latency (µs) ===
Count:    58365
Min:        14.2 us
Max:      5664.8 us
Mean:       33.1 us
Median:     29.9 us
p95:        53.9 us
p99:       109.6 us

=== Reader Latency (µs) ===
Count:    350000
Min:         0.0 us
Max:      14876.7 us
Mean:        0.7 us
Median:      0.2 us
p95:         0.4 us
p99:         3.8 us
Plot saved to latency_plot.png

Test Environment

AWS EC2 c6in.8xlarge Instance:
- CPU: Intel Xeon 8375C (Ice Lake) - 16 physical cores/32 threads
- Memory: 64GB DDR4 with tmpfs mount
- OS: AWS Linux 2 (Kernel 5.10)
- Test Configuration: 10-second duration, 12 reader threads

Full Metrics by Distribution Mode

Poisson (Realistic Market Flow)

Metric	Writer	Reader
Count	99,994	1,199,112
Min	0.4 µs	0.1 µs
Max	183.6 µs	36.8 µs
Mean	0.5 µs	0.7 µs
Median	0.4 µs	0.5 µs
p95	0.6 µs	1.9 µs
p99	1.3 µs	2.3 µs

Throughput: 9,999 writes/sec | 119,911 reads/sec

Pareto (Bursty Events)

Metric	Writer	Reader
Count	99,989	1,199,798
Min	0.4 µs	0.1 µs
Max	298.0 µs	42.3 µs
Mean	0.4 µs	0.7 µs
Median	0.4 µs	0.5 µs
p95	0.5 µs	1.9 µs
p99	0.7 µs	2.2 µs

Throughput: 9,999 writes/sec | 119,979 reads/sec

Stress (Maximum Contention)

Metric	Writer	Reader
Count	541,524	6,225,731
Min	0.4 µs	0.0 µs
Max	257.0 µs	37.6 µs
Mean	18.2 µs	0.9 µs
Median	1.8 µs	0.6 µs
p95	58.4 µs	2.2 µs
p99	59.0 µs	3.0 µs

Throughput: 54,152 writes/sec | 622,573 reads/sec

Key Analysis

Writer Performance Characteristics
- Normal Operation (Poisson/Pareto):
  - Consistent sub-0.5µs median latency across all realistic scenarios
  - Tight p95-p99 spread (0.2-0.9µs) demonstrates predictable behavior
  - Maximum latencies under 300µs even during burst scenarios
- Stress Mode:
  - Maintained 1.8µs median despite 54k writes/sec throughput
  - p99 latency stable at 59µs showing effective contention management
  - 257µs max latency demonstrates bounded worst-case behavior
Reader Consistency
- Sub-microsecond median latency across all test scenarios
- p99 below 3µs even under maximum contention
- Maximum observed latency under 43µs across all modes
Infrastructure Impact
- tmpfs: Reduced writer max latency by 3.9x (1443µs vs 5664µs on Mac)

Performance Thresholds (10s Test, 12 Readers)

Scenario	Healthy Range	Warning Threshold
Writer Median Latency	<2µs	≥2µs
Writer p99 Latency	<60µs	≥60µs
Reader p95 Latency	<3µs	≥5µs
Write Throughput	<55k/s	>60k/s
Read Throughput	<650k/s	>700k/s

*Calculated from 10s test totals: Stress Mode = 502,256 writes / 10s = 50k writes/sec, 5.76M reads / 10s = 576k reads/sec*

Conclusion

The mmap-sync library demonstrates:

Predictable low latency: 0.4-0.7µs median for both roles in normal operation
Graceful degradation: Writer p99 grows only 2.5x from 1.3µs (Poisson) to 59µs (Stress)
Horizontal scalability: 12 readers add <1µs to median read latency
Production readiness: Sustains 54k writes/sec with sub-60µs p99 latency

How to Run

# Clone and build
git clone git@github.com:hanchang/tom_tang.git
cd tom_tang
cargo build --release

# Run with 12 readers for 60s (Poisson mode)
cargo r -r -- --readers 12 --duration 60

# Stress test mode (max writes)
cargo r -r --readers 12 --duration 60 --mode stress

Branches for Test Results

We maintain three dedicated branches to store benchmark outputs (CSV, plots, etc.):

# pareto
pareto_latencies.zip
pareto_latency_plot.png_combined.png
pareto_latency_plot.png_reader.png
pareto_latency_plot.png_writer.png
pareto_output.txt

# poisson
poisson_latencies.zip
poisson_latency_plot.png_combined.png
poisson_latency_plot.png_reader.png
poisson_latency_plot.png_writer.png
poisson_output.txt

# stress
stress_latencies.zip
stress_latency_plot.png_combined.png
stress_latency_plot.png_reader.png
stress_latency_plot.png_writer.png
stress_output.txt

These branches are only for historical data and reference, so main development remains clean. If you need to review past test artifacts or reproduce a particular scenario, check out these branches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mmap-sync Performance Analysis

Overview

Questions & Methodology

1. Latency Instrumentation

2. Minimizing Instrumentation Overhead

3. Realistic Distribution Models

4. Stress Testing Distribution

5. Expected Latency Results

Implementation Highlights

Data Structures

Workflow

Results & Analysis

Run (locally) (local Mac m1 max, 6 read threads)

Example Output (local Mac m1 max, 6 read threads)

Test Environment

Full Metrics by Distribution Mode

Poisson (Realistic Market Flow)

Pareto (Bursty Events)

Stress (Maximum Contention)

Key Analysis

Performance Thresholds (10s Test, 12 Readers)

Conclusion

How to Run

Branches for Test Results

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
pareto_latencies.zip		pareto_latencies.zip
pareto_latency_plot.png_combined.png		pareto_latency_plot.png_combined.png
pareto_latency_plot.png_reader.png		pareto_latency_plot.png_reader.png
pareto_latency_plot.png_writer.png		pareto_latency_plot.png_writer.png
pareto_output.txt		pareto_output.txt
poisson_latencies.zip		poisson_latencies.zip
poisson_latency_plot.png_combined.png		poisson_latency_plot.png_combined.png
poisson_latency_plot.png_reader.png		poisson_latency_plot.png_reader.png
poisson_latency_plot.png_writer.png		poisson_latency_plot.png_writer.png
poisson_output.txt		poisson_output.txt
stress_latencies.zip		stress_latencies.zip
stress_latency_plot.png_combined.png		stress_latency_plot.png_combined.png
stress_latency_plot.png_reader.png		stress_latency_plot.png_reader.png
stress_latency_plot.png_writer.png		stress_latency_plot.png_writer.png
stress_output.txt		stress_output.txt

tombelieber/rust_mmap_sync_latency

Folders and files

Latest commit

History

Repository files navigation

Mmap-sync Performance Analysis

Overview

Questions & Methodology

1. Latency Instrumentation

2. Minimizing Instrumentation Overhead

3. Realistic Distribution Models

4. Stress Testing Distribution

5. Expected Latency Results

Implementation Highlights

Data Structures

Workflow

Results & Analysis

Run (locally) (local Mac m1 max, 6 read threads)

Example Output (local Mac m1 max, 6 read threads)

Test Environment

Full Metrics by Distribution Mode

Poisson (Realistic Market Flow)

Pareto (Bursty Events)

Stress (Maximum Contention)

Key Analysis

Performance Thresholds (10s Test, 12 Readers)

Conclusion

How to Run

Branches for Test Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages