-
-
Notifications
You must be signed in to change notification settings - Fork 60
Description
Hi!
Recently I checked optimizations like Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for Symbolicator can be a good idea.
I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.
Test environment
- Fedora 39
- Linux kernel 6.5.12
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.74
- Symbolicator version: the latest for now from the
master
branch on commitac127975e4649dc2f40b177cb98556e307aca26e
- Disabled Turbo boost (for better results consistency across runs)
Benchmark
For the benchmark purposes, I used this WRK-based scenario for Minidump. As a minidump, I use this Linux dump. WRK command is the same for all benchmarks and PGO/PLO training phases: WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
. Before each WRK benchmark, I once run cargo run -p process-event -- ../tests/fixtures/linux.dmp
as it's recommended.
All PGO and PLO optimizations are done with cargo-pgo (I highly recommend using this tool). For PLO phase I use LLVM BOLT tool.
LTO is enabled by the following changes to the Release
profile in the root Cargo.toml
file:
[profile.release]
# For release builds, we do want line-only debug information to be able to symbolicate panic stack traces.
debug = 1
codegen-units = 1
lto = true
For all benchmarks, binaries are stripped with a strip
tool.
All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.
Tricky moment with PGO dumps
For some unknown reason, Symbolicator does not dump the PGO profile to the disk on Ctrl+C. I guess it's somehow related to custom signal handling somewhere in the code. So I modified Symbolicator a little bit by manually dumping the PGO profile to the disk. As a reference implementation, I use this piece of code from YugabyteDB. I made the following changes to the main.rs
:
extern {
fn __llvm_profile_write_file();
}
use signal_hook::{consts::SIGINT, iterator::Signals};
use std::{error::Error, thread, time::Duration};
fn main() {
let mut signals = Signals::new(&[SIGINT]).unwrap();
thread::spawn(move || {
for sig in signals.forever() {
println!("Received signal {:?}", sig);
unsafe { __llvm_profile_write_file(); }
std::process::exit(0);
}
});
match cli::execute() {
Ok(()) => std::process::exit(0),
Err(error) => {
logging::ensure_log_error(&error);
std::process::exit(1);
}
}
}
I use signal_hook
dependency. Please note that __llvm_profile_write_file
symbol is linked to the program only when you build your program with PGO instrumentation (it's done automatically by the Rustc compiler). Since this, you need to disable/comment out this code during the PGO optimization phase (otherwise you get a link error).
I think there should be a better way to implement this logic but for the tests' purposes, it's good enough.
Results
Here I post the benchmark results for the following Symbolicator configurations:
- Release build (the default build)
- Release + LTO
- Release + LTO + PGO optimized
- Release + LTO + PGO optimized + PLO optimized (with LLVM BOLT)
Release:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 18.60ms 10.37ms 41.83ms 57.72%
Req/Sec 269.77 45.57 434.00 67.50%
80627 requests in 30.02s, 470.12MB read
Requests/sec: 2685.82
Transfer/sec: 15.66MB
Release + LTO:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 16.09ms 8.95ms 35.94ms 57.91%
Req/Sec 312.10 43.03 440.00 70.03%
93266 requests in 30.03s, 543.81MB read
Requests/sec: 3106.16
Transfer/sec: 18.11MB
Release + LTO + PGO optimized:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 13.71ms 7.63ms 31.15ms 57.75%
Req/Sec 366.18 51.08 545.00 67.53%
109422 requests in 30.02s, 638.01MB read
Requests/sec: 3644.55
Transfer/sec: 21.25MB
Release + LTO + PGO optimized + PLO optimized:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 12.82ms 7.14ms 30.18ms 57.86%
Req/Sec 391.64 58.28 555.00 62.60%
117034 requests in 30.02s, 682.40MB read
Requests/sec: 3897.93
Transfer/sec: 22.73MB
According to the tests above, I see measurable improvements from enabling LTO, PGO and PLO with LLVM BOLT.
Additionally, below I post results for the PGO instrumentation and PLO instrumentation phases. So you can estimate the Symbolicator slowdown during the instrumentation.
Release + LTO + PGO instrumentation:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 19.10ms 10.63ms 42.48ms 57.75%
Req/Sec 262.82 44.15 454.00 68.50%
78545 requests in 30.02s, 457.98MB read
Requests/sec: 2616.14
Transfer/sec: 15.25MB
Release + LTO + PGO optimized + PLO instrumentation:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 22.72ms 12.66ms 50.54ms 57.68%
Req/Sec 220.82 40.15 380.00 66.73%
66012 requests in 30.02s, 384.90MB read
Requests/sec: 2198.81
Transfer/sec: 12.82MB
Further steps
I can suggest the following action points:
- Perform more PGO benchmarks on Symbolicator. If it shows improvements - add a note to the documentation about possible improvements in Symbolicator performance with PGO.
- Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Symbolicator according to their workloads.
- Optimize pre-built Symbolicator binaries
- Evaluate other Sentry products for LTO/PGO/PLO applicability
Here are some examples of how PGO optimization is integrated in other projects:
- Rustc: a CI script for the multi-stage build
- GCC:
- Clang: Docs
- Python:
- Go: Bash script
- V8: Bazel flag
- ChakraCore: Scripts
- Chromium: Script
- Firefox: Docs
- Thunderbird has PGO support too
- PHP - Makefile command and old Centminmod scripts
- MySQL: CMake script
- YugabyteDB: GitHub commit
- FoundationDB: Script
- Zstd: Makefile
- Foot: Scripts
- Windows Terminal: GitHub PR
- Pydantic-core: GitHub PR
- file.d: GitHub PR
- OceanBase: CMake flag
I have some examples of how PGO information looks in the documentation:
- ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
- Databend: https://databend.rs/doc/contributing/pgo
- Vector: https://vector.dev/docs/administration/tuning/pgo/
- Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
- GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- Clang:
- tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md
Regarding LLVM BOLT integration, I have the following examples:
- Rustc:
- CPython: GitHub PR
- YDB: GitHub comment
- Clang:
- LDC: GitHub comment
- HHVM, Proxygen and others: Facebook paper
- NodeJS: Blog
- Chromium: Blog
- MySQL, MongoDB, memcached, Verilator: Paper
Metadata
Metadata
Assignees
Labels
Type
Projects
Status