Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO)

Hi!

Recently I checked optimizations like Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available [here](https://github.com/zamazan4ik/awesome-pgo/). According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for Symbolicator can be a good idea.

I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.

## Test environment

* Fedora 39
* Linux kernel 6.5.12
* AMD Ryzen 9 5900x
* 48 Gib RAM
* SSD Samsung 980 Pro 2 Tib
* Compiler - Rustc 1.74
* Symbolicator version: the latest for now from the `master` branch on commit `ac127975e4649dc2f40b177cb98556e307aca26e`
* Disabled Turbo boost (for better results consistency across runs)

## Benchmark

For the benchmark purposes, I used [this](https://github.com/getsentry/symbolicator/blob/master/tests/wrk/README.md) WRK-based scenario for Minidump. As a minidump, I use [this](https://github.com/getsentry/symbolicator/blob/master/tests/fixtures/linux.dmp) Linux dump. WRK command is the same for all benchmarks and PGO/PLO training phases: `WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump`. Before each WRK benchmark, I once run `cargo run -p process-event -- ../tests/fixtures/linux.dmp` as it's recommended.

All PGO and PLO optimizations are done with [cargo-pgo](https://github.com/Kobzol/cargo-pgo) (I highly recommend using this tool). For PLO phase I use [LLVM BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md) tool.

LTO is enabled by the following changes to the `Release` profile in the root `Cargo.toml` file:
```
[profile.release]
# For release builds, we do want line-only debug information to be able to symbolicate panic stack traces.
debug = 1
codegen-units = 1
lto = true
```
For all benchmarks, binaries are stripped with a `strip` tool.

All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.

## Tricky moment with PGO dumps

For some unknown reason, Symbolicator does not dump the PGO profile to the disk on Ctrl+C. I guess it's somehow related to custom signal handling somewhere in the code. So I modified Symbolicator a little bit by manually dumping the PGO profile to the disk. As a reference implementation, I use [this](https://github.com/yugabyte/yugabyte-db/commit/34cb791ed9d3d5f8ae9a9b9e9181a46485e1981d#diff-66d3bd86334a4dc799f7c362e309734fce19813d3f42911f2131a0e9b968e697R171) piece of code from YugabyteDB. I made the following changes to the `main.rs`:
```
extern {
    fn __llvm_profile_write_file();
}

use signal_hook::{consts::SIGINT, iterator::Signals};
use std::{error::Error, thread, time::Duration};

fn main() {
    let mut signals = Signals::new(&[SIGINT]).unwrap();

    thread::spawn(move || {
        for sig in signals.forever() {
            println!("Received signal {:?}", sig);
            unsafe { __llvm_profile_write_file(); }
            std::process::exit(0);
        }
    });

    match cli::execute() {
        Ok(()) => std::process::exit(0),
        Err(error) => {
            logging::ensure_log_error(&error);
            std::process::exit(1);
        }
    }
}
```
I use `signal_hook` dependency. Please note that `__llvm_profile_write_file` symbol is linked to the program only when you build your program with PGO instrumentation (it's done automatically by the Rustc compiler). Since this, you need to disable/comment out this code during the PGO optimization phase (otherwise you get a link error).

I think there should be a better way to implement this logic but for the tests' purposes, it's good enough.

## Results

Here I post the benchmark results for the following Symbolicator configurations:

* Release build (the default build)
* Release + LTO
* Release + LTO + PGO optimized
* Release + LTO + PGO optimized + PLO optimized (with LLVM BOLT)

Release:
```
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    18.60ms   10.37ms  41.83ms   57.72%
    Req/Sec   269.77     45.57   434.00     67.50%
  80627 requests in 30.02s, 470.12MB read
Requests/sec:   2685.82
Transfer/sec:     15.66MB
```

Release + LTO:
```
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.09ms    8.95ms  35.94ms   57.91%
    Req/Sec   312.10     43.03   440.00     70.03%
  93266 requests in 30.03s, 543.81MB read
Requests/sec:   3106.16
Transfer/sec:     18.11MB
```

Release + LTO + PGO optimized:
```
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.71ms    7.63ms  31.15ms   57.75%
    Req/Sec   366.18     51.08   545.00     67.53%
  109422 requests in 30.02s, 638.01MB read
Requests/sec:   3644.55
Transfer/sec:     21.25MB
```

Release + LTO + PGO optimized + PLO optimized:
```
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    12.82ms    7.14ms  30.18ms   57.86%
    Req/Sec   391.64     58.28   555.00     62.60%
  117034 requests in 30.02s, 682.40MB read
Requests/sec:   3897.93
Transfer/sec:     22.73MB
```

According to the tests above, I see measurable improvements from enabling LTO, PGO and PLO with LLVM BOLT.

Additionally, below I post results for the PGO instrumentation and PLO instrumentation phases. So you can estimate the Symbolicator slowdown during the instrumentation.

Release + LTO + PGO instrumentation:
```
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    19.10ms   10.63ms  42.48ms   57.75%
    Req/Sec   262.82     44.15   454.00     68.50%
  78545 requests in 30.02s, 457.98MB read
Requests/sec:   2616.14
Transfer/sec:     15.25MB
```

Release + LTO + PGO optimized + PLO instrumentation:
```
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.72ms   12.66ms  50.54ms   57.68%
    Req/Sec   220.82     40.15   380.00     66.73%
  66012 requests in 30.02s, 384.90MB read
Requests/sec:   2198.81
Transfer/sec:     12.82MB
```

## Further steps

I can suggest the following action points:

* Perform more PGO benchmarks on Symbolicator. If it shows improvements - add a note to the documentation about possible improvements in Symbolicator performance with PGO.
* Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Symbolicator according to their workloads.
* Optimize pre-built Symbolicator binaries
* Evaluate other Sentry products for LTO/PGO/PLO applicability

Here are some examples of how PGO optimization is integrated in other projects:

* Rustc: a CI [script](https://github.com/rust-lang/rust/blob/master/src/ci/stage-build.py) for the multi-stage build
* GCC:
  - Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback" (even AutoFDO build is supported)
  - A [part](https://github.com/gcc-mirror/gcc/blob/4832767db7897be6fb5cbc44f079482c90cb95a6/configure#L7818) in a "wonderful" `configure` script 
* Clang: [Docs](https://llvm.org/docs/HowToBuildWithPGO.html) 
* Python: 
  - CPython: [README](https://github.com/python/cpython#profile-guided-optimization)
  - Pyston: [README](https://github.com/pyston/pyston#building)
* Go: [Bash script](https://github.com/golang/go/blob/master/src/cmd/compile/profile.sh)
* V8: [Bazel flag](https://github.com/v8/v8/blob/main/BUILD.gn#L184)
* ChakraCore: [Scripts](https://github.com/chakra-core/ChakraCore/tree/master/Build/scripts/pgo)
* Chromium: [Script](https://chromium.googlesource.com/chromium/src/build/config/+/refs/heads/main/compiler/pgo/BUILD.gn)
* Firefox: [Docs](https://firefox-source-docs.mozilla.org/build/buildsystem/pgo.html)
   - Thunderbird has PGO support too
* PHP - [Makefile command](https://github.com/php/php-src/blob/master/build/Makefile.global#L138) and old Centminmod [scripts](https://github.com/centminmod/php_pgo_training_scripts)
* MySQL: [CMake script](https://github.com/mysql/mysql-server/blob/8.0/cmake/fprofile.cmake)
* YugabyteDB: [GitHub commit](https://github.com/yugabyte/yugabyte-db/commit/34cb791ed9d3d5f8ae9a9b9e9181a46485e1981d)
* FoundationDB: [Script](https://github.com/apple/foundationdb/blob/1a6114a66f3de508c0cf0a45f72f3687ba05750c/contrib/generate_profile.sh)
* Zstd: [Makefile](https://github.com/facebook/zstd/blob/dev/programs/Makefile#L232)
* [Foot](https://codeberg.org/dnkl/foot): [Scripts](https://codeberg.org/dnkl/foot/src/branch/master/pgo)
* Windows Terminal: [GitHub PR](https://github.com/microsoft/terminal/pull/10071)
* Pydantic-core: [GitHub PR](https://github.com/pydantic/pydantic-core/pull/741)
* file.d: [GitHub PR](https://github.com/ozontech/file.d/pull/469)
* OceanBase: [CMake flag](https://github.com/oceanbase/oceanbase/blob/master/cmake/Env.cmake#L55)

I have some examples of how PGO information looks in the documentation:

* ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
* Databend: https://databend.rs/doc/contributing/pgo
* Vector: https://vector.dev/docs/administration/tuning/pgo/
* Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
* GCC: Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback" (even AutoFDO build is supported)
* Clang:
  - https://llvm.org/docs/HowToBuildWithPGO.html
  - https://llvm.org/docs/AdvancedBuilds.html
* tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

Regarding LLVM BOLT integration, I have the following examples:

* Rustc:
  - [Rustc itself (GitHub PR)](https://github.com/rust-lang/rust/pull/116352)
  - [LLVM in Rustc (Reddit)](https://www.reddit.com/r/rust/comments/y4w2kr/llvm_used_by_rustc_is_now_optimized_with_bolt_on/)
* CPython: [GitHub PR](https://github.com/python/cpython/pull/95908)
* YDB: [GitHub comment](https://github.com/ydb-platform/ydb/issues/140)
* Clang:
  - [Slides](https://llvm.org/devmtg/2022-11/slides/Lightning15-OptimizingClangWithBOLTUsingCMake.pdf)
  - [Results on building Clang](https://github.com/ptr1337/llvm-bolt-scripts/blob/master/results.md)
  - [Linaro results](https://android-review.linaro.org/plugins/gitiles/toolchain/llvm_android/+/f36c64eeddf531b7b1a144c40f61d6c9a78eee7a)
  - [on AMD 7950X3D](https://github.com/llvm/llvm-project/issues/65010#issuecomment-1701255347)
  - [GitHub comment from LLVM](https://github.com/llvm/llvm-project/pull/71067#issuecomment-1791317245)
* LDC: [GitHub comment](https://github.com/ldc-developers/ldc/issues/4228#issuecomment-1334499428)
* HHVM, Proxygen and others: [Facebook paper](https://scontent-waw1-1.xx.fbcdn.net/v/t39.8562-6/240895848_219658560107211_6043870470092412798_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=93feyEEEdC0AX8gvlSt&_nc_ht=scontent-waw1-1.xx&oh=00_AfAJh5n1HhZg32R-kRPqnpKJxHUSaFZZ2udLMFcT9MRQPw&oe=64CD8CE0)
* NodeJS: [Blog](https://aaupov.github.io/blog/2020/10/08/bolt-nodejs)
* Chromium: [Blog](https://aaupov.github.io/blog/2022/11/12/bolt-chromium)
* MySQL, MongoDB, memcached, Verilator: [Paper](https://people.ucsc.edu/~hlitz/papers/ocolos.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) #1334

Test environment

Benchmark

Tricky moment with PGO dumps

Results

Further steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) #1334

Description

Test environment

Benchmark

Tricky moment with PGO dumps

Results

Further steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions