Skip to content

Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) #1334

@zamazan4ik

Description

@zamazan4ik

Hi!

Recently I checked optimizations like Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for Symbolicator can be a good idea.

I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.

Test environment

  • Fedora 39
  • Linux kernel 6.5.12
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.74
  • Symbolicator version: the latest for now from the master branch on commit ac127975e4649dc2f40b177cb98556e307aca26e
  • Disabled Turbo boost (for better results consistency across runs)

Benchmark

For the benchmark purposes, I used this WRK-based scenario for Minidump. As a minidump, I use this Linux dump. WRK command is the same for all benchmarks and PGO/PLO training phases: WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump. Before each WRK benchmark, I once run cargo run -p process-event -- ../tests/fixtures/linux.dmp as it's recommended.

All PGO and PLO optimizations are done with cargo-pgo (I highly recommend using this tool). For PLO phase I use LLVM BOLT tool.

LTO is enabled by the following changes to the Release profile in the root Cargo.toml file:

[profile.release]
# For release builds, we do want line-only debug information to be able to symbolicate panic stack traces.
debug = 1
codegen-units = 1
lto = true

For all benchmarks, binaries are stripped with a strip tool.

All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.

Tricky moment with PGO dumps

For some unknown reason, Symbolicator does not dump the PGO profile to the disk on Ctrl+C. I guess it's somehow related to custom signal handling somewhere in the code. So I modified Symbolicator a little bit by manually dumping the PGO profile to the disk. As a reference implementation, I use this piece of code from YugabyteDB. I made the following changes to the main.rs:

extern {
    fn __llvm_profile_write_file();
}

use signal_hook::{consts::SIGINT, iterator::Signals};
use std::{error::Error, thread, time::Duration};

fn main() {
    let mut signals = Signals::new(&[SIGINT]).unwrap();

    thread::spawn(move || {
        for sig in signals.forever() {
            println!("Received signal {:?}", sig);
            unsafe { __llvm_profile_write_file(); }
            std::process::exit(0);
        }
    });

    match cli::execute() {
        Ok(()) => std::process::exit(0),
        Err(error) => {
            logging::ensure_log_error(&error);
            std::process::exit(1);
        }
    }
}

I use signal_hook dependency. Please note that __llvm_profile_write_file symbol is linked to the program only when you build your program with PGO instrumentation (it's done automatically by the Rustc compiler). Since this, you need to disable/comment out this code during the PGO optimization phase (otherwise you get a link error).

I think there should be a better way to implement this logic but for the tests' purposes, it's good enough.

Results

Here I post the benchmark results for the following Symbolicator configurations:

  • Release build (the default build)
  • Release + LTO
  • Release + LTO + PGO optimized
  • Release + LTO + PGO optimized + PLO optimized (with LLVM BOLT)

Release:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    18.60ms   10.37ms  41.83ms   57.72%
    Req/Sec   269.77     45.57   434.00     67.50%
  80627 requests in 30.02s, 470.12MB read
Requests/sec:   2685.82
Transfer/sec:     15.66MB

Release + LTO:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.09ms    8.95ms  35.94ms   57.91%
    Req/Sec   312.10     43.03   440.00     70.03%
  93266 requests in 30.03s, 543.81MB read
Requests/sec:   3106.16
Transfer/sec:     18.11MB

Release + LTO + PGO optimized:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.71ms    7.63ms  31.15ms   57.75%
    Req/Sec   366.18     51.08   545.00     67.53%
  109422 requests in 30.02s, 638.01MB read
Requests/sec:   3644.55
Transfer/sec:     21.25MB

Release + LTO + PGO optimized + PLO optimized:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    12.82ms    7.14ms  30.18ms   57.86%
    Req/Sec   391.64     58.28   555.00     62.60%
  117034 requests in 30.02s, 682.40MB read
Requests/sec:   3897.93
Transfer/sec:     22.73MB

According to the tests above, I see measurable improvements from enabling LTO, PGO and PLO with LLVM BOLT.

Additionally, below I post results for the PGO instrumentation and PLO instrumentation phases. So you can estimate the Symbolicator slowdown during the instrumentation.

Release + LTO + PGO instrumentation:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    19.10ms   10.63ms  42.48ms   57.75%
    Req/Sec   262.82     44.15   454.00     68.50%
  78545 requests in 30.02s, 457.98MB read
Requests/sec:   2616.14
Transfer/sec:     15.25MB

Release + LTO + PGO optimized + PLO instrumentation:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.72ms   12.66ms  50.54ms   57.68%
    Req/Sec   220.82     40.15   380.00     66.73%
  66012 requests in 30.02s, 384.90MB read
Requests/sec:   2198.81
Transfer/sec:     12.82MB

Further steps

I can suggest the following action points:

  • Perform more PGO benchmarks on Symbolicator. If it shows improvements - add a note to the documentation about possible improvements in Symbolicator performance with PGO.
  • Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Symbolicator according to their workloads.
  • Optimize pre-built Symbolicator binaries
  • Evaluate other Sentry products for LTO/PGO/PLO applicability

Here are some examples of how PGO optimization is integrated in other projects:

I have some examples of how PGO information looks in the documentation:

Regarding LLVM BOLT integration, I have the following examples:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions