Skip to content

Commit 1d1c06f

Browse files
authored
Add basic support for profiling Pulley (bytecodealliance#10034)
* Add basic support for profiling Pulley This commit adds basic support for profiling the Pulley interpreter. This is partially achievable previously through the use of native profilers, but the downside of that approach is that you can find hot instructions but it's not clear in what context the hot instructions are being executed nor what functions are hot. The goal of this profiler is to show pulley bytecode and time spent in bytecode itself to better understand the shape of code around a hot instruction to identify new macro opcodes for example. The general structure of this new profiler is: * There is a compile-time feature for Pulley which is off-by-default where, when enabled, Pulley will record its current program counter into an `AtomicUsize` before each instruction. * When the CLI has `--profile pulley` Wasmtime will spawn a sampling thread in the same process which will periodically read from this `AtomicUsize` to record where the program is currently executing. * The Pulley profiler additionally records all bytecode through the use of the `ProfilingAgent` trait to ensure that the recording has access to all bytecode as well. * Samples are taken throughout the process and emitted to a `pulley-$pid.data` file. This file is then interpreted and printed by an "example" program `profiler-html.rs` in the `pulley/examples` directory. The end result is that hot functions of Pulley bytecode can be seen and instructions are annotated with how frequently they were executed. This enables finding hot loops and understanding more about the whole loop, bytecodes that were selected, and such. * Add missing source file * Check the profile-pulley feature in CI * Miscellaneous fixes for CI * Fix type-checking of `become` on nightly Rust * Fix more misc CI issues * Fix dispatch in tail loop * Update test expectations * Review comments * Fix a feature combo
1 parent 51c82b8 commit 1d1c06f

File tree

19 files changed

+909
-23
lines changed

19 files changed

+909
-23
lines changed

.github/workflows/main.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,6 +372,7 @@ jobs:
372372
-p wasmtime --no-default-features --features runtime,threads
373373
-p wasmtime --no-default-features --features cranelift,threads
374374
-p wasmtime --features incremental-cache
375+
-p wasmtime --features profile-pulley
375376
-p wasmtime --all-features
376377
377378
- name: wasmtime-fiber

Cargo.lock

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -440,6 +440,7 @@ winch = ["wasmtime/winch"]
440440
wmemcheck = ["wasmtime/wmemcheck"]
441441
trace-log = ["wasmtime/trace-log"]
442442
memory-protection-keys = ["wasmtime-cli-flags/memory-protection-keys"]
443+
profile-pulley = ["wasmtime/profile-pulley"]
443444

444445
# This feature, when enabled, will statically compile out all logging statements
445446
# throughout Wasmtime and its dependencies.

crates/wasmtime/Cargo.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -366,3 +366,8 @@ custom-virtual-memory = []
366366

367367
# Same as `custom-virtual-memory` above, but for custom signal-handling APIs.
368368
custom-native-signals = []
369+
370+
# Off-by-default support to profile the Pulley interpreter. This has a
371+
# performance hit, even when not profiling, so it's disabled by default at
372+
# compile time.
373+
profile-pulley = ['pulley', 'profiling', 'pulley-interpreter/profile']

crates/wasmtime/src/config.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2272,6 +2272,7 @@ impl Config {
22722272
ProfilingStrategy::JitDump => profiling_agent::new_jitdump()?,
22732273
ProfilingStrategy::VTune => profiling_agent::new_vtune()?,
22742274
ProfilingStrategy::None => profiling_agent::new_null(),
2275+
ProfilingStrategy::Pulley => profiling_agent::new_pulley()?,
22752276
})
22762277
}
22772278

@@ -2799,6 +2800,11 @@ pub enum ProfilingStrategy {
27992800

28002801
/// Collect profiling info using the "ittapi", used with `VTune` on Linux.
28012802
VTune,
2803+
2804+
/// Support for profiling Pulley, Wasmtime's interpreter. Note that enabling
2805+
/// this at runtime requires enabling the `profile-pulley` Cargo feature at
2806+
/// compile time.
2807+
Pulley,
28022808
}
28032809

28042810
/// Select how wasm backtrace detailed information is handled.

crates/wasmtime/src/profiling_agent.rs

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,10 +51,26 @@ cfg_if::cfg_if! {
5151
}
5252
}
5353

54+
cfg_if::cfg_if! {
55+
if #[cfg(feature = "profile-pulley")] {
56+
mod pulley;
57+
pub use pulley::new as new_pulley;
58+
} else {
59+
pub fn new_pulley() -> Result<Box<dyn ProfilingAgent>> {
60+
bail!("pulley profiling support disabled at compile time.");
61+
}
62+
}
63+
}
64+
5465
/// Common interface for profiling tools.
5566
pub trait ProfilingAgent: Send + Sync + 'static {
5667
fn register_function(&self, name: &str, code: &[u8]);
5768

69+
#[cfg(all(feature = "runtime", feature = "pulley"))]
70+
fn register_interpreter(&self, interp: &crate::vm::Interpreter) {
71+
let _ = interp;
72+
}
73+
5874
fn register_module(&self, code: &[u8], custom_name: &dyn Fn(usize) -> Option<String>) {
5975
use object::{File, Object as _, ObjectSection, ObjectSymbol, SectionKind, SymbolKind};
6076

@@ -63,7 +79,10 @@ pub trait ProfilingAgent: Send + Sync + 'static {
6379
Err(_) => return,
6480
};
6581

66-
let text = match image.sections().find(|s| s.kind() == SectionKind::Text) {
82+
let text = match image
83+
.sections()
84+
.find(|s| s.kind() == SectionKind::Text || s.name() == Ok(".text"))
85+
{
6786
Some(section) => match section.data() {
6887
Ok(data) => data,
6988
Err(_) => return,
Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
//! Basic support for emitting a `*.data` file which contains samples of pulley
2+
//! bytecode.
3+
//!
4+
//! Pulley is Wasmtime's interpreter and native profilers are not good at
5+
//! profiling bytecode interpreters because they show hot bytecode instructions
6+
//! but we're instead often interested in the shape of the bytecode itself
7+
//! around the hot instruction, for example to identify new macro-instructions
8+
//! to add to Pulley. This module serves as a means of collecting data from
9+
//! Pulley being executed in-process and serializing it to a file.
10+
//!
11+
//! The file collected here is populated by a sampling thread in-process. This
12+
//! sampling thread only collects the current program counter of any interpeters
13+
//! in the process. This does not collect stack traces at all. That means that
14+
//! this profiler is only suitable for looking at "self time" and is not
15+
//! suitable for getting a broader picture of what's going on (e.g. why
16+
//! something was called in the first place).
17+
//!
18+
//! The general design of this profiler is:
19+
//!
20+
//! * Support for this all requires a `pulley-profile` feature at compile-time
21+
//! as it's generally a perf hit to the interpreter loop.
22+
//! * Each Pulley interpreter updates an `AtomicUsize` before all instructions
23+
//! with the current PC that it's executing.
24+
//! * This module spawns a "sampling thread" which will, at some frequency,
25+
//! collect all the PCs of all interpreters in the process.
26+
//! * Once enough samples have been collected they're flushed out to a data file
27+
//! on a second thread, the "recording thread".
28+
//!
29+
//! The hope is that the sampling thread stays as steady as possible in its
30+
//! sampling rate while not hitting OOM conditions in the process or anything
31+
//! like that. The `*.data` file that's emitted is intended to be processed by
32+
//! example code in the `pulley-interpreter` crate or `pulley/examples/*.rs` in
33+
//! the Wasmtime repository.
34+
35+
use crate::prelude::*;
36+
use crate::profiling_agent::ProfilingAgent;
37+
use crate::vm::Interpreter;
38+
use pulley_interpreter::profile::{ExecutingPc, Recorder, Samples};
39+
use std::mem;
40+
use std::sync::mpsc;
41+
use std::sync::{Arc, Condvar, Mutex};
42+
use std::thread::{self, JoinHandle};
43+
use std::time::{Duration, Instant};
44+
45+
/// Implementation of `ProfilingAgent` from the Wasmtime crate.
46+
struct PulleyAgent {
47+
state: Arc<State>,
48+
49+
/// Handle to the thread performing periodic sampling. This is joined on
50+
/// `Drop` of this structure so it's not a daemon thread permanently.
51+
sampling_thread: Option<JoinHandle<()>>,
52+
53+
/// Same as the sampling thread above, but for recording data to the
54+
/// filesystem.
55+
recording_thread: Option<JoinHandle<()>>,
56+
}
57+
58+
struct State {
59+
/// Protected state about the recorder, or the file being created. This is
60+
/// accessed both from the "recording thread" as well as `Engine` threads to
61+
/// register new pulley bytecode.
62+
recorder: Mutex<Recorder>,
63+
64+
/// Protected state about sampling interpreters. This is accessed both from
65+
/// the "sampling thread" primarily but is additionally accessed from
66+
/// `Engine` threads to register new interpreters coming online.
67+
sampling: Mutex<SamplingState>,
68+
69+
/// Condition variable which is signaled when sampling should cease and
70+
/// exit. This is coupled with `Drop for PulleyAgent`.
71+
sampling_done: Condvar,
72+
73+
/// The frequency at which samples are collected. Defaults to 1000 but can
74+
/// be configured with the `PULLEY_SAMPLING_FREQ` environment variable.
75+
sampling_freq: u32,
76+
77+
/// Number of samples to buffer before flushing them to a file. Defaults to
78+
/// 20000 but can be configured with the `PULLEY_SAMPLING_FLUSH_AMT`
79+
/// environment variable.
80+
sampling_flush_amt: u32,
81+
}
82+
83+
/// State protected by a mutex in `State` above related to sampling.
84+
#[derive(Default)]
85+
struct SamplingState {
86+
/// All interpreters known to be executing. This is a list of
87+
/// pointers-to-the-current-PC which is updated whenever the interpreter
88+
/// executes an instruction.
89+
interpreters: Vec<ExecutingPc>,
90+
91+
/// Current list of samples that have been collected.
92+
samples: Samples,
93+
}
94+
95+
pub fn new() -> Result<Box<dyn ProfilingAgent>> {
96+
let pid = std::process::id();
97+
let filename = format!("./pulley-{pid}.data");
98+
let mut agent = PulleyAgent {
99+
state: Arc::new(State {
100+
recorder: Mutex::new(Recorder::new(&filename)?),
101+
sampling: Default::default(),
102+
sampling_done: Condvar::new(),
103+
sampling_freq: std::env::var("PULLEY_SAMPLING_FREQ")
104+
.ok()
105+
.and_then(|s| s.parse::<u32>().ok())
106+
.unwrap_or(1_000),
107+
sampling_flush_amt: std::env::var("PULLEY_SAMPLING_FLUSH_AMT")
108+
.ok()
109+
.and_then(|s| s.parse::<u32>().ok())
110+
.unwrap_or(20_000),
111+
}),
112+
sampling_thread: None,
113+
recording_thread: None,
114+
};
115+
116+
let (tx, rx) = mpsc::channel();
117+
let state = agent.state.clone();
118+
agent.sampling_thread = Some(thread::spawn(move || sampling_thread(&state, tx)));
119+
let state = agent.state.clone();
120+
agent.recording_thread = Some(thread::spawn(move || recording_thread(&state, rx)));
121+
122+
Ok(Box::new(agent))
123+
}
124+
125+
impl ProfilingAgent for PulleyAgent {
126+
/// New functions are registered with `Recorder` to record the exact
127+
/// bytecode so disassembly is available during profile analysis.
128+
///
129+
/// Note that this also provides the native address that code is loaded at
130+
/// so samples know what code it's within.
131+
fn register_function(&self, name: &str, code: &[u8]) {
132+
self.state
133+
.recorder
134+
.lock()
135+
.unwrap()
136+
.add_function(name, code)
137+
.expect("failed to register pulley function");
138+
}
139+
140+
/// Registers a new interpreter coming online. Interpreters, with
141+
/// `pulley-profile` enabled, store a shadow program counter updated on each
142+
/// instruction which we can read from a different thread.
143+
fn register_interpreter(&self, interpreter: &Interpreter) {
144+
let pc = interpreter.pulley().executing_pc();
145+
self.state
146+
.sampling
147+
.lock()
148+
.unwrap()
149+
.interpreters
150+
.push(pc.clone());
151+
}
152+
}
153+
154+
/// Execution of the thread responsible for sampling interpreters.
155+
///
156+
/// This thread has a few tasks:
157+
///
158+
/// * Needs to sample, at `state.sampling_freq`, the state of all known
159+
/// interpreters. Ideally this sampling is as steady as possible.
160+
/// * Needs to clean up interpeters which have been destroyed as there's
161+
/// otherwise no hook for doing so.
162+
/// * Needs to send batches of samples to the recording thread to get written to
163+
/// the filesystem.
164+
fn sampling_thread(state: &State, to_record: mpsc::Sender<Samples>) {
165+
// Calculate the `Duration` between each sample which will be in
166+
// nanoseconds. This duration is then used to create an `Instant` in time
167+
// where we'll be collecting the next sample.
168+
let between_ticks = Duration::new(0, 1_000_000_000 / state.sampling_freq);
169+
let start = Instant::now();
170+
let mut next_sample = start + between_ticks;
171+
172+
// Helper closure to send off a batch of samples to the recording thread.
173+
// Note that recording is done off-thread to ensure that the filesystem I/O
174+
// interferes as little as possible with the sampling rate here.
175+
let record = |sampling: &mut SamplingState| {
176+
if sampling.samples.num_samples() == 0 {
177+
return;
178+
}
179+
let samples = mem::take(&mut sampling.samples);
180+
to_record.send(samples).unwrap();
181+
};
182+
183+
let mut sampling = state.sampling.lock().unwrap();
184+
185+
loop {
186+
// Calculate the duration, from this current moment in time, to when the
187+
// next sample is supposed to be taken. If the next sampling time is in
188+
// the past then this won't sleep but will still check the condvar.
189+
let dur = next_sample
190+
.checked_duration_since(Instant::now())
191+
.unwrap_or(Duration::new(0, 0));
192+
193+
// Wait on `state.sampling_done`, but with the timeout we've calculated.
194+
// If this times out that means that the next sample can proceed.
195+
// Otherwise if this did not time out then it means that sampling should
196+
// cease as the profiler is being destroyed.
197+
let (guard, result) = state.sampling_done.wait_timeout(sampling, dur).unwrap();
198+
sampling = guard;
199+
if !result.timed_out() {
200+
break;
201+
}
202+
203+
// Now that we've decided to take a sample increment the next sample
204+
// time by our interval. Once we're done sampling below we'll then sleep
205+
// again up to this time.
206+
next_sample += between_ticks;
207+
208+
// Sample the state of all interpreters known. This first starts by
209+
// discarding any interpreters that are offline. Samples without a PC
210+
// are additionally discarded as it means the interpreter is inactive.
211+
//
212+
// Once enough samples have been collected they're flushed to the
213+
// recording thread.
214+
let SamplingState {
215+
interpreters,
216+
samples,
217+
} = &mut *sampling;
218+
interpreters.retain(|a| !a.is_done());
219+
for interpreter in interpreters.iter() {
220+
if let Some(pc) = interpreter.get() {
221+
samples.append(pc);
222+
}
223+
}
224+
if samples.num_samples() > state.sampling_flush_amt {
225+
record(&mut sampling);
226+
}
227+
}
228+
229+
// Send any final samples to the recording thread after the loop has exited.
230+
record(&mut sampling);
231+
}
232+
233+
/// Helper thread responsible for writing samples to the filesystem.
234+
///
235+
/// This receives samples over `to_record` and then performs the filesystem I/O
236+
/// necessary to write them out. This thread completes once `to_record` is
237+
/// closed, or when the sampling thread completes. At that time all data in the
238+
/// recorder is flushed out as well.
239+
fn recording_thread(state: &State, to_record: mpsc::Receiver<Samples>) {
240+
for mut samples in to_record {
241+
state
242+
.recorder
243+
.lock()
244+
.unwrap()
245+
.add_samples(&mut samples)
246+
.expect("failed to write samples");
247+
}
248+
249+
state.recorder.lock().unwrap().flush().unwrap();
250+
}
251+
252+
impl Drop for PulleyAgent {
253+
fn drop(&mut self) {
254+
// First notify the sampling thread that it's time to shut down and
255+
// wait for it to exit.
256+
self.state.sampling_done.notify_one();
257+
self.sampling_thread.take().unwrap().join().unwrap();
258+
259+
// Wait on the recording thread as well which should terminate once
260+
// `sampling_thread` has terminated as well.
261+
self.recording_thread.take().unwrap().join().unwrap();
262+
}
263+
}

crates/wasmtime/src/runtime/vm/interpreter.rs

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,22 @@ pub struct Interpreter {
1919
impl Interpreter {
2020
/// Creates a new interpreter ready to interpret code.
2121
pub fn new(engine: &Engine) -> Interpreter {
22-
Interpreter {
22+
let ret = Interpreter {
2323
pulley: Box::new(Vm::with_stack(vec![0; engine.config().max_wasm_stack])),
24-
}
24+
};
25+
engine.profiler().register_interpreter(&ret);
26+
ret
2527
}
2628

2729
/// Returns the `InterpreterRef` structure which can be used to actually
2830
/// execute interpreted code.
2931
pub fn as_interpreter_ref(&mut self) -> InterpreterRef<'_> {
3032
InterpreterRef(&mut self.pulley)
3133
}
34+
35+
pub fn pulley(&self) -> &Vm {
36+
&self.pulley
37+
}
3238
}
3339

3440
/// Wrapper around `&mut pulley_interpreter::Vm` to enable compiling this to a

0 commit comments

Comments
 (0)