Skip to content

brownsys/paralegal-osdi-2025-artifact

Repository files navigation

Artifact for the Submission of "Paralegal" to OSDI 2025

Welcome to the software artifact for the 2025 submission of the static Rust privacy and security analyzer "Paralegal" (preprint). This repository ties together the analyzer, benchmark suite and plotting utilities for reproducing the graphs shown in the paper.

This repository is intended for artifact evaluators. If you are interested in using the software (or benchmarker), you should instead visit the standalone repos for the "Paralegal" analyzer or the benchmarker, which have the latest updates.

If you are receiving this artifact as a docker container (for example from Zenodo) you should start with Running the Docker Image.

If you are instead intending to use this repository from source, go to the Setup and Installation section before cloning the repo.

If you want to learn more about the organization of this repository see Organization.

Platform Compatibility

The analyzer has been tested on Ubuntu and OSX, but should in theory run on any Linux and possibly Windows too, though no guarantees are made.

The benchmarker relies on UNIX features (process groups) and can only run on a platform that supports it e.g. Linux and OSX.

The docker image uses x86_64 architecture and will (in its entirety) only work on an x86_64 architecture machine. The analyzer and benchmarker seem to run fine even on ARM (tested on OSX) but some of the case study dependencies do not compile.

Running the Docker Image

You will require bzip2 and docker on the command line. This assumes a file name of paralegal-osdi25-docker.tar.bz2

$ bunzip2 -kc paralegal-osdi25-docker.tar.bz2 | docker image load

Verify it has been loaded

$ docker image list
REPOSITORY                          TAG               IMAGE ID       CREATED         SIZE
paralegal                           osdi25-artifact   627f3552e43f   2 hours ago     11.5GB

Then you can start and connect to the image with the following command. Note the --mount option and the mkdir are used here to establish a way to transfer data from the image to your machine. This is intended to be used later for you to view the plots.

mkdir output
docker run -ti --mount type=bind,src="$(pwd)/output",dst=/home/aec/artifact/output paralegal:osdi25-artifact

Setup and Installation

If this artifact was provided to you as a docker container, you may skip this section.

Installation commands for packages in this section are going to assume you are using Ubuntu Linux. Adjust accordingly for your OS.

This artifact leverages git submodules so you must clone with --recursive or run git submodule update --init --recursive after cloning.

$ git clone git@github.com:brownsys/paralegal-osdi-2025-artifact --recursive

You should have rustup installed on your system, to handle Rust toolchain management. This is necessary because the analyzer builds against a specific Rust nightly version (see paralegal/rust-toolchain.toml). For Unix you can run

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Sets the environment
$ . "$HOME/.cargo/env"

Rustup will install the requisite toolchains automatically. You can download the one necessary for the analyzer and verify that the rustup installation was successful by running

$ (cd paralegal && rustc --version)
...
rustc 1.74.0-nightly (58eefc33a 2023-08-24)

Compiling (and analyzing) the case study applications also requires you have git, pkg-config, libssl, z4 and clang installed.

sudo apt install clang libclang-dev liblz4-dev libssl-dev pkg-config git

You will also require python 3 and pip to plot the graphs. Please follow the installation instruction for your platform.

sudo apt install python3 python3-pip

To install the dependencies for the plotting script use the following command. You may also wish to use a virtual python environment.

$ python3 -m pip install -r plotting/requirements.txt

If you wish to reproduce our CodeQL results you must install CodeQL too. The results in our paper were obtained with 2.19.3. Download the bundle for your platform from the releases page and unzip the archive.

wget https://github.com/github/codeql-action/releases/download/codeql-bundle-v2.19.3/codeql-bundle-linux64.tar.gz
tar -xf codeql-bundle-linux64.tar.gz

You need to ensure the executable can be found by our comparison benchmark runner. You may add it to your path (command below) or provide the path directly to the runner via the --codeql-command flag.

export PATH="$PATH:$(realpath codeql)"

In addition running CodeQL requires you have a C++ compiler like gcc or clang as well as cmake installed.

sudo apt install cmake

This next set of instructions is only if you are operating on Linux.

Retrieve the system root for the Rust toolchain, e.g. with my example output

(cd paralegal && rustc --print sysroot)
/home/justus/.rustup/toolchains/nightly-2023-08-25-x86_64-unknown-linux-gnu

You must edit any benchmark configuration (e.g. paralegal-bench/bconf/bench-config.toml) you run. Specifically it will have an entry like

[app-config.hyperswitch.env]
LD_LIBRARY_PATH = "TOOLCHAIN/lib"

Replace the TOOLCHAIN part with the sysroot path you retrieved.

[app-config.hyperswitch.env]
LD_LIBRARY_PATH = "/home/justus/.rustup/toolchains/nightly-2023-08-25-x86_64-unknown-linux-gnu/lib"

For the curious: this is a workaround to deal with cases where a build script calls rustc to get metadata information. This actually ends up calling our analyzer again (which wraps around rustc), which dynamically links against libLLVM, but for some reason in this nested call the linker cannot find this library, so we provide the path here.

Step-by-step for Reproducing Results

You can run all performance benchmarks with

$ (cd paralegal-bench && cargo run --bin griswold --locked --release -- bconf/bench-config.toml)

This will take about 90 minutes.

Performance benchmarks are controlled by the configuration (e.g. bconf/bench-config.toml).

In the paper we run these experiments with 10 repetitions to ensure stable results. This option is commented out in the configuration provided to you, as the numbers tend to be similar even for just one run and this makes the overall runtime almost 10 times lower. If you so desire, you may uncomment the option at the top of the file.

The output from the benchmark is written to paralegal-bench/results. For more information on the format see Performance Benchmarker Output Format. Once the benchmarks have finished, generate the plots by running plotting/plot.py. If you are on the docker image you will likely want to redirect the output to the mounted output directory as follows:

$ plotting/plot.py -o output

You will find the created plots in the directory you set with -o or, if unset, the plotting directory. For explanations of each graph/table, see Plots. The plotting script also supports a number of command line parameters. Run it with --help to learn more. The reference plots we generated from our experiment run you will find in plotting/reference. On docker you may wish to copy the reference plots to output as well for viewing with cp plotting/reference output/reference.

To check our CodeQL comparison you run a similar coordination program to griswold. For us this run took about 10min.

$ cd codeql-experimentation
$ (cd runner && cargo build --release --locked)
$ runner/target/release/runner --keep-intermediates eval-config.toml

If you did not add the codeql command to your PATH, you should invoke the runner as

$ runner/target/release/runner --keep-intermediates eval-config.toml --codeql-path /path/to/codeql/executable

The runner will fail if any build or codeql analysis fails. It will also issue an "error" log message for every policy where the output diverges from the expected output with information which policy failed, what the expected output was and what the actual output was.

The actual outputs from CodeQL depend on the C compiler and platform so it is possible to observe spurious failures of the runner. In such cases inspect the outputs manually to verify that the failure is genuine. The Explanations of Expected Outputs section explains each policy and expected results in detail and should be helpful for this.

Plots

General Note: Since the acceptance of the paper we've made improvements to Paralegal that make PDG construction substantially faster. As a result, Paralegal will now by default analyze all dependencies of a crate, while for the submission version, we restricted the set of crates analyzed for every application and approximate the rest using Paralegal's modular approximation. As a result of these improvements, we are changing the evaluation setup for the final paper to (a) illustrate the cost of analyzing only the crates in the workspace vs. all crates; (b) use the "all crates" configuration, which has better soundness as it analyzes more code, in many experiments; and (c) illustrate that Paralegal's adaptive inlining optimization is critical for performance when running over all dependencies.

This section explains the changes we made to each plot or claim made in the paper, and how to check our results. We provide the new plots that we will to put in the final paper, as well as the old versions of the plots or a way to produce analogous results. Some of the numbers have changed since the submission version of Paralegal (always for the better, since PDG construction is now faster); we note and explain this where relevant.

ide_ci_plot(.png)

Workspace vs all crates plot

The right-hand side of this graph is an updated version of Figure 9 in the paper. The left-hand side is new and represents the mentioned "all crates" configuration. We will change the description in the paper to explain that we expect this "all crates" setup to be the one user's run in CI, where soundness is the chief concern and slightly longer runtimes are acceptable. The "Workspace Crates Only" we believe is more suited for interactive use, especially when only run on single controllers (see per_controller_plot).

per_controller_plot(.png)

Per Controller Runtime Plot

This plot is an updated version of Figure 10 in the paper. Runtimes have gotten a bit lower, due to optimizations. This uses the "Workspace Crates Only" setup and shows combined runtime (PDG generation and policy checking).

k_depth_plot(.png)

Adaptive Depth Impact Plot

This plot is new and set to replace Figure 11 in the paper. The purpose of this plot remains the same. It quantifies the impact of using our marker-guided "adaptive inlining" optimization. The graph compares total runtime (PDG generation + policy check) for Paralegal when using "adaptive inlining" vs a fixed-k limit. Both approaches limit the depth of function call stacks that will be represented in the PDG. Both cases use the "all crates" setup.

Each application uses a different, optimal fixed k. This corresponds to having an oracle which selects a k, such that we the graph is minimal but includes all policy-relevant code. This ensures both soundness, low false-positives and the best possible performance achievable with a fixed-k analyzer. In general a fixed-k approach does not have such precise oracles. It is intended as a the best-faith representation of a fixed-k approach and to show that even in this idealistic case, "adaptive inlining" still offers substantial improvements.

We concretely use our adaptive approximation setup as the oracle, picking for each application the largest k to which it explored the graph.

One might also suggest using a k = infinity, but as you can infer from the timeouts just in this case, most of those will not terminate. Hence we pick a smaller, optimal k.

Note on the Plume result: We run both a the buggy and fixed version of the app. The buggy one finished with the time shown, the fixed one time out, hence both the bar and the timeout marker.

old_adaptive_plot(.png)

Prior Adaptive Depth plot

This plot is a recreation of Figure 11 from the paper with the new numbers/optimizations. This plot is provided for transparency, it will be removed from the paper.

The setup is similar to k_depth_plot, however here k = infinity, but we only run on the "Workspace Crates", massively reducing the code needing to be analyzed at the cost of soundness.

This graph shows that "Adaptive Inlining" is still generally faster, though the gains are less pronounced as we made the part of the system for which adaptive inlining saves work much more efficient. However, as the k_depth_plot shows, adaptive inlining is crucial when analyzing across all dependencies.

dependency_times(.txt)

Table

The graphs shown so far do not include the time needed to compile the dependencies before running the PDG generation. This is only needed once (for each dependency change). In the paper we plan to add a sentence to Section 7.4 about the overhead Paralegal introduces to this dependency compilation, because we persist the MIR bodies for each dependency in addition to the compilation. The claim will be similar to:

As a rustc plugin Paralegal must compile the dependencies first before it can generate a PDG. In addition Paralegal dumps the MIR code for each function in the dependencies on top of compilation. This creates at most a 4% overhead over just compiling the dependencies.

The substantiating column in the table is "Dump Time % Share" which is calculated as dividing "Dep. Comp. CPU Time" (CPU time spent on compiling dependencies) by "Dep. Dump CPU Time" (CPU time spent persisting MIR).

"Wall Clock Dependency Time" is provided for convenience, showing actual time used, when dependency compilation is parallelized.

atomic_data_locs(.txt)

Table

This table shows the calculated numbers we quote in Section 7.3. In the second table we quote the numbers from the "LoC Workspace" column, as this represents the lines of code that could cause a marker change. We plan to add the "Mean #LoC changed" from "LoC Including Deps" statistic to the paper with something like

On average, Paralegal’s analysis touches 907 lines (min: 411 , max: 1196). Within the 936 functional commits, 66 commits modify one or more analyzed line (60 lines change on average). These numbers reflect only the lines of code in the workspace, since only those can require adjustments to the markers. Over the entire code base Paralegal’s analysis actually touches 22k lines, 84 commits perform changes to those with a mean of 1165 LoC changed per commit.

CodeQL Comparison

The codeql-experimentation contains the code for our comparison with CodeQL.

The runner program provides a convenient way to run the specific examples we mention in the paper. Effectively it runs a sequence of shell (build) commands, then compares the output from codeql to the examples in expected. Similar to griswold each run of this tool creates a new output results/<timestamp>. There you'll find stdout and stderr from all called commands. Furthermore if you run it with --keep-temporaries you can see the captured codeql output in results/<timestamp>/tmp.

If you want to run the examples manually then you must do the following. Take the "Plume" example, which also has the third party library "plib" and we use the "plume-data-deletion.ql" policy.

cd cpp/plume
cd plib
mkdir build
cd build
cmake ..
cmake --build .
cd ..
codeql database create --overwrite qdb --language cpp
codeql query run -d qdb ../../real-world-policies/plume-data-deletion.ql

And now compare that output to expected/plume-data-deletion.ql.

The actual outputs from CodeQL depend on the C compiler and platform so it is possible to observe spurious failures of the runner or divergence in the output. In such cases inspect the outputs manually to verify that the failure is genuine. The following section explains each policy and expected results in detail and should be helpful for this.

Explanations of Expected Outputs

  • Plume Data Deletion Policy: This policy matches user data to deletion sink flows. We tried it for three types of user data, and as such if output contains three rows fetching different types of data and three calls to an appropriate delete sink we can call this policy succeeding.
  • mCaptcha Limited Collection Policy: The policy checks for the presence of a function call that authorizes the collection of data and a control flow influence of its return value on the collection. If the CodeQL output contains that function call, then this policy recognizes that the code does not contain a violation.
  • mCaptcha Deletion Policy: The output is a tuple showing where the private data is retrieved in the delete controller and where the delete method is called on it. Since we only test one type of private data, if the CodeQL output contains at least one row, we can call the policy succeeding.
  • Lemmy Instance Policy: The policy checks whether there exists a retrieval for data connected to an "instance" without prior check that the requesting user is banned or deleted from that instance. If CodeQL output is empty it would recognize our code is non-violating. The actual output contains two rows, demonstrating that it works for the simple case (in the "perform" entrypoint) but not the async case ("perform_async") entrypoint. It also shows the call in line 118, because we haven't yet implemented to only check function bodies reachable from the entrypoint (that one is our bad).
  • Atomic Safe Writes Policy: This policy fragment represents the premise of the full policy, finding a write to a specific type of data. The CodeQL output should show this data flow, but it doesn't. This means a full policy would be vacuously succeeding (false negative).
  • Websubmit Deletion Policy: The output is a tuple showing where the private data is retrieved in the delete controller and where the delete method is called on it. We test four types of private data and therefore if the CodeQL output contains four rows with those types mentioned it would be succeeding. The output we observe is empty.
  • Websubmit Disclosure Policy: This policy fragment represents the premise of the full policy, finding sensitive data that is is written to an external sink. The CodeQL output should show at least one flow but doesn't. This means a full policy would be vacuously succeeding (false negative).
  • Websubmit Scoped Storage Policy: This auxiliary policy to data deletion wants to check that user data is associated with an appropriate identifier in storage. The policy as written tests the premise that sensitive data is being stored. The CodeQL output should show this flow but doesn't. This means a full policy would be vacuously succeeding (false negative).
  • Freedit Expiration Policy: This policy fragment is part of the conclusion of the full policy, checking that the output from a call to retrieve the time (system_clock::now) reaches a conditional check to guard the deletion of expired data. The CodeQL output should show this flow but doesn't. This means a full policy would be vacuously failing (false positive).
  • Freedit Storage Policy: This is an auxiliary policy to the expiration check that asserts that user records must be stored with a timestamp. The file being run here is a policy fragment that tests the premise of the policy, finding the storage calls for user records. The CodeQL output should show at least one such call but doesn't. This means a full policy would be vacuously succeeding (false negative).

Organization

Make sure you've fetched the submodules by cloning with --recursive or run git submodule update --init --recursive

The artifact is organized as follows:

  • The paralegal directory contains the source code for the analyzer. The most important subdirectories are

    • crates/paralegal-flow is the PDG constructor. It wraps around cargo and rustc, functioning as a compiler plugin. It can be installed via cargo install --locked --path crates/paralegal-flow and run via cargo paralegal-flow in the directory of a target rust project to analyze. It also accepts a --help flag to show command line configuration options.
    • crates/paralegal-policy contains the low-level Rust API for policy enforcement, which is used by the compiler (paralegal-compiler).
    • crates/paralegal is the annotation library that a target rust project should link against to add inline annotations, e.g. #[paralegal::analyze] or #[paralegal::marker(...)].

    For more documentation on the analyzer see the online documentation.

  • The paralegal-bench directory contains the performance benchmark coordinator and source code for our use cases.

    • bconf contains configuration files for various benchmark runs. The most important one being bench-config.toml, which is the one used to produce all results we present in the paper.

    • griswold is the benchmark driver/coordinator. You can run it via cargo run --bin griswold. It takes as input a run configuration (for example bconf/bench-config.toml). It writes outputs to the results directory. See the --help option of the Benchmarker output Format section for more details on the structure of results.

    • case-studies contains a copy of the source code for our case studies. These corresponds to the commits listed in Case study versions, with some small changes applied, as explained in the paper appendix.

    • paralegal-compiler contains the source code for the compiler of the high-level policy language. Can be run via cargo run -p paralegal-compiler --release.

      • The policies used in the paper and corresponding to our case studies are in the policies subdirectory.
  • The plotting directory contains python scripts to plot the graphs. Install the dependencies with python3 -m pip install -r plotting/requirements.txt.

  • The codeql-experimentation directory contains our C++ translations of applications and the CodeQL translations of our policies.

    • cpp contains C++ source code for examples we tried with CodeQL, including the source for our application translations.
    • runner coordination program for replicating our codeql findings
    • eval-config.toml an index file for which policies should be run on which application. Input to the runner.
    • real-world-policies the ql source code for our policy translations
    • expected CodeQL output as we observed it
    • policy-coding raw data of how we labeled sections in the CodeQL policies

Performance Benchmarker Output Format

This section explains the output structure create by the griswold benchmark coordinator.

Results are written to the --result-path argument (default to "results"). Each time you call griswold it creates a new set of directories. They all have the format <timestamp>-<purpose> with the following purposes:

  • logs: stdout and stderr from the PDG generation (called compile) and combined output from the policy

  • pp: The source code that was involved in the analysis. The lines of code actually visited. This is used in the roll-forward experiment

  • run: Everything considered result data which are the following files:

    • results.csv: incrementally written statistics and results for each run.

    • controllers.csv: incrementally written statistics about individual controllers.

      Multiple such statistics are written for a single run. The run_id field tells you which run each row belongs to.

    • sys.toml: information about the system that this experiment was run on.

    • bench-config.toml: a copy of the configuration that was input to this run

Case Study Versions

Application Source Repo Commit
Atomic Data https://github.com/atomicdata-dev/atomic-server 46a503adbfc52678c97e52f6e8cfaf541aa6492d
Contile https://github.com/mozilla-services/contile 2bdf4072925d199c1b75d1f76f656c34325e5e8a
Freedit https://github.com/freedit-org/freedit b25e3781a8f46506046bb92f9bfc1ce85a9c9b6d
Hyperswitch https://github.com/juspay/hyperswitch 6428d07f983026245159de4147b62bc0fc018165
Lemmy https://github.com/LemmyNet/lemmy b78826c2c80567192b4e2ce5f8714a094299be04
mCaptcha https://github.com/mCaptcha/mCaptcha/ 9922c233220751e9611b42414ecd82fcde757480
Plume https://github.com/Plume-org/Plume 613ccbcd945618cce624725aee3e2997cbf6de38
Websubmit https://github.com/JustusAdam/beavered-websubmit 4b4713845494b14e92b01c05ec32e7d429eb524f

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published