Skip to content

kimchi: Improve memory usage and parallelism of expr.evaluations #3127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

Fizzixnerd
Copy link
Contributor

@Fizzixnerd Fizzixnerd commented Apr 1, 2025

I will be rewriting history in the coming days to ensure it's not so terrible and makes sense from a "story" perspective when reviewing this PR.

I have a couple of checkboxes, for reviewers. If they are for you, please feel free to edit and check them (if GH allows you to do that), otherwise leave a comment specifically mentioning the checkbox you are approving and I will do so.

Goals

Reduce memory usage of expr.evaluations without impacting the performance of proving speed negatively.

Testing

Runtime

cargo bench expr and cargo test expr are your go-to commands. The second ensures the original implementation matches the new one by calling e.evaluations_iter(&env).collect::<Vec<_>>() on the new iterator interface and comparing it to the original e.evaluations(&env), using the main Kimchi expression. I have only observed memory usage from top at the moment, but I will add a proper massif analysis in the coming days, as per @marcbeunardeau88's request.

The first may take ~30-60 minutes to do a full run of 100 samples for each version, but roughly can be understood from running a single sample and dividing if you don't have time for that. I am including screenshots below of my runs.

Memory

Memory is a little tricker. First run cargo bench --bench expr and it will print out the name of the binary it's going to run at the start, something like target/release/deps/expr-19c2d0431de8290d. Control-c to stop it, then run valgrind --tool=massif target/release/deps/expr-19c2d0431de8290d expr_evals_vec for the old implementation. This should run in a couple of minutes.

run valgrind --tool=massif target/release/deps/expr-19c2d0431de8290d expr_evals_par for the new. This will take roughly 10x-15x the time of the original implementation, due to massif suppressing paralellism.

Runtime Results

For the iterator-based runs, runtime-wise: on a 10-core with hyperthreading circa 2022 core i9 something laptop

image

(If you do plan to benchmark, close Steam especially. I had it open at first, and those are the red dots near the start at the bottom. Oops! I reran and we use the blue dots for analysis.)

For the original implementation, on the same computer:

image

Runtime Analysis

The mean runtime, wallclock-wise, of the new implementation is comparable to the old.

CPU-time wise, it is enormously in favour of the old implementation.

The CPU time saturates the cores of my machine in the new implementation but does not in the old. This means three important facts: runtime could get worse on less capable machines; specific WASM (cc our WASM expert @hattyhattington17) testing for rayon is required before merging into o1js; and runtime could get better on more capable machines.

I recommend someone check the performance characteristics across different CPUs (Macs, etc) more carefully than I have been able to. Specifically, if you have < 10 cores, or a M1+ Mac, or an ARM-based computer, I would be interested in hearing from you. If the reviewers desire, I can test these changes on an old server I have that has 72 CPU cores, and on my modern desktop (that I think has like 12 or something).


  • @mrmr1993 would like someone (other than @Fizzixnerd) to test on a Mac and less capable computers before shipping this in o1js
  • @marcbeunardeau88 would like someone (other than @Fizzixnerd) to test on a Mac and less capable computers before shipping this in o1js

Memory Results

MASSIF RESULTS

I used massif-visualizer to make these graphs.

First, the old implementation:

image

Next the new:

image

Note that the new a lot longer to run because massif prevents parallelism, it seems.

OLD TOP RESULTS

Peak memory usage from top for the old implementation was ~11-12% of my memory (32GB). This number started low and grew until the evaluations were returned. Peak memory usage from top for the new implementation was ~1.6% of my memory. This number is constant throughout the calculation.

For comparison, Discord uses ~1.8% for me!


Memory Analysis

MASSIF RESULTS

peak old: 3.6 GB
peak new: 0.53 GB

new / old = 0.147

Therefore: peak usage is approximately 15% of what it was originally.


OLD TOP ANALYSIS

Peak memory usage is enormously in favour of the new implementation, using just ~13% of the memory of the original, and without massive spikes in allocation.

Copy link
Contributor Author

@Fizzixnerd Fizzixnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intial self review.

@@ -582,11 +582,11 @@ pub enum FeatureFlag {

impl FeatureFlag {
fn is_enabled(&self) -> bool {
todo!("Handle features")
true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrmr1993 I don't think I ever got an answer of what to do here. Currently, this just unconditionally enables all features (since I assume the kimchi expression is correct). An inspection of the IfFeature branch of value_/value indicates this shouldn't be broken, but I don't know whether to revert this or whatever.

Comment on lines 272 to 273
assert_eq!(evals1.len(), evals2.len());
assert_eq!(evals1[0], evals2[0]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checks.


assert_eq!(evals1.len(), evals2.len());
assert_eq!(evals1[0], evals2[0]);
assert_eq!(evals1, evals2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real deal.

@Fizzixnerd Fizzixnerd force-pushed the fizzixnerd/expr-bench branch from 7b36bcc to 01dab68 Compare April 2, 2025 14:54
@Fizzixnerd Fizzixnerd marked this pull request as ready for review April 2, 2025 14:59
@Fizzixnerd
Copy link
Contributor Author

Just gonna fix up CI.

@Fizzixnerd
Copy link
Contributor Author

The most recent commit removed the tests, since it was now checking that x==x, essentially.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

kimchi/src/circuits/constraints.rs:278

  • The word 'Iterativelly' is misspelled; consider correcting it to 'Iteratively'.
/// 2. Iterativelly invoke any desired number of steps: `public(), lookup(), runtime(), precomputations()`

@@ -1198,25 +1198,37 @@ fn value<
env: &Environment,
cache: &HashMap<CacheId, Evaluations<F, D<F>>>,
row: usize,
inferred_domain: Option<Domain>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit confusing as when this is None the domain is infered by the function.
How about something like res_domain or final_domain ?

@dannywillems
Copy link
Member

I would recommend to keep this PR unmerged until we merge different commits fixing the CI of o1js. @Trivo25 and @querolita are on it.

Copy link
Member

@dannywillems dannywillems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes from now, as 0fa4e82 has been added recently, and to avoid merging it by error after @marcbeunardeau88 approval - we do not request a new set of reviews after a new commit has been added. I give a deeper look within the day.

@@ -0,0 +1,80 @@
use criterion::{criterion_group, criterion_main, Criterion};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file, in addition to the change in kimchi/Cargo.toml, can be extracted from this PR, gathered in a single commit, and provide a good starting point for the changes that you try to introduce in this patch.
A reviewer could go after that on the next commits and gradually use the commands you recommend to use to confirm your claims.
In addition to that, it follows the engineering practices we try to enforce. For instance, this practice has been followed in this PR, and it ended up very useful to have atomic commits compiling and passing the whole CI when we had to revert commits.

It seems you started doing in the first commit, but there are comments that you're removing later. Would you mind trying to squash the different commits touching this file to get a clean first commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants