-
Notifications
You must be signed in to change notification settings - Fork 121
kimchi: Improve memory usage and parallelism of expr.evaluations
#3127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intial self review.
@@ -582,11 +582,11 @@ pub enum FeatureFlag { | |||
|
|||
impl FeatureFlag { | |||
fn is_enabled(&self) -> bool { | |||
todo!("Handle features") | |||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrmr1993 I don't think I ever got an answer of what to do here. Currently, this just unconditionally enables all features (since I assume the kimchi expression is correct). An inspection of the IfFeature
branch of value_
/value
indicates this shouldn't be broken, but I don't know whether to revert this or whatever.
kimchi/tests/test_expr.rs
Outdated
assert_eq!(evals1.len(), evals2.len()); | ||
assert_eq!(evals1[0], evals2[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity checks.
kimchi/tests/test_expr.rs
Outdated
|
||
assert_eq!(evals1.len(), evals2.len()); | ||
assert_eq!(evals1[0], evals2[0]); | ||
assert_eq!(evals1, evals2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real deal.
dcbc4f6
to
7b36bcc
Compare
7b36bcc
to
01dab68
Compare
Just gonna fix up CI. |
The most recent commit removed the tests, since it was now checking that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
kimchi/src/circuits/constraints.rs:278
- The word 'Iterativelly' is misspelled; consider correcting it to 'Iteratively'.
/// 2. Iterativelly invoke any desired number of steps: `public(), lookup(), runtime(), precomputations()`
kimchi/src/circuits/expr.rs
Outdated
@@ -1198,25 +1198,37 @@ fn value< | |||
env: &Environment, | |||
cache: &HashMap<CacheId, Evaluations<F, D<F>>>, | |||
row: usize, | |||
inferred_domain: Option<Domain>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a bit confusing as when this is None the domain is infered by the function.
How about something like res_domain or final_domain ?
I would recommend to keep this PR unmerged until we merge different commits fixing the CI of o1js. @Trivo25 and @querolita are on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes from now, as 0fa4e82 has been added recently, and to avoid merging it by error after @marcbeunardeau88 approval - we do not request a new set of reviews after a new commit has been added. I give a deeper look within the day.
@@ -0,0 +1,80 @@ | |||
use criterion::{criterion_group, criterion_main, Criterion}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file, in addition to the change in kimchi/Cargo.toml
, can be extracted from this PR, gathered in a single commit, and provide a good starting point for the changes that you try to introduce in this patch.
A reviewer could go after that on the next commits and gradually use the commands you recommend to use to confirm your claims.
In addition to that, it follows the engineering practices we try to enforce. For instance, this practice has been followed in this PR, and it ended up very useful to have atomic commits compiling and passing the whole CI when we had to revert commits.
It seems you started doing in the first commit, but there are comments that you're removing later. Would you mind trying to squash the different commits touching this file to get a clean first commit?
I will be rewriting history in the coming days to ensure it's not so terrible and makes sense from a "story" perspective when reviewing this PR.
I have a couple of checkboxes, for reviewers. If they are for you, please feel free to edit and check them (if GH allows you to do that), otherwise leave a comment specifically mentioning the checkbox you are approving and I will do so.
Goals
Reduce memory usage of
expr.evaluations
without impacting the performance of proving speed negatively.Testing
Runtime
cargo bench expr
andcargo test expr
are your go-to commands. The second ensures the original implementation matches the new one by callinge.evaluations_iter(&env).collect::<Vec<_>>()
on the new iterator interface and comparing it to the originale.evaluations(&env)
, using the main Kimchi expression. I have only observed memory usage fromtop
at the moment, but I will add a propermassif
analysis in the coming days, as per @marcbeunardeau88's request.The first may take ~30-60 minutes to do a full run of 100 samples for each version, but roughly can be understood from running a single sample and dividing if you don't have time for that. I am including screenshots below of my runs.
Memory
Memory is a little tricker. First run
cargo bench --bench expr
and it will print out the name of the binary it's going to run at the start, something liketarget/release/deps/expr-19c2d0431de8290d
. Control-c to stop it, then runvalgrind --tool=massif target/release/deps/expr-19c2d0431de8290d expr_evals_vec
for the old implementation. This should run in a couple of minutes.run
valgrind --tool=massif target/release/deps/expr-19c2d0431de8290d expr_evals_par
for the new. This will take roughly 10x-15x the time of the original implementation, due to massif suppressing paralellism.Runtime Results
For the iterator-based runs, runtime-wise: on a 10-core with hyperthreading circa 2022 core i9 something laptop
(If you do plan to benchmark, close Steam especially. I had it open at first, and those are the red dots near the start at the bottom. Oops! I reran and we use the blue dots for analysis.)
For the original implementation, on the same computer:
Runtime Analysis
The mean runtime, wallclock-wise, of the new implementation is comparable to the old.
CPU-time wise, it is enormously in favour of the old implementation.
The CPU time saturates the cores of my machine in the new implementation but does not in the old. This means three important facts: runtime could get worse on less capable machines; specific WASM (cc our WASM expert @hattyhattington17) testing for
rayon
is required before merging into o1js; and runtime could get better on more capable machines.I recommend someone check the performance characteristics across different CPUs (Macs, etc) more carefully than I have been able to. Specifically, if you have < 10 cores, or a M1+ Mac, or an ARM-based computer, I would be interested in hearing from you. If the reviewers desire, I can test these changes on an old server I have that has 72 CPU cores, and on my modern desktop (that I think has like 12 or something).
Memory Results
massif
stuff (@Fizzixnerd)MASSIF RESULTS
I used
massif-visualizer
to make these graphs.First, the old implementation:
Next the new:
Note that the new a lot longer to run because massif prevents parallelism, it seems.
OLD TOP RESULTS
Peak memory usage from
top
for the old implementation was ~11-12% of my memory (32GB). This number started low and grew until the evaluations were returned. Peak memory usage fromtop
for the new implementation was ~1.6% of my memory. This number is constant throughout the calculation.For comparison, Discord uses ~1.8% for me!
Memory Analysis
MASSIF RESULTS
peak old: 3.6 GB
peak new: 0.53 GB
new / old = 0.147
Therefore: peak usage is approximately 15% of what it was originally.
OLD TOP ANALYSIS
Peak memory usage is enormously in favour of the new implementation, using just ~13% of the memory of the original, and without massive spikes in allocation.