Skip to content

Finer grained precompile native code cache (part 1) #58592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

xal-0
Copy link
Member

@xal-0 xal-0 commented May 30, 2025

Overview

This pull request adds a new mode for precompiling sysimages and pkgimages that caches the results of compiling each LLVM module, reducing the time spent emitting native code for CodeInstances that generate identical LLVM IR. For now, it works only when using the ahead-of-time compiler, but the approach is also valid for JITed code.

Usage

Set JULIA_NATIVE_CACHE=<dir> to look for and store compiled objects in <dir>. When JULIA_IMAGE_TIMINGS=1 is also set, the cache hit rate will be printed, like:

[...]
cache hits: 260/261 (99%)
added 2520 B to cache
[...]

Internals

Normally, jl_emit_native emits every CodeInstance into a separate LLVM module before combining everything into a single module. When the fine-grained cache is enabled, the modules are serialized to bitcode separately. The cache key for each module is computed from the hash of the serialized bitcode and the LLVM version, and the compiled object file is the value.

The partitionModule pass is not run, since multi-threaded compilation is done by having JULIA_IMAGE_THREADS worker threads take from the queue of serialized modules.

When the fine-grained cache is used, we generate a single jl_image_shard_t table for the entire image. The gvar_offsets are resolved by the linker.

Currently, the cache uses LLVM's FileCache, a thread safe key-value store that uses one file per key and write-and-rename to implement atomic updates. It's a convenient choice for development because the contents can be easily objdumped, but the long term plan is to switch to a more appropriate database, be it LLVMCAS when it is merged, or sqlite.

Current limitations

  • The multiversioning pass requires a combined module. If it is requested, the native code cache will be disabled.
  • Only .o outputs are cached. The fine-grained cache cannot be used if --output-bc, --output-unopt-bc or --output-asm are specified.
  • It is unclear how much of a (compile time) performance impact using seperate modules will have. When we understand how to maximize cache hits I'd like to use some heuristic to emit code into shared modules to mitigate this.
  • Cache entries do not currently expire.
  • The cache directory hits file system bottlenecks very fast on Windows. There is much wasted space.
  • The APIs intended for external use also require the use of a single module (jl_get_llvm_module_impl, jl_get_llvm_function_impl, jl_emit_native_impl with llvmmod set).
  • The cache hit rate is predictably quite low because of how we generate names during codegen. We'll need to change how this works and delay the uniquing to linking to improve this.

Plan

  • Support compiling split modules, emitting a single shard table (this PR)
  • Generate names that are predictable and only unique within one module; rename while linking.
  • Find a good heuristic to generate fewer modules
  • Switch to a better KV store
  • Cache eviction
  • Support multiversioning

@xal-0 xal-0 added compiler:codegen Generation of LLVM IR and native code compiler:precompilation Precompilation of modules feature Indicates new feature / enhancement requests labels May 30, 2025
@vchuravy
Copy link
Member

vchuravy commented Jun 1, 2025

This is fantastic! I have long been wanting to explore more fine-grained compilation caching using approaches like llvm-cas. From a cursory look this seems similar to how GPUCompilers on-disk cache is working (and maybe there is room for unifying both approaches)

My big picture questions would be:

  1. How is data handled?
  2. How is provenance handled? E.g. this relies on the uniqueness of CodeInstances? IIRC that uniqueness is only guaranteed during precompilation
  3. Cache collision/invalidation?

x-ref: https://github.com/JuliaGPU/GPUCompiler.jl/blob/c3ba85b62daeb572b2faa800013094b21f94a4f7/src/execution.jl#L167

Another concern is the many small files since they often perform very badly, so something along the line of llvm-cas or sqllite as a blob storage might be good.

@xal-0
Copy link
Member Author

xal-0 commented Jun 2, 2025

It's a little different from the GPUCompiler approach, since it makes no attempt
to predict if a CodeInstance will ultimately emit the same code. Instead, we
emit LLVM as normal and compute the ModuleHash, which we do anyway to move LLVM
modules between threads.

The idea with this PR is to do the minimum thing that will give a correct
result, then work to get more cache hits (starting with
globalUniqueGeneratedNames), so we can evaluate if this type of cache is the
right approach at all.

llvm-cas possibly being merged soon is the reason I have so far avoided pulling
in an embeddable key-value store, though sqlite might be the better choice if we
find other uses for having it around.

So far, there is no invalidation, pending a switch to a real KV store.

@IanButterworth
Copy link
Member

[WIP] Precompile native code cache
This is the first iteration of a cache for native code. ...

This confusingly sounds like something we already have. It might be good to make it clearer in the title and top comment how this differentiates from the current situation.

@gbaraldi gbaraldi changed the title [WIP] Precompile native code cache [WIP] Finer grained precompile native code cache Jun 3, 2025
@StefanKarpinski
Copy link
Member

2. How is provenance handled? E.g. this relies on the uniqueness of CodeInstances? IIRC that uniqueness is only guaranteed during precompilation

It sounds like that isn't an issue for this approach, but it could be possible to have a cache keyed by CodeInstances that is only used during precompilation, where uniqueness is guaranteed. Yes, precompilation is already generating a cached code, but we can have multiple layers of caching, and this could be used to speed up precompilation.

@xal-0 xal-0 changed the title [WIP] Finer grained precompile native code cache Finer grained precompile native code cache (part 1) Jun 9, 2025
@xal-0
Copy link
Member Author

xal-0 commented Jun 9, 2025

Reverted two commits that were an attempt at getting better cache hit rates with the simplest possible change, in favour of doing a link-time rename step.

@xal-0 xal-0 marked this pull request as ready for review June 9, 2025 21:53
src/codegen.cpp Outdated
Comment on lines 1579 to 1590
static StringMap<uint64_t> fresh_name_map;
static std::mutex fresh_name_lock;

static void freshen_name(std::string &name)
{
uint64_t n;
{
std::lock_guard guard{fresh_name_lock};
n = fresh_name_map[name]++;
}
raw_string_ostream(name) << "_" << n;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that this map could grow big. What issue are you trying to solve here? Just that the order of generation causes different names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was an attempt at getting reasonable hit rates with the minimum amount of work; it turned out not to be worth it. The new plan is to abandon globally unique names at this stage and resolve things after code generation, but before linking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code compiler:precompilation Precompilation of modules feature Indicates new feature / enhancement requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants