Skip to content

avoidable cache misses from the current handling of per-builder filesystems #1255

@mjguzik

Description

@mjguzik

.. at least for tmpfs, I have not checked what happens with zfs.

per-builder fs structure is perfectly fine and encouraged, it's just the way it is currently done which is the problem.

The commonly executing binaries are literally megabytes in size. With compiles running in parallel this keeps busting CPU caches. This would not be happening if the underlying vnodes were the same.

Important remark about NUMA-awareness goes here. This is a factor on the official builders as well.

The core idea how to handle this is shared tmpfs mount with different jails on it, with files hardlinked.

An unsuspecting user might think nullfs would do a great job here, but that's not true due to its overhead.

Suppose the system has 2 numa nodes (numbered 0 and 1) and builders are hanging out in /poudriere/builders.

In that case:
cpuset -l <cpus-from-domain-0> mount -t tmpfs tmpfs /poudriere/builders/node0
cpuset -l <cpus-from-domain-0> mkdir /poudriere/builders/node0/basefs
here unpack the base system into the basefs dir. Still ignore /usr/share, /usr/tests and whatever other applicable. these can be null-mounted from a machine-wide place.

cpuset -l <cpus-from-domain-0> mkdir /poudriere/builders/node0/builder0
here recreate the directory tree as seen in basefs with mkdir of everything
now create hardlinks to all files (e.g., ln basefs/bin/sh builder0/bin/sh )
symlinks have to be recreated as they are found in basefs (e.g., a symlink to "../crap" has to be a symlink to "../crap")

repeat for all domains and all builders.

Finally, issue +schg on all files in basefs and all directories (modulo etc and similar) in builders to prevent ports from messing with them.

I stress cpuset to maintain kernel memory locality when creating data structures to use by given builder.

Et voila, jail-private view of the filesystem with kernel-level data being shared.

In a simple test building hello world in a loop with clang I get over 6% win from doing this instead of completely separate worlds. The win would be higher if it was not for lock contention in the kernel, which I'm going to look into.

Here are memory throughput stats from pcm when issuing 54 builds in parallel.

54 separate tmpfs mounts:

|-- System Read Throughput(MB/s): 36412.07 --|
|-- System Write Throughput(MB/s): 15735.59 --|
|-- System Memory Throughput(MB/s): 52147.66 --|

Shared:

|-- System Read Throughput(MB/s): 10309.42 --|
|-- System Write Throughput(MB/s): 12702.96 --|
|-- System Memory Throughput(MB/s): 23012.38 --|

As you can see read throughput dropped over 3x, and that's while getting more work done.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions