avoidable cache misses from the current handling of per-builder filesystems

.. at least for tmpfs, I have not checked what happens with zfs.

per-builder fs structure is perfectly fine and encouraged, it's just the way it is currently done which is the problem.

The commonly executing binaries are literally megabytes in size. With compiles running in parallel this keeps busting CPU caches. This would not be happening if the underlying vnodes were the same.

Important remark about NUMA-awareness goes here. This is a factor on the official builders as well.

The core idea how to handle this is shared tmpfs mount with different jails on it, with files hardlinked.

An unsuspecting user might think nullfs would do a great job here, but that's not true due to its overhead.

Suppose the system has 2 numa nodes (numbered 0 and 1) and builders are hanging out in `/poudriere/builders`.

In that case:
`cpuset -l <cpus-from-domain-0> mount -t tmpfs tmpfs /poudriere/builders/node0`
`cpuset -l <cpus-from-domain-0> mkdir /poudriere/builders/node0/basefs`
here unpack the base system into the `basefs` dir. Still ignore `/usr/share`, `/usr/tests` and whatever other applicable. these can be null-mounted from a machine-wide place.

`cpuset -l <cpus-from-domain-0> mkdir /poudriere/builders/node0/builder0`
here recreate the directory tree as seen in `basefs` with mkdir of everything
now create hardlinks to all files (e.g., `ln basefs/bin/sh builder0/bin/sh` )
symlinks have to be recreated as they are found in basefs (e.g., a symlink to "../crap" has to be a symlink to "../crap")

repeat for all domains and all builders.

Finally, issue +schg on all files in basefs and all directories (modulo etc and similar) in builders to prevent ports from messing with them.

I stress cpuset to maintain kernel memory locality when creating data structures to use by given builder.

Et voila, jail-private view of the filesystem with kernel-level data being shared.

In a simple test building hello world in a loop with clang I get over 6% win from doing this instead of completely separate worlds. The win would be higher if it was not for lock contention in the kernel, which I'm going to look into.

Here are memory throughput stats from pcm when issuing 54 builds in parallel.

54 separate tmpfs mounts:
>|--                 System Read Throughput(MB/s):      36412.07                --|
>|--                System Write Throughput(MB/s):      15735.59                --|
>|--               System Memory Throughput(MB/s):      52147.66                --|

Shared:
>|--                 System Read Throughput(MB/s):      10309.42                --|
>|--                System Write Throughput(MB/s):      12702.96                --|
>|--               System Memory Throughput(MB/s):      23012.38                --|

As you can see read throughput dropped over 3x, and that's while getting *more* work done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avoidable cache misses from the current handling of per-builder filesystems #1255

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

avoidable cache misses from the current handling of per-builder filesystems #1255

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions