-
Notifications
You must be signed in to change notification settings - Fork 377
Description
We are running some workloads using crun which started occasionally failing after upgrading to crun version 1.16.1
. In a significant fraction of these workloads, we started seeing the error "corrupted size vs. prev_size in fastbins." I'm not very familiar with this error, but some basic searching indicates that this might be happening due to an invalid memory access.
Notably, the issue doesn't reproduce when running crun via podman. Our usage of crun
is a bit unusual - we are not invoking it via podman
and we also don't run conmon
, because we are trying to reduce overhead as much as possible. Instead, we generate an OCI container spec and then directly invoke crun
. We tried to make the generated container spec match podman's as closely as possible.
To try and find the exact commit where this issue was introduced, I did a git bisect between 1.16 (a known good version) and 1.16.1 (the known bad version), building statically linked crun with the nix
method outlined in the README. The bisect revealed that this behavior started happening in commit 72b4eea.
Unfortunately, I don't have a minimal repro yet... it's difficult to find a minimal repro in this case because these are customer workloads where we don't have access to the source code. What I do know is that the executable for this workload appears to be nodejs 22. Unfortunately, strace
proved unhelpful, because the issue doesn't reproduce under strace
. Even worse, if I wrap the workload with any wrapper process whatsoever (even just a simple sh -c 'exec <executable> <args...>
), the issue doesn't reproduce.
I am continuing to investigate to see if I can find a minimal repro, and I am also planning to dig into that commit to see if I can find exactly what it changed that might be causing this bug, but I thought I would file a report anyway just to raise an early warning, and just so that other people can more easily find this issue if they start seeing this behavior too (it took me a long time to track down this commit as being the culprit).