-
Notifications
You must be signed in to change notification settings - Fork 377
linux: optimize masked paths with shared empty directory #1859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR optimizes masked directory mounts by introducing a single shared empty directory (using O_TMPFILE or a temporary mkdtemp directory) that is bind-mounted over each masked path, falling back to per-path tmpfs mounts on failure. It extends the container’s private data to cache the shared directory, implements helper functions to manage and mount it, and integrates the new logic into the existing masked-path workflow. Sequence diagram for masked directory mount process with shared empty directorysequenceDiagram
participant Container
participant MaskedDirManager
participant Kernel
Container->>MaskedDirManager: Request masked path mount
MaskedDirManager->>MaskedDirManager: get_shared_empty_dir()
alt Shared empty dir available
MaskedDirManager->>Kernel: Bind mount shared empty dir
Kernel-->>MaskedDirManager: Success or Failure
alt Bind mount fails
MaskedDirManager->>Kernel: Mount tmpfs over path
Kernel-->>MaskedDirManager: Success
end
else Shared empty dir not available
MaskedDirManager->>Kernel: Mount tmpfs over path
Kernel-->>MaskedDirManager: Success
end
MaskedDirManager-->>Container: Mount complete
Class diagram for updated private_data_s structure and masked directory handlingclassDiagram
class private_data_s {
char *external_descriptors
int rootfsfd
int shared_empty_dir_fd
proc_fd_path_t shared_empty_dir_path
...
}
class libcrun_container_s {
private_data_s *private_data
cleanup_private_data
...
}
private_data_s <.. libcrun_container_s : contains
class MaskedDirManager {
+get_shared_empty_dir(container, err)
+mount_masked_dir(container, pathfd, rel_path, err)
}
MaskedDirManager ..> private_data_s : uses
MaskedDirManager ..> libcrun_container_s : uses
Flow diagram for masked directory mount optimizationflowchart TD
A[Start masked path mount] --> B{Is shared empty dir available?}
B -- Yes --> C[Bind mount shared empty dir over masked path]
C --> D[Success]
B -- No or bind mount fails --> E[Mount tmpfs over masked path]
E --> D[Success]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
3ada96c
to
7eea972
Compare
Ephemeral COPR build failed. @containers/packit-build please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/libcrun/linux.c:1099` </location>
<code_context>
+ if (prefix == NULL || *prefix != '/')
+ prefix = "/tmp";
+
+#ifdef O_TMPFILE
+ fd = open (prefix, O_TMPFILE | O_DIRECTORY | O_RDONLY | O_CLOEXEC, 0700);
+ if (fd >= 0)
+ {
+ private_data->shared_empty_dir_fd = fd;
+ get_proc_self_fd_path (private_data->shared_empty_dir_path, fd);
+ fd = -1; /* Don't auto-close on cleanup */
+ return private_data->shared_empty_dir_fd;
+ }
+
+ /* Fallback if O_TMPFILE isn't supported or fails unexpectedly */
+ if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL)
+ return crun_make_error (err, errno, "open O_TMPFILE `%s`", prefix);
+#endif
</code_context>
<issue_to_address>
Error handling for open with O_TMPFILE may miss some edge cases.
Review whether additional errno values, such as ENOSPC or EACCES, should also trigger the fallback instead of being treated as fatal errors.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
/* Fallback if O_TMPFILE isn't supported or fails unexpectedly */
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL)
return crun_make_error (err, errno, "open O_TMPFILE `%s`", prefix);
=======
/* Fallback if O_TMPFILE isn't supported or fails unexpectedly */
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL && errno != ENOSPC && errno != EACCES)
return crun_make_error (err, errno, "open O_TMPFILE `%s`", prefix);
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `src/libcrun/linux.c:1116` </location>
<code_context>
+
+ /* If O_TMPFILE not available, create directory using crun's standard pattern */
+ char template[PATH_MAX];
+ if (snprintf (template, sizeof (template), "%s/crun.XXXXXX", prefix) < 0)
+ return crun_make_error (err, errno, "snprintf");
+
</code_context>
<issue_to_address>
snprintf return value check may not be sufficient for buffer overflow.
If snprintf returns a value >= sizeof(template), truncation has occurred. Please add a check for this to prevent buffer overflow risks.
</issue_to_address>
### Comment 3
<location> `src/libcrun/linux.c:1144` </location>
<code_context>
+ int ret;
+
+ libcrun_error_t tmp_err = NULL;
+ int shared_empty_fd = get_shared_empty_dir (container, &tmp_err);
+ if (shared_empty_fd >= 0)
+ {
</code_context>
<issue_to_address>
Error handling for get_shared_empty_dir may leak file descriptors.
If get_shared_empty_dir fails after opening a directory, make sure any opened file descriptors are properly closed. Check the cleanup logic for edge cases.
</issue_to_address>
### Comment 4
<location> `src/libcrun/linux.c:1074` </location>
<code_context>
return false;
}
+/* Get or create a shared empty directory for masked paths optimization.
+ * This avoids creating individual tmpfs mounts for each masked directory.
+ */
</code_context>
<issue_to_address>
Consider merging the two new routines into a single helper function to simplify masked directory handling.
```suggestion
The two new routines (`get_shared_empty_dir` + `mount_masked_dir`) can be collapsed into one simpler helper. You only need to open the “anonymous” dir once (using O_TMPFILE or mkdtemp) and then bind-mount it via `/proc/self/fd/<fd>`. You can drop the extra proc-path cache and just `snprintf()` it on each mount.
Example:
```c
/* open anon directory once, fallback from O_TMPFILE -> mkdtemp */
static int open_anon_dir(const char *prefix, libcrun_error_t *err) {
int fd;
#ifdef O_TMPFILE
fd = open(prefix, O_TMPFILE|O_DIRECTORY|O_CLOEXEC, 0700);
if (fd >= 0) return fd;
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL)
return crun_make_error(err, errno, "open O_TMPFILE `%s`", prefix);
#endif
char tmp[PATH_MAX];
if (snprintf(tmp, sizeof(tmp), "%s/crun.XXXXXX", prefix) < 0)
return crun_make_error(err, errno, "snprintf");
if (!mkdtemp(tmp))
return crun_make_error(err, errno, "mkdtemp `%s`", tmp);
fd = open(tmp, O_DIRECTORY|O_CLOEXEC);
rmdir(tmp); /* cleanup on host fs */
if (fd < 0)
return crun_make_error(err, errno, "open `%s`", tmp);
return fd;
}
/* single helper for masked dirs */
static int mount_masked_dir(libcrun_container_t *ctr,
int pathfd, const char *rel, libcrun_error_t *err)
{
struct private_data_s *p = get_private_data(ctr);
if (p->empty_fd < 0) {
const char *dir = getenv("_LIBCONTAINER_STATEDIR");
if (!dir || *dir!='/') dir = "/tmp";
p->empty_fd = open_anon_dir(dir, err);
if (p->empty_fd < 0) return -1;
}
/* mount via /proc/self/fd */
char src[32];
snprintf(src, sizeof(src), "/proc/self/fd/%d", p->empty_fd);
if (do_mount(ctr, src, pathfd, rel,
NULL, MS_BIND|MS_RDONLY, NULL, LABEL_MOUNT, err) == 0)
return 0;
libcrun_warning("bind empty-dir `%s` failed, falling back to tmpfs: %s",
rel, err ? err->msg : "errno");
crun_error_release(&err);
/* fallback to tmpfs */
return do_mount(ctr, "tmpfs", pathfd, rel,
"tmpfs", MS_RDONLY, "size=0k", LABEL_MOUNT, err);
}
```
- Drop `get_shared_empty_dir` and its proc-path cache.
- Store only one `empty_fd` in `private_data_s`.
- Inline the fallback logic in one function.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/libcrun/linux.c
Outdated
/* Fallback if O_TMPFILE isn't supported or fails unexpectedly */ | ||
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL) | ||
return crun_make_error (err, errno, "open O_TMPFILE `%s`", prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Error handling for open with O_TMPFILE may miss some edge cases.
Review whether additional errno values, such as ENOSPC or EACCES, should also trigger the fallback instead of being treated as fatal errors.
/* Fallback if O_TMPFILE isn't supported or fails unexpectedly */ | |
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL) | |
return crun_make_error (err, errno, "open O_TMPFILE `%s`", prefix); | |
/* Fallback if O_TMPFILE isn't supported or fails unexpectedly */ | |
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL && errno != ENOSPC && errno != EACCES) | |
return crun_make_error (err, errno, "open O_TMPFILE `%s`", prefix); |
src/libcrun/linux.c
Outdated
|
||
/* If O_TMPFILE not available, create directory using crun's standard pattern */ | ||
char template[PATH_MAX]; | ||
if (snprintf (template, sizeof (template), "%s/crun.XXXXXX", prefix) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): snprintf return value check may not be sufficient for buffer overflow.
If snprintf returns a value >= sizeof(template), truncation has occurred. Please add a check for this to prevent buffer overflow risks.
src/libcrun/linux.c
Outdated
int ret; | ||
|
||
libcrun_error_t tmp_err = NULL; | ||
int shared_empty_fd = get_shared_empty_dir (container, &tmp_err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Error handling for get_shared_empty_dir may leak file descriptors.
If get_shared_empty_dir fails after opening a directory, make sure any opened file descriptors are properly closed. Check the cleanup logic for edge cases.
src/libcrun/linux.c
Outdated
return false; | ||
} | ||
|
||
/* Get or create a shared empty directory for masked paths optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider merging the two new routines into a single helper function to simplify masked directory handling.
/* Get or create a shared empty directory for masked paths optimization. | |
The two new routines (`get_shared_empty_dir` + `mount_masked_dir`) can be collapsed into one simpler helper. You only need to open the “anonymous” dir once (using O_TMPFILE or mkdtemp) and then bind-mount it via `/proc/self/fd/<fd>`. You can drop the extra proc-path cache and just `snprintf()` it on each mount. | |
Example: | |
```c | |
/* open anon directory once, fallback from O_TMPFILE -> mkdtemp */ | |
static int open_anon_dir(const char *prefix, libcrun_error_t *err) { | |
int fd; | |
#ifdef O_TMPFILE | |
fd = open(prefix, O_TMPFILE|O_DIRECTORY|O_CLOEXEC, 0700); | |
if (fd >= 0) return fd; | |
if (errno != EOPNOTSUPP && errno != EISDIR && errno != EINVAL) | |
return crun_make_error(err, errno, "open O_TMPFILE `%s`", prefix); | |
#endif | |
char tmp[PATH_MAX]; | |
if (snprintf(tmp, sizeof(tmp), "%s/crun.XXXXXX", prefix) < 0) | |
return crun_make_error(err, errno, "snprintf"); | |
if (!mkdtemp(tmp)) | |
return crun_make_error(err, errno, "mkdtemp `%s`", tmp); | |
fd = open(tmp, O_DIRECTORY|O_CLOEXEC); | |
rmdir(tmp); /* cleanup on host fs */ | |
if (fd < 0) | |
return crun_make_error(err, errno, "open `%s`", tmp); | |
return fd; | |
} | |
/* single helper for masked dirs */ | |
static int mount_masked_dir(libcrun_container_t *ctr, | |
int pathfd, const char *rel, libcrun_error_t *err) | |
{ | |
struct private_data_s *p = get_private_data(ctr); | |
if (p->empty_fd < 0) { | |
const char *dir = getenv("_LIBCONTAINER_STATEDIR"); | |
if (!dir || *dir!='/') dir = "/tmp"; | |
p->empty_fd = open_anon_dir(dir, err); | |
if (p->empty_fd < 0) return -1; | |
} | |
/* mount via /proc/self/fd */ | |
char src[32]; | |
snprintf(src, sizeof(src), "/proc/self/fd/%d", p->empty_fd); | |
if (do_mount(ctr, src, pathfd, rel, | |
NULL, MS_BIND|MS_RDONLY, NULL, LABEL_MOUNT, err) == 0) | |
return 0; | |
libcrun_warning("bind empty-dir `%s` failed, falling back to tmpfs: %s", | |
rel, err ? err->msg : "errno"); | |
crun_error_release(&err); | |
/* fallback to tmpfs */ | |
return do_mount(ctr, "tmpfs", pathfd, rel, | |
"tmpfs", MS_RDONLY, "size=0k", LABEL_MOUNT, err); | |
} |
- Drop
get_shared_empty_dir
and its proc-path cache. - Store only one
empty_fd
inprivate_data_s
. - Inline the fallback logic in one function.
TMT tests failed. @containers/packit-build please check. |
you can create a directory under |
7eea972
to
cccb1f9
Compare
cccb1f9
to
29d0aef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit, otherwise codewise LGTM
29d0aef
to
980741b
Compare
src/libcrun/linux.c
Outdated
if (LIKELY (ret >= 0)) | ||
return ret; | ||
|
||
if (tmp_err != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the if statement
if (tmp_err != NULL)
is unnecessary?
I created an issue related to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, @sohankunkerkar please drop the check as it is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just reading tmpfs(5)
and found out that size=0
, which we're currently using, means "no limit". Perhaps if we change it to be > 0, this will fix the issue of high kernel memory use?
Something like nr_blocks=1,nr_inodes=1
?
wouldn't the bind mount be always cheaper than the tmpfs though? If we leave the fallback though, I think we need to issue above. @sohankunkerkar can you please add a new patch to this PR? |
980741b
to
41bf172
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments inline, waiting for @kolyshkin to weight in if he is happy with the change
src/libcrun/linux.c
Outdated
if (LIKELY (ret >= 0)) | ||
return ret; | ||
|
||
if (tmp_err != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, @sohankunkerkar please drop the check as it is not necessary.
41bf172
to
a4ee278
Compare
Replace "size=0k" (unlimited growth) with explicit block and inode limits for tmpfs mounts used in masked directory paths. This prevents excessive kernel memory consumption under high container density. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
a4ee278
to
2d79e32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I wonder if there are measurements re how much memory it saves?
@sohankunkerkar guess you also need to remove |
2d79e32
to
096377b
Compare
src/libcrun/linux.c
Outdated
return ret; | ||
|
||
/* Apply SELinux label using container's mount label */ | ||
if (container->container_def->linux && container->container_def->linux->mount_label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CNV team found an SELinux issue while testing the previous patch:
audit: type=1400 audit(1756323551.103:5179): avc: denied { read } for pid=142874
comm="virt-launcher-m" name=".empty-directory" dev="tmpfs" ino=7201
scontext=system_u:system_r:container_t:s0:c139,c767
tcontext=system_u:object_r:container_var_run_t:s0 tclass=dir permissive=0
I added this code to address the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this test to confirm the SELinux context for the directory.
@test "crun shared directory has correct SELinux context" {
cat > "$CRIO_CONFIG_DIR/01-crun-test.conf" <<EOF
[crio.runtime.runtimes.crun-test]
runtime_path = "$CRUN_BINARY"
runtime_type = ""
runtime_root = "/run/crun-test"
monitor_path = "/usr/bin/conmon"
EOF
start_crio
# Create container config with thermal_throttle masked paths
jq '.linux.security_context.masked_paths = ["/sys/devices/system/cpu/cpu0/thermal_throttle"] | del(.linux.security_context.namespace_options.pid)' \
"$TESTDATA/container_redis.json" > "$TESTDIR/container_thermal.json"
# Create sandbox config without PID namespace sharing (avoid conflict with drop_infra_ctr=true)
jq 'del(.linux.security_context.namespace_options.pid)' \
"$TESTDATA/sandbox_config.json" > "$TESTDIR/sandbox_no_pid.json"
local ctr_id
ctr_id=$(crictl run --runtime crun-test "$TESTDIR/container_thermal.json" "$TESTDIR/sandbox_no_pid.json")
# Wait for container to start
sleep 2
sudo ls -ldZ /run/crun-test/.empty-directory 2>/dev/null || echo "Directory not found"
local selinux_context
selinux_context=$(sudo ls -ldZ /run/crun-test/.empty-directory 2>/dev/null | awk '{print $5}' | head -1)
echo "DEBUG: SELinux context found: '$selinux_context'"
# Should have container_file_t context (accessible to containers) not container_var_run_t
[[ "$selinux_context" =~ container_file_t ]]
[[ ! "$selinux_context" =~ container_var_run_t ]]
# Test mountinfo to verify bind mounts instead of tmpfs
echo "DEBUG: Checking mountinfo for thermal_throttle paths..."
crictl exec "$ctr_id" sh -c "cat /proc/self/mountinfo | grep thermal_throttle | head -3"
crictl stop "$ctr_id"
crictl rm "$ctr_id"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be this?
const char *shared_context = "system_u:object_r:container_file_t:s0";
ret = setxattr (empty_dir_path, "security.selinux", shared_context, strlen (shared_context), 0);
if (ret < 0 && errno != EOPNOTSUPP)
{
libcrun_warning ("Failed to set shared SELinux context on %s: %s", empty_dir_path, strerror (errno));
}
I suspect each container gets unique MCS categories for isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not set the xattr at all. It will be changed for each mount. It must have a static value (the default it gets might be good) and the selinux policy should allow that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about containers/container-selinux#393?
@giuseppe could you PTAL? |
…timization Problem: crun PR #1859 (containers/crun#1859) optimizes masked paths by using a shared empty directory instead of individual tmpfs mounts. However, containers cannot access this shared directory due to SELinux policy: avc: denied { read } for name=".empty-directory" scontext=container_t:s0:c139,c767 tcontext=container_var_run_t:s0 Without this policy, the optimization falls back to individual tmpfs mounts, negating the performance benefits.
…timization Problem: crun PR #1859 (containers/crun#1859) optimizes masked paths by using a shared empty directory instead of individual tmpfs mounts. However, containers cannot access this shared directory due to SELinux policy: avc: denied { read } for name=".empty-directory" scontext=container_t:s0:c139,c767 tcontext=container_var_run_t:s0 Without this policy, the optimization falls back to individual tmpfs mounts, negating the performance benefits. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
096377b
to
2a605bb
Compare
Optimize masked path handling by bind-mounting a shared empty directory (via cached /proc/self/fd) instead of creating per-path tmpfs mounts. This reduces kernel memory and mount syscall overhead under high container density. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
2a605bb
to
4004e5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/packit retest-failed |
Account lsm5 has no write access nor is author of PR! |
@giuseppe mind giving me write access ? Alternatively, please feel free to run that packit command. |
@lsm5 can you try again? |
/packit retest-failed |
…timization Problem: crun PR #1859 (containers/crun#1859) optimizes masked paths by using a shared empty directory instead of individual tmpfs mounts. However, containers cannot access this shared directory due to SELinux policy: avc: denied { read } for name=".empty-directory" scontext=container_t:s0:c139,c767 tcontext=container_var_run_t:s0 Without this policy, the optimization falls back to individual tmpfs mounts, negating the performance benefits. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Replace individual tmpfs mounts with bind mounts to a shared empty directory for masked directory paths. This reduces kernel memory usage under high container density.
This is how the
mountinfo
will look like:Summary by Sourcery
Optimize masked directory mounts by using a shared empty directory bind mount to reduce per-mount tmpfs memory overhead under high container density.
New Features:
Enhancements: