Skip to content

Commit b5683a3

Browse files
committed
Merge tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pdfd updates from Christian Brauner: - Until now pidfds could only be created for thread-group leaders but not for threads. There was no technical reason for this. We simply had no users that needed support for this. Now we do have users that need support for this. This introduces a new PIDFD_THREAD flag for pidfd_open(). If that flag is set pidfd_open() creates a pidfd that refers to a specific thread. In addition, we now allow clone() and clone3() to be called with CLONE_PIDFD | CLONE_THREAD which wasn't possible before. A pidfd that refers to an individual thread differs from a pidfd that refers to a thread-group leader: (1) Pidfds are pollable. A task may poll a pidfd and get notified when the task has exited. For thread-group leader pidfds the polling task is woken if the thread-group is empty. In other words, if the thread-group leader task exits when there are still threads alive in its thread-group the polling task will not be woken when the thread-group leader exits but rather when the last thread in the thread-group exits. For thread-specific pidfds the polling task is woken if the thread exits. (2) Passing a thread-group leader pidfd to pidfd_send_signal() will generate thread-group directed signals like kill(2) does. Passing a thread-specific pidfd to pidfd_send_signal() will generate thread-specific signals like tgkill(2) does. The default scope of the signal is thus determined by the type of the pidfd. Since use-cases exist where the default scope of the provided pidfd needs to be overriden the following flags are added to pidfd_send_signal(): - PIDFD_SIGNAL_THREAD Send a thread-specific signal. - PIDFD_SIGNAL_THREAD_GROUP Send a thread-group directed signal. - PIDFD_SIGNAL_PROCESS_GROUP Send a process-group directed signal. The scope change will only work if the struct pid is actually used for this scope. For example, in order to send a thread-group directed signal the provided pidfd must be used as a thread-group leader and similarly for PIDFD_SIGNAL_PROCESS_GROUP the struct pid must be used as a process group leader. - Move pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows for statx on pidfds to become useful for the first time. They can now be compared by inode number which are unique for the system lifetime. Instead of stashing struct pid in file->private_data we can now stash it in inode->i_private. This makes it possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. Another side-effect is that file->private_data is now freed up for per-file options for pidfds. Now, each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. We allocate a new inode and dentry for each struct pid and we reuse that inode and dentry for all pidfds that refer to the same struct pid. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs. The dentry and inode allocation mechanism is moved into generic infrastructure that is now shared between nsfs and pidfs. The path_from_stashed() helper must be provided with a stashing location, an inode number, a mount, and the private data that is supposed to be used and it will provide a path that can be passed to dentry_open(). The helper will try retrieve an existing dentry from the provided stashing location. If a valid dentry is found it is reused. If not a new one is allocated and we try to stash it in the provided location. If this fails we retry until we either find an existing dentry or the newly allocated dentry could be stashed. Subsequent openers of the same namespace or task are then able to reuse it. - Currently it is only possible to get notified when a task has exited, i.e., become a zombie and userspace gets notified with EPOLLIN. We now also support waiting until the task has been reaped, notifying userspace with EPOLLHUP. - Ensure that ESRCH is reported for getfd if a task is exiting instead of the confusing EBADF. - Various smaller cleanups to pidfd functions. * tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits) libfs: improve path_from_stashed() libfs: add stashed_dentry_prune() libfs: improve path_from_stashed() helper pidfs: convert to path_from_stashed() helper nsfs: convert to path_from_stashed() helper libfs: add path_from_stashed() pidfd: add pidfs pidfd: move struct pidfd_fops pidfd: allow to override signal scope in pidfd_send_signal() pidfd: change pidfd_send_signal() to respect PIDFD_THREAD signal: fill in si_code in prepare_kill_siginfo() selftests: add ESRCH tests for pidfd_getfd() pidfd: getfd should always report ESRCH if a task is exiting pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together pidfd: exit: kill the no longer used thread_group_exited() pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN)) pid: kill the obsolete PIDTYPE_PID code in transfer_pid() pidfd: kill the no longer needed do_notify_pidfd() in de_thread() pidfd_poll: report POLLHUP when pid_task() == NULL pidfd: implement PIDFD_THREAD flag for pidfd_open() ...
2 parents 54126fa + e9c5263 commit b5683a3

File tree

21 files changed

+686
-298
lines changed

21 files changed

+686
-298
lines changed

fs/Kconfig

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,13 @@ source "fs/proc/Kconfig"
173173
source "fs/kernfs/Kconfig"
174174
source "fs/sysfs/Kconfig"
175175

176+
config FS_PID
177+
bool "Pseudo filesystem for process file descriptors"
178+
depends on 64BIT
179+
default y
180+
help
181+
Pidfs implements advanced features for process file descriptors.
182+
176183
config TMPFS
177184
bool "Tmpfs virtual memory file system support (former shm fs)"
178185
depends on SHMEM

fs/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ obj-y := open.o read_write.o file_table.o super.o \
1515
pnode.o splice.o sync.o utimes.o d_path.o \
1616
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
1717
fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
18-
kernel_read_file.o mnt_idmapping.o remap_range.o
18+
kernel_read_file.o mnt_idmapping.o remap_range.o pidfs.o
1919

2020
obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o
2121
obj-$(CONFIG_PROC_FS) += proc_namespace.o

fs/exec.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1158,7 +1158,6 @@ static int de_thread(struct task_struct *tsk)
11581158

11591159
BUG_ON(leader->exit_state != EXIT_ZOMBIE);
11601160
leader->exit_state = EXIT_DEAD;
1161-
11621161
/*
11631162
* We are going to release_task()->ptrace_unlink() silently,
11641163
* the tracer can sleep in do_wait(). EXIT_DEAD guarantees

fs/internal.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,3 +310,10 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
310310
struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
311311
struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
312312
void mnt_idmap_put(struct mnt_idmap *idmap);
313+
struct stashed_operations {
314+
void (*put_data)(void *data);
315+
void (*init_inode)(struct inode *inode, void *data);
316+
};
317+
int path_from_stashed(struct dentry **stashed, unsigned long ino,
318+
struct vfsmount *mnt, void *data, struct path *path);
319+
void stashed_dentry_prune(struct dentry *dentry);

fs/libfs.c

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include <linux/fsnotify.h>
2424
#include <linux/unicode.h>
2525
#include <linux/fscrypt.h>
26+
#include <linux/pidfs.h>
2627

2728
#include <linux/uaccess.h>
2829

@@ -1985,3 +1986,144 @@ struct timespec64 simple_inode_init_ts(struct inode *inode)
19851986
return ts;
19861987
}
19871988
EXPORT_SYMBOL(simple_inode_init_ts);
1989+
1990+
static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
1991+
{
1992+
struct dentry *dentry;
1993+
1994+
guard(rcu)();
1995+
dentry = READ_ONCE(stashed);
1996+
if (!dentry)
1997+
return NULL;
1998+
if (!lockref_get_not_dead(&dentry->d_lockref))
1999+
return NULL;
2000+
return dentry;
2001+
}
2002+
2003+
static struct dentry *prepare_anon_dentry(struct dentry **stashed,
2004+
unsigned long ino,
2005+
struct super_block *sb,
2006+
void *data)
2007+
{
2008+
struct dentry *dentry;
2009+
struct inode *inode;
2010+
const struct stashed_operations *sops = sb->s_fs_info;
2011+
2012+
dentry = d_alloc_anon(sb);
2013+
if (!dentry)
2014+
return ERR_PTR(-ENOMEM);
2015+
2016+
inode = new_inode_pseudo(sb);
2017+
if (!inode) {
2018+
dput(dentry);
2019+
return ERR_PTR(-ENOMEM);
2020+
}
2021+
2022+
inode->i_ino = ino;
2023+
inode->i_flags |= S_IMMUTABLE;
2024+
inode->i_mode = S_IFREG;
2025+
simple_inode_init_ts(inode);
2026+
sops->init_inode(inode, data);
2027+
2028+
/* Notice when this is changed. */
2029+
WARN_ON_ONCE(!S_ISREG(inode->i_mode));
2030+
WARN_ON_ONCE(!IS_IMMUTABLE(inode));
2031+
2032+
/* Store address of location where dentry's supposed to be stashed. */
2033+
dentry->d_fsdata = stashed;
2034+
2035+
/* @data is now owned by the fs */
2036+
d_instantiate(dentry, inode);
2037+
return dentry;
2038+
}
2039+
2040+
static struct dentry *stash_dentry(struct dentry **stashed,
2041+
struct dentry *dentry)
2042+
{
2043+
guard(rcu)();
2044+
for (;;) {
2045+
struct dentry *old;
2046+
2047+
/* Assume any old dentry was cleared out. */
2048+
old = cmpxchg(stashed, NULL, dentry);
2049+
if (likely(!old))
2050+
return dentry;
2051+
2052+
/* Check if somebody else installed a reusable dentry. */
2053+
if (lockref_get_not_dead(&old->d_lockref))
2054+
return old;
2055+
2056+
/* There's an old dead dentry there, try to take it over. */
2057+
if (likely(try_cmpxchg(stashed, &old, dentry)))
2058+
return dentry;
2059+
}
2060+
}
2061+
2062+
/**
2063+
* path_from_stashed - create path from stashed or new dentry
2064+
* @stashed: where to retrieve or stash dentry
2065+
* @ino: inode number to use
2066+
* @mnt: mnt of the filesystems to use
2067+
* @data: data to store in inode->i_private
2068+
* @path: path to create
2069+
*
2070+
* The function tries to retrieve a stashed dentry from @stashed. If the dentry
2071+
* is still valid then it will be reused. If the dentry isn't able the function
2072+
* will allocate a new dentry and inode. It will then check again whether it
2073+
* can reuse an existing dentry in case one has been added in the meantime or
2074+
* update @stashed with the newly added dentry.
2075+
*
2076+
* Special-purpose helper for nsfs and pidfs.
2077+
*
2078+
* Return: On success zero and on failure a negative error is returned.
2079+
*/
2080+
int path_from_stashed(struct dentry **stashed, unsigned long ino,
2081+
struct vfsmount *mnt, void *data, struct path *path)
2082+
{
2083+
struct dentry *dentry;
2084+
const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
2085+
2086+
/* See if dentry can be reused. */
2087+
path->dentry = get_stashed_dentry(*stashed);
2088+
if (path->dentry) {
2089+
sops->put_data(data);
2090+
goto out_path;
2091+
}
2092+
2093+
/* Allocate a new dentry. */
2094+
dentry = prepare_anon_dentry(stashed, ino, mnt->mnt_sb, data);
2095+
if (IS_ERR(dentry)) {
2096+
sops->put_data(data);
2097+
return PTR_ERR(dentry);
2098+
}
2099+
2100+
/* Added a new dentry. @data is now owned by the filesystem. */
2101+
path->dentry = stash_dentry(stashed, dentry);
2102+
if (path->dentry != dentry)
2103+
dput(dentry);
2104+
2105+
out_path:
2106+
WARN_ON_ONCE(path->dentry->d_fsdata != stashed);
2107+
WARN_ON_ONCE(d_inode(path->dentry)->i_private != data);
2108+
path->mnt = mntget(mnt);
2109+
return 0;
2110+
}
2111+
2112+
void stashed_dentry_prune(struct dentry *dentry)
2113+
{
2114+
struct dentry **stashed = dentry->d_fsdata;
2115+
struct inode *inode = d_inode(dentry);
2116+
2117+
if (WARN_ON_ONCE(!stashed))
2118+
return;
2119+
2120+
if (!inode)
2121+
return;
2122+
2123+
/*
2124+
* Only replace our own @dentry as someone else might've
2125+
* already cleared out @dentry and stashed their own
2126+
* dentry in there.
2127+
*/
2128+
cmpxchg(stashed, dentry, NULL);
2129+
}

fs/nsfs.c

Lines changed: 39 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -27,26 +27,17 @@ static const struct file_operations ns_file_operations = {
2727
static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
2828
{
2929
struct inode *inode = d_inode(dentry);
30-
const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
30+
struct ns_common *ns = inode->i_private;
31+
const struct proc_ns_operations *ns_ops = ns->ops;
3132

3233
return dynamic_dname(buffer, buflen, "%s:[%lu]",
3334
ns_ops->name, inode->i_ino);
3435
}
3536

36-
static void ns_prune_dentry(struct dentry *dentry)
37-
{
38-
struct inode *inode = d_inode(dentry);
39-
if (inode) {
40-
struct ns_common *ns = inode->i_private;
41-
atomic_long_set(&ns->stashed, 0);
42-
}
43-
}
44-
45-
const struct dentry_operations ns_dentry_operations =
46-
{
47-
.d_prune = ns_prune_dentry,
37+
const struct dentry_operations ns_dentry_operations = {
4838
.d_delete = always_delete_dentry,
4939
.d_dname = ns_dname,
40+
.d_prune = stashed_dentry_prune,
5041
};
5142

5243
static void nsfs_evict(struct inode *inode)
@@ -56,67 +47,16 @@ static void nsfs_evict(struct inode *inode)
5647
ns->ops->put(ns);
5748
}
5849

59-
static int __ns_get_path(struct path *path, struct ns_common *ns)
60-
{
61-
struct vfsmount *mnt = nsfs_mnt;
62-
struct dentry *dentry;
63-
struct inode *inode;
64-
unsigned long d;
65-
66-
rcu_read_lock();
67-
d = atomic_long_read(&ns->stashed);
68-
if (!d)
69-
goto slow;
70-
dentry = (struct dentry *)d;
71-
if (!lockref_get_not_dead(&dentry->d_lockref))
72-
goto slow;
73-
rcu_read_unlock();
74-
ns->ops->put(ns);
75-
got_it:
76-
path->mnt = mntget(mnt);
77-
path->dentry = dentry;
78-
return 0;
79-
slow:
80-
rcu_read_unlock();
81-
inode = new_inode_pseudo(mnt->mnt_sb);
82-
if (!inode) {
83-
ns->ops->put(ns);
84-
return -ENOMEM;
85-
}
86-
inode->i_ino = ns->inum;
87-
simple_inode_init_ts(inode);
88-
inode->i_flags |= S_IMMUTABLE;
89-
inode->i_mode = S_IFREG | S_IRUGO;
90-
inode->i_fop = &ns_file_operations;
91-
inode->i_private = ns;
92-
93-
dentry = d_make_root(inode); /* not the normal use, but... */
94-
if (!dentry)
95-
return -ENOMEM;
96-
dentry->d_fsdata = (void *)ns->ops;
97-
d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
98-
if (d) {
99-
d_delete(dentry); /* make sure ->d_prune() does nothing */
100-
dput(dentry);
101-
cpu_relax();
102-
return -EAGAIN;
103-
}
104-
goto got_it;
105-
}
106-
10750
int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
10851
void *private_data)
10952
{
110-
int ret;
53+
struct ns_common *ns;
11154

112-
do {
113-
struct ns_common *ns = ns_get_cb(private_data);
114-
if (!ns)
115-
return -ENOENT;
116-
ret = __ns_get_path(path, ns);
117-
} while (ret == -EAGAIN);
55+
ns = ns_get_cb(private_data);
56+
if (!ns)
57+
return -ENOENT;
11858

119-
return ret;
59+
return path_from_stashed(&ns->stashed, ns->inum, nsfs_mnt, ns, path);
12060
}
12161

12262
struct ns_get_path_task_args {
@@ -146,6 +86,7 @@ int open_related_ns(struct ns_common *ns,
14686
struct ns_common *(*get_ns)(struct ns_common *ns))
14787
{
14888
struct path path = {};
89+
struct ns_common *relative;
14990
struct file *f;
15091
int err;
15192
int fd;
@@ -154,19 +95,15 @@ int open_related_ns(struct ns_common *ns,
15495
if (fd < 0)
15596
return fd;
15697

157-
do {
158-
struct ns_common *relative;
159-
160-
relative = get_ns(ns);
161-
if (IS_ERR(relative)) {
162-
put_unused_fd(fd);
163-
return PTR_ERR(relative);
164-
}
165-
166-
err = __ns_get_path(&path, relative);
167-
} while (err == -EAGAIN);
98+
relative = get_ns(ns);
99+
if (IS_ERR(relative)) {
100+
put_unused_fd(fd);
101+
return PTR_ERR(relative);
102+
}
168103

169-
if (err) {
104+
err = path_from_stashed(&relative->stashed, relative->inum, nsfs_mnt,
105+
relative, &path);
106+
if (err < 0) {
170107
put_unused_fd(fd);
171108
return err;
172109
}
@@ -249,7 +186,8 @@ bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
249186
static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
250187
{
251188
struct inode *inode = d_inode(dentry);
252-
const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
189+
const struct ns_common *ns = inode->i_private;
190+
const struct proc_ns_operations *ns_ops = ns->ops;
253191

254192
seq_printf(seq, "%s:[%lu]", ns_ops->name, inode->i_ino);
255193
return 0;
@@ -261,13 +199,32 @@ static const struct super_operations nsfs_ops = {
261199
.show_path = nsfs_show_path,
262200
};
263201

202+
static void nsfs_init_inode(struct inode *inode, void *data)
203+
{
204+
inode->i_private = data;
205+
inode->i_mode |= S_IRUGO;
206+
inode->i_fop = &ns_file_operations;
207+
}
208+
209+
static void nsfs_put_data(void *data)
210+
{
211+
struct ns_common *ns = data;
212+
ns->ops->put(ns);
213+
}
214+
215+
static const struct stashed_operations nsfs_stashed_ops = {
216+
.init_inode = nsfs_init_inode,
217+
.put_data = nsfs_put_data,
218+
};
219+
264220
static int nsfs_init_fs_context(struct fs_context *fc)
265221
{
266222
struct pseudo_fs_context *ctx = init_pseudo(fc, NSFS_MAGIC);
267223
if (!ctx)
268224
return -ENOMEM;
269225
ctx->ops = &nsfs_ops;
270226
ctx->dops = &ns_dentry_operations;
227+
fc->s_fs_info = (void *)&nsfs_stashed_ops;
271228
return 0;
272229
}
273230

0 commit comments

Comments
 (0)