Skip to content

Commit 1c5976e

Browse files
Christian Braunerkees
authored andcommitted
binfmt_misc: cleanup on filesystem umount
Currently, registering a new binary type pins the binfmt_misc filesystem. Specifically, this means that as long as there is at least one binary type registered the binfmt_misc filesystem survives all umounts, i.e. the superblock is not destroyed. Meaning that a umount followed by another mount will end up with the same superblock and the same binary type handlers. This is a behavior we tend to discourage for any new filesystems (apart from a few special filesystems such as e.g. configfs or debugfs). A umount operation without the filesystem being pinned - by e.g. someone holding a file descriptor to an open file - should usually result in the destruction of the superblock and all associated resources. This makes introspection easier and leads to clearly defined, simple and clean semantics. An administrator can rely on the fact that a umount will guarantee a clean slate making it possible to reinitialize a filesystem. Right now all binary types would need to be explicitly deleted before that can happen. This allows us to remove the heavy-handed calls to simple_pin_fs() and simple_release_fs() when creating and deleting binary types. This in turn allows us to replace the current brittle pinning mechanism abusing dget() which has caused a range of bugs judging from prior fixes in [2] and [3]. The additional dget() in load_misc_binary() pins the dentry but only does so for the sake to prevent ->evict_inode() from freeing the node when a user removes the binary type and kill_node() is run. Which would mean ->interpreter and ->interp_file would be freed causing a UAF. This isn't really nicely documented nor is it very clean because it relies on simple_pin_fs() pinning the filesystem as long as at least one binary type exists. Otherwise it would cause load_misc_binary() to hold on to a dentry belonging to a superblock that has been shutdown. Replace that implicit pinning with a clean and simple per-node refcount and get rid of the ugly dget() pinning. A similar mechanism exists for e.g. binderfs (cf. [4]). All the cleanup work can now be done in ->evict_inode(). In a follow-up patch we will make it possible to use binfmt_misc in sandboxes. We will use the cleaner semantics where a umount for the filesystem will cause the superblock and all resources to be deallocated. In preparation for this apply the same semantics to the initial binfmt_misc mount. Note, that this is a user-visible change and as such a uapi change but one that we can reasonably risk. We've discussed this in earlier versions of this patchset (cf. [1]). The main user and provider of binfmt_misc is systemd. Systemd provides binfmt_misc via autofs since it is configurable as a kernel module and is used by a few exotic packages and users. As such a binfmt_misc mount is triggered when /proc/sys/fs/binfmt_misc is accessed and is only provided on demand. Other autofs on demand filesystems include EFI ESP which systemd umounts if the mountpoint stays idle for a certain amount of time. This doesn't apply to the binfmt_misc autofs mount which isn't touched once it is mounted meaning this change can't accidently wipe binary type handlers without someone having explicitly unmounted binfmt_misc. After speaking to systemd folks they don't expect this change to affect them. In line with our general policy, if we see a regression for systemd or other users with this change we will switch back to the old behavior for the initial binfmt_misc mount and have binary types pin the filesystem again. But while we touch this code let's take the chance and let's improve on the status quo. [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu [2]: commit 43a4f26 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()" [3]: commit 83f9182 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()") [4]: commit f0fe2c0 ("binder: prevent UAF for binderfs devices II") Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1) Cc: Sargun Dhillon <sargun@sargun.me> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Henning Schild <henning.schild@siemens.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Laurent Vivier <laurent@vivier.eu> Cc: linux-fsdevel@vger.kernel.org Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> --- /* v2 */ - Christian Brauner <christian.brauner@ubuntu.com>: - Add more comments that explain what's going on. - Rename functions while changing them to better reflect what they are doing to make the code easier to understand. - In the first version when a specific binary type handler was removed either through a write to the entry's file or all binary type handlers were removed by a write to the binfmt_misc mount's status file all cleanup work happened during inode eviction. That includes removal of the relevant entries from entry list. While that works fine I disliked that model after thinking about it for a bit. Because it means that there was a window were someone has already removed a or all binary handlers but they could still be safely reached from load_misc_binary() when it has managed to take the read_lock() on the entries list while inode eviction was already happening. Again, that perfectly benign but it's cleaner to remove the binary handler from the list immediately meaning that ones the write to then entry's file or the binfmt_misc status file returns the binary type cannot be executed anymore. That gives stronger guarantees to the user.
1 parent 553e41d commit 1c5976e

File tree

1 file changed

+168
-48
lines changed

1 file changed

+168
-48
lines changed

fs/binfmt_misc.c

Lines changed: 168 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,11 @@ typedef struct {
6060
char *name;
6161
struct dentry *dentry;
6262
struct file *interp_file;
63+
refcount_t users; /* sync removal with load_misc_binary() */
6364
} Node;
6465

6566
static DEFINE_RWLOCK(entries_lock);
6667
static struct file_system_type bm_fs_type;
67-
static struct vfsmount *bm_mnt;
68-
static int entry_count;
6968

7069
/*
7170
* Max length of the register string. Determined by:
@@ -82,19 +81,23 @@ static int entry_count;
8281
*/
8382
#define MAX_REGISTER_LENGTH 1920
8483

85-
/*
86-
* Check if we support the binfmt
87-
* if we do, return the node, else NULL
88-
* locking is done in load_misc_binary
84+
/**
85+
* search_binfmt_handler - search for a binary handler for @bprm
86+
* @misc: handle to binfmt_misc instance
87+
* @bprm: binary for which we are looking for a handler
88+
*
89+
* Search for a binary type handler for @bprm in the list of registered binary
90+
* type handlers.
91+
*
92+
* Return: binary type list entry on success, NULL on failure
8993
*/
90-
static Node *check_file(struct linux_binprm *bprm)
94+
static Node *search_binfmt_handler(struct linux_binprm *bprm)
9195
{
9296
char *p = strrchr(bprm->interp, '.');
93-
struct list_head *l;
97+
Node *e;
9498

9599
/* Walk all the registered handlers. */
96-
list_for_each(l, &entries) {
97-
Node *e = list_entry(l, Node, list);
100+
list_for_each_entry(e, &entries, list) {
98101
char *s;
99102
int j;
100103

@@ -123,9 +126,49 @@ static Node *check_file(struct linux_binprm *bprm)
123126
if (j == e->size)
124127
return e;
125128
}
129+
126130
return NULL;
127131
}
128132

133+
/**
134+
* get_binfmt_handler - try to find a binary type handler
135+
* @misc: handle to binfmt_misc instance
136+
* @bprm: binary for which we are looking for a handler
137+
*
138+
* Try to find a binfmt handler for the binary type. If one is found take a
139+
* reference to protect against removal via bm_{entry,status}_write().
140+
*
141+
* Return: binary type list entry on success, NULL on failure
142+
*/
143+
static Node *get_binfmt_handler(struct linux_binprm *bprm)
144+
{
145+
Node *e;
146+
147+
read_lock(&entries_lock);
148+
e = search_binfmt_handler(bprm);
149+
if (e)
150+
refcount_inc(&e->users);
151+
read_unlock(&entries_lock);
152+
return e;
153+
}
154+
155+
/**
156+
* put_binfmt_handler - put binary handler node
157+
* @e: node to put
158+
*
159+
* Free node syncing with load_misc_binary() and defer final free to
160+
* load_misc_binary() in case it is using the binary type handler we were
161+
* requested to remove.
162+
*/
163+
static void put_binfmt_handler(Node *e)
164+
{
165+
if (refcount_dec_and_test(&e->users)) {
166+
if (e->flags & MISC_FMT_OPEN_FILE)
167+
filp_close(e->interp_file, NULL);
168+
kfree(e);
169+
}
170+
}
171+
129172
/*
130173
* the loader itself
131174
*/
@@ -139,12 +182,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
139182
if (!enabled)
140183
return retval;
141184

142-
/* to keep locking time low, we copy the interpreter string */
143-
read_lock(&entries_lock);
144-
fmt = check_file(bprm);
145-
if (fmt)
146-
dget(fmt->dentry);
147-
read_unlock(&entries_lock);
185+
fmt = get_binfmt_handler(bprm);
148186
if (!fmt)
149187
return retval;
150188

@@ -198,7 +236,16 @@ static int load_misc_binary(struct linux_binprm *bprm)
198236

199237
retval = 0;
200238
ret:
201-
dput(fmt->dentry);
239+
240+
/*
241+
* If we actually put the node here all concurrent calls to
242+
* load_misc_binary() will have finished. We also know
243+
* that for the refcount to be zero ->evict_inode() must have removed
244+
* the node to be deleted from the list. All that is left for us is to
245+
* close and free.
246+
*/
247+
put_binfmt_handler(fmt);
248+
202249
return retval;
203250
}
204251

@@ -552,30 +599,90 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
552599
return inode;
553600
}
554601

602+
/**
603+
* bm_evict_inode - cleanup data associated with @inode
604+
* @inode: inode to which the data is attached
605+
*
606+
* Cleanup the binary type handler data associated with @inode if a binary type
607+
* entry is removed or the filesystem is unmounted and the super block is
608+
* shutdown.
609+
*
610+
* If the ->evict call was not caused by a super block shutdown but by a write
611+
* to remove the entry or all entries via bm_{entry,status}_write() the entry
612+
* will have already been removed from the list. We keep the list_empty() check
613+
* to make that explicit.
614+
*/
555615
static void bm_evict_inode(struct inode *inode)
556616
{
557617
Node *e = inode->i_private;
558618

559-
if (e && e->flags & MISC_FMT_OPEN_FILE)
560-
filp_close(e->interp_file, NULL);
561-
562619
clear_inode(inode);
563-
kfree(e);
620+
621+
if (e) {
622+
write_lock(&entries_lock);
623+
if (!list_empty(&e->list))
624+
list_del_init(&e->list);
625+
write_unlock(&entries_lock);
626+
put_binfmt_handler(e);
627+
}
564628
}
565629

566-
static void kill_node(Node *e)
630+
/**
631+
* unlink_binfmt_dentry - remove the dentry for the binary type handler
632+
* @dentry: dentry associated with the binary type handler
633+
*
634+
* Do the actual filesystem work to remove a dentry for a registered binary
635+
* type handler. Since binfmt_misc only allows simple files to be created
636+
* directly under the root dentry of the filesystem we ensure that we are
637+
* indeed passed a dentry directly beneath the root dentry, that the inode
638+
* associated with the root dentry is locked, and that it is a regular file we
639+
* are asked to remove.
640+
*/
641+
static void unlink_binfmt_dentry(struct dentry *dentry)
567642
{
568-
struct dentry *dentry;
643+
struct dentry *parent = dentry->d_parent;
644+
struct inode *inode, *parent_inode;
645+
646+
/* All entries are immediate descendants of the root dentry. */
647+
if (WARN_ON_ONCE(dentry->d_sb->s_root != parent))
648+
return;
569649

650+
/* We only expect to be called on regular files. */
651+
inode = d_inode(dentry);
652+
if (WARN_ON_ONCE(!S_ISREG(inode->i_mode)))
653+
return;
654+
655+
/* The parent inode must be locked. */
656+
parent_inode = d_inode(parent);
657+
if (WARN_ON_ONCE(!inode_is_locked(parent_inode)))
658+
return;
659+
660+
if (simple_positive(dentry)) {
661+
dget(dentry);
662+
simple_unlink(parent_inode, dentry);
663+
d_delete(dentry);
664+
dput(dentry);
665+
}
666+
}
667+
668+
/**
669+
* remove_binfmt_handler - remove a binary type handler
670+
* @misc: handle to binfmt_misc instance
671+
* @e: binary type handler to remove
672+
*
673+
* Remove a binary type handler from the list of binary type handlers and
674+
* remove its associated dentry. This is called from
675+
* binfmt_{entry,status}_write(). In the future, we might want to think about
676+
* adding a proper ->unlink() method to binfmt_misc instead of forcing caller's
677+
* to use writes to files in order to delete binary type handlers. But it has
678+
* worked for so long that it's not a pressing issue.
679+
*/
680+
static void remove_binfmt_handler(Node *e)
681+
{
570682
write_lock(&entries_lock);
571683
list_del_init(&e->list);
572684
write_unlock(&entries_lock);
573-
574-
dentry = e->dentry;
575-
drop_nlink(d_inode(dentry));
576-
d_drop(dentry);
577-
dput(dentry);
578-
simple_release_fs(&bm_mnt, &entry_count);
685+
unlink_binfmt_dentry(e->dentry);
579686
}
580687

581688
/* /<entry> */
@@ -602,8 +709,8 @@ bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
602709
static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
603710
size_t count, loff_t *ppos)
604711
{
605-
struct dentry *root;
606-
Node *e = file_inode(file)->i_private;
712+
struct inode *inode = file_inode(file);
713+
Node *e = inode->i_private;
607714
int res = parse_command(buffer, count);
608715

609716
switch (res) {
@@ -617,13 +724,22 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
617724
break;
618725
case 3:
619726
/* Delete this handler. */
620-
root = file_inode(file)->i_sb->s_root;
621-
inode_lock(d_inode(root));
727+
inode = d_inode(inode->i_sb->s_root);
728+
inode_lock(inode);
622729

730+
/*
731+
* In order to add new element or remove elements from the list
732+
* via bm_{entry,register,status}_write() inode_lock() on the
733+
* root inode must be held.
734+
* The lock is exclusive ensuring that the list can't be
735+
* modified. Only load_misc_binary() can access but does so
736+
* read-only. So we only need to take the write lock when we
737+
* actually remove the entry from the list.
738+
*/
623739
if (!list_empty(&e->list))
624-
kill_node(e);
740+
remove_binfmt_handler(e);
625741

626-
inode_unlock(d_inode(root));
742+
inode_unlock(inode);
627743
break;
628744
default:
629745
return res;
@@ -682,13 +798,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
682798
if (!inode)
683799
goto out2;
684800

685-
err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
686-
if (err) {
687-
iput(inode);
688-
inode = NULL;
689-
goto out2;
690-
}
691-
801+
refcount_set(&e->users, 1);
692802
e->dentry = dget(dentry);
693803
inode->i_private = e;
694804
inode->i_fop = &bm_entry_operations;
@@ -732,7 +842,8 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
732842
size_t count, loff_t *ppos)
733843
{
734844
int res = parse_command(buffer, count);
735-
struct dentry *root;
845+
Node *e, *next;
846+
struct inode *inode;
736847

737848
switch (res) {
738849
case 1:
@@ -745,13 +856,22 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
745856
break;
746857
case 3:
747858
/* Delete all handlers. */
748-
root = file_inode(file)->i_sb->s_root;
749-
inode_lock(d_inode(root));
859+
inode = d_inode(file_inode(file)->i_sb->s_root);
860+
inode_lock(inode);
750861

751-
while (!list_empty(&entries))
752-
kill_node(list_first_entry(&entries, Node, list));
862+
/*
863+
* In order to add new element or remove elements from the list
864+
* via bm_{entry,register,status}_write() inode_lock() on the
865+
* root inode must be held.
866+
* The lock is exclusive ensuring that the list can't be
867+
* modified. Only load_misc_binary() can access but does so
868+
* read-only. So we only need to take the write lock when we
869+
* actually remove the entry from the list.
870+
*/
871+
list_for_each_entry_safe(e, next, &entries, list)
872+
remove_binfmt_handler(e);
753873

754-
inode_unlock(d_inode(root));
874+
inode_unlock(inode);
755875
break;
756876
default:
757877
return res;

0 commit comments

Comments
 (0)