Skip to content

Commit 9f8413c

Browse files
committed
Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1 is mostly in maintenance mode, this addition is very small while having an outsized usefulness for users who are still on cgroup1. Yafang also optimized root cgroup list access by making it RCU protected in the process. - Waiman Long optimized rstat operation leading to substantially lower and more consistent lock hold time while flushing the hierarchical statistics. As the lock can be acquired briefly in various hot paths, this reduction has cascading benefits. - Waiman also improved the quality of isolation for cpuset's isolated partitions. CPUs which are allocated to isolated partitions are now excluded from running unbound work items and cpu_is_isolated() test which is used by vmstat and memcg to reduce interference now includes cpuset isolated CPUs. While it isn't there yet, the hope is eventually reaching parity with the isolation level provided by the `isolcpus` boot param but in a dynamic manner. * tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Move rcu_head up near the top of cgroup_root cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check cgroup: Avoid false cacheline sharing of read mostly rstat_cpu cgroup/rstat: Optimize cgroup_rstat_updated_list() cgroup: Fix documentation for cpu.idle cgroup/cpuset: Expose cpuset.cpus.isolated workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask cgroup/cpuset: Keep track of CPUs in isolated partitions selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask selftests: cgroup: Fixes a typo in a comment cgroup: Add a new helper for cgroup1 hierarchy cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root() cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show() cgroup: Make operations on the cgroup root_list RCU safe cgroup: Remove unnecessary list_empty()
2 parents bfe8eb3 + a7fb042 commit 9f8413c

File tree

14 files changed

+708
-283
lines changed

14 files changed

+708
-283
lines changed

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1093,7 +1093,11 @@ All time durations are in microseconds.
10931093
A read-write single value file which exists on non-root
10941094
cgroups. The default is "100".
10951095

1096-
The weight in the range [1, 10000].
1096+
For non idle groups (cpu.idle = 0), the weight is in the
1097+
range [1, 10000].
1098+
1099+
If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
1100+
then the weight will show as a 0.
10971101

10981102
cpu.weight.nice
10991103
A read-write single value file which exists on non-root
@@ -1157,6 +1161,16 @@ All time durations are in microseconds.
11571161
values similar to the sched_setattr(2). This maximum utilization
11581162
value is used to clamp the task specific maximum utilization clamp.
11591163

1164+
cpu.idle
1165+
A read-write single value file which exists on non-root cgroups.
1166+
The default is 0.
1167+
1168+
This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1169+
Setting this value to a 1 will make the scheduling policy of the
1170+
cgroup SCHED_IDLE. The threads inside the cgroup will retain their
1171+
own relative priorities, but the cgroup itself will be treated as
1172+
very low priority relative to its peers.
1173+
11601174

11611175

11621176
Memory
@@ -2316,6 +2330,13 @@ Cpuset Interface Files
23162330
treated to have an implicit value of "cpuset.cpus" in the
23172331
formation of local partition.
23182332

2333+
cpuset.cpus.isolated
2334+
A read-only and root cgroup only multiple values file.
2335+
2336+
This file shows the set of all isolated CPUs used in existing
2337+
isolated partitions. It will be empty if no isolated partition
2338+
is created.
2339+
23192340
cpuset.cpus.partition
23202341
A read-write single value file which exists on non-root
23212342
cpuset-enabled cgroups. This flag is owned by the parent cgroup
@@ -2358,11 +2379,11 @@ Cpuset Interface Files
23582379
partition or scheduling domain. The set of exclusive CPUs is
23592380
determined by the value of its "cpuset.cpus.exclusive.effective".
23602381

2361-
When set to "isolated", the CPUs in that partition will
2362-
be in an isolated state without any load balancing from the
2363-
scheduler. Tasks placed in such a partition with multiple
2364-
CPUs should be carefully distributed and bound to each of the
2365-
individual CPUs for optimal performance.
2382+
When set to "isolated", the CPUs in that partition will be in
2383+
an isolated state without any load balancing from the scheduler
2384+
and excluded from the unbound workqueues. Tasks placed in such
2385+
a partition with multiple CPUs should be carefully distributed
2386+
and bound to each of the individual CPUs for optimal performance.
23662387

23672388
A partition root ("root" or "isolated") can be in one of the
23682389
two possible states - valid or invalid. An invalid partition

include/linux/cgroup-defs.h

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,20 @@ struct cgroup {
496496
struct cgroup_rstat_cpu __percpu *rstat_cpu;
497497
struct list_head rstat_css_list;
498498

499+
/*
500+
* Add padding to separate the read mostly rstat_cpu and
501+
* rstat_css_list into a different cacheline from the following
502+
* rstat_flush_next and *bstat fields which can have frequent updates.
503+
*/
504+
CACHELINE_PADDING(_pad_);
505+
506+
/*
507+
* A singly-linked list of cgroup structures to be rstat flushed.
508+
* This is a scratch field to be used exclusively by
509+
* cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
510+
*/
511+
struct cgroup *rstat_flush_next;
512+
499513
/* cgroup basic resource statistics */
500514
struct cgroup_base_stat last_bstat;
501515
struct cgroup_base_stat bstat;
@@ -548,6 +562,10 @@ struct cgroup_root {
548562
/* Unique id for this hierarchy. */
549563
int hierarchy_id;
550564

565+
/* A list running through the active hierarchies */
566+
struct list_head root_list;
567+
struct rcu_head rcu; /* Must be near the top */
568+
551569
/*
552570
* The root cgroup. The containing cgroup_root will be destroyed on its
553571
* release. cgrp->ancestors[0] will be used overflowing into the
@@ -561,9 +579,6 @@ struct cgroup_root {
561579
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
562580
atomic_t nr_cgrps;
563581

564-
/* A list running through the active hierarchies */
565-
struct list_head root_list;
566-
567582
/* Hierarchy-specific flags */
568583
unsigned int flags;
569584

include/linux/cgroup.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ struct css_task_iter {
6969
extern struct file_system_type cgroup_fs_type;
7070
extern struct cgroup_root cgrp_dfl_root;
7171
extern struct css_set init_css_set;
72+
extern spinlock_t css_set_lock;
7273

7374
#define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys;
7475
#include <linux/cgroup_subsys.h>
@@ -386,7 +387,6 @@ static inline void cgroup_unlock(void)
386387
* as locks used during the cgroup_subsys::attach() methods.
387388
*/
388389
#ifdef CONFIG_PROVE_RCU
389-
extern spinlock_t css_set_lock;
390390
#define task_css_set_check(task, __c) \
391391
rcu_dereference_check((task)->cgroups, \
392392
rcu_read_lock_sched_held() || \
@@ -853,4 +853,6 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
853853

854854
#endif /* CONFIG_CGROUP_BPF */
855855

856+
struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id);
857+
856858
#endif /* _LINUX_CGROUP_H */

include/linux/cpuset.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ extern void cpuset_lock(void);
7777
extern void cpuset_unlock(void);
7878
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
7979
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
80+
extern bool cpuset_cpu_is_isolated(int cpu);
8081
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
8182
#define cpuset_current_mems_allowed (current->mems_allowed)
8283
void cpuset_init_current_mems_allowed(void);
@@ -207,6 +208,11 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
207208
return false;
208209
}
209210

211+
static inline bool cpuset_cpu_is_isolated(int cpu)
212+
{
213+
return false;
214+
}
215+
210216
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
211217
{
212218
return node_possible_map;

include/linux/sched/isolation.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
#define _LINUX_SCHED_ISOLATION_H
33

44
#include <linux/cpumask.h>
5+
#include <linux/cpuset.h>
56
#include <linux/init.h>
67
#include <linux/tick.h>
78

@@ -67,7 +68,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
6768
static inline bool cpu_is_isolated(int cpu)
6869
{
6970
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
70-
!housekeeping_test_cpu(cpu, HK_TYPE_TICK);
71+
!housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
72+
cpuset_cpu_is_isolated(cpu);
7173
}
7274

7375
#endif /* _LINUX_SCHED_ISOLATION_H */

include/linux/workqueue.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -491,7 +491,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void);
491491
void free_workqueue_attrs(struct workqueue_attrs *attrs);
492492
int apply_workqueue_attrs(struct workqueue_struct *wq,
493493
const struct workqueue_attrs *attrs);
494-
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
494+
extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
495495

496496
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
497497
struct work_struct *work);

kernel/cgroup/cgroup-internal.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -164,13 +164,13 @@ struct cgroup_mgctx {
164164
#define DEFINE_CGROUP_MGCTX(name) \
165165
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
166166

167-
extern spinlock_t css_set_lock;
168167
extern struct cgroup_subsys *cgroup_subsys[];
169168
extern struct list_head cgroup_roots;
170169

171170
/* iterate across the hierarchies */
172171
#define for_each_root(root) \
173-
list_for_each_entry((root), &cgroup_roots, root_list)
172+
list_for_each_entry_rcu((root), &cgroup_roots, root_list, \
173+
lockdep_is_held(&cgroup_mutex))
174174

175175
/**
176176
* for_each_subsys - iterate all enabled cgroup subsystems

kernel/cgroup/cgroup-v1.c

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1262,6 +1262,40 @@ int cgroup1_get_tree(struct fs_context *fc)
12621262
return ret;
12631263
}
12641264

1265+
/**
1266+
* task_get_cgroup1 - Acquires the associated cgroup of a task within a
1267+
* specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its
1268+
* hierarchy ID.
1269+
* @tsk: The target task
1270+
* @hierarchy_id: The ID of a cgroup1 hierarchy
1271+
*
1272+
* On success, the cgroup is returned. On failure, ERR_PTR is returned.
1273+
* We limit it to cgroup1 only.
1274+
*/
1275+
struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id)
1276+
{
1277+
struct cgroup *cgrp = ERR_PTR(-ENOENT);
1278+
struct cgroup_root *root;
1279+
unsigned long flags;
1280+
1281+
rcu_read_lock();
1282+
for_each_root(root) {
1283+
/* cgroup1 only*/
1284+
if (root == &cgrp_dfl_root)
1285+
continue;
1286+
if (root->hierarchy_id != hierarchy_id)
1287+
continue;
1288+
spin_lock_irqsave(&css_set_lock, flags);
1289+
cgrp = task_cgroup_from_root(tsk, root);
1290+
if (!cgrp || !cgroup_tryget(cgrp))
1291+
cgrp = ERR_PTR(-ENOENT);
1292+
spin_unlock_irqrestore(&css_set_lock, flags);
1293+
break;
1294+
}
1295+
rcu_read_unlock();
1296+
return cgrp;
1297+
}
1298+
12651299
static int __init cgroup1_wq_init(void)
12661300
{
12671301
/*

kernel/cgroup/cgroup.c

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1315,7 +1315,7 @@ static void cgroup_exit_root_id(struct cgroup_root *root)
13151315

13161316
void cgroup_free_root(struct cgroup_root *root)
13171317
{
1318-
kfree(root);
1318+
kfree_rcu(root, rcu);
13191319
}
13201320

13211321
static void cgroup_destroy_root(struct cgroup_root *root)
@@ -1347,10 +1347,9 @@ static void cgroup_destroy_root(struct cgroup_root *root)
13471347

13481348
spin_unlock_irq(&css_set_lock);
13491349

1350-
if (!list_empty(&root->root_list)) {
1351-
list_del(&root->root_list);
1352-
cgroup_root_count--;
1353-
}
1350+
WARN_ON_ONCE(list_empty(&root->root_list));
1351+
list_del_rcu(&root->root_list);
1352+
cgroup_root_count--;
13541353

13551354
if (!have_favordynmods)
13561355
cgroup_favor_dynmods(root, false);
@@ -1390,7 +1389,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
13901389
}
13911390
}
13921391

1393-
BUG_ON(!res_cgroup);
1392+
/*
1393+
* If cgroup_mutex is not held, the cgrp_cset_link will be freed
1394+
* before we remove the cgroup root from the root_list. Consequently,
1395+
* when accessing a cgroup root, the cset_link may have already been
1396+
* freed, resulting in a NULL res_cgroup. However, by holding the
1397+
* cgroup_mutex, we ensure that res_cgroup can't be NULL.
1398+
* If we don't hold cgroup_mutex in the caller, we must do the NULL
1399+
* check.
1400+
*/
13941401
return res_cgroup;
13951402
}
13961403

@@ -1413,6 +1420,11 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
14131420

14141421
rcu_read_unlock();
14151422

1423+
/*
1424+
* The namespace_sem is held by current, so the root cgroup can't
1425+
* be umounted. Therefore, we can ensure that the res is non-NULL.
1426+
*/
1427+
WARN_ON_ONCE(!res);
14161428
return res;
14171429
}
14181430

@@ -1449,15 +1461,16 @@ static struct cgroup *current_cgns_cgroup_dfl(void)
14491461
static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
14501462
struct cgroup_root *root)
14511463
{
1452-
lockdep_assert_held(&cgroup_mutex);
14531464
lockdep_assert_held(&css_set_lock);
14541465

14551466
return __cset_cgroup_from_root(cset, root);
14561467
}
14571468

14581469
/*
14591470
* Return the cgroup for "task" from the given hierarchy. Must be
1460-
* called with cgroup_mutex and css_set_lock held.
1471+
* called with css_set_lock held to prevent task's groups from being modified.
1472+
* Must be called with either cgroup_mutex or rcu read lock to prevent the
1473+
* cgroup root from being destroyed.
14611474
*/
14621475
struct cgroup *task_cgroup_from_root(struct task_struct *task,
14631476
struct cgroup_root *root)
@@ -2032,7 +2045,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
20322045
struct cgroup_root *root = ctx->root;
20332046
struct cgroup *cgrp = &root->cgrp;
20342047

2035-
INIT_LIST_HEAD(&root->root_list);
2048+
INIT_LIST_HEAD_RCU(&root->root_list);
20362049
atomic_set(&root->nr_cgrps, 1);
20372050
cgrp->root = root;
20382051
init_cgroup_housekeeping(cgrp);
@@ -2115,7 +2128,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
21152128
* care of subsystems' refcounts, which are explicitly dropped in
21162129
* the failure exit path.
21172130
*/
2118-
list_add(&root->root_list, &cgroup_roots);
2131+
list_add_rcu(&root->root_list, &cgroup_roots);
21192132
cgroup_root_count++;
21202133

21212134
/*
@@ -6265,7 +6278,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
62656278
if (!buf)
62666279
goto out;
62676280

6268-
cgroup_lock();
6281+
rcu_read_lock();
62696282
spin_lock_irq(&css_set_lock);
62706283

62716284
for_each_root(root) {
@@ -6276,6 +6289,11 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
62766289
if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible))
62776290
continue;
62786291

6292+
cgrp = task_cgroup_from_root(tsk, root);
6293+
/* The root has already been unmounted. */
6294+
if (!cgrp)
6295+
continue;
6296+
62796297
seq_printf(m, "%d:", root->hierarchy_id);
62806298
if (root != &cgrp_dfl_root)
62816299
for_each_subsys(ss, ssid)
@@ -6286,9 +6304,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
62866304
seq_printf(m, "%sname=%s", count ? "," : "",
62876305
root->name);
62886306
seq_putc(m, ':');
6289-
6290-
cgrp = task_cgroup_from_root(tsk, root);
6291-
62926307
/*
62936308
* On traditional hierarchies, all zombie tasks show up as
62946309
* belonging to the root cgroup. On the default hierarchy,
@@ -6320,7 +6335,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
63206335
retval = 0;
63216336
out_unlock:
63226337
spin_unlock_irq(&css_set_lock);
6323-
cgroup_unlock();
6338+
rcu_read_unlock();
63246339
kfree(buf);
63256340
out:
63266341
return retval;

0 commit comments

Comments
 (0)