aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2013-04-29 22:07:40 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2013-04-29 22:07:40 -0400
commit46d9be3e5eb01f71fc02653755d970247174b400 (patch)
tree01534c9ebfa5f52a7133e34354d2831fe6704f15 /kernel
parentce8aa48929449b491149b6c87861ac69cb797a42 (diff)
parentcece95dfe5aa56ba99e51b4746230ff0b8542abd (diff)
Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo: "A lot of activities on workqueue side this time. The changes achieve the followings. - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are updated to be able to interface with multiple backend worker pools. This involved a lot of churning but the end result seems actually neater as unbound workqueues are now a lot closer to per-cpu ones. - The ability to interface with multiple backend worker pools are used to implement unbound workqueues with custom attributes. Currently the supported attributes are the nice level and CPU affinity. It may be expanded to include cgroup association in future. The attributes can be specified either by calling apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if the workqueue in question is exported through sysfs. The backend worker pools are keyed by the actual attributes and shared by any workqueues which share the same attributes. When attributes of a workqueue are changed, the workqueue binds to the worker pool with the specified attributes while leaving the work items which are already executing in its previous worker pools alone. This allows converting custom worker pool implementations which want worker attribute tuning to use workqueues. The writeback pool is already converted in block tree and there are a couple others are likely to follow including btrfs io workers. - WQ_UNBOUND's ability to bind to multiple worker pools is also used to make it NUMA-aware. Because there's no association between work item issuer and the specific worker assigned to execute it, before this change, using unbound workqueue led to unnecessary cross-node bouncing and it couldn't be helped by autonuma as it requires tasks to have implicit node affinity and workers are assigned randomly. After these changes, an unbound workqueue now binds to multiple NUMA-affine worker pools so that queued work items are executed in the same node. This is turned on by default but can be disabled system-wide or for individual workqueues. Crypto was requesting NUMA affinity as encrypting data across different nodes can contribute noticeable overhead and doing it per-cpu was too limiting for certain cases and IO throughput could be bottlenecked by one CPU being fully occupied while others have idle cycles. While the new features required a lot of changes including restructuring locking, it didn't complicate the execution paths much. The unbound workqueue handling is now closer to per-cpu ones and the new features are implemented by simply associating a workqueue with different sets of backend worker pools without changing queue, execution or flush paths. As such, even though the amount of change is very high, I feel relatively safe in that it isn't likely to cause subtle issues with basic correctness of work item execution and handling. If something is wrong, it's likely to show up as being associated with worker pools with the wrong attributes or OOPS while workqueue attributes are being changed or during CPU hotplug. While this creates more backend worker pools, it doesn't add too many more workers unless, of course, there are many workqueues with unique combinations of attributes. Assuming everything else is the same, NUMA awareness costs an extra worker pool per NUMA node with online CPUs. There are also a couple things which are being routed outside the workqueue tree. - block tree pulled in workqueue for-3.10 so that writeback worker pool can be converted to unbound workqueue with sysfs control exposed. This simplifies the code, makes writeback workers NUMA-aware and allows tuning nice level and CPU affinity via sysfs. - The conversion to workqueue means that there's no 1:1 association between a specific worker, which makes writeback folks unhappy as they want to be able to tell which filesystem caused a problem from backtrace on systems with many filesystems mounted. This is resolved by allowing work items to set debug info string which is printed when the task is dumped. As this change involves unifying implementations of dump_stack() and friends in arch codes, it's being routed through Andrew's -mm tree." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits) workqueue: use kmem_cache_free() instead of kfree() workqueue: avoid false negative WARN_ON() in destroy_workqueue() workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity workqueue: implement NUMA affinity for unbound workqueues workqueue: introduce put_pwq_unlocked() workqueue: introduce numa_pwq_tbl_install() workqueue: use NUMA-aware allocation for pool_workqueues workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() workqueue: map an unbound workqueues to multiple per-node pool_workqueues workqueue: move hot fields of workqueue_struct to the end workqueue: make workqueue->name[] fixed len workqueue: add workqueue->unbound_attrs workqueue: determine NUMA node of workers accourding to the allowed cpumask workqueue: drop 'H' from kworker names of unbound worker pools workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] workqueue: move pwq_pool_locking outside of get/put_unbound_pool() workqueue: fix memory leak in apply_workqueue_attrs() workqueue: fix unbound workqueue attrs hashing / comparison workqueue: fix race condition in unbound workqueue free path workqueue: remove pwq_lock which is no longer used ...
Diffstat (limited to 'kernel')
-rw-r--r--kernel/cgroup.c4
-rw-r--r--kernel/cpuset.c16
-rw-r--r--kernel/kthread.c2
-rw-r--r--kernel/sched/core.c9
-rw-r--r--kernel/workqueue.c2828
-rw-r--r--kernel/workqueue_internal.h9
6 files changed, 1985 insertions, 883 deletions
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dfaf50d4705e..1f628bc039f4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2224,11 +2224,11 @@ retry_find_task:
2224 tsk = tsk->group_leader; 2224 tsk = tsk->group_leader;
2225 2225
2226 /* 2226 /*
2227 * Workqueue threads may acquire PF_THREAD_BOUND and become 2227 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
2228 * trapped in a cpuset, or RT worker may be born in a cgroup 2228 * trapped in a cpuset, or RT worker may be born in a cgroup
2229 * with no rt_runtime allocated. Just say no. 2229 * with no rt_runtime allocated. Just say no.
2230 */ 2230 */
2231 if (tsk == kthreadd_task || (tsk->flags & PF_THREAD_BOUND)) { 2231 if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
2232 ret = -EINVAL; 2232 ret = -EINVAL;
2233 rcu_read_unlock(); 2233 rcu_read_unlock();
2234 goto out_unlock_cgroup; 2234 goto out_unlock_cgroup;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 334d983a36b2..027a6f65f2ad 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1388,16 +1388,16 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
1388 1388
1389 cgroup_taskset_for_each(task, cgrp, tset) { 1389 cgroup_taskset_for_each(task, cgrp, tset) {
1390 /* 1390 /*
1391 * Kthreads bound to specific cpus cannot be moved to a new 1391 * Kthreads which disallow setaffinity shouldn't be moved
1392 * cpuset; we cannot change their cpu affinity and 1392 * to a new cpuset; we don't want to change their cpu
1393 * isolating such threads by their set of allowed nodes is 1393 * affinity and isolating such threads by their set of
1394 * unnecessary. Thus, cpusets are not applicable for such 1394 * allowed nodes is unnecessary. Thus, cpusets are not
1395 * threads. This prevents checking for success of 1395 * applicable for such threads. This prevents checking for
1396 * set_cpus_allowed_ptr() on all attached tasks before 1396 * success of set_cpus_allowed_ptr() on all attached tasks
1397 * cpus_allowed may be changed. 1397 * before cpus_allowed may be changed.
1398 */ 1398 */
1399 ret = -EINVAL; 1399 ret = -EINVAL;
1400 if (task->flags & PF_THREAD_BOUND) 1400 if (task->flags & PF_NO_SETAFFINITY)
1401 goto out_unlock; 1401 goto out_unlock;
1402 ret = security_task_setscheduler(task); 1402 ret = security_task_setscheduler(task);
1403 if (ret) 1403 if (ret)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 9b12d65186f7..16d8ddd268b1 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -278,7 +278,7 @@ static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
278 } 278 }
279 /* It's safe because the task is inactive. */ 279 /* It's safe because the task is inactive. */
280 do_set_cpus_allowed(p, cpumask_of(cpu)); 280 do_set_cpus_allowed(p, cpumask_of(cpu));
281 p->flags |= PF_THREAD_BOUND; 281 p->flags |= PF_NO_SETAFFINITY;
282} 282}
283 283
284/** 284/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42053547e0f5..d8285eb0cde6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4083,6 +4083,10 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
4083 get_task_struct(p); 4083 get_task_struct(p);
4084 rcu_read_unlock(); 4084 rcu_read_unlock();
4085 4085
4086 if (p->flags & PF_NO_SETAFFINITY) {
4087 retval = -EINVAL;
4088 goto out_put_task;
4089 }
4086 if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) { 4090 if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
4087 retval = -ENOMEM; 4091 retval = -ENOMEM;
4088 goto out_put_task; 4092 goto out_put_task;
@@ -4730,11 +4734,6 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
4730 goto out; 4734 goto out;
4731 } 4735 }
4732 4736
4733 if (unlikely((p->flags & PF_THREAD_BOUND) && p != current)) {
4734 ret = -EINVAL;
4735 goto out;
4736 }
4737
4738 do_set_cpus_allowed(p, new_mask); 4737 do_set_cpus_allowed(p, new_mask);
4739 4738
4740 /* Can the task run on the task's current CPU? If so, we're done */ 4739 /* Can the task run on the task's current CPU? If so, we're done */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b48cd597145d..154aa12af48e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,7 +41,11 @@
41#include <linux/debug_locks.h> 41#include <linux/debug_locks.h>
42#include <linux/lockdep.h> 42#include <linux/lockdep.h>
43#include <linux/idr.h> 43#include <linux/idr.h>
44#include <linux/jhash.h>
44#include <linux/hashtable.h> 45#include <linux/hashtable.h>
46#include <linux/rculist.h>
47#include <linux/nodemask.h>
48#include <linux/moduleparam.h>
45 49
46#include "workqueue_internal.h" 50#include "workqueue_internal.h"
47 51
@@ -58,12 +62,11 @@ enum {
58 * %WORKER_UNBOUND set and concurrency management disabled, and may 62 * %WORKER_UNBOUND set and concurrency management disabled, and may
59 * be executing on any CPU. The pool behaves as an unbound one. 63 * be executing on any CPU. The pool behaves as an unbound one.
60 * 64 *
61 * Note that DISASSOCIATED can be flipped only while holding 65 * Note that DISASSOCIATED should be flipped only while holding
62 * assoc_mutex to avoid changing binding state while 66 * manager_mutex to avoid changing binding state while
63 * create_worker() is in progress. 67 * create_worker() is in progress.
64 */ 68 */
65 POOL_MANAGE_WORKERS = 1 << 0, /* need to manage workers */ 69 POOL_MANAGE_WORKERS = 1 << 0, /* need to manage workers */
66 POOL_MANAGING_WORKERS = 1 << 1, /* managing workers */
67 POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */ 70 POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */
68 POOL_FREEZING = 1 << 3, /* freeze in progress */ 71 POOL_FREEZING = 1 << 3, /* freeze in progress */
69 72
@@ -74,12 +77,14 @@ enum {
74 WORKER_PREP = 1 << 3, /* preparing to run works */ 77 WORKER_PREP = 1 << 3, /* preparing to run works */
75 WORKER_CPU_INTENSIVE = 1 << 6, /* cpu intensive */ 78 WORKER_CPU_INTENSIVE = 1 << 6, /* cpu intensive */
76 WORKER_UNBOUND = 1 << 7, /* worker is unbound */ 79 WORKER_UNBOUND = 1 << 7, /* worker is unbound */
80 WORKER_REBOUND = 1 << 8, /* worker was rebound */
77 81
78 WORKER_NOT_RUNNING = WORKER_PREP | WORKER_UNBOUND | 82 WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
79 WORKER_CPU_INTENSIVE, 83 WORKER_UNBOUND | WORKER_REBOUND,
80 84
81 NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */ 85 NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
82 86
87 UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
83 BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */ 88 BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
84 89
85 MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */ 90 MAX_IDLE_WORKERS_RATIO = 4, /* 1/4 of busy can be idle */
@@ -97,6 +102,8 @@ enum {
97 */ 102 */
98 RESCUER_NICE_LEVEL = -20, 103 RESCUER_NICE_LEVEL = -20,
99 HIGHPRI_NICE_LEVEL = -20, 104 HIGHPRI_NICE_LEVEL = -20,
105
106 WQ_NAME_LEN = 24,
100}; 107};
101 108
102/* 109/*
@@ -115,16 +122,26 @@ enum {
115 * cpu or grabbing pool->lock is enough for read access. If 122 * cpu or grabbing pool->lock is enough for read access. If
116 * POOL_DISASSOCIATED is set, it's identical to L. 123 * POOL_DISASSOCIATED is set, it's identical to L.
117 * 124 *
118 * F: wq->flush_mutex protected. 125 * MG: pool->manager_mutex and pool->lock protected. Writes require both
126 * locks. Reads can happen under either lock.
127 *
128 * PL: wq_pool_mutex protected.
129 *
130 * PR: wq_pool_mutex protected for writes. Sched-RCU protected for reads.
131 *
132 * WQ: wq->mutex protected.
119 * 133 *
120 * W: workqueue_lock protected. 134 * WR: wq->mutex protected for writes. Sched-RCU protected for reads.
135 *
136 * MD: wq_mayday_lock protected.
121 */ 137 */
122 138
123/* struct worker is defined in workqueue_internal.h */ 139/* struct worker is defined in workqueue_internal.h */
124 140
125struct worker_pool { 141struct worker_pool {
126 spinlock_t lock; /* the pool lock */ 142 spinlock_t lock; /* the pool lock */
127 unsigned int cpu; /* I: the associated cpu */ 143 int cpu; /* I: the associated cpu */
144 int node; /* I: the associated node ID */
128 int id; /* I: pool ID */ 145 int id; /* I: pool ID */
129 unsigned int flags; /* X: flags */ 146 unsigned int flags; /* X: flags */
130 147
@@ -138,12 +155,18 @@ struct worker_pool {
138 struct timer_list idle_timer; /* L: worker idle timeout */ 155 struct timer_list idle_timer; /* L: worker idle timeout */
139 struct timer_list mayday_timer; /* L: SOS timer for workers */ 156 struct timer_list mayday_timer; /* L: SOS timer for workers */
140 157
141 /* workers are chained either in busy_hash or idle_list */ 158 /* a workers is either on busy_hash or idle_list, or the manager */
142 DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER); 159 DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
143 /* L: hash of busy workers */ 160 /* L: hash of busy workers */
144 161
145 struct mutex assoc_mutex; /* protect POOL_DISASSOCIATED */ 162 /* see manage_workers() for details on the two manager mutexes */
146 struct ida worker_ida; /* L: for worker IDs */ 163 struct mutex manager_arb; /* manager arbitration */
164 struct mutex manager_mutex; /* manager exclusion */
165 struct idr worker_idr; /* MG: worker IDs and iteration */
166
167 struct workqueue_attrs *attrs; /* I: worker attributes */
168 struct hlist_node hash_node; /* PL: unbound_pool_hash node */
169 int refcnt; /* PL: refcnt for unbound pools */
147 170
148 /* 171 /*
149 * The current concurrency level. As it's likely to be accessed 172 * The current concurrency level. As it's likely to be accessed
@@ -151,6 +174,12 @@ struct worker_pool {
151 * cacheline. 174 * cacheline.
152 */ 175 */
153 atomic_t nr_running ____cacheline_aligned_in_smp; 176 atomic_t nr_running ____cacheline_aligned_in_smp;
177
178 /*
179 * Destruction of pool is sched-RCU protected to allow dereferences
180 * from get_work_pool().
181 */
182 struct rcu_head rcu;
154} ____cacheline_aligned_in_smp; 183} ____cacheline_aligned_in_smp;
155 184
156/* 185/*
@@ -164,75 +193,107 @@ struct pool_workqueue {
164 struct workqueue_struct *wq; /* I: the owning workqueue */ 193 struct workqueue_struct *wq; /* I: the owning workqueue */
165 int work_color; /* L: current color */ 194 int work_color; /* L: current color */
166 int flush_color; /* L: flushing color */ 195 int flush_color; /* L: flushing color */
196 int refcnt; /* L: reference count */
167 int nr_in_flight[WORK_NR_COLORS]; 197 int nr_in_flight[WORK_NR_COLORS];
168 /* L: nr of in_flight works */ 198 /* L: nr of in_flight works */
169 int nr_active; /* L: nr of active works */ 199 int nr_active; /* L: nr of active works */
170 int max_active; /* L: max active works */ 200 int max_active; /* L: max active works */
171 struct list_head delayed_works; /* L: delayed works */ 201 struct list_head delayed_works; /* L: delayed works */
172}; 202 struct list_head pwqs_node; /* WR: node on wq->pwqs */
203 struct list_head mayday_node; /* MD: node on wq->maydays */
204
205 /*
206 * Release of unbound pwq is punted to system_wq. See put_pwq()
207 * and pwq_unbound_release_workfn() for details. pool_workqueue
208 * itself is also sched-RCU protected so that the first pwq can be
209 * determined without grabbing wq->mutex.
210 */
211 struct work_struct unbound_release_work;
212 struct rcu_head rcu;
213} __aligned(1 << WORK_STRUCT_FLAG_BITS);
173 214
174/* 215/*
175 * Structure used to wait for workqueue flush. 216 * Structure used to wait for workqueue flush.
176 */ 217 */
177struct wq_flusher { 218struct wq_flusher {
178 struct list_head list; /* F: list of flushers */ 219 struct list_head list; /* WQ: list of flushers */
179 int flush_color; /* F: flush color waiting for */ 220 int flush_color; /* WQ: flush color waiting for */
180 struct completion done; /* flush completion */ 221 struct completion done; /* flush completion */
181}; 222};
182 223
183/* 224struct wq_device;
184 * All cpumasks are assumed to be always set on UP and thus can't be
185 * used to determine whether there's something to be done.
186 */
187#ifdef CONFIG_SMP
188typedef cpumask_var_t mayday_mask_t;
189#define mayday_test_and_set_cpu(cpu, mask) \
190 cpumask_test_and_set_cpu((cpu), (mask))
191#define mayday_clear_cpu(cpu, mask) cpumask_clear_cpu((cpu), (mask))
192#define for_each_mayday_cpu(cpu, mask) for_each_cpu((cpu), (mask))
193#define alloc_mayday_mask(maskp, gfp) zalloc_cpumask_var((maskp), (gfp))
194#define free_mayday_mask(mask) free_cpumask_var((mask))
195#else
196typedef unsigned long mayday_mask_t;
197#define mayday_test_and_set_cpu(cpu, mask) test_and_set_bit(0, &(mask))
198#define mayday_clear_cpu(cpu, mask) clear_bit(0, &(mask))
199#define for_each_mayday_cpu(cpu, mask) if ((cpu) = 0, (mask))
200#define alloc_mayday_mask(maskp, gfp) true
201#define free_mayday_mask(mask) do { } while (0)
202#endif
203 225
204/* 226/*
205 * The externally visible workqueue abstraction is an array of 227 * The externally visible workqueue. It relays the issued work items to
206 * per-CPU workqueues: 228 * the appropriate worker_pool through its pool_workqueues.
207 */ 229 */
208struct workqueue_struct { 230struct workqueue_struct {
209 unsigned int flags; /* W: WQ_* flags */ 231 struct list_head pwqs; /* WR: all pwqs of this wq */
210 union { 232 struct list_head list; /* PL: list of all workqueues */
211 struct pool_workqueue __percpu *pcpu; 233
212 struct pool_workqueue *single; 234 struct mutex mutex; /* protects this wq */
213 unsigned long v; 235 int work_color; /* WQ: current work color */
214 } pool_wq; /* I: pwq's */ 236 int flush_color; /* WQ: current flush color */
215 struct list_head list; /* W: list of all workqueues */
216
217 struct mutex flush_mutex; /* protects wq flushing */
218 int work_color; /* F: current work color */
219 int flush_color; /* F: current flush color */
220 atomic_t nr_pwqs_to_flush; /* flush in progress */ 237 atomic_t nr_pwqs_to_flush; /* flush in progress */
221 struct wq_flusher *first_flusher; /* F: first flusher */ 238 struct wq_flusher *first_flusher; /* WQ: first flusher */
222 struct list_head flusher_queue; /* F: flush waiters */ 239 struct list_head flusher_queue; /* WQ: flush waiters */
223 struct list_head flusher_overflow; /* F: flush overflow list */ 240 struct list_head flusher_overflow; /* WQ: flush overflow list */
224 241
225 mayday_mask_t mayday_mask; /* cpus requesting rescue */ 242 struct list_head maydays; /* MD: pwqs requesting rescue */
226 struct worker *rescuer; /* I: rescue worker */ 243 struct worker *rescuer; /* I: rescue worker */
227 244
228 int nr_drainers; /* W: drain in progress */ 245 int nr_drainers; /* WQ: drain in progress */
229 int saved_max_active; /* W: saved pwq max_active */ 246 int saved_max_active; /* WQ: saved pwq max_active */
247
248 struct workqueue_attrs *unbound_attrs; /* WQ: only for unbound wqs */
249 struct pool_workqueue *dfl_pwq; /* WQ: only for unbound wqs */
250
251#ifdef CONFIG_SYSFS
252 struct wq_device *wq_dev; /* I: for sysfs interface */
253#endif
230#ifdef CONFIG_LOCKDEP 254#ifdef CONFIG_LOCKDEP
231 struct lockdep_map lockdep_map; 255 struct lockdep_map lockdep_map;
232#endif 256#endif
233 char name[]; /* I: workqueue name */ 257 char name[WQ_NAME_LEN]; /* I: workqueue name */
258
259 /* hot fields used during command issue, aligned to cacheline */
260 unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */
261 struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
262 struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
234}; 263};
235 264
265static struct kmem_cache *pwq_cache;
266
267static int wq_numa_tbl_len; /* highest possible NUMA node id + 1 */
268static cpumask_var_t *wq_numa_possible_cpumask;
269 /* possible CPUs of each node */
270
271static bool wq_disable_numa;
272module_param_named(disable_numa, wq_disable_numa, bool, 0444);
273
274static bool wq_numa_enabled; /* unbound NUMA affinity enabled */
275
276/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
277static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
278
279static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
280static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
281
282static LIST_HEAD(workqueues); /* PL: list of all workqueues */
283static bool workqueue_freezing; /* PL: have wqs started freezing? */
284
285/* the per-cpu worker pools */
286static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
287 cpu_worker_pools);
288
289static DEFINE_IDR(worker_pool_idr); /* PR: idr of all pools */
290
291/* PL: hash of all unbound pools keyed by pool->attrs */
292static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
293
294/* I: attributes used when instantiating standard unbound pools on demand */
295static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
296
236struct workqueue_struct *system_wq __read_mostly; 297struct workqueue_struct *system_wq __read_mostly;
237EXPORT_SYMBOL_GPL(system_wq); 298EXPORT_SYMBOL_GPL(system_wq);
238struct workqueue_struct *system_highpri_wq __read_mostly; 299struct workqueue_struct *system_highpri_wq __read_mostly;
@@ -244,64 +305,87 @@ EXPORT_SYMBOL_GPL(system_unbound_wq);
244struct workqueue_struct *system_freezable_wq __read_mostly; 305struct workqueue_struct *system_freezable_wq __read_mostly;
245EXPORT_SYMBOL_GPL(system_freezable_wq); 306EXPORT_SYMBOL_GPL(system_freezable_wq);
246 307
308static int worker_thread(void *__worker);
309static void copy_workqueue_attrs(struct workqueue_attrs *to,
310 const struct workqueue_attrs *from);
311
247#define CREATE_TRACE_POINTS 312#define CREATE_TRACE_POINTS
248#include <trace/events/workqueue.h> 313#include <trace/events/workqueue.h>
249 314
250#define for_each_std_worker_pool(pool, cpu) \ 315#define assert_rcu_or_pool_mutex() \
251 for ((pool) = &std_worker_pools(cpu)[0]; \ 316 rcu_lockdep_assert(rcu_read_lock_sched_held() || \
252 (pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++) 317 lockdep_is_held(&wq_pool_mutex), \
318 "sched RCU or wq_pool_mutex should be held")
253 319
254#define for_each_busy_worker(worker, i, pool) \ 320#define assert_rcu_or_wq_mutex(wq) \
255 hash_for_each(pool->busy_hash, i, worker, hentry) 321 rcu_lockdep_assert(rcu_read_lock_sched_held() || \
322 lockdep_is_held(&wq->mutex), \
323 "sched RCU or wq->mutex should be held")
256 324
257static inline int __next_wq_cpu(int cpu, const struct cpumask *mask, 325#ifdef CONFIG_LOCKDEP
258 unsigned int sw) 326#define assert_manager_or_pool_lock(pool) \
259{ 327 WARN_ONCE(debug_locks && \
260 if (cpu < nr_cpu_ids) { 328 !lockdep_is_held(&(pool)->manager_mutex) && \
261 if (sw & 1) { 329 !lockdep_is_held(&(pool)->lock), \
262 cpu = cpumask_next(cpu, mask); 330 "pool->manager_mutex or ->lock should be held")
263 if (cpu < nr_cpu_ids) 331#else
264 return cpu; 332#define assert_manager_or_pool_lock(pool) do { } while (0)
265 } 333#endif
266 if (sw & 2)
267 return WORK_CPU_UNBOUND;
268 }
269 return WORK_CPU_END;
270}
271 334
272static inline int __next_pwq_cpu(int cpu, const struct cpumask *mask, 335#define for_each_cpu_worker_pool(pool, cpu) \
273 struct workqueue_struct *wq) 336 for ((pool) = &per_cpu(cpu_worker_pools, cpu)[0]; \
274{ 337 (pool) < &per_cpu(cpu_worker_pools, cpu)[NR_STD_WORKER_POOLS]; \
275 return __next_wq_cpu(cpu, mask, !(wq->flags & WQ_UNBOUND) ? 1 : 2); 338 (pool)++)
276}
277 339
278/* 340/**
279 * CPU iterators 341 * for_each_pool - iterate through all worker_pools in the system
342 * @pool: iteration cursor
343 * @pi: integer used for iteration
280 * 344 *
281 * An extra cpu number is defined using an invalid cpu number 345 * This must be called either with wq_pool_mutex held or sched RCU read
282 * (WORK_CPU_UNBOUND) to host workqueues which are not bound to any 346 * locked. If the pool needs to be used beyond the locking in effect, the
283 * specific CPU. The following iterators are similar to for_each_*_cpu() 347 * caller is responsible for guaranteeing that the pool stays online.
284 * iterators but also considers the unbound CPU.
285 * 348 *
286 * for_each_wq_cpu() : possible CPUs + WORK_CPU_UNBOUND 349 * The if/else clause exists only for the lockdep assertion and can be
287 * for_each_online_wq_cpu() : online CPUs + WORK_CPU_UNBOUND 350 * ignored.
288 * for_each_pwq_cpu() : possible CPUs for bound workqueues,
289 * WORK_CPU_UNBOUND for unbound workqueues
290 */ 351 */
291#define for_each_wq_cpu(cpu) \ 352#define for_each_pool(pool, pi) \
292 for ((cpu) = __next_wq_cpu(-1, cpu_possible_mask, 3); \ 353 idr_for_each_entry(&worker_pool_idr, pool, pi) \
293 (cpu) < WORK_CPU_END; \ 354 if (({ assert_rcu_or_pool_mutex(); false; })) { } \
294 (cpu) = __next_wq_cpu((cpu), cpu_possible_mask, 3)) 355 else
295 356
296#define for_each_online_wq_cpu(cpu) \ 357/**
297 for ((cpu) = __next_wq_cpu(-1, cpu_online_mask, 3); \ 358 * for_each_pool_worker - iterate through all workers of a worker_pool
298 (cpu) < WORK_CPU_END; \ 359 * @worker: iteration cursor
299 (cpu) = __next_wq_cpu((cpu), cpu_online_mask, 3)) 360 * @wi: integer used for iteration
361 * @pool: worker_pool to iterate workers of
362 *
363 * This must be called with either @pool->manager_mutex or ->lock held.
364 *
365 * The if/else clause exists only for the lockdep assertion and can be
366 * ignored.
367 */
368#define for_each_pool_worker(worker, wi, pool) \
369 idr_for_each_entry(&(pool)->worker_idr, (worker), (wi)) \
370 if (({ assert_manager_or_pool_lock((pool)); false; })) { } \
371 else
300 372
301#define for_each_pwq_cpu(cpu, wq) \ 373/**
302 for ((cpu) = __next_pwq_cpu(-1, cpu_possible_mask, (wq)); \ 374 * for_each_pwq - iterate through all pool_workqueues of the specified workqueue
303 (cpu) < WORK_CPU_END; \ 375 * @pwq: iteration cursor
304 (cpu) = __next_pwq_cpu((cpu), cpu_possible_mask, (wq))) 376 * @wq: the target workqueue
377 *
378 * This must be called either with wq->mutex held or sched RCU read locked.
379 * If the pwq needs to be used beyond the locking in effect, the caller is
380 * responsible for guaranteeing that the pwq stays online.
381 *
382 * The if/else clause exists only for the lockdep assertion and can be
383 * ignored.
384 */
385#define for_each_pwq(pwq, wq) \
386 list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node) \
387 if (({ assert_rcu_or_wq_mutex(wq); false; })) { } \
388 else
305 389
306#ifdef CONFIG_DEBUG_OBJECTS_WORK 390#ifdef CONFIG_DEBUG_OBJECTS_WORK
307 391
@@ -419,77 +503,35 @@ static inline void debug_work_activate(struct work_struct *work) { }
419static inline void debug_work_deactivate(struct work_struct *work) { } 503static inline void debug_work_deactivate(struct work_struct *work) { }
420#endif 504#endif
421 505
422/* Serializes the accesses to the list of workqueues. */
423static DEFINE_SPINLOCK(workqueue_lock);
424static LIST_HEAD(workqueues);
425static bool workqueue_freezing; /* W: have wqs started freezing? */
426
427/*
428 * The CPU and unbound standard worker pools. The unbound ones have
429 * POOL_DISASSOCIATED set, and their workers have WORKER_UNBOUND set.
430 */
431static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
432 cpu_std_worker_pools);
433static struct worker_pool unbound_std_worker_pools[NR_STD_WORKER_POOLS];
434
435/* idr of all pools */
436static DEFINE_MUTEX(worker_pool_idr_mutex);
437static DEFINE_IDR(worker_pool_idr);
438
439static int worker_thread(void *__worker);
440
441static struct worker_pool *std_worker_pools(int cpu)
442{
443 if (cpu != WORK_CPU_UNBOUND)
444 return per_cpu(cpu_std_worker_pools, cpu);
445 else
446 return unbound_std_worker_pools;
447}
448
449static int std_worker_pool_pri(struct worker_pool *pool)
450{
451 return pool - std_worker_pools(pool->cpu);
452}
453
454/* allocate ID and assign it to @pool */ 506/* allocate ID and assign it to @pool */
455static int worker_pool_assign_id(struct worker_pool *pool) 507static int worker_pool_assign_id(struct worker_pool *pool)
456{ 508{
457 int ret; 509 int ret;
458 510
459 mutex_lock(&worker_pool_idr_mutex); 511 lockdep_assert_held(&wq_pool_mutex);
512
460 ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); 513 ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
461 if (ret >= 0) 514 if (ret >= 0) {
462 pool->id = ret; 515 pool->id = ret;
463 mutex_unlock(&worker_pool_idr_mutex); 516 return 0;
464 517 }
465 return ret < 0 ? ret : 0; 518 return ret;
466} 519}
467 520
468/* 521/**
469 * Lookup worker_pool by id. The idr currently is built during boot and 522 * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
470 * never modified. Don't worry about locking for now. 523 * @wq: the target workqueue
524 * @node: the node ID
525 *
526 * This must be called either with pwq_lock held or sched RCU read locked.
527 * If the pwq needs to be used beyond the locking in effect, the caller is
528 * responsible for guaranteeing that the pwq stays online.
471 */ 529 */
472static struct worker_pool *worker_pool_by_id(int pool_id) 530static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
531 int node)
473{ 532{
474 return idr_find(&worker_pool_idr, pool_id); 533 assert_rcu_or_wq_mutex(wq);
475} 534 return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
476
477static struct worker_pool *get_std_worker_pool(int cpu, bool highpri)
478{
479 struct worker_pool *pools = std_worker_pools(cpu);
480
481 return &pools[highpri];
482}
483
484static struct pool_workqueue *get_pwq(unsigned int cpu,
485 struct workqueue_struct *wq)
486{
487 if (!(wq->flags & WQ_UNBOUND)) {
488 if (likely(cpu < nr_cpu_ids))
489 return per_cpu_ptr(wq->pool_wq.pcpu, cpu);
490 } else if (likely(cpu == WORK_CPU_UNBOUND))
491 return wq->pool_wq.single;
492 return NULL;
493} 535}
494 536
495static unsigned int work_color_to_flags(int color) 537static unsigned int work_color_to_flags(int color)
@@ -531,7 +573,7 @@ static int work_next_color(int color)
531static inline void set_work_data(struct work_struct *work, unsigned long data, 573static inline void set_work_data(struct work_struct *work, unsigned long data,
532 unsigned long flags) 574 unsigned long flags)
533{ 575{
534 BUG_ON(!work_pending(work)); 576 WARN_ON_ONCE(!work_pending(work));
535 atomic_long_set(&work->data, data | flags | work_static(work)); 577 atomic_long_set(&work->data, data | flags | work_static(work));
536} 578}
537 579
@@ -583,13 +625,23 @@ static struct pool_workqueue *get_work_pwq(struct work_struct *work)
583 * @work: the work item of interest 625 * @work: the work item of interest
584 * 626 *
585 * Return the worker_pool @work was last associated with. %NULL if none. 627 * Return the worker_pool @work was last associated with. %NULL if none.
628 *
629 * Pools are created and destroyed under wq_pool_mutex, and allows read
630 * access under sched-RCU read lock. As such, this function should be
631 * called under wq_pool_mutex or with preemption disabled.
632 *
633 * All fields of the returned pool are accessible as long as the above
634 * mentioned locking is in effect. If the returned pool needs to be used
635 * beyond the critical section, the caller is responsible for ensuring the
636 * returned pool is and stays online.
586 */ 637 */
587static struct worker_pool *get_work_pool(struct work_struct *work) 638static struct worker_pool *get_work_pool(struct work_struct *work)
588{ 639{
589 unsigned long data = atomic_long_read(&work->data); 640 unsigned long data = atomic_long_read(&work->data);
590 struct worker_pool *pool;
591 int pool_id; 641 int pool_id;
592 642
643 assert_rcu_or_pool_mutex();
644
593 if (data & WORK_STRUCT_PWQ) 645 if (data & WORK_STRUCT_PWQ)
594 return ((struct pool_workqueue *) 646 return ((struct pool_workqueue *)
595 (data & WORK_STRUCT_WQ_DATA_MASK))->pool; 647 (data & WORK_STRUCT_WQ_DATA_MASK))->pool;
@@ -598,9 +650,7 @@ static struct worker_pool *get_work_pool(struct work_struct *work)
598 if (pool_id == WORK_OFFQ_POOL_NONE) 650 if (pool_id == WORK_OFFQ_POOL_NONE)
599 return NULL; 651 return NULL;
600 652
601 pool = worker_pool_by_id(pool_id); 653 return idr_find(&worker_pool_idr, pool_id);
602 WARN_ON_ONCE(!pool);
603 return pool;
604} 654}
605 655
606/** 656/**
@@ -689,7 +739,7 @@ static bool need_to_manage_workers(struct worker_pool *pool)
689/* Do we have too many workers and should some go away? */ 739/* Do we have too many workers and should some go away? */
690static bool too_many_workers(struct worker_pool *pool) 740static bool too_many_workers(struct worker_pool *pool)
691{ 741{
692 bool managing = pool->flags & POOL_MANAGING_WORKERS; 742 bool managing = mutex_is_locked(&pool->manager_arb);
693 int nr_idle = pool->nr_idle + managing; /* manager is considered idle */ 743 int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
694 int nr_busy = pool->nr_workers - nr_idle; 744 int nr_busy = pool->nr_workers - nr_idle;
695 745
@@ -744,7 +794,7 @@ static void wake_up_worker(struct worker_pool *pool)
744 * CONTEXT: 794 * CONTEXT:
745 * spin_lock_irq(rq->lock) 795 * spin_lock_irq(rq->lock)
746 */ 796 */
747void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) 797void wq_worker_waking_up(struct task_struct *task, int cpu)
748{ 798{
749 struct worker *worker = kthread_data(task); 799 struct worker *worker = kthread_data(task);
750 800
@@ -769,8 +819,7 @@ void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
769 * RETURNS: 819 * RETURNS:
770 * Worker task on @cpu to wake up, %NULL if none. 820 * Worker task on @cpu to wake up, %NULL if none.
771 */ 821 */
772struct task_struct *wq_worker_sleeping(struct task_struct *task, 822struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu)
773 unsigned int cpu)
774{ 823{
775 struct worker *worker = kthread_data(task), *to_wakeup = NULL; 824 struct worker *worker = kthread_data(task), *to_wakeup = NULL;
776 struct worker_pool *pool; 825 struct worker_pool *pool;
@@ -786,7 +835,8 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task,
786 pool = worker->pool; 835 pool = worker->pool;
787 836
788 /* this can only happen on the local cpu */ 837 /* this can only happen on the local cpu */
789 BUG_ON(cpu != raw_smp_processor_id()); 838 if (WARN_ON_ONCE(cpu != raw_smp_processor_id()))
839 return NULL;
790 840
791 /* 841 /*
792 * The counterpart of the following dec_and_test, implied mb, 842 * The counterpart of the following dec_and_test, implied mb,
@@ -891,13 +941,12 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
891 * recycled work item as currently executing and make it wait until the 941 * recycled work item as currently executing and make it wait until the
892 * current execution finishes, introducing an unwanted dependency. 942 * current execution finishes, introducing an unwanted dependency.
893 * 943 *
894 * This function checks the work item address, work function and workqueue 944 * This function checks the work item address and work function to avoid
895 * to avoid false positives. Note that this isn't complete as one may 945 * false positives. Note that this isn't complete as one may construct a
896 * construct a work function which can introduce dependency onto itself 946 * work function which can introduce dependency onto itself through a
897 * through a recycled work item. Well, if somebody wants to shoot oneself 947 * recycled work item. Well, if somebody wants to shoot oneself in the
898 * in the foot that badly, there's only so much we can do, and if such 948 * foot that badly, there's only so much we can do, and if such deadlock
899 * deadlock actually occurs, it should be easy to locate the culprit work 949 * actually occurs, it should be easy to locate the culprit work function.
900 * function.
901 * 950 *
902 * CONTEXT: 951 * CONTEXT:
903 * spin_lock_irq(pool->lock). 952 * spin_lock_irq(pool->lock).
@@ -961,6 +1010,64 @@ static void move_linked_works(struct work_struct *work, struct list_head *head,
961 *nextp = n; 1010 *nextp = n;
962} 1011}
963 1012
1013/**
1014 * get_pwq - get an extra reference on the specified pool_workqueue
1015 * @pwq: pool_workqueue to get
1016 *
1017 * Obtain an extra reference on @pwq. The caller should guarantee that
1018 * @pwq has positive refcnt and be holding the matching pool->lock.
1019 */
1020static void get_pwq(struct pool_workqueue *pwq)
1021{
1022 lockdep_assert_held(&pwq->pool->lock);
1023 WARN_ON_ONCE(pwq->refcnt <= 0);
1024 pwq->refcnt++;
1025}
1026
1027/**
1028 * put_pwq - put a pool_workqueue reference
1029 * @pwq: pool_workqueue to put
1030 *
1031 * Drop a reference of @pwq. If its refcnt reaches zero, schedule its
1032 * destruction. The caller should be holding the matching pool->lock.
1033 */
1034static void put_pwq(struct pool_workqueue *pwq)
1035{
1036 lockdep_assert_held(&pwq->pool->lock);
1037 if (likely(--pwq->refcnt))
1038 return;
1039 if (WARN_ON_ONCE(!(pwq->wq->flags & WQ_UNBOUND)))
1040 return;
1041 /*
1042 * @pwq can't be released under pool->lock, bounce to
1043 * pwq_unbound_release_workfn(). This never recurses on the same
1044 * pool->lock as this path is taken only for unbound workqueues and
1045 * the release work item is scheduled on a per-cpu workqueue. To
1046 * avoid lockdep warning, unbound pool->locks are given lockdep
1047 * subclass of 1 in get_unbound_pool().
1048 */
1049 schedule_work(&pwq->unbound_release_work);
1050}
1051
1052/**
1053 * put_pwq_unlocked - put_pwq() with surrounding pool lock/unlock
1054 * @pwq: pool_workqueue to put (can be %NULL)
1055 *
1056 * put_pwq() with locking. This function also allows %NULL @pwq.
1057 */
1058static void put_pwq_unlocked(struct pool_workqueue *pwq)
1059{
1060 if (pwq) {
1061 /*
1062 * As both pwqs and pools are sched-RCU protected, the
1063 * following lock operations are safe.
1064 */
1065 spin_lock_irq(&pwq->pool->lock);
1066 put_pwq(pwq);
1067 spin_unlock_irq(&pwq->pool->lock);
1068 }
1069}
1070
964static void pwq_activate_delayed_work(struct work_struct *work) 1071static void pwq_activate_delayed_work(struct work_struct *work)
965{ 1072{
966 struct pool_workqueue *pwq = get_work_pwq(work); 1073 struct pool_workqueue *pwq = get_work_pwq(work);
@@ -992,9 +1099,9 @@ static void pwq_activate_first_delayed(struct pool_workqueue *pwq)
992 */ 1099 */
993static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color) 1100static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
994{ 1101{
995 /* ignore uncolored works */ 1102 /* uncolored work items don't participate in flushing or nr_active */
996 if (color == WORK_NO_COLOR) 1103 if (color == WORK_NO_COLOR)
997 return; 1104 goto out_put;
998 1105
999 pwq->nr_in_flight[color]--; 1106 pwq->nr_in_flight[color]--;
1000 1107
@@ -1007,11 +1114,11 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
1007 1114
1008 /* is flush in progress and are we at the flushing tip? */ 1115 /* is flush in progress and are we at the flushing tip? */
1009 if (likely(pwq->flush_color != color)) 1116 if (likely(pwq->flush_color != color))
1010 return; 1117 goto out_put;
1011 1118
1012 /* are there still in-flight works? */ 1119 /* are there still in-flight works? */
1013 if (pwq->nr_in_flight[color]) 1120 if (pwq->nr_in_flight[color])
1014 return; 1121 goto out_put;
1015 1122
1016 /* this pwq is done, clear flush_color */ 1123 /* this pwq is done, clear flush_color */
1017 pwq->flush_color = -1; 1124 pwq->flush_color = -1;
@@ -1022,6 +1129,8 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
1022 */ 1129 */
1023 if (atomic_dec_and_test(&pwq->wq->nr_pwqs_to_flush)) 1130 if (atomic_dec_and_test(&pwq->wq->nr_pwqs_to_flush))
1024 complete(&pwq->wq->first_flusher->done); 1131 complete(&pwq->wq->first_flusher->done);
1132out_put:
1133 put_pwq(pwq);
1025} 1134}
1026 1135
1027/** 1136/**
@@ -1144,11 +1253,12 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
1144 /* we own @work, set data and link */ 1253 /* we own @work, set data and link */
1145 set_work_pwq(work, pwq, extra_flags); 1254 set_work_pwq(work, pwq, extra_flags);
1146 list_add_tail(&work->entry, head); 1255 list_add_tail(&work->entry, head);
1256 get_pwq(pwq);
1147 1257
1148 /* 1258 /*
1149 * Ensure either worker_sched_deactivated() sees the above 1259 * Ensure either wq_worker_sleeping() sees the above
1150 * list_add_tail() or we see zero nr_running to avoid workers 1260 * list_add_tail() or we see zero nr_running to avoid workers lying
1151 * lying around lazily while there are works to be processed. 1261 * around lazily while there are works to be processed.
1152 */ 1262 */
1153 smp_mb(); 1263 smp_mb();
1154 1264
@@ -1172,10 +1282,11 @@ static bool is_chained_work(struct workqueue_struct *wq)
1172 return worker && worker->current_pwq->wq == wq; 1282 return worker && worker->current_pwq->wq == wq;
1173} 1283}
1174 1284
1175static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, 1285static void __queue_work(int cpu, struct workqueue_struct *wq,
1176 struct work_struct *work) 1286 struct work_struct *work)
1177{ 1287{
1178 struct pool_workqueue *pwq; 1288 struct pool_workqueue *pwq;
1289 struct worker_pool *last_pool;
1179 struct list_head *worklist; 1290 struct list_head *worklist;
1180 unsigned int work_flags; 1291 unsigned int work_flags;
1181 unsigned int req_cpu = cpu; 1292 unsigned int req_cpu = cpu;
@@ -1191,48 +1302,62 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
1191 debug_work_activate(work); 1302 debug_work_activate(work);
1192 1303
1193 /* if dying, only works from the same workqueue are allowed */ 1304 /* if dying, only works from the same workqueue are allowed */
1194 if (unlikely(wq->flags & WQ_DRAINING) && 1305 if (unlikely(wq->flags & __WQ_DRAINING) &&
1195 WARN_ON_ONCE(!is_chained_work(wq))) 1306 WARN_ON_ONCE(!is_chained_work(wq)))
1196 return; 1307 return;
1308retry:
1309 if (req_cpu == WORK_CPU_UNBOUND)
1310 cpu = raw_smp_processor_id();
1197 1311
1198 /* determine the pwq to use */ 1312 /* pwq which will be used unless @work is executing elsewhere */
1199 if (!(wq->flags & WQ_UNBOUND)) { 1313 if (!(wq->flags & WQ_UNBOUND))
1200 struct worker_pool *last_pool; 1314 pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
1201 1315 else
1202 if (cpu == WORK_CPU_UNBOUND) 1316 pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
1203 cpu = raw_smp_processor_id();
1204
1205 /*
1206 * It's multi cpu. If @work was previously on a different
1207 * cpu, it might still be running there, in which case the
1208 * work needs to be queued on that cpu to guarantee
1209 * non-reentrancy.
1210 */
1211 pwq = get_pwq(cpu, wq);
1212 last_pool = get_work_pool(work);
1213 1317
1214 if (last_pool && last_pool != pwq->pool) { 1318 /*
1215 struct worker *worker; 1319 * If @work was previously on a different pool, it might still be
1320 * running there, in which case the work needs to be queued on that
1321 * pool to guarantee non-reentrancy.
1322 */
1323 last_pool = get_work_pool(work);
1324 if (last_pool && last_pool != pwq->pool) {
1325 struct worker *worker;
1216 1326
1217 spin_lock(&last_pool->lock); 1327 spin_lock(&last_pool->lock);
1218 1328
1219 worker = find_worker_executing_work(last_pool, work); 1329 worker = find_worker_executing_work(last_pool, work);
1220 1330
1221 if (worker && worker->current_pwq->wq == wq) { 1331 if (worker && worker->current_pwq->wq == wq) {
1222 pwq = get_pwq(last_pool->cpu, wq); 1332 pwq = worker->current_pwq;
1223 } else {
1224 /* meh... not running there, queue here */
1225 spin_unlock(&last_pool->lock);
1226 spin_lock(&pwq->pool->lock);
1227 }
1228 } else { 1333 } else {
1334 /* meh... not running there, queue here */
1335 spin_unlock(&last_pool->lock);
1229 spin_lock(&pwq->pool->lock); 1336 spin_lock(&pwq->pool->lock);
1230 } 1337 }
1231 } else { 1338 } else {
1232 pwq = get_pwq(WORK_CPU_UNBOUND, wq);
1233 spin_lock(&pwq->pool->lock); 1339 spin_lock(&pwq->pool->lock);
1234 } 1340 }
1235 1341
1342 /*
1343 * pwq is determined and locked. For unbound pools, we could have
1344 * raced with pwq release and it could already be dead. If its
1345 * refcnt is zero, repeat pwq selection. Note that pwqs never die
1346 * without another pwq replacing it in the numa_pwq_tbl or while
1347 * work items are executing on it, so the retrying is guaranteed to
1348 * make forward-progress.
1349 */
1350 if (unlikely(!pwq->refcnt)) {
1351 if (wq->flags & WQ_UNBOUND) {
1352 spin_unlock(&pwq->pool->lock);
1353 cpu_relax();
1354 goto retry;
1355 }
1356 /* oops */
1357 WARN_ONCE(true, "workqueue: per-cpu pwq for %s on cpu%d has 0 refcnt",
1358 wq->name, cpu);
1359 }
1360
1236 /* pwq determined, queue */ 1361 /* pwq determined, queue */
1237 trace_workqueue_queue_work(req_cpu, pwq, work); 1362 trace_workqueue_queue_work(req_cpu, pwq, work);
1238 1363
@@ -1287,22 +1412,6 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
1287} 1412}
1288EXPORT_SYMBOL_GPL(queue_work_on); 1413EXPORT_SYMBOL_GPL(queue_work_on);
1289 1414
1290/**
1291 * queue_work - queue work on a workqueue
1292 * @wq: workqueue to use
1293 * @work: work to queue
1294 *
1295 * Returns %false if @work was already on a queue, %true otherwise.
1296 *
1297 * We queue the work to the CPU on which it was submitted, but if the CPU dies
1298 * it can be processed by another CPU.
1299 */
1300bool queue_work(struct workqueue_struct *wq, struct work_struct *work)
1301{
1302 return queue_work_on(WORK_CPU_UNBOUND, wq, work);
1303}
1304EXPORT_SYMBOL_GPL(queue_work);
1305
1306void delayed_work_timer_fn(unsigned long __data) 1415void delayed_work_timer_fn(unsigned long __data)
1307{ 1416{
1308 struct delayed_work *dwork = (struct delayed_work *)__data; 1417 struct delayed_work *dwork = (struct delayed_work *)__data;
@@ -1378,21 +1487,6 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
1378EXPORT_SYMBOL_GPL(queue_delayed_work_on); 1487EXPORT_SYMBOL_GPL(queue_delayed_work_on);
1379 1488
1380/** 1489/**
1381 * queue_delayed_work - queue work on a workqueue after delay
1382 * @wq: workqueue to use
1383 * @dwork: delayable work to queue
1384 * @delay: number of jiffies to wait before queueing
1385 *
1386 * Equivalent to queue_delayed_work_on() but tries to use the local CPU.
1387 */
1388bool queue_delayed_work(struct workqueue_struct *wq,
1389 struct delayed_work *dwork, unsigned long delay)
1390{
1391 return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
1392}
1393EXPORT_SYMBOL_GPL(queue_delayed_work);
1394
1395/**
1396 * mod_delayed_work_on - modify delay of or queue a delayed work on specific CPU 1490 * mod_delayed_work_on - modify delay of or queue a delayed work on specific CPU
1397 * @cpu: CPU number to execute work on 1491 * @cpu: CPU number to execute work on
1398 * @wq: workqueue to use 1492 * @wq: workqueue to use
@@ -1431,21 +1525,6 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
1431EXPORT_SYMBOL_GPL(mod_delayed_work_on); 1525EXPORT_SYMBOL_GPL(mod_delayed_work_on);
1432 1526
1433/** 1527/**
1434 * mod_delayed_work - modify delay of or queue a delayed work
1435 * @wq: workqueue to use
1436 * @dwork: work to queue
1437 * @delay: number of jiffies to wait before queueing
1438 *
1439 * mod_delayed_work_on() on local CPU.
1440 */
1441bool mod_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork,
1442 unsigned long delay)
1443{
1444 return mod_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
1445}
1446EXPORT_SYMBOL_GPL(mod_delayed_work);
1447
1448/**
1449 * worker_enter_idle - enter idle state 1528 * worker_enter_idle - enter idle state
1450 * @worker: worker which is entering idle state 1529 * @worker: worker which is entering idle state
1451 * 1530 *
@@ -1459,9 +1538,10 @@ static void worker_enter_idle(struct worker *worker)
1459{ 1538{
1460 struct worker_pool *pool = worker->pool; 1539 struct worker_pool *pool = worker->pool;
1461 1540
1462 BUG_ON(worker->flags & WORKER_IDLE); 1541 if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
1463 BUG_ON(!list_empty(&worker->entry) && 1542 WARN_ON_ONCE(!list_empty(&worker->entry) &&
1464 (worker->hentry.next || worker->hentry.pprev)); 1543 (worker->hentry.next || worker->hentry.pprev)))
1544 return;
1465 1545
1466 /* can't use worker_set_flags(), also called from start_worker() */ 1546 /* can't use worker_set_flags(), also called from start_worker() */
1467 worker->flags |= WORKER_IDLE; 1547 worker->flags |= WORKER_IDLE;
@@ -1498,22 +1578,25 @@ static void worker_leave_idle(struct worker *worker)
1498{ 1578{
1499 struct worker_pool *pool = worker->pool; 1579 struct worker_pool *pool = worker->pool;
1500 1580
1501 BUG_ON(!(worker->flags & WORKER_IDLE)); 1581 if (WARN_ON_ONCE(!(worker->flags & WORKER_IDLE)))
1582 return;
1502 worker_clr_flags(worker, WORKER_IDLE); 1583 worker_clr_flags(worker, WORKER_IDLE);
1503 pool->nr_idle--; 1584 pool->nr_idle--;
1504 list_del_init(&worker->entry); 1585 list_del_init(&worker->entry);
1505} 1586}
1506 1587
1507/** 1588/**
1508 * worker_maybe_bind_and_lock - bind worker to its cpu if possible and lock pool 1589 * worker_maybe_bind_and_lock - try to bind %current to worker_pool and lock it
1509 * @worker: self 1590 * @pool: target worker_pool
1591 *
1592 * Bind %current to the cpu of @pool if it is associated and lock @pool.
1510 * 1593 *
1511 * Works which are scheduled while the cpu is online must at least be 1594 * Works which are scheduled while the cpu is online must at least be
1512 * scheduled to a worker which is bound to the cpu so that if they are 1595 * scheduled to a worker which is bound to the cpu so that if they are
1513 * flushed from cpu callbacks while cpu is going down, they are 1596 * flushed from cpu callbacks while cpu is going down, they are
1514 * guaranteed to execute on the cpu. 1597 * guaranteed to execute on the cpu.
1515 * 1598 *
1516 * This function is to be used by rogue workers and rescuers to bind 1599 * This function is to be used by unbound workers and rescuers to bind
1517 * themselves to the target cpu and may race with cpu going down or 1600 * themselves to the target cpu and may race with cpu going down or
1518 * coming online. kthread_bind() can't be used because it may put the 1601 * coming online. kthread_bind() can't be used because it may put the
1519 * worker to already dead cpu and set_cpus_allowed_ptr() can't be used 1602 * worker to already dead cpu and set_cpus_allowed_ptr() can't be used
@@ -1534,12 +1617,9 @@ static void worker_leave_idle(struct worker *worker)
1534 * %true if the associated pool is online (@worker is successfully 1617 * %true if the associated pool is online (@worker is successfully
1535 * bound), %false if offline. 1618 * bound), %false if offline.
1536 */ 1619 */
1537static bool worker_maybe_bind_and_lock(struct worker *worker) 1620static bool worker_maybe_bind_and_lock(struct worker_pool *pool)
1538__acquires(&pool->lock) 1621__acquires(&pool->lock)
1539{ 1622{
1540 struct worker_pool *pool = worker->pool;
1541 struct task_struct *task = worker->task;
1542
1543 while (true) { 1623 while (true) {
1544 /* 1624 /*
1545 * The following call may fail, succeed or succeed 1625 * The following call may fail, succeed or succeed
@@ -1548,14 +1628,13 @@ __acquires(&pool->lock)
1548 * against POOL_DISASSOCIATED. 1628 * against POOL_DISASSOCIATED.
1549 */ 1629 */
1550 if (!(pool->flags & POOL_DISASSOCIATED)) 1630 if (!(pool->flags & POOL_DISASSOCIATED))
1551 set_cpus_allowed_ptr(task, get_cpu_mask(pool->cpu)); 1631 set_cpus_allowed_ptr(current, pool->attrs->cpumask);
1552 1632
1553 spin_lock_irq(&pool->lock); 1633 spin_lock_irq(&pool->lock);
1554 if (pool->flags & POOL_DISASSOCIATED) 1634 if (pool->flags & POOL_DISASSOCIATED)
1555 return false; 1635 return false;
1556 if (task_cpu(task) == pool->cpu && 1636 if (task_cpu(current) == pool->cpu &&
1557 cpumask_equal(&current->cpus_allowed, 1637 cpumask_equal(&current->cpus_allowed, pool->attrs->cpumask))
1558 get_cpu_mask(pool->cpu)))
1559 return true; 1638 return true;
1560 spin_unlock_irq(&pool->lock); 1639 spin_unlock_irq(&pool->lock);
1561 1640
@@ -1570,108 +1649,6 @@ __acquires(&pool->lock)
1570 } 1649 }
1571} 1650}
1572 1651
1573/*
1574 * Rebind an idle @worker to its CPU. worker_thread() will test
1575 * list_empty(@worker->entry) before leaving idle and call this function.
1576 */
1577static void idle_worker_rebind(struct worker *worker)
1578{
1579 /* CPU may go down again inbetween, clear UNBOUND only on success */
1580 if (worker_maybe_bind_and_lock(worker))
1581 worker_clr_flags(worker, WORKER_UNBOUND);
1582
1583 /* rebind complete, become available again */
1584 list_add(&worker->entry, &worker->pool->idle_list);
1585 spin_unlock_irq(&worker->pool->lock);
1586}
1587
1588/*
1589 * Function for @worker->rebind.work used to rebind unbound busy workers to
1590 * the associated cpu which is coming back online. This is scheduled by
1591 * cpu up but can race with other cpu hotplug operations and may be
1592 * executed twice without intervening cpu down.
1593 */
1594static void busy_worker_rebind_fn(struct work_struct *work)
1595{
1596 struct worker *worker = container_of(work, struct worker, rebind_work);
1597
1598 if (worker_maybe_bind_and_lock(worker))
1599 worker_clr_flags(worker, WORKER_UNBOUND);
1600
1601 spin_unlock_irq(&worker->pool->lock);
1602}
1603
1604/**
1605 * rebind_workers - rebind all workers of a pool to the associated CPU
1606 * @pool: pool of interest
1607 *
1608 * @pool->cpu is coming online. Rebind all workers to the CPU. Rebinding
1609 * is different for idle and busy ones.
1610 *
1611 * Idle ones will be removed from the idle_list and woken up. They will
1612 * add themselves back after completing rebind. This ensures that the
1613 * idle_list doesn't contain any unbound workers when re-bound busy workers
1614 * try to perform local wake-ups for concurrency management.
1615 *
1616 * Busy workers can rebind after they finish their current work items.
1617 * Queueing the rebind work item at the head of the scheduled list is
1618 * enough. Note that nr_running will be properly bumped as busy workers
1619 * rebind.
1620 *
1621 * On return, all non-manager workers are scheduled for rebind - see
1622 * manage_workers() for the manager special case. Any idle worker
1623 * including the manager will not appear on @idle_list until rebind is
1624 * complete, making local wake-ups safe.
1625 */
1626static void rebind_workers(struct worker_pool *pool)
1627{
1628 struct worker *worker, *n;
1629 int i;
1630
1631 lockdep_assert_held(&pool->assoc_mutex);
1632 lockdep_assert_held(&pool->lock);
1633
1634 /* dequeue and kick idle ones */
1635 list_for_each_entry_safe(worker, n, &pool->idle_list, entry) {
1636 /*
1637 * idle workers should be off @pool->idle_list until rebind
1638 * is complete to avoid receiving premature local wake-ups.
1639 */
1640 list_del_init(&worker->entry);
1641
1642 /*
1643 * worker_thread() will see the above dequeuing and call
1644 * idle_worker_rebind().
1645 */
1646 wake_up_process(worker->task);
1647 }
1648
1649 /* rebind busy workers */
1650 for_each_busy_worker(worker, i, pool) {
1651 struct work_struct *rebind_work = &worker->rebind_work;
1652 struct workqueue_struct *wq;
1653
1654 if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
1655 work_data_bits(rebind_work)))
1656 continue;
1657
1658 debug_work_activate(rebind_work);
1659
1660 /*
1661 * wq doesn't really matter but let's keep @worker->pool
1662 * and @pwq->pool consistent for sanity.
1663 */
1664 if (std_worker_pool_pri(worker->pool))
1665 wq = system_highpri_wq;
1666 else
1667 wq = system_wq;
1668
1669 insert_work(get_pwq(pool->cpu, wq), rebind_work,
1670 worker->scheduled.next,
1671 work_color_to_flags(WORK_NO_COLOR));
1672 }
1673}
1674
1675static struct worker *alloc_worker(void) 1652static struct worker *alloc_worker(void)
1676{ 1653{
1677 struct worker *worker; 1654 struct worker *worker;
@@ -1680,7 +1657,6 @@ static struct worker *alloc_worker(void)
1680 if (worker) { 1657 if (worker) {
1681 INIT_LIST_HEAD(&worker->entry); 1658 INIT_LIST_HEAD(&worker->entry);
1682 INIT_LIST_HEAD(&worker->scheduled); 1659 INIT_LIST_HEAD(&worker->scheduled);
1683 INIT_WORK(&worker->rebind_work, busy_worker_rebind_fn);
1684 /* on creation a worker is in !idle && prep state */ 1660 /* on creation a worker is in !idle && prep state */
1685 worker->flags = WORKER_PREP; 1661 worker->flags = WORKER_PREP;
1686 } 1662 }
@@ -1703,18 +1679,25 @@ static struct worker *alloc_worker(void)
1703 */ 1679 */
1704static struct worker *create_worker(struct worker_pool *pool) 1680static struct worker *create_worker(struct worker_pool *pool)
1705{ 1681{
1706 const char *pri = std_worker_pool_pri(pool) ? "H" : "";
1707 struct worker *worker = NULL; 1682 struct worker *worker = NULL;
1708 int id = -1; 1683 int id = -1;
1684 char id_buf[16];
1685
1686 lockdep_assert_held(&pool->manager_mutex);
1709 1687
1688 /*
1689 * ID is needed to determine kthread name. Allocate ID first
1690 * without installing the pointer.
1691 */
1692 idr_preload(GFP_KERNEL);
1710 spin_lock_irq(&pool->lock); 1693 spin_lock_irq(&pool->lock);
1711 while (ida_get_new(&pool->worker_ida, &id)) { 1694
1712 spin_unlock_irq(&pool->lock); 1695 id = idr_alloc(&pool->worker_idr, NULL, 0, 0, GFP_NOWAIT);
1713 if (!ida_pre_get(&pool->worker_ida, GFP_KERNEL)) 1696
1714 goto fail;
1715 spin_lock_irq(&pool->lock);
1716 }
1717 spin_unlock_irq(&pool->lock); 1697 spin_unlock_irq(&pool->lock);
1698 idr_preload_end();
1699 if (id < 0)
1700 goto fail;
1718 1701
1719 worker = alloc_worker(); 1702 worker = alloc_worker();
1720 if (!worker) 1703 if (!worker)
@@ -1723,40 +1706,46 @@ static struct worker *create_worker(struct worker_pool *pool)
1723 worker->pool = pool; 1706 worker->pool = pool;
1724 worker->id = id; 1707 worker->id = id;
1725 1708
1726 if (pool->cpu != WORK_CPU_UNBOUND) 1709 if (pool->cpu >= 0)
1727 worker->task = kthread_create_on_node(worker_thread, 1710 snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
1728 worker, cpu_to_node(pool->cpu), 1711 pool->attrs->nice < 0 ? "H" : "");
1729 "kworker/%u:%d%s", pool->cpu, id, pri);
1730 else 1712 else
1731 worker->task = kthread_create(worker_thread, worker, 1713 snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
1732 "kworker/u:%d%s", id, pri); 1714
1715 worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
1716 "kworker/%s", id_buf);
1733 if (IS_ERR(worker->task)) 1717 if (IS_ERR(worker->task))
1734 goto fail; 1718 goto fail;
1735 1719
1736 if (std_worker_pool_pri(pool)) 1720 /*
1737 set_user_nice(worker->task, HIGHPRI_NICE_LEVEL); 1721 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
1722 * online CPUs. It'll be re-applied when any of the CPUs come up.
1723 */
1724 set_user_nice(worker->task, pool->attrs->nice);
1725 set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
1726
1727 /* prevent userland from meddling with cpumask of workqueue workers */
1728 worker->task->flags |= PF_NO_SETAFFINITY;
1738 1729
1739 /* 1730 /*
1740 * Determine CPU binding of the new worker depending on 1731 * The caller is responsible for ensuring %POOL_DISASSOCIATED
1741 * %POOL_DISASSOCIATED. The caller is responsible for ensuring the 1732 * remains stable across this function. See the comments above the
1742 * flag remains stable across this function. See the comments 1733 * flag definition for details.
1743 * above the flag definition for details.
1744 *
1745 * As an unbound worker may later become a regular one if CPU comes
1746 * online, make sure every worker has %PF_THREAD_BOUND set.
1747 */ 1734 */
1748 if (!(pool->flags & POOL_DISASSOCIATED)) { 1735 if (pool->flags & POOL_DISASSOCIATED)
1749 kthread_bind(worker->task, pool->cpu);
1750 } else {
1751 worker->task->flags |= PF_THREAD_BOUND;
1752 worker->flags |= WORKER_UNBOUND; 1736 worker->flags |= WORKER_UNBOUND;
1753 } 1737
1738 /* successful, commit the pointer to idr */
1739 spin_lock_irq(&pool->lock);
1740 idr_replace(&pool->worker_idr, worker, worker->id);
1741 spin_unlock_irq(&pool->lock);
1754 1742
1755 return worker; 1743 return worker;
1744
1756fail: 1745fail:
1757 if (id >= 0) { 1746 if (id >= 0) {
1758 spin_lock_irq(&pool->lock); 1747 spin_lock_irq(&pool->lock);
1759 ida_remove(&pool->worker_ida, id); 1748 idr_remove(&pool->worker_idr, id);
1760 spin_unlock_irq(&pool->lock); 1749 spin_unlock_irq(&pool->lock);
1761 } 1750 }
1762 kfree(worker); 1751 kfree(worker);
@@ -1781,6 +1770,30 @@ static void start_worker(struct worker *worker)
1781} 1770}
1782 1771
1783/** 1772/**
1773 * create_and_start_worker - create and start a worker for a pool
1774 * @pool: the target pool
1775 *
1776 * Grab the managership of @pool and create and start a new worker for it.
1777 */
1778static int create_and_start_worker(struct worker_pool *pool)
1779{
1780 struct worker *worker;
1781
1782 mutex_lock(&pool->manager_mutex);
1783
1784 worker = create_worker(pool);
1785 if (worker) {
1786 spin_lock_irq(&pool->lock);
1787 start_worker(worker);
1788 spin_unlock_irq(&pool->lock);
1789 }
1790
1791 mutex_unlock(&pool->manager_mutex);
1792
1793 return worker ? 0 : -ENOMEM;
1794}
1795
1796/**
1784 * destroy_worker - destroy a workqueue worker 1797 * destroy_worker - destroy a workqueue worker
1785 * @worker: worker to be destroyed 1798 * @worker: worker to be destroyed
1786 * 1799 *
@@ -1792,11 +1805,14 @@ static void start_worker(struct worker *worker)
1792static void destroy_worker(struct worker *worker) 1805static void destroy_worker(struct worker *worker)
1793{ 1806{
1794 struct worker_pool *pool = worker->pool; 1807 struct worker_pool *pool = worker->pool;
1795 int id = worker->id; 1808
1809 lockdep_assert_held(&pool->manager_mutex);
1810 lockdep_assert_held(&pool->lock);
1796 1811
1797 /* sanity check frenzy */ 1812 /* sanity check frenzy */
1798 BUG_ON(worker->current_work); 1813 if (WARN_ON(worker->current_work) ||
1799 BUG_ON(!list_empty(&worker->scheduled)); 1814 WARN_ON(!list_empty(&worker->scheduled)))
1815 return;
1800 1816
1801 if (worker->flags & WORKER_STARTED) 1817 if (worker->flags & WORKER_STARTED)
1802 pool->nr_workers--; 1818 pool->nr_workers--;
@@ -1806,13 +1822,14 @@ static void destroy_worker(struct worker *worker)
1806 list_del_init(&worker->entry); 1822 list_del_init(&worker->entry);
1807 worker->flags |= WORKER_DIE; 1823 worker->flags |= WORKER_DIE;
1808 1824
1825 idr_remove(&pool->worker_idr, worker->id);
1826
1809 spin_unlock_irq(&pool->lock); 1827 spin_unlock_irq(&pool->lock);
1810 1828
1811 kthread_stop(worker->task); 1829 kthread_stop(worker->task);
1812 kfree(worker); 1830 kfree(worker);
1813 1831
1814 spin_lock_irq(&pool->lock); 1832 spin_lock_irq(&pool->lock);
1815 ida_remove(&pool->worker_ida, id);
1816} 1833}
1817 1834
1818static void idle_worker_timeout(unsigned long __pool) 1835static void idle_worker_timeout(unsigned long __pool)
@@ -1841,23 +1858,21 @@ static void idle_worker_timeout(unsigned long __pool)
1841 spin_unlock_irq(&pool->lock); 1858 spin_unlock_irq(&pool->lock);
1842} 1859}
1843 1860
1844static bool send_mayday(struct work_struct *work) 1861static void send_mayday(struct work_struct *work)
1845{ 1862{
1846 struct pool_workqueue *pwq = get_work_pwq(work); 1863 struct pool_workqueue *pwq = get_work_pwq(work);
1847 struct workqueue_struct *wq = pwq->wq; 1864 struct workqueue_struct *wq = pwq->wq;
1848 unsigned int cpu;
1849 1865
1850 if (!(wq->flags & WQ_RESCUER)) 1866 lockdep_assert_held(&wq_mayday_lock);
1851 return false; 1867
1868 if (!wq->rescuer)
1869 return;
1852 1870
1853 /* mayday mayday mayday */ 1871 /* mayday mayday mayday */
1854 cpu = pwq->pool->cpu; 1872 if (list_empty(&pwq->mayday_node)) {
1855 /* WORK_CPU_UNBOUND can't be set in cpumask, use cpu 0 instead */ 1873 list_add_tail(&pwq->mayday_node, &wq->maydays);
1856 if (cpu == WORK_CPU_UNBOUND)
1857 cpu = 0;
1858 if (!mayday_test_and_set_cpu(cpu, wq->mayday_mask))
1859 wake_up_process(wq->rescuer->task); 1874 wake_up_process(wq->rescuer->task);
1860 return true; 1875 }
1861} 1876}
1862 1877
1863static void pool_mayday_timeout(unsigned long __pool) 1878static void pool_mayday_timeout(unsigned long __pool)
@@ -1865,7 +1880,8 @@ static void pool_mayday_timeout(unsigned long __pool)
1865 struct worker_pool *pool = (void *)__pool; 1880 struct worker_pool *pool = (void *)__pool;
1866 struct work_struct *work; 1881 struct work_struct *work;
1867 1882
1868 spin_lock_irq(&pool->lock); 1883 spin_lock_irq(&wq_mayday_lock); /* for wq->maydays */
1884 spin_lock(&pool->lock);
1869 1885
1870 if (need_to_create_worker(pool)) { 1886 if (need_to_create_worker(pool)) {
1871 /* 1887 /*
@@ -1878,7 +1894,8 @@ static void pool_mayday_timeout(unsigned long __pool)
1878 send_mayday(work); 1894 send_mayday(work);
1879 } 1895 }
1880 1896
1881 spin_unlock_irq(&pool->lock); 1897 spin_unlock(&pool->lock);
1898 spin_unlock_irq(&wq_mayday_lock);
1882 1899
1883 mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL); 1900 mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL);
1884} 1901}
@@ -1893,8 +1910,8 @@ static void pool_mayday_timeout(unsigned long __pool)
1893 * sent to all rescuers with works scheduled on @pool to resolve 1910 * sent to all rescuers with works scheduled on @pool to resolve
1894 * possible allocation deadlock. 1911 * possible allocation deadlock.
1895 * 1912 *
1896 * On return, need_to_create_worker() is guaranteed to be false and 1913 * On return, need_to_create_worker() is guaranteed to be %false and
1897 * may_start_working() true. 1914 * may_start_working() %true.
1898 * 1915 *
1899 * LOCKING: 1916 * LOCKING:
1900 * spin_lock_irq(pool->lock) which may be released and regrabbed 1917 * spin_lock_irq(pool->lock) which may be released and regrabbed
@@ -1902,7 +1919,7 @@ static void pool_mayday_timeout(unsigned long __pool)
1902 * manager. 1919 * manager.
1903 * 1920 *
1904 * RETURNS: 1921 * RETURNS:
1905 * false if no action was taken and pool->lock stayed locked, true 1922 * %false if no action was taken and pool->lock stayed locked, %true
1906 * otherwise. 1923 * otherwise.
1907 */ 1924 */
1908static bool maybe_create_worker(struct worker_pool *pool) 1925static bool maybe_create_worker(struct worker_pool *pool)
@@ -1925,7 +1942,8 @@ restart:
1925 del_timer_sync(&pool->mayday_timer); 1942 del_timer_sync(&pool->mayday_timer);
1926 spin_lock_irq(&pool->lock); 1943 spin_lock_irq(&pool->lock);
1927 start_worker(worker); 1944 start_worker(worker);
1928 BUG_ON(need_to_create_worker(pool)); 1945 if (WARN_ON_ONCE(need_to_create_worker(pool)))
1946 goto restart;
1929 return true; 1947 return true;
1930 } 1948 }
1931 1949
@@ -1958,7 +1976,7 @@ restart:
1958 * multiple times. Called only from manager. 1976 * multiple times. Called only from manager.
1959 * 1977 *
1960 * RETURNS: 1978 * RETURNS:
1961 * false if no action was taken and pool->lock stayed locked, true 1979 * %false if no action was taken and pool->lock stayed locked, %true
1962 * otherwise. 1980 * otherwise.
1963 */ 1981 */
1964static bool maybe_destroy_workers(struct worker_pool *pool) 1982static bool maybe_destroy_workers(struct worker_pool *pool)
@@ -2009,42 +2027,37 @@ static bool manage_workers(struct worker *worker)
2009 struct worker_pool *pool = worker->pool; 2027 struct worker_pool *pool = worker->pool;
2010 bool ret = false; 2028 bool ret = false;
2011 2029
2012 if (pool->flags & POOL_MANAGING_WORKERS) 2030 /*
2031 * Managership is governed by two mutexes - manager_arb and
2032 * manager_mutex. manager_arb handles arbitration of manager role.
2033 * Anyone who successfully grabs manager_arb wins the arbitration
2034 * and becomes the manager. mutex_trylock() on pool->manager_arb
2035 * failure while holding pool->lock reliably indicates that someone
2036 * else is managing the pool and the worker which failed trylock
2037 * can proceed to executing work items. This means that anyone
2038 * grabbing manager_arb is responsible for actually performing
2039 * manager duties. If manager_arb is grabbed and released without
2040 * actual management, the pool may stall indefinitely.
2041 *
2042 * manager_mutex is used for exclusion of actual management
2043 * operations. The holder of manager_mutex can be sure that none
2044 * of management operations, including creation and destruction of
2045 * workers, won't take place until the mutex is released. Because
2046 * manager_mutex doesn't interfere with manager role arbitration,
2047 * it is guaranteed that the pool's management, while may be
2048 * delayed, won't be disturbed by someone else grabbing
2049 * manager_mutex.
2050 */
2051 if (!mutex_trylock(&pool->manager_arb))
2013 return ret; 2052 return ret;
2014 2053
2015 pool->flags |= POOL_MANAGING_WORKERS;
2016
2017 /* 2054 /*
2018 * To simplify both worker management and CPU hotplug, hold off 2055 * With manager arbitration won, manager_mutex would be free in
2019 * management while hotplug is in progress. CPU hotplug path can't 2056 * most cases. trylock first without dropping @pool->lock.
2020 * grab %POOL_MANAGING_WORKERS to achieve this because that can
2021 * lead to idle worker depletion (all become busy thinking someone
2022 * else is managing) which in turn can result in deadlock under
2023 * extreme circumstances. Use @pool->assoc_mutex to synchronize
2024 * manager against CPU hotplug.
2025 *
2026 * assoc_mutex would always be free unless CPU hotplug is in
2027 * progress. trylock first without dropping @pool->lock.
2028 */ 2057 */
2029 if (unlikely(!mutex_trylock(&pool->assoc_mutex))) { 2058 if (unlikely(!mutex_trylock(&pool->manager_mutex))) {
2030 spin_unlock_irq(&pool->lock); 2059 spin_unlock_irq(&pool->lock);
2031 mutex_lock(&pool->assoc_mutex); 2060 mutex_lock(&pool->manager_mutex);
2032 /*
2033 * CPU hotplug could have happened while we were waiting
2034 * for assoc_mutex. Hotplug itself can't handle us
2035 * because manager isn't either on idle or busy list, and
2036 * @pool's state and ours could have deviated.
2037 *
2038 * As hotplug is now excluded via assoc_mutex, we can
2039 * simply try to bind. It will succeed or fail depending
2040 * on @pool's current state. Try it and adjust
2041 * %WORKER_UNBOUND accordingly.
2042 */
2043 if (worker_maybe_bind_and_lock(worker))
2044 worker->flags &= ~WORKER_UNBOUND;
2045 else
2046 worker->flags |= WORKER_UNBOUND;
2047
2048 ret = true; 2061 ret = true;
2049 } 2062 }
2050 2063
@@ -2057,8 +2070,8 @@ static bool manage_workers(struct worker *worker)
2057 ret |= maybe_destroy_workers(pool); 2070 ret |= maybe_destroy_workers(pool);
2058 ret |= maybe_create_worker(pool); 2071 ret |= maybe_create_worker(pool);
2059 2072
2060 pool->flags &= ~POOL_MANAGING_WORKERS; 2073 mutex_unlock(&pool->manager_mutex);
2061 mutex_unlock(&pool->assoc_mutex); 2074 mutex_unlock(&pool->manager_arb);
2062 return ret; 2075 return ret;
2063} 2076}
2064 2077
@@ -2212,11 +2225,11 @@ static void process_scheduled_works(struct worker *worker)
2212 * worker_thread - the worker thread function 2225 * worker_thread - the worker thread function
2213 * @__worker: self 2226 * @__worker: self
2214 * 2227 *
2215 * The worker thread function. There are NR_CPU_WORKER_POOLS dynamic pools 2228 * The worker thread function. All workers belong to a worker_pool -
2216 * of these per each cpu. These workers process all works regardless of 2229 * either a per-cpu one or dynamic unbound one. These workers process all
2217 * their specific target workqueue. The only exception is works which 2230 * work items regardless of their specific target workqueue. The only
2218 * belong to workqueues with a rescuer which will be explained in 2231 * exception is work items which belong to workqueues with a rescuer which
2219 * rescuer_thread(). 2232 * will be explained in rescuer_thread().
2220 */ 2233 */
2221static int worker_thread(void *__worker) 2234static int worker_thread(void *__worker)
2222{ 2235{
@@ -2228,19 +2241,12 @@ static int worker_thread(void *__worker)
2228woke_up: 2241woke_up:
2229 spin_lock_irq(&pool->lock); 2242 spin_lock_irq(&pool->lock);
2230 2243
2231 /* we are off idle list if destruction or rebind is requested */ 2244 /* am I supposed to die? */
2232 if (unlikely(list_empty(&worker->entry))) { 2245 if (unlikely(worker->flags & WORKER_DIE)) {
2233 spin_unlock_irq(&pool->lock); 2246 spin_unlock_irq(&pool->lock);
2234 2247 WARN_ON_ONCE(!list_empty(&worker->entry));
2235 /* if DIE is set, destruction is requested */ 2248 worker->task->flags &= ~PF_WQ_WORKER;
2236 if (worker->flags & WORKER_DIE) { 2249 return 0;
2237 worker->task->flags &= ~PF_WQ_WORKER;
2238 return 0;
2239 }
2240
2241 /* otherwise, rebind */
2242 idle_worker_rebind(worker);
2243 goto woke_up;
2244 } 2250 }
2245 2251
2246 worker_leave_idle(worker); 2252 worker_leave_idle(worker);
@@ -2258,14 +2264,16 @@ recheck:
2258 * preparing to process a work or actually processing it. 2264 * preparing to process a work or actually processing it.
2259 * Make sure nobody diddled with it while I was sleeping. 2265 * Make sure nobody diddled with it while I was sleeping.
2260 */ 2266 */
2261 BUG_ON(!list_empty(&worker->scheduled)); 2267 WARN_ON_ONCE(!list_empty(&worker->scheduled));
2262 2268
2263 /* 2269 /*
2264 * When control reaches this point, we're guaranteed to have 2270 * Finish PREP stage. We're guaranteed to have at least one idle
2265 * at least one idle worker or that someone else has already 2271 * worker or that someone else has already assumed the manager
2266 * assumed the manager role. 2272 * role. This is where @worker starts participating in concurrency
2273 * management if applicable and concurrency management is restored
2274 * after being rebound. See rebind_workers() for details.
2267 */ 2275 */
2268 worker_clr_flags(worker, WORKER_PREP); 2276 worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
2269 2277
2270 do { 2278 do {
2271 struct work_struct *work = 2279 struct work_struct *work =
@@ -2307,7 +2315,7 @@ sleep:
2307 * @__rescuer: self 2315 * @__rescuer: self
2308 * 2316 *
2309 * Workqueue rescuer thread function. There's one rescuer for each 2317 * Workqueue rescuer thread function. There's one rescuer for each
2310 * workqueue which has WQ_RESCUER set. 2318 * workqueue which has WQ_MEM_RECLAIM set.
2311 * 2319 *
2312 * Regular work processing on a pool may block trying to create a new 2320 * Regular work processing on a pool may block trying to create a new
2313 * worker which uses GFP_KERNEL allocation which has slight chance of 2321 * worker which uses GFP_KERNEL allocation which has slight chance of
@@ -2326,8 +2334,6 @@ static int rescuer_thread(void *__rescuer)
2326 struct worker *rescuer = __rescuer; 2334 struct worker *rescuer = __rescuer;
2327 struct workqueue_struct *wq = rescuer->rescue_wq; 2335 struct workqueue_struct *wq = rescuer->rescue_wq;
2328 struct list_head *scheduled = &rescuer->scheduled; 2336 struct list_head *scheduled = &rescuer->scheduled;
2329 bool is_unbound = wq->flags & WQ_UNBOUND;
2330 unsigned int cpu;
2331 2337
2332 set_user_nice(current, RESCUER_NICE_LEVEL); 2338 set_user_nice(current, RESCUER_NICE_LEVEL);
2333 2339
@@ -2345,28 +2351,29 @@ repeat:
2345 return 0; 2351 return 0;
2346 } 2352 }
2347 2353
2348 /* 2354 /* see whether any pwq is asking for help */
2349 * See whether any cpu is asking for help. Unbounded 2355 spin_lock_irq(&wq_mayday_lock);
2350 * workqueues use cpu 0 in mayday_mask for CPU_UNBOUND. 2356
2351 */ 2357 while (!list_empty(&wq->maydays)) {
2352 for_each_mayday_cpu(cpu, wq->mayday_mask) { 2358 struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
2353 unsigned int tcpu = is_unbound ? WORK_CPU_UNBOUND : cpu; 2359 struct pool_workqueue, mayday_node);
2354 struct pool_workqueue *pwq = get_pwq(tcpu, wq);
2355 struct worker_pool *pool = pwq->pool; 2360 struct worker_pool *pool = pwq->pool;
2356 struct work_struct *work, *n; 2361 struct work_struct *work, *n;
2357 2362
2358 __set_current_state(TASK_RUNNING); 2363 __set_current_state(TASK_RUNNING);
2359 mayday_clear_cpu(cpu, wq->mayday_mask); 2364 list_del_init(&pwq->mayday_node);
2365
2366 spin_unlock_irq(&wq_mayday_lock);
2360 2367
2361 /* migrate to the target cpu if possible */ 2368 /* migrate to the target cpu if possible */
2369 worker_maybe_bind_and_lock(pool);
2362 rescuer->pool = pool; 2370 rescuer->pool = pool;
2363 worker_maybe_bind_and_lock(rescuer);
2364 2371
2365 /* 2372 /*
2366 * Slurp in all works issued via this workqueue and 2373 * Slurp in all works issued via this workqueue and
2367 * process'em. 2374 * process'em.
2368 */ 2375 */
2369 BUG_ON(!list_empty(&rescuer->scheduled)); 2376 WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
2370 list_for_each_entry_safe(work, n, &pool->worklist, entry) 2377 list_for_each_entry_safe(work, n, &pool->worklist, entry)
2371 if (get_work_pwq(work) == pwq) 2378 if (get_work_pwq(work) == pwq)
2372 move_linked_works(work, scheduled, &n); 2379 move_linked_works(work, scheduled, &n);
@@ -2381,9 +2388,13 @@ repeat:
2381 if (keep_working(pool)) 2388 if (keep_working(pool))
2382 wake_up_worker(pool); 2389 wake_up_worker(pool);
2383 2390
2384 spin_unlock_irq(&pool->lock); 2391 rescuer->pool = NULL;
2392 spin_unlock(&pool->lock);
2393 spin_lock(&wq_mayday_lock);
2385 } 2394 }
2386 2395
2396 spin_unlock_irq(&wq_mayday_lock);
2397
2387 /* rescuers should never participate in concurrency management */ 2398 /* rescuers should never participate in concurrency management */
2388 WARN_ON_ONCE(!(rescuer->flags & WORKER_NOT_RUNNING)); 2399 WARN_ON_ONCE(!(rescuer->flags & WORKER_NOT_RUNNING));
2389 schedule(); 2400 schedule();
@@ -2487,7 +2498,7 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
2487 * advanced to @work_color. 2498 * advanced to @work_color.
2488 * 2499 *
2489 * CONTEXT: 2500 * CONTEXT:
2490 * mutex_lock(wq->flush_mutex). 2501 * mutex_lock(wq->mutex).
2491 * 2502 *
2492 * RETURNS: 2503 * RETURNS:
2493 * %true if @flush_color >= 0 and there's something to flush. %false 2504 * %true if @flush_color >= 0 and there's something to flush. %false
@@ -2497,21 +2508,20 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
2497 int flush_color, int work_color) 2508 int flush_color, int work_color)
2498{ 2509{
2499 bool wait = false; 2510 bool wait = false;
2500 unsigned int cpu; 2511 struct pool_workqueue *pwq;
2501 2512
2502 if (flush_color >= 0) { 2513 if (flush_color >= 0) {
2503 BUG_ON(atomic_read(&wq->nr_pwqs_to_flush)); 2514 WARN_ON_ONCE(atomic_read(&wq->nr_pwqs_to_flush));
2504 atomic_set(&wq->nr_pwqs_to_flush, 1); 2515 atomic_set(&wq->nr_pwqs_to_flush, 1);
2505 } 2516 }
2506 2517
2507 for_each_pwq_cpu(cpu, wq) { 2518 for_each_pwq(pwq, wq) {
2508 struct pool_workqueue *pwq = get_pwq(cpu, wq);
2509 struct worker_pool *pool = pwq->pool; 2519 struct worker_pool *pool = pwq->pool;
2510 2520
2511 spin_lock_irq(&pool->lock); 2521 spin_lock_irq(&pool->lock);
2512 2522
2513 if (flush_color >= 0) { 2523 if (flush_color >= 0) {
2514 BUG_ON(pwq->flush_color != -1); 2524 WARN_ON_ONCE(pwq->flush_color != -1);
2515 2525
2516 if (pwq->nr_in_flight[flush_color]) { 2526 if (pwq->nr_in_flight[flush_color]) {
2517 pwq->flush_color = flush_color; 2527 pwq->flush_color = flush_color;
@@ -2521,7 +2531,7 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
2521 } 2531 }
2522 2532
2523 if (work_color >= 0) { 2533 if (work_color >= 0) {
2524 BUG_ON(work_color != work_next_color(pwq->work_color)); 2534 WARN_ON_ONCE(work_color != work_next_color(pwq->work_color));
2525 pwq->work_color = work_color; 2535 pwq->work_color = work_color;
2526 } 2536 }
2527 2537
@@ -2538,11 +2548,8 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
2538 * flush_workqueue - ensure that any scheduled work has run to completion. 2548 * flush_workqueue - ensure that any scheduled work has run to completion.
2539 * @wq: workqueue to flush 2549 * @wq: workqueue to flush
2540 * 2550 *
2541 * Forces execution of the workqueue and blocks until its completion. 2551 * This function sleeps until all work items which were queued on entry
2542 * This is typically used in driver shutdown handlers. 2552 * have finished execution, but it is not livelocked by new incoming ones.
2543 *
2544 * We sleep until all works which were queued on entry have been handled,
2545 * but we are not livelocked by new incoming ones.
2546 */ 2553 */
2547void flush_workqueue(struct workqueue_struct *wq) 2554void flush_workqueue(struct workqueue_struct *wq)
2548{ 2555{
@@ -2556,7 +2563,7 @@ void flush_workqueue(struct workqueue_struct *wq)
2556 lock_map_acquire(&wq->lockdep_map); 2563 lock_map_acquire(&wq->lockdep_map);
2557 lock_map_release(&wq->lockdep_map); 2564 lock_map_release(&wq->lockdep_map);
2558 2565
2559 mutex_lock(&wq->flush_mutex); 2566 mutex_lock(&wq->mutex);
2560 2567
2561 /* 2568 /*
2562 * Start-to-wait phase 2569 * Start-to-wait phase
@@ -2569,13 +2576,13 @@ void flush_workqueue(struct workqueue_struct *wq)
2569 * becomes our flush_color and work_color is advanced 2576 * becomes our flush_color and work_color is advanced
2570 * by one. 2577 * by one.
2571 */ 2578 */
2572 BUG_ON(!list_empty(&wq->flusher_overflow)); 2579 WARN_ON_ONCE(!list_empty(&wq->flusher_overflow));
2573 this_flusher.flush_color = wq->work_color; 2580 this_flusher.flush_color = wq->work_color;
2574 wq->work_color = next_color; 2581 wq->work_color = next_color;
2575 2582
2576 if (!wq->first_flusher) { 2583 if (!wq->first_flusher) {
2577 /* no flush in progress, become the first flusher */ 2584 /* no flush in progress, become the first flusher */
2578 BUG_ON(wq->flush_color != this_flusher.flush_color); 2585 WARN_ON_ONCE(wq->flush_color != this_flusher.flush_color);
2579 2586
2580 wq->first_flusher = &this_flusher; 2587 wq->first_flusher = &this_flusher;
2581 2588
@@ -2588,7 +2595,7 @@ void flush_workqueue(struct workqueue_struct *wq)
2588 } 2595 }
2589 } else { 2596 } else {
2590 /* wait in queue */ 2597 /* wait in queue */
2591 BUG_ON(wq->flush_color == this_flusher.flush_color); 2598 WARN_ON_ONCE(wq->flush_color == this_flusher.flush_color);
2592 list_add_tail(&this_flusher.list, &wq->flusher_queue); 2599 list_add_tail(&this_flusher.list, &wq->flusher_queue);
2593 flush_workqueue_prep_pwqs(wq, -1, wq->work_color); 2600 flush_workqueue_prep_pwqs(wq, -1, wq->work_color);
2594 } 2601 }
@@ -2601,7 +2608,7 @@ void flush_workqueue(struct workqueue_struct *wq)
2601 list_add_tail(&this_flusher.list, &wq->flusher_overflow); 2608 list_add_tail(&this_flusher.list, &wq->flusher_overflow);
2602 } 2609 }
2603 2610
2604 mutex_unlock(&wq->flush_mutex); 2611 mutex_unlock(&wq->mutex);
2605 2612
2606 wait_for_completion(&this_flusher.done); 2613 wait_for_completion(&this_flusher.done);
2607 2614
@@ -2614,7 +2621,7 @@ void flush_workqueue(struct workqueue_struct *wq)
2614 if (wq->first_flusher != &this_flusher) 2621 if (wq->first_flusher != &this_flusher)
2615 return; 2622 return;
2616 2623
2617 mutex_lock(&wq->flush_mutex); 2624 mutex_lock(&wq->mutex);
2618 2625
2619 /* we might have raced, check again with mutex held */ 2626 /* we might have raced, check again with mutex held */
2620 if (wq->first_flusher != &this_flusher) 2627 if (wq->first_flusher != &this_flusher)
@@ -2622,8 +2629,8 @@ void flush_workqueue(struct workqueue_struct *wq)
2622 2629
2623 wq->first_flusher = NULL; 2630 wq->first_flusher = NULL;
2624 2631
2625 BUG_ON(!list_empty(&this_flusher.list)); 2632 WARN_ON_ONCE(!list_empty(&this_flusher.list));
2626 BUG_ON(wq->flush_color != this_flusher.flush_color); 2633 WARN_ON_ONCE(wq->flush_color != this_flusher.flush_color);
2627 2634
2628 while (true) { 2635 while (true) {
2629 struct wq_flusher *next, *tmp; 2636 struct wq_flusher *next, *tmp;
@@ -2636,8 +2643,8 @@ void flush_workqueue(struct workqueue_struct *wq)
2636 complete(&next->done); 2643 complete(&next->done);
2637 } 2644 }
2638 2645
2639 BUG_ON(!list_empty(&wq->flusher_overflow) && 2646 WARN_ON_ONCE(!list_empty(&wq->flusher_overflow) &&
2640 wq->flush_color != work_next_color(wq->work_color)); 2647 wq->flush_color != work_next_color(wq->work_color));
2641 2648
2642 /* this flush_color is finished, advance by one */ 2649 /* this flush_color is finished, advance by one */
2643 wq->flush_color = work_next_color(wq->flush_color); 2650 wq->flush_color = work_next_color(wq->flush_color);
@@ -2661,7 +2668,7 @@ void flush_workqueue(struct workqueue_struct *wq)
2661 } 2668 }
2662 2669
2663 if (list_empty(&wq->flusher_queue)) { 2670 if (list_empty(&wq->flusher_queue)) {
2664 BUG_ON(wq->flush_color != wq->work_color); 2671 WARN_ON_ONCE(wq->flush_color != wq->work_color);
2665 break; 2672 break;
2666 } 2673 }
2667 2674
@@ -2669,8 +2676,8 @@ void flush_workqueue(struct workqueue_struct *wq)
2669 * Need to flush more colors. Make the next flusher 2676 * Need to flush more colors. Make the next flusher
2670 * the new first flusher and arm pwqs. 2677 * the new first flusher and arm pwqs.
2671 */ 2678 */
2672 BUG_ON(wq->flush_color == wq->work_color); 2679 WARN_ON_ONCE(wq->flush_color == wq->work_color);
2673 BUG_ON(wq->flush_color != next->flush_color); 2680 WARN_ON_ONCE(wq->flush_color != next->flush_color);
2674 2681
2675 list_del_init(&next->list); 2682 list_del_init(&next->list);
2676 wq->first_flusher = next; 2683 wq->first_flusher = next;
@@ -2686,7 +2693,7 @@ void flush_workqueue(struct workqueue_struct *wq)
2686 } 2693 }
2687 2694
2688out_unlock: 2695out_unlock:
2689 mutex_unlock(&wq->flush_mutex); 2696 mutex_unlock(&wq->mutex);
2690} 2697}
2691EXPORT_SYMBOL_GPL(flush_workqueue); 2698EXPORT_SYMBOL_GPL(flush_workqueue);
2692 2699
@@ -2704,22 +2711,23 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
2704void drain_workqueue(struct workqueue_struct *wq) 2711void drain_workqueue(struct workqueue_struct *wq)
2705{ 2712{
2706 unsigned int flush_cnt = 0; 2713 unsigned int flush_cnt = 0;
2707 unsigned int cpu; 2714 struct pool_workqueue *pwq;
2708 2715
2709 /* 2716 /*
2710 * __queue_work() needs to test whether there are drainers, is much 2717 * __queue_work() needs to test whether there are drainers, is much
2711 * hotter than drain_workqueue() and already looks at @wq->flags. 2718 * hotter than drain_workqueue() and already looks at @wq->flags.
2712 * Use WQ_DRAINING so that queue doesn't have to check nr_drainers. 2719 * Use __WQ_DRAINING so that queue doesn't have to check nr_drainers.
2713 */ 2720 */
2714 spin_lock(&workqueue_lock); 2721 mutex_lock(&wq->mutex);
2715 if (!wq->nr_drainers++) 2722 if (!wq->nr_drainers++)
2716 wq->flags |= WQ_DRAINING; 2723 wq->flags |= __WQ_DRAINING;
2717 spin_unlock(&workqueue_lock); 2724 mutex_unlock(&wq->mutex);
2718reflush: 2725reflush:
2719 flush_workqueue(wq); 2726 flush_workqueue(wq);
2720 2727
2721 for_each_pwq_cpu(cpu, wq) { 2728 mutex_lock(&wq->mutex);
2722 struct pool_workqueue *pwq = get_pwq(cpu, wq); 2729
2730 for_each_pwq(pwq, wq) {
2723 bool drained; 2731 bool drained;
2724 2732
2725 spin_lock_irq(&pwq->pool->lock); 2733 spin_lock_irq(&pwq->pool->lock);
@@ -2731,15 +2739,16 @@ reflush:
2731 2739
2732 if (++flush_cnt == 10 || 2740 if (++flush_cnt == 10 ||
2733 (flush_cnt % 100 == 0 && flush_cnt <= 1000)) 2741 (flush_cnt % 100 == 0 && flush_cnt <= 1000))
2734 pr_warn("workqueue %s: flush on destruction isn't complete after %u tries\n", 2742 pr_warn("workqueue %s: drain_workqueue() isn't complete after %u tries\n",
2735 wq->name, flush_cnt); 2743 wq->name, flush_cnt);
2744
2745 mutex_unlock(&wq->mutex);
2736 goto reflush; 2746 goto reflush;
2737 } 2747 }
2738 2748
2739 spin_lock(&workqueue_lock);
2740 if (!--wq->nr_drainers) 2749 if (!--wq->nr_drainers)
2741 wq->flags &= ~WQ_DRAINING; 2750 wq->flags &= ~__WQ_DRAINING;
2742 spin_unlock(&workqueue_lock); 2751 mutex_unlock(&wq->mutex);
2743} 2752}
2744EXPORT_SYMBOL_GPL(drain_workqueue); 2753EXPORT_SYMBOL_GPL(drain_workqueue);
2745 2754
@@ -2750,11 +2759,15 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr)
2750 struct pool_workqueue *pwq; 2759 struct pool_workqueue *pwq;
2751 2760
2752 might_sleep(); 2761 might_sleep();
2762
2763 local_irq_disable();
2753 pool = get_work_pool(work); 2764 pool = get_work_pool(work);
2754 if (!pool) 2765 if (!pool) {
2766 local_irq_enable();
2755 return false; 2767 return false;
2768 }
2756 2769
2757 spin_lock_irq(&pool->lock); 2770 spin_lock(&pool->lock);
2758 /* see the comment in try_to_grab_pending() with the same code */ 2771 /* see the comment in try_to_grab_pending() with the same code */
2759 pwq = get_work_pwq(work); 2772 pwq = get_work_pwq(work);
2760 if (pwq) { 2773 if (pwq) {
@@ -2776,7 +2789,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr)
2776 * flusher is not running on the same workqueue by verifying write 2789 * flusher is not running on the same workqueue by verifying write
2777 * access. 2790 * access.
2778 */ 2791 */
2779 if (pwq->wq->saved_max_active == 1 || pwq->wq->flags & WQ_RESCUER) 2792 if (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)
2780 lock_map_acquire(&pwq->wq->lockdep_map); 2793 lock_map_acquire(&pwq->wq->lockdep_map);
2781 else 2794 else
2782 lock_map_acquire_read(&pwq->wq->lockdep_map); 2795 lock_map_acquire_read(&pwq->wq->lockdep_map);
@@ -2933,66 +2946,6 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork)
2933EXPORT_SYMBOL(cancel_delayed_work_sync); 2946EXPORT_SYMBOL(cancel_delayed_work_sync);
2934 2947
2935/** 2948/**
2936 * schedule_work_on - put work task on a specific cpu
2937 * @cpu: cpu to put the work task on
2938 * @work: job to be done
2939 *
2940 * This puts a job on a specific cpu
2941 */
2942bool schedule_work_on(int cpu, struct work_struct *work)
2943{
2944 return queue_work_on(cpu, system_wq, work);
2945}
2946EXPORT_SYMBOL(schedule_work_on);
2947
2948/**
2949 * schedule_work - put work task in global workqueue
2950 * @work: job to be done
2951 *
2952 * Returns %false if @work was already on the kernel-global workqueue and
2953 * %true otherwise.
2954 *
2955 * This puts a job in the kernel-global workqueue if it was not already
2956 * queued and leaves it in the same position on the kernel-global
2957 * workqueue otherwise.
2958 */
2959bool schedule_work(struct work_struct *work)
2960{
2961 return queue_work(system_wq, work);
2962}
2963EXPORT_SYMBOL(schedule_work);
2964
2965/**
2966 * schedule_delayed_work_on - queue work in global workqueue on CPU after delay
2967 * @cpu: cpu to use
2968 * @dwork: job to be done
2969 * @delay: number of jiffies to wait
2970 *
2971 * After waiting for a given time this puts a job in the kernel-global
2972 * workqueue on the specified CPU.
2973 */
2974bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
2975 unsigned long delay)
2976{
2977 return queue_delayed_work_on(cpu, system_wq, dwork, delay);
2978}
2979EXPORT_SYMBOL(schedule_delayed_work_on);
2980
2981/**
2982 * schedule_delayed_work - put work task in global workqueue after delay
2983 * @dwork: job to be done
2984 * @delay: number of jiffies to wait or 0 for immediate execution
2985 *
2986 * After waiting for a given time this puts a job in the kernel-global
2987 * workqueue.
2988 */
2989bool schedule_delayed_work(struct delayed_work *dwork, unsigned long delay)
2990{
2991 return queue_delayed_work(system_wq, dwork, delay);
2992}
2993EXPORT_SYMBOL(schedule_delayed_work);
2994
2995/**
2996 * schedule_on_each_cpu - execute a function synchronously on each online CPU 2949 * schedule_on_each_cpu - execute a function synchronously on each online CPU
2997 * @func: the function to call 2950 * @func: the function to call
2998 * 2951 *
@@ -3085,51 +3038,1025 @@ int execute_in_process_context(work_func_t fn, struct execute_work *ew)
3085} 3038}
3086EXPORT_SYMBOL_GPL(execute_in_process_context); 3039EXPORT_SYMBOL_GPL(execute_in_process_context);
3087 3040
3088int keventd_up(void) 3041#ifdef CONFIG_SYSFS
3042/*
3043 * Workqueues with WQ_SYSFS flag set is visible to userland via
3044 * /sys/bus/workqueue/devices/WQ_NAME. All visible workqueues have the
3045 * following attributes.
3046 *
3047 * per_cpu RO bool : whether the workqueue is per-cpu or unbound
3048 * max_active RW int : maximum number of in-flight work items
3049 *
3050 * Unbound workqueues have the following extra attributes.
3051 *
3052 * id RO int : the associated pool ID
3053 * nice RW int : nice value of the workers
3054 * cpumask RW mask : bitmask of allowed CPUs for the workers
3055 */
3056struct wq_device {
3057 struct workqueue_struct *wq;
3058 struct device dev;
3059};
3060
3061static struct workqueue_struct *dev_to_wq(struct device *dev)
3062{
3063 struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
3064
3065 return wq_dev->wq;
3066}
3067
3068static ssize_t wq_per_cpu_show(struct device *dev,
3069 struct device_attribute *attr, char *buf)
3070{
3071 struct workqueue_struct *wq = dev_to_wq(dev);
3072
3073 return scnprintf(buf, PAGE_SIZE, "%d\n", (bool)!(wq->flags & WQ_UNBOUND));
3074}
3075
3076static ssize_t wq_max_active_show(struct device *dev,
3077 struct device_attribute *attr, char *buf)
3078{
3079 struct workqueue_struct *wq = dev_to_wq(dev);
3080
3081 return scnprintf(buf, PAGE_SIZE, "%d\n", wq->saved_max_active);
3082}
3083
3084static ssize_t wq_max_active_store(struct device *dev,
3085 struct device_attribute *attr,
3086 const char *buf, size_t count)
3087{
3088 struct workqueue_struct *wq = dev_to_wq(dev);
3089 int val;
3090
3091 if (sscanf(buf, "%d", &val) != 1 || val <= 0)
3092 return -EINVAL;
3093
3094 workqueue_set_max_active(wq, val);
3095 return count;
3096}
3097
3098static struct device_attribute wq_sysfs_attrs[] = {
3099 __ATTR(per_cpu, 0444, wq_per_cpu_show, NULL),
3100 __ATTR(max_active, 0644, wq_max_active_show, wq_max_active_store),
3101 __ATTR_NULL,
3102};
3103
3104static ssize_t wq_pool_ids_show(struct device *dev,
3105 struct device_attribute *attr, char *buf)
3106{
3107 struct workqueue_struct *wq = dev_to_wq(dev);
3108 const char *delim = "";
3109 int node, written = 0;
3110
3111 rcu_read_lock_sched();
3112 for_each_node(node) {
3113 written += scnprintf(buf + written, PAGE_SIZE - written,
3114 "%s%d:%d", delim, node,
3115 unbound_pwq_by_node(wq, node)->pool->id);
3116 delim = " ";
3117 }
3118 written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
3119 rcu_read_unlock_sched();
3120
3121 return written;
3122}
3123
3124static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
3125 char *buf)
3126{
3127 struct workqueue_struct *wq = dev_to_wq(dev);
3128 int written;
3129
3130 mutex_lock(&wq->mutex);
3131 written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
3132 mutex_unlock(&wq->mutex);
3133
3134 return written;
3135}
3136
3137/* prepare workqueue_attrs for sysfs store operations */
3138static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
3139{
3140 struct workqueue_attrs *attrs;
3141
3142 attrs = alloc_workqueue_attrs(GFP_KERNEL);
3143 if (!attrs)
3144 return NULL;
3145
3146 mutex_lock(&wq->mutex);
3147 copy_workqueue_attrs(attrs, wq->unbound_attrs);
3148 mutex_unlock(&wq->mutex);
3149 return attrs;
3150}
3151
3152static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
3153 const char *buf, size_t count)
3154{
3155 struct workqueue_struct *wq = dev_to_wq(dev);
3156 struct workqueue_attrs *attrs;
3157 int ret;
3158
3159 attrs = wq_sysfs_prep_attrs(wq);
3160 if (!attrs)
3161 return -ENOMEM;
3162
3163 if (sscanf(buf, "%d", &attrs->nice) == 1 &&
3164 attrs->nice >= -20 && attrs->nice <= 19)
3165 ret = apply_workqueue_attrs(wq, attrs);
3166 else
3167 ret = -EINVAL;
3168
3169 free_workqueue_attrs(attrs);
3170 return ret ?: count;
3171}
3172
3173static ssize_t wq_cpumask_show(struct device *dev,
3174 struct device_attribute *attr, char *buf)
3089{ 3175{
3090 return system_wq != NULL; 3176 struct workqueue_struct *wq = dev_to_wq(dev);
3177 int written;
3178
3179 mutex_lock(&wq->mutex);
3180 written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
3181 mutex_unlock(&wq->mutex);
3182
3183 written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
3184 return written;
3091} 3185}
3092 3186
3093static int alloc_pwqs(struct workqueue_struct *wq) 3187static ssize_t wq_cpumask_store(struct device *dev,
3188 struct device_attribute *attr,
3189 const char *buf, size_t count)
3094{ 3190{
3191 struct workqueue_struct *wq = dev_to_wq(dev);
3192 struct workqueue_attrs *attrs;
3193 int ret;
3194
3195 attrs = wq_sysfs_prep_attrs(wq);
3196 if (!attrs)
3197 return -ENOMEM;
3198
3199 ret = cpumask_parse(buf, attrs->cpumask);
3200 if (!ret)
3201 ret = apply_workqueue_attrs(wq, attrs);
3202
3203 free_workqueue_attrs(attrs);
3204 return ret ?: count;
3205}
3206
3207static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
3208 char *buf)
3209{
3210 struct workqueue_struct *wq = dev_to_wq(dev);
3211 int written;
3212
3213 mutex_lock(&wq->mutex);
3214 written = scnprintf(buf, PAGE_SIZE, "%d\n",
3215 !wq->unbound_attrs->no_numa);
3216 mutex_unlock(&wq->mutex);
3217
3218 return written;
3219}
3220
3221static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
3222 const char *buf, size_t count)
3223{
3224 struct workqueue_struct *wq = dev_to_wq(dev);
3225 struct workqueue_attrs *attrs;
3226 int v, ret;
3227
3228 attrs = wq_sysfs_prep_attrs(wq);
3229 if (!attrs)
3230 return -ENOMEM;
3231
3232 ret = -EINVAL;
3233 if (sscanf(buf, "%d", &v) == 1) {
3234 attrs->no_numa = !v;
3235 ret = apply_workqueue_attrs(wq, attrs);
3236 }
3237
3238 free_workqueue_attrs(attrs);
3239 return ret ?: count;
3240}
3241
3242static struct device_attribute wq_sysfs_unbound_attrs[] = {
3243 __ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
3244 __ATTR(nice, 0644, wq_nice_show, wq_nice_store),
3245 __ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
3246 __ATTR(numa, 0644, wq_numa_show, wq_numa_store),
3247 __ATTR_NULL,
3248};
3249
3250static struct bus_type wq_subsys = {
3251 .name = "workqueue",
3252 .dev_attrs = wq_sysfs_attrs,
3253};
3254
3255static int __init wq_sysfs_init(void)
3256{
3257 return subsys_virtual_register(&wq_subsys, NULL);
3258}
3259core_initcall(wq_sysfs_init);
3260
3261static void wq_device_release(struct device *dev)
3262{
3263 struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
3264
3265 kfree(wq_dev);
3266}
3267
3268/**
3269 * workqueue_sysfs_register - make a workqueue visible in sysfs
3270 * @wq: the workqueue to register
3271 *
3272 * Expose @wq in sysfs under /sys/bus/workqueue/devices.
3273 * alloc_workqueue*() automatically calls this function if WQ_SYSFS is set
3274 * which is the preferred method.
3275 *
3276 * Workqueue user should use this function directly iff it wants to apply
3277 * workqueue_attrs before making the workqueue visible in sysfs; otherwise,
3278 * apply_workqueue_attrs() may race against userland updating the
3279 * attributes.
3280 *
3281 * Returns 0 on success, -errno on failure.
3282 */
3283int workqueue_sysfs_register(struct workqueue_struct *wq)
3284{
3285 struct wq_device *wq_dev;
3286 int ret;
3287
3095 /* 3288 /*
3096 * pwqs are forced aligned according to WORK_STRUCT_FLAG_BITS. 3289 * Adjusting max_active or creating new pwqs by applyting
3097 * Make sure that the alignment isn't lower than that of 3290 * attributes breaks ordering guarantee. Disallow exposing ordered
3098 * unsigned long long. 3291 * workqueues.
3099 */ 3292 */
3100 const size_t size = sizeof(struct pool_workqueue); 3293 if (WARN_ON(wq->flags & __WQ_ORDERED))
3101 const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS, 3294 return -EINVAL;
3102 __alignof__(unsigned long long));
3103 3295
3104 if (!(wq->flags & WQ_UNBOUND)) 3296 wq->wq_dev = wq_dev = kzalloc(sizeof(*wq_dev), GFP_KERNEL);
3105 wq->pool_wq.pcpu = __alloc_percpu(size, align); 3297 if (!wq_dev)
3106 else { 3298 return -ENOMEM;
3107 void *ptr; 3299
3300 wq_dev->wq = wq;
3301 wq_dev->dev.bus = &wq_subsys;
3302 wq_dev->dev.init_name = wq->name;
3303 wq_dev->dev.release = wq_device_release;
3304
3305 /*
3306 * unbound_attrs are created separately. Suppress uevent until
3307 * everything is ready.
3308 */
3309 dev_set_uevent_suppress(&wq_dev->dev, true);
3310
3311 ret = device_register(&wq_dev->dev);
3312 if (ret) {
3313 kfree(wq_dev);
3314 wq->wq_dev = NULL;
3315 return ret;
3316 }
3317
3318 if (wq->flags & WQ_UNBOUND) {
3319 struct device_attribute *attr;
3320
3321 for (attr = wq_sysfs_unbound_attrs; attr->attr.name; attr++) {
3322 ret = device_create_file(&wq_dev->dev, attr);
3323 if (ret) {
3324 device_unregister(&wq_dev->dev);
3325 wq->wq_dev = NULL;
3326 return ret;
3327 }
3328 }
3329 }
3330
3331 kobject_uevent(&wq_dev->dev.kobj, KOBJ_ADD);
3332 return 0;
3333}
3334
3335/**
3336 * workqueue_sysfs_unregister - undo workqueue_sysfs_register()
3337 * @wq: the workqueue to unregister
3338 *
3339 * If @wq is registered to sysfs by workqueue_sysfs_register(), unregister.
3340 */
3341static void workqueue_sysfs_unregister(struct workqueue_struct *wq)
3342{
3343 struct wq_device *wq_dev = wq->wq_dev;
3344
3345 if (!wq->wq_dev)
3346 return;
3347
3348 wq->wq_dev = NULL;
3349 device_unregister(&wq_dev->dev);
3350}
3351#else /* CONFIG_SYSFS */
3352static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
3353#endif /* CONFIG_SYSFS */
3354
3355/**
3356 * free_workqueue_attrs - free a workqueue_attrs
3357 * @attrs: workqueue_attrs to free
3358 *
3359 * Undo alloc_workqueue_attrs().
3360 */
3361void free_workqueue_attrs(struct workqueue_attrs *attrs)
3362{
3363 if (attrs) {
3364 free_cpumask_var(attrs->cpumask);
3365 kfree(attrs);
3366 }
3367}
3368
3369/**
3370 * alloc_workqueue_attrs - allocate a workqueue_attrs
3371 * @gfp_mask: allocation mask to use
3372 *
3373 * Allocate a new workqueue_attrs, initialize with default settings and
3374 * return it. Returns NULL on failure.
3375 */
3376struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
3377{
3378 struct workqueue_attrs *attrs;
3379
3380 attrs = kzalloc(sizeof(*attrs), gfp_mask);
3381 if (!attrs)
3382 goto fail;
3383 if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
3384 goto fail;
3385
3386 cpumask_copy(attrs->cpumask, cpu_possible_mask);
3387 return attrs;
3388fail:
3389 free_workqueue_attrs(attrs);
3390 return NULL;
3391}
3392
3393static void copy_workqueue_attrs(struct workqueue_attrs *to,
3394 const struct workqueue_attrs *from)
3395{
3396 to->nice = from->nice;
3397 cpumask_copy(to->cpumask, from->cpumask);
3398}
3399
3400/* hash value of the content of @attr */
3401static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
3402{
3403 u32 hash = 0;
3404
3405 hash = jhash_1word(attrs->nice, hash);
3406 hash = jhash(cpumask_bits(attrs->cpumask),
3407 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
3408 return hash;
3409}
3410
3411/* content equality test */
3412static bool wqattrs_equal(const struct workqueue_attrs *a,
3413 const struct workqueue_attrs *b)
3414{
3415 if (a->nice != b->nice)
3416 return false;
3417 if (!cpumask_equal(a->cpumask, b->cpumask))
3418 return false;
3419 return true;
3420}
3421
3422/**
3423 * init_worker_pool - initialize a newly zalloc'd worker_pool
3424 * @pool: worker_pool to initialize
3425 *
3426 * Initiailize a newly zalloc'd @pool. It also allocates @pool->attrs.
3427 * Returns 0 on success, -errno on failure. Even on failure, all fields
3428 * inside @pool proper are initialized and put_unbound_pool() can be called
3429 * on @pool safely to release it.
3430 */
3431static int init_worker_pool(struct worker_pool *pool)
3432{
3433 spin_lock_init(&pool->lock);
3434 pool->id = -1;
3435 pool->cpu = -1;
3436 pool->node = NUMA_NO_NODE;
3437 pool->flags |= POOL_DISASSOCIATED;
3438 INIT_LIST_HEAD(&pool->worklist);
3439 INIT_LIST_HEAD(&pool->idle_list);
3440 hash_init(pool->busy_hash);
3441
3442 init_timer_deferrable(&pool->idle_timer);
3443 pool->idle_timer.function = idle_worker_timeout;
3444 pool->idle_timer.data = (unsigned long)pool;
3445
3446 setup_timer(&pool->mayday_timer, pool_mayday_timeout,
3447 (unsigned long)pool);
3448
3449 mutex_init(&pool->manager_arb);
3450 mutex_init(&pool->manager_mutex);
3451 idr_init(&pool->worker_idr);
3452
3453 INIT_HLIST_NODE(&pool->hash_node);
3454 pool->refcnt = 1;
3455
3456 /* shouldn't fail above this point */
3457 pool->attrs = alloc_workqueue_attrs(GFP_KERNEL);
3458 if (!pool->attrs)
3459 return -ENOMEM;
3460 return 0;
3461}
3462
3463static void rcu_free_pool(struct rcu_head *rcu)
3464{
3465 struct worker_pool *pool = container_of(rcu, struct worker_pool, rcu);
3466
3467 idr_destroy(&pool->worker_idr);
3468 free_workqueue_attrs(pool->attrs);
3469 kfree(pool);
3470}
3471
3472/**
3473 * put_unbound_pool - put a worker_pool
3474 * @pool: worker_pool to put
3475 *
3476 * Put @pool. If its refcnt reaches zero, it gets destroyed in sched-RCU
3477 * safe manner. get_unbound_pool() calls this function on its failure path
3478 * and this function should be able to release pools which went through,
3479 * successfully or not, init_worker_pool().
3480 *
3481 * Should be called with wq_pool_mutex held.
3482 */
3483static void put_unbound_pool(struct worker_pool *pool)
3484{
3485 struct worker *worker;
3486
3487 lockdep_assert_held(&wq_pool_mutex);
3488
3489 if (--pool->refcnt)
3490 return;
3491
3492 /* sanity checks */
3493 if (WARN_ON(!(pool->flags & POOL_DISASSOCIATED)) ||
3494 WARN_ON(!list_empty(&pool->worklist)))
3495 return;
3496
3497 /* release id and unhash */
3498 if (pool->id >= 0)
3499 idr_remove(&worker_pool_idr, pool->id);
3500 hash_del(&pool->hash_node);
3501
3502 /*
3503 * Become the manager and destroy all workers. Grabbing
3504 * manager_arb prevents @pool's workers from blocking on
3505 * manager_mutex.
3506 */
3507 mutex_lock(&pool->manager_arb);
3508 mutex_lock(&pool->manager_mutex);
3509 spin_lock_irq(&pool->lock);
3510
3511 while ((worker = first_worker(pool)))
3512 destroy_worker(worker);
3513 WARN_ON(pool->nr_workers || pool->nr_idle);
3514
3515 spin_unlock_irq(&pool->lock);
3516 mutex_unlock(&pool->manager_mutex);
3517 mutex_unlock(&pool->manager_arb);
3518
3519 /* shut down the timers */
3520 del_timer_sync(&pool->idle_timer);
3521 del_timer_sync(&pool->mayday_timer);
3522
3523 /* sched-RCU protected to allow dereferences from get_work_pool() */
3524 call_rcu_sched(&pool->rcu, rcu_free_pool);
3525}
3526
3527/**
3528 * get_unbound_pool - get a worker_pool with the specified attributes
3529 * @attrs: the attributes of the worker_pool to get
3530 *
3531 * Obtain a worker_pool which has the same attributes as @attrs, bump the
3532 * reference count and return it. If there already is a matching
3533 * worker_pool, it will be used; otherwise, this function attempts to
3534 * create a new one. On failure, returns NULL.
3535 *
3536 * Should be called with wq_pool_mutex held.
3537 */
3538static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
3539{
3540 u32 hash = wqattrs_hash(attrs);
3541 struct worker_pool *pool;
3542 int node;
3543
3544 lockdep_assert_held(&wq_pool_mutex);
3545
3546 /* do we already have a matching pool? */
3547 hash_for_each_possible(unbound_pool_hash, pool, hash_node, hash) {
3548 if (wqattrs_equal(pool->attrs, attrs)) {
3549 pool->refcnt++;
3550 goto out_unlock;
3551 }
3552 }
3553
3554 /* nope, create a new one */
3555 pool = kzalloc(sizeof(*pool), GFP_KERNEL);
3556 if (!pool || init_worker_pool(pool) < 0)
3557 goto fail;
3558
3559 if (workqueue_freezing)
3560 pool->flags |= POOL_FREEZING;
3561
3562 lockdep_set_subclass(&pool->lock, 1); /* see put_pwq() */
3563 copy_workqueue_attrs(pool->attrs, attrs);
3564
3565 /* if cpumask is contained inside a NUMA node, we belong to that node */
3566 if (wq_numa_enabled) {
3567 for_each_node(node) {
3568 if (cpumask_subset(pool->attrs->cpumask,
3569 wq_numa_possible_cpumask[node])) {
3570 pool->node = node;
3571 break;
3572 }
3573 }
3574 }
3575
3576 if (worker_pool_assign_id(pool) < 0)
3577 goto fail;
3578
3579 /* create and start the initial worker */
3580 if (create_and_start_worker(pool) < 0)
3581 goto fail;
3582
3583 /* install */
3584 hash_add(unbound_pool_hash, &pool->hash_node, hash);
3585out_unlock:
3586 return pool;
3587fail:
3588 if (pool)
3589 put_unbound_pool(pool);
3590 return NULL;
3591}
3592
3593static void rcu_free_pwq(struct rcu_head *rcu)
3594{
3595 kmem_cache_free(pwq_cache,
3596 container_of(rcu, struct pool_workqueue, rcu));
3597}
3598
3599/*
3600 * Scheduled on system_wq by put_pwq() when an unbound pwq hits zero refcnt
3601 * and needs to be destroyed.
3602 */
3603static void pwq_unbound_release_workfn(struct work_struct *work)
3604{
3605 struct pool_workqueue *pwq = container_of(work, struct pool_workqueue,
3606 unbound_release_work);
3607 struct workqueue_struct *wq = pwq->wq;
3608 struct worker_pool *pool = pwq->pool;
3609 bool is_last;
3610
3611 if (WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND)))
3612 return;
3613
3614 /*
3615 * Unlink @pwq. Synchronization against wq->mutex isn't strictly
3616 * necessary on release but do it anyway. It's easier to verify
3617 * and consistent with the linking path.
3618 */
3619 mutex_lock(&wq->mutex);
3620 list_del_rcu(&pwq->pwqs_node);
3621 is_last = list_empty(&wq->pwqs);
3622 mutex_unlock(&wq->mutex);
3623
3624 mutex_lock(&wq_pool_mutex);
3625 put_unbound_pool(pool);
3626 mutex_unlock(&wq_pool_mutex);
3627
3628 call_rcu_sched(&pwq->rcu, rcu_free_pwq);
3629
3630 /*
3631 * If we're the last pwq going away, @wq is already dead and no one
3632 * is gonna access it anymore. Free it.
3633 */
3634 if (is_last) {
3635 free_workqueue_attrs(wq->unbound_attrs);
3636 kfree(wq);
3637 }
3638}
3639
3640/**
3641 * pwq_adjust_max_active - update a pwq's max_active to the current setting
3642 * @pwq: target pool_workqueue
3643 *
3644 * If @pwq isn't freezing, set @pwq->max_active to the associated
3645 * workqueue's saved_max_active and activate delayed work items
3646 * accordingly. If @pwq is freezing, clear @pwq->max_active to zero.
3647 */
3648static void pwq_adjust_max_active(struct pool_workqueue *pwq)
3649{
3650 struct workqueue_struct *wq = pwq->wq;
3651 bool freezable = wq->flags & WQ_FREEZABLE;
3652
3653 /* for @wq->saved_max_active */
3654 lockdep_assert_held(&wq->mutex);
3655
3656 /* fast exit for non-freezable wqs */
3657 if (!freezable && pwq->max_active == wq->saved_max_active)
3658 return;
3659
3660 spin_lock_irq(&pwq->pool->lock);
3661
3662 if (!freezable || !(pwq->pool->flags & POOL_FREEZING)) {
3663 pwq->max_active = wq->saved_max_active;
3664
3665 while (!list_empty(&pwq->delayed_works) &&
3666 pwq->nr_active < pwq->max_active)
3667 pwq_activate_first_delayed(pwq);
3108 3668
3109 /* 3669 /*
3110 * Allocate enough room to align pwq and put an extra 3670 * Need to kick a worker after thawed or an unbound wq's
3111 * pointer at the end pointing back to the originally 3671 * max_active is bumped. It's a slow path. Do it always.
3112 * allocated pointer which will be used for free.
3113 */ 3672 */
3114 ptr = kzalloc(size + align + sizeof(void *), GFP_KERNEL); 3673 wake_up_worker(pwq->pool);
3115 if (ptr) { 3674 } else {
3116 wq->pool_wq.single = PTR_ALIGN(ptr, align); 3675 pwq->max_active = 0;
3117 *(void **)(wq->pool_wq.single + 1) = ptr; 3676 }
3677
3678 spin_unlock_irq(&pwq->pool->lock);
3679}
3680
3681/* initialize newly alloced @pwq which is associated with @wq and @pool */
3682static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
3683 struct worker_pool *pool)
3684{
3685 BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
3686
3687 memset(pwq, 0, sizeof(*pwq));
3688
3689 pwq->pool = pool;
3690 pwq->wq = wq;
3691 pwq->flush_color = -1;
3692 pwq->refcnt = 1;
3693 INIT_LIST_HEAD(&pwq->delayed_works);
3694 INIT_LIST_HEAD(&pwq->pwqs_node);
3695 INIT_LIST_HEAD(&pwq->mayday_node);
3696 INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
3697}
3698
3699/* sync @pwq with the current state of its associated wq and link it */
3700static void link_pwq(struct pool_workqueue *pwq)
3701{
3702 struct workqueue_struct *wq = pwq->wq;
3703
3704 lockdep_assert_held(&wq->mutex);
3705
3706 /* may be called multiple times, ignore if already linked */
3707 if (!list_empty(&pwq->pwqs_node))
3708 return;
3709
3710 /*
3711 * Set the matching work_color. This is synchronized with
3712 * wq->mutex to avoid confusing flush_workqueue().
3713 */
3714 pwq->work_color = wq->work_color;
3715
3716 /* sync max_active to the current setting */
3717 pwq_adjust_max_active(pwq);
3718
3719 /* link in @pwq */
3720 list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
3721}
3722
3723/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
3724static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
3725 const struct workqueue_attrs *attrs)
3726{
3727 struct worker_pool *pool;
3728 struct pool_workqueue *pwq;
3729
3730 lockdep_assert_held(&wq_pool_mutex);
3731
3732 pool = get_unbound_pool(attrs);
3733 if (!pool)
3734 return NULL;
3735
3736 pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
3737 if (!pwq) {
3738 put_unbound_pool(pool);
3739 return NULL;
3740 }
3741
3742 init_pwq(pwq, wq, pool);
3743 return pwq;
3744}
3745
3746/* undo alloc_unbound_pwq(), used only in the error path */
3747static void free_unbound_pwq(struct pool_workqueue *pwq)
3748{
3749 lockdep_assert_held(&wq_pool_mutex);
3750
3751 if (pwq) {
3752 put_unbound_pool(pwq->pool);
3753 kmem_cache_free(pwq_cache, pwq);
3754 }
3755}
3756
3757/**
3758 * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
3759 * @attrs: the wq_attrs of interest
3760 * @node: the target NUMA node
3761 * @cpu_going_down: if >= 0, the CPU to consider as offline
3762 * @cpumask: outarg, the resulting cpumask
3763 *
3764 * Calculate the cpumask a workqueue with @attrs should use on @node. If
3765 * @cpu_going_down is >= 0, that cpu is considered offline during
3766 * calculation. The result is stored in @cpumask. This function returns
3767 * %true if the resulting @cpumask is different from @attrs->cpumask,
3768 * %false if equal.
3769 *
3770 * If NUMA affinity is not enabled, @attrs->cpumask is always used. If
3771 * enabled and @node has online CPUs requested by @attrs, the returned
3772 * cpumask is the intersection of the possible CPUs of @node and
3773 * @attrs->cpumask.
3774 *
3775 * The caller is responsible for ensuring that the cpumask of @node stays
3776 * stable.
3777 */
3778static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
3779 int cpu_going_down, cpumask_t *cpumask)
3780{
3781 if (!wq_numa_enabled || attrs->no_numa)
3782 goto use_dfl;
3783
3784 /* does @node have any online CPUs @attrs wants? */
3785 cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
3786 if (cpu_going_down >= 0)
3787 cpumask_clear_cpu(cpu_going_down, cpumask);
3788
3789 if (cpumask_empty(cpumask))
3790 goto use_dfl;
3791
3792 /* yeap, return possible CPUs in @node that @attrs wants */
3793 cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
3794 return !cpumask_equal(cpumask, attrs->cpumask);
3795
3796use_dfl:
3797 cpumask_copy(cpumask, attrs->cpumask);
3798 return false;
3799}
3800
3801/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
3802static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
3803 int node,
3804 struct pool_workqueue *pwq)
3805{
3806 struct pool_workqueue *old_pwq;
3807
3808 lockdep_assert_held(&wq->mutex);
3809
3810 /* link_pwq() can handle duplicate calls */
3811 link_pwq(pwq);
3812
3813 old_pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
3814 rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
3815 return old_pwq;
3816}
3817
3818/**
3819 * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
3820 * @wq: the target workqueue
3821 * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
3822 *
3823 * Apply @attrs to an unbound workqueue @wq. Unless disabled, on NUMA
3824 * machines, this function maps a separate pwq to each NUMA node with
3825 * possibles CPUs in @attrs->cpumask so that work items are affine to the
3826 * NUMA node it was issued on. Older pwqs are released as in-flight work
3827 * items finish. Note that a work item which repeatedly requeues itself
3828 * back-to-back will stay on its current pwq.
3829 *
3830 * Performs GFP_KERNEL allocations. Returns 0 on success and -errno on
3831 * failure.
3832 */
3833int apply_workqueue_attrs(struct workqueue_struct *wq,
3834 const struct workqueue_attrs *attrs)
3835{
3836 struct workqueue_attrs *new_attrs, *tmp_attrs;
3837 struct pool_workqueue **pwq_tbl, *dfl_pwq;
3838 int node, ret;
3839
3840 /* only unbound workqueues can change attributes */
3841 if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
3842 return -EINVAL;
3843
3844 /* creating multiple pwqs breaks ordering guarantee */
3845 if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
3846 return -EINVAL;
3847
3848 pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
3849 new_attrs = alloc_workqueue_attrs(GFP_KERNEL);
3850 tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
3851 if (!pwq_tbl || !new_attrs || !tmp_attrs)
3852 goto enomem;
3853
3854 /* make a copy of @attrs and sanitize it */
3855 copy_workqueue_attrs(new_attrs, attrs);
3856 cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
3857
3858 /*
3859 * We may create multiple pwqs with differing cpumasks. Make a
3860 * copy of @new_attrs which will be modified and used to obtain
3861 * pools.
3862 */
3863 copy_workqueue_attrs(tmp_attrs, new_attrs);
3864
3865 /*
3866 * CPUs should stay stable across pwq creations and installations.
3867 * Pin CPUs, determine the target cpumask for each node and create
3868 * pwqs accordingly.
3869 */
3870 get_online_cpus();
3871
3872 mutex_lock(&wq_pool_mutex);
3873
3874 /*
3875 * If something goes wrong during CPU up/down, we'll fall back to
3876 * the default pwq covering whole @attrs->cpumask. Always create
3877 * it even if we don't use it immediately.
3878 */
3879 dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
3880 if (!dfl_pwq)
3881 goto enomem_pwq;
3882
3883 for_each_node(node) {
3884 if (wq_calc_node_cpumask(attrs, node, -1, tmp_attrs->cpumask)) {
3885 pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
3886 if (!pwq_tbl[node])
3887 goto enomem_pwq;
3888 } else {
3889 dfl_pwq->refcnt++;
3890 pwq_tbl[node] = dfl_pwq;
3118 } 3891 }
3119 } 3892 }
3120 3893
3121 /* just in case, make sure it's actually aligned */ 3894 mutex_unlock(&wq_pool_mutex);
3122 BUG_ON(!IS_ALIGNED(wq->pool_wq.v, align)); 3895
3123 return wq->pool_wq.v ? 0 : -ENOMEM; 3896 /* all pwqs have been created successfully, let's install'em */
3897 mutex_lock(&wq->mutex);
3898
3899 copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
3900
3901 /* save the previous pwq and install the new one */
3902 for_each_node(node)
3903 pwq_tbl[node] = numa_pwq_tbl_install(wq, node, pwq_tbl[node]);
3904
3905 /* @dfl_pwq might not have been used, ensure it's linked */
3906 link_pwq(dfl_pwq);
3907 swap(wq->dfl_pwq, dfl_pwq);
3908
3909 mutex_unlock(&wq->mutex);
3910
3911 /* put the old pwqs */
3912 for_each_node(node)
3913 put_pwq_unlocked(pwq_tbl[node]);
3914 put_pwq_unlocked(dfl_pwq);
3915
3916 put_online_cpus();
3917 ret = 0;
3918 /* fall through */
3919out_free:
3920 free_workqueue_attrs(tmp_attrs);
3921 free_workqueue_attrs(new_attrs);
3922 kfree(pwq_tbl);
3923 return ret;
3924
3925enomem_pwq:
3926 free_unbound_pwq(dfl_pwq);
3927 for_each_node(node)
3928 if (pwq_tbl && pwq_tbl[node] != dfl_pwq)
3929 free_unbound_pwq(pwq_tbl[node]);
3930 mutex_unlock(&wq_pool_mutex);
3931 put_online_cpus();
3932enomem:
3933 ret = -ENOMEM;
3934 goto out_free;
3124} 3935}
3125 3936
3126static void free_pwqs(struct workqueue_struct *wq) 3937/**
3938 * wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot[un]plug
3939 * @wq: the target workqueue
3940 * @cpu: the CPU coming up or going down
3941 * @online: whether @cpu is coming up or going down
3942 *
3943 * This function is to be called from %CPU_DOWN_PREPARE, %CPU_ONLINE and
3944 * %CPU_DOWN_FAILED. @cpu is being hot[un]plugged, update NUMA affinity of
3945 * @wq accordingly.
3946 *
3947 * If NUMA affinity can't be adjusted due to memory allocation failure, it
3948 * falls back to @wq->dfl_pwq which may not be optimal but is always
3949 * correct.
3950 *
3951 * Note that when the last allowed CPU of a NUMA node goes offline for a
3952 * workqueue with a cpumask spanning multiple nodes, the workers which were
3953 * already executing the work items for the workqueue will lose their CPU
3954 * affinity and may execute on any CPU. This is similar to how per-cpu
3955 * workqueues behave on CPU_DOWN. If a workqueue user wants strict
3956 * affinity, it's the user's responsibility to flush the work item from
3957 * CPU_DOWN_PREPARE.
3958 */
3959static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
3960 bool online)
3127{ 3961{
3128 if (!(wq->flags & WQ_UNBOUND)) 3962 int node = cpu_to_node(cpu);
3129 free_percpu(wq->pool_wq.pcpu); 3963 int cpu_off = online ? -1 : cpu;
3130 else if (wq->pool_wq.single) { 3964 struct pool_workqueue *old_pwq = NULL, *pwq;
3131 /* the pointer to free is stored right after the pwq */ 3965 struct workqueue_attrs *target_attrs;
3132 kfree(*(void **)(wq->pool_wq.single + 1)); 3966 cpumask_t *cpumask;
3967
3968 lockdep_assert_held(&wq_pool_mutex);
3969
3970 if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
3971 return;
3972
3973 /*
3974 * We don't wanna alloc/free wq_attrs for each wq for each CPU.
3975 * Let's use a preallocated one. The following buf is protected by
3976 * CPU hotplug exclusion.
3977 */
3978 target_attrs = wq_update_unbound_numa_attrs_buf;
3979 cpumask = target_attrs->cpumask;
3980
3981 mutex_lock(&wq->mutex);
3982 if (wq->unbound_attrs->no_numa)
3983 goto out_unlock;
3984
3985 copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
3986 pwq = unbound_pwq_by_node(wq, node);
3987
3988 /*
3989 * Let's determine what needs to be done. If the target cpumask is
3990 * different from wq's, we need to compare it to @pwq's and create
3991 * a new one if they don't match. If the target cpumask equals
3992 * wq's, the default pwq should be used. If @pwq is already the
3993 * default one, nothing to do; otherwise, install the default one.
3994 */
3995 if (wq_calc_node_cpumask(wq->unbound_attrs, node, cpu_off, cpumask)) {
3996 if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
3997 goto out_unlock;
3998 } else {
3999 if (pwq == wq->dfl_pwq)
4000 goto out_unlock;
4001 else
4002 goto use_dfl_pwq;
4003 }
4004
4005 mutex_unlock(&wq->mutex);
4006
4007 /* create a new pwq */
4008 pwq = alloc_unbound_pwq(wq, target_attrs);
4009 if (!pwq) {
4010 pr_warning("workqueue: allocation failed while updating NUMA affinity of \"%s\"\n",
4011 wq->name);
4012 goto out_unlock;
4013 }
4014
4015 /*
4016 * Install the new pwq. As this function is called only from CPU
4017 * hotplug callbacks and applying a new attrs is wrapped with
4018 * get/put_online_cpus(), @wq->unbound_attrs couldn't have changed
4019 * inbetween.
4020 */
4021 mutex_lock(&wq->mutex);
4022 old_pwq = numa_pwq_tbl_install(wq, node, pwq);
4023 goto out_unlock;
4024
4025use_dfl_pwq:
4026 spin_lock_irq(&wq->dfl_pwq->pool->lock);
4027 get_pwq(wq->dfl_pwq);
4028 spin_unlock_irq(&wq->dfl_pwq->pool->lock);
4029 old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq);
4030out_unlock:
4031 mutex_unlock(&wq->mutex);
4032 put_pwq_unlocked(old_pwq);
4033}
4034
4035static int alloc_and_link_pwqs(struct workqueue_struct *wq)
4036{
4037 bool highpri = wq->flags & WQ_HIGHPRI;
4038 int cpu;
4039
4040 if (!(wq->flags & WQ_UNBOUND)) {
4041 wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
4042 if (!wq->cpu_pwqs)
4043 return -ENOMEM;
4044
4045 for_each_possible_cpu(cpu) {
4046 struct pool_workqueue *pwq =
4047 per_cpu_ptr(wq->cpu_pwqs, cpu);
4048 struct worker_pool *cpu_pools =
4049 per_cpu(cpu_worker_pools, cpu);
4050
4051 init_pwq(pwq, wq, &cpu_pools[highpri]);
4052
4053 mutex_lock(&wq->mutex);
4054 link_pwq(pwq);
4055 mutex_unlock(&wq->mutex);
4056 }
4057 return 0;
4058 } else {
4059 return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
3133 } 4060 }
3134} 4061}
3135 4062
@@ -3151,30 +4078,28 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
3151 struct lock_class_key *key, 4078 struct lock_class_key *key,
3152 const char *lock_name, ...) 4079 const char *lock_name, ...)
3153{ 4080{
3154 va_list args, args1; 4081 size_t tbl_size = 0;
4082 va_list args;
3155 struct workqueue_struct *wq; 4083 struct workqueue_struct *wq;
3156 unsigned int cpu; 4084 struct pool_workqueue *pwq;
3157 size_t namelen;
3158 4085
3159 /* determine namelen, allocate wq and format name */ 4086 /* allocate wq and format name */
3160 va_start(args, lock_name); 4087 if (flags & WQ_UNBOUND)
3161 va_copy(args1, args); 4088 tbl_size = wq_numa_tbl_len * sizeof(wq->numa_pwq_tbl[0]);
3162 namelen = vsnprintf(NULL, 0, fmt, args) + 1;
3163 4089
3164 wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL); 4090 wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
3165 if (!wq) 4091 if (!wq)
3166 goto err; 4092 return NULL;
3167 4093
3168 vsnprintf(wq->name, namelen, fmt, args1); 4094 if (flags & WQ_UNBOUND) {
3169 va_end(args); 4095 wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
3170 va_end(args1); 4096 if (!wq->unbound_attrs)
4097 goto err_free_wq;
4098 }
3171 4099
3172 /* 4100 va_start(args, lock_name);
3173 * Workqueues which may be used during memory reclaim should 4101 vsnprintf(wq->name, sizeof(wq->name), fmt, args);
3174 * have a rescuer to guarantee forward progress. 4102 va_end(args);
3175 */
3176 if (flags & WQ_MEM_RECLAIM)
3177 flags |= WQ_RESCUER;
3178 4103
3179 max_active = max_active ?: WQ_DFL_ACTIVE; 4104 max_active = max_active ?: WQ_DFL_ACTIVE;
3180 max_active = wq_clamp_max_active(max_active, flags, wq->name); 4105 max_active = wq_clamp_max_active(max_active, flags, wq->name);
@@ -3182,71 +4107,70 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
3182 /* init wq */ 4107 /* init wq */
3183 wq->flags = flags; 4108 wq->flags = flags;
3184 wq->saved_max_active = max_active; 4109 wq->saved_max_active = max_active;
3185 mutex_init(&wq->flush_mutex); 4110 mutex_init(&wq->mutex);
3186 atomic_set(&wq->nr_pwqs_to_flush, 0); 4111 atomic_set(&wq->nr_pwqs_to_flush, 0);
4112 INIT_LIST_HEAD(&wq->pwqs);
3187 INIT_LIST_HEAD(&wq->flusher_queue); 4113 INIT_LIST_HEAD(&wq->flusher_queue);
3188 INIT_LIST_HEAD(&wq->flusher_overflow); 4114 INIT_LIST_HEAD(&wq->flusher_overflow);
4115 INIT_LIST_HEAD(&wq->maydays);
3189 4116
3190 lockdep_init_map(&wq->lockdep_map, lock_name, key, 0); 4117 lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
3191 INIT_LIST_HEAD(&wq->list); 4118 INIT_LIST_HEAD(&wq->list);
3192 4119
3193 if (alloc_pwqs(wq) < 0) 4120 if (alloc_and_link_pwqs(wq) < 0)
3194 goto err; 4121 goto err_free_wq;
3195
3196 for_each_pwq_cpu(cpu, wq) {
3197 struct pool_workqueue *pwq = get_pwq(cpu, wq);
3198 4122
3199 BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK); 4123 /*
3200 pwq->pool = get_std_worker_pool(cpu, flags & WQ_HIGHPRI); 4124 * Workqueues which may be used during memory reclaim should
3201 pwq->wq = wq; 4125 * have a rescuer to guarantee forward progress.
3202 pwq->flush_color = -1; 4126 */
3203 pwq->max_active = max_active; 4127 if (flags & WQ_MEM_RECLAIM) {
3204 INIT_LIST_HEAD(&pwq->delayed_works);
3205 }
3206
3207 if (flags & WQ_RESCUER) {
3208 struct worker *rescuer; 4128 struct worker *rescuer;
3209 4129
3210 if (!alloc_mayday_mask(&wq->mayday_mask, GFP_KERNEL)) 4130 rescuer = alloc_worker();
3211 goto err;
3212
3213 wq->rescuer = rescuer = alloc_worker();
3214 if (!rescuer) 4131 if (!rescuer)
3215 goto err; 4132 goto err_destroy;
3216 4133
3217 rescuer->rescue_wq = wq; 4134 rescuer->rescue_wq = wq;
3218 rescuer->task = kthread_create(rescuer_thread, rescuer, "%s", 4135 rescuer->task = kthread_create(rescuer_thread, rescuer, "%s",
3219 wq->name); 4136 wq->name);
3220 if (IS_ERR(rescuer->task)) 4137 if (IS_ERR(rescuer->task)) {
3221 goto err; 4138 kfree(rescuer);
4139 goto err_destroy;
4140 }
3222 4141
3223 rescuer->task->flags |= PF_THREAD_BOUND; 4142 wq->rescuer = rescuer;
4143 rescuer->task->flags |= PF_NO_SETAFFINITY;
3224 wake_up_process(rescuer->task); 4144 wake_up_process(rescuer->task);
3225 } 4145 }
3226 4146
4147 if ((wq->flags & WQ_SYSFS) && workqueue_sysfs_register(wq))
4148 goto err_destroy;
4149
3227 /* 4150 /*
3228 * workqueue_lock protects global freeze state and workqueues 4151 * wq_pool_mutex protects global freeze state and workqueues list.
3229 * list. Grab it, set max_active accordingly and add the new 4152 * Grab it, adjust max_active and add the new @wq to workqueues
3230 * workqueue to workqueues list. 4153 * list.
3231 */ 4154 */
3232 spin_lock(&workqueue_lock); 4155 mutex_lock(&wq_pool_mutex);
3233 4156
3234 if (workqueue_freezing && wq->flags & WQ_FREEZABLE) 4157 mutex_lock(&wq->mutex);
3235 for_each_pwq_cpu(cpu, wq) 4158 for_each_pwq(pwq, wq)
3236 get_pwq(cpu, wq)->max_active = 0; 4159 pwq_adjust_max_active(pwq);
4160 mutex_unlock(&wq->mutex);
3237 4161
3238 list_add(&wq->list, &workqueues); 4162 list_add(&wq->list, &workqueues);
3239 4163
3240 spin_unlock(&workqueue_lock); 4164 mutex_unlock(&wq_pool_mutex);
3241 4165
3242 return wq; 4166 return wq;
3243err: 4167
3244 if (wq) { 4168err_free_wq:
3245 free_pwqs(wq); 4169 free_workqueue_attrs(wq->unbound_attrs);
3246 free_mayday_mask(wq->mayday_mask); 4170 kfree(wq);
3247 kfree(wq->rescuer); 4171 return NULL;
3248 kfree(wq); 4172err_destroy:
3249 } 4173 destroy_workqueue(wq);
3250 return NULL; 4174 return NULL;
3251} 4175}
3252EXPORT_SYMBOL_GPL(__alloc_workqueue_key); 4176EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
@@ -3259,60 +4183,78 @@ EXPORT_SYMBOL_GPL(__alloc_workqueue_key);
3259 */ 4183 */
3260void destroy_workqueue(struct workqueue_struct *wq) 4184void destroy_workqueue(struct workqueue_struct *wq)
3261{ 4185{
3262 unsigned int cpu; 4186 struct pool_workqueue *pwq;
4187 int node;
3263 4188
3264 /* drain it before proceeding with destruction */ 4189 /* drain it before proceeding with destruction */
3265 drain_workqueue(wq); 4190 drain_workqueue(wq);
3266 4191
4192 /* sanity checks */
4193 mutex_lock(&wq->mutex);
4194 for_each_pwq(pwq, wq) {
4195 int i;
4196
4197 for (i = 0; i < WORK_NR_COLORS; i++) {
4198 if (WARN_ON(pwq->nr_in_flight[i])) {
4199 mutex_unlock(&wq->mutex);
4200 return;
4201 }
4202 }
4203
4204 if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) ||
4205 WARN_ON(pwq->nr_active) ||
4206 WARN_ON(!list_empty(&pwq->delayed_works))) {
4207 mutex_unlock(&wq->mutex);
4208 return;
4209 }
4210 }
4211 mutex_unlock(&wq->mutex);
4212
3267 /* 4213 /*
3268 * wq list is used to freeze wq, remove from list after 4214 * wq list is used to freeze wq, remove from list after
3269 * flushing is complete in case freeze races us. 4215 * flushing is complete in case freeze races us.
3270 */ 4216 */
3271 spin_lock(&workqueue_lock); 4217 mutex_lock(&wq_pool_mutex);
3272 list_del(&wq->list); 4218 list_del_init(&wq->list);
3273 spin_unlock(&workqueue_lock); 4219 mutex_unlock(&wq_pool_mutex);
3274 4220
3275 /* sanity check */ 4221 workqueue_sysfs_unregister(wq);
3276 for_each_pwq_cpu(cpu, wq) {
3277 struct pool_workqueue *pwq = get_pwq(cpu, wq);
3278 int i;
3279 4222
3280 for (i = 0; i < WORK_NR_COLORS; i++) 4223 if (wq->rescuer) {
3281 BUG_ON(pwq->nr_in_flight[i]);
3282 BUG_ON(pwq->nr_active);
3283 BUG_ON(!list_empty(&pwq->delayed_works));
3284 }
3285
3286 if (wq->flags & WQ_RESCUER) {
3287 kthread_stop(wq->rescuer->task); 4224 kthread_stop(wq->rescuer->task);
3288 free_mayday_mask(wq->mayday_mask);
3289 kfree(wq->rescuer); 4225 kfree(wq->rescuer);
4226 wq->rescuer = NULL;
3290 } 4227 }
3291 4228
3292 free_pwqs(wq); 4229 if (!(wq->flags & WQ_UNBOUND)) {
3293 kfree(wq); 4230 /*
3294} 4231 * The base ref is never dropped on per-cpu pwqs. Directly
3295EXPORT_SYMBOL_GPL(destroy_workqueue); 4232 * free the pwqs and wq.
3296 4233 */
3297/** 4234 free_percpu(wq->cpu_pwqs);
3298 * pwq_set_max_active - adjust max_active of a pwq 4235 kfree(wq);
3299 * @pwq: target pool_workqueue 4236 } else {
3300 * @max_active: new max_active value. 4237 /*
3301 * 4238 * We're the sole accessor of @wq at this point. Directly
3302 * Set @pwq->max_active to @max_active and activate delayed works if 4239 * access numa_pwq_tbl[] and dfl_pwq to put the base refs.
3303 * increased. 4240 * @wq will be freed when the last pwq is released.
3304 * 4241 */
3305 * CONTEXT: 4242 for_each_node(node) {
3306 * spin_lock_irq(pool->lock). 4243 pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
3307 */ 4244 RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
3308static void pwq_set_max_active(struct pool_workqueue *pwq, int max_active) 4245 put_pwq_unlocked(pwq);
3309{ 4246 }
3310 pwq->max_active = max_active;
3311 4247
3312 while (!list_empty(&pwq->delayed_works) && 4248 /*
3313 pwq->nr_active < pwq->max_active) 4249 * Put dfl_pwq. @wq may be freed any time after dfl_pwq is
3314 pwq_activate_first_delayed(pwq); 4250 * put. Don't access it afterwards.
4251 */
4252 pwq = wq->dfl_pwq;
4253 wq->dfl_pwq = NULL;
4254 put_pwq_unlocked(pwq);
4255 }
3315} 4256}
4257EXPORT_SYMBOL_GPL(destroy_workqueue);
3316 4258
3317/** 4259/**
3318 * workqueue_set_max_active - adjust max_active of a workqueue 4260 * workqueue_set_max_active - adjust max_active of a workqueue
@@ -3326,30 +4268,37 @@ static void pwq_set_max_active(struct pool_workqueue *pwq, int max_active)
3326 */ 4268 */
3327void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) 4269void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
3328{ 4270{
3329 unsigned int cpu; 4271 struct pool_workqueue *pwq;
4272
4273 /* disallow meddling with max_active for ordered workqueues */
4274 if (WARN_ON(wq->flags & __WQ_ORDERED))
4275 return;
3330 4276
3331 max_active = wq_clamp_max_active(max_active, wq->flags, wq->name); 4277 max_active = wq_clamp_max_active(max_active, wq->flags, wq->name);
3332 4278
3333 spin_lock(&workqueue_lock); 4279 mutex_lock(&wq->mutex);
3334 4280
3335 wq->saved_max_active = max_active; 4281 wq->saved_max_active = max_active;
3336 4282
3337 for_each_pwq_cpu(cpu, wq) { 4283 for_each_pwq(pwq, wq)
3338 struct pool_workqueue *pwq = get_pwq(cpu, wq); 4284 pwq_adjust_max_active(pwq);
3339 struct worker_pool *pool = pwq->pool;
3340
3341 spin_lock_irq(&pool->lock);
3342 4285
3343 if (!(wq->flags & WQ_FREEZABLE) || 4286 mutex_unlock(&wq->mutex);
3344 !(pool->flags & POOL_FREEZING)) 4287}
3345 pwq_set_max_active(pwq, max_active); 4288EXPORT_SYMBOL_GPL(workqueue_set_max_active);
3346 4289
3347 spin_unlock_irq(&pool->lock); 4290/**
3348 } 4291 * current_is_workqueue_rescuer - is %current workqueue rescuer?
4292 *
4293 * Determine whether %current is a workqueue rescuer. Can be used from
4294 * work functions to determine whether it's being run off the rescuer task.
4295 */
4296bool current_is_workqueue_rescuer(void)
4297{
4298 struct worker *worker = current_wq_worker();
3349 4299
3350 spin_unlock(&workqueue_lock); 4300 return worker && worker->rescue_wq;
3351} 4301}
3352EXPORT_SYMBOL_GPL(workqueue_set_max_active);
3353 4302
3354/** 4303/**
3355 * workqueue_congested - test whether a workqueue is congested 4304 * workqueue_congested - test whether a workqueue is congested
@@ -3363,11 +4312,22 @@ EXPORT_SYMBOL_GPL(workqueue_set_max_active);
3363 * RETURNS: 4312 * RETURNS:
3364 * %true if congested, %false otherwise. 4313 * %true if congested, %false otherwise.
3365 */ 4314 */
3366bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq) 4315bool workqueue_congested(int cpu, struct workqueue_struct *wq)
3367{ 4316{
3368 struct pool_workqueue *pwq = get_pwq(cpu, wq); 4317 struct pool_workqueue *pwq;
4318 bool ret;
3369 4319
3370 return !list_empty(&pwq->delayed_works); 4320 rcu_read_lock_sched();
4321
4322 if (!(wq->flags & WQ_UNBOUND))
4323 pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
4324 else
4325 pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
4326
4327 ret = !list_empty(&pwq->delayed_works);
4328 rcu_read_unlock_sched();
4329
4330 return ret;
3371} 4331}
3372EXPORT_SYMBOL_GPL(workqueue_congested); 4332EXPORT_SYMBOL_GPL(workqueue_congested);
3373 4333
@@ -3384,19 +4344,22 @@ EXPORT_SYMBOL_GPL(workqueue_congested);
3384 */ 4344 */
3385unsigned int work_busy(struct work_struct *work) 4345unsigned int work_busy(struct work_struct *work)
3386{ 4346{
3387 struct worker_pool *pool = get_work_pool(work); 4347 struct worker_pool *pool;
3388 unsigned long flags; 4348 unsigned long flags;
3389 unsigned int ret = 0; 4349 unsigned int ret = 0;
3390 4350
3391 if (work_pending(work)) 4351 if (work_pending(work))
3392 ret |= WORK_BUSY_PENDING; 4352 ret |= WORK_BUSY_PENDING;
3393 4353
4354 local_irq_save(flags);
4355 pool = get_work_pool(work);
3394 if (pool) { 4356 if (pool) {
3395 spin_lock_irqsave(&pool->lock, flags); 4357 spin_lock(&pool->lock);
3396 if (find_worker_executing_work(pool, work)) 4358 if (find_worker_executing_work(pool, work))
3397 ret |= WORK_BUSY_RUNNING; 4359 ret |= WORK_BUSY_RUNNING;
3398 spin_unlock_irqrestore(&pool->lock, flags); 4360 spin_unlock(&pool->lock);
3399 } 4361 }
4362 local_irq_restore(flags);
3400 4363
3401 return ret; 4364 return ret;
3402} 4365}
@@ -3422,31 +4385,28 @@ static void wq_unbind_fn(struct work_struct *work)
3422 int cpu = smp_processor_id(); 4385 int cpu = smp_processor_id();
3423 struct worker_pool *pool; 4386 struct worker_pool *pool;
3424 struct worker *worker; 4387 struct worker *worker;
3425 int i; 4388 int wi;
3426 4389
3427 for_each_std_worker_pool(pool, cpu) { 4390 for_each_cpu_worker_pool(pool, cpu) {
3428 BUG_ON(cpu != smp_processor_id()); 4391 WARN_ON_ONCE(cpu != smp_processor_id());
3429 4392
3430 mutex_lock(&pool->assoc_mutex); 4393 mutex_lock(&pool->manager_mutex);
3431 spin_lock_irq(&pool->lock); 4394 spin_lock_irq(&pool->lock);
3432 4395
3433 /* 4396 /*
3434 * We've claimed all manager positions. Make all workers 4397 * We've blocked all manager operations. Make all workers
3435 * unbound and set DISASSOCIATED. Before this, all workers 4398 * unbound and set DISASSOCIATED. Before this, all workers
3436 * except for the ones which are still executing works from 4399 * except for the ones which are still executing works from
3437 * before the last CPU down must be on the cpu. After 4400 * before the last CPU down must be on the cpu. After
3438 * this, they may become diasporas. 4401 * this, they may become diasporas.
3439 */ 4402 */
3440 list_for_each_entry(worker, &pool->idle_list, entry) 4403 for_each_pool_worker(worker, wi, pool)
3441 worker->flags |= WORKER_UNBOUND;
3442
3443 for_each_busy_worker(worker, i, pool)
3444 worker->flags |= WORKER_UNBOUND; 4404 worker->flags |= WORKER_UNBOUND;
3445 4405
3446 pool->flags |= POOL_DISASSOCIATED; 4406 pool->flags |= POOL_DISASSOCIATED;
3447 4407
3448 spin_unlock_irq(&pool->lock); 4408 spin_unlock_irq(&pool->lock);
3449 mutex_unlock(&pool->assoc_mutex); 4409 mutex_unlock(&pool->manager_mutex);
3450 4410
3451 /* 4411 /*
3452 * Call schedule() so that we cross rq->lock and thus can 4412 * Call schedule() so that we cross rq->lock and thus can
@@ -3477,6 +4437,103 @@ static void wq_unbind_fn(struct work_struct *work)
3477 } 4437 }
3478} 4438}
3479 4439
4440/**
4441 * rebind_workers - rebind all workers of a pool to the associated CPU
4442 * @pool: pool of interest
4443 *
4444 * @pool->cpu is coming online. Rebind all workers to the CPU.
4445 */
4446static void rebind_workers(struct worker_pool *pool)
4447{
4448 struct worker *worker;
4449 int wi;
4450
4451 lockdep_assert_held(&pool->manager_mutex);
4452
4453 /*
4454 * Restore CPU affinity of all workers. As all idle workers should
4455 * be on the run-queue of the associated CPU before any local
4456 * wake-ups for concurrency management happen, restore CPU affinty
4457 * of all workers first and then clear UNBOUND. As we're called
4458 * from CPU_ONLINE, the following shouldn't fail.
4459 */
4460 for_each_pool_worker(worker, wi, pool)
4461 WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
4462 pool->attrs->cpumask) < 0);
4463
4464 spin_lock_irq(&pool->lock);
4465
4466 for_each_pool_worker(worker, wi, pool) {
4467 unsigned int worker_flags = worker->flags;
4468
4469 /*
4470 * A bound idle worker should actually be on the runqueue
4471 * of the associated CPU for local wake-ups targeting it to
4472 * work. Kick all idle workers so that they migrate to the
4473 * associated CPU. Doing this in the same loop as
4474 * replacing UNBOUND with REBOUND is safe as no worker will
4475 * be bound before @pool->lock is released.
4476 */
4477 if (worker_flags & WORKER_IDLE)
4478 wake_up_process(worker->task);
4479
4480 /*
4481 * We want to clear UNBOUND but can't directly call
4482 * worker_clr_flags() or adjust nr_running. Atomically
4483 * replace UNBOUND with another NOT_RUNNING flag REBOUND.
4484 * @worker will clear REBOUND using worker_clr_flags() when
4485 * it initiates the next execution cycle thus restoring
4486 * concurrency management. Note that when or whether
4487 * @worker clears REBOUND doesn't affect correctness.
4488 *
4489 * ACCESS_ONCE() is necessary because @worker->flags may be
4490 * tested without holding any lock in
4491 * wq_worker_waking_up(). Without it, NOT_RUNNING test may
4492 * fail incorrectly leading to premature concurrency
4493 * management operations.
4494 */
4495 WARN_ON_ONCE(!(worker_flags & WORKER_UNBOUND));
4496 worker_flags |= WORKER_REBOUND;
4497 worker_flags &= ~WORKER_UNBOUND;
4498 ACCESS_ONCE(worker->flags) = worker_flags;
4499 }
4500
4501 spin_unlock_irq(&pool->lock);
4502}
4503
4504/**
4505 * restore_unbound_workers_cpumask - restore cpumask of unbound workers
4506 * @pool: unbound pool of interest
4507 * @cpu: the CPU which is coming up
4508 *
4509 * An unbound pool may end up with a cpumask which doesn't have any online
4510 * CPUs. When a worker of such pool get scheduled, the scheduler resets
4511 * its cpus_allowed. If @cpu is in @pool's cpumask which didn't have any
4512 * online CPU before, cpus_allowed of all its workers should be restored.
4513 */
4514static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
4515{
4516 static cpumask_t cpumask;
4517 struct worker *worker;
4518 int wi;
4519
4520 lockdep_assert_held(&pool->manager_mutex);
4521
4522 /* is @cpu allowed for @pool? */
4523 if (!cpumask_test_cpu(cpu, pool->attrs->cpumask))
4524 return;
4525
4526 /* is @cpu the only online CPU? */
4527 cpumask_and(&cpumask, pool->attrs->cpumask, cpu_online_mask);
4528 if (cpumask_weight(&cpumask) != 1)
4529 return;
4530
4531 /* as we're called from CPU_ONLINE, the following shouldn't fail */
4532 for_each_pool_worker(worker, wi, pool)
4533 WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
4534 pool->attrs->cpumask) < 0);
4535}
4536
3480/* 4537/*
3481 * Workqueues should be brought up before normal priority CPU notifiers. 4538 * Workqueues should be brought up before normal priority CPU notifiers.
3482 * This will be registered high priority CPU notifier. 4539 * This will be registered high priority CPU notifier.
@@ -3485,39 +4542,46 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb,
3485 unsigned long action, 4542 unsigned long action,
3486 void *hcpu) 4543 void *hcpu)
3487{ 4544{
3488 unsigned int cpu = (unsigned long)hcpu; 4545 int cpu = (unsigned long)hcpu;
3489 struct worker_pool *pool; 4546 struct worker_pool *pool;
4547 struct workqueue_struct *wq;
4548 int pi;
3490 4549
3491 switch (action & ~CPU_TASKS_FROZEN) { 4550 switch (action & ~CPU_TASKS_FROZEN) {
3492 case CPU_UP_PREPARE: 4551 case CPU_UP_PREPARE:
3493 for_each_std_worker_pool(pool, cpu) { 4552 for_each_cpu_worker_pool(pool, cpu) {
3494 struct worker *worker;
3495
3496 if (pool->nr_workers) 4553 if (pool->nr_workers)
3497 continue; 4554 continue;
3498 4555 if (create_and_start_worker(pool) < 0)
3499 worker = create_worker(pool);
3500 if (!worker)
3501 return NOTIFY_BAD; 4556 return NOTIFY_BAD;
3502
3503 spin_lock_irq(&pool->lock);
3504 start_worker(worker);
3505 spin_unlock_irq(&pool->lock);
3506 } 4557 }
3507 break; 4558 break;
3508 4559
3509 case CPU_DOWN_FAILED: 4560 case CPU_DOWN_FAILED:
3510 case CPU_ONLINE: 4561 case CPU_ONLINE:
3511 for_each_std_worker_pool(pool, cpu) { 4562 mutex_lock(&wq_pool_mutex);
3512 mutex_lock(&pool->assoc_mutex);
3513 spin_lock_irq(&pool->lock);
3514 4563
3515 pool->flags &= ~POOL_DISASSOCIATED; 4564 for_each_pool(pool, pi) {
3516 rebind_workers(pool); 4565 mutex_lock(&pool->manager_mutex);
4566
4567 if (pool->cpu == cpu) {
4568 spin_lock_irq(&pool->lock);
4569 pool->flags &= ~POOL_DISASSOCIATED;
4570 spin_unlock_irq(&pool->lock);
4571
4572 rebind_workers(pool);
4573 } else if (pool->cpu < 0) {
4574 restore_unbound_workers_cpumask(pool, cpu);
4575 }
3517 4576
3518 spin_unlock_irq(&pool->lock); 4577 mutex_unlock(&pool->manager_mutex);
3519 mutex_unlock(&pool->assoc_mutex);
3520 } 4578 }
4579
4580 /* update NUMA affinity of unbound workqueues */
4581 list_for_each_entry(wq, &workqueues, list)
4582 wq_update_unbound_numa(wq, cpu, true);
4583
4584 mutex_unlock(&wq_pool_mutex);
3521 break; 4585 break;
3522 } 4586 }
3523 return NOTIFY_OK; 4587 return NOTIFY_OK;
@@ -3531,14 +4595,23 @@ static int __cpuinit workqueue_cpu_down_callback(struct notifier_block *nfb,
3531 unsigned long action, 4595 unsigned long action,
3532 void *hcpu) 4596 void *hcpu)
3533{ 4597{
3534 unsigned int cpu = (unsigned long)hcpu; 4598 int cpu = (unsigned long)hcpu;
3535 struct work_struct unbind_work; 4599 struct work_struct unbind_work;
4600 struct workqueue_struct *wq;
3536 4601
3537 switch (action & ~CPU_TASKS_FROZEN) { 4602 switch (action & ~CPU_TASKS_FROZEN) {
3538 case CPU_DOWN_PREPARE: 4603 case CPU_DOWN_PREPARE:
3539 /* unbinding should happen on the local CPU */ 4604 /* unbinding per-cpu workers should happen on the local CPU */
3540 INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn); 4605 INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
3541 queue_work_on(cpu, system_highpri_wq, &unbind_work); 4606 queue_work_on(cpu, system_highpri_wq, &unbind_work);
4607
4608 /* update NUMA affinity of unbound workqueues */
4609 mutex_lock(&wq_pool_mutex);
4610 list_for_each_entry(wq, &workqueues, list)
4611 wq_update_unbound_numa(wq, cpu, false);
4612 mutex_unlock(&wq_pool_mutex);
4613
4614 /* wait for per-cpu unbinding to finish */
3542 flush_work(&unbind_work); 4615 flush_work(&unbind_work);
3543 break; 4616 break;
3544 } 4617 }
@@ -3571,7 +4644,7 @@ static void work_for_cpu_fn(struct work_struct *work)
3571 * It is up to the caller to ensure that the cpu doesn't go offline. 4644 * It is up to the caller to ensure that the cpu doesn't go offline.
3572 * The caller must not hold any locks which would prevent @fn from completing. 4645 * The caller must not hold any locks which would prevent @fn from completing.
3573 */ 4646 */
3574long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg) 4647long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
3575{ 4648{
3576 struct work_for_cpu wfc = { .fn = fn, .arg = arg }; 4649 struct work_for_cpu wfc = { .fn = fn, .arg = arg };
3577 4650
@@ -3589,44 +4662,40 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
3589 * freeze_workqueues_begin - begin freezing workqueues 4662 * freeze_workqueues_begin - begin freezing workqueues
3590 * 4663 *
3591 * Start freezing workqueues. After this function returns, all freezable 4664 * Start freezing workqueues. After this function returns, all freezable
3592 * workqueues will queue new works to their frozen_works list instead of 4665 * workqueues will queue new works to their delayed_works list instead of
3593 * pool->worklist. 4666 * pool->worklist.
3594 * 4667 *
3595 * CONTEXT: 4668 * CONTEXT:
3596 * Grabs and releases workqueue_lock and pool->lock's. 4669 * Grabs and releases wq_pool_mutex, wq->mutex and pool->lock's.
3597 */ 4670 */
3598void freeze_workqueues_begin(void) 4671void freeze_workqueues_begin(void)
3599{ 4672{
3600 unsigned int cpu; 4673 struct worker_pool *pool;
4674 struct workqueue_struct *wq;
4675 struct pool_workqueue *pwq;
4676 int pi;
3601 4677
3602 spin_lock(&workqueue_lock); 4678 mutex_lock(&wq_pool_mutex);
3603 4679
3604 BUG_ON(workqueue_freezing); 4680 WARN_ON_ONCE(workqueue_freezing);
3605 workqueue_freezing = true; 4681 workqueue_freezing = true;
3606 4682
3607 for_each_wq_cpu(cpu) { 4683 /* set FREEZING */
3608 struct worker_pool *pool; 4684 for_each_pool(pool, pi) {
3609 struct workqueue_struct *wq; 4685 spin_lock_irq(&pool->lock);
3610 4686 WARN_ON_ONCE(pool->flags & POOL_FREEZING);
3611 for_each_std_worker_pool(pool, cpu) { 4687 pool->flags |= POOL_FREEZING;
3612 spin_lock_irq(&pool->lock); 4688 spin_unlock_irq(&pool->lock);
3613 4689 }
3614 WARN_ON_ONCE(pool->flags & POOL_FREEZING);
3615 pool->flags |= POOL_FREEZING;
3616
3617 list_for_each_entry(wq, &workqueues, list) {
3618 struct pool_workqueue *pwq = get_pwq(cpu, wq);
3619
3620 if (pwq && pwq->pool == pool &&
3621 (wq->flags & WQ_FREEZABLE))
3622 pwq->max_active = 0;
3623 }
3624 4690
3625 spin_unlock_irq(&pool->lock); 4691 list_for_each_entry(wq, &workqueues, list) {
3626 } 4692 mutex_lock(&wq->mutex);
4693 for_each_pwq(pwq, wq)
4694 pwq_adjust_max_active(pwq);
4695 mutex_unlock(&wq->mutex);
3627 } 4696 }
3628 4697
3629 spin_unlock(&workqueue_lock); 4698 mutex_unlock(&wq_pool_mutex);
3630} 4699}
3631 4700
3632/** 4701/**
@@ -3636,7 +4705,7 @@ void freeze_workqueues_begin(void)
3636 * between freeze_workqueues_begin() and thaw_workqueues(). 4705 * between freeze_workqueues_begin() and thaw_workqueues().
3637 * 4706 *
3638 * CONTEXT: 4707 * CONTEXT:
3639 * Grabs and releases workqueue_lock. 4708 * Grabs and releases wq_pool_mutex.
3640 * 4709 *
3641 * RETURNS: 4710 * RETURNS:
3642 * %true if some freezable workqueues are still busy. %false if freezing 4711 * %true if some freezable workqueues are still busy. %false if freezing
@@ -3644,34 +4713,34 @@ void freeze_workqueues_begin(void)
3644 */ 4713 */
3645bool freeze_workqueues_busy(void) 4714bool freeze_workqueues_busy(void)
3646{ 4715{
3647 unsigned int cpu;
3648 bool busy = false; 4716 bool busy = false;
4717 struct workqueue_struct *wq;
4718 struct pool_workqueue *pwq;
3649 4719
3650 spin_lock(&workqueue_lock); 4720 mutex_lock(&wq_pool_mutex);
3651 4721
3652 BUG_ON(!workqueue_freezing); 4722 WARN_ON_ONCE(!workqueue_freezing);
3653 4723
3654 for_each_wq_cpu(cpu) { 4724 list_for_each_entry(wq, &workqueues, list) {
3655 struct workqueue_struct *wq; 4725 if (!(wq->flags & WQ_FREEZABLE))
4726 continue;
3656 /* 4727 /*
3657 * nr_active is monotonically decreasing. It's safe 4728 * nr_active is monotonically decreasing. It's safe
3658 * to peek without lock. 4729 * to peek without lock.
3659 */ 4730 */
3660 list_for_each_entry(wq, &workqueues, list) { 4731 rcu_read_lock_sched();
3661 struct pool_workqueue *pwq = get_pwq(cpu, wq); 4732 for_each_pwq(pwq, wq) {
3662 4733 WARN_ON_ONCE(pwq->nr_active < 0);
3663 if (!pwq || !(wq->flags & WQ_FREEZABLE))
3664 continue;
3665
3666 BUG_ON(pwq->nr_active < 0);
3667 if (pwq->nr_active) { 4734 if (pwq->nr_active) {
3668 busy = true; 4735 busy = true;
4736 rcu_read_unlock_sched();
3669 goto out_unlock; 4737 goto out_unlock;
3670 } 4738 }
3671 } 4739 }
4740 rcu_read_unlock_sched();
3672 } 4741 }
3673out_unlock: 4742out_unlock:
3674 spin_unlock(&workqueue_lock); 4743 mutex_unlock(&wq_pool_mutex);
3675 return busy; 4744 return busy;
3676} 4745}
3677 4746
@@ -3682,104 +4751,141 @@ out_unlock:
3682 * frozen works are transferred to their respective pool worklists. 4751 * frozen works are transferred to their respective pool worklists.
3683 * 4752 *
3684 * CONTEXT: 4753 * CONTEXT:
3685 * Grabs and releases workqueue_lock and pool->lock's. 4754 * Grabs and releases wq_pool_mutex, wq->mutex and pool->lock's.
3686 */ 4755 */
3687void thaw_workqueues(void) 4756void thaw_workqueues(void)
3688{ 4757{
3689 unsigned int cpu; 4758 struct workqueue_struct *wq;
4759 struct pool_workqueue *pwq;
4760 struct worker_pool *pool;
4761 int pi;
3690 4762
3691 spin_lock(&workqueue_lock); 4763 mutex_lock(&wq_pool_mutex);
3692 4764
3693 if (!workqueue_freezing) 4765 if (!workqueue_freezing)
3694 goto out_unlock; 4766 goto out_unlock;
3695 4767
3696 for_each_wq_cpu(cpu) { 4768 /* clear FREEZING */
3697 struct worker_pool *pool; 4769 for_each_pool(pool, pi) {
3698 struct workqueue_struct *wq; 4770 spin_lock_irq(&pool->lock);
4771 WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
4772 pool->flags &= ~POOL_FREEZING;
4773 spin_unlock_irq(&pool->lock);
4774 }
3699 4775
3700 for_each_std_worker_pool(pool, cpu) { 4776 /* restore max_active and repopulate worklist */
3701 spin_lock_irq(&pool->lock); 4777 list_for_each_entry(wq, &workqueues, list) {
4778 mutex_lock(&wq->mutex);
4779 for_each_pwq(pwq, wq)
4780 pwq_adjust_max_active(pwq);
4781 mutex_unlock(&wq->mutex);
4782 }
3702 4783
3703 WARN_ON_ONCE(!(pool->flags & POOL_FREEZING)); 4784 workqueue_freezing = false;
3704 pool->flags &= ~POOL_FREEZING; 4785out_unlock:
4786 mutex_unlock(&wq_pool_mutex);
4787}
4788#endif /* CONFIG_FREEZER */
3705 4789
3706 list_for_each_entry(wq, &workqueues, list) { 4790static void __init wq_numa_init(void)
3707 struct pool_workqueue *pwq = get_pwq(cpu, wq); 4791{
4792 cpumask_var_t *tbl;
4793 int node, cpu;
3708 4794
3709 if (!pwq || pwq->pool != pool || 4795 /* determine NUMA pwq table len - highest node id + 1 */
3710 !(wq->flags & WQ_FREEZABLE)) 4796 for_each_node(node)
3711 continue; 4797 wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
3712 4798
3713 /* restore max_active and repopulate worklist */ 4799 if (num_possible_nodes() <= 1)
3714 pwq_set_max_active(pwq, wq->saved_max_active); 4800 return;
3715 }
3716 4801
3717 wake_up_worker(pool); 4802 if (wq_disable_numa) {
4803 pr_info("workqueue: NUMA affinity support disabled\n");
4804 return;
4805 }
4806
4807 wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs(GFP_KERNEL);
4808 BUG_ON(!wq_update_unbound_numa_attrs_buf);
3718 4809
3719 spin_unlock_irq(&pool->lock); 4810 /*
4811 * We want masks of possible CPUs of each node which isn't readily
4812 * available. Build one from cpu_to_node() which should have been
4813 * fully initialized by now.
4814 */
4815 tbl = kzalloc(wq_numa_tbl_len * sizeof(tbl[0]), GFP_KERNEL);
4816 BUG_ON(!tbl);
4817
4818 for_each_node(node)
4819 BUG_ON(!alloc_cpumask_var_node(&tbl[node], GFP_KERNEL, node));
4820
4821 for_each_possible_cpu(cpu) {
4822 node = cpu_to_node(cpu);
4823 if (WARN_ON(node == NUMA_NO_NODE)) {
4824 pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
4825 /* happens iff arch is bonkers, let's just proceed */
4826 return;
3720 } 4827 }
4828 cpumask_set_cpu(cpu, tbl[node]);
3721 } 4829 }
3722 4830
3723 workqueue_freezing = false; 4831 wq_numa_possible_cpumask = tbl;
3724out_unlock: 4832 wq_numa_enabled = true;
3725 spin_unlock(&workqueue_lock);
3726} 4833}
3727#endif /* CONFIG_FREEZER */
3728 4834
3729static int __init init_workqueues(void) 4835static int __init init_workqueues(void)
3730{ 4836{
3731 unsigned int cpu; 4837 int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
4838 int i, cpu;
3732 4839
3733 /* make sure we have enough bits for OFFQ pool ID */ 4840 /* make sure we have enough bits for OFFQ pool ID */
3734 BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) < 4841 BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
3735 WORK_CPU_END * NR_STD_WORKER_POOLS); 4842 WORK_CPU_END * NR_STD_WORKER_POOLS);
3736 4843
4844 WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
4845
4846 pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
4847
3737 cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP); 4848 cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
3738 hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN); 4849 hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
3739 4850
4851 wq_numa_init();
4852
3740 /* initialize CPU pools */ 4853 /* initialize CPU pools */
3741 for_each_wq_cpu(cpu) { 4854 for_each_possible_cpu(cpu) {
3742 struct worker_pool *pool; 4855 struct worker_pool *pool;
3743 4856
3744 for_each_std_worker_pool(pool, cpu) { 4857 i = 0;
3745 spin_lock_init(&pool->lock); 4858 for_each_cpu_worker_pool(pool, cpu) {
4859 BUG_ON(init_worker_pool(pool));
3746 pool->cpu = cpu; 4860 pool->cpu = cpu;
3747 pool->flags |= POOL_DISASSOCIATED; 4861 cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
3748 INIT_LIST_HEAD(&pool->worklist); 4862 pool->attrs->nice = std_nice[i++];
3749 INIT_LIST_HEAD(&pool->idle_list); 4863 pool->node = cpu_to_node(cpu);
3750 hash_init(pool->busy_hash);
3751
3752 init_timer_deferrable(&pool->idle_timer);
3753 pool->idle_timer.function = idle_worker_timeout;
3754 pool->idle_timer.data = (unsigned long)pool;
3755
3756 setup_timer(&pool->mayday_timer, pool_mayday_timeout,
3757 (unsigned long)pool);
3758
3759 mutex_init(&pool->assoc_mutex);
3760 ida_init(&pool->worker_ida);
3761 4864
3762 /* alloc pool ID */ 4865 /* alloc pool ID */
4866 mutex_lock(&wq_pool_mutex);
3763 BUG_ON(worker_pool_assign_id(pool)); 4867 BUG_ON(worker_pool_assign_id(pool));
4868 mutex_unlock(&wq_pool_mutex);
3764 } 4869 }
3765 } 4870 }
3766 4871
3767 /* create the initial worker */ 4872 /* create the initial worker */
3768 for_each_online_wq_cpu(cpu) { 4873 for_each_online_cpu(cpu) {
3769 struct worker_pool *pool; 4874 struct worker_pool *pool;
3770 4875
3771 for_each_std_worker_pool(pool, cpu) { 4876 for_each_cpu_worker_pool(pool, cpu) {
3772 struct worker *worker; 4877 pool->flags &= ~POOL_DISASSOCIATED;
4878 BUG_ON(create_and_start_worker(pool) < 0);
4879 }
4880 }
3773 4881
3774 if (cpu != WORK_CPU_UNBOUND) 4882 /* create default unbound wq attrs */
3775 pool->flags &= ~POOL_DISASSOCIATED; 4883 for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
4884 struct workqueue_attrs *attrs;
3776 4885
3777 worker = create_worker(pool); 4886 BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
3778 BUG_ON(!worker); 4887 attrs->nice = std_nice[i];
3779 spin_lock_irq(&pool->lock); 4888 unbound_std_wq_attrs[i] = attrs;
3780 start_worker(worker);
3781 spin_unlock_irq(&pool->lock);
3782 }
3783 } 4889 }
3784 4890
3785 system_wq = alloc_workqueue("events", 0, 0); 4891 system_wq = alloc_workqueue("events", 0, 0);
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 07650264ec15..84ab6e1dc6fb 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -32,14 +32,12 @@ struct worker {
32 struct list_head scheduled; /* L: scheduled works */ 32 struct list_head scheduled; /* L: scheduled works */
33 struct task_struct *task; /* I: worker task */ 33 struct task_struct *task; /* I: worker task */
34 struct worker_pool *pool; /* I: the associated pool */ 34 struct worker_pool *pool; /* I: the associated pool */
35 /* L: for rescuers */
35 /* 64 bytes boundary on 64bit, 32 on 32bit */ 36 /* 64 bytes boundary on 64bit, 32 on 32bit */
36 unsigned long last_active; /* L: last active timestamp */ 37 unsigned long last_active; /* L: last active timestamp */
37 unsigned int flags; /* X: flags */ 38 unsigned int flags; /* X: flags */
38 int id; /* I: worker id */ 39 int id; /* I: worker id */
39 40
40 /* for rebinding worker to CPU */
41 struct work_struct rebind_work; /* L: for busy worker */
42
43 /* used only by rescuers to point to the target workqueue */ 41 /* used only by rescuers to point to the target workqueue */
44 struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */ 42 struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */
45}; 43};
@@ -58,8 +56,7 @@ static inline struct worker *current_wq_worker(void)
58 * Scheduler hooks for concurrency managed workqueue. Only to be used from 56 * Scheduler hooks for concurrency managed workqueue. Only to be used from
59 * sched.c and workqueue.c. 57 * sched.c and workqueue.c.
60 */ 58 */
61void wq_worker_waking_up(struct task_struct *task, unsigned int cpu); 59void wq_worker_waking_up(struct task_struct *task, int cpu);
62struct task_struct *wq_worker_sleeping(struct task_struct *task, 60struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu);
63 unsigned int cpu);
64 61
65#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */ 62#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */