diff options
-rw-r--r-- | Documentation/cpusets.txt | 178 |
1 files changed, 153 insertions, 25 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index e2d9afc30d2d..ef83b2dd0cb3 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -14,7 +14,11 @@ CONTENTS: | |||
14 | 1.1 What are cpusets ? | 14 | 1.1 What are cpusets ? |
15 | 1.2 Why are cpusets needed ? | 15 | 1.2 Why are cpusets needed ? |
16 | 1.3 How are cpusets implemented ? | 16 | 1.3 How are cpusets implemented ? |
17 | 1.4 How do I use cpusets ? | 17 | 1.4 What are exclusive cpusets ? |
18 | 1.5 What does notify_on_release do ? | ||
19 | 1.6 What is a marker_pid ? | ||
20 | 1.7 What is memory_pressure ? | ||
21 | 1.8 How do I use cpusets ? | ||
18 | 2. Usage Examples and Syntax | 22 | 2. Usage Examples and Syntax |
19 | 2.1 Basic Usage | 23 | 2.1 Basic Usage |
20 | 2.2 Adding/removing cpus | 24 | 2.2 Adding/removing cpus |
@@ -49,29 +53,6 @@ its cpus_allowed vector, and the kernel page allocator will not | |||
49 | allocate a page on a node that is not allowed in the requesting tasks | 53 | allocate a page on a node that is not allowed in the requesting tasks |
50 | mems_allowed vector. | 54 | mems_allowed vector. |
51 | 55 | ||
52 | If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct | ||
53 | ancestor or descendent, may share any of the same CPUs or Memory Nodes. | ||
54 | A cpuset that is cpu exclusive has a sched domain associated with it. | ||
55 | The sched domain consists of all cpus in the current cpuset that are not | ||
56 | part of any exclusive child cpusets. | ||
57 | This ensures that the scheduler load balacing code only balances | ||
58 | against the cpus that are in the sched domain as defined above and not | ||
59 | all of the cpus in the system. This removes any overhead due to | ||
60 | load balancing code trying to pull tasks outside of the cpu exclusive | ||
61 | cpuset only to be prevented by the tasks' cpus_allowed mask. | ||
62 | |||
63 | A cpuset that is mem_exclusive restricts kernel allocations for | ||
64 | page, buffer and other data commonly shared by the kernel across | ||
65 | multiple users. All cpusets, whether mem_exclusive or not, restrict | ||
66 | allocations of memory for user space. This enables configuring a | ||
67 | system so that several independent jobs can share common kernel | ||
68 | data, such as file system pages, while isolating each jobs user | ||
69 | allocation in its own cpuset. To do this, construct a large | ||
70 | mem_exclusive cpuset to hold all the jobs, and construct child, | ||
71 | non-mem_exclusive cpusets for each individual job. Only a small | ||
72 | amount of typical kernel memory, such as requests from interrupt | ||
73 | handlers, is allowed to be taken outside even a mem_exclusive cpuset. | ||
74 | |||
75 | User level code may create and destroy cpusets by name in the cpuset | 56 | User level code may create and destroy cpusets by name in the cpuset |
76 | virtual file system, manage the attributes and permissions of these | 57 | virtual file system, manage the attributes and permissions of these |
77 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | 58 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
@@ -196,6 +177,12 @@ containing the following files describing that cpuset: | |||
196 | - cpu_exclusive flag: is cpu placement exclusive? | 177 | - cpu_exclusive flag: is cpu placement exclusive? |
197 | - mem_exclusive flag: is memory placement exclusive? | 178 | - mem_exclusive flag: is memory placement exclusive? |
198 | - tasks: list of tasks (by pid) attached to that cpuset | 179 | - tasks: list of tasks (by pid) attached to that cpuset |
180 | - notify_on_release flag: run /sbin/cpuset_release_agent on exit? | ||
181 | - marker_pid: pid of user task in co-ordinated operation sequence | ||
182 | - memory_pressure: measure of how much paging pressure in cpuset | ||
183 | |||
184 | In addition, the root cpuset only has the following file: | ||
185 | - memory_pressure_enabled flag: compute memory_pressure? | ||
199 | 186 | ||
200 | New cpusets are created using the mkdir system call or shell | 187 | New cpusets are created using the mkdir system call or shell |
201 | command. The properties of a cpuset, such as its flags, allowed | 188 | command. The properties of a cpuset, such as its flags, allowed |
@@ -229,7 +216,148 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |||
229 | to represent the cpuset hierarchy provides for a familiar permission | 216 | to represent the cpuset hierarchy provides for a familiar permission |
230 | and name space for cpusets, with a minimum of additional kernel code. | 217 | and name space for cpusets, with a minimum of additional kernel code. |
231 | 218 | ||
232 | 1.4 How do I use cpusets ? | 219 | |
220 | 1.4 What are exclusive cpusets ? | ||
221 | -------------------------------- | ||
222 | |||
223 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | ||
224 | a direct ancestor or descendent, may share any of the same CPUs or | ||
225 | Memory Nodes. | ||
226 | |||
227 | A cpuset that is cpu_exclusive has a scheduler (sched) domain | ||
228 | associated with it. The sched domain consists of all CPUs in the | ||
229 | current cpuset that are not part of any exclusive child cpusets. | ||
230 | This ensures that the scheduler load balancing code only balances | ||
231 | against the CPUs that are in the sched domain as defined above and | ||
232 | not all of the CPUs in the system. This removes any overhead due to | ||
233 | load balancing code trying to pull tasks outside of the cpu_exclusive | ||
234 | cpuset only to be prevented by the tasks' cpus_allowed mask. | ||
235 | |||
236 | A cpuset that is mem_exclusive restricts kernel allocations for | ||
237 | page, buffer and other data commonly shared by the kernel across | ||
238 | multiple users. All cpusets, whether mem_exclusive or not, restrict | ||
239 | allocations of memory for user space. This enables configuring a | ||
240 | system so that several independent jobs can share common kernel data, | ||
241 | such as file system pages, while isolating each jobs user allocation in | ||
242 | its own cpuset. To do this, construct a large mem_exclusive cpuset to | ||
243 | hold all the jobs, and construct child, non-mem_exclusive cpusets for | ||
244 | each individual job. Only a small amount of typical kernel memory, | ||
245 | such as requests from interrupt handlers, is allowed to be taken | ||
246 | outside even a mem_exclusive cpuset. | ||
247 | |||
248 | |||
249 | 1.5 What does notify_on_release do ? | ||
250 | ------------------------------------ | ||
251 | |||
252 | If the notify_on_release flag is enabled (1) in a cpuset, then whenever | ||
253 | the last task in the cpuset leaves (exits or attaches to some other | ||
254 | cpuset) and the last child cpuset of that cpuset is removed, then | ||
255 | the kernel runs the command /sbin/cpuset_release_agent, supplying the | ||
256 | pathname (relative to the mount point of the cpuset file system) of the | ||
257 | abandoned cpuset. This enables automatic removal of abandoned cpusets. | ||
258 | The default value of notify_on_release in the root cpuset at system | ||
259 | boot is disabled (0). The default value of other cpusets at creation | ||
260 | is the current value of their parents notify_on_release setting. | ||
261 | |||
262 | |||
263 | 1.6 What is a marker_pid ? | ||
264 | -------------------------- | ||
265 | |||
266 | The marker_pid helps manage cpuset changes safely from user space. | ||
267 | |||
268 | The interface presented to user space for cpusets uses system wide | ||
269 | numbering of CPUs and Memory Nodes. It is the responsibility of | ||
270 | user level code, presumably in a library, to present cpuset-relative | ||
271 | numbering to applications when that would be more useful to them. | ||
272 | |||
273 | However if a task is moved to a different cpuset, or if the 'cpus' or | ||
274 | 'mems' of a cpuset are changed, then we need a way for such library | ||
275 | code to detect that its cpuset-relative numbering has changed, when | ||
276 | expressed using system wide numbering. | ||
277 | |||
278 | The kernel cannot safely allow user code to lock kernel resources. | ||
279 | The kernel could deliver out-of-band notice of cpuset changes by | ||
280 | such mechanisms as signals or usermodehelper callbacks, however | ||
281 | this can't be synchronously delivered to library code linked in | ||
282 | applications without intruding on the IPC mechanisms available to | ||
283 | the app. The kernel could require user level code to do all the work, | ||
284 | tracking the cpuset state before and during changes, to verify no | ||
285 | unexpected change occurred, but this becomes an onerous task. | ||
286 | |||
287 | The "marker_pid" cpuset field provides a simple way to make this task | ||
288 | less onerous on user library code. A task writes its pid to a cpusets | ||
289 | "marker_pid" at the start of a sequence of queries and updates, | ||
290 | and check as it goes that the cpusets marker_pid doesn't change. | ||
291 | The pread(2) system call does a seek and read in a single call. | ||
292 | If the marker_pid changes, the user code should retry the required | ||
293 | sequence of operations. | ||
294 | |||
295 | Anytime that a task modifies the "cpus" or "mems" of a cpuset, | ||
296 | unless it's pid is in the cpusets marker_pid field, the kernel zeros | ||
297 | this field. | ||
298 | |||
299 | The above was inspired by the load linked and store conditional | ||
300 | (ll/sc) instructions in the MIPS II instruction set. | ||
301 | |||
302 | |||
303 | 1.7 What is memory_pressure ? | ||
304 | ----------------------------- | ||
305 | The memory_pressure of a cpuset provides a simple per-cpuset metric | ||
306 | of the rate that the tasks in a cpuset are attempting to free up in | ||
307 | use memory on the nodes of the cpuset to satisfy additional memory | ||
308 | requests. | ||
309 | |||
310 | This enables batch managers monitoring jobs running in dedicated | ||
311 | cpusets to efficiently detect what level of memory pressure that job | ||
312 | is causing. | ||
313 | |||
314 | This is useful both on tightly managed systems running a wide mix of | ||
315 | submitted jobs, which may choose to terminate or re-prioritize jobs that | ||
316 | are trying to use more memory than allowed on the nodes assigned them, | ||
317 | and with tightly coupled, long running, massively parallel scientific | ||
318 | computing jobs that will dramatically fail to meet required performance | ||
319 | goals if they start to use more memory than allowed to them. | ||
320 | |||
321 | This mechanism provides a very economical way for the batch manager | ||
322 | to monitor a cpuset for signs of memory pressure. It's up to the | ||
323 | batch manager or other user code to decide what to do about it and | ||
324 | take action. | ||
325 | |||
326 | ==> Unless this feature is enabled by writing "1" to the special file | ||
327 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | ||
328 | code of __alloc_pages() for this metric reduces to simply noticing | ||
329 | that the cpuset_memory_pressure_enabled flag is zero. So only | ||
330 | systems that enable this feature will compute the metric. | ||
331 | |||
332 | Why a per-cpuset, running average: | ||
333 | |||
334 | Because this meter is per-cpuset, rather than per-task or mm, | ||
335 | the system load imposed by a batch scheduler monitoring this | ||
336 | metric is sharply reduced on large systems, because a scan of | ||
337 | the tasklist can be avoided on each set of queries. | ||
338 | |||
339 | Because this meter is a running average, instead of an accumulating | ||
340 | counter, a batch scheduler can detect memory pressure with a | ||
341 | single read, instead of having to read and accumulate results | ||
342 | for a period of time. | ||
343 | |||
344 | Because this meter is per-cpuset rather than per-task or mm, | ||
345 | the batch scheduler can obtain the key information, memory | ||
346 | pressure in a cpuset, with a single read, rather than having to | ||
347 | query and accumulate results over all the (dynamically changing) | ||
348 | set of tasks in the cpuset. | ||
349 | |||
350 | A per-cpuset simple digital filter (requires a spinlock and 3 words | ||
351 | of data per-cpuset) is kept, and updated by any task attached to that | ||
352 | cpuset, if it enters the synchronous (direct) page reclaim code. | ||
353 | |||
354 | A per-cpuset file provides an integer number representing the recent | ||
355 | (half-life of 10 seconds) rate of direct page reclaims caused by | ||
356 | the tasks in the cpuset, in units of reclaims attempted per second, | ||
357 | times 1000. | ||
358 | |||
359 | |||
360 | 1.8 How do I use cpusets ? | ||
233 | -------------------------- | 361 | -------------------------- |
234 | 362 | ||
235 | In order to minimize the impact of cpusets on critical kernel | 363 | In order to minimize the impact of cpusets on critical kernel |