aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/DocBook/kernel-api.tmpl2
-rw-r--r--Documentation/IPMI.txt25
-rw-r--r--Documentation/accounting/cgroupstats.txt27
-rw-r--r--Documentation/atomic_ops.txt14
-rw-r--r--Documentation/cachetlb.txt27
-rw-r--r--Documentation/cgroups.txt545
-rw-r--r--Documentation/cpu-hotplug.txt4
-rw-r--r--Documentation/cpusets.txt226
-rw-r--r--Documentation/feature-removal-schedule.txt42
-rw-r--r--Documentation/input/input-programming.txt15
-rw-r--r--Documentation/kdump/kdump.txt26
-rw-r--r--Documentation/kernel-parameters.txt17
-rw-r--r--Documentation/markers.txt81
-rw-r--r--Documentation/memory-barriers.txt14
-rw-r--r--Documentation/mips/00-INDEX2
-rw-r--r--Documentation/mips/time.README173
-rw-r--r--Documentation/parport-lowlevel.txt29
-rw-r--r--Documentation/power/basic-pm-debugging.txt4
-rw-r--r--Documentation/power/freezing-of-tasks.txt44
-rw-r--r--Documentation/power/interface.txt2
-rw-r--r--Documentation/sound/oss/es137164
-rw-r--r--Documentation/thinkpad-acpi.txt25
22 files changed, 1019 insertions, 389 deletions
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index d3290c46af51..aa38cc5692a0 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -46,7 +46,7 @@
46 46
47 <sect1><title>Atomic and pointer manipulation</title> 47 <sect1><title>Atomic and pointer manipulation</title>
48!Iinclude/asm-x86/atomic_32.h 48!Iinclude/asm-x86/atomic_32.h
49!Iinclude/asm-x86/unaligned_32.h 49!Iinclude/asm-x86/unaligned.h
50 </sect1> 50 </sect1>
51 51
52 <sect1><title>Delaying, scheduling, and timer routines</title> 52 <sect1><title>Delaying, scheduling, and timer routines</title>
diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt
index 24dc3fcf1594..bc38283379f0 100644
--- a/Documentation/IPMI.txt
+++ b/Documentation/IPMI.txt
@@ -441,17 +441,20 @@ ACPI, and if none of those then a KCS device at the spec-specified
4410xca2. If you want to turn this off, set the "trydefaults" option to 4410xca2. If you want to turn this off, set the "trydefaults" option to
442false. 442false.
443 443
444If you have high-res timers compiled into the kernel, the driver will 444If your IPMI interface does not support interrupts and is a KCS or
445use them to provide much better performance. Note that if you do not 445SMIC interface, the IPMI driver will start a kernel thread for the
446have high-res timers enabled in the kernel and you don't have 446interface to help speed things up. This is a low-priority kernel
447interrupts enabled, the driver will run VERY slowly. Don't blame me, 447thread that constantly polls the IPMI driver while an IPMI operation
448is in progress. The force_kipmid module parameter will all the user to
449force this thread on or off. If you force it off and don't have
450interrupts, the driver will run VERY slowly. Don't blame me,
448these interfaces suck. 451these interfaces suck.
449 452
450The driver supports a hot add and remove of interfaces. This way, 453The driver supports a hot add and remove of interfaces. This way,
451interfaces can be added or removed after the kernel is up and running. 454interfaces can be added or removed after the kernel is up and running.
452This is done using /sys/modules/ipmi_si/hotmod, which is a write-only 455This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a
453parameter. You write a string to this interface. The string has the 456write-only parameter. You write a string to this interface. The string
454format: 457has the format:
455 <op1>[:op2[:op3...]] 458 <op1>[:op2[:op3...]]
456The "op"s are: 459The "op"s are:
457 add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]] 460 add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]]
@@ -581,9 +584,11 @@ The watchdog will panic and start a 120 second reset timeout if it
581gets a pre-action. During a panic or a reboot, the watchdog will 584gets a pre-action. During a panic or a reboot, the watchdog will
582start a 120 timer if it is running to make sure the reboot occurs. 585start a 120 timer if it is running to make sure the reboot occurs.
583 586
584Note that if you use the NMI preaction for the watchdog, you MUST 587Note that if you use the NMI preaction for the watchdog, you MUST NOT
585NOT use nmi watchdog mode 1. If you use the NMI watchdog, you 588use the nmi watchdog. There is no reasonable way to tell if an NMI
586must use mode 2. 589comes from the IPMI controller, so it must assume that if it gets an
590otherwise unhandled NMI, it must be from IPMI and it will panic
591immediately.
587 592
588Once you open the watchdog timer, you must write a 'V' character to the 593Once you open the watchdog timer, you must write a 'V' character to the
589device to close it, or the timer will not stop. This is a new semantic 594device to close it, or the timer will not stop. This is a new semantic
diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.txt
new file mode 100644
index 000000000000..eda40fd39cad
--- /dev/null
+++ b/Documentation/accounting/cgroupstats.txt
@@ -0,0 +1,27 @@
1Control Groupstats is inspired by the discussion at
2http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
3suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
4
5Per cgroup statistics infrastructure re-uses code from the taskstats
6interface. A new set of cgroup operations are registered with commands
7and attributes specific to cgroups. It should be very easy to
8extend per cgroup statistics, by adding members to the cgroupstats
9structure.
10
11The current model for cgroupstats is a pull, a push model (to post
12statistics on interesting events), should be very easy to add. Currently
13user space requests for statistics by passing the cgroup path.
14Statistics about the state of all the tasks in the cgroup is returned to
15user space.
16
17NOTE: We currently rely on delay accounting for extracting information
18about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
19information will not be available.
20
21To extract cgroup statistics a utility very similar to getdelays.c
22has been developed, the sample output of the utility is shown below
23
24~/balbir/cgroupstats # ./getdelays -C "/cgroup/a"
25sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
26~/balbir/cgroupstats # ./getdelays -C "/cgroup"
27sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index d46306fea230..f20c10c2858f 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -418,6 +418,20 @@ brothers:
418 */ 418 */
419 smp_mb__after_clear_bit(); 419 smp_mb__after_clear_bit();
420 420
421There are two special bitops with lock barrier semantics (acquire/release,
422same as spinlocks). These operate in the same way as their non-_lock/unlock
423postfixed variants, except that they are to provide acquire/release semantics,
424respectively. This means they can be used for bit_spin_trylock and
425bit_spin_unlock type operations without specifying any more barriers.
426
427 int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
428 void clear_bit_unlock(unsigned long nr, unsigned long *addr);
429 void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
430
431The __clear_bit_unlock version is non-atomic, however it still implements
432unlock barrier semantics. This can be useful if the lock itself is protecting
433the other bits in the word.
434
421Finally, there are non-atomic versions of the bitmask operations 435Finally, there are non-atomic versions of the bitmask operations
422provided. They are used in contexts where some other higher-level SMP 436provided. They are used in contexts where some other higher-level SMP
423locking scheme is being used to protect the bitmask, and thus less 437locking scheme is being used to protect the bitmask, and thus less
diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt
index 552cabac0608..da42ab414c48 100644
--- a/Documentation/cachetlb.txt
+++ b/Documentation/cachetlb.txt
@@ -87,30 +87,7 @@ changes occur:
87 87
88 This is used primarily during fault processing. 88 This is used primarily during fault processing.
89 89
905) void flush_tlb_pgtables(struct mm_struct *mm, 905) void update_mmu_cache(struct vm_area_struct *vma,
91 unsigned long start, unsigned long end)
92
93 The software page tables for address space 'mm' for virtual
94 addresses in the range 'start' to 'end-1' are being torn down.
95
96 Some platforms cache the lowest level of the software page tables
97 in a linear virtually mapped array, to make TLB miss processing
98 more efficient. On such platforms, since the TLB is caching the
99 software page table structure, it needs to be flushed when parts
100 of the software page table tree are unlinked/freed.
101
102 Sparc64 is one example of a platform which does this.
103
104 Usually, when munmap()'ing an area of user virtual address
105 space, the kernel leaves the page table parts around and just
106 marks the individual pte's as invalid. However, if very large
107 portions of the address space are unmapped, the kernel frees up
108 those portions of the software page tables to prevent potential
109 excessive kernel memory usage caused by erratic mmap/mmunmap
110 sequences. It is at these times that flush_tlb_pgtables will
111 be invoked.
112
1136) void update_mmu_cache(struct vm_area_struct *vma,
114 unsigned long address, pte_t pte) 91 unsigned long address, pte_t pte)
115 92
116 At the end of every page fault, this routine is invoked to 93 At the end of every page fault, this routine is invoked to
@@ -123,7 +100,7 @@ changes occur:
123 translations for software managed TLB configurations. 100 translations for software managed TLB configurations.
124 The sparc64 port currently does this. 101 The sparc64 port currently does this.
125 102
1267) void tlb_migrate_finish(struct mm_struct *mm) 1036) void tlb_migrate_finish(struct mm_struct *mm)
127 104
128 This interface is called at the end of an explicit 105 This interface is called at the end of an explicit
129 process migration. This interface provides a hook 106 process migration. This interface provides a hook
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt
new file mode 100644
index 000000000000..98a26f81fa75
--- /dev/null
+++ b/Documentation/cgroups.txt
@@ -0,0 +1,545 @@
1 CGROUPS
2 -------
3
4Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
5
6Original copyright statements from cpusets.txt:
7Portions Copyright (C) 2004 BULL SA.
8Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
9Modified by Paul Jackson <pj@sgi.com>
10Modified by Christoph Lameter <clameter@sgi.com>
11
12CONTENTS:
13=========
14
151. Control Groups
16 1.1 What are cgroups ?
17 1.2 Why are cgroups needed ?
18 1.3 How are cgroups implemented ?
19 1.4 What does notify_on_release do ?
20 1.5 How do I use cgroups ?
212. Usage Examples and Syntax
22 2.1 Basic Usage
23 2.2 Attaching processes
243. Kernel API
25 3.1 Overview
26 3.2 Synchronization
27 3.3 Subsystem API
284. Questions
29
301. Control Groups
31==========
32
331.1 What are cgroups ?
34----------------------
35
36Control Groups provide a mechanism for aggregating/partitioning sets of
37tasks, and all their future children, into hierarchical groups with
38specialized behaviour.
39
40Definitions:
41
42A *cgroup* associates a set of tasks with a set of parameters for one
43or more subsystems.
44
45A *subsystem* is a module that makes use of the task grouping
46facilities provided by cgroups to treat groups of tasks in
47particular ways. A subsystem is typically a "resource controller" that
48schedules a resource or applies per-cgroup limits, but it may be
49anything that wants to act on a group of processes, e.g. a
50virtualization subsystem.
51
52A *hierarchy* is a set of cgroups arranged in a tree, such that
53every task in the system is in exactly one of the cgroups in the
54hierarchy, and a set of subsystems; each subsystem has system-specific
55state attached to each cgroup in the hierarchy. Each hierarchy has
56an instance of the cgroup virtual filesystem associated with it.
57
58At any one time there may be multiple active hierachies of task
59cgroups. Each hierarchy is a partition of all tasks in the system.
60
61User level code may create and destroy cgroups by name in an
62instance of the cgroup virtual file system, specify and query to
63which cgroup a task is assigned, and list the task pids assigned to
64a cgroup. Those creations and assignments only affect the hierarchy
65associated with that instance of the cgroup file system.
66
67On their own, the only use for cgroups is for simple job
68tracking. The intention is that other subsystems hook into the generic
69cgroup support to provide new attributes for cgroups, such as
70accounting/limiting the resources which processes in a cgroup can
71access. For example, cpusets (see Documentation/cpusets.txt) allows
72you to associate a set of CPUs and a set of memory nodes with the
73tasks in each cgroup.
74
751.2 Why are cgroups needed ?
76----------------------------
77
78There are multiple efforts to provide process aggregations in the
79Linux kernel, mainly for resource tracking purposes. Such efforts
80include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
81namespaces. These all require the basic notion of a
82grouping/partitioning of processes, with newly forked processes ending
83in the same group (cgroup) as their parent process.
84
85The kernel cgroup patch provides the minimum essential kernel
86mechanisms required to efficiently implement such groups. It has
87minimal impact on the system fast paths, and provides hooks for
88specific subsystems such as cpusets to provide additional behaviour as
89desired.
90
91Multiple hierarchy support is provided to allow for situations where
92the division of tasks into cgroups is distinctly different for
93different subsystems - having parallel hierarchies allows each
94hierarchy to be a natural division of tasks, without having to handle
95complex combinations of tasks that would be present if several
96unrelated subsystems needed to be forced into the same tree of
97cgroups.
98
99At one extreme, each resource controller or subsystem could be in a
100separate hierarchy; at the other extreme, all subsystems
101would be attached to the same hierarchy.
102
103As an example of a scenario (originally proposed by vatsa@in.ibm.com)
104that can benefit from multiple hierarchies, consider a large
105university server with various users - students, professors, system
106tasks etc. The resource planning for this server could be along the
107following lines:
108
109 CPU : Top cpuset
110 / \
111 CPUSet1 CPUSet2
112 | |
113 (Profs) (Students)
114
115 In addition (system tasks) are attached to topcpuset (so
116 that they can run anywhere) with a limit of 20%
117
118 Memory : Professors (50%), students (30%), system (20%)
119
120 Disk : Prof (50%), students (30%), system (20%)
121
122 Network : WWW browsing (20%), Network File System (60%), others (20%)
123 / \
124 Prof (15%) students (5%)
125
126Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
127into NFS network class.
128
129At the same time firefox/lynx will share an appropriate CPU/Memory class
130depending on who launched it (prof/student).
131
132With the ability to classify tasks differently for different resources
133(by putting those resource subsystems in different hierarchies) then
134the admin can easily set up a script which receives exec notifications
135and depending on who is launching the browser he can
136
137 # echo browser_pid > /mnt/<restype>/<userclass>/tasks
138
139With only a single hierarchy, he now would potentially have to create
140a separate cgroup for every browser launched and associate it with
141approp network and other resource class. This may lead to
142proliferation of such cgroups.
143
144Also lets say that the administrator would like to give enhanced network
145access temporarily to a student's browser (since it is night and the user
146wants to do online gaming :) OR give one of the students simulation
147apps enhanced CPU power,
148
149With ability to write pids directly to resource classes, its just a
150matter of :
151
152 # echo pid > /mnt/network/<new_class>/tasks
153 (after some time)
154 # echo pid > /mnt/network/<orig_class>/tasks
155
156Without this ability, he would have to split the cgroup into
157multiple separate ones and then associate the new cgroups with the
158new resource classes.
159
160
161
1621.3 How are cgroups implemented ?
163---------------------------------
164
165Control Groups extends the kernel as follows:
166
167 - Each task in the system has a reference-counted pointer to a
168 css_set.
169
170 - A css_set contains a set of reference-counted pointers to
171 cgroup_subsys_state objects, one for each cgroup subsystem
172 registered in the system. There is no direct link from a task to
173 the cgroup of which it's a member in each hierarchy, but this
174 can be determined by following pointers through the
175 cgroup_subsys_state objects. This is because accessing the
176 subsystem state is something that's expected to happen frequently
177 and in performance-critical code, whereas operations that require a
178 task's actual cgroup assignments (in particular, moving between
179 cgroups) are less common. A linked list runs through the cg_list
180 field of each task_struct using the css_set, anchored at
181 css_set->tasks.
182
183 - A cgroup hierarchy filesystem can be mounted for browsing and
184 manipulation from user space.
185
186 - You can list all the tasks (by pid) attached to any cgroup.
187
188The implementation of cgroups requires a few, simple hooks
189into the rest of the kernel, none in performance critical paths:
190
191 - in init/main.c, to initialize the root cgroups and initial
192 css_set at system boot.
193
194 - in fork and exit, to attach and detach a task from its css_set.
195
196In addition a new file system, of type "cgroup" may be mounted, to
197enable browsing and modifying the cgroups presently known to the
198kernel. When mounting a cgroup hierarchy, you may specify a
199comma-separated list of subsystems to mount as the filesystem mount
200options. By default, mounting the cgroup filesystem attempts to
201mount a hierarchy containing all registered subsystems.
202
203If an active hierarchy with exactly the same set of subsystems already
204exists, it will be reused for the new mount. If no existing hierarchy
205matches, and any of the requested subsystems are in use in an existing
206hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
207is activated, associated with the requested subsystems.
208
209It's not currently possible to bind a new subsystem to an active
210cgroup hierarchy, or to unbind a subsystem from an active cgroup
211hierarchy. This may be possible in future, but is fraught with nasty
212error-recovery issues.
213
214When a cgroup filesystem is unmounted, if there are any
215child cgroups created below the top-level cgroup, that hierarchy
216will remain active even though unmounted; if there are no
217child cgroups then the hierarchy will be deactivated.
218
219No new system calls are added for cgroups - all support for
220querying and modifying cgroups is via this cgroup file system.
221
222Each task under /proc has an added file named 'cgroup' displaying,
223for each active hierarchy, the subsystem names and the cgroup name
224as the path relative to the root of the cgroup file system.
225
226Each cgroup is represented by a directory in the cgroup file system
227containing the following files describing that cgroup:
228
229 - tasks: list of tasks (by pid) attached to that cgroup
230 - notify_on_release flag: run /sbin/cgroup_release_agent on exit?
231
232Other subsystems such as cpusets may add additional files in each
233cgroup dir
234
235New cgroups are created using the mkdir system call or shell
236command. The properties of a cgroup, such as its flags, are
237modified by writing to the appropriate file in that cgroups
238directory, as listed above.
239
240The named hierarchical structure of nested cgroups allows partitioning
241a large system into nested, dynamically changeable, "soft-partitions".
242
243The attachment of each task, automatically inherited at fork by any
244children of that task, to a cgroup allows organizing the work load
245on a system into related sets of tasks. A task may be re-attached to
246any other cgroup, if allowed by the permissions on the necessary
247cgroup file system directories.
248
249When a task is moved from one cgroup to another, it gets a new
250css_set pointer - if there's an already existing css_set with the
251desired collection of cgroups then that group is reused, else a new
252css_set is allocated. Note that the current implementation uses a
253linear search to locate an appropriate existing css_set, so isn't
254very efficient. A future version will use a hash table for better
255performance.
256
257To allow access from a cgroup to the css_sets (and hence tasks)
258that comprise it, a set of cg_cgroup_link objects form a lattice;
259each cg_cgroup_link is linked into a list of cg_cgroup_links for
260a single cgroup on its cont_link_list field, and a list of
261cg_cgroup_links for a single css_set on its cg_link_list.
262
263Thus the set of tasks in a cgroup can be listed by iterating over
264each css_set that references the cgroup, and sub-iterating over
265each css_set's task set.
266
267The use of a Linux virtual file system (vfs) to represent the
268cgroup hierarchy provides for a familiar permission and name space
269for cgroups, with a minimum of additional kernel code.
270
2711.4 What does notify_on_release do ?
272------------------------------------
273
274*** notify_on_release is disabled in the current patch set. It will be
275*** reactivated in a future patch in a less-intrusive manner
276
277If the notify_on_release flag is enabled (1) in a cgroup, then
278whenever the last task in the cgroup leaves (exits or attaches to
279some other cgroup) and the last child cgroup of that cgroup
280is removed, then the kernel runs the command specified by the contents
281of the "release_agent" file in that hierarchy's root directory,
282supplying the pathname (relative to the mount point of the cgroup
283file system) of the abandoned cgroup. This enables automatic
284removal of abandoned cgroups. The default value of
285notify_on_release in the root cgroup at system boot is disabled
286(0). The default value of other cgroups at creation is the current
287value of their parents notify_on_release setting. The default value of
288a cgroup hierarchy's release_agent path is empty.
289
2901.5 How do I use cgroups ?
291--------------------------
292
293To start a new job that is to be contained within a cgroup, using
294the "cpuset" cgroup subsystem, the steps are something like:
295
296 1) mkdir /dev/cgroup
297 2) mount -t cgroup -ocpuset cpuset /dev/cgroup
298 3) Create the new cgroup by doing mkdir's and write's (or echo's) in
299 the /dev/cgroup virtual file system.
300 4) Start a task that will be the "founding father" of the new job.
301 5) Attach that task to the new cgroup by writing its pid to the
302 /dev/cgroup tasks file for that cgroup.
303 6) fork, exec or clone the job tasks from this founding father task.
304
305For example, the following sequence of commands will setup a cgroup
306named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
307and then start a subshell 'sh' in that cgroup:
308
309 mount -t cgroup cpuset -ocpuset /dev/cgroup
310 cd /dev/cgroup
311 mkdir Charlie
312 cd Charlie
313 /bin/echo 2-3 > cpus
314 /bin/echo 1 > mems
315 /bin/echo $$ > tasks
316 sh
317 # The subshell 'sh' is now running in cgroup Charlie
318 # The next line should display '/Charlie'
319 cat /proc/self/cgroup
320
3212. Usage Examples and Syntax
322============================
323
3242.1 Basic Usage
325---------------
326
327Creating, modifying, using the cgroups can be done through the cgroup
328virtual filesystem.
329
330To mount a cgroup hierarchy will all available subsystems, type:
331# mount -t cgroup xxx /dev/cgroup
332
333The "xxx" is not interpreted by the cgroup code, but will appear in
334/proc/mounts so may be any useful identifying string that you like.
335
336To mount a cgroup hierarchy with just the cpuset and numtasks
337subsystems, type:
338# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup
339
340To change the set of subsystems bound to a mounted hierarchy, just
341remount with different options:
342
343# mount -o remount,cpuset,ns /dev/cgroup
344
345Note that changing the set of subsystems is currently only supported
346when the hierarchy consists of a single (root) cgroup. Supporting
347the ability to arbitrarily bind/unbind subsystems from an existing
348cgroup hierarchy is intended to be implemented in the future.
349
350Then under /dev/cgroup you can find a tree that corresponds to the
351tree of the cgroups in the system. For instance, /dev/cgroup
352is the cgroup that holds the whole system.
353
354If you want to create a new cgroup under /dev/cgroup:
355# cd /dev/cgroup
356# mkdir my_cgroup
357
358Now you want to do something with this cgroup.
359# cd my_cgroup
360
361In this directory you can find several files:
362# ls
363notify_on_release release_agent tasks
364(plus whatever files are added by the attached subsystems)
365
366Now attach your shell to this cgroup:
367# /bin/echo $$ > tasks
368
369You can also create cgroups inside your cgroup by using mkdir in this
370directory.
371# mkdir my_sub_cs
372
373To remove a cgroup, just use rmdir:
374# rmdir my_sub_cs
375
376This will fail if the cgroup is in use (has cgroups inside, or
377has processes attached, or is held alive by other subsystem-specific
378reference).
379
3802.2 Attaching processes
381-----------------------
382
383# /bin/echo PID > tasks
384
385Note that it is PID, not PIDs. You can only attach ONE task at a time.
386If you have several tasks to attach, you have to do it one after another:
387
388# /bin/echo PID1 > tasks
389# /bin/echo PID2 > tasks
390 ...
391# /bin/echo PIDn > tasks
392
3933. Kernel API
394=============
395
3963.1 Overview
397------------
398
399Each kernel subsystem that wants to hook into the generic cgroup
400system needs to create a cgroup_subsys object. This contains
401various methods, which are callbacks from the cgroup system, along
402with a subsystem id which will be assigned by the cgroup system.
403
404Other fields in the cgroup_subsys object include:
405
406- subsys_id: a unique array index for the subsystem, indicating which
407 entry in cgroup->subsys[] this subsystem should be
408 managing. Initialized by cgroup_register_subsys(); prior to this
409 it should be initialized to -1
410
411- hierarchy: an index indicating which hierarchy, if any, this
412 subsystem is currently attached to. If this is -1, then the
413 subsystem is not attached to any hierarchy, and all tasks should be
414 considered to be members of the subsystem's top_cgroup. It should
415 be initialized to -1.
416
417- name: should be initialized to a unique subsystem name prior to
418 calling cgroup_register_subsystem. Should be no longer than
419 MAX_CGROUP_TYPE_NAMELEN
420
421Each cgroup object created by the system has an array of pointers,
422indexed by subsystem id; this pointer is entirely managed by the
423subsystem; the generic cgroup code will never touch this pointer.
424
4253.2 Synchronization
426-------------------
427
428There is a global mutex, cgroup_mutex, used by the cgroup
429system. This should be taken by anything that wants to modify a
430cgroup. It may also be taken to prevent cgroups from being
431modified, but more specific locks may be more appropriate in that
432situation.
433
434See kernel/cgroup.c for more details.
435
436Subsystems can take/release the cgroup_mutex via the functions
437cgroup_lock()/cgroup_unlock(), and can
438take/release the callback_mutex via the functions
439cgroup_lock()/cgroup_unlock().
440
441Accessing a task's cgroup pointer may be done in the following ways:
442- while holding cgroup_mutex
443- while holding the task's alloc_lock (via task_lock())
444- inside an rcu_read_lock() section via rcu_dereference()
445
4463.3 Subsystem API
447--------------------------
448
449Each subsystem should:
450
451- add an entry in linux/cgroup_subsys.h
452- define a cgroup_subsys object called <name>_subsys
453
454Each subsystem may export the following methods. The only mandatory
455methods are create/destroy. Any others that are null are presumed to
456be successful no-ops.
457
458struct cgroup_subsys_state *create(struct cgroup *cont)
459LL=cgroup_mutex
460
461Called to create a subsystem state object for a cgroup. The
462subsystem should allocate its subsystem state object for the passed
463cgroup, returning a pointer to the new object on success or a
464negative error code. On success, the subsystem pointer should point to
465a structure of type cgroup_subsys_state (typically embedded in a
466larger subsystem-specific object), which will be initialized by the
467cgroup system. Note that this will be called at initialization to
468create the root subsystem state for this subsystem; this case can be
469identified by the passed cgroup object having a NULL parent (since
470it's the root of the hierarchy) and may be an appropriate place for
471initialization code.
472
473void destroy(struct cgroup *cont)
474LL=cgroup_mutex
475
476The cgroup system is about to destroy the passed cgroup; the
477subsystem should do any necessary cleanup
478
479int can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
480 struct task_struct *task)
481LL=cgroup_mutex
482
483Called prior to moving a task into a cgroup; if the subsystem
484returns an error, this will abort the attach operation. If a NULL
485task is passed, then a successful result indicates that *any*
486unspecified task can be moved into the cgroup. Note that this isn't
487called on a fork. If this method returns 0 (success) then this should
488remain valid while the caller holds cgroup_mutex.
489
490void attach(struct cgroup_subsys *ss, struct cgroup *cont,
491 struct cgroup *old_cont, struct task_struct *task)
492LL=cgroup_mutex
493
494
495Called after the task has been attached to the cgroup, to allow any
496post-attachment activity that requires memory allocations or blocking.
497
498void fork(struct cgroup_subsy *ss, struct task_struct *task)
499LL=callback_mutex, maybe read_lock(tasklist_lock)
500
501Called when a task is forked into a cgroup. Also called during
502registration for all existing tasks.
503
504void exit(struct cgroup_subsys *ss, struct task_struct *task)
505LL=callback_mutex
506
507Called during task exit
508
509int populate(struct cgroup_subsys *ss, struct cgroup *cont)
510LL=none
511
512Called after creation of a cgroup to allow a subsystem to populate
513the cgroup directory with file entries. The subsystem should make
514calls to cgroup_add_file() with objects of type cftype (see
515include/linux/cgroup.h for details). Note that although this
516method can return an error code, the error code is currently not
517always handled well.
518
519void post_clone(struct cgroup_subsys *ss, struct cgroup *cont)
520
521Called at the end of cgroup_clone() to do any paramater
522initialization which might be required before a task could attach. For
523example in cpusets, no task may attach before 'cpus' and 'mems' are set
524up.
525
526void bind(struct cgroup_subsys *ss, struct cgroup *root)
527LL=callback_mutex
528
529Called when a cgroup subsystem is rebound to a different hierarchy
530and root cgroup. Currently this will only involve movement between
531the default hierarchy (which never has sub-cgroups) and a hierarchy
532that is being created/destroyed (and hence has no sub-cgroups).
533
5344. Questions
535============
536
537Q: what's up with this '/bin/echo' ?
538A: bash's builtin 'echo' command does not check calls to write() against
539 errors. If you use it in the cgroup file system, you won't be
540 able to tell whether a command succeeded or failed.
541
542Q: When I attach processes, only the first of the line gets really attached !
543A: We can only return one error code per call to write(). So you should also
544 put only ONE pid.
545
diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
index b6d24c22274b..a741f658a3c9 100644
--- a/Documentation/cpu-hotplug.txt
+++ b/Documentation/cpu-hotplug.txt
@@ -220,7 +220,9 @@ A: The following happen, listed in no particular order :-)
220 CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the 220 CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the
221 CPU is being offlined while tasks are frozen due to a suspend operation in 221 CPU is being offlined while tasks are frozen due to a suspend operation in
222 progress 222 progress
223- All process is migrated away from this outgoing CPU to a new CPU 223- All processes are migrated away from this outgoing CPU to new CPUs.
224 The new CPU is chosen from each process' current cpuset, which may be
225 a subset of all online CPUs.
224- All interrupts targeted to this CPU is migrated to a new CPU 226- All interrupts targeted to this CPU is migrated to a new CPU
225- timers/bottom half/task lets are also migrated to a new CPU 227- timers/bottom half/task lets are also migrated to a new CPU
226- Once all services are migrated, kernel calls an arch specific routine 228- Once all services are migrated, kernel calls an arch specific routine
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index ec9de6917f01..141bef1c8599 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net
7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
8Modified by Paul Jackson <pj@sgi.com> 8Modified by Paul Jackson <pj@sgi.com>
9Modified by Christoph Lameter <clameter@sgi.com> 9Modified by Christoph Lameter <clameter@sgi.com>
10Modified by Paul Menage <menage@google.com>
10 11
11CONTENTS: 12CONTENTS:
12========= 13=========
@@ -16,9 +17,9 @@ CONTENTS:
16 1.2 Why are cpusets needed ? 17 1.2 Why are cpusets needed ?
17 1.3 How are cpusets implemented ? 18 1.3 How are cpusets implemented ?
18 1.4 What are exclusive cpusets ? 19 1.4 What are exclusive cpusets ?
19 1.5 What does notify_on_release do ? 20 1.5 What is memory_pressure ?
20 1.6 What is memory_pressure ? 21 1.6 What is memory spread ?
21 1.7 What is memory spread ? 22 1.7 What is sched_load_balance ?
22 1.8 How do I use cpusets ? 23 1.8 How do I use cpusets ?
232. Usage Examples and Syntax 242. Usage Examples and Syntax
24 2.1 Basic Usage 25 2.1 Basic Usage
@@ -44,18 +45,19 @@ hierarchy visible in a virtual file system. These are the essential
44hooks, beyond what is already present, required to manage dynamic 45hooks, beyond what is already present, required to manage dynamic
45job placement on large systems. 46job placement on large systems.
46 47
47Each task has a pointer to a cpuset. Multiple tasks may reference 48Cpusets use the generic cgroup subsystem described in
48the same cpuset. Requests by a task, using the sched_setaffinity(2) 49Documentation/cgroup.txt.
49system call to include CPUs in its CPU affinity mask, and using the 50
50mbind(2) and set_mempolicy(2) system calls to include Memory Nodes 51Requests by a task, using the sched_setaffinity(2) system call to
51in its memory policy, are both filtered through that tasks cpuset, 52include CPUs in its CPU affinity mask, and using the mbind(2) and
52filtering out any CPUs or Memory Nodes not in that cpuset. The 53set_mempolicy(2) system calls to include Memory Nodes in its memory
53scheduler will not schedule a task on a CPU that is not allowed in 54policy, are both filtered through that tasks cpuset, filtering out any
54its cpus_allowed vector, and the kernel page allocator will not 55CPUs or Memory Nodes not in that cpuset. The scheduler will not
55allocate a page on a node that is not allowed in the requesting tasks 56schedule a task on a CPU that is not allowed in its cpus_allowed
56mems_allowed vector. 57vector, and the kernel page allocator will not allocate a page on a
57 58node that is not allowed in the requesting tasks mems_allowed vector.
58User level code may create and destroy cpusets by name in the cpuset 59
60User level code may create and destroy cpusets by name in the cgroup
59virtual file system, manage the attributes and permissions of these 61virtual file system, manage the attributes and permissions of these
60cpusets and which CPUs and Memory Nodes are assigned to each cpuset, 62cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
61specify and query to which cpuset a task is assigned, and list the 63specify and query to which cpuset a task is assigned, and list the
@@ -115,7 +117,7 @@ Cpusets extends these two mechanisms as follows:
115 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the 117 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
116 kernel. 118 kernel.
117 - Each task in the system is attached to a cpuset, via a pointer 119 - Each task in the system is attached to a cpuset, via a pointer
118 in the task structure to a reference counted cpuset structure. 120 in the task structure to a reference counted cgroup structure.
119 - Calls to sched_setaffinity are filtered to just those CPUs 121 - Calls to sched_setaffinity are filtered to just those CPUs
120 allowed in that tasks cpuset. 122 allowed in that tasks cpuset.
121 - Calls to mbind and set_mempolicy are filtered to just 123 - Calls to mbind and set_mempolicy are filtered to just
@@ -145,15 +147,10 @@ into the rest of the kernel, none in performance critical paths:
145 - in page_alloc.c, to restrict memory to allowed nodes. 147 - in page_alloc.c, to restrict memory to allowed nodes.
146 - in vmscan.c, to restrict page recovery to the current cpuset. 148 - in vmscan.c, to restrict page recovery to the current cpuset.
147 149
148In addition a new file system, of type "cpuset" may be mounted, 150You should mount the "cgroup" filesystem type in order to enable
149typically at /dev/cpuset, to enable browsing and modifying the cpusets 151browsing and modifying the cpusets presently known to the kernel. No
150presently known to the kernel. No new system calls are added for 152new system calls are added for cpusets - all support for querying and
151cpusets - all support for querying and modifying cpusets is via 153modifying cpusets is via this cpuset file system.
152this cpuset file system.
153
154Each task under /proc has an added file named 'cpuset', displaying
155the cpuset name, as the path relative to the root of the cpuset file
156system.
157 154
158The /proc/<pid>/status file for each task has two added lines, 155The /proc/<pid>/status file for each task has two added lines,
159displaying the tasks cpus_allowed (on which CPUs it may be scheduled) 156displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
@@ -163,16 +160,15 @@ in the format seen in the following example:
163 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 160 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
164 Mems_allowed: ffffffff,ffffffff 161 Mems_allowed: ffffffff,ffffffff
165 162
166Each cpuset is represented by a directory in the cpuset file system 163Each cpuset is represented by a directory in the cgroup file system
167containing the following files describing that cpuset: 164containing (on top of the standard cgroup files) the following
165files describing that cpuset:
168 166
169 - cpus: list of CPUs in that cpuset 167 - cpus: list of CPUs in that cpuset
170 - mems: list of Memory Nodes in that cpuset 168 - mems: list of Memory Nodes in that cpuset
171 - memory_migrate flag: if set, move pages to cpusets nodes 169 - memory_migrate flag: if set, move pages to cpusets nodes
172 - cpu_exclusive flag: is cpu placement exclusive? 170 - cpu_exclusive flag: is cpu placement exclusive?
173 - mem_exclusive flag: is memory placement exclusive? 171 - mem_exclusive flag: is memory placement exclusive?
174 - tasks: list of tasks (by pid) attached to that cpuset
175 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
176 - memory_pressure: measure of how much paging pressure in cpuset 172 - memory_pressure: measure of how much paging pressure in cpuset
177 173
178In addition, the root cpuset only has the following file: 174In addition, the root cpuset only has the following file:
@@ -237,21 +233,7 @@ such as requests from interrupt handlers, is allowed to be taken
237outside even a mem_exclusive cpuset. 233outside even a mem_exclusive cpuset.
238 234
239 235
2401.5 What does notify_on_release do ? 2361.5 What is memory_pressure ?
241------------------------------------
242
243If the notify_on_release flag is enabled (1) in a cpuset, then whenever
244the last task in the cpuset leaves (exits or attaches to some other
245cpuset) and the last child cpuset of that cpuset is removed, then
246the kernel runs the command /sbin/cpuset_release_agent, supplying the
247pathname (relative to the mount point of the cpuset file system) of the
248abandoned cpuset. This enables automatic removal of abandoned cpusets.
249The default value of notify_on_release in the root cpuset at system
250boot is disabled (0). The default value of other cpusets at creation
251is the current value of their parents notify_on_release setting.
252
253
2541.6 What is memory_pressure ?
255----------------------------- 237-----------------------------
256The memory_pressure of a cpuset provides a simple per-cpuset metric 238The memory_pressure of a cpuset provides a simple per-cpuset metric
257of the rate that the tasks in a cpuset are attempting to free up in 239of the rate that the tasks in a cpuset are attempting to free up in
@@ -308,7 +290,7 @@ the tasks in the cpuset, in units of reclaims attempted per second,
308times 1000. 290times 1000.
309 291
310 292
3111.7 What is memory spread ? 2931.6 What is memory spread ?
312--------------------------- 294---------------------------
313There are two boolean flag files per cpuset that control where the 295There are two boolean flag files per cpuset that control where the
314kernel allocates pages for the file system buffers and related in 296kernel allocates pages for the file system buffers and related in
@@ -378,6 +360,142 @@ policy, especially for jobs that might have one thread reading in the
378data set, the memory allocation across the nodes in the jobs cpuset 360data set, the memory allocation across the nodes in the jobs cpuset
379can become very uneven. 361can become very uneven.
380 362
3631.7 What is sched_load_balance ?
364--------------------------------
365
366The kernel scheduler (kernel/sched.c) automatically load balances
367tasks. If one CPU is underutilized, kernel code running on that
368CPU will look for tasks on other more overloaded CPUs and move those
369tasks to itself, within the constraints of such placement mechanisms
370as cpusets and sched_setaffinity.
371
372The algorithmic cost of load balancing and its impact on key shared
373kernel data structures such as the task list increases more than
374linearly with the number of CPUs being balanced. So the scheduler
375has support to partition the systems CPUs into a number of sched
376domains such that it only load balances within each sched domain.
377Each sched domain covers some subset of the CPUs in the system;
378no two sched domains overlap; some CPUs might not be in any sched
379domain and hence won't be load balanced.
380
381Put simply, it costs less to balance between two smaller sched domains
382than one big one, but doing so means that overloads in one of the
383two domains won't be load balanced to the other one.
384
385By default, there is one sched domain covering all CPUs, except those
386marked isolated using the kernel boot time "isolcpus=" argument.
387
388This default load balancing across all CPUs is not well suited for
389the following two situations:
390 1) On large systems, load balancing across many CPUs is expensive.
391 If the system is managed using cpusets to place independent jobs
392 on separate sets of CPUs, full load balancing is unnecessary.
393 2) Systems supporting realtime on some CPUs need to minimize
394 system overhead on those CPUs, including avoiding task load
395 balancing if that is not needed.
396
397When the per-cpuset flag "sched_load_balance" is enabled (the default
398setting), it requests that all the CPUs in that cpusets allowed 'cpus'
399be contained in a single sched domain, ensuring that load balancing
400can move a task (not otherwised pinned, as by sched_setaffinity)
401from any CPU in that cpuset to any other.
402
403When the per-cpuset flag "sched_load_balance" is disabled, then the
404scheduler will avoid load balancing across the CPUs in that cpuset,
405--except-- in so far as is necessary because some overlapping cpuset
406has "sched_load_balance" enabled.
407
408So, for example, if the top cpuset has the flag "sched_load_balance"
409enabled, then the scheduler will have one sched domain covering all
410CPUs, and the setting of the "sched_load_balance" flag in any other
411cpusets won't matter, as we're already fully load balancing.
412
413Therefore in the above two situations, the top cpuset flag
414"sched_load_balance" should be disabled, and only some of the smaller,
415child cpusets have this flag enabled.
416
417When doing this, you don't usually want to leave any unpinned tasks in
418the top cpuset that might use non-trivial amounts of CPU, as such tasks
419may be artificially constrained to some subset of CPUs, depending on
420the particulars of this flag setting in descendent cpusets. Even if
421such a task could use spare CPU cycles in some other CPUs, the kernel
422scheduler might not consider the possibility of load balancing that
423task to that underused CPU.
424
425Of course, tasks pinned to a particular CPU can be left in a cpuset
426that disables "sched_load_balance" as those tasks aren't going anywhere
427else anyway.
428
429There is an impedance mismatch here, between cpusets and sched domains.
430Cpusets are hierarchical and nest. Sched domains are flat; they don't
431overlap and each CPU is in at most one sched domain.
432
433It is necessary for sched domains to be flat because load balancing
434across partially overlapping sets of CPUs would risk unstable dynamics
435that would be beyond our understanding. So if each of two partially
436overlapping cpusets enables the flag 'sched_load_balance', then we
437form a single sched domain that is a superset of both. We won't move
438a task to a CPU outside it cpuset, but the scheduler load balancing
439code might waste some compute cycles considering that possibility.
440
441This mismatch is why there is not a simple one-to-one relation
442between which cpusets have the flag "sched_load_balance" enabled,
443and the sched domain configuration. If a cpuset enables the flag, it
444will get balancing across all its CPUs, but if it disables the flag,
445it will only be assured of no load balancing if no other overlapping
446cpuset enables the flag.
447
448If two cpusets have partially overlapping 'cpus' allowed, and only
449one of them has this flag enabled, then the other may find its
450tasks only partially load balanced, just on the overlapping CPUs.
451This is just the general case of the top_cpuset example given a few
452paragraphs above. In the general case, as in the top cpuset case,
453don't leave tasks that might use non-trivial amounts of CPU in
454such partially load balanced cpusets, as they may be artificially
455constrained to some subset of the CPUs allowed to them, for lack of
456load balancing to the other CPUs.
457
4581.7.1 sched_load_balance implementation details.
459------------------------------------------------
460
461The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
462to most cpuset flags.) When enabled for a cpuset, the kernel will
463ensure that it can load balance across all the CPUs in that cpuset
464(makes sure that all the CPUs in the cpus_allowed of that cpuset are
465in the same sched domain.)
466
467If two overlapping cpusets both have 'sched_load_balance' enabled,
468then they will be (must be) both in the same sched domain.
469
470If, as is the default, the top cpuset has 'sched_load_balance' enabled,
471then by the above that means there is a single sched domain covering
472the whole system, regardless of any other cpuset settings.
473
474The kernel commits to user space that it will avoid load balancing
475where it can. It will pick as fine a granularity partition of sched
476domains as it can while still providing load balancing for any set
477of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
478
479The internal kernel cpuset to scheduler interface passes from the
480cpuset code to the scheduler code a partition of the load balanced
481CPUs in the system. This partition is a set of subsets (represented
482as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
483the CPUs that must be load balanced.
484
485Whenever the 'sched_load_balance' flag changes, or CPUs come or go
486from a cpuset with this flag enabled, or a cpuset with this flag
487enabled is removed, the cpuset code builds a new such partition and
488passes it to the scheduler sched domain setup code, to have the sched
489domains rebuilt as necessary.
490
491This partition exactly defines what sched domains the scheduler should
492setup - one sched domain for each element (cpumask_t) in the partition.
493
494The scheduler remembers the currently active sched domain partitions.
495When the scheduler routine partition_sched_domains() is invoked from
496the cpuset code to update these sched domains, it compares the new
497partition requested with the current, and updates its sched domains,
498removing the old and adding the new, for each change.
381 499
3821.8 How do I use cpusets ? 5001.8 How do I use cpusets ?
383-------------------------- 501--------------------------
@@ -469,7 +587,7 @@ than stress the kernel.
469To start a new job that is to be contained within a cpuset, the steps are: 587To start a new job that is to be contained within a cpuset, the steps are:
470 588
471 1) mkdir /dev/cpuset 589 1) mkdir /dev/cpuset
472 2) mount -t cpuset none /dev/cpuset 590 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
473 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 591 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
474 the /dev/cpuset virtual file system. 592 the /dev/cpuset virtual file system.
475 4) Start a task that will be the "founding father" of the new job. 593 4) Start a task that will be the "founding father" of the new job.
@@ -481,7 +599,7 @@ For example, the following sequence of commands will setup a cpuset
481named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 599named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
482and then start a subshell 'sh' in that cpuset: 600and then start a subshell 'sh' in that cpuset:
483 601
484 mount -t cpuset none /dev/cpuset 602 mount -t cgroup -ocpuset cpuset /dev/cpuset
485 cd /dev/cpuset 603 cd /dev/cpuset
486 mkdir Charlie 604 mkdir Charlie
487 cd Charlie 605 cd Charlie
@@ -513,7 +631,7 @@ Creating, modifying, using the cpusets can be done through the cpuset
513virtual filesystem. 631virtual filesystem.
514 632
515To mount it, type: 633To mount it, type:
516# mount -t cpuset none /dev/cpuset 634# mount -t cgroup -o cpuset cpuset /dev/cpuset
517 635
518Then under /dev/cpuset you can find a tree that corresponds to the 636Then under /dev/cpuset you can find a tree that corresponds to the
519tree of the cpusets in the system. For instance, /dev/cpuset 637tree of the cpusets in the system. For instance, /dev/cpuset
@@ -556,6 +674,18 @@ To remove a cpuset, just use rmdir:
556This will fail if the cpuset is in use (has cpusets inside, or has 674This will fail if the cpuset is in use (has cpusets inside, or has
557processes attached). 675processes attached).
558 676
677Note that for legacy reasons, the "cpuset" filesystem exists as a
678wrapper around the cgroup filesystem.
679
680The command
681
682mount -t cpuset X /dev/cpuset
683
684is equivalent to
685
686mount -t cgroup -ocpuset X /dev/cpuset
687echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
688
5592.2 Adding/removing cpus 6892.2 Adding/removing cpus
560------------------------ 690------------------------
561 691
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 280ec06573e6..6b0f963f5379 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -82,6 +82,41 @@ Who: Dominik Brodowski <linux@brodo.de>
82 82
83--------------------------- 83---------------------------
84 84
85What: sys_sysctl
86When: September 2010
87Option: CONFIG_SYSCTL_SYSCALL
88Why: The same information is available in a more convenient from
89 /proc/sys, and none of the sysctl variables appear to be
90 important performance wise.
91
92 Binary sysctls are a long standing source of subtle kernel
93 bugs and security issues.
94
95 When I looked several months ago all I could find after
96 searching several distributions were 5 user space programs and
97 glibc (which falls back to /proc/sys) using this syscall.
98
99 The man page for sysctl(2) documents it as unusable for user
100 space programs.
101
102 sysctl(2) is not generally ABI compatible to a 32bit user
103 space application on a 64bit and a 32bit kernel.
104
105 For the last several months the policy has been no new binary
106 sysctls and no one has put forward an argument to use them.
107
108 Binary sysctls issues seem to keep happening appearing so
109 properly deprecating them (with a warning to user space) and a
110 2 year grace warning period will mean eventually we can kill
111 them and end the pain.
112
113 In the mean time individual binary sysctls can be dealt with
114 in a piecewise fashion.
115
116Who: Eric Biederman <ebiederm@xmission.com>
117
118---------------------------
119
85What: a.out interpreter support for ELF executables 120What: a.out interpreter support for ELF executables
86When: 2.6.25 121When: 2.6.25
87Files: fs/binfmt_elf.c 122Files: fs/binfmt_elf.c
@@ -184,13 +219,6 @@ Who: Jean Delvare <khali@linux-fr.org>,
184 219
185--------------------------- 220---------------------------
186 221
187What: drivers depending on OBSOLETE_OSS
188When: options in 2.6.22, code in 2.6.24
189Why: OSS drivers with ALSA replacements
190Who: Adrian Bunk <bunk@stusta.de>
191
192---------------------------
193
194What: ACPI procfs interface 222What: ACPI procfs interface
195When: July 2008 223When: July 2008
196Why: ACPI sysfs conversion should be finished by January 2008. 224Why: ACPI sysfs conversion should be finished by January 2008.
diff --git a/Documentation/input/input-programming.txt b/Documentation/input/input-programming.txt
index d9d523099bb7..4d932dc66098 100644
--- a/Documentation/input/input-programming.txt
+++ b/Documentation/input/input-programming.txt
@@ -42,8 +42,8 @@ static int __init button_init(void)
42 goto err_free_irq; 42 goto err_free_irq;
43 } 43 }
44 44
45 button_dev->evbit[0] = BIT(EV_KEY); 45 button_dev->evbit[0] = BIT_MASK(EV_KEY);
46 button_dev->keybit[LONG(BTN_0)] = BIT(BTN_0); 46 button_dev->keybit[BIT_WORD(BTN_0)] = BIT_MASK(BTN_0);
47 47
48 error = input_register_device(button_dev); 48 error = input_register_device(button_dev);
49 if (error) { 49 if (error) {
@@ -217,14 +217,15 @@ If you don't need absfuzz and absflat, you can set them to zero, which mean
217that the thing is precise and always returns to exactly the center position 217that the thing is precise and always returns to exactly the center position
218(if it has any). 218(if it has any).
219 219
2201.4 NBITS(), LONG(), BIT() 2201.4 BITS_TO_LONGS(), BIT_WORD(), BIT_MASK()
221~~~~~~~~~~~~~~~~~~~~~~~~~~ 221~~~~~~~~~~~~~~~~~~~~~~~~~~
222 222
223These three macros from input.h help some bitfield computations: 223These three macros from bitops.h help some bitfield computations:
224 224
225 NBITS(x) - returns the length of a bitfield array in longs for x bits 225 BITS_TO_LONGS(x) - returns the length of a bitfield array in longs for
226 LONG(x) - returns the index in the array in longs for bit x 226 x bits
227 BIT(x) - returns the index in a long for bit x 227 BIT_WORD(x) - returns the index in the array in longs for bit x
228 BIT_MASK(x) - returns the index in a long for bit x
228 229
2291.5 The id* and name fields 2301.5 The id* and name fields
230~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 231~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 1b37b28cc234..d0ac72cc19ff 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -231,6 +231,32 @@ Dump-capture kernel config options (Arch Dependent, ia64)
231 any space below the alignment point will be wasted. 231 any space below the alignment point will be wasted.
232 232
233 233
234Extended crashkernel syntax
235===========================
236
237While the "crashkernel=size[@offset]" syntax is sufficient for most
238configurations, sometimes it's handy to have the reserved memory dependent
239on the value of System RAM -- that's mostly for distributors that pre-setup
240the kernel command line to avoid a unbootable system after some memory has
241been removed from the machine.
242
243The syntax is:
244
245 crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
246 range=start-[end]
247
248For example:
249
250 crashkernel=512M-2G:64M,2G-:128M
251
252This would mean:
253
254 1) if the RAM is smaller than 512M, then don't reserve anything
255 (this is the "rescue" case)
256 2) if the RAM size is between 512M and 2G, then reserve 64M
257 3) if the RAM size is larger than 2G, then reserve 128M
258
259
234Boot into System Kernel 260Boot into System Kernel
235======================= 261=======================
236 262
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 98cf90f2631d..0a3fed445249 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -479,6 +479,16 @@ and is between 256 and 4096 characters. It is defined in the file
479 UART at the specified I/O port or MMIO address. 479 UART at the specified I/O port or MMIO address.
480 The options are the same as for ttyS, above. 480 The options are the same as for ttyS, above.
481 481
482 no_console_suspend
483 [HW] Never suspend the console
484 Disable suspending of consoles during suspend and
485 hibernate operations. Once disabled, debugging
486 messages can reach various consoles while the rest
487 of the system is being put to sleep (ie, while
488 debugging driver suspend/resume hooks). This may
489 not work reliably with all consoles, but is known
490 to work with serial and VGA consoles.
491
482 cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver 492 cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
483 Format: 493 Format:
484 <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] 494 <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
@@ -487,6 +497,13 @@ and is between 256 and 4096 characters. It is defined in the file
487 [KNL] Reserve a chunk of physical memory to 497 [KNL] Reserve a chunk of physical memory to
488 hold a kernel to switch to with kexec on panic. 498 hold a kernel to switch to with kexec on panic.
489 499
500 crashkernel=range1:size1[,range2:size2,...][@offset]
501 [KNL] Same as above, but depends on the memory
502 in the running system. The syntax of range is
503 start-[end] where start and end are both
504 a memory unit (amount[KMG]). See also
505 Documentation/kdump/kdump.txt for a example.
506
490 cs4232= [HW,OSS] 507 cs4232= [HW,OSS]
491 Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq> 508 Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq>
492 509
diff --git a/Documentation/markers.txt b/Documentation/markers.txt
new file mode 100644
index 000000000000..295a71bc301e
--- /dev/null
+++ b/Documentation/markers.txt
@@ -0,0 +1,81 @@
1 Using the Linux Kernel Markers
2
3 Mathieu Desnoyers
4
5
6This document introduces Linux Kernel Markers and their use. It provides
7examples of how to insert markers in the kernel and connect probe functions to
8them and provides some examples of probe functions.
9
10
11* Purpose of markers
12
13A marker placed in code provides a hook to call a function (probe) that you can
14provide at runtime. A marker can be "on" (a probe is connected to it) or "off"
15(no probe is attached). When a marker is "off" it has no effect, except for
16adding a tiny time penalty (checking a condition for a branch) and space
17penalty (adding a few bytes for the function call at the end of the
18instrumented function and adds a data structure in a separate section). When a
19marker is "on", the function you provide is called each time the marker is
20executed, in the execution context of the caller. When the function provided
21ends its execution, it returns to the caller (continuing from the marker site).
22
23You can put markers at important locations in the code. Markers are
24lightweight hooks that can pass an arbitrary number of parameters,
25described in a printk-like format string, to the attached probe function.
26
27They can be used for tracing and performance accounting.
28
29
30* Usage
31
32In order to use the macro trace_mark, you should include linux/marker.h.
33
34#include <linux/marker.h>
35
36And,
37
38trace_mark(subsystem_event, "%d %s", someint, somestring);
39Where :
40- subsystem_event is an identifier unique to your event
41 - subsystem is the name of your subsystem.
42 - event is the name of the event to mark.
43- "%d %s" is the formatted string for the serializer.
44- someint is an integer.
45- somestring is a char pointer.
46
47Connecting a function (probe) to a marker is done by providing a probe (function
48to call) for the specific marker through marker_probe_register() and can be
49activated by calling marker_arm(). Marker deactivation can be done by calling
50marker_disarm() as many times as marker_arm() has been called. Removing a probe
51is done through marker_probe_unregister(); it will disarm the probe and make
52sure there is no caller left using the probe when it returns. Probe removal is
53preempt-safe because preemption is disabled around the probe call. See the
54"Probe example" section below for a sample probe module.
55
56The marker mechanism supports inserting multiple instances of the same marker.
57Markers can be put in inline functions, inlined static functions, and
58unrolled loops as well as regular functions.
59
60The naming scheme "subsystem_event" is suggested here as a convention intended
61to limit collisions. Marker names are global to the kernel: they are considered
62as being the same whether they are in the core kernel image or in modules.
63Conflicting format strings for markers with the same name will cause the markers
64to be detected to have a different format string not to be armed and will output
65a printk warning which identifies the inconsistency:
66
67"Format mismatch for probe probe_name (format), marker (format)"
68
69
70* Probe / marker example
71
72See the example provided in samples/markers/src
73
74Compile them with your kernel.
75
76Run, as root :
77modprobe marker-example (insmod order is not important)
78modprobe probe-example
79cat /proc/marker-example (returns an expected error)
80rmmod marker-example probe-example
81dmesg
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 650657c54733..4e17beba2379 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1479,7 +1479,8 @@ kernel.
1479 1479
1480Any atomic operation that modifies some state in memory and returns information 1480Any atomic operation that modifies some state in memory and returns information
1481about the state (old or new) implies an SMP-conditional general memory barrier 1481about the state (old or new) implies an SMP-conditional general memory barrier
1482(smp_mb()) on each side of the actual operation. These include: 1482(smp_mb()) on each side of the actual operation (with the exception of
1483explicit lock operations, described later). These include:
1483 1484
1484 xchg(); 1485 xchg();
1485 cmpxchg(); 1486 cmpxchg();
@@ -1536,10 +1537,19 @@ If they're used for constructing a lock of some description, then they probably
1536do need memory barriers as a lock primitive generally has to do things in a 1537do need memory barriers as a lock primitive generally has to do things in a
1537specific order. 1538specific order.
1538 1539
1539
1540Basically, each usage case has to be carefully considered as to whether memory 1540Basically, each usage case has to be carefully considered as to whether memory
1541barriers are needed or not. 1541barriers are needed or not.
1542 1542
1543The following operations are special locking primitives:
1544
1545 test_and_set_bit_lock();
1546 clear_bit_unlock();
1547 __clear_bit_unlock();
1548
1549These implement LOCK-class and UNLOCK-class operations. These should be used in
1550preference to other operations when implementing locking primitives, because
1551their implementations can be optimised on many architectures.
1552
1543[!] Note that special memory barrier primitives are available for these 1553[!] Note that special memory barrier primitives are available for these
1544situations because on some CPUs the atomic instructions used imply full memory 1554situations because on some CPUs the atomic instructions used imply full memory
1545barriers, and so barrier instructions are superfluous in conjunction with them, 1555barriers, and so barrier instructions are superfluous in conjunction with them,
diff --git a/Documentation/mips/00-INDEX b/Documentation/mips/00-INDEX
index 9df8a2eac7b4..3f13bf8043d2 100644
--- a/Documentation/mips/00-INDEX
+++ b/Documentation/mips/00-INDEX
@@ -4,5 +4,3 @@ AU1xxx_IDE.README
4 - README for MIPS AU1XXX IDE driver. 4 - README for MIPS AU1XXX IDE driver.
5GT64120.README 5GT64120.README
6 - README for dir with info on MIPS boards using GT-64120 or GT-64120A. 6 - README for dir with info on MIPS boards using GT-64120 or GT-64120A.
7time.README
8 - README for MIPS time services.
diff --git a/Documentation/mips/time.README b/Documentation/mips/time.README
deleted file mode 100644
index a4ce603ed3b3..000000000000
--- a/Documentation/mips/time.README
+++ /dev/null
@@ -1,173 +0,0 @@
1README for MIPS time services
2
3Jun Sun
4jsun@mvista.com or jsun@junsun.net
5
6
7ABOUT
8-----
9This file describes the new arch/mips/kernel/time.c, related files and the
10services they provide.
11
12If you are short in patience and just want to know how to use time.c for a
13new board or convert an existing board, go to the last section.
14
15
16FILES, COMPATABILITY AND CONFIGS
17---------------------------------
18
19The old arch/mips/kernel/time.c is renamed to old-time.c.
20
21A new time.c is put there, together with include/asm-mips/time.h.
22
23Two configs variables are introduced, CONFIG_OLD_TIME_C and CONFIG_NEW_TIME_C.
24So we allow boards using
25
26 1) old time.c (CONFIG_OLD_TIME_C)
27 2) new time.c (CONFIG_NEW_TIME_C)
28 3) neither (their own private time.c)
29
30However, it is expected every board will move to the new time.c in the near
31future.
32
33
34WHAT THE NEW CODE PROVIDES?
35---------------------------
36
37The new time code provide the following services:
38
39 a) Implements functions required by Linux common code:
40 time_init
41
42 b) provides an abstraction of RTC and null RTC implementation as default.
43 extern unsigned long (*rtc_get_time)(void);
44 extern int (*rtc_set_time)(unsigned long);
45
46 c) high-level and low-level timer interrupt routines where the timer
47 interrupt source may or may not be the CPU timer. The high-level
48 routine is dispatched through do_IRQ() while the low-level is
49 dispatched in assemably code (usually int-handler.S)
50
51
52WHAT THE NEW CODE REQUIRES?
53---------------------------
54
55For the new code to work properly, each board implementation needs to supply
56the following functions or values:
57
58 a) board_time_init - a function pointer. Invoked at the beginnig of
59 time_init(). It is optional.
60 1. (optional) set up RTC routines
61 2. (optional) calibrate and set the mips_hpt_frequency
62
63 b) plat_timer_setup - a function pointer. Invoked at the end of time_init()
64 1. (optional) over-ride any decisions made in time_init()
65 2. set up the irqaction for timer interrupt.
66 3. enable the timer interrupt
67
68 c) (optional) board-specific RTC routines.
69
70 d) (optional) mips_hpt_frequency - It must be definied if the board
71 is using CPU counter for timer interrupt.
72
73
74PORTING GUIDE
75-------------
76
77Step 1: decide how you like to implement the time services.
78
79 a) does this board have a RTC? If yes, implement the two RTC funcs.
80
81 b) does the CPU have counter/compare registers?
82
83 If the answer is no, you need a timer to provide the timer interrupt
84 at 100 HZ speed.
85
86 c) The following sub steps assume your CPU has counter register.
87 Do you plan to use the CPU counter register as the timer interrupt
88 or use an exnternal timer?
89
90 In order to use CPU counter register as the timer interrupt source, you
91 must know the counter speed (mips_hpt_frequency). It is usually the
92 same as the CPU speed or an integral divisor of it.
93
94 d) decide on whether you want to use high-level or low-level timer
95 interrupt routines. The low-level one is presumably faster, but should
96 not make too mcuh difference.
97
98
99Step 2: the machine setup() function
100
101 If you supply board_time_init(), set the function poointer.
102
103
104Step 3: implement rtc routines, board_time_init() and plat_timer_setup()
105 if needed.
106
107 board_time_init() -
108 a) (optional) set up RTC routines,
109 b) (optional) calibrate and set the mips_hpt_frequency
110 (only needed if you intended to use cpu counter as timer interrupt
111 source)
112
113 plat_timer_setup() -
114 a) (optional) over-write any choices made above by time_init().
115 b) machine specific code should setup the timer irqaction.
116 c) enable the timer interrupt
117
118
119 If the RTC chip is a common chip, I suggest the routines are put under
120 arch/mips/libs. For example, for DS1386 chip, one would create
121 rtc-ds1386.c under arch/mips/lib directory. Add the following line to
122 the arch/mips/lib/Makefile:
123
124 obj-$(CONFIG_DDB5476) += rtc-ds1386.o
125
126Step 4: if you are using low-level timer interrupt, change your interrupt
127 dispathcing code to check for timer interrupt and jump to
128 ll_timer_interrupt() directly if one is detected.
129
130Step 5: Modify arch/mips/config.in and add CONFIG_NEW_TIME_C to your machine.
131 Modify the appropriate defconfig if applicable.
132
133Final notes:
134
135For some tricky cases, you may need to add your own wrapper functions
136for some of the functions in time.c.
137
138For example, you may define your own timer interrupt routine, which does
139some of its own processing and then calls timer_interrupt().
140
141You can also over-ride any of the built-in functions (RTC routines
142and/or timer interrupt routine).
143
144
145PORTING NOTES FOR SMP
146----------------------
147
148If you have a SMP box, things are slightly more complicated.
149
150The time service running every jiffy is logically divided into two parts:
151
152 1) the one for the whole system (defined in timer_interrupt())
153 2) the one that should run for each CPU (defined in local_timer_interrupt())
154
155You need to decide on your timer interrupt sources.
156
157 case 1) - whole system has only one timer interrupt delivered to one CPU
158
159 In this case, you set up timer interrupt as in UP systems. In addtion,
160 you need to set emulate_local_timer_interrupt to 1 so that other
161 CPUs get to call local_timer_interrupt().
162
163 THIS IS CURRENTLY NOT IMPLEMNETED. However, it is rather easy to write
164 one should such a need arise. You simply make a IPI call.
165
166 case 2) - each CPU has a separate timer interrupt
167
168 In this case, you need to set up IRQ such that each of them will
169 call local_timer_interrupt(). In addition, you need to arrange
170 one and only one of them to call timer_interrupt().
171
172 You can also do the low-level version of those interrupt routines,
173 following similar dispatching routes described above.
diff --git a/Documentation/parport-lowlevel.txt b/Documentation/parport-lowlevel.txt
index 8f2302415eff..265fcdcb8e5f 100644
--- a/Documentation/parport-lowlevel.txt
+++ b/Documentation/parport-lowlevel.txt
@@ -25,7 +25,6 @@ Global functions:
25 parport_open 25 parport_open
26 parport_close 26 parport_close
27 parport_device_id 27 parport_device_id
28 parport_device_num
29 parport_device_coords 28 parport_device_coords
30 parport_find_class 29 parport_find_class
31 parport_find_device 30 parport_find_device
@@ -735,7 +734,7 @@ NULL is returned.
735 734
736SEE ALSO 735SEE ALSO
737 736
738parport_register_device, parport_device_num 737parport_register_device
739 738
740parport_close - unregister device for particular device number 739parport_close - unregister device for particular device number
741------------- 740-------------
@@ -787,29 +786,7 @@ Many devices have ill-formed IEEE 1284 Device IDs.
787 786
788SEE ALSO 787SEE ALSO
789 788
790parport_find_class, parport_find_device, parport_device_num 789parport_find_class, parport_find_device
791
792parport_device_num - convert device coordinates to device number
793------------------
794
795SYNOPSIS
796
797#include <linux/parport.h>
798
799int parport_device_num (int parport, int mux, int daisy);
800
801DESCRIPTION
802
803Convert between device coordinates (port, multiplexor, daisy chain
804address) and device number (zero-based).
805
806RETURN VALUE
807
808Device number, or -1 if no device at given coordinates.
809
810SEE ALSO
811
812parport_device_coords, parport_open, parport_device_id
813 790
814parport_device_coords - convert device number to device coordinates 791parport_device_coords - convert device number to device coordinates
815------------------ 792------------------
@@ -833,7 +810,7 @@ Zero on success, in which case the coordinates are (*parport, *mux,
833 810
834SEE ALSO 811SEE ALSO
835 812
836parport_device_num, parport_open, parport_device_id 813parport_open, parport_device_id
837 814
838parport_find_class - find a device by its class 815parport_find_class - find a device by its class
839------------------ 816------------------
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt
index 1a85e2b964dc..57aef2f6e0de 100644
--- a/Documentation/power/basic-pm-debugging.txt
+++ b/Documentation/power/basic-pm-debugging.txt
@@ -78,8 +78,8 @@ c) Advanced debugging
78In case the STD does not work on your system even in the minimal configuration 78In case the STD does not work on your system even in the minimal configuration
79and compiling more drivers as modules is not practical or some modules cannot 79and compiling more drivers as modules is not practical or some modules cannot
80be unloaded, you can use one of the more advanced debugging techniques to find 80be unloaded, you can use one of the more advanced debugging techniques to find
81the problem. First, if there is a serial port in your box, you can set the 81the problem. First, if there is a serial port in your box, you can boot the
82CONFIG_DISABLE_CONSOLE_SUSPEND kernel configuration option and try to log kernel 82kernel with the 'no_console_suspend' parameter and try to log kernel
83messages using the serial console. This may provide you with some information 83messages using the serial console. This may provide you with some information
84about the reasons of the suspend (resume) failure. Alternatively, it may be 84about the reasons of the suspend (resume) failure. Alternatively, it may be
85possible to use a FireWire port for debugging with firescope 85possible to use a FireWire port for debugging with firescope
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt
index 04dc1cf9d215..38b57248fd61 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -19,12 +19,13 @@ we only consider hibernation, but the description also applies to suspend).
19Namely, as the first step of the hibernation procedure the function 19Namely, as the first step of the hibernation procedure the function
20freeze_processes() (defined in kernel/power/process.c) is called. It executes 20freeze_processes() (defined in kernel/power/process.c) is called. It executes
21try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and 21try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and
22sends a fake signal to each of them. A task that receives such a signal and has 22either wakes them up, if they are kernel threads, or sends fake signals to them,
23TIF_FREEZE set, should react to it by calling the refrigerator() function 23if they are user space processes. A task that has TIF_FREEZE set, should react
24(defined in kernel/power/process.c), which sets the task's PF_FROZEN flag, 24to it by calling the function called refrigerator() (defined in
25changes its state to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is 25kernel/power/process.c), which sets the task's PF_FROZEN flag, changes its state
26cleared for it. Then, we say that the task is 'frozen' and therefore the set of 26to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it.
27functions handling this mechanism is called 'the freezer' (these functions are 27Then, we say that the task is 'frozen' and therefore the set of functions
28handling this mechanism is referred to as 'the freezer' (these functions are
28defined in kernel/power/process.c and include/linux/freezer.h). User space 29defined in kernel/power/process.c and include/linux/freezer.h). User space
29processes are generally frozen before kernel threads. 30processes are generally frozen before kernel threads.
30 31
@@ -35,21 +36,27 @@ task enter refrigerator() if the flag is set.
35 36
36For user space processes try_to_freeze() is called automatically from the 37For user space processes try_to_freeze() is called automatically from the
37signal-handling code, but the freezable kernel threads need to call it 38signal-handling code, but the freezable kernel threads need to call it
38explicitly in suitable places. The code to do this may look like the following: 39explicitly in suitable places or use the wait_event_freezable() or
40wait_event_freezable_timeout() macros (defined in include/linux/freezer.h)
41that combine interruptible sleep with checking if TIF_FREEZE is set and calling
42try_to_freeze(). The main loop of a freezable kernel thread may look like the
43following one:
39 44
45 set_freezable();
40 do { 46 do {
41 hub_events(); 47 hub_events();
42 wait_event_interruptible(khubd_wait, 48 wait_event_freezable(khubd_wait,
43 !list_empty(&hub_event_list)); 49 !list_empty(&hub_event_list) ||
44 try_to_freeze(); 50 kthread_should_stop());
45 } while (!signal_pending(current)); 51 } while (!kthread_should_stop() || !list_empty(&hub_event_list));
46 52
47(from drivers/usb/core/hub.c::hub_thread()). 53(from drivers/usb/core/hub.c::hub_thread()).
48 54
49If a freezable kernel thread fails to call try_to_freeze() after the freezer has 55If a freezable kernel thread fails to call try_to_freeze() after the freezer has
50set TIF_FREEZE for it, the freezing of tasks will fail and the entire 56set TIF_FREEZE for it, the freezing of tasks will fail and the entire
51hibernation operation will be cancelled. For this reason, freezable kernel 57hibernation operation will be cancelled. For this reason, freezable kernel
52threads must call try_to_freeze() somewhere. 58threads must call try_to_freeze() somewhere or use one of the
59wait_event_freezable() and wait_event_freezable_timeout() macros.
53 60
54After the system memory state has been restored from a hibernation image and 61After the system memory state has been restored from a hibernation image and
55devices have been reinitialized, the function thaw_processes() is called in 62devices have been reinitialized, the function thaw_processes() is called in
@@ -81,7 +88,16 @@ hibernation image has been created and before the system is finally powered off.
81The majority of these are user space processes, but if any of the kernel threads 88The majority of these are user space processes, but if any of the kernel threads
82may cause something like this to happen, they have to be freezable. 89may cause something like this to happen, they have to be freezable.
83 90
842. The second reason is to prevent user space processes and some kernel threads 912. Next, to create the hibernation image we need to free a sufficient amount of
92memory (approximately 50% of available RAM) and we need to do that before
93devices are deactivated, because we generally need them for swapping out. Then,
94after the memory for the image has been freed, we don't want tasks to allocate
95additional memory and we prevent them from doing that by freezing them earlier.
96[Of course, this also means that device drivers should not allocate substantial
97amounts of memory from their .suspend() callbacks before hibernation, but this
98is e separate issue.]
99
1003. The third reason is to prevent user space processes and some kernel threads
85from interfering with the suspending and resuming of devices. A user space 101from interfering with the suspending and resuming of devices. A user space
86process running on a second CPU while we are suspending devices may, for 102process running on a second CPU while we are suspending devices may, for
87example, be troublesome and without the freezing of tasks we would need some 103example, be troublesome and without the freezing of tasks we would need some
@@ -111,7 +127,7 @@ frozen before the driver's .suspend() callback is executed and it will be
111thawed after the driver's .resume() callback has run, so it won't be accessing 127thawed after the driver's .resume() callback has run, so it won't be accessing
112the device while it's suspended. 128the device while it's suspended.
113 129
1143. Another reason for freezing tasks is to prevent user space processes from 1304. Another reason for freezing tasks is to prevent user space processes from
115realizing that hibernation (or suspend) operation takes place. Ideally, user 131realizing that hibernation (or suspend) operation takes place. Ideally, user
116space processes should not notice that such a system-wide operation has occurred 132space processes should not notice that such a system-wide operation has occurred
117and should continue running without any problems after the restore (or resume 133and should continue running without any problems after the restore (or resume
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt
index fd5192a8fa8a..e67211fe0ee2 100644
--- a/Documentation/power/interface.txt
+++ b/Documentation/power/interface.txt
@@ -20,7 +20,7 @@ states.
20/sys/power/disk controls the operating mode of the suspend-to-disk 20/sys/power/disk controls the operating mode of the suspend-to-disk
21mechanism. Suspend-to-disk can be handled in several ways. We have a 21mechanism. Suspend-to-disk can be handled in several ways. We have a
22few options for putting the system to sleep - using the platform driver 22few options for putting the system to sleep - using the platform driver
23(e.g. ACPI or other pm_ops), powering off the system or rebooting the 23(e.g. ACPI or other suspend_ops), powering off the system or rebooting the
24system (for testing). 24system (for testing).
25 25
26Additionally, /sys/power/disk can be used to turn on one of the two testing 26Additionally, /sys/power/disk can be used to turn on one of the two testing
diff --git a/Documentation/sound/oss/es1371 b/Documentation/sound/oss/es1371
deleted file mode 100644
index c3151266771c..000000000000
--- a/Documentation/sound/oss/es1371
+++ /dev/null
@@ -1,64 +0,0 @@
1/proc/sound, /dev/sndstat
2-------------------------
3
4/proc/sound and /dev/sndstat is not supported by the
5driver. To find out whether the driver succeeded loading,
6check the kernel log (dmesg).
7
8
9ALaw/uLaw sample formats
10------------------------
11
12This driver does not support the ALaw/uLaw sample formats.
13ALaw is the default mode when opening a sound device
14using OSS/Free. The reason for the lack of support is
15that the hardware does not support these formats, and adding
16conversion routines to the kernel would lead to very ugly
17code in the presence of the mmap interface to the driver.
18And since xquake uses mmap, mmap is considered important :-)
19and no sane application uses ALaw/uLaw these days anyway.
20In short, playing a Sun .au file as follows:
21
22cat my_file.au > /dev/dsp
23
24does not work. Instead, you may use the play script from
25Chris Bagwell's sox-12.14 package (available from the URL
26below) to play many different audio file formats.
27The script automatically determines the audio format
28and does do audio conversions if necessary.
29http://home.sprynet.com/sprynet/cbagwell/projects.html
30
31
32Blocking vs. nonblocking IO
33---------------------------
34
35Unlike OSS/Free this driver honours the O_NONBLOCK file flag
36not only during open, but also during read and write.
37This is an effort to make the sound driver interface more
38regular. Timidity has problems with this; a patch
39is available from http://www.ife.ee.ethz.ch/~sailer/linux/pciaudio.html.
40(Timidity patched will also run on OSS/Free).
41
42
43MIDI UART
44---------
45
46The driver supports a simple MIDI UART interface, with
47no ioctl's supported.
48
49
50MIDI synthesizer
51----------------
52
53This soundcard does not have any hardware MIDI synthesizer;
54MIDI synthesis has to be done in software. To allow this
55the driver/soundcard supports two PCM (/dev/dsp) interfaces.
56
57There is a freely available software package that allows
58MIDI file playback on this soundcard called Timidity.
59See http://www.cgs.fi/~tt/timidity/.
60
61
62
63Thomas Sailer
64t.sailer@alumni.ethz.ch
diff --git a/Documentation/thinkpad-acpi.txt b/Documentation/thinkpad-acpi.txt
index 60953d6c919d..3b95bbacc775 100644
--- a/Documentation/thinkpad-acpi.txt
+++ b/Documentation/thinkpad-acpi.txt
@@ -105,10 +105,15 @@ The version of thinkpad-acpi's sysfs interface is exported by the driver
105as a driver attribute (see below). 105as a driver attribute (see below).
106 106
107Sysfs driver attributes are on the driver's sysfs attribute space, 107Sysfs driver attributes are on the driver's sysfs attribute space,
108for 2.6.20 this is /sys/bus/platform/drivers/thinkpad_acpi/. 108for 2.6.23 this is /sys/bus/platform/drivers/thinkpad_acpi/ and
109/sys/bus/platform/drivers/thinkpad_hwmon/
109 110
110Sysfs device attributes are on the driver's sysfs attribute space, 111Sysfs device attributes are on the thinkpad_acpi device sysfs attribute
111for 2.6.20 this is /sys/devices/platform/thinkpad_acpi/. 112space, for 2.6.23 this is /sys/devices/platform/thinkpad_acpi/.
113
114Sysfs device attributes for the sensors and fan are on the
115thinkpad_hwmon device's sysfs attribute space, but you should locate it
116looking for a hwmon device with the name attribute of "thinkpad".
112 117
113Driver version 118Driver version
114-------------- 119--------------
@@ -766,7 +771,7 @@ Temperature sensors
766------------------- 771-------------------
767 772
768procfs: /proc/acpi/ibm/thermal 773procfs: /proc/acpi/ibm/thermal
769sysfs device attributes: (hwmon) temp*_input 774sysfs device attributes: (hwmon "thinkpad") temp*_input
770 775
771Most ThinkPads include six or more separate temperature sensors but only 776Most ThinkPads include six or more separate temperature sensors but only
772expose the CPU temperature through the standard ACPI methods. This 777expose the CPU temperature through the standard ACPI methods. This
@@ -989,7 +994,9 @@ Fan control and monitoring: fan speed, fan enable/disable
989--------------------------------------------------------- 994---------------------------------------------------------
990 995
991procfs: /proc/acpi/ibm/fan 996procfs: /proc/acpi/ibm/fan
992sysfs device attributes: (hwmon) fan_input, pwm1, pwm1_enable 997sysfs device attributes: (hwmon "thinkpad") fan1_input, pwm1,
998 pwm1_enable
999sysfs hwmon driver attributes: fan_watchdog
993 1000
994NOTE NOTE NOTE: fan control operations are disabled by default for 1001NOTE NOTE NOTE: fan control operations are disabled by default for
995safety reasons. To enable them, the module parameter "fan_control=1" 1002safety reasons. To enable them, the module parameter "fan_control=1"
@@ -1131,7 +1138,7 @@ hwmon device attribute fan1_input:
1131 which can take up to two minutes. May return rubbish on older 1138 which can take up to two minutes. May return rubbish on older
1132 ThinkPads. 1139 ThinkPads.
1133 1140
1134driver attribute fan_watchdog: 1141hwmon driver attribute fan_watchdog:
1135 Fan safety watchdog timer interval, in seconds. Minimum is 1142 Fan safety watchdog timer interval, in seconds. Minimum is
1136 1 second, maximum is 120 seconds. 0 disables the watchdog. 1143 1 second, maximum is 120 seconds. 0 disables the watchdog.
1137 1144
@@ -1233,3 +1240,9 @@ Sysfs interface changelog:
1233 layer, the radio switch generates input event EV_RADIO, 1240 layer, the radio switch generates input event EV_RADIO,
1234 and the driver enables hot key handling by default in 1241 and the driver enables hot key handling by default in
1235 the firmware. 1242 the firmware.
1243
12440x020000: ABI fix: added a separate hwmon platform device and
1245 driver, which must be located by name (thinkpad)
1246 and the hwmon class for libsensors4 (lm-sensors 3)
1247 compatibility. Moved all hwmon attributes to this
1248 new platform device.