aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/DocBook/kernel-api.tmpl2
-rw-r--r--Documentation/accounting/cgroupstats.txt27
-rw-r--r--Documentation/cachetlb.txt27
-rw-r--r--Documentation/cgroups.txt545
-rw-r--r--Documentation/cpu-hotplug.txt4
-rw-r--r--Documentation/cpusets.txt226
-rw-r--r--Documentation/device-mapper/dm-uevent.txt97
-rw-r--r--Documentation/input/input-programming.txt15
-rw-r--r--Documentation/kbuild/kconfig-language.txt14
-rw-r--r--Documentation/kbuild/makefiles.txt22
-rw-r--r--Documentation/kdump/kdump.txt26
-rw-r--r--Documentation/kernel-parameters.txt13
-rw-r--r--Documentation/markers.txt81
-rw-r--r--Documentation/mips/00-INDEX2
-rw-r--r--Documentation/mips/time.README173
-rw-r--r--Documentation/thinkpad-acpi.txt25
16 files changed, 1033 insertions, 266 deletions
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index d3290c46af51..aa38cc5692a0 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -46,7 +46,7 @@
46 46
47 <sect1><title>Atomic and pointer manipulation</title> 47 <sect1><title>Atomic and pointer manipulation</title>
48!Iinclude/asm-x86/atomic_32.h 48!Iinclude/asm-x86/atomic_32.h
49!Iinclude/asm-x86/unaligned_32.h 49!Iinclude/asm-x86/unaligned.h
50 </sect1> 50 </sect1>
51 51
52 <sect1><title>Delaying, scheduling, and timer routines</title> 52 <sect1><title>Delaying, scheduling, and timer routines</title>
diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.txt
new file mode 100644
index 000000000000..eda40fd39cad
--- /dev/null
+++ b/Documentation/accounting/cgroupstats.txt
@@ -0,0 +1,27 @@
1Control Groupstats is inspired by the discussion at
2http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
3suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
4
5Per cgroup statistics infrastructure re-uses code from the taskstats
6interface. A new set of cgroup operations are registered with commands
7and attributes specific to cgroups. It should be very easy to
8extend per cgroup statistics, by adding members to the cgroupstats
9structure.
10
11The current model for cgroupstats is a pull, a push model (to post
12statistics on interesting events), should be very easy to add. Currently
13user space requests for statistics by passing the cgroup path.
14Statistics about the state of all the tasks in the cgroup is returned to
15user space.
16
17NOTE: We currently rely on delay accounting for extracting information
18about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
19information will not be available.
20
21To extract cgroup statistics a utility very similar to getdelays.c
22has been developed, the sample output of the utility is shown below
23
24~/balbir/cgroupstats # ./getdelays -C "/cgroup/a"
25sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
26~/balbir/cgroupstats # ./getdelays -C "/cgroup"
27sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt
index 552cabac0608..da42ab414c48 100644
--- a/Documentation/cachetlb.txt
+++ b/Documentation/cachetlb.txt
@@ -87,30 +87,7 @@ changes occur:
87 87
88 This is used primarily during fault processing. 88 This is used primarily during fault processing.
89 89
905) void flush_tlb_pgtables(struct mm_struct *mm, 905) void update_mmu_cache(struct vm_area_struct *vma,
91 unsigned long start, unsigned long end)
92
93 The software page tables for address space 'mm' for virtual
94 addresses in the range 'start' to 'end-1' are being torn down.
95
96 Some platforms cache the lowest level of the software page tables
97 in a linear virtually mapped array, to make TLB miss processing
98 more efficient. On such platforms, since the TLB is caching the
99 software page table structure, it needs to be flushed when parts
100 of the software page table tree are unlinked/freed.
101
102 Sparc64 is one example of a platform which does this.
103
104 Usually, when munmap()'ing an area of user virtual address
105 space, the kernel leaves the page table parts around and just
106 marks the individual pte's as invalid. However, if very large
107 portions of the address space are unmapped, the kernel frees up
108 those portions of the software page tables to prevent potential
109 excessive kernel memory usage caused by erratic mmap/mmunmap
110 sequences. It is at these times that flush_tlb_pgtables will
111 be invoked.
112
1136) void update_mmu_cache(struct vm_area_struct *vma,
114 unsigned long address, pte_t pte) 91 unsigned long address, pte_t pte)
115 92
116 At the end of every page fault, this routine is invoked to 93 At the end of every page fault, this routine is invoked to
@@ -123,7 +100,7 @@ changes occur:
123 translations for software managed TLB configurations. 100 translations for software managed TLB configurations.
124 The sparc64 port currently does this. 101 The sparc64 port currently does this.
125 102
1267) void tlb_migrate_finish(struct mm_struct *mm) 1036) void tlb_migrate_finish(struct mm_struct *mm)
127 104
128 This interface is called at the end of an explicit 105 This interface is called at the end of an explicit
129 process migration. This interface provides a hook 106 process migration. This interface provides a hook
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt
new file mode 100644
index 000000000000..98a26f81fa75
--- /dev/null
+++ b/Documentation/cgroups.txt
@@ -0,0 +1,545 @@
1 CGROUPS
2 -------
3
4Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
5
6Original copyright statements from cpusets.txt:
7Portions Copyright (C) 2004 BULL SA.
8Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
9Modified by Paul Jackson <pj@sgi.com>
10Modified by Christoph Lameter <clameter@sgi.com>
11
12CONTENTS:
13=========
14
151. Control Groups
16 1.1 What are cgroups ?
17 1.2 Why are cgroups needed ?
18 1.3 How are cgroups implemented ?
19 1.4 What does notify_on_release do ?
20 1.5 How do I use cgroups ?
212. Usage Examples and Syntax
22 2.1 Basic Usage
23 2.2 Attaching processes
243. Kernel API
25 3.1 Overview
26 3.2 Synchronization
27 3.3 Subsystem API
284. Questions
29
301. Control Groups
31==========
32
331.1 What are cgroups ?
34----------------------
35
36Control Groups provide a mechanism for aggregating/partitioning sets of
37tasks, and all their future children, into hierarchical groups with
38specialized behaviour.
39
40Definitions:
41
42A *cgroup* associates a set of tasks with a set of parameters for one
43or more subsystems.
44
45A *subsystem* is a module that makes use of the task grouping
46facilities provided by cgroups to treat groups of tasks in
47particular ways. A subsystem is typically a "resource controller" that
48schedules a resource or applies per-cgroup limits, but it may be
49anything that wants to act on a group of processes, e.g. a
50virtualization subsystem.
51
52A *hierarchy* is a set of cgroups arranged in a tree, such that
53every task in the system is in exactly one of the cgroups in the
54hierarchy, and a set of subsystems; each subsystem has system-specific
55state attached to each cgroup in the hierarchy. Each hierarchy has
56an instance of the cgroup virtual filesystem associated with it.
57
58At any one time there may be multiple active hierachies of task
59cgroups. Each hierarchy is a partition of all tasks in the system.
60
61User level code may create and destroy cgroups by name in an
62instance of the cgroup virtual file system, specify and query to
63which cgroup a task is assigned, and list the task pids assigned to
64a cgroup. Those creations and assignments only affect the hierarchy
65associated with that instance of the cgroup file system.
66
67On their own, the only use for cgroups is for simple job
68tracking. The intention is that other subsystems hook into the generic
69cgroup support to provide new attributes for cgroups, such as
70accounting/limiting the resources which processes in a cgroup can
71access. For example, cpusets (see Documentation/cpusets.txt) allows
72you to associate a set of CPUs and a set of memory nodes with the
73tasks in each cgroup.
74
751.2 Why are cgroups needed ?
76----------------------------
77
78There are multiple efforts to provide process aggregations in the
79Linux kernel, mainly for resource tracking purposes. Such efforts
80include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
81namespaces. These all require the basic notion of a
82grouping/partitioning of processes, with newly forked processes ending
83in the same group (cgroup) as their parent process.
84
85The kernel cgroup patch provides the minimum essential kernel
86mechanisms required to efficiently implement such groups. It has
87minimal impact on the system fast paths, and provides hooks for
88specific subsystems such as cpusets to provide additional behaviour as
89desired.
90
91Multiple hierarchy support is provided to allow for situations where
92the division of tasks into cgroups is distinctly different for
93different subsystems - having parallel hierarchies allows each
94hierarchy to be a natural division of tasks, without having to handle
95complex combinations of tasks that would be present if several
96unrelated subsystems needed to be forced into the same tree of
97cgroups.
98
99At one extreme, each resource controller or subsystem could be in a
100separate hierarchy; at the other extreme, all subsystems
101would be attached to the same hierarchy.
102
103As an example of a scenario (originally proposed by vatsa@in.ibm.com)
104that can benefit from multiple hierarchies, consider a large
105university server with various users - students, professors, system
106tasks etc. The resource planning for this server could be along the
107following lines:
108
109 CPU : Top cpuset
110 / \
111 CPUSet1 CPUSet2
112 | |
113 (Profs) (Students)
114
115 In addition (system tasks) are attached to topcpuset (so
116 that they can run anywhere) with a limit of 20%
117
118 Memory : Professors (50%), students (30%), system (20%)
119
120 Disk : Prof (50%), students (30%), system (20%)
121
122 Network : WWW browsing (20%), Network File System (60%), others (20%)
123 / \
124 Prof (15%) students (5%)
125
126Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
127into NFS network class.
128
129At the same time firefox/lynx will share an appropriate CPU/Memory class
130depending on who launched it (prof/student).
131
132With the ability to classify tasks differently for different resources
133(by putting those resource subsystems in different hierarchies) then
134the admin can easily set up a script which receives exec notifications
135and depending on who is launching the browser he can
136
137 # echo browser_pid > /mnt/<restype>/<userclass>/tasks
138
139With only a single hierarchy, he now would potentially have to create
140a separate cgroup for every browser launched and associate it with
141approp network and other resource class. This may lead to
142proliferation of such cgroups.
143
144Also lets say that the administrator would like to give enhanced network
145access temporarily to a student's browser (since it is night and the user
146wants to do online gaming :) OR give one of the students simulation
147apps enhanced CPU power,
148
149With ability to write pids directly to resource classes, its just a
150matter of :
151
152 # echo pid > /mnt/network/<new_class>/tasks
153 (after some time)
154 # echo pid > /mnt/network/<orig_class>/tasks
155
156Without this ability, he would have to split the cgroup into
157multiple separate ones and then associate the new cgroups with the
158new resource classes.
159
160
161
1621.3 How are cgroups implemented ?
163---------------------------------
164
165Control Groups extends the kernel as follows:
166
167 - Each task in the system has a reference-counted pointer to a
168 css_set.
169
170 - A css_set contains a set of reference-counted pointers to
171 cgroup_subsys_state objects, one for each cgroup subsystem
172 registered in the system. There is no direct link from a task to
173 the cgroup of which it's a member in each hierarchy, but this
174 can be determined by following pointers through the
175 cgroup_subsys_state objects. This is because accessing the
176 subsystem state is something that's expected to happen frequently
177 and in performance-critical code, whereas operations that require a
178 task's actual cgroup assignments (in particular, moving between
179 cgroups) are less common. A linked list runs through the cg_list
180 field of each task_struct using the css_set, anchored at
181 css_set->tasks.
182
183 - A cgroup hierarchy filesystem can be mounted for browsing and
184 manipulation from user space.
185
186 - You can list all the tasks (by pid) attached to any cgroup.
187
188The implementation of cgroups requires a few, simple hooks
189into the rest of the kernel, none in performance critical paths:
190
191 - in init/main.c, to initialize the root cgroups and initial
192 css_set at system boot.
193
194 - in fork and exit, to attach and detach a task from its css_set.
195
196In addition a new file system, of type "cgroup" may be mounted, to
197enable browsing and modifying the cgroups presently known to the
198kernel. When mounting a cgroup hierarchy, you may specify a
199comma-separated list of subsystems to mount as the filesystem mount
200options. By default, mounting the cgroup filesystem attempts to
201mount a hierarchy containing all registered subsystems.
202
203If an active hierarchy with exactly the same set of subsystems already
204exists, it will be reused for the new mount. If no existing hierarchy
205matches, and any of the requested subsystems are in use in an existing
206hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
207is activated, associated with the requested subsystems.
208
209It's not currently possible to bind a new subsystem to an active
210cgroup hierarchy, or to unbind a subsystem from an active cgroup
211hierarchy. This may be possible in future, but is fraught with nasty
212error-recovery issues.
213
214When a cgroup filesystem is unmounted, if there are any
215child cgroups created below the top-level cgroup, that hierarchy
216will remain active even though unmounted; if there are no
217child cgroups then the hierarchy will be deactivated.
218
219No new system calls are added for cgroups - all support for
220querying and modifying cgroups is via this cgroup file system.
221
222Each task under /proc has an added file named 'cgroup' displaying,
223for each active hierarchy, the subsystem names and the cgroup name
224as the path relative to the root of the cgroup file system.
225
226Each cgroup is represented by a directory in the cgroup file system
227containing the following files describing that cgroup:
228
229 - tasks: list of tasks (by pid) attached to that cgroup
230 - notify_on_release flag: run /sbin/cgroup_release_agent on exit?
231
232Other subsystems such as cpusets may add additional files in each
233cgroup dir
234
235New cgroups are created using the mkdir system call or shell
236command. The properties of a cgroup, such as its flags, are
237modified by writing to the appropriate file in that cgroups
238directory, as listed above.
239
240The named hierarchical structure of nested cgroups allows partitioning
241a large system into nested, dynamically changeable, "soft-partitions".
242
243The attachment of each task, automatically inherited at fork by any
244children of that task, to a cgroup allows organizing the work load
245on a system into related sets of tasks. A task may be re-attached to
246any other cgroup, if allowed by the permissions on the necessary
247cgroup file system directories.
248
249When a task is moved from one cgroup to another, it gets a new
250css_set pointer - if there's an already existing css_set with the
251desired collection of cgroups then that group is reused, else a new
252css_set is allocated. Note that the current implementation uses a
253linear search to locate an appropriate existing css_set, so isn't
254very efficient. A future version will use a hash table for better
255performance.
256
257To allow access from a cgroup to the css_sets (and hence tasks)
258that comprise it, a set of cg_cgroup_link objects form a lattice;
259each cg_cgroup_link is linked into a list of cg_cgroup_links for
260a single cgroup on its cont_link_list field, and a list of
261cg_cgroup_links for a single css_set on its cg_link_list.
262
263Thus the set of tasks in a cgroup can be listed by iterating over
264each css_set that references the cgroup, and sub-iterating over
265each css_set's task set.
266
267The use of a Linux virtual file system (vfs) to represent the
268cgroup hierarchy provides for a familiar permission and name space
269for cgroups, with a minimum of additional kernel code.
270
2711.4 What does notify_on_release do ?
272------------------------------------
273
274*** notify_on_release is disabled in the current patch set. It will be
275*** reactivated in a future patch in a less-intrusive manner
276
277If the notify_on_release flag is enabled (1) in a cgroup, then
278whenever the last task in the cgroup leaves (exits or attaches to
279some other cgroup) and the last child cgroup of that cgroup
280is removed, then the kernel runs the command specified by the contents
281of the "release_agent" file in that hierarchy's root directory,
282supplying the pathname (relative to the mount point of the cgroup
283file system) of the abandoned cgroup. This enables automatic
284removal of abandoned cgroups. The default value of
285notify_on_release in the root cgroup at system boot is disabled
286(0). The default value of other cgroups at creation is the current
287value of their parents notify_on_release setting. The default value of
288a cgroup hierarchy's release_agent path is empty.
289
2901.5 How do I use cgroups ?
291--------------------------
292
293To start a new job that is to be contained within a cgroup, using
294the "cpuset" cgroup subsystem, the steps are something like:
295
296 1) mkdir /dev/cgroup
297 2) mount -t cgroup -ocpuset cpuset /dev/cgroup
298 3) Create the new cgroup by doing mkdir's and write's (or echo's) in
299 the /dev/cgroup virtual file system.
300 4) Start a task that will be the "founding father" of the new job.
301 5) Attach that task to the new cgroup by writing its pid to the
302 /dev/cgroup tasks file for that cgroup.
303 6) fork, exec or clone the job tasks from this founding father task.
304
305For example, the following sequence of commands will setup a cgroup
306named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
307and then start a subshell 'sh' in that cgroup:
308
309 mount -t cgroup cpuset -ocpuset /dev/cgroup
310 cd /dev/cgroup
311 mkdir Charlie
312 cd Charlie
313 /bin/echo 2-3 > cpus
314 /bin/echo 1 > mems
315 /bin/echo $$ > tasks
316 sh
317 # The subshell 'sh' is now running in cgroup Charlie
318 # The next line should display '/Charlie'
319 cat /proc/self/cgroup
320
3212. Usage Examples and Syntax
322============================
323
3242.1 Basic Usage
325---------------
326
327Creating, modifying, using the cgroups can be done through the cgroup
328virtual filesystem.
329
330To mount a cgroup hierarchy will all available subsystems, type:
331# mount -t cgroup xxx /dev/cgroup
332
333The "xxx" is not interpreted by the cgroup code, but will appear in
334/proc/mounts so may be any useful identifying string that you like.
335
336To mount a cgroup hierarchy with just the cpuset and numtasks
337subsystems, type:
338# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup
339
340To change the set of subsystems bound to a mounted hierarchy, just
341remount with different options:
342
343# mount -o remount,cpuset,ns /dev/cgroup
344
345Note that changing the set of subsystems is currently only supported
346when the hierarchy consists of a single (root) cgroup. Supporting
347the ability to arbitrarily bind/unbind subsystems from an existing
348cgroup hierarchy is intended to be implemented in the future.
349
350Then under /dev/cgroup you can find a tree that corresponds to the
351tree of the cgroups in the system. For instance, /dev/cgroup
352is the cgroup that holds the whole system.
353
354If you want to create a new cgroup under /dev/cgroup:
355# cd /dev/cgroup
356# mkdir my_cgroup
357
358Now you want to do something with this cgroup.
359# cd my_cgroup
360
361In this directory you can find several files:
362# ls
363notify_on_release release_agent tasks
364(plus whatever files are added by the attached subsystems)
365
366Now attach your shell to this cgroup:
367# /bin/echo $$ > tasks
368
369You can also create cgroups inside your cgroup by using mkdir in this
370directory.
371# mkdir my_sub_cs
372
373To remove a cgroup, just use rmdir:
374# rmdir my_sub_cs
375
376This will fail if the cgroup is in use (has cgroups inside, or
377has processes attached, or is held alive by other subsystem-specific
378reference).
379
3802.2 Attaching processes
381-----------------------
382
383# /bin/echo PID > tasks
384
385Note that it is PID, not PIDs. You can only attach ONE task at a time.
386If you have several tasks to attach, you have to do it one after another:
387
388# /bin/echo PID1 > tasks
389# /bin/echo PID2 > tasks
390 ...
391# /bin/echo PIDn > tasks
392
3933. Kernel API
394=============
395
3963.1 Overview
397------------
398
399Each kernel subsystem that wants to hook into the generic cgroup
400system needs to create a cgroup_subsys object. This contains
401various methods, which are callbacks from the cgroup system, along
402with a subsystem id which will be assigned by the cgroup system.
403
404Other fields in the cgroup_subsys object include:
405
406- subsys_id: a unique array index for the subsystem, indicating which
407 entry in cgroup->subsys[] this subsystem should be
408 managing. Initialized by cgroup_register_subsys(); prior to this
409 it should be initialized to -1
410
411- hierarchy: an index indicating which hierarchy, if any, this
412 subsystem is currently attached to. If this is -1, then the
413 subsystem is not attached to any hierarchy, and all tasks should be
414 considered to be members of the subsystem's top_cgroup. It should
415 be initialized to -1.
416
417- name: should be initialized to a unique subsystem name prior to
418 calling cgroup_register_subsystem. Should be no longer than
419 MAX_CGROUP_TYPE_NAMELEN
420
421Each cgroup object created by the system has an array of pointers,
422indexed by subsystem id; this pointer is entirely managed by the
423subsystem; the generic cgroup code will never touch this pointer.
424
4253.2 Synchronization
426-------------------
427
428There is a global mutex, cgroup_mutex, used by the cgroup
429system. This should be taken by anything that wants to modify a
430cgroup. It may also be taken to prevent cgroups from being
431modified, but more specific locks may be more appropriate in that
432situation.
433
434See kernel/cgroup.c for more details.
435
436Subsystems can take/release the cgroup_mutex via the functions
437cgroup_lock()/cgroup_unlock(), and can
438take/release the callback_mutex via the functions
439cgroup_lock()/cgroup_unlock().
440
441Accessing a task's cgroup pointer may be done in the following ways:
442- while holding cgroup_mutex
443- while holding the task's alloc_lock (via task_lock())
444- inside an rcu_read_lock() section via rcu_dereference()
445
4463.3 Subsystem API
447--------------------------
448
449Each subsystem should:
450
451- add an entry in linux/cgroup_subsys.h
452- define a cgroup_subsys object called <name>_subsys
453
454Each subsystem may export the following methods. The only mandatory
455methods are create/destroy. Any others that are null are presumed to
456be successful no-ops.
457
458struct cgroup_subsys_state *create(struct cgroup *cont)
459LL=cgroup_mutex
460
461Called to create a subsystem state object for a cgroup. The
462subsystem should allocate its subsystem state object for the passed
463cgroup, returning a pointer to the new object on success or a
464negative error code. On success, the subsystem pointer should point to
465a structure of type cgroup_subsys_state (typically embedded in a
466larger subsystem-specific object), which will be initialized by the
467cgroup system. Note that this will be called at initialization to
468create the root subsystem state for this subsystem; this case can be
469identified by the passed cgroup object having a NULL parent (since
470it's the root of the hierarchy) and may be an appropriate place for
471initialization code.
472
473void destroy(struct cgroup *cont)
474LL=cgroup_mutex
475
476The cgroup system is about to destroy the passed cgroup; the
477subsystem should do any necessary cleanup
478
479int can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
480 struct task_struct *task)
481LL=cgroup_mutex
482
483Called prior to moving a task into a cgroup; if the subsystem
484returns an error, this will abort the attach operation. If a NULL
485task is passed, then a successful result indicates that *any*
486unspecified task can be moved into the cgroup. Note that this isn't
487called on a fork. If this method returns 0 (success) then this should
488remain valid while the caller holds cgroup_mutex.
489
490void attach(struct cgroup_subsys *ss, struct cgroup *cont,
491 struct cgroup *old_cont, struct task_struct *task)
492LL=cgroup_mutex
493
494
495Called after the task has been attached to the cgroup, to allow any
496post-attachment activity that requires memory allocations or blocking.
497
498void fork(struct cgroup_subsy *ss, struct task_struct *task)
499LL=callback_mutex, maybe read_lock(tasklist_lock)
500
501Called when a task is forked into a cgroup. Also called during
502registration for all existing tasks.
503
504void exit(struct cgroup_subsys *ss, struct task_struct *task)
505LL=callback_mutex
506
507Called during task exit
508
509int populate(struct cgroup_subsys *ss, struct cgroup *cont)
510LL=none
511
512Called after creation of a cgroup to allow a subsystem to populate
513the cgroup directory with file entries. The subsystem should make
514calls to cgroup_add_file() with objects of type cftype (see
515include/linux/cgroup.h for details). Note that although this
516method can return an error code, the error code is currently not
517always handled well.
518
519void post_clone(struct cgroup_subsys *ss, struct cgroup *cont)
520
521Called at the end of cgroup_clone() to do any paramater
522initialization which might be required before a task could attach. For
523example in cpusets, no task may attach before 'cpus' and 'mems' are set
524up.
525
526void bind(struct cgroup_subsys *ss, struct cgroup *root)
527LL=callback_mutex
528
529Called when a cgroup subsystem is rebound to a different hierarchy
530and root cgroup. Currently this will only involve movement between
531the default hierarchy (which never has sub-cgroups) and a hierarchy
532that is being created/destroyed (and hence has no sub-cgroups).
533
5344. Questions
535============
536
537Q: what's up with this '/bin/echo' ?
538A: bash's builtin 'echo' command does not check calls to write() against
539 errors. If you use it in the cgroup file system, you won't be
540 able to tell whether a command succeeded or failed.
541
542Q: When I attach processes, only the first of the line gets really attached !
543A: We can only return one error code per call to write(). So you should also
544 put only ONE pid.
545
diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
index b6d24c22274b..a741f658a3c9 100644
--- a/Documentation/cpu-hotplug.txt
+++ b/Documentation/cpu-hotplug.txt
@@ -220,7 +220,9 @@ A: The following happen, listed in no particular order :-)
220 CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the 220 CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the
221 CPU is being offlined while tasks are frozen due to a suspend operation in 221 CPU is being offlined while tasks are frozen due to a suspend operation in
222 progress 222 progress
223- All process is migrated away from this outgoing CPU to a new CPU 223- All processes are migrated away from this outgoing CPU to new CPUs.
224 The new CPU is chosen from each process' current cpuset, which may be
225 a subset of all online CPUs.
224- All interrupts targeted to this CPU is migrated to a new CPU 226- All interrupts targeted to this CPU is migrated to a new CPU
225- timers/bottom half/task lets are also migrated to a new CPU 227- timers/bottom half/task lets are also migrated to a new CPU
226- Once all services are migrated, kernel calls an arch specific routine 228- Once all services are migrated, kernel calls an arch specific routine
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index ec9de6917f01..141bef1c8599 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net
7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
8Modified by Paul Jackson <pj@sgi.com> 8Modified by Paul Jackson <pj@sgi.com>
9Modified by Christoph Lameter <clameter@sgi.com> 9Modified by Christoph Lameter <clameter@sgi.com>
10Modified by Paul Menage <menage@google.com>
10 11
11CONTENTS: 12CONTENTS:
12========= 13=========
@@ -16,9 +17,9 @@ CONTENTS:
16 1.2 Why are cpusets needed ? 17 1.2 Why are cpusets needed ?
17 1.3 How are cpusets implemented ? 18 1.3 How are cpusets implemented ?
18 1.4 What are exclusive cpusets ? 19 1.4 What are exclusive cpusets ?
19 1.5 What does notify_on_release do ? 20 1.5 What is memory_pressure ?
20 1.6 What is memory_pressure ? 21 1.6 What is memory spread ?
21 1.7 What is memory spread ? 22 1.7 What is sched_load_balance ?
22 1.8 How do I use cpusets ? 23 1.8 How do I use cpusets ?
232. Usage Examples and Syntax 242. Usage Examples and Syntax
24 2.1 Basic Usage 25 2.1 Basic Usage
@@ -44,18 +45,19 @@ hierarchy visible in a virtual file system. These are the essential
44hooks, beyond what is already present, required to manage dynamic 45hooks, beyond what is already present, required to manage dynamic
45job placement on large systems. 46job placement on large systems.
46 47
47Each task has a pointer to a cpuset. Multiple tasks may reference 48Cpusets use the generic cgroup subsystem described in
48the same cpuset. Requests by a task, using the sched_setaffinity(2) 49Documentation/cgroup.txt.
49system call to include CPUs in its CPU affinity mask, and using the 50
50mbind(2) and set_mempolicy(2) system calls to include Memory Nodes 51Requests by a task, using the sched_setaffinity(2) system call to
51in its memory policy, are both filtered through that tasks cpuset, 52include CPUs in its CPU affinity mask, and using the mbind(2) and
52filtering out any CPUs or Memory Nodes not in that cpuset. The 53set_mempolicy(2) system calls to include Memory Nodes in its memory
53scheduler will not schedule a task on a CPU that is not allowed in 54policy, are both filtered through that tasks cpuset, filtering out any
54its cpus_allowed vector, and the kernel page allocator will not 55CPUs or Memory Nodes not in that cpuset. The scheduler will not
55allocate a page on a node that is not allowed in the requesting tasks 56schedule a task on a CPU that is not allowed in its cpus_allowed
56mems_allowed vector. 57vector, and the kernel page allocator will not allocate a page on a
57 58node that is not allowed in the requesting tasks mems_allowed vector.
58User level code may create and destroy cpusets by name in the cpuset 59
60User level code may create and destroy cpusets by name in the cgroup
59virtual file system, manage the attributes and permissions of these 61virtual file system, manage the attributes and permissions of these
60cpusets and which CPUs and Memory Nodes are assigned to each cpuset, 62cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
61specify and query to which cpuset a task is assigned, and list the 63specify and query to which cpuset a task is assigned, and list the
@@ -115,7 +117,7 @@ Cpusets extends these two mechanisms as follows:
115 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the 117 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
116 kernel. 118 kernel.
117 - Each task in the system is attached to a cpuset, via a pointer 119 - Each task in the system is attached to a cpuset, via a pointer
118 in the task structure to a reference counted cpuset structure. 120 in the task structure to a reference counted cgroup structure.
119 - Calls to sched_setaffinity are filtered to just those CPUs 121 - Calls to sched_setaffinity are filtered to just those CPUs
120 allowed in that tasks cpuset. 122 allowed in that tasks cpuset.
121 - Calls to mbind and set_mempolicy are filtered to just 123 - Calls to mbind and set_mempolicy are filtered to just
@@ -145,15 +147,10 @@ into the rest of the kernel, none in performance critical paths:
145 - in page_alloc.c, to restrict memory to allowed nodes. 147 - in page_alloc.c, to restrict memory to allowed nodes.
146 - in vmscan.c, to restrict page recovery to the current cpuset. 148 - in vmscan.c, to restrict page recovery to the current cpuset.
147 149
148In addition a new file system, of type "cpuset" may be mounted, 150You should mount the "cgroup" filesystem type in order to enable
149typically at /dev/cpuset, to enable browsing and modifying the cpusets 151browsing and modifying the cpusets presently known to the kernel. No
150presently known to the kernel. No new system calls are added for 152new system calls are added for cpusets - all support for querying and
151cpusets - all support for querying and modifying cpusets is via 153modifying cpusets is via this cpuset file system.
152this cpuset file system.
153
154Each task under /proc has an added file named 'cpuset', displaying
155the cpuset name, as the path relative to the root of the cpuset file
156system.
157 154
158The /proc/<pid>/status file for each task has two added lines, 155The /proc/<pid>/status file for each task has two added lines,
159displaying the tasks cpus_allowed (on which CPUs it may be scheduled) 156displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
@@ -163,16 +160,15 @@ in the format seen in the following example:
163 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 160 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
164 Mems_allowed: ffffffff,ffffffff 161 Mems_allowed: ffffffff,ffffffff
165 162
166Each cpuset is represented by a directory in the cpuset file system 163Each cpuset is represented by a directory in the cgroup file system
167containing the following files describing that cpuset: 164containing (on top of the standard cgroup files) the following
165files describing that cpuset:
168 166
169 - cpus: list of CPUs in that cpuset 167 - cpus: list of CPUs in that cpuset
170 - mems: list of Memory Nodes in that cpuset 168 - mems: list of Memory Nodes in that cpuset
171 - memory_migrate flag: if set, move pages to cpusets nodes 169 - memory_migrate flag: if set, move pages to cpusets nodes
172 - cpu_exclusive flag: is cpu placement exclusive? 170 - cpu_exclusive flag: is cpu placement exclusive?
173 - mem_exclusive flag: is memory placement exclusive? 171 - mem_exclusive flag: is memory placement exclusive?
174 - tasks: list of tasks (by pid) attached to that cpuset
175 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
176 - memory_pressure: measure of how much paging pressure in cpuset 172 - memory_pressure: measure of how much paging pressure in cpuset
177 173
178In addition, the root cpuset only has the following file: 174In addition, the root cpuset only has the following file:
@@ -237,21 +233,7 @@ such as requests from interrupt handlers, is allowed to be taken
237outside even a mem_exclusive cpuset. 233outside even a mem_exclusive cpuset.
238 234
239 235
2401.5 What does notify_on_release do ? 2361.5 What is memory_pressure ?
241------------------------------------
242
243If the notify_on_release flag is enabled (1) in a cpuset, then whenever
244the last task in the cpuset leaves (exits or attaches to some other
245cpuset) and the last child cpuset of that cpuset is removed, then
246the kernel runs the command /sbin/cpuset_release_agent, supplying the
247pathname (relative to the mount point of the cpuset file system) of the
248abandoned cpuset. This enables automatic removal of abandoned cpusets.
249The default value of notify_on_release in the root cpuset at system
250boot is disabled (0). The default value of other cpusets at creation
251is the current value of their parents notify_on_release setting.
252
253
2541.6 What is memory_pressure ?
255----------------------------- 237-----------------------------
256The memory_pressure of a cpuset provides a simple per-cpuset metric 238The memory_pressure of a cpuset provides a simple per-cpuset metric
257of the rate that the tasks in a cpuset are attempting to free up in 239of the rate that the tasks in a cpuset are attempting to free up in
@@ -308,7 +290,7 @@ the tasks in the cpuset, in units of reclaims attempted per second,
308times 1000. 290times 1000.
309 291
310 292
3111.7 What is memory spread ? 2931.6 What is memory spread ?
312--------------------------- 294---------------------------
313There are two boolean flag files per cpuset that control where the 295There are two boolean flag files per cpuset that control where the
314kernel allocates pages for the file system buffers and related in 296kernel allocates pages for the file system buffers and related in
@@ -378,6 +360,142 @@ policy, especially for jobs that might have one thread reading in the
378data set, the memory allocation across the nodes in the jobs cpuset 360data set, the memory allocation across the nodes in the jobs cpuset
379can become very uneven. 361can become very uneven.
380 362
3631.7 What is sched_load_balance ?
364--------------------------------
365
366The kernel scheduler (kernel/sched.c) automatically load balances
367tasks. If one CPU is underutilized, kernel code running on that
368CPU will look for tasks on other more overloaded CPUs and move those
369tasks to itself, within the constraints of such placement mechanisms
370as cpusets and sched_setaffinity.
371
372The algorithmic cost of load balancing and its impact on key shared
373kernel data structures such as the task list increases more than
374linearly with the number of CPUs being balanced. So the scheduler
375has support to partition the systems CPUs into a number of sched
376domains such that it only load balances within each sched domain.
377Each sched domain covers some subset of the CPUs in the system;
378no two sched domains overlap; some CPUs might not be in any sched
379domain and hence won't be load balanced.
380
381Put simply, it costs less to balance between two smaller sched domains
382than one big one, but doing so means that overloads in one of the
383two domains won't be load balanced to the other one.
384
385By default, there is one sched domain covering all CPUs, except those
386marked isolated using the kernel boot time "isolcpus=" argument.
387
388This default load balancing across all CPUs is not well suited for
389the following two situations:
390 1) On large systems, load balancing across many CPUs is expensive.
391 If the system is managed using cpusets to place independent jobs
392 on separate sets of CPUs, full load balancing is unnecessary.
393 2) Systems supporting realtime on some CPUs need to minimize
394 system overhead on those CPUs, including avoiding task load
395 balancing if that is not needed.
396
397When the per-cpuset flag "sched_load_balance" is enabled (the default
398setting), it requests that all the CPUs in that cpusets allowed 'cpus'
399be contained in a single sched domain, ensuring that load balancing
400can move a task (not otherwised pinned, as by sched_setaffinity)
401from any CPU in that cpuset to any other.
402
403When the per-cpuset flag "sched_load_balance" is disabled, then the
404scheduler will avoid load balancing across the CPUs in that cpuset,
405--except-- in so far as is necessary because some overlapping cpuset
406has "sched_load_balance" enabled.
407
408So, for example, if the top cpuset has the flag "sched_load_balance"
409enabled, then the scheduler will have one sched domain covering all
410CPUs, and the setting of the "sched_load_balance" flag in any other
411cpusets won't matter, as we're already fully load balancing.
412
413Therefore in the above two situations, the top cpuset flag
414"sched_load_balance" should be disabled, and only some of the smaller,
415child cpusets have this flag enabled.
416
417When doing this, you don't usually want to leave any unpinned tasks in
418the top cpuset that might use non-trivial amounts of CPU, as such tasks
419may be artificially constrained to some subset of CPUs, depending on
420the particulars of this flag setting in descendent cpusets. Even if
421such a task could use spare CPU cycles in some other CPUs, the kernel
422scheduler might not consider the possibility of load balancing that
423task to that underused CPU.
424
425Of course, tasks pinned to a particular CPU can be left in a cpuset
426that disables "sched_load_balance" as those tasks aren't going anywhere
427else anyway.
428
429There is an impedance mismatch here, between cpusets and sched domains.
430Cpusets are hierarchical and nest. Sched domains are flat; they don't
431overlap and each CPU is in at most one sched domain.
432
433It is necessary for sched domains to be flat because load balancing
434across partially overlapping sets of CPUs would risk unstable dynamics
435that would be beyond our understanding. So if each of two partially
436overlapping cpusets enables the flag 'sched_load_balance', then we
437form a single sched domain that is a superset of both. We won't move
438a task to a CPU outside it cpuset, but the scheduler load balancing
439code might waste some compute cycles considering that possibility.
440
441This mismatch is why there is not a simple one-to-one relation
442between which cpusets have the flag "sched_load_balance" enabled,
443and the sched domain configuration. If a cpuset enables the flag, it
444will get balancing across all its CPUs, but if it disables the flag,
445it will only be assured of no load balancing if no other overlapping
446cpuset enables the flag.
447
448If two cpusets have partially overlapping 'cpus' allowed, and only
449one of them has this flag enabled, then the other may find its
450tasks only partially load balanced, just on the overlapping CPUs.
451This is just the general case of the top_cpuset example given a few
452paragraphs above. In the general case, as in the top cpuset case,
453don't leave tasks that might use non-trivial amounts of CPU in
454such partially load balanced cpusets, as they may be artificially
455constrained to some subset of the CPUs allowed to them, for lack of
456load balancing to the other CPUs.
457
4581.7.1 sched_load_balance implementation details.
459------------------------------------------------
460
461The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
462to most cpuset flags.) When enabled for a cpuset, the kernel will
463ensure that it can load balance across all the CPUs in that cpuset
464(makes sure that all the CPUs in the cpus_allowed of that cpuset are
465in the same sched domain.)
466
467If two overlapping cpusets both have 'sched_load_balance' enabled,
468then they will be (must be) both in the same sched domain.
469
470If, as is the default, the top cpuset has 'sched_load_balance' enabled,
471then by the above that means there is a single sched domain covering
472the whole system, regardless of any other cpuset settings.
473
474The kernel commits to user space that it will avoid load balancing
475where it can. It will pick as fine a granularity partition of sched
476domains as it can while still providing load balancing for any set
477of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
478
479The internal kernel cpuset to scheduler interface passes from the
480cpuset code to the scheduler code a partition of the load balanced
481CPUs in the system. This partition is a set of subsets (represented
482as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
483the CPUs that must be load balanced.
484
485Whenever the 'sched_load_balance' flag changes, or CPUs come or go
486from a cpuset with this flag enabled, or a cpuset with this flag
487enabled is removed, the cpuset code builds a new such partition and
488passes it to the scheduler sched domain setup code, to have the sched
489domains rebuilt as necessary.
490
491This partition exactly defines what sched domains the scheduler should
492setup - one sched domain for each element (cpumask_t) in the partition.
493
494The scheduler remembers the currently active sched domain partitions.
495When the scheduler routine partition_sched_domains() is invoked from
496the cpuset code to update these sched domains, it compares the new
497partition requested with the current, and updates its sched domains,
498removing the old and adding the new, for each change.
381 499
3821.8 How do I use cpusets ? 5001.8 How do I use cpusets ?
383-------------------------- 501--------------------------
@@ -469,7 +587,7 @@ than stress the kernel.
469To start a new job that is to be contained within a cpuset, the steps are: 587To start a new job that is to be contained within a cpuset, the steps are:
470 588
471 1) mkdir /dev/cpuset 589 1) mkdir /dev/cpuset
472 2) mount -t cpuset none /dev/cpuset 590 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
473 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 591 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
474 the /dev/cpuset virtual file system. 592 the /dev/cpuset virtual file system.
475 4) Start a task that will be the "founding father" of the new job. 593 4) Start a task that will be the "founding father" of the new job.
@@ -481,7 +599,7 @@ For example, the following sequence of commands will setup a cpuset
481named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 599named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
482and then start a subshell 'sh' in that cpuset: 600and then start a subshell 'sh' in that cpuset:
483 601
484 mount -t cpuset none /dev/cpuset 602 mount -t cgroup -ocpuset cpuset /dev/cpuset
485 cd /dev/cpuset 603 cd /dev/cpuset
486 mkdir Charlie 604 mkdir Charlie
487 cd Charlie 605 cd Charlie
@@ -513,7 +631,7 @@ Creating, modifying, using the cpusets can be done through the cpuset
513virtual filesystem. 631virtual filesystem.
514 632
515To mount it, type: 633To mount it, type:
516# mount -t cpuset none /dev/cpuset 634# mount -t cgroup -o cpuset cpuset /dev/cpuset
517 635
518Then under /dev/cpuset you can find a tree that corresponds to the 636Then under /dev/cpuset you can find a tree that corresponds to the
519tree of the cpusets in the system. For instance, /dev/cpuset 637tree of the cpusets in the system. For instance, /dev/cpuset
@@ -556,6 +674,18 @@ To remove a cpuset, just use rmdir:
556This will fail if the cpuset is in use (has cpusets inside, or has 674This will fail if the cpuset is in use (has cpusets inside, or has
557processes attached). 675processes attached).
558 676
677Note that for legacy reasons, the "cpuset" filesystem exists as a
678wrapper around the cgroup filesystem.
679
680The command
681
682mount -t cpuset X /dev/cpuset
683
684is equivalent to
685
686mount -t cgroup -ocpuset X /dev/cpuset
687echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
688
5592.2 Adding/removing cpus 6892.2 Adding/removing cpus
560------------------------ 690------------------------
561 691
diff --git a/Documentation/device-mapper/dm-uevent.txt b/Documentation/device-mapper/dm-uevent.txt
new file mode 100644
index 000000000000..07edbd85c714
--- /dev/null
+++ b/Documentation/device-mapper/dm-uevent.txt
@@ -0,0 +1,97 @@
1The device-mapper uevent code adds the capability to device-mapper to create
2and send kobject uevents (uevents). Previously device-mapper events were only
3available through the ioctl interface. The advantage of the uevents interface
4is the event contains environment attributes providing increased context for
5the event avoiding the need to query the state of the device-mapper device after
6the event is received.
7
8There are two functions currently for device-mapper events. The first function
9listed creates the event and the second function sends the event(s).
10
11void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
12 const char *path, unsigned nr_valid_paths)
13
14void dm_send_uevents(struct list_head *events, struct kobject *kobj)
15
16
17The variables added to the uevent environment are:
18
19Variable Name: DM_TARGET
20Uevent Action(s): KOBJ_CHANGE
21Type: string
22Description:
23Value: Name of device-mapper target that generated the event.
24
25Variable Name: DM_ACTION
26Uevent Action(s): KOBJ_CHANGE
27Type: string
28Description:
29Value: Device-mapper specific action that caused the uevent action.
30 PATH_FAILED - A path has failed.
31 PATH_REINSTATED - A path has been reinstated.
32
33Variable Name: DM_SEQNUM
34Uevent Action(s): KOBJ_CHANGE
35Type: unsigned integer
36Description: A sequence number for this specific device-mapper device.
37Value: Valid unsigned integer range.
38
39Variable Name: DM_PATH
40Uevent Action(s): KOBJ_CHANGE
41Type: string
42Description: Major and minor number of the path device pertaining to this
43event.
44Value: Path name in the form of "Major:Minor"
45
46Variable Name: DM_NR_VALID_PATHS
47Uevent Action(s): KOBJ_CHANGE
48Type: unsigned integer
49Description:
50Value: Valid unsigned integer range.
51
52Variable Name: DM_NAME
53Uevent Action(s): KOBJ_CHANGE
54Type: string
55Description: Name of the device-mapper device.
56Value: Name
57
58Variable Name: DM_UUID
59Uevent Action(s): KOBJ_CHANGE
60Type: string
61Description: UUID of the device-mapper device.
62Value: UUID. (Empty string if there isn't one.)
63
64An example of the uevents generated as captured by udevmonitor is shown
65below.
66
671.) Path failure.
68UEVENT[1192521009.711215] change@/block/dm-3
69ACTION=change
70DEVPATH=/block/dm-3
71SUBSYSTEM=block
72DM_TARGET=multipath
73DM_ACTION=PATH_FAILED
74DM_SEQNUM=1
75DM_PATH=8:32
76DM_NR_VALID_PATHS=0
77DM_NAME=mpath2
78DM_UUID=mpath-35333333000002328
79MINOR=3
80MAJOR=253
81SEQNUM=1130
82
832.) Path reinstate.
84UEVENT[1192521132.989927] change@/block/dm-3
85ACTION=change
86DEVPATH=/block/dm-3
87SUBSYSTEM=block
88DM_TARGET=multipath
89DM_ACTION=PATH_REINSTATED
90DM_SEQNUM=2
91DM_PATH=8:32
92DM_NR_VALID_PATHS=1
93DM_NAME=mpath2
94DM_UUID=mpath-35333333000002328
95MINOR=3
96MAJOR=253
97SEQNUM=1131
diff --git a/Documentation/input/input-programming.txt b/Documentation/input/input-programming.txt
index d9d523099bb7..4d932dc66098 100644
--- a/Documentation/input/input-programming.txt
+++ b/Documentation/input/input-programming.txt
@@ -42,8 +42,8 @@ static int __init button_init(void)
42 goto err_free_irq; 42 goto err_free_irq;
43 } 43 }
44 44
45 button_dev->evbit[0] = BIT(EV_KEY); 45 button_dev->evbit[0] = BIT_MASK(EV_KEY);
46 button_dev->keybit[LONG(BTN_0)] = BIT(BTN_0); 46 button_dev->keybit[BIT_WORD(BTN_0)] = BIT_MASK(BTN_0);
47 47
48 error = input_register_device(button_dev); 48 error = input_register_device(button_dev);
49 if (error) { 49 if (error) {
@@ -217,14 +217,15 @@ If you don't need absfuzz and absflat, you can set them to zero, which mean
217that the thing is precise and always returns to exactly the center position 217that the thing is precise and always returns to exactly the center position
218(if it has any). 218(if it has any).
219 219
2201.4 NBITS(), LONG(), BIT() 2201.4 BITS_TO_LONGS(), BIT_WORD(), BIT_MASK()
221~~~~~~~~~~~~~~~~~~~~~~~~~~ 221~~~~~~~~~~~~~~~~~~~~~~~~~~
222 222
223These three macros from input.h help some bitfield computations: 223These three macros from bitops.h help some bitfield computations:
224 224
225 NBITS(x) - returns the length of a bitfield array in longs for x bits 225 BITS_TO_LONGS(x) - returns the length of a bitfield array in longs for
226 LONG(x) - returns the index in the array in longs for bit x 226 x bits
227 BIT(x) - returns the index in a long for bit x 227 BIT_WORD(x) - returns the index in the array in longs for bit x
228 BIT_MASK(x) - returns the index in a long for bit x
228 229
2291.5 The id* and name fields 2301.5 The id* and name fields
230~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 231~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/kbuild/kconfig-language.txt b/Documentation/kbuild/kconfig-language.txt
index fe8b0c4892cf..616043a6da99 100644
--- a/Documentation/kbuild/kconfig-language.txt
+++ b/Documentation/kbuild/kconfig-language.txt
@@ -77,7 +77,12 @@ applicable everywhere (see syntax).
77 Optionally, dependencies only for this default value can be added with 77 Optionally, dependencies only for this default value can be added with
78 "if". 78 "if".
79 79
80- dependencies: "depends on"/"requires" <expr> 80- type definition + default value:
81 "def_bool"/"def_tristate" <expr> ["if" <expr>]
82 This is a shorthand notation for a type definition plus a value.
83 Optionally dependencies for this default value can be added with "if".
84
85- dependencies: "depends on" <expr>
81 This defines a dependency for this menu entry. If multiple 86 This defines a dependency for this menu entry. If multiple
82 dependencies are defined, they are connected with '&&'. Dependencies 87 dependencies are defined, they are connected with '&&'. Dependencies
83 are applied to all other options within this menu entry (which also 88 are applied to all other options within this menu entry (which also
@@ -289,3 +294,10 @@ source:
289 "source" <prompt> 294 "source" <prompt>
290 295
291This reads the specified configuration file. This file is always parsed. 296This reads the specified configuration file. This file is always parsed.
297
298mainmenu:
299
300 "mainmenu" <prompt>
301
302This sets the config program's title bar if the config program chooses
303to use it.
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index f099b814d383..6166e2d7da76 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -518,6 +518,28 @@ more details, with real examples.
518 In this example for a specific GCC version the build will error out explaining 518 In this example for a specific GCC version the build will error out explaining
519 to the user why it stops. 519 to the user why it stops.
520 520
521 cc-cross-prefix
522 cc-cross-prefix is used to check if there exist a $(CC) in path with
523 one of the listed prefixes. The first prefix where there exist a
524 prefix$(CC) in the PATH is returned - and if no prefix$(CC) is found
525 then nothing is returned.
526 Additional prefixes are separated by a single space in the
527 call of cc-cross-prefix.
528 This functionality is usefull for architecture Makefile that try
529 to set CROSS_COMPILE to well know values but may have several
530 values to select between.
531 It is recommended only to try to set CROSS_COMPILE is it is a cross
532 build (host arch is different from target arch). And is CROSS_COMPILE
533 is already set then leave it with the old value.
534
535 Example:
536 #arch/m68k/Makefile
537 ifneq ($(SUBARCH),$(ARCH))
538 ifeq ($(CROSS_COMPILE),)
539 CROSS_COMPILE := $(call cc-cross-prefix, m68k-linux-gnu-)
540 endif
541 endif
542
521=== 4 Host Program support 543=== 4 Host Program support
522 544
523Kbuild supports building executables on the host for use during the 545Kbuild supports building executables on the host for use during the
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 1b37b28cc234..d0ac72cc19ff 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -231,6 +231,32 @@ Dump-capture kernel config options (Arch Dependent, ia64)
231 any space below the alignment point will be wasted. 231 any space below the alignment point will be wasted.
232 232
233 233
234Extended crashkernel syntax
235===========================
236
237While the "crashkernel=size[@offset]" syntax is sufficient for most
238configurations, sometimes it's handy to have the reserved memory dependent
239on the value of System RAM -- that's mostly for distributors that pre-setup
240the kernel command line to avoid a unbootable system after some memory has
241been removed from the machine.
242
243The syntax is:
244
245 crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
246 range=start-[end]
247
248For example:
249
250 crashkernel=512M-2G:64M,2G-:128M
251
252This would mean:
253
254 1) if the RAM is smaller than 512M, then don't reserve anything
255 (this is the "rescue" case)
256 2) if the RAM size is between 512M and 2G, then reserve 64M
257 3) if the RAM size is larger than 2G, then reserve 128M
258
259
234Boot into System Kernel 260Boot into System Kernel
235======================= 261=======================
236 262
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 189df0bcab99..7bf6bd2f530b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -431,8 +431,10 @@ and is between 256 and 4096 characters. It is defined in the file
431 over the 8254 in addition to over the IO-APIC. The 431 over the 8254 in addition to over the IO-APIC. The
432 kernel tries to set a sensible default. 432 kernel tries to set a sensible default.
433 433
434 hpet= [X86-32,HPET] option to disable HPET and use PIT. 434 hpet= [X86-32,HPET] option to control HPET usage
435 Format: disable 435 Format: { enable (default) | disable | force }
436 disable: disable HPET and use PIT instead
437 force: allow force enabled of undocumented chips (ICH4, VIA)
436 438
437 com20020= [HW,NET] ARCnet - COM20020 chipset 439 com20020= [HW,NET] ARCnet - COM20020 chipset
438 Format: 440 Format:
@@ -497,6 +499,13 @@ and is between 256 and 4096 characters. It is defined in the file
497 [KNL] Reserve a chunk of physical memory to 499 [KNL] Reserve a chunk of physical memory to
498 hold a kernel to switch to with kexec on panic. 500 hold a kernel to switch to with kexec on panic.
499 501
502 crashkernel=range1:size1[,range2:size2,...][@offset]
503 [KNL] Same as above, but depends on the memory
504 in the running system. The syntax of range is
505 start-[end] where start and end are both
506 a memory unit (amount[KMG]). See also
507 Documentation/kdump/kdump.txt for a example.
508
500 cs4232= [HW,OSS] 509 cs4232= [HW,OSS]
501 Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq> 510 Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq>
502 511
diff --git a/Documentation/markers.txt b/Documentation/markers.txt
new file mode 100644
index 000000000000..295a71bc301e
--- /dev/null
+++ b/Documentation/markers.txt
@@ -0,0 +1,81 @@
1 Using the Linux Kernel Markers
2
3 Mathieu Desnoyers
4
5
6This document introduces Linux Kernel Markers and their use. It provides
7examples of how to insert markers in the kernel and connect probe functions to
8them and provides some examples of probe functions.
9
10
11* Purpose of markers
12
13A marker placed in code provides a hook to call a function (probe) that you can
14provide at runtime. A marker can be "on" (a probe is connected to it) or "off"
15(no probe is attached). When a marker is "off" it has no effect, except for
16adding a tiny time penalty (checking a condition for a branch) and space
17penalty (adding a few bytes for the function call at the end of the
18instrumented function and adds a data structure in a separate section). When a
19marker is "on", the function you provide is called each time the marker is
20executed, in the execution context of the caller. When the function provided
21ends its execution, it returns to the caller (continuing from the marker site).
22
23You can put markers at important locations in the code. Markers are
24lightweight hooks that can pass an arbitrary number of parameters,
25described in a printk-like format string, to the attached probe function.
26
27They can be used for tracing and performance accounting.
28
29
30* Usage
31
32In order to use the macro trace_mark, you should include linux/marker.h.
33
34#include <linux/marker.h>
35
36And,
37
38trace_mark(subsystem_event, "%d %s", someint, somestring);
39Where :
40- subsystem_event is an identifier unique to your event
41 - subsystem is the name of your subsystem.
42 - event is the name of the event to mark.
43- "%d %s" is the formatted string for the serializer.
44- someint is an integer.
45- somestring is a char pointer.
46
47Connecting a function (probe) to a marker is done by providing a probe (function
48to call) for the specific marker through marker_probe_register() and can be
49activated by calling marker_arm(). Marker deactivation can be done by calling
50marker_disarm() as many times as marker_arm() has been called. Removing a probe
51is done through marker_probe_unregister(); it will disarm the probe and make
52sure there is no caller left using the probe when it returns. Probe removal is
53preempt-safe because preemption is disabled around the probe call. See the
54"Probe example" section below for a sample probe module.
55
56The marker mechanism supports inserting multiple instances of the same marker.
57Markers can be put in inline functions, inlined static functions, and
58unrolled loops as well as regular functions.
59
60The naming scheme "subsystem_event" is suggested here as a convention intended
61to limit collisions. Marker names are global to the kernel: they are considered
62as being the same whether they are in the core kernel image or in modules.
63Conflicting format strings for markers with the same name will cause the markers
64to be detected to have a different format string not to be armed and will output
65a printk warning which identifies the inconsistency:
66
67"Format mismatch for probe probe_name (format), marker (format)"
68
69
70* Probe / marker example
71
72See the example provided in samples/markers/src
73
74Compile them with your kernel.
75
76Run, as root :
77modprobe marker-example (insmod order is not important)
78modprobe probe-example
79cat /proc/marker-example (returns an expected error)
80rmmod marker-example probe-example
81dmesg
diff --git a/Documentation/mips/00-INDEX b/Documentation/mips/00-INDEX
index 9df8a2eac7b4..3f13bf8043d2 100644
--- a/Documentation/mips/00-INDEX
+++ b/Documentation/mips/00-INDEX
@@ -4,5 +4,3 @@ AU1xxx_IDE.README
4 - README for MIPS AU1XXX IDE driver. 4 - README for MIPS AU1XXX IDE driver.
5GT64120.README 5GT64120.README
6 - README for dir with info on MIPS boards using GT-64120 or GT-64120A. 6 - README for dir with info on MIPS boards using GT-64120 or GT-64120A.
7time.README
8 - README for MIPS time services.
diff --git a/Documentation/mips/time.README b/Documentation/mips/time.README
deleted file mode 100644
index a4ce603ed3b3..000000000000
--- a/Documentation/mips/time.README
+++ /dev/null
@@ -1,173 +0,0 @@
1README for MIPS time services
2
3Jun Sun
4jsun@mvista.com or jsun@junsun.net
5
6
7ABOUT
8-----
9This file describes the new arch/mips/kernel/time.c, related files and the
10services they provide.
11
12If you are short in patience and just want to know how to use time.c for a
13new board or convert an existing board, go to the last section.
14
15
16FILES, COMPATABILITY AND CONFIGS
17---------------------------------
18
19The old arch/mips/kernel/time.c is renamed to old-time.c.
20
21A new time.c is put there, together with include/asm-mips/time.h.
22
23Two configs variables are introduced, CONFIG_OLD_TIME_C and CONFIG_NEW_TIME_C.
24So we allow boards using
25
26 1) old time.c (CONFIG_OLD_TIME_C)
27 2) new time.c (CONFIG_NEW_TIME_C)
28 3) neither (their own private time.c)
29
30However, it is expected every board will move to the new time.c in the near
31future.
32
33
34WHAT THE NEW CODE PROVIDES?
35---------------------------
36
37The new time code provide the following services:
38
39 a) Implements functions required by Linux common code:
40 time_init
41
42 b) provides an abstraction of RTC and null RTC implementation as default.
43 extern unsigned long (*rtc_get_time)(void);
44 extern int (*rtc_set_time)(unsigned long);
45
46 c) high-level and low-level timer interrupt routines where the timer
47 interrupt source may or may not be the CPU timer. The high-level
48 routine is dispatched through do_IRQ() while the low-level is
49 dispatched in assemably code (usually int-handler.S)
50
51
52WHAT THE NEW CODE REQUIRES?
53---------------------------
54
55For the new code to work properly, each board implementation needs to supply
56the following functions or values:
57
58 a) board_time_init - a function pointer. Invoked at the beginnig of
59 time_init(). It is optional.
60 1. (optional) set up RTC routines
61 2. (optional) calibrate and set the mips_hpt_frequency
62
63 b) plat_timer_setup - a function pointer. Invoked at the end of time_init()
64 1. (optional) over-ride any decisions made in time_init()
65 2. set up the irqaction for timer interrupt.
66 3. enable the timer interrupt
67
68 c) (optional) board-specific RTC routines.
69
70 d) (optional) mips_hpt_frequency - It must be definied if the board
71 is using CPU counter for timer interrupt.
72
73
74PORTING GUIDE
75-------------
76
77Step 1: decide how you like to implement the time services.
78
79 a) does this board have a RTC? If yes, implement the two RTC funcs.
80
81 b) does the CPU have counter/compare registers?
82
83 If the answer is no, you need a timer to provide the timer interrupt
84 at 100 HZ speed.
85
86 c) The following sub steps assume your CPU has counter register.
87 Do you plan to use the CPU counter register as the timer interrupt
88 or use an exnternal timer?
89
90 In order to use CPU counter register as the timer interrupt source, you
91 must know the counter speed (mips_hpt_frequency). It is usually the
92 same as the CPU speed or an integral divisor of it.
93
94 d) decide on whether you want to use high-level or low-level timer
95 interrupt routines. The low-level one is presumably faster, but should
96 not make too mcuh difference.
97
98
99Step 2: the machine setup() function
100
101 If you supply board_time_init(), set the function poointer.
102
103
104Step 3: implement rtc routines, board_time_init() and plat_timer_setup()
105 if needed.
106
107 board_time_init() -
108 a) (optional) set up RTC routines,
109 b) (optional) calibrate and set the mips_hpt_frequency
110 (only needed if you intended to use cpu counter as timer interrupt
111 source)
112
113 plat_timer_setup() -
114 a) (optional) over-write any choices made above by time_init().
115 b) machine specific code should setup the timer irqaction.
116 c) enable the timer interrupt
117
118
119 If the RTC chip is a common chip, I suggest the routines are put under
120 arch/mips/libs. For example, for DS1386 chip, one would create
121 rtc-ds1386.c under arch/mips/lib directory. Add the following line to
122 the arch/mips/lib/Makefile:
123
124 obj-$(CONFIG_DDB5476) += rtc-ds1386.o
125
126Step 4: if you are using low-level timer interrupt, change your interrupt
127 dispathcing code to check for timer interrupt and jump to
128 ll_timer_interrupt() directly if one is detected.
129
130Step 5: Modify arch/mips/config.in and add CONFIG_NEW_TIME_C to your machine.
131 Modify the appropriate defconfig if applicable.
132
133Final notes:
134
135For some tricky cases, you may need to add your own wrapper functions
136for some of the functions in time.c.
137
138For example, you may define your own timer interrupt routine, which does
139some of its own processing and then calls timer_interrupt().
140
141You can also over-ride any of the built-in functions (RTC routines
142and/or timer interrupt routine).
143
144
145PORTING NOTES FOR SMP
146----------------------
147
148If you have a SMP box, things are slightly more complicated.
149
150The time service running every jiffy is logically divided into two parts:
151
152 1) the one for the whole system (defined in timer_interrupt())
153 2) the one that should run for each CPU (defined in local_timer_interrupt())
154
155You need to decide on your timer interrupt sources.
156
157 case 1) - whole system has only one timer interrupt delivered to one CPU
158
159 In this case, you set up timer interrupt as in UP systems. In addtion,
160 you need to set emulate_local_timer_interrupt to 1 so that other
161 CPUs get to call local_timer_interrupt().
162
163 THIS IS CURRENTLY NOT IMPLEMNETED. However, it is rather easy to write
164 one should such a need arise. You simply make a IPI call.
165
166 case 2) - each CPU has a separate timer interrupt
167
168 In this case, you need to set up IRQ such that each of them will
169 call local_timer_interrupt(). In addition, you need to arrange
170 one and only one of them to call timer_interrupt().
171
172 You can also do the low-level version of those interrupt routines,
173 following similar dispatching routes described above.
diff --git a/Documentation/thinkpad-acpi.txt b/Documentation/thinkpad-acpi.txt
index 60953d6c919d..3b95bbacc775 100644
--- a/Documentation/thinkpad-acpi.txt
+++ b/Documentation/thinkpad-acpi.txt
@@ -105,10 +105,15 @@ The version of thinkpad-acpi's sysfs interface is exported by the driver
105as a driver attribute (see below). 105as a driver attribute (see below).
106 106
107Sysfs driver attributes are on the driver's sysfs attribute space, 107Sysfs driver attributes are on the driver's sysfs attribute space,
108for 2.6.20 this is /sys/bus/platform/drivers/thinkpad_acpi/. 108for 2.6.23 this is /sys/bus/platform/drivers/thinkpad_acpi/ and
109/sys/bus/platform/drivers/thinkpad_hwmon/
109 110
110Sysfs device attributes are on the driver's sysfs attribute space, 111Sysfs device attributes are on the thinkpad_acpi device sysfs attribute
111for 2.6.20 this is /sys/devices/platform/thinkpad_acpi/. 112space, for 2.6.23 this is /sys/devices/platform/thinkpad_acpi/.
113
114Sysfs device attributes for the sensors and fan are on the
115thinkpad_hwmon device's sysfs attribute space, but you should locate it
116looking for a hwmon device with the name attribute of "thinkpad".
112 117
113Driver version 118Driver version
114-------------- 119--------------
@@ -766,7 +771,7 @@ Temperature sensors
766------------------- 771-------------------
767 772
768procfs: /proc/acpi/ibm/thermal 773procfs: /proc/acpi/ibm/thermal
769sysfs device attributes: (hwmon) temp*_input 774sysfs device attributes: (hwmon "thinkpad") temp*_input
770 775
771Most ThinkPads include six or more separate temperature sensors but only 776Most ThinkPads include six or more separate temperature sensors but only
772expose the CPU temperature through the standard ACPI methods. This 777expose the CPU temperature through the standard ACPI methods. This
@@ -989,7 +994,9 @@ Fan control and monitoring: fan speed, fan enable/disable
989--------------------------------------------------------- 994---------------------------------------------------------
990 995
991procfs: /proc/acpi/ibm/fan 996procfs: /proc/acpi/ibm/fan
992sysfs device attributes: (hwmon) fan_input, pwm1, pwm1_enable 997sysfs device attributes: (hwmon "thinkpad") fan1_input, pwm1,
998 pwm1_enable
999sysfs hwmon driver attributes: fan_watchdog
993 1000
994NOTE NOTE NOTE: fan control operations are disabled by default for 1001NOTE NOTE NOTE: fan control operations are disabled by default for
995safety reasons. To enable them, the module parameter "fan_control=1" 1002safety reasons. To enable them, the module parameter "fan_control=1"
@@ -1131,7 +1138,7 @@ hwmon device attribute fan1_input:
1131 which can take up to two minutes. May return rubbish on older 1138 which can take up to two minutes. May return rubbish on older
1132 ThinkPads. 1139 ThinkPads.
1133 1140
1134driver attribute fan_watchdog: 1141hwmon driver attribute fan_watchdog:
1135 Fan safety watchdog timer interval, in seconds. Minimum is 1142 Fan safety watchdog timer interval, in seconds. Minimum is
1136 1 second, maximum is 120 seconds. 0 disables the watchdog. 1143 1 second, maximum is 120 seconds. 0 disables the watchdog.
1137 1144
@@ -1233,3 +1240,9 @@ Sysfs interface changelog:
1233 layer, the radio switch generates input event EV_RADIO, 1240 layer, the radio switch generates input event EV_RADIO,
1234 and the driver enables hot key handling by default in 1241 and the driver enables hot key handling by default in
1235 the firmware. 1242 the firmware.
1243
12440x020000: ABI fix: added a separate hwmon platform device and
1245 driver, which must be located by name (thinkpad)
1246 and the hwmon class for libsensors4 (lm-sensors 3)
1247 compatibility. Moved all hwmon attributes to this
1248 new platform device.