22 files changed, 1019 insertions, 389 deletions
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index d3290c46af51..aa38cc5692a0 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -46,7 +46,7 @@
     <sect1><title>Atomic and pointer manipulation</title>
 !Iinclude/asm-x86/atomic_32.h
-!Iinclude/asm-x86/unaligned_32.h
+!Iinclude/asm-x86/unaligned.h
     </sect1>
     <sect1><title>Delaying, scheduling, and timer routines</title>
diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt
index 24dc3fcf1594..bc38283379f0 100644
--- a/Documentation/IPMI.txt
+++ b/Documentation/IPMI.txt
@@ -441,17 +441,20 @@ ACPI, and if none of those then a KCS device at the spec-specified
 0xca2.  If you want to turn this off, set the "trydefaults" option to
 false.
-If you have high-res timers compiled into the kernel, the driver will
+If your IPMI interface does not support interrupts and is a KCS or
-use them to provide much better performance.  Note that if you do not
+SMIC interface, the IPMI driver will start a kernel thread for the
-have high-res timers enabled in the kernel and you don't have
+interface to help speed things up.  This is a low-priority kernel
-interrupts enabled, the driver will run VERY slowly.  Don't blame me,
+thread that constantly polls the IPMI driver while an IPMI operation
+is in progress.  The force_kipmid module parameter will all the user to
+force this thread on or off.  If you force it off and don't have
+interrupts, the driver will run VERY slowly.  Don't blame me,
 these interfaces suck.
 The driver supports a hot add and remove of interfaces.  This way,
 interfaces can be added or removed after the kernel is up and running.
-This is done using /sys/modules/ipmi_si/hotmod, which is a write-only
+This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a
-parameter.  You write a string to this interface.  The string has the
+write-only parameter.  You write a string to this interface.  The string
-format:
+has the format:
   <op1>[:op2[:op3...]]
 The "op"s are:
   add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]]
@@ -581,9 +584,11 @@ The watchdog will panic and start a 120 second reset timeout if it
 gets a pre-action.  During a panic or a reboot, the watchdog will
 start a 120 timer if it is running to make sure the reboot occurs.
-Note that if you use the NMI preaction for the watchdog, you MUST
+Note that if you use the NMI preaction for the watchdog, you MUST NOT
-NOT use nmi watchdog mode 1.  If you use the NMI watchdog, you
+use the nmi watchdog.  There is no reasonable way to tell if an NMI
-must use mode 2.
+comes from the IPMI controller, so it must assume that if it gets an
+otherwise unhandled NMI, it must be from IPMI and it will panic
+immediately.
 Once you open the watchdog timer, you must write a 'V' character to the
 device to close it, or the timer will not stop.  This is a new semantic
diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.txt
new file mode 100644
index 000000000000..eda40fd39cad
--- /dev/null
+++ b/Documentation/accounting/cgroupstats.txt
@@ -0,0 +1,27 @@
+Control Groupstats is inspired by the discussion at
+http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
+suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
+Per cgroup statistics infrastructure re-uses code from the taskstats
+interface. A new set of cgroup operations are registered with commands
+and attributes specific to cgroups. It should be very easy to
+extend per cgroup statistics, by adding members to the cgroupstats
+structure.
+The current model for cgroupstats is a pull, a push model (to post
+statistics on interesting events), should be very easy to add. Currently
+user space requests for statistics by passing the cgroup path.
+Statistics about the state of all the tasks in the cgroup is returned to
+user space.
+NOTE: We currently rely on delay accounting for extracting information
+about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
+information will not be available.
+To extract cgroup statistics a utility very similar to getdelays.c
+has been developed, the sample output of the utility is shown below
+~/balbir/cgroupstats # ./getdelays  -C "/cgroup/a"
+sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
+~/balbir/cgroupstats # ./getdelays  -C "/cgroup"
+sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index d46306fea230..f20c10c2858f 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -418,6 +418,20 @@ brothers:
         */
         smp_mb__after_clear_bit();
+There are two special bitops with lock barrier semantics (acquire/release,
+same as spinlocks). These operate in the same way as their non-_lock/unlock
+postfixed variants, except that they are to provide acquire/release semantics,
+respectively. This means they can be used for bit_spin_trylock and
+bit_spin_unlock type operations without specifying any more barriers.
+        int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
+        void clear_bit_unlock(unsigned long nr, unsigned long *addr);
+        void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
+The __clear_bit_unlock version is non-atomic, however it still implements
+unlock barrier semantics. This can be useful if the lock itself is protecting
+the other bits in the word.
 Finally, there are non-atomic versions of the bitmask operations
 provided.  They are used in contexts where some other higher-level SMP
 locking scheme is being used to protect the bitmask, and thus less
diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt
index 552cabac0608..da42ab414c48 100644
--- a/Documentation/cachetlb.txt
+++ b/Documentation/cachetlb.txt
@@ -87,30 +87,7 @@ changes occur:
        This is used primarily during fault processing.
-5) void flush_tlb_pgtables(struct mm_struct *mm,
+5) void update_mmu_cache(struct vm_area_struct *vma,
-                           unsigned long start, unsigned long end)
-   The software page tables for address space 'mm' for virtual
-   addresses in the range 'start' to 'end-1' are being torn down.
-   Some platforms cache the lowest level of the software page tables
-   in a linear virtually mapped array, to make TLB miss processing
-   more efficient.  On such platforms, since the TLB is caching the
-   software page table structure, it needs to be flushed when parts
-   of the software page table tree are unlinked/freed.
-   Sparc64 is one example of a platform which does this.
-   Usually, when munmap()'ing an area of user virtual address
-   space, the kernel leaves the page table parts around and just
-   marks the individual pte's as invalid.  However, if very large
-   portions of the address space are unmapped, the kernel frees up
-   those portions of the software page tables to prevent potential
-   excessive kernel memory usage caused by erratic mmap/mmunmap
-   sequences.  It is at these times that flush_tlb_pgtables will
-   be invoked.
-6) void update_mmu_cache(struct vm_area_struct *vma,
                         unsigned long address, pte_t pte)
        At the end of every page fault, this routine is invoked to
@@ -123,7 +100,7 @@ changes occur:
        translations for software managed TLB configurations.
        The sparc64 port currently does this.
-7) void tlb_migrate_finish(struct mm_struct *mm)
+6) void tlb_migrate_finish(struct mm_struct *mm)
        This interface is called at the end of an explicit
        process migration. This interface provides a hook
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt
new file mode 100644
index 000000000000..98a26f81fa75
--- /dev/null
+++ b/Documentation/cgroups.txt
@@ -0,0 +1,545 @@
+                                CGROUPS
+                                -------
+Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
+Original copyright statements from cpusets.txt:
+Portions Copyright (C) 2004 BULL SA.
+Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
+Modified by Paul Jackson <pj@sgi.com>
+Modified by Christoph Lameter <clameter@sgi.com>
+CONTENTS:
+=========
+1. Control Groups
+  1.1 What are cgroups ?
+  1.2 Why are cgroups needed ?
+  1.3 How are cgroups implemented ?
+  1.4 What does notify_on_release do ?
+  1.5 How do I use cgroups ?
+2. Usage Examples and Syntax
+  2.1 Basic Usage
+  2.2 Attaching processes
+3. Kernel API
+  3.1 Overview
+  3.2 Synchronization
+  3.3 Subsystem API
+4. Questions
+1. Control Groups
+==========
+1.1 What are cgroups ?
+----------------------
+Control Groups provide a mechanism for aggregating/partitioning sets of
+tasks, and all their future children, into hierarchical groups with
+specialized behaviour.
+Definitions:
+A *cgroup* associates a set of tasks with a set of parameters for one
+or more subsystems.
+A *subsystem* is a module that makes use of the task grouping
+facilities provided by cgroups to treat groups of tasks in
+particular ways. A subsystem is typically a "resource controller" that
+schedules a resource or applies per-cgroup limits, but it may be
+anything that wants to act on a group of processes, e.g. a
+virtualization subsystem.
+A *hierarchy* is a set of cgroups arranged in a tree, such that
+every task in the system is in exactly one of the cgroups in the
+hierarchy, and a set of subsystems; each subsystem has system-specific
+state attached to each cgroup in the hierarchy.  Each hierarchy has
+an instance of the cgroup virtual filesystem associated with it.
+At any one time there may be multiple active hierachies of task
+cgroups. Each hierarchy is a partition of all tasks in the system.
+User level code may create and destroy cgroups by name in an
+instance of the cgroup virtual file system, specify and query to
+which cgroup a task is assigned, and list the task pids assigned to
+a cgroup. Those creations and assignments only affect the hierarchy
+associated with that instance of the cgroup file system.
+On their own, the only use for cgroups is for simple job
+tracking. The intention is that other subsystems hook into the generic
+cgroup support to provide new attributes for cgroups, such as
+accounting/limiting the resources which processes in a cgroup can
+access. For example, cpusets (see Documentation/cpusets.txt) allows
+you to associate a set of CPUs and a set of memory nodes with the
+tasks in each cgroup.
+1.2 Why are cgroups needed ?
+----------------------------
+There are multiple efforts to provide process aggregations in the
+Linux kernel, mainly for resource tracking purposes. Such efforts
+include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
+namespaces. These all require the basic notion of a
+grouping/partitioning of processes, with newly forked processes ending
+in the same group (cgroup) as their parent process.
+The kernel cgroup patch provides the minimum essential kernel
+mechanisms required to efficiently implement such groups. It has
+minimal impact on the system fast paths, and provides hooks for
+specific subsystems such as cpusets to provide additional behaviour as
+desired.
+Multiple hierarchy support is provided to allow for situations where
+the division of tasks into cgroups is distinctly different for
+different subsystems - having parallel hierarchies allows each
+hierarchy to be a natural division of tasks, without having to handle
+complex combinations of tasks that would be present if several
+unrelated subsystems needed to be forced into the same tree of
+cgroups.
+At one extreme, each resource controller or subsystem could be in a
+separate hierarchy; at the other extreme, all subsystems
+would be attached to the same hierarchy.
+As an example of a scenario (originally proposed by vatsa@in.ibm.com)
+that can benefit from multiple hierarchies, consider a large
+university server with various users - students, professors, system
+tasks etc. The resource planning for this server could be along the
+following lines:
+       CPU :           Top cpuset
+                       /       \
+               CPUSet1         CPUSet2
+                  |              |
+               (Profs)         (Students)
+               In addition (system tasks) are attached to topcpuset (so
+               that they can run anywhere) with a limit of 20%
+       Memory : Professors (50%), students (30%), system (20%)
+       Disk : Prof (50%), students (30%), system (20%)
+       Network : WWW browsing (20%), Network File System (60%), others (20%)
+                               / \
+                       Prof (15%) students (5%)
+Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
+into NFS network class.
+At the same time firefox/lynx will share an appropriate CPU/Memory class
+depending on who launched it (prof/student).
+With the ability to classify tasks differently for different resources
+(by putting those resource subsystems in different hierarchies) then
+the admin can easily set up a script which receives exec notifications
+and depending on who is launching the browser he can
+       # echo browser_pid > /mnt/<restype>/<userclass>/tasks
+With only a single hierarchy, he now would potentially have to create
+a separate cgroup for every browser launched and associate it with
+approp network and other resource class.  This may lead to
+proliferation of such cgroups.
+Also lets say that the administrator would like to give enhanced network
+access temporarily to a student's browser (since it is night and the user
+wants to do online gaming :)  OR give one of the students simulation
+apps enhanced CPU power,
+With ability to write pids directly to resource classes, its just a
+matter of :
+       # echo pid > /mnt/network/<new_class>/tasks
+       (after some time)
+       # echo pid > /mnt/network/<orig_class>/tasks
+Without this ability, he would have to split the cgroup into
+multiple separate ones and then associate the new cgroups with the
+new resource classes.
+1.3 How are cgroups implemented ?
+---------------------------------
+Control Groups extends the kernel as follows:
+ - Each task in the system has a reference-counted pointer to a
+   css_set.
+ - A css_set contains a set of reference-counted pointers to
+   cgroup_subsys_state objects, one for each cgroup subsystem
+   registered in the system. There is no direct link from a task to
+   the cgroup of which it's a member in each hierarchy, but this
+   can be determined by following pointers through the
+   cgroup_subsys_state objects. This is because accessing the
+   subsystem state is something that's expected to happen frequently
+   and in performance-critical code, whereas operations that require a
+   task's actual cgroup assignments (in particular, moving between
+   cgroups) are less common. A linked list runs through the cg_list
+   field of each task_struct using the css_set, anchored at
+   css_set->tasks.
+ - A cgroup hierarchy filesystem can be mounted  for browsing and
+   manipulation from user space.
+ - You can list all the tasks (by pid) attached to any cgroup.
+The implementation of cgroups requires a few, simple hooks
+into the rest of the kernel, none in performance critical paths:
+ - in init/main.c, to initialize the root cgroups and initial
+   css_set at system boot.
+ - in fork and exit, to attach and detach a task from its css_set.
+In addition a new file system, of type "cgroup" may be mounted, to
+enable browsing and modifying the cgroups presently known to the
+kernel.  When mounting a cgroup hierarchy, you may specify a
+comma-separated list of subsystems to mount as the filesystem mount
+options.  By default, mounting the cgroup filesystem attempts to
+mount a hierarchy containing all registered subsystems.
+If an active hierarchy with exactly the same set of subsystems already
+exists, it will be reused for the new mount. If no existing hierarchy
+matches, and any of the requested subsystems are in use in an existing
+hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
+is activated, associated with the requested subsystems.
+It's not currently possible to bind a new subsystem to an active
+cgroup hierarchy, or to unbind a subsystem from an active cgroup
+hierarchy. This may be possible in future, but is fraught with nasty
+error-recovery issues.
+When a cgroup filesystem is unmounted, if there are any
+child cgroups created below the top-level cgroup, that hierarchy
+will remain active even though unmounted; if there are no
+child cgroups then the hierarchy will be deactivated.
+No new system calls are added for cgroups - all support for
+querying and modifying cgroups is via this cgroup file system.
+Each task under /proc has an added file named 'cgroup' displaying,
+for each active hierarchy, the subsystem names and the cgroup name
+as the path relative to the root of the cgroup file system.
+Each cgroup is represented by a directory in the cgroup file system
+containing the following files describing that cgroup:
+ - tasks: list of tasks (by pid) attached to that cgroup
+ - notify_on_release flag: run /sbin/cgroup_release_agent on exit?
+Other subsystems such as cpusets may add additional files in each
+cgroup dir
+New cgroups are created using the mkdir system call or shell
+command.  The properties of a cgroup, such as its flags, are
+modified by writing to the appropriate file in that cgroups
+directory, as listed above.
+The named hierarchical structure of nested cgroups allows partitioning
+a large system into nested, dynamically changeable, "soft-partitions".
+The attachment of each task, automatically inherited at fork by any
+children of that task, to a cgroup allows organizing the work load
+on a system into related sets of tasks.  A task may be re-attached to
+any other cgroup, if allowed by the permissions on the necessary
+cgroup file system directories.
+When a task is moved from one cgroup to another, it gets a new
+css_set pointer - if there's an already existing css_set with the
+desired collection of cgroups then that group is reused, else a new
+css_set is allocated. Note that the current implementation uses a
+linear search to locate an appropriate existing css_set, so isn't
+very efficient. A future version will use a hash table for better
+performance.
+To allow access from a cgroup to the css_sets (and hence tasks)
+that comprise it, a set of cg_cgroup_link objects form a lattice;
+each cg_cgroup_link is linked into a list of cg_cgroup_links for
+a single cgroup on its cont_link_list field, and a list of
+cg_cgroup_links for a single css_set on its cg_link_list.
+Thus the set of tasks in a cgroup can be listed by iterating over
+each css_set that references the cgroup, and sub-iterating over
+each css_set's task set.
+The use of a Linux virtual file system (vfs) to represent the
+cgroup hierarchy provides for a familiar permission and name space
+for cgroups, with a minimum of additional kernel code.
+1.4 What does notify_on_release do ?
+------------------------------------
+*** notify_on_release is disabled in the current patch set. It will be
+*** reactivated in a future patch in a less-intrusive manner
+If the notify_on_release flag is enabled (1) in a cgroup, then
+whenever the last task in the cgroup leaves (exits or attaches to
+some other cgroup) and the last child cgroup of that cgroup
+is removed, then the kernel runs the command specified by the contents
+of the "release_agent" file in that hierarchy's root directory,
+supplying the pathname (relative to the mount point of the cgroup
+file system) of the abandoned cgroup.  This enables automatic
+removal of abandoned cgroups.  The default value of
+notify_on_release in the root cgroup at system boot is disabled
+(0).  The default value of other cgroups at creation is the current
+value of their parents notify_on_release setting. The default value of
+a cgroup hierarchy's release_agent path is empty.
+1.5 How do I use cgroups ?
+--------------------------
+To start a new job that is to be contained within a cgroup, using
+the "cpuset" cgroup subsystem, the steps are something like:
+ 1) mkdir /dev/cgroup
+ 2) mount -t cgroup -ocpuset cpuset /dev/cgroup
+ 3) Create the new cgroup by doing mkdir's and write's (or echo's) in
+    the /dev/cgroup virtual file system.
+ 4) Start a task that will be the "founding father" of the new job.
+ 5) Attach that task to the new cgroup by writing its pid to the
+    /dev/cgroup tasks file for that cgroup.
+ 6) fork, exec or clone the job tasks from this founding father task.
+For example, the following sequence of commands will setup a cgroup
+named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
+and then start a subshell 'sh' in that cgroup:
+  mount -t cgroup cpuset -ocpuset /dev/cgroup
+  cd /dev/cgroup
+  mkdir Charlie
+  cd Charlie
+  /bin/echo 2-3 > cpus
+  /bin/echo 1 > mems
+  /bin/echo $$ > tasks
+  sh
+  # The subshell 'sh' is now running in cgroup Charlie
+  # The next line should display '/Charlie'
+  cat /proc/self/cgroup
+2. Usage Examples and Syntax
+============================
+2.1 Basic Usage
+---------------
+Creating, modifying, using the cgroups can be done through the cgroup
+virtual filesystem.
+To mount a cgroup hierarchy will all available subsystems, type:
+# mount -t cgroup xxx /dev/cgroup
+The "xxx" is not interpreted by the cgroup code, but will appear in
+/proc/mounts so may be any useful identifying string that you like.
+To mount a cgroup hierarchy with just the cpuset and numtasks
+subsystems, type:
+# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup
+To change the set of subsystems bound to a mounted hierarchy, just
+remount with different options:
+# mount -o remount,cpuset,ns  /dev/cgroup
+Note that changing the set of subsystems is currently only supported
+when the hierarchy consists of a single (root) cgroup. Supporting
+the ability to arbitrarily bind/unbind subsystems from an existing
+cgroup hierarchy is intended to be implemented in the future.
+Then under /dev/cgroup you can find a tree that corresponds to the
+tree of the cgroups in the system. For instance, /dev/cgroup
+is the cgroup that holds the whole system.
+If you want to create a new cgroup under /dev/cgroup:
+# cd /dev/cgroup
+# mkdir my_cgroup
+Now you want to do something with this cgroup.
+# cd my_cgroup
+In this directory you can find several files:
+# ls
+notify_on_release release_agent tasks
+(plus whatever files are added by the attached subsystems)
+Now attach your shell to this cgroup:
+# /bin/echo $$ > tasks
+You can also create cgroups inside your cgroup by using mkdir in this
+directory.
+# mkdir my_sub_cs
+To remove a cgroup, just use rmdir:
+# rmdir my_sub_cs
+This will fail if the cgroup is in use (has cgroups inside, or
+has processes attached, or is held alive by other subsystem-specific
+reference).
+2.2 Attaching processes
+-----------------------
+# /bin/echo PID > tasks
+Note that it is PID, not PIDs. You can only attach ONE task at a time.
+If you have several tasks to attach, you have to do it one after another:
+# /bin/echo PID1 > tasks
+# /bin/echo PID2 > tasks
+        ...
+# /bin/echo PIDn > tasks
+3. Kernel API
+=============
+3.1 Overview
+------------
+Each kernel subsystem that wants to hook into the generic cgroup
+system needs to create a cgroup_subsys object. This contains
+various methods, which are callbacks from the cgroup system, along
+with a subsystem id which will be assigned by the cgroup system.
+Other fields in the cgroup_subsys object include:
+- subsys_id: a unique array index for the subsystem, indicating which
+  entry in cgroup->subsys[] this subsystem should be
+  managing. Initialized by cgroup_register_subsys(); prior to this
+  it should be initialized to -1
+- hierarchy: an index indicating which hierarchy, if any, this
+  subsystem is currently attached to. If this is -1, then the
+  subsystem is not attached to any hierarchy, and all tasks should be
+  considered to be members of the subsystem's top_cgroup. It should
+  be initialized to -1.
+- name: should be initialized to a unique subsystem name prior to
+  calling cgroup_register_subsystem. Should be no longer than
+  MAX_CGROUP_TYPE_NAMELEN
+Each cgroup object created by the system has an array of pointers,
+indexed by subsystem id; this pointer is entirely managed by the
+subsystem; the generic cgroup code will never touch this pointer.
+3.2 Synchronization
+-------------------
+There is a global mutex, cgroup_mutex, used by the cgroup
+system. This should be taken by anything that wants to modify a
+cgroup. It may also be taken to prevent cgroups from being
+modified, but more specific locks may be more appropriate in that
+situation.
+See kernel/cgroup.c for more details.
+Subsystems can take/release the cgroup_mutex via the functions
+cgroup_lock()/cgroup_unlock(), and can
+take/release the callback_mutex via the functions
+cgroup_lock()/cgroup_unlock().
+Accessing a task's cgroup pointer may be done in the following ways:
+- while holding cgroup_mutex
+- while holding the task's alloc_lock (via task_lock())
+- inside an rcu_read_lock() section via rcu_dereference()
+3.3 Subsystem API
+--------------------------
+Each subsystem should:
+- add an entry in linux/cgroup_subsys.h
+- define a cgroup_subsys object called <name>_subsys
+Each subsystem may export the following methods. The only mandatory
+methods are create/destroy. Any others that are null are presumed to
+be successful no-ops.
+struct cgroup_subsys_state *create(struct cgroup *cont)
+LL=cgroup_mutex
+Called to create a subsystem state object for a cgroup. The
+subsystem should allocate its subsystem state object for the passed
+cgroup, returning a pointer to the new object on success or a
+negative error code. On success, the subsystem pointer should point to
+a structure of type cgroup_subsys_state (typically embedded in a
+larger subsystem-specific object), which will be initialized by the
+cgroup system. Note that this will be called at initialization to
+create the root subsystem state for this subsystem; this case can be
+identified by the passed cgroup object having a NULL parent (since
+it's the root of the hierarchy) and may be an appropriate place for
+initialization code.
+void destroy(struct cgroup *cont)
+LL=cgroup_mutex
+The cgroup system is about to destroy the passed cgroup; the
+subsystem should do any necessary cleanup
+int can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+               struct task_struct *task)
+LL=cgroup_mutex
+Called prior to moving a task into a cgroup; if the subsystem
+returns an error, this will abort the attach operation.  If a NULL
+task is passed, then a successful result indicates that *any*
+unspecified task can be moved into the cgroup. Note that this isn't
+called on a fork. If this method returns 0 (success) then this should
+remain valid while the caller holds cgroup_mutex.
+void attach(struct cgroup_subsys *ss, struct cgroup *cont,
+            struct cgroup *old_cont, struct task_struct *task)
+LL=cgroup_mutex
+Called after the task has been attached to the cgroup, to allow any
+post-attachment activity that requires memory allocations or blocking.
+void fork(struct cgroup_subsy *ss, struct task_struct *task)
+LL=callback_mutex, maybe read_lock(tasklist_lock)
+Called when a task is forked into a cgroup. Also called during
+registration for all existing tasks.
+void exit(struct cgroup_subsys *ss, struct task_struct *task)
+LL=callback_mutex
+Called during task exit
+int populate(struct cgroup_subsys *ss, struct cgroup *cont)
+LL=none
+Called after creation of a cgroup to allow a subsystem to populate
+the cgroup directory with file entries.  The subsystem should make
+calls to cgroup_add_file() with objects of type cftype (see
+include/linux/cgroup.h for details).  Note that although this
+method can return an error code, the error code is currently not
+always handled well.
+void post_clone(struct cgroup_subsys *ss, struct cgroup *cont)
+Called at the end of cgroup_clone() to do any paramater
+initialization which might be required before a task could attach.  For
+example in cpusets, no task may attach before 'cpus' and 'mems' are set
+up.
+void bind(struct cgroup_subsys *ss, struct cgroup *root)
+LL=callback_mutex
+Called when a cgroup subsystem is rebound to a different hierarchy
+and root cgroup. Currently this will only involve movement between
+the default hierarchy (which never has sub-cgroups) and a hierarchy
+that is being created/destroyed (and hence has no sub-cgroups).
+4. Questions
+============
+Q: what's up with this '/bin/echo' ?
+A: bash's builtin 'echo' command does not check calls to write() against
+   errors. If you use it in the cgroup file system, you won't be
+   able to tell whether a command succeeded or failed.
+Q: When I attach processes, only the first of the line gets really attached !
+A: We can only return one error code per call to write(). So you should also
+   put only ONE pid.
diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
index b6d24c22274b..a741f658a3c9 100644
--- a/Documentation/cpu-hotplug.txt
+++ b/Documentation/cpu-hotplug.txt
@@ -220,7 +220,9 @@ A: The following happen, listed in no particular order :-)
  CPU_DOWN_PREPARE or CPU_DOWN_PREPARE_FROZEN, depending on whether or not the
  CPU is being offlined while tasks are frozen due to a suspend operation in
  progress
- All process is migrated away from this outgoing CPU to a new CPU
+- All processes are migrated away from this outgoing CPU to new CPUs.
+  The new CPU is chosen from each process' current cpuset, which may be
+  a subset of all online CPUs.
 - All interrupts targeted to this CPU is migrated to a new CPU
 - timers/bottom half/task lets are also migrated to a new CPU
 - Once all services are migrated, kernel calls an arch specific routine
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index ec9de6917f01..141bef1c8599 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net
 Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <clameter@sgi.com>
+Modified by Paul Menage <menage@google.com>
 CONTENTS:
 =========
@@ -16,9 +17,9 @@ CONTENTS:
  1.2 Why are cpusets needed ?
  1.3 How are cpusets implemented ?
  1.4 What are exclusive cpusets ?
-  1.5 What does notify_on_release do ?
+  1.5 What is memory_pressure ?
-  1.6 What is memory_pressure ?
+  1.6 What is memory spread ?
-  1.7 What is memory spread ?
+  1.7 What is sched_load_balance ?
  1.8 How do I use cpusets ?
 2. Usage Examples and Syntax
  2.1 Basic Usage
@@ -44,18 +45,19 @@ hierarchy visible in a virtual file system.  These are the essential
 hooks, beyond what is already present, required to manage dynamic
 job placement on large systems.
-Each task has a pointer to a cpuset.  Multiple tasks may reference
+Cpusets use the generic cgroup subsystem described in
-the same cpuset.  Requests by a task, using the sched_setaffinity(2)
+Documentation/cgroup.txt.
-system call to include CPUs in its CPU affinity mask, and using the
-mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
+Requests by a task, using the sched_setaffinity(2) system call to
-in its memory policy, are both filtered through that tasks cpuset,
+include CPUs in its CPU affinity mask, and using the mbind(2) and
-filtering out any CPUs or Memory Nodes not in that cpuset.  The
+set_mempolicy(2) system calls to include Memory Nodes in its memory
-scheduler will not schedule a task on a CPU that is not allowed in
+policy, are both filtered through that tasks cpuset, filtering out any
-its cpus_allowed vector, and the kernel page allocator will not
+CPUs or Memory Nodes not in that cpuset.  The scheduler will not
-allocate a page on a node that is not allowed in the requesting tasks
+schedule a task on a CPU that is not allowed in its cpus_allowed
-mems_allowed vector.
+vector, and the kernel page allocator will not allocate a page on a
+node that is not allowed in the requesting tasks mems_allowed vector.
-User level code may create and destroy cpusets by name in the cpuset
+User level code may create and destroy cpusets by name in the cgroup
 virtual file system, manage the attributes and permissions of these
 cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
 specify and query to which cpuset a task is assigned, and list the
@@ -115,7 +117,7 @@ Cpusets extends these two mechanisms as follows:
 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
   kernel.
 - Each task in the system is attached to a cpuset, via a pointer
-   in the task structure to a reference counted cpuset structure.
+   in the task structure to a reference counted cgroup structure.
 - Calls to sched_setaffinity are filtered to just those CPUs
   allowed in that tasks cpuset.
 - Calls to mbind and set_mempolicy are filtered to just
@@ -145,15 +147,10 @@ into the rest of the kernel, none in performance critical paths:
 - in page_alloc.c, to restrict memory to allowed nodes.
 - in vmscan.c, to restrict page recovery to the current cpuset.
-In addition a new file system, of type "cpuset" may be mounted,
+You should mount the "cgroup" filesystem type in order to enable
-typically at /dev/cpuset, to enable browsing and modifying the cpusets
+browsing and modifying the cpusets presently known to the kernel.  No
-presently known to the kernel.  No new system calls are added for
+new system calls are added for cpusets - all support for querying and
-cpusets - all support for querying and modifying cpusets is via
+modifying cpusets is via this cpuset file system.
-this cpuset file system.
-Each task under /proc has an added file named 'cpuset', displaying
-the cpuset name, as the path relative to the root of the cpuset file
-system.
 The /proc/<pid>/status file for each task has two added lines,
 displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
@@ -163,16 +160,15 @@ in the format seen in the following example:
  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
  Mems_allowed:   ffffffff,ffffffff
-Each cpuset is represented by a directory in the cpuset file system
+Each cpuset is represented by a directory in the cgroup file system
-containing the following files describing that cpuset:
+containing (on top of the standard cgroup files) the following
+files describing that cpuset:
 - cpus: list of CPUs in that cpuset
 - mems: list of Memory Nodes in that cpuset
 - memory_migrate flag: if set, move pages to cpusets nodes
 - cpu_exclusive flag: is cpu placement exclusive?
 - mem_exclusive flag: is memory placement exclusive?
- - tasks: list of tasks (by pid) attached to that cpuset
- - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
 - memory_pressure: measure of how much paging pressure in cpuset
 In addition, the root cpuset only has the following file:
@@ -237,21 +233,7 @@ such as requests from interrupt handlers, is allowed to be taken
 outside even a mem_exclusive cpuset.
-1.5 What does notify_on_release do ?
+1.5 What is memory_pressure ?
------------------------------------
-If the notify_on_release flag is enabled (1) in a cpuset, then whenever
-the last task in the cpuset leaves (exits or attaches to some other
-cpuset) and the last child cpuset of that cpuset is removed, then
-the kernel runs the command /sbin/cpuset_release_agent, supplying the
-pathname (relative to the mount point of the cpuset file system) of the
-abandoned cpuset.  This enables automatic removal of abandoned cpusets.
-The default value of notify_on_release in the root cpuset at system
-boot is disabled (0).  The default value of other cpusets at creation
-is the current value of their parents notify_on_release setting.
-1.6 What is memory_pressure ?
 -----------------------------
 The memory_pressure of a cpuset provides a simple per-cpuset metric
 of the rate that the tasks in a cpuset are attempting to free up in
@@ -308,7 +290,7 @@ the tasks in the cpuset, in units of reclaims attempted per second,
 times 1000.
-1.7 What is memory spread ?
+1.6 What is memory spread ?
 ---------------------------
 There are two boolean flag files per cpuset that control where the
 kernel allocates pages for the file system buffers and related in
@@ -378,6 +360,142 @@ policy, especially for jobs that might have one thread reading in the
 data set, the memory allocation across the nodes in the jobs cpuset
 can become very uneven.
+1.7 What is sched_load_balance ?
+--------------------------------
+The kernel scheduler (kernel/sched.c) automatically load balances
+tasks.  If one CPU is underutilized, kernel code running on that
+CPU will look for tasks on other more overloaded CPUs and move those
+tasks to itself, within the constraints of such placement mechanisms
+as cpusets and sched_setaffinity.
+The algorithmic cost of load balancing and its impact on key shared
+kernel data structures such as the task list increases more than
+linearly with the number of CPUs being balanced.  So the scheduler
+has support to  partition the systems CPUs into a number of sched
+domains such that it only load balances within each sched domain.
+Each sched domain covers some subset of the CPUs in the system;
+no two sched domains overlap; some CPUs might not be in any sched
+domain and hence won't be load balanced.
+Put simply, it costs less to balance between two smaller sched domains
+than one big one, but doing so means that overloads in one of the
+two domains won't be load balanced to the other one.
+By default, there is one sched domain covering all CPUs, except those
+marked isolated using the kernel boot time "isolcpus=" argument.
+This default load balancing across all CPUs is not well suited for
+the following two situations:
+ 1) On large systems, load balancing across many CPUs is expensive.
+    If the system is managed using cpusets to place independent jobs
+    on separate sets of CPUs, full load balancing is unnecessary.
+ 2) Systems supporting realtime on some CPUs need to minimize
+    system overhead on those CPUs, including avoiding task load
+    balancing if that is not needed.
+When the per-cpuset flag "sched_load_balance" is enabled (the default
+setting), it requests that all the CPUs in that cpusets allowed 'cpus'
+be contained in a single sched domain, ensuring that load balancing
+can move a task (not otherwised pinned, as by sched_setaffinity)
+from any CPU in that cpuset to any other.
+When the per-cpuset flag "sched_load_balance" is disabled, then the
+scheduler will avoid load balancing across the CPUs in that cpuset,
+--except-- in so far as is necessary because some overlapping cpuset
+has "sched_load_balance" enabled.
+So, for example, if the top cpuset has the flag "sched_load_balance"
+enabled, then the scheduler will have one sched domain covering all
+CPUs, and the setting of the "sched_load_balance" flag in any other
+cpusets won't matter, as we're already fully load balancing.
+Therefore in the above two situations, the top cpuset flag
+"sched_load_balance" should be disabled, and only some of the smaller,
+child cpusets have this flag enabled.
+When doing this, you don't usually want to leave any unpinned tasks in
+the top cpuset that might use non-trivial amounts of CPU, as such tasks
+may be artificially constrained to some subset of CPUs, depending on
+the particulars of this flag setting in descendent cpusets.  Even if
+such a task could use spare CPU cycles in some other CPUs, the kernel
+scheduler might not consider the possibility of load balancing that
+task to that underused CPU.
+Of course, tasks pinned to a particular CPU can be left in a cpuset
+that disables "sched_load_balance" as those tasks aren't going anywhere
+else anyway.
+There is an impedance mismatch here, between cpusets and sched domains.
+Cpusets are hierarchical and nest.  Sched domains are flat; they don't
+overlap and each CPU is in at most one sched domain.
+It is necessary for sched domains to be flat because load balancing
+across partially overlapping sets of CPUs would risk unstable dynamics
+that would be beyond our understanding.  So if each of two partially
+overlapping cpusets enables the flag 'sched_load_balance', then we
+form a single sched domain that is a superset of both.  We won't move
+a task to a CPU outside it cpuset, but the scheduler load balancing
+code might waste some compute cycles considering that possibility.
+This mismatch is why there is not a simple one-to-one relation
+between which cpusets have the flag "sched_load_balance" enabled,
+and the sched domain configuration.  If a cpuset enables the flag, it
+will get balancing across all its CPUs, but if it disables the flag,
+it will only be assured of no load balancing if no other overlapping
+cpuset enables the flag.
+If two cpusets have partially overlapping 'cpus' allowed, and only
+one of them has this flag enabled, then the other may find its
+tasks only partially load balanced, just on the overlapping CPUs.
+This is just the general case of the top_cpuset example given a few
+paragraphs above.  In the general case, as in the top cpuset case,
+don't leave tasks that might use non-trivial amounts of CPU in
+such partially load balanced cpusets, as they may be artificially
+constrained to some subset of the CPUs allowed to them, for lack of
+load balancing to the other CPUs.
+1.7.1 sched_load_balance implementation details.
+------------------------------------------------
+The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
+to most cpuset flags.)  When enabled for a cpuset, the kernel will
+ensure that it can load balance across all the CPUs in that cpuset
+(makes sure that all the CPUs in the cpus_allowed of that cpuset are
+in the same sched domain.)
+If two overlapping cpusets both have 'sched_load_balance' enabled,
+then they will be (must be) both in the same sched domain.
+If, as is the default, the top cpuset has 'sched_load_balance' enabled,
+then by the above that means there is a single sched domain covering
+the whole system, regardless of any other cpuset settings.
+The kernel commits to user space that it will avoid load balancing
+where it can.  It will pick as fine a granularity partition of sched
+domains as it can while still providing load balancing for any set
+of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
+The internal kernel cpuset to scheduler interface passes from the
+cpuset code to the scheduler code a partition of the load balanced
+CPUs in the system. This partition is a set of subsets (represented
+as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
+the CPUs that must be load balanced.
+Whenever the 'sched_load_balance' flag changes, or CPUs come or go
+from a cpuset with this flag enabled, or a cpuset with this flag
+enabled is removed, the cpuset code builds a new such partition and
+passes it to the scheduler sched domain setup code, to have the sched
+domains rebuilt as necessary.
+This partition exactly defines what sched domains the scheduler should
+setup - one sched domain for each element (cpumask_t) in the partition.
+The scheduler remembers the currently active sched domain partitions.
+When the scheduler routine partition_sched_domains() is invoked from
+the cpuset code to update these sched domains, it compares the new
+partition requested with the current, and updates its sched domains,
+removing the old and adding the new, for each change.
 1.8 How do I use cpusets ?
 --------------------------
@@ -469,7 +587,7 @@ than stress the kernel.
 To start a new job that is to be contained within a cpuset, the steps are:
 1) mkdir /dev/cpuset
- 2) mount -t cpuset none /dev/cpuset
+ 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
    the /dev/cpuset virtual file system.
 4) Start a task that will be the "founding father" of the new job.
@@ -481,7 +599,7 @@ For example, the following sequence of commands will setup a cpuset
 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
 and then start a subshell 'sh' in that cpuset:
-  mount -t cpuset none /dev/cpuset
+  mount -t cgroup -ocpuset cpuset /dev/cpuset
  cd /dev/cpuset
  mkdir Charlie
  cd Charlie
@@ -513,7 +631,7 @@ Creating, modifying, using the cpusets can be done through the cpuset
 virtual filesystem.
 To mount it, type:
-# mount -t cpuset none /dev/cpuset
+# mount -t cgroup -o cpuset cpuset /dev/cpuset
 Then under /dev/cpuset you can find a tree that corresponds to the
 tree of the cpusets in the system. For instance, /dev/cpuset
@@ -556,6 +674,18 @@ To remove a cpuset, just use rmdir:
 This will fail if the cpuset is in use (has cpusets inside, or has
 processes attached).
+Note that for legacy reasons, the "cpuset" filesystem exists as a
+wrapper around the cgroup filesystem.
+The command
+mount -t cpuset X /dev/cpuset
+is equivalent to
+mount -t cgroup -ocpuset X /dev/cpuset
+echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
 2.2 Adding/removing cpus
 ------------------------
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 280ec06573e6..6b0f963f5379 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -82,6 +82,41 @@ Who:	Dominik Brodowski <linux@brodo.de>
 ---------------------------
+What:   sys_sysctl
+When:   September 2010
+Option: CONFIG_SYSCTL_SYSCALL
+Why:    The same information is available in a more convenient from
+        /proc/sys, and none of the sysctl variables appear to be
+        important performance wise.
+        Binary sysctls are a long standing source of subtle kernel
+        bugs and security issues.
+        When I looked several months ago all I could find after
+        searching several distributions were 5 user space programs and
+        glibc (which falls back to /proc/sys) using this syscall.
+        The man page for sysctl(2) documents it as unusable for user
+        space programs.
+        sysctl(2) is not generally ABI compatible to a 32bit user
+        space application on a 64bit and a 32bit kernel.
+        For the last several months the policy has been no new binary
+        sysctls and no one has put forward an argument to use them.
+        Binary sysctls issues seem to keep happening appearing so
+        properly deprecating them (with a warning to user space) and a
+        2 year grace warning period will mean eventually we can kill
+        them and end the pain.
+        In the mean time individual binary sysctls can be dealt with
+        in a piecewise fashion.
+Who:    Eric Biederman <ebiederm@xmission.com>
+---------------------------
 What:  a.out interpreter support for ELF executables
 When:  2.6.25
 Files: fs/binfmt_elf.c
@@ -184,13 +219,6 @@ Who:	Jean Delvare <khali@linux-fr.org>,
 ---------------------------
-What:  drivers depending on OBSOLETE_OSS
-When:  options in 2.6.22, code in 2.6.24
-Why:   OSS drivers with ALSA replacements
-Who:   Adrian Bunk <bunk@stusta.de>
---------------------------
 What:   ACPI procfs interface
 When:   July 2008
 Why:    ACPI sysfs conversion should be finished by January 2008.
diff --git a/Documentation/input/input-programming.txt b/Documentation/input/input-programming.txt
index d9d523099bb7..4d932dc66098 100644
--- a/Documentation/input/input-programming.txt
+++ b/Documentation/input/input-programming.txt
@@ -42,8 +42,8 @@ static int __init button_init(void)
                goto err_free_irq;
        }
-        button_dev->evbit[0] = BIT(EV_KEY);
+        button_dev->evbit[0] = BIT_MASK(EV_KEY);
-        button_dev->keybit[LONG(BTN_0)] = BIT(BTN_0);
+        button_dev->keybit[BIT_WORD(BTN_0)] = BIT_MASK(BTN_0);
        error = input_register_device(button_dev);
        if (error) {
@@ -217,14 +217,15 @@ If you don't need absfuzz and absflat, you can set them to zero, which mean
 that the thing is precise and always returns to exactly the center position
 (if it has any).
-1.4 NBITS(), LONG(), BIT()
+1.4 BITS_TO_LONGS(), BIT_WORD(), BIT_MASK()
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-These three macros from input.h help some bitfield computations:
+These three macros from bitops.h help some bitfield computations:
-        NBITS(x) - returns the length of a bitfield array in longs for x bits
+        BITS_TO_LONGS(x) - returns the length of a bitfield array in longs for
-        LONG(x)  - returns the index in the array in longs for bit x
+                           x bits
-        BIT(x)   - returns the index in a long for bit x
+        BIT_WORD(x)      - returns the index in the array in longs for bit x
+        BIT_MASK(x)      - returns the index in a long for bit x
 1.5 The id* and name fields
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 1b37b28cc234..d0ac72cc19ff 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -231,6 +231,32 @@ Dump-capture kernel config options (Arch Dependent, ia64)
  any space below the alignment point will be wasted.
+Extended crashkernel syntax
+===========================
+While the "crashkernel=size[@offset]" syntax is sufficient for most
+configurations, sometimes it's handy to have the reserved memory dependent
+on the value of System RAM -- that's mostly for distributors that pre-setup
+the kernel command line to avoid a unbootable system after some memory has
+been removed from the machine.
+The syntax is:
+    crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
+    range=start-[end]
+For example:
+    crashkernel=512M-2G:64M,2G-:128M
+This would mean:
+    1) if the RAM is smaller than 512M, then don't reserve anything
+       (this is the "rescue" case)
+    2) if the RAM size is between 512M and 2G, then reserve 64M
+    3) if the RAM size is larger than 2G, then reserve 128M
 Boot into System Kernel
 =======================
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 98cf90f2631d..0a3fed445249 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -479,6 +479,16 @@ and is between 256 and 4096 characters. It is defined in the file
                        UART at the specified I/O port or MMIO address.
                        The options are the same as for ttyS, above.
+        no_console_suspend
+                        [HW] Never suspend the console
+                        Disable suspending of consoles during suspend and
+                        hibernate operations.  Once disabled, debugging
+                        messages can reach various consoles while the rest
+                        of the system is being put to sleep (ie, while
+                        debugging driver suspend/resume hooks).  This may
+                        not work reliably with all consoles, but is known
+                        to work with serial and VGA consoles.
        cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
                        Format:
                        <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
@@ -487,6 +497,13 @@ and is between 256 and 4096 characters. It is defined in the file
                        [KNL] Reserve a chunk of physical memory to
                        hold a kernel to switch to with kexec on panic.
+        crashkernel=range1:size1[,range2:size2,...][@offset]
+                        [KNL] Same as above, but depends on the memory
+                        in the running system. The syntax of range is
+                        start-[end] where start and end are both
+                        a memory unit (amount[KMG]). See also
+                        Documentation/kdump/kdump.txt for a example.
        cs4232=         [HW,OSS]
                        Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq>
diff --git a/Documentation/markers.txt b/Documentation/markers.txt
new file mode 100644
index 000000000000..295a71bc301e
--- /dev/null
+++ b/Documentation/markers.txt
@@ -0,0 +1,81 @@
+                     Using the Linux Kernel Markers
+                            Mathieu Desnoyers
+This document introduces Linux Kernel Markers and their use. It provides
+examples of how to insert markers in the kernel and connect probe functions to
+them and provides some examples of probe functions.
+* Purpose of markers
+A marker placed in code provides a hook to call a function (probe) that you can
+provide at runtime. A marker can be "on" (a probe is connected to it) or "off"
+(no probe is attached). When a marker is "off" it has no effect, except for
+adding a tiny time penalty (checking a condition for a branch) and space
+penalty (adding a few bytes for the function call at the end of the
+instrumented function and adds a data structure in a separate section).  When a
+marker is "on", the function you provide is called each time the marker is
+executed, in the execution context of the caller. When the function provided
+ends its execution, it returns to the caller (continuing from the marker site).
+You can put markers at important locations in the code. Markers are
+lightweight hooks that can pass an arbitrary number of parameters,
+described in a printk-like format string, to the attached probe function.
+They can be used for tracing and performance accounting.
+* Usage
+In order to use the macro trace_mark, you should include linux/marker.h.
+#include <linux/marker.h>
+And,
+trace_mark(subsystem_event, "%d %s", someint, somestring);
+Where :
+- subsystem_event is an identifier unique to your event
+    - subsystem is the name of your subsystem.
+    - event is the name of the event to mark.
+- "%d %s" is the formatted string for the serializer.
+- someint is an integer.
+- somestring is a char pointer.
+Connecting a function (probe) to a marker is done by providing a probe (function
+to call) for the specific marker through marker_probe_register() and can be
+activated by calling marker_arm(). Marker deactivation can be done by calling
+marker_disarm() as many times as marker_arm() has been called. Removing a probe
+is done through marker_probe_unregister(); it will disarm the probe and make
+sure there is no caller left using the probe when it returns. Probe removal is
+preempt-safe because preemption is disabled around the probe call. See the
+"Probe example" section below for a sample probe module.
+The marker mechanism supports inserting multiple instances of the same marker.
+Markers can be put in inline functions, inlined static functions, and
+unrolled loops as well as regular functions.
+The naming scheme "subsystem_event" is suggested here as a convention intended
+to limit collisions. Marker names are global to the kernel: they are considered
+as being the same whether they are in the core kernel image or in modules.
+Conflicting format strings for markers with the same name will cause the markers
+to be detected to have a different format string not to be armed and will output
+a printk warning which identifies the inconsistency:
+"Format mismatch for probe probe_name (format), marker (format)"
+* Probe / marker example
+See the example provided in samples/markers/src
+Compile them with your kernel.
+Run, as root :
+modprobe marker-example (insmod order is not important)
+modprobe probe-example
+cat /proc/marker-example (returns an expected error)
+rmmod marker-example probe-example
+dmesg
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 650657c54733..4e17beba2379 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1479,7 +1479,8 @@ kernel.
 Any atomic operation that modifies some state in memory and returns information
 about the state (old or new) implies an SMP-conditional general memory barrier
-(smp_mb()) on each side of the actual operation.  These include:
+(smp_mb()) on each side of the actual operation (with the exception of
+explicit lock operations, described later).  These include:
        xchg();
        cmpxchg();
@@ -1536,10 +1537,19 @@ If they're used for constructing a lock of some description, then they probably
 do need memory barriers as a lock primitive generally has to do things in a
 specific order.
 Basically, each usage case has to be carefully considered as to whether memory
 barriers are needed or not.
+The following operations are special locking primitives:
+        test_and_set_bit_lock();
+        clear_bit_unlock();
+        __clear_bit_unlock();
+These implement LOCK-class and UNLOCK-class operations. These should be used in
+preference to other operations when implementing locking primitives, because
+their implementations can be optimised on many architectures.
 [!] Note that special memory barrier primitives are available for these
 situations because on some CPUs the atomic instructions used imply full memory
 barriers, and so barrier instructions are superfluous in conjunction with them,
diff --git a/Documentation/mips/00-INDEX b/Documentation/mips/00-INDEX
index 9df8a2eac7b4..3f13bf8043d2 100644
--- a/Documentation/mips/00-INDEX
+++ b/Documentation/mips/00-INDEX
@@ -4,5 +4,3 @@ AU1xxx_IDE.README
        - README for MIPS AU1XXX IDE driver.
 GT64120.README
        - README for dir with info on MIPS boards using GT-64120 or GT-64120A.
-time.README
-        - README for MIPS time services.
diff --git a/Documentation/mips/time.README b/Documentation/mips/time.README
deleted file mode 100644
index a4ce603ed3b3..000000000000
--- a/Documentation/mips/time.README
+++ /dev/null
@@ -1,173 +0,0 @@
-README for MIPS time services
-Jun Sun
-jsun@mvista.com or jsun@junsun.net
-ABOUT
-----
-This file describes the new arch/mips/kernel/time.c, related files and the 
-services they provide. 
-If you are short in patience and just want to know how to use time.c for a 
-new board or convert an existing board, go to the last section.
-FILES, COMPATABILITY AND CONFIGS
---------------------------------
-The old arch/mips/kernel/time.c is renamed to old-time.c.
-A new time.c is put there, together with include/asm-mips/time.h.
-Two configs variables are introduced, CONFIG_OLD_TIME_C and CONFIG_NEW_TIME_C.
-So we allow boards using 
-        1) old time.c (CONFIG_OLD_TIME_C)
-        2) new time.c (CONFIG_NEW_TIME_C)
-        3) neither (their own private time.c)
-However, it is expected every board will move to the new time.c in the near
-future.
-WHAT THE NEW CODE PROVIDES?
--------------------------- 
-The new time code provide the following services:
-  a) Implements functions required by Linux common code:
-        time_init
-  b) provides an abstraction of RTC and null RTC implementation as default.
-        extern unsigned long (*rtc_get_time)(void);
-        extern int (*rtc_set_time)(unsigned long);
-  c) high-level and low-level timer interrupt routines where the timer
-     interrupt source  may or may not be the CPU timer.  The high-level
-     routine is dispatched through do_IRQ() while the low-level is
-     dispatched in assemably code (usually int-handler.S)
-WHAT THE NEW CODE REQUIRES?
---------------------------
-For the new code to work properly, each board implementation needs to supply
-the following functions or values:
-  a) board_time_init - a function pointer.  Invoked at the beginnig of
-     time_init().  It is optional.
-        1. (optional) set up RTC routines
-        2. (optional) calibrate and set the mips_hpt_frequency
-  b) plat_timer_setup - a function pointer.  Invoked at the end of time_init()
-        1. (optional) over-ride any decisions made in time_init()
-        2. set up the irqaction for timer interrupt.
-        3. enable the timer interrupt
-  c) (optional) board-specific RTC routines.
-  d) (optional) mips_hpt_frequency - It must be definied if the board
-     is using CPU counter for timer interrupt.
-PORTING GUIDE
-------------
-Step 1: decide how you like to implement the time services.
-  a) does this board have a RTC?  If yes, implement the two RTC funcs.
-  b) does the CPU have counter/compare registers? 
-     If the answer is no, you need a timer to provide the timer interrupt
-     at 100 HZ speed.
-  c) The following sub steps assume your CPU has counter register.
-     Do you plan to use the CPU counter register as the timer interrupt
-     or use an exnternal timer?
-     In order to use CPU counter register as the timer interrupt source, you
-     must know the counter speed (mips_hpt_frequency).  It is usually the
-     same as the CPU speed or an integral divisor of it.
-  d) decide on whether you want to use high-level or low-level timer
-     interrupt routines.  The low-level one is presumably faster, but should
-     not make too mcuh difference.
-Step 2:  the machine setup() function
-  If you supply board_time_init(), set the function poointer.
-Step 3: implement rtc routines, board_time_init() and plat_timer_setup()
-  if needed.
-  board_time_init() -
-        a) (optional) set up RTC routines,
-        b) (optional) calibrate and set the mips_hpt_frequency
-            (only needed if you intended to use cpu counter as timer interrupt
-             source)
-  plat_timer_setup() -
-        a) (optional) over-write any choices made above by time_init().
-        b) machine specific code should setup the timer irqaction.
-        c) enable the timer interrupt
-  If the RTC chip is a common chip, I suggest the routines are put under
-  arch/mips/libs.  For example, for DS1386 chip, one would create
-  rtc-ds1386.c under arch/mips/lib directory.  Add the following line to
-  the arch/mips/lib/Makefile:
-        obj-$(CONFIG_DDB5476) += rtc-ds1386.o
-Step 4: if you are using low-level timer interrupt, change your interrupt
-  dispathcing code to check for timer interrupt and jump to 
-  ll_timer_interrupt() directly  if one is detected.
-Step 5: Modify arch/mips/config.in and add CONFIG_NEW_TIME_C to your machine.
-  Modify the appropriate defconfig if applicable.
-Final notes: 
-For some tricky cases, you may need to add your own wrapper functions 
-for some of the functions in time.c.  
-For example, you may define your own timer interrupt routine, which does
-some of its own processing and then calls timer_interrupt().
-You can also over-ride any of the built-in functions (RTC routines
-and/or timer interrupt routine).
-PORTING NOTES FOR SMP
----------------------
-If you have a SMP box, things are slightly more complicated.
-The time service running every jiffy is logically divided into two parts:
-  1) the one for the whole system  (defined in timer_interrupt())
-  2) the one that should run for each CPU (defined in local_timer_interrupt())
-You need to decide on your timer interrupt sources.
-  case 1) - whole system has only one timer interrupt delivered to one CPU
-        In this case, you set up timer interrupt as in UP systems.  In addtion,
-        you need to set emulate_local_timer_interrupt to 1 so that other
-        CPUs get to call local_timer_interrupt().
-        THIS IS CURRENTLY NOT IMPLEMNETED.  However, it is rather easy to write
-        one should such a need arise.  You simply make a IPI call.
-  case 2) - each CPU has a separate timer interrupt
-        In this case, you need to set up IRQ such that each of them will
-        call local_timer_interrupt().  In addition, you need to arrange
-        one and only one of them to call timer_interrupt().
-        You can also do the low-level version of those interrupt routines,
-        following similar dispatching routes described above.
diff --git a/Documentation/parport-lowlevel.txt b/Documentation/parport-lowlevel.txt
index 8f2302415eff..265fcdcb8e5f 100644
--- a/Documentation/parport-lowlevel.txt
+++ b/Documentation/parport-lowlevel.txt
@@ -25,7 +25,6 @@ Global functions:
  parport_open
  parport_close
  parport_device_id
-  parport_device_num
  parport_device_coords
  parport_find_class
  parport_find_device
@@ -735,7 +734,7 @@ NULL is returned.
 SEE ALSO
-parport_register_device, parport_device_num
+parport_register_device
 parport_close - unregister device for particular device number
 -------------
@@ -787,29 +786,7 @@ Many devices have ill-formed IEEE 1284 Device IDs.
 SEE ALSO
-parport_find_class, parport_find_device, parport_device_num
+parport_find_class, parport_find_device
-parport_device_num - convert device coordinates to device number
------------------
-SYNOPSIS
-#include <linux/parport.h>
-int parport_device_num (int parport, int mux, int daisy);
-DESCRIPTION
-Convert between device coordinates (port, multiplexor, daisy chain
-address) and device number (zero-based).
-RETURN VALUE
-Device number, or -1 if no device at given coordinates.
-SEE ALSO
-parport_device_coords, parport_open, parport_device_id
 parport_device_coords - convert device number to device coordinates
 ------------------
@@ -833,7 +810,7 @@ Zero on success, in which case the coordinates are (*parport, *mux,
 SEE ALSO
-parport_device_num, parport_open, parport_device_id
+parport_open, parport_device_id
 parport_find_class - find a device by its class
 ------------------
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt
index 1a85e2b964dc..57aef2f6e0de 100644
--- a/Documentation/power/basic-pm-debugging.txt
+++ b/Documentation/power/basic-pm-debugging.txt
@@ -78,8 +78,8 @@ c) Advanced debugging
 In case the STD does not work on your system even in the minimal configuration
 and compiling more drivers as modules is not practical or some modules cannot
 be unloaded, you can use one of the more advanced debugging techniques to find
-the problem.  First, if there is a serial port in your box, you can set the
+the problem.  First, if there is a serial port in your box, you can boot the
-CONFIG_DISABLE_CONSOLE_SUSPEND kernel configuration option and try to log kernel
+kernel with the 'no_console_suspend' parameter and try to log kernel
 messages using the serial console.  This may provide you with some information
 about the reasons of the suspend (resume) failure.  Alternatively, it may be
 possible to use a FireWire port for debugging with firescope
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt
index 04dc1cf9d215..38b57248fd61 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -19,12 +19,13 @@ we only consider hibernation, but the description also applies to suspend).
 Namely, as the first step of the hibernation procedure the function
 freeze_processes() (defined in kernel/power/process.c) is called.  It executes
 try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and
-sends a fake signal to each of them.  A task that receives such a signal and has
+either wakes them up, if they are kernel threads, or sends fake signals to them,
-TIF_FREEZE set, should react to it by calling the refrigerator() function
+if they are user space processes.  A task that has TIF_FREEZE set, should react
-(defined in kernel/power/process.c), which sets the task's PF_FROZEN flag,
+to it by calling the function called refrigerator() (defined in
-changes its state to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is
+kernel/power/process.c), which sets the task's PF_FROZEN flag, changes its state
-cleared for it.  Then, we say that the task is 'frozen' and therefore the set of
+to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it.
-functions handling this mechanism is called 'the freezer' (these functions are
+Then, we say that the task is 'frozen' and therefore the set of functions
+handling this mechanism is referred to as 'the freezer' (these functions are
 defined in kernel/power/process.c and include/linux/freezer.h).  User space
 processes are generally frozen before kernel threads.
@@ -35,21 +36,27 @@ task enter refrigerator() if the flag is set.
 For user space processes try_to_freeze() is called automatically from the
 signal-handling code, but the freezable kernel threads need to call it
-explicitly in suitable places.  The code to do this may look like the following:
+explicitly in suitable places or use the wait_event_freezable() or
+wait_event_freezable_timeout() macros (defined in include/linux/freezer.h)
+that combine interruptible sleep with checking if TIF_FREEZE is set and calling
+try_to_freeze().  The main loop of a freezable kernel thread may look like the
+following one:
+        set_freezable();
        do {
                hub_events();
-                wait_event_interruptible(khubd_wait,
+                wait_event_freezable(khubd_wait,
-                                        !list_empty(&hub_event_list));
+                                !list_empty(&hub_event_list) ||
-                try_to_freeze();
+                                kthread_should_stop());
-        } while (!signal_pending(current));
+        } while (!kthread_should_stop() || !list_empty(&hub_event_list));
 (from drivers/usb/core/hub.c::hub_thread()).
 If a freezable kernel thread fails to call try_to_freeze() after the freezer has
 set TIF_FREEZE for it, the freezing of tasks will fail and the entire
 hibernation operation will be cancelled.  For this reason, freezable kernel
-threads must call try_to_freeze() somewhere.
+threads must call try_to_freeze() somewhere or use one of the
+wait_event_freezable() and wait_event_freezable_timeout() macros.
 After the system memory state has been restored from a hibernation image and
 devices have been reinitialized, the function thaw_processes() is called in
@@ -81,7 +88,16 @@ hibernation image has been created and before the system is finally powered off.
 The majority of these are user space processes, but if any of the kernel threads
 may cause something like this to happen, they have to be freezable.
-2. The second reason is to prevent user space processes and some kernel threads
+2. Next, to create the hibernation image we need to free a sufficient amount of
+memory (approximately 50% of available RAM) and we need to do that before
+devices are deactivated, because we generally need them for swapping out.  Then,
+after the memory for the image has been freed, we don't want tasks to allocate
+additional memory and we prevent them from doing that by freezing them earlier.
+[Of course, this also means that device drivers should not allocate substantial
+amounts of memory from their .suspend() callbacks before hibernation, but this
+is e separate issue.]
+3. The third reason is to prevent user space processes and some kernel threads
 from interfering with the suspending and resuming of devices.  A user space
 process running on a second CPU while we are suspending devices may, for
 example, be troublesome and without the freezing of tasks we would need some
@@ -111,7 +127,7 @@ frozen before the driver's .suspend() callback is executed and it will be
 thawed after the driver's .resume() callback has run, so it won't be accessing
 the device while it's suspended.
-3. Another reason for freezing tasks is to prevent user space processes from
+4. Another reason for freezing tasks is to prevent user space processes from
 realizing that hibernation (or suspend) operation takes place.  Ideally, user
 space processes should not notice that such a system-wide operation has occurred
 and should continue running without any problems after the restore (or resume
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt
index fd5192a8fa8a..e67211fe0ee2 100644
--- a/Documentation/power/interface.txt
+++ b/Documentation/power/interface.txt
@@ -20,7 +20,7 @@ states.
 /sys/power/disk controls the operating mode of the suspend-to-disk
 mechanism. Suspend-to-disk can be handled in several ways. We have a
 few options for putting the system to sleep - using the platform driver
-(e.g. ACPI or other pm_ops), powering off the system or rebooting the
+(e.g. ACPI or other suspend_ops), powering off the system or rebooting the
 system (for testing).
 Additionally, /sys/power/disk can be used to turn on one of the two testing
diff --git a/Documentation/sound/oss/es1371 b/Documentation/sound/oss/es1371
deleted file mode 100644
index c3151266771c..000000000000
--- a/Documentation/sound/oss/es1371
+++ /dev/null
@@ -1,64 +0,0 @@
-/proc/sound, /dev/sndstat
-------------------------
-/proc/sound and /dev/sndstat is not supported by the
-driver. To find out whether the driver succeeded loading,
-check the kernel log (dmesg).
-ALaw/uLaw sample formats
------------------------
-This driver does not support the ALaw/uLaw sample formats.
-ALaw is the default mode when opening a sound device
-using OSS/Free. The reason for the lack of support is
-that the hardware does not support these formats, and adding
-conversion routines to the kernel would lead to very ugly
-code in the presence of the mmap interface to the driver.
-And since xquake uses mmap, mmap is considered important :-)
-and no sane application uses ALaw/uLaw these days anyway.
-In short, playing a Sun .au file as follows:
-cat my_file.au > /dev/dsp
-does not work. Instead, you may use the play script from
-Chris Bagwell's sox-12.14 package (available from the URL
-below) to play many different audio file formats.
-The script automatically determines the audio format
-and does do audio conversions if necessary.
-http://home.sprynet.com/sprynet/cbagwell/projects.html
-Blocking vs. nonblocking IO
---------------------------
-Unlike OSS/Free this driver honours the O_NONBLOCK file flag
-not only during open, but also during read and write.
-This is an effort to make the sound driver interface more
-regular. Timidity has problems with this; a patch
-is available from http://www.ife.ee.ethz.ch/~sailer/linux/pciaudio.html.
-(Timidity patched will also run on OSS/Free).
-MIDI UART
---------
-The driver supports a simple MIDI UART interface, with
-no ioctl's supported.
-MIDI synthesizer
----------------
-This soundcard does not have any hardware MIDI synthesizer;
-MIDI synthesis has to be done in software. To allow this
-the driver/soundcard supports two PCM (/dev/dsp) interfaces.
-There is a freely available software package that allows
-MIDI file playback on this soundcard called Timidity.
-See http://www.cgs.fi/~tt/timidity/.
-Thomas Sailer
-t.sailer@alumni.ethz.ch
diff --git a/Documentation/thinkpad-acpi.txt b/Documentation/thinkpad-acpi.txt
index 60953d6c919d..3b95bbacc775 100644
--- a/Documentation/thinkpad-acpi.txt
+++ b/Documentation/thinkpad-acpi.txt
@@ -105,10 +105,15 @@ The version of thinkpad-acpi's sysfs interface is exported by the driver
 as a driver attribute (see below).
 Sysfs driver attributes are on the driver's sysfs attribute space,
-for 2.6.20 this is /sys/bus/platform/drivers/thinkpad_acpi/.
+for 2.6.23 this is /sys/bus/platform/drivers/thinkpad_acpi/ and
+/sys/bus/platform/drivers/thinkpad_hwmon/
-Sysfs device attributes are on the driver's sysfs attribute space,
+Sysfs device attributes are on the thinkpad_acpi device sysfs attribute
-for 2.6.20 this is /sys/devices/platform/thinkpad_acpi/.
+space, for 2.6.23 this is /sys/devices/platform/thinkpad_acpi/.
+Sysfs device attributes for the sensors and fan are on the
+thinkpad_hwmon device's sysfs attribute space, but you should locate it
+looking for a hwmon device with the name attribute of "thinkpad".
 Driver version
 --------------
@@ -766,7 +771,7 @@ Temperature sensors
 -------------------
 procfs: /proc/acpi/ibm/thermal
-sysfs device attributes: (hwmon) temp*_input
+sysfs device attributes: (hwmon "thinkpad") temp*_input
 Most ThinkPads include six or more separate temperature sensors but only
 expose the CPU temperature through the standard ACPI methods.  This
@@ -989,7 +994,9 @@ Fan control and monitoring: fan speed, fan enable/disable
 ---------------------------------------------------------
 procfs: /proc/acpi/ibm/fan
-sysfs device attributes: (hwmon) fan_input, pwm1, pwm1_enable
+sysfs device attributes: (hwmon "thinkpad") fan1_input, pwm1,
+                          pwm1_enable
+sysfs hwmon driver attributes: fan_watchdog
 NOTE NOTE NOTE: fan control operations are disabled by default for
 safety reasons.  To enable them, the module parameter "fan_control=1"
@@ -1131,7 +1138,7 @@ hwmon device attribute fan1_input:
        which can take up to two minutes.  May return rubbish on older
        ThinkPads.
-driver attribute fan_watchdog:
+hwmon driver attribute fan_watchdog:
        Fan safety watchdog timer interval, in seconds.  Minimum is
        1 second, maximum is 120 seconds.  0 disables the watchdog.
@@ -1233,3 +1240,9 @@ Sysfs interface changelog:
                layer, the radio switch generates input event EV_RADIO,
                and the driver enables hot key handling by default in
                the firmware.
+0x020000:       ABI fix: added a separate hwmon platform device and
+                driver, which must be located by name (thinkpad)
+                and the hwmon class for libsensors4 (lm-sensors 3)
+                compatibility.  Moved all hwmon attributes to this
+                new platform device.