142 files changed, 3655 insertions, 3940 deletions
diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
index a9faaca1f029..106245c3aecc 100644
--- a/Documentation/cgroups/hugetlb.txt
+++ b/Documentation/cgroups/hugetlb.txt
@@ -29,7 +29,7 @@ Brief summary of control files
 hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
 hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
 hugetlb.<hugepagesize>.failcnt            # show the number of allocation failure due to HugeTLB limit
 For a system supporting two hugepage size (16M and 16G) the control
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 02ab997a1ed2..46b2b5080317 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -1,5 +1,10 @@
 Memory Resource Controller
+NOTE: This document is hopelessly outdated and it asks for a complete
+      rewrite. It still contains a useful information so we are keeping it
+      here but make sure to check the current code if you need a deeper
+      understanding.
 NOTE: The Memory Resource Controller has generically been referred to as the
      memory controller in this document. Do not confuse memory controller
      used here with the memory controller that is used in hardware.
@@ -52,9 +57,9 @@ Brief summary of control files.
 tasks                           # attach a task(thread) and show list of threads
 cgroup.procs                    # show list of processes
 cgroup.event_control            # an interface for event_fd()
- memory.usage_in_bytes           # show current res_counter usage for memory
+ memory.usage_in_bytes           # show current usage for memory
                                 (See 5.5 for details)
- memory.memsw.usage_in_bytes     # show current res_counter usage for memory+Swap
+ memory.memsw.usage_in_bytes     # show current usage for memory+Swap
                                 (See 5.5 for details)
 memory.limit_in_bytes           # set/show limit of memory usage
 memory.memsw.limit_in_bytes     # set/show limit of memory+Swap usage
@@ -116,16 +121,16 @@ The memory controller is the first controller developed.
 2.1. Design
-The core of the design is a counter called the res_counter. The res_counter
+The core of the design is a counter called the page_counter. The
-tracks the current memory usage and limit of the group of processes associated
+page_counter tracks the current memory usage and limit of the group of
-with the controller. Each cgroup has a memory controller specific data
+processes associated with the controller. Each cgroup has a memory controller
-structure (mem_cgroup) associated with it.
+specific data structure (mem_cgroup) associated with it.
 2.2. Accounting
                +--------------------+
-                |  mem_cgroup     |
+                |  mem_cgroup        |
-                |  (res_counter)     |
+                |  (page_counter)    |
                +--------------------+
                 /            ^      \
                /             |       \
@@ -352,9 +357,8 @@ set:
 0. Configuration
 a. Enable CONFIG_CGROUPS
-b. Enable CONFIG_RESOURCE_COUNTERS
+b. Enable CONFIG_MEMCG
-c. Enable CONFIG_MEMCG
+c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
-d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
 d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
deleted file mode 100644
index 762ca54eb929..000000000000
--- a/Documentation/cgroups/resource_counter.txt
+++ /dev/null
@@ -1,197 +0,0 @@
-                The Resource Counter
-The resource counter, declared at include/linux/res_counter.h,
-is supposed to facilitate the resource management by controllers
-by providing common stuff for accounting.
-This "stuff" includes the res_counter structure and routines
-to work with it.
-1. Crucial parts of the res_counter structure
- a. unsigned long long usage
-        The usage value shows the amount of a resource that is consumed
-        by a group at a given time. The units of measurement should be
-        determined by the controller that uses this counter. E.g. it can
-        be bytes, items or any other unit the controller operates on.
- b. unsigned long long max_usage
-        The maximal value of the usage over time.
-        This value is useful when gathering statistical information about
-        the particular group, as it shows the actual resource requirements
-        for a particular group, not just some usage snapshot.
- c. unsigned long long limit
-        The maximal allowed amount of resource to consume by the group. In
-        case the group requests for more resources, so that the usage value
-        would exceed the limit, the resource allocation is rejected (see
-        the next section).
- d. unsigned long long failcnt
-        The failcnt stands for "failures counter". This is the number of
-        resource allocation attempts that failed.
- c. spinlock_t lock
-        Protects changes of the above values.
-2. Basic accounting routines
- a. void res_counter_init(struct res_counter *rc,
-                                struct res_counter *rc_parent)
-        Initializes the resource counter. As usual, should be the first
-        routine called for a new counter.
-        The struct res_counter *parent can be used to define a hierarchical
-        child -> parent relationship directly in the res_counter structure,
-        NULL can be used to define no relationship.
- c. int res_counter_charge(struct res_counter *rc, unsigned long val,
-                                struct res_counter **limit_fail_at)
-        When a resource is about to be allocated it has to be accounted
-        with the appropriate resource counter (controller should determine
-        which one to use on its own). This operation is called "charging".
-        This is not very important which operation - resource allocation
-        or charging - is performed first, but
-          * if the allocation is performed first, this may create a
-            temporary resource over-usage by the time resource counter is
-            charged;
-          * if the charging is performed first, then it should be uncharged
-            on error path (if the one is called).
-        If the charging fails and a hierarchical dependency exists, the
-        limit_fail_at parameter is set to the particular res_counter element
-        where the charging failed.
- d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
-        When a resource is released (freed) it should be de-accounted
-        from the resource counter it was accounted to.  This is called
-        "uncharging". The return value of this function indicate the amount
-        of charges still present in the counter.
-        The _locked routines imply that the res_counter->lock is taken.
- e. u64 res_counter_uncharge_until
-                (struct res_counter *rc, struct res_counter *top,
-                 unsigned long val)
-        Almost same as res_counter_uncharge() but propagation of uncharge
-        stops when rc == top. This is useful when kill a res_counter in
-        child cgroup.
- 2.1 Other accounting routines
-    There are more routines that may help you with common needs, like
-    checking whether the limit is reached or resetting the max_usage
-    value. They are all declared in include/linux/res_counter.h.
-3. Analyzing the resource counter registrations
- a. If the failcnt value constantly grows, this means that the counter's
-    limit is too tight. Either the group is misbehaving and consumes too
-    many resources, or the configuration is not suitable for the group
-    and the limit should be increased.
- b. The max_usage value can be used to quickly tune the group. One may
-    set the limits to maximal values and either load the container with
-    a common pattern or leave one for a while. After this the max_usage
-    value shows the amount of memory the container would require during
-    its common activity.
-    Setting the limit a bit above this value gives a pretty good
-    configuration that works in most of the cases.
- c. If the max_usage is much less than the limit, but the failcnt value
-    is growing, then the group tries to allocate a big chunk of resource
-    at once.
- d. If the max_usage is much less than the limit, but the failcnt value
-    is 0, then this group is given too high limit, that it does not
-    require. It is better to lower the limit a bit leaving more resource
-    for other groups.
-4. Communication with the control groups subsystem (cgroups)
-All the resource controllers that are using cgroups and resource counters
-should provide files (in the cgroup filesystem) to work with the resource
-counter fields. They are recommended to adhere to the following rules:
- a. File names
-        Field name      File name
-        ---------------------------------------------------
-        usage           usage_in_<unit_of_measurement>
-        max_usage       max_usage_in_<unit_of_measurement>
-        limit           limit_in_<unit_of_measurement>
-        failcnt         failcnt
-        lock            no file :)
- b. Reading from file should show the corresponding field value in the
-    appropriate format.
- c. Writing to file
-        Field           Expected behavior
-        ----------------------------------
-        usage           prohibited
-        max_usage       reset to usage
-        limit           set the limit
-        failcnt         reset to zero
-5. Usage example
- a. Declare a task group (take a look at cgroups subsystem for this) and
-    fold a res_counter into it
-        struct my_group {
-                struct res_counter res;
-                <other fields>
-        }
- b. Put hooks in resource allocation/release paths
-        int alloc_something(...)
-        {
-                if (res_counter_charge(res_counter_ptr, amount) < 0)
-                        return -ENOMEM;
-                <allocate the resource and return to the caller>
-        }
-        void release_something(...)
-        {
-                res_counter_uncharge(res_counter_ptr, amount);
-                <release the resource>
-        }
-    In order to keep the usage value self-consistent, both the
-    "res_counter_ptr" and the "amount" in release_something() should be
-    the same as they were in the alloc_something() when the releasing
-    resource was allocated.
- c. Provide the way to read res_counter values and set them (the cgroups
-    still can help with it).
- c. Compile and run :)
diff --git a/Documentation/devicetree/bindings/rtc/rtc-omap.txt b/Documentation/devicetree/bindings/rtc/rtc-omap.txt
index 5a0f02d34d95..4ba4dbd34289 100644
--- a/Documentation/devicetree/bindings/rtc/rtc-omap.txt
+++ b/Documentation/devicetree/bindings/rtc/rtc-omap.txt
@@ -5,11 +5,17 @@ Required properties:
        - "ti,da830-rtc"  - for RTC IP used similar to that on DA8xx SoC family.
        - "ti,am3352-rtc" - for RTC IP used similar to that on AM335x SoC family.
                            This RTC IP has special WAKE-EN Register to enable
-                            Wakeup generation for event Alarm.
+                            Wakeup generation for event Alarm. It can also be
+                            used to control an external PMIC via the
+                            pmic_power_en pin.
 - reg: Address range of rtc register set
 - interrupts: rtc timer, alarm interrupts in order
 - interrupt-parent: phandle for the interrupt controller
+Optional properties:
+- system-power-controller: whether the rtc is controlling the system power
+  through pmic_power_en
 Example:
 rtc@1c23000 {
@@ -18,4 +24,5 @@ rtc@1c23000 {
        interrupts = <19
                      19>;
        interrupt-parent = <&intc>;
+        system-power-controller;
 };
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt
index 0d354625299c..2417cb0b493b 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -115,6 +115,7 @@ nxp	NXP Semiconductors
 onnn    ON Semiconductor Corp.
 opencores       OpenCores.org
 panasonic       Panasonic Corporation
+pericom Pericom Technology Inc.
 phytec  PHYTEC Messtechnik GmbH
 picochip        Picochip Ltd
 plathome        Plat'Home Co., Ltd.
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 6c0b9f27e465..bc4bd5a44b88 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
   http://people.redhat.com/~anderson/
+Trigger Kdump on WARN()
+=======================
+The kernel parameter, panic_on_warn, calls panic() in all WARN() paths.  This
+will cause a kdump to occur at the panic() call.  In cases where a user wants
+to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
+to achieve the same behaviour.
 Contact
 =======
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 838f3776c924..d6eb3636fe5a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2509,6 +2509,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
                        timeout < 0: reboot immediately
                        Format: <timeout>
+        panic_on_warn   panic() instead of WARN().  Useful to cause kdump
+                        on a WARN().
        crash_kexec_post_notifiers
                        Run kdump after running panic-notifiers and dumping
                        kmsg. This only for the users who doubt kdump always
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57baff5bdb80..b5d0c8501a18 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -54,8 +54,9 @@ show up in /proc/sys/kernel:
 - overflowuid
 - panic
 - panic_on_oops
- panic_on_unrecovered_nmi
 - panic_on_stackoverflow
+- panic_on_unrecovered_nmi
+- panic_on_warn
 - pid_max
 - powersave-nap               [ PPC only ]
 - printk
@@ -527,19 +528,6 @@ the recommended setting is 60.
 ==============================================================
-panic_on_unrecovered_nmi:
-The default Linux behaviour on an NMI of either memory or unknown is
-to continue operation. For many environments such as scientific
-computing it is preferable that the box is taken out and the error
-dealt with than an uncorrected parity/ECC error get propagated.
-A small number of systems do generate NMI's for bizarre random reasons
-such as power management so the default is off. That sysctl works like
-the existing panic controls already in that directory.
-==============================================================
 panic_on_oops:
 Controls the kernel's behaviour when an oops or BUG is encountered.
@@ -563,6 +551,30 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
 ==============================================================
+panic_on_unrecovered_nmi:
+The default Linux behaviour on an NMI of either memory or unknown is
+to continue operation. For many environments such as scientific
+computing it is preferable that the box is taken out and the error
+dealt with than an uncorrected parity/ECC error get propagated.
+A small number of systems do generate NMI's for bizarre random reasons
+such as power management so the default is off. That sysctl works like
+the existing panic controls already in that directory.
+==============================================================
+panic_on_warn:
+Calls panic() in the WARN() path when set to 1.  This is useful to avoid
+a kernel rebuild when attempting to kdump at the location of a WARN().
+0: only WARN(), default behaviour.
+1: call panic() after printing out WARN() location.
+==============================================================
 perf_cpu_time_max_percent:
 Hints to the kernel how much CPU time it should be allowed to
diff --git a/MAINTAINERS b/MAINTAINERS
index 1563a3b38960..079efaf1b5e7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2605,7 +2605,7 @@ L:	cgroups@vger.kernel.org
 L:      linux-mm@kvack.org
 S:      Maintained
 F:      mm/memcontrol.c
-F:      mm/page_cgroup.c
+F:      mm/swap_cgroup.c
 CORETEMP HARDWARE MONITORING DRIVER
 M:      Fenghua Yu <fenghua.yu@intel.com>
@@ -2722,7 +2722,7 @@ F:	drivers/net/wireless/cw1200/
 CX18 VIDEO4LINUX DRIVER
 M:      Andy Walls <awalls@md.metrocast.net>
-L:      ivtv-devel@ivtvdriver.org (moderated for non-subscribers)
+L:      ivtv-devel@ivtvdriver.org (subscribers-only)
 L:      linux-media@vger.kernel.org
 T:      git git://linuxtv.org/media_tree.git
 W:      http://linuxtv.org
@@ -5208,7 +5208,7 @@ F:	drivers/media/tuners/it913x*
 IVTV VIDEO4LINUX DRIVER
 M:      Andy Walls <awalls@md.metrocast.net>
-L:      ivtv-devel@ivtvdriver.org (moderated for non-subscribers)
+L:      ivtv-devel@ivtvdriver.org (subscribers-only)
 L:      linux-media@vger.kernel.org
 T:      git git://linuxtv.org/media_tree.git
 W:      http://www.ivtvdriver.org
diff --git a/arch/arm/boot/dts/am335x-boneblack.dts b/arch/arm/boot/dts/am335x-boneblack.dts
index 901739fcb85a..5c42d259fa68 100644
--- a/arch/arm/boot/dts/am335x-boneblack.dts
+++ b/arch/arm/boot/dts/am335x-boneblack.dts
@@ -80,3 +80,7 @@
                status = "okay";
        };
 };
+&rtc {
+        system-power-controller;
+};
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index befe713b3e1b..acd37057bca9 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -435,7 +435,7 @@
                };
                rtc: rtc@44e3e000 {
-                        compatible = "ti,da830-rtc";
+                        compatible = "ti,am3352-rtc", "ti,da830-rtc";
                        reg = <0x44e3e000 0x1000>;
                        interrupts = <75
                                      76>;
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 41a43bf26492..df22314f57cf 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -279,6 +279,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
 #endif /* CONFIG_HAVE_RCU_TABLE_FREE */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmd_dirty(pmd)          pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)          pte_young(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)      pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mksplitting(pmd)    pte_pmd(pte_mkspecial(pmd_pte(pmd)))
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 5845ffea67c3..dc063fe6646a 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2662,7 +2662,7 @@ pfm_context_create(pfm_context_t *ctx, void *arg, int count, struct pt_regs *reg
        ret = -ENOMEM;
-        fd = get_unused_fd();
+        fd = get_unused_fd_flags(0);
        if (fd < 0)
                return fd;
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ae153c40ab7c..9b4b1904efae 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -467,6 +467,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 }
 #define pmd_pfn(pmd)            pte_pfn(pmd_pte(pmd))
+#define pmd_dirty(pmd)          pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)          pte_young(pmd_pte(pmd))
 #define pmd_mkold(pmd)          pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)      pte_pmd(pte_wrprotect(pmd_pte(pmd)))
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 65d633f20d37..1a3429e1ccb5 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -301,7 +301,7 @@ static int spufs_context_open(struct path *path)
        int ret;
        struct file *filp;
-        ret = get_unused_fd();
+        ret = get_unused_fd_flags(0);
        if (ret < 0)
                return ret;
@@ -518,7 +518,7 @@ static int spufs_gang_open(struct path *path)
        int ret;
        struct file *filp;
-        ret = get_unused_fd();
+        ret = get_unused_fd_flags(0);
        if (ret < 0)
                return ret;
diff --git a/arch/sh/mm/numa.c b/arch/sh/mm/numa.c
index 3d85225b9e95..bce52ba66206 100644
--- a/arch/sh/mm/numa.c
+++ b/arch/sh/mm/numa.c
@@ -31,7 +31,7 @@ void __init setup_bootmem_node(int nid, unsigned long start, unsigned long end)
        unsigned long bootmem_paddr;
        /* Don't allow bogus node assignment */
-        BUG_ON(nid > MAX_NUMNODES || nid <= 0);
+        BUG_ON(nid >= MAX_NUMNODES || nid <= 0);
        start_pfn = start >> PAGE_SHIFT;
        end_pfn = end >> PAGE_SHIFT;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index bfeb626085ac..1ff9e7864168 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -667,6 +667,13 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 }
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline unsigned long pmd_dirty(pmd_t pmd)
+{
+        pte_t pte = __pte(pmd_val(pmd));
+        return pte_dirty(pte);
+}
 static inline unsigned long pmd_young(pmd_t pmd)
 {
        pte_t pte = __pte(pmd_val(pmd));
diff --git a/arch/tile/kernel/early_printk.c b/arch/tile/kernel/early_printk.c
index b608e00e7f6d..aefb2c086726 100644
--- a/arch/tile/kernel/early_printk.c
+++ b/arch/tile/kernel/early_printk.c
@@ -43,13 +43,20 @@ static struct console early_hv_console = {
 void early_panic(const char *fmt, ...)
 {
-        va_list ap;
+        struct va_format vaf;
+        va_list args;
        arch_local_irq_disable_all();
-        va_start(ap, fmt);
-        early_printk("Kernel panic - not syncing: ");
+        va_start(args, fmt);
-        early_vprintk(fmt, ap);
-        early_printk("\n");
+        vaf.fmt = fmt;
-        va_end(ap);
+        vaf.va = &args;
+        early_printk("Kernel panic - not syncing: %pV", &vaf);
+        va_end(args);
        dump_stack();
        hv_halt();
 }
diff --git a/arch/tile/kernel/setup.c b/arch/tile/kernel/setup.c
index b9736ded06f2..7f079bbfdf4c 100644
--- a/arch/tile/kernel/setup.c
+++ b/arch/tile/kernel/setup.c
@@ -534,11 +534,10 @@ static void __init setup_memory(void)
                        }
                }
                physpages -= dropped_pages;
-                pr_warning("Only using %ldMB memory;"
+                pr_warn("Only using %ldMB memory - ignoring %ldMB\n",
-                       " ignoring %ldMB.\n",
+                        physpages >> (20 - PAGE_SHIFT),
-                       physpages >> (20 - PAGE_SHIFT),
+                        dropped_pages >> (20 - PAGE_SHIFT));
-                       dropped_pages >> (20 - PAGE_SHIFT));
+                pr_warn("Consider using a larger page size\n");
-                pr_warning("Consider using a larger page size.\n");
        }
 #endif
@@ -566,9 +565,8 @@ static void __init setup_memory(void)
 #ifndef __tilegx__
        if (node_end_pfn[0] > MAXMEM_PFN) {
-                pr_warning("Only using %ldMB LOWMEM.\n",
+                pr_warn("Only using %ldMB LOWMEM\n", MAXMEM >> 20);
-                       MAXMEM>>20);
+                pr_warn("Use a HIGHMEM enabled kernel\n");
-                pr_warning("Use a HIGHMEM enabled kernel.\n");
                max_low_pfn = MAXMEM_PFN;
                max_pfn = MAXMEM_PFN;
                node_end_pfn[0] = MAXMEM_PFN;
@@ -1112,8 +1110,8 @@ static void __init load_hv_initrd(void)
        fd = hv_fs_findfile((HV_VirtAddr) initramfs_file);
        if (fd == HV_ENOENT) {
                if (set_initramfs_file) {
-                        pr_warning("No such hvfs initramfs file '%s'\n",
+                        pr_warn("No such hvfs initramfs file '%s'\n",
-                                   initramfs_file);
+                                initramfs_file);
                        return;
                } else {
                        /* Try old backwards-compatible name. */
@@ -1126,8 +1124,8 @@ static void __init load_hv_initrd(void)
        stat = hv_fs_fstat(fd);
        BUG_ON(stat.size < 0);
        if (stat.flags & HV_FS_ISDIR) {
-                pr_warning("Ignoring hvfs file '%s': it's a directory.\n",
+                pr_warn("Ignoring hvfs file '%s': it's a directory\n",
-                           initramfs_file);
+                        initramfs_file);
                return;
        }
        initrd = alloc_bootmem_pages(stat.size);
@@ -1185,9 +1183,8 @@ static void __init validate_hv(void)
        HV_Topology topology = hv_inquire_topology();
        BUG_ON(topology.coord.x != 0 || topology.coord.y != 0);
        if (topology.width != 1 || topology.height != 1) {
-                pr_warning("Warning: booting UP kernel on %dx%d grid;"
+                pr_warn("Warning: booting UP kernel on %dx%d grid; will ignore all but first tile\n",
-                           " will ignore all but first tile.\n",
+                        topology.width, topology.height);
-                           topology.width, topology.height);
        }
 #endif
@@ -1208,9 +1205,8 @@ static void __init validate_hv(void)
         * We use a struct cpumask for this, so it must be big enough.
         */
        if ((smp_height * smp_width) > nr_cpu_ids)
-                early_panic("Hypervisor %d x %d grid too big for Linux"
+                early_panic("Hypervisor %d x %d grid too big for Linux NR_CPUS %d\n",
-                            " NR_CPUS %d\n", smp_height, smp_width,
+                            smp_height, smp_width, nr_cpu_ids);
-                            nr_cpu_ids);
 #endif
        /*
@@ -1265,10 +1261,9 @@ static void __init validate_va(void)
        /* Kernel PCs must have their high bit set; see intvec.S. */
        if ((long)VMALLOC_START >= 0)
-                early_panic(
+                early_panic("Linux VMALLOC region below the 2GB line (%#lx)!\n"
-                        "Linux VMALLOC region below the 2GB line (%#lx)!\n"
+                            "Reconfigure the kernel with smaller VMALLOC_RESERVE\n",
-                        "Reconfigure the kernel with smaller VMALLOC_RESERVE.\n",
+                            VMALLOC_START);
-                        VMALLOC_START);
 #endif
 }
@@ -1395,7 +1390,7 @@ static void __init setup_cpu_maps(void)
 static int __init dataplane(char *str)
 {
-        pr_warning("WARNING: dataplane support disabled in this kernel\n");
+        pr_warn("WARNING: dataplane support disabled in this kernel\n");
        return 0;
 }
@@ -1413,8 +1408,8 @@ void __init setup_arch(char **cmdline_p)
        len = hv_get_command_line((HV_VirtAddr) boot_command_line,
                                  COMMAND_LINE_SIZE);
        if (boot_command_line[0])
-                pr_warning("WARNING: ignoring dynamic command line \"%s\"\n",
+                pr_warn("WARNING: ignoring dynamic command line \"%s\"\n",
-                           boot_command_line);
+                        boot_command_line);
        strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
 #else
        char *hv_cmdline;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c112ea63f40d..e8a5454acc99 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -100,6 +100,11 @@ static inline int pte_young(pte_t pte)
        return pte_flags(pte) & _PAGE_ACCESSED;
 }
+static inline int pmd_dirty(pmd_t pmd)
+{
+        return pmd_flags(pmd) & _PAGE_DIRTY;
+}
 static inline int pmd_young(pmd_t pmd)
 {
        return pmd_flags(pmd) & _PAGE_ACCESSED;
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index df04227d00cf..98504ec99c7d 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -267,18 +267,24 @@ comment "Default contiguous memory area size:"
 config CMA_SIZE_MBYTES
        int "Size in Mega Bytes"
        depends on !CMA_SIZE_SEL_PERCENTAGE
+        default 0 if X86
        default 16
        help
          Defines the size (in MiB) of the default memory area for Contiguous
-          Memory Allocator.
+          Memory Allocator.  If the size of 0 is selected, CMA is disabled by
+          default, but it can be enabled by passing cma=size[MG] to the kernel.
 config CMA_SIZE_PERCENTAGE
        int "Percentage of total memory"
        depends on !CMA_SIZE_SEL_MBYTES
+        default 0 if X86
        default 10
        help
          Defines the size of the default memory area for Contiguous Memory
          Allocator as a percentage of the total memory in the system.
+          If 0 percent is selected, CMA is disabled by default, but it can be
+          enabled by passing cma=size[MG] to the kernel.
 choice
        prompt "Selected region size"
diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
index b682651b5307..4511ddc1ac31 100644
--- a/drivers/rtc/Kconfig
+++ b/drivers/rtc/Kconfig
@@ -192,6 +192,14 @@ config RTC_DRV_DS1374
          This driver can also be built as a module. If so, the module
          will be called rtc-ds1374.
+config RTC_DRV_DS1374_WDT
+        bool "Dallas/Maxim DS1374 watchdog timer"
+        depends on RTC_DRV_DS1374
+        help
+          If you say Y here you will get support for the
+          watchdog timer in the Dallas Semiconductor DS1374
+          real-time clock chips.
 config RTC_DRV_DS1672
        tristate "Dallas/Maxim DS1672"
        help
diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index 5b2717f5dafa..45bfc28ee3aa 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -30,6 +30,14 @@ static int __rtc_read_time(struct rtc_device *rtc, struct rtc_time *tm)
        else {
                memset(tm, 0, sizeof(struct rtc_time));
                err = rtc->ops->read_time(rtc->dev.parent, tm);
+                if (err < 0) {
+                        dev_err(&rtc->dev, "read_time: fail to read\n");
+                        return err;
+                }
+                err = rtc_valid_tm(tm);
+                if (err < 0)
+                        dev_err(&rtc->dev, "read_time: rtc_time isn't valid\n");
        }
        return err;
 }
@@ -891,11 +899,24 @@ again:
        if (next) {
                struct rtc_wkalrm alarm;
                int err;
+                int retry = 3;
                alarm.time = rtc_ktime_to_tm(next->expires);
                alarm.enabled = 1;
+reprogram:
                err = __rtc_set_alarm(rtc, &alarm);
                if (err == -ETIME)
                        goto again;
+                else if (err) {
+                        if (retry-- > 0)
+                                goto reprogram;
+                        timer = container_of(next, struct rtc_timer, node);
+                        timerqueue_del(&rtc->timerqueue, &timer->node);
+                        timer->enabled = 0;
+                        dev_err(&rtc->dev, "__rtc_set_alarm: err=%d\n", err);
+                        goto again;
+                }
        } else
                rtc_alarm_disable(rtc);
diff --git a/drivers/rtc/rtc-ab8500.c b/drivers/rtc/rtc-ab8500.c
index 727e2f5d14d9..866e0ef5122d 100644
--- a/drivers/rtc/rtc-ab8500.c
+++ b/drivers/rtc/rtc-ab8500.c
@@ -504,6 +504,8 @@ static int ab8500_rtc_probe(struct platform_device *pdev)
                return err;
        }
+        rtc->uie_unsupported = 1;
        return 0;
 }
diff --git a/drivers/rtc/rtc-ds1307.c b/drivers/rtc/rtc-ds1307.c
index bb43cf703efc..4ffabb322a9a 100644
--- a/drivers/rtc/rtc-ds1307.c
+++ b/drivers/rtc/rtc-ds1307.c
@@ -35,7 +35,7 @@ enum ds_type {
        ds_1388,
        ds_3231,
        m41t00,
-        mcp7941x,
+        mcp794xx,
        rx_8025,
        last_ds_type /* always last */
        /* rs5c372 too?  different address... */
@@ -46,7 +46,7 @@ enum ds_type {
 #define DS1307_REG_SECS         0x00    /* 00-59 */
 #       define DS1307_BIT_CH            0x80
 #       define DS1340_BIT_nEOSC         0x80
-#       define MCP7941X_BIT_ST          0x80
+#       define MCP794XX_BIT_ST          0x80
 #define DS1307_REG_MIN          0x01    /* 00-59 */
 #define DS1307_REG_HOUR         0x02    /* 00-23, or 1-12{am,pm} */
 #       define DS1307_BIT_12HR          0x40    /* in REG_HOUR */
@@ -54,7 +54,7 @@ enum ds_type {
 #       define DS1340_BIT_CENTURY_EN    0x80    /* in REG_HOUR */
 #       define DS1340_BIT_CENTURY       0x40    /* in REG_HOUR */
 #define DS1307_REG_WDAY         0x03    /* 01-07 */
-#       define MCP7941X_BIT_VBATEN      0x08
+#       define MCP794XX_BIT_VBATEN      0x08
 #define DS1307_REG_MDAY         0x04    /* 01-31 */
 #define DS1307_REG_MONTH        0x05    /* 01-12 */
 #       define DS1337_BIT_CENTURY       0x80    /* in REG_MONTH */
@@ -159,7 +159,7 @@ static struct chip_desc chips[last_ds_type] = {
        [ds_3231] = {
                .alarm          = 1,
        },
-        [mcp7941x] = {
+        [mcp794xx] = {
                .alarm          = 1,
                /* this is battery backed SRAM */
                .nvram_offset   = 0x20,
@@ -176,7 +176,8 @@ static const struct i2c_device_id ds1307_id[] = {
        { "ds1340", ds_1340 },
        { "ds3231", ds_3231 },
        { "m41t00", m41t00 },
-        { "mcp7941x", mcp7941x },
+        { "mcp7940x", mcp794xx },
+        { "mcp7941x", mcp794xx },
        { "pt7c4338", ds_1307 },
        { "rx8025", rx_8025 },
        { }
@@ -439,14 +440,14 @@ static int ds1307_set_time(struct device *dev, struct rtc_time *t)
                buf[DS1307_REG_HOUR] |= DS1340_BIT_CENTURY_EN
                                | DS1340_BIT_CENTURY;
                break;
-        case mcp7941x:
+        case mcp794xx:
                /*
                 * these bits were cleared when preparing the date/time
                 * values and need to be set again before writing the
                 * buffer out to the device.
                 */
-                buf[DS1307_REG_SECS] |= MCP7941X_BIT_ST;
+                buf[DS1307_REG_SECS] |= MCP794XX_BIT_ST;
-                buf[DS1307_REG_WDAY] |= MCP7941X_BIT_VBATEN;
+                buf[DS1307_REG_WDAY] |= MCP794XX_BIT_VBATEN;
                break;
        default:
                break;
@@ -614,26 +615,26 @@ static const struct rtc_class_ops ds13xx_rtc_ops = {
 /*----------------------------------------------------------------------*/
 /*
- * Alarm support for mcp7941x devices.
+ * Alarm support for mcp794xx devices.
 */
-#define MCP7941X_REG_CONTROL            0x07
+#define MCP794XX_REG_CONTROL            0x07
-#       define MCP7941X_BIT_ALM0_EN     0x10
+#       define MCP794XX_BIT_ALM0_EN     0x10
-#       define MCP7941X_BIT_ALM1_EN     0x20
+#       define MCP794XX_BIT_ALM1_EN     0x20
-#define MCP7941X_REG_ALARM0_BASE        0x0a
+#define MCP794XX_REG_ALARM0_BASE        0x0a
-#define MCP7941X_REG_ALARM0_CTRL        0x0d
+#define MCP794XX_REG_ALARM0_CTRL        0x0d
-#define MCP7941X_REG_ALARM1_BASE        0x11
+#define MCP794XX_REG_ALARM1_BASE        0x11
-#define MCP7941X_REG_ALARM1_CTRL        0x14
+#define MCP794XX_REG_ALARM1_CTRL        0x14
-#       define MCP7941X_BIT_ALMX_IF     (1 << 3)
+#       define MCP794XX_BIT_ALMX_IF     (1 << 3)
-#       define MCP7941X_BIT_ALMX_C0     (1 << 4)
+#       define MCP794XX_BIT_ALMX_C0     (1 << 4)
-#       define MCP7941X_BIT_ALMX_C1     (1 << 5)
+#       define MCP794XX_BIT_ALMX_C1     (1 << 5)
-#       define MCP7941X_BIT_ALMX_C2     (1 << 6)
+#       define MCP794XX_BIT_ALMX_C2     (1 << 6)
-#       define MCP7941X_BIT_ALMX_POL    (1 << 7)
+#       define MCP794XX_BIT_ALMX_POL    (1 << 7)
-#       define MCP7941X_MSK_ALMX_MATCH  (MCP7941X_BIT_ALMX_C0 | \
+#       define MCP794XX_MSK_ALMX_MATCH  (MCP794XX_BIT_ALMX_C0 | \
-                                         MCP7941X_BIT_ALMX_C1 | \
+                                         MCP794XX_BIT_ALMX_C1 | \
-                                         MCP7941X_BIT_ALMX_C2)
+                                         MCP794XX_BIT_ALMX_C2)
-static void mcp7941x_work(struct work_struct *work)
+static void mcp794xx_work(struct work_struct *work)
 {
        struct ds1307 *ds1307 = container_of(work, struct ds1307, work);
        struct i2c_client *client = ds1307->client;
@@ -642,22 +643,22 @@ static void mcp7941x_work(struct work_struct *work)
        mutex_lock(&ds1307->rtc->ops_lock);
        /* Check and clear alarm 0 interrupt flag. */
-        reg = i2c_smbus_read_byte_data(client, MCP7941X_REG_ALARM0_CTRL);
+        reg = i2c_smbus_read_byte_data(client, MCP794XX_REG_ALARM0_CTRL);
        if (reg < 0)
                goto out;
-        if (!(reg & MCP7941X_BIT_ALMX_IF))
+        if (!(reg & MCP794XX_BIT_ALMX_IF))
                goto out;
-        reg &= ~MCP7941X_BIT_ALMX_IF;
+        reg &= ~MCP794XX_BIT_ALMX_IF;
-        ret = i2c_smbus_write_byte_data(client, MCP7941X_REG_ALARM0_CTRL, reg);
+        ret = i2c_smbus_write_byte_data(client, MCP794XX_REG_ALARM0_CTRL, reg);
        if (ret < 0)
                goto out;
        /* Disable alarm 0. */
-        reg = i2c_smbus_read_byte_data(client, MCP7941X_REG_CONTROL);
+        reg = i2c_smbus_read_byte_data(client, MCP794XX_REG_CONTROL);
        if (reg < 0)
                goto out;
-        reg &= ~MCP7941X_BIT_ALM0_EN;
+        reg &= ~MCP794XX_BIT_ALM0_EN;
-        ret = i2c_smbus_write_byte_data(client, MCP7941X_REG_CONTROL, reg);
+        ret = i2c_smbus_write_byte_data(client, MCP794XX_REG_CONTROL, reg);
        if (ret < 0)
                goto out;
@@ -669,7 +670,7 @@ out:
        mutex_unlock(&ds1307->rtc->ops_lock);
 }
-static int mcp7941x_read_alarm(struct device *dev, struct rtc_wkalrm *t)
+static int mcp794xx_read_alarm(struct device *dev, struct rtc_wkalrm *t)
 {
        struct i2c_client *client = to_i2c_client(dev);
        struct ds1307 *ds1307 = i2c_get_clientdata(client);
@@ -680,11 +681,11 @@ static int mcp7941x_read_alarm(struct device *dev, struct rtc_wkalrm *t)
                return -EINVAL;
        /* Read control and alarm 0 registers. */
-        ret = ds1307->read_block_data(client, MCP7941X_REG_CONTROL, 10, regs);
+        ret = ds1307->read_block_data(client, MCP794XX_REG_CONTROL, 10, regs);
        if (ret < 0)
                return ret;
-        t->enabled = !!(regs[0] & MCP7941X_BIT_ALM0_EN);
+        t->enabled = !!(regs[0] & MCP794XX_BIT_ALM0_EN);
        /* Report alarm 0 time assuming 24-hour and day-of-month modes. */
        t->time.tm_sec = bcd2bin(ds1307->regs[3] & 0x7f);
@@ -701,14 +702,14 @@ static int mcp7941x_read_alarm(struct device *dev, struct rtc_wkalrm *t)
                "enabled=%d polarity=%d irq=%d match=%d\n", __func__,
                t->time.tm_sec, t->time.tm_min, t->time.tm_hour,
                t->time.tm_wday, t->time.tm_mday, t->time.tm_mon, t->enabled,
-                !!(ds1307->regs[6] & MCP7941X_BIT_ALMX_POL),
+                !!(ds1307->regs[6] & MCP794XX_BIT_ALMX_POL),
-                !!(ds1307->regs[6] & MCP7941X_BIT_ALMX_IF),
+                !!(ds1307->regs[6] & MCP794XX_BIT_ALMX_IF),
-                (ds1307->regs[6] & MCP7941X_MSK_ALMX_MATCH) >> 4);
+                (ds1307->regs[6] & MCP794XX_MSK_ALMX_MATCH) >> 4);
        return 0;
 }
-static int mcp7941x_set_alarm(struct device *dev, struct rtc_wkalrm *t)
+static int mcp794xx_set_alarm(struct device *dev, struct rtc_wkalrm *t)
 {
        struct i2c_client *client = to_i2c_client(dev);
        struct ds1307 *ds1307 = i2c_get_clientdata(client);
@@ -725,7 +726,7 @@ static int mcp7941x_set_alarm(struct device *dev, struct rtc_wkalrm *t)
                t->enabled, t->pending);
        /* Read control and alarm 0 registers. */
-        ret = ds1307->read_block_data(client, MCP7941X_REG_CONTROL, 10, regs);
+        ret = ds1307->read_block_data(client, MCP794XX_REG_CONTROL, 10, regs);
        if (ret < 0)
                return ret;
@@ -738,23 +739,23 @@ static int mcp7941x_set_alarm(struct device *dev, struct rtc_wkalrm *t)
        regs[8] = bin2bcd(t->time.tm_mon) + 1;
        /* Clear the alarm 0 interrupt flag. */
-        regs[6] &= ~MCP7941X_BIT_ALMX_IF;
+        regs[6] &= ~MCP794XX_BIT_ALMX_IF;
        /* Set alarm match: second, minute, hour, day, date, month. */
-        regs[6] |= MCP7941X_MSK_ALMX_MATCH;
+        regs[6] |= MCP794XX_MSK_ALMX_MATCH;
        if (t->enabled)
-                regs[0] |= MCP7941X_BIT_ALM0_EN;
+                regs[0] |= MCP794XX_BIT_ALM0_EN;
        else
-                regs[0] &= ~MCP7941X_BIT_ALM0_EN;
+                regs[0] &= ~MCP794XX_BIT_ALM0_EN;
-        ret = ds1307->write_block_data(client, MCP7941X_REG_CONTROL, 10, regs);
+        ret = ds1307->write_block_data(client, MCP794XX_REG_CONTROL, 10, regs);
        if (ret < 0)
                return ret;
        return 0;
 }
-static int mcp7941x_alarm_irq_enable(struct device *dev, unsigned int enabled)
+static int mcp794xx_alarm_irq_enable(struct device *dev, unsigned int enabled)
 {
        struct i2c_client *client = to_i2c_client(dev);
        struct ds1307 *ds1307 = i2c_get_clientdata(client);
@@ -763,24 +764,24 @@ static int mcp7941x_alarm_irq_enable(struct device *dev, unsigned int enabled)
        if (!test_bit(HAS_ALARM, &ds1307->flags))
                return -EINVAL;
-        reg = i2c_smbus_read_byte_data(client, MCP7941X_REG_CONTROL);
+        reg = i2c_smbus_read_byte_data(client, MCP794XX_REG_CONTROL);
        if (reg < 0)
                return reg;
        if (enabled)
-                reg |= MCP7941X_BIT_ALM0_EN;
+                reg |= MCP794XX_BIT_ALM0_EN;
        else
-                reg &= ~MCP7941X_BIT_ALM0_EN;
+                reg &= ~MCP794XX_BIT_ALM0_EN;
-        return i2c_smbus_write_byte_data(client, MCP7941X_REG_CONTROL, reg);
+        return i2c_smbus_write_byte_data(client, MCP794XX_REG_CONTROL, reg);
 }
-static const struct rtc_class_ops mcp7941x_rtc_ops = {
+static const struct rtc_class_ops mcp794xx_rtc_ops = {
        .read_time      = ds1307_get_time,
        .set_time       = ds1307_set_time,
-        .read_alarm     = mcp7941x_read_alarm,
+        .read_alarm     = mcp794xx_read_alarm,
-        .set_alarm      = mcp7941x_set_alarm,
+        .set_alarm      = mcp794xx_set_alarm,
-        .alarm_irq_enable = mcp7941x_alarm_irq_enable,
+        .alarm_irq_enable = mcp794xx_alarm_irq_enable,
 };
 /*----------------------------------------------------------------------*/
@@ -1049,10 +1050,10 @@ static int ds1307_probe(struct i2c_client *client,
        case ds_1388:
                ds1307->offset = 1; /* Seconds starts at 1 */
                break;
-        case mcp7941x:
+        case mcp794xx:
-                rtc_ops = &mcp7941x_rtc_ops;
+                rtc_ops = &mcp794xx_rtc_ops;
                if (ds1307->client->irq > 0 && chip->alarm) {
-                        INIT_WORK(&ds1307->work, mcp7941x_work);
+                        INIT_WORK(&ds1307->work, mcp794xx_work);
                        want_irq = true;
                }
                break;
@@ -1117,18 +1118,18 @@ read_rtc:
                        dev_warn(&client->dev, "SET TIME!\n");
                }
                break;
-        case mcp7941x:
+        case mcp794xx:
                /* make sure that the backup battery is enabled */
-                if (!(ds1307->regs[DS1307_REG_WDAY] & MCP7941X_BIT_VBATEN)) {
+                if (!(ds1307->regs[DS1307_REG_WDAY] & MCP794XX_BIT_VBATEN)) {
                        i2c_smbus_write_byte_data(client, DS1307_REG_WDAY,
                                        ds1307->regs[DS1307_REG_WDAY]
-                                        | MCP7941X_BIT_VBATEN);
+                                        | MCP794XX_BIT_VBATEN);
                }
                /* clock halted?  turn it on, so clock can tick. */
-                if (!(tmp & MCP7941X_BIT_ST)) {
+                if (!(tmp & MCP794XX_BIT_ST)) {
                        i2c_smbus_write_byte_data(client, DS1307_REG_SECS,
-                                        MCP7941X_BIT_ST);
+                                        MCP794XX_BIT_ST);
                        dev_warn(&client->dev, "SET TIME!\n");
                        goto read_rtc;
                }
diff --git a/drivers/rtc/rtc-ds1374.c b/drivers/rtc/rtc-ds1374.c
index 9e6e14fb53d7..8605fde394b2 100644
--- a/drivers/rtc/rtc-ds1374.c
+++ b/drivers/rtc/rtc-ds1374.c
@@ -4,6 +4,7 @@
 * Based on code by Randy Vinson <rvinson@mvista.com>,
 * which was based on the m41t00.c by Mark Greer <mgreer@mvista.com>.
 *
+ * Copyright (C) 2014 Rose Technology
 * Copyright (C) 2006-2007 Freescale Semiconductor
 *
 * 2005 (c) MontaVista Software, Inc. This file is licensed under
@@ -26,6 +27,13 @@
 #include <linux/workqueue.h>
 #include <linux/slab.h>
 #include <linux/pm.h>
+#ifdef CONFIG_RTC_DRV_DS1374_WDT
+#include <linux/fs.h>
+#include <linux/ioctl.h>
+#include <linux/miscdevice.h>
+#include <linux/reboot.h>
+#include <linux/watchdog.h>
+#endif
 #define DS1374_REG_TOD0         0x00 /* Time of Day */
 #define DS1374_REG_TOD1         0x01
@@ -49,6 +57,14 @@ static const struct i2c_device_id ds1374_id[] = {
 };
 MODULE_DEVICE_TABLE(i2c, ds1374_id);
+#ifdef CONFIG_OF
+static const struct of_device_id ds1374_of_match[] = {
+        { .compatible = "dallas,ds1374" },
+        { }
+};
+MODULE_DEVICE_TABLE(of, ds1374_of_match);
+#endif
 struct ds1374 {
        struct i2c_client *client;
        struct rtc_device *rtc;
@@ -162,6 +178,7 @@ static int ds1374_set_time(struct device *dev, struct rtc_time *time)
        return ds1374_write_rtc(client, itime, DS1374_REG_TOD0, 4);
 }
+#ifndef CONFIG_RTC_DRV_DS1374_WDT
 /* The ds1374 has a decrementer for an alarm, rather than a comparator.
 * If the time of day is changed, then the alarm will need to be
 * reset.
@@ -263,6 +280,7 @@ out:
        mutex_unlock(&ds1374->mutex);
        return ret;
 }
+#endif
 static irqreturn_t ds1374_irq(int irq, void *dev_id)
 {
@@ -307,6 +325,7 @@ unlock:
        mutex_unlock(&ds1374->mutex);
 }
+#ifndef CONFIG_RTC_DRV_DS1374_WDT
 static int ds1374_alarm_irq_enable(struct device *dev, unsigned int enabled)
 {
        struct i2c_client *client = to_i2c_client(dev);
@@ -331,15 +350,260 @@ out:
        mutex_unlock(&ds1374->mutex);
        return ret;
 }
+#endif
 static const struct rtc_class_ops ds1374_rtc_ops = {
        .read_time = ds1374_read_time,
        .set_time = ds1374_set_time,
+#ifndef CONFIG_RTC_DRV_DS1374_WDT
        .read_alarm = ds1374_read_alarm,
        .set_alarm = ds1374_set_alarm,
        .alarm_irq_enable = ds1374_alarm_irq_enable,
+#endif
+};
+#ifdef CONFIG_RTC_DRV_DS1374_WDT
+/*
+ *****************************************************************************
+ *
+ * Watchdog Driver
+ *
+ *****************************************************************************
+ */
+static struct i2c_client *save_client;
+/* Default margin */
+#define WD_TIMO 131762
+#define DRV_NAME "DS1374 Watchdog"
+static int wdt_margin = WD_TIMO;
+static unsigned long wdt_is_open;
+module_param(wdt_margin, int, 0);
+MODULE_PARM_DESC(wdt_margin, "Watchdog timeout in seconds (default 32s)");
+static const struct watchdog_info ds1374_wdt_info = {
+        .identity       = "DS1374 WTD",
+        .options        = WDIOF_SETTIMEOUT | WDIOF_KEEPALIVEPING |
+                                                WDIOF_MAGICCLOSE,
 };
+static int ds1374_wdt_settimeout(unsigned int timeout)
+{
+        int ret = -ENOIOCTLCMD;
+        int cr;
+        ret = cr = i2c_smbus_read_byte_data(save_client, DS1374_REG_CR);
+        if (ret < 0)
+                goto out;
+        /* Disable any existing watchdog/alarm before setting the new one */
+        cr &= ~DS1374_REG_CR_WACE;
+        ret = i2c_smbus_write_byte_data(save_client, DS1374_REG_CR, cr);
+        if (ret < 0)
+                goto out;
+        /* Set new watchdog time */
+        ret = ds1374_write_rtc(save_client, timeout, DS1374_REG_WDALM0, 3);
+        if (ret) {
+                pr_info("rtc-ds1374 - couldn't set new watchdog time\n");
+                goto out;
+        }
+        /* Enable watchdog timer */
+        cr |= DS1374_REG_CR_WACE | DS1374_REG_CR_WDALM;
+        cr &= ~DS1374_REG_CR_AIE;
+        ret = i2c_smbus_write_byte_data(save_client, DS1374_REG_CR, cr);
+        if (ret < 0)
+                goto out;
+        return 0;
+out:
+        return ret;
+}
+/*
+ * Reload the watchdog timer.  (ie, pat the watchdog)
+ */
+static void ds1374_wdt_ping(void)
+{
+        u32 val;
+        int ret = 0;
+        ret = ds1374_read_rtc(save_client, &val, DS1374_REG_WDALM0, 3);
+        if (ret)
+                pr_info("WD TICK FAIL!!!!!!!!!! %i\n", ret);
+}
+static void ds1374_wdt_disable(void)
+{
+        int ret = -ENOIOCTLCMD;
+        int cr;
+        cr = i2c_smbus_read_byte_data(save_client, DS1374_REG_CR);
+        /* Disable watchdog timer */
+        cr &= ~DS1374_REG_CR_WACE;
+        ret = i2c_smbus_write_byte_data(save_client, DS1374_REG_CR, cr);
+}
+/*
+ * Watchdog device is opened, and watchdog starts running.
+ */
+static int ds1374_wdt_open(struct inode *inode, struct file *file)
+{
+        struct ds1374 *ds1374 = i2c_get_clientdata(save_client);
+        if (MINOR(inode->i_rdev) == WATCHDOG_MINOR) {
+                mutex_lock(&ds1374->mutex);
+                if (test_and_set_bit(0, &wdt_is_open)) {
+                        mutex_unlock(&ds1374->mutex);
+                        return -EBUSY;
+                }
+                /*
+                 *      Activate
+                 */
+                wdt_is_open = 1;
+                mutex_unlock(&ds1374->mutex);
+                return nonseekable_open(inode, file);
+        }
+        return -ENODEV;
+}
+/*
+ * Close the watchdog device.
+ */
+static int ds1374_wdt_release(struct inode *inode, struct file *file)
+{
+        if (MINOR(inode->i_rdev) == WATCHDOG_MINOR)
+                clear_bit(0, &wdt_is_open);
+        return 0;
+}
+/*
+ * Pat the watchdog whenever device is written to.
+ */
+static ssize_t ds1374_wdt_write(struct file *file, const char __user *data,
+                                size_t len, loff_t *ppos)
+{
+        if (len) {
+                ds1374_wdt_ping();
+                return 1;
+        }
+        return 0;
+}
+static ssize_t ds1374_wdt_read(struct file *file, char __user *data,
+                                size_t len, loff_t *ppos)
+{
+        return 0;
+}
+/*
+ * Handle commands from user-space.
+ */
+static long ds1374_wdt_ioctl(struct file *file, unsigned int cmd,
+                                                        unsigned long arg)
+{
+        int new_margin, options;
+        switch (cmd) {
+        case WDIOC_GETSUPPORT:
+                return copy_to_user((struct watchdog_info __user *)arg,
+                &ds1374_wdt_info, sizeof(ds1374_wdt_info)) ? -EFAULT : 0;
+        case WDIOC_GETSTATUS:
+        case WDIOC_GETBOOTSTATUS:
+                return put_user(0, (int __user *)arg);
+        case WDIOC_KEEPALIVE:
+                ds1374_wdt_ping();
+                return 0;
+        case WDIOC_SETTIMEOUT:
+                if (get_user(new_margin, (int __user *)arg))
+                        return -EFAULT;
+                if (new_margin < 1 || new_margin > 16777216)
+                        return -EINVAL;
+                wdt_margin = new_margin;
+                ds1374_wdt_settimeout(new_margin);
+                ds1374_wdt_ping();
+                /* fallthrough */
+        case WDIOC_GETTIMEOUT:
+                return put_user(wdt_margin, (int __user *)arg);
+        case WDIOC_SETOPTIONS:
+                if (copy_from_user(&options, (int __user *)arg, sizeof(int)))
+                        return -EFAULT;
+                if (options & WDIOS_DISABLECARD) {
+                        pr_info("rtc-ds1374: disable watchdog\n");
+                        ds1374_wdt_disable();
+                }
+                if (options & WDIOS_ENABLECARD) {
+                        pr_info("rtc-ds1374: enable watchdog\n");
+                        ds1374_wdt_settimeout(wdt_margin);
+                        ds1374_wdt_ping();
+                }
+                return -EINVAL;
+        }
+        return -ENOTTY;
+}
+static long ds1374_wdt_unlocked_ioctl(struct file *file, unsigned int cmd,
+                        unsigned long arg)
+{
+        int ret;
+        struct ds1374 *ds1374 = i2c_get_clientdata(save_client);
+        mutex_lock(&ds1374->mutex);
+        ret = ds1374_wdt_ioctl(file, cmd, arg);
+        mutex_unlock(&ds1374->mutex);
+        return ret;
+}
+static int ds1374_wdt_notify_sys(struct notifier_block *this,
+                        unsigned long code, void *unused)
+{
+        if (code == SYS_DOWN || code == SYS_HALT)
+                /* Disable Watchdog */
+                ds1374_wdt_disable();
+        return NOTIFY_DONE;
+}
+static const struct file_operations ds1374_wdt_fops = {
+        .owner                  = THIS_MODULE,
+        .read                   = ds1374_wdt_read,
+        .unlocked_ioctl         = ds1374_wdt_unlocked_ioctl,
+        .write                  = ds1374_wdt_write,
+        .open                   = ds1374_wdt_open,
+        .release                = ds1374_wdt_release,
+        .llseek                 = no_llseek,
+};
+static struct miscdevice ds1374_miscdev = {
+        .minor          = WATCHDOG_MINOR,
+        .name           = "watchdog",
+        .fops           = &ds1374_wdt_fops,
+};
+static struct notifier_block ds1374_wdt_notifier = {
+        .notifier_call = ds1374_wdt_notify_sys,
+};
+#endif /*CONFIG_RTC_DRV_DS1374_WDT*/
+/*
+ *****************************************************************************
+ *
+ *      Driver Interface
+ *
+ *****************************************************************************
+ */
 static int ds1374_probe(struct i2c_client *client,
                        const struct i2c_device_id *id)
 {
@@ -378,12 +642,33 @@ static int ds1374_probe(struct i2c_client *client,
                return PTR_ERR(ds1374->rtc);
        }
+#ifdef CONFIG_RTC_DRV_DS1374_WDT
+        save_client = client;
+        ret = misc_register(&ds1374_miscdev);
+        if (ret)
+                return ret;
+        ret = register_reboot_notifier(&ds1374_wdt_notifier);
+        if (ret) {
+                misc_deregister(&ds1374_miscdev);
+                return ret;
+        }
+        ds1374_wdt_settimeout(131072);
+#endif
        return 0;
 }
 static int ds1374_remove(struct i2c_client *client)
 {
        struct ds1374 *ds1374 = i2c_get_clientdata(client);
+#ifdef CONFIG_RTC_DRV_DS1374_WDT
+        int res;
+        res = misc_deregister(&ds1374_miscdev);
+        if (!res)
+                ds1374_miscdev.parent = NULL;
+        unregister_reboot_notifier(&ds1374_wdt_notifier);
+#endif
        if (client->irq > 0) {
                mutex_lock(&ds1374->mutex);
diff --git a/drivers/rtc/rtc-isl12057.c b/drivers/rtc/rtc-isl12057.c
index 455b601d731d..6e1fcfb5d7e6 100644
--- a/drivers/rtc/rtc-isl12057.c
+++ b/drivers/rtc/rtc-isl12057.c
@@ -41,6 +41,7 @@
 #define ISL12057_REG_RTC_DW     0x03    /* Day of the Week */
 #define ISL12057_REG_RTC_DT     0x04    /* Date */
 #define ISL12057_REG_RTC_MO     0x05    /* Month */
+#define ISL12057_REG_RTC_MO_CEN BIT(7)  /* Century bit */
 #define ISL12057_REG_RTC_YR     0x06    /* Year */
 #define ISL12057_RTC_SEC_LEN    7
@@ -88,7 +89,7 @@ static void isl12057_rtc_regs_to_tm(struct rtc_time *tm, u8 *regs)
        tm->tm_min = bcd2bin(regs[ISL12057_REG_RTC_MN]);
        if (regs[ISL12057_REG_RTC_HR] & ISL12057_REG_RTC_HR_MIL) { /* AM/PM */
-                tm->tm_hour = bcd2bin(regs[ISL12057_REG_RTC_HR] & 0x0f);
+                tm->tm_hour = bcd2bin(regs[ISL12057_REG_RTC_HR] & 0x1f);
                if (regs[ISL12057_REG_RTC_HR] & ISL12057_REG_RTC_HR_PM)
                        tm->tm_hour += 12;
        } else {                                            /* 24 hour mode */
@@ -97,26 +98,37 @@ static void isl12057_rtc_regs_to_tm(struct rtc_time *tm, u8 *regs)
        tm->tm_mday = bcd2bin(regs[ISL12057_REG_RTC_DT]);
        tm->tm_wday = bcd2bin(regs[ISL12057_REG_RTC_DW]) - 1; /* starts at 1 */
-        tm->tm_mon  = bcd2bin(regs[ISL12057_REG_RTC_MO]) - 1; /* starts at 1 */
+        tm->tm_mon  = bcd2bin(regs[ISL12057_REG_RTC_MO] & 0x1f) - 1; /* ditto */
        tm->tm_year = bcd2bin(regs[ISL12057_REG_RTC_YR]) + 100;
+        /* Check if years register has overflown from 99 to 00 */
+        if (regs[ISL12057_REG_RTC_MO] & ISL12057_REG_RTC_MO_CEN)
+                tm->tm_year += 100;
 }
 static int isl12057_rtc_tm_to_regs(u8 *regs, struct rtc_time *tm)
 {
+        u8 century_bit;
        /*
         * The clock has an 8 bit wide bcd-coded register for the year.
+         * It also has a century bit encoded in MO flag which provides
+         * information about overflow of year register from 99 to 00.
         * tm_year is an offset from 1900 and we are interested in the
-         * 2000-2099 range, so any value less than 100 is invalid.
+         * 2000-2199 range, so any value less than 100 or larger than
+         * 299 is invalid.
         */
-        if (tm->tm_year < 100)
+        if (tm->tm_year < 100 || tm->tm_year > 299)
                return -EINVAL;
+        century_bit = (tm->tm_year > 199) ? ISL12057_REG_RTC_MO_CEN : 0;
        regs[ISL12057_REG_RTC_SC] = bin2bcd(tm->tm_sec);
        regs[ISL12057_REG_RTC_MN] = bin2bcd(tm->tm_min);
        regs[ISL12057_REG_RTC_HR] = bin2bcd(tm->tm_hour); /* 24-hour format */
        regs[ISL12057_REG_RTC_DT] = bin2bcd(tm->tm_mday);
-        regs[ISL12057_REG_RTC_MO] = bin2bcd(tm->tm_mon + 1);
+        regs[ISL12057_REG_RTC_MO] = bin2bcd(tm->tm_mon + 1) | century_bit;
-        regs[ISL12057_REG_RTC_YR] = bin2bcd(tm->tm_year - 100);
+        regs[ISL12057_REG_RTC_YR] = bin2bcd(tm->tm_year % 100);
        regs[ISL12057_REG_RTC_DW] = bin2bcd(tm->tm_wday + 1);
        return 0;
@@ -152,17 +164,33 @@ static int isl12057_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
        struct isl12057_rtc_data *data = dev_get_drvdata(dev);
        u8 regs[ISL12057_RTC_SEC_LEN];
+        unsigned int sr;
        int ret;
        mutex_lock(&data->lock);
+        ret = regmap_read(data->regmap, ISL12057_REG_SR, &sr);
+        if (ret) {
+                dev_err(dev, "%s: unable to read oscillator status flag (%d)\n",
+                        __func__, ret);
+                goto out;
+        } else {
+                if (sr & ISL12057_REG_SR_OSF) {
+                        ret = -ENODATA;
+                        goto out;
+                }
+        }
        ret = regmap_bulk_read(data->regmap, ISL12057_REG_RTC_SC, regs,
                               ISL12057_RTC_SEC_LEN);
+        if (ret)
+                dev_err(dev, "%s: unable to read RTC time section (%d)\n",
+                        __func__, ret);
+out:
        mutex_unlock(&data->lock);
-        if (ret) {
+        if (ret)
-                dev_err(dev, "%s: RTC read failed\n", __func__);
                return ret;
-        }
        isl12057_rtc_regs_to_tm(tm, regs);
@@ -182,10 +210,24 @@ static int isl12057_rtc_set_time(struct device *dev, struct rtc_time *tm)
        mutex_lock(&data->lock);
        ret = regmap_bulk_write(data->regmap, ISL12057_REG_RTC_SC, regs,
                                ISL12057_RTC_SEC_LEN);
-        mutex_unlock(&data->lock);
+        if (ret) {
+                dev_err(dev, "%s: unable to write RTC time section (%d)\n",
+                        __func__, ret);
+                goto out;
+        }
-        if (ret)
+        /*
-                dev_err(dev, "%s: RTC write failed\n", __func__);
+         * Now that RTC time has been updated, let's clear oscillator
+         * failure flag, if needed.
+         */
+        ret = regmap_update_bits(data->regmap, ISL12057_REG_SR,
+                                 ISL12057_REG_SR_OSF, 0);
+        if (ret < 0)
+                dev_err(dev, "%s: unable to clear osc. failure bit (%d)\n",
+                        __func__, ret);
+out:
+        mutex_unlock(&data->lock);
        return ret;
 }
@@ -203,15 +245,8 @@ static int isl12057_check_rtc_status(struct device *dev, struct regmap *regmap)
        ret = regmap_update_bits(regmap, ISL12057_REG_INT,
                                 ISL12057_REG_INT_EOSC, 0);
        if (ret < 0) {
-                dev_err(dev, "Unable to enable oscillator\n");
+                dev_err(dev, "%s: unable to enable oscillator (%d)\n",
-                return ret;
+                        __func__, ret);
-        }
-        /* Clear oscillator failure bit if needed */
-        ret = regmap_update_bits(regmap, ISL12057_REG_SR,
-                                 ISL12057_REG_SR_OSF, 0);
-        if (ret < 0) {
-                dev_err(dev, "Unable to clear oscillator failure bit\n");
                return ret;
        }
@@ -219,7 +254,8 @@ static int isl12057_check_rtc_status(struct device *dev, struct regmap *regmap)
        ret = regmap_update_bits(regmap, ISL12057_REG_SR,
                                 ISL12057_REG_SR_A1F, 0);
        if (ret < 0) {
-                dev_err(dev, "Unable to clear alarm bit\n");
+                dev_err(dev, "%s: unable to clear alarm bit (%d)\n",
+                        __func__, ret);
                return ret;
        }
@@ -253,7 +289,8 @@ static int isl12057_probe(struct i2c_client *client,
        regmap = devm_regmap_init_i2c(client, &isl12057_rtc_regmap_config);
        if (IS_ERR(regmap)) {
                ret = PTR_ERR(regmap);
-                dev_err(dev, "regmap allocation failed: %d\n", ret);
+                dev_err(dev, "%s: regmap allocation failed (%d)\n",
+                        __func__, ret);
                return ret;
        }
diff --git a/drivers/rtc/rtc-omap.c b/drivers/rtc/rtc-omap.c
index 21142e6574a9..4f1c6ca97211 100644
--- a/drivers/rtc/rtc-omap.c
+++ b/drivers/rtc/rtc-omap.c
@@ -1,10 +1,11 @@
 /*
- * TI OMAP1 Real Time Clock interface for Linux
+ * TI OMAP Real Time Clock interface for Linux
 *
 * Copyright (C) 2003 MontaVista Software, Inc.
 * Author: George G. Davis <gdavis@mvista.com> or <source@mvista.com>
 *
 * Copyright (C) 2006 David Brownell (new RTC framework)
+ * Copyright (C) 2014 Johan Hovold <johan@kernel.org>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
@@ -25,7 +26,8 @@
 #include <linux/pm_runtime.h>
 #include <linux/io.h>
-/* The OMAP1 RTC is a year/month/day/hours/minutes/seconds BCD clock
+/*
+ * The OMAP RTC is a year/month/day/hours/minutes/seconds BCD clock
 * with century-range alarm matching, driven by the 32kHz clock.
 *
 * The main user-visible ways it differs from PC RTCs are by omitting
@@ -39,10 +41,6 @@
 * the SoC). See the BOARD-SPECIFIC CUSTOMIZATION comment.
 */
-#define DRIVER_NAME                     "omap_rtc"
-#define OMAP_RTC_BASE                   0xfffb4800
 /* RTC registers */
 #define OMAP_RTC_SECONDS_REG            0x00
 #define OMAP_RTC_MINUTES_REG            0x04
@@ -72,6 +70,15 @@
 #define OMAP_RTC_IRQWAKEEN              0x7c
+#define OMAP_RTC_ALARM2_SECONDS_REG     0x80
+#define OMAP_RTC_ALARM2_MINUTES_REG     0x84
+#define OMAP_RTC_ALARM2_HOURS_REG       0x88
+#define OMAP_RTC_ALARM2_DAYS_REG        0x8c
+#define OMAP_RTC_ALARM2_MONTHS_REG      0x90
+#define OMAP_RTC_ALARM2_YEARS_REG       0x94
+#define OMAP_RTC_PMIC_REG               0x98
 /* OMAP_RTC_CTRL_REG bit fields: */
 #define OMAP_RTC_CTRL_SPLIT             BIT(7)
 #define OMAP_RTC_CTRL_DISABLE           BIT(6)
@@ -84,6 +91,7 @@
 /* OMAP_RTC_STATUS_REG bit fields: */
 #define OMAP_RTC_STATUS_POWER_UP        BIT(7)
+#define OMAP_RTC_STATUS_ALARM2          BIT(7)
 #define OMAP_RTC_STATUS_ALARM           BIT(6)
 #define OMAP_RTC_STATUS_1D_EVENT        BIT(5)
 #define OMAP_RTC_STATUS_1H_EVENT        BIT(4)
@@ -93,6 +101,7 @@
 #define OMAP_RTC_STATUS_BUSY            BIT(0)
 /* OMAP_RTC_INTERRUPTS_REG bit fields: */
+#define OMAP_RTC_INTERRUPTS_IT_ALARM2   BIT(4)
 #define OMAP_RTC_INTERRUPTS_IT_ALARM    BIT(3)
 #define OMAP_RTC_INTERRUPTS_IT_TIMER    BIT(2)
@@ -102,61 +111,82 @@
 /* OMAP_RTC_IRQWAKEEN bit fields: */
 #define OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN BIT(1)
+/* OMAP_RTC_PMIC bit fields: */
+#define OMAP_RTC_PMIC_POWER_EN_EN       BIT(16)
 /* OMAP_RTC_KICKER values */
 #define KICK0_VALUE                     0x83e70b13
 #define KICK1_VALUE                     0x95a4f1e0
-#define OMAP_RTC_HAS_KICKER             BIT(0)
+struct omap_rtc_device_type {
+        bool has_32kclk_en;
-/*
+        bool has_kicker;
- * Few RTC IP revisions has special WAKE-EN Register to enable Wakeup
+        bool has_irqwakeen;
- * generation for event Alarm.
+        bool has_pmic_mode;
- */
+        bool has_power_up_reset;
-#define OMAP_RTC_HAS_IRQWAKEEN          BIT(1)
+};
-/*
+struct omap_rtc {
- * Some RTC IP revisions (like those in AM335x and DRA7x) need
+        struct rtc_device *rtc;
- * the 32KHz clock to be explicitly enabled.
+        void __iomem *base;
- */
+        int irq_alarm;
-#define OMAP_RTC_HAS_32KCLK_EN          BIT(2)
+        int irq_timer;
+        u8 interrupts_reg;
+        bool is_pmic_controller;
+        const struct omap_rtc_device_type *type;
+};
-static void __iomem     *rtc_base;
+static inline u8 rtc_read(struct omap_rtc *rtc, unsigned int reg)
+{
+        return readb(rtc->base + reg);
+}
-#define rtc_read(addr)          readb(rtc_base + (addr))
+static inline u32 rtc_readl(struct omap_rtc *rtc, unsigned int reg)
-#define rtc_write(val, addr)    writeb(val, rtc_base + (addr))
+{
+        return readl(rtc->base + reg);
+}
-#define rtc_writel(val, addr)   writel(val, rtc_base + (addr))
+static inline void rtc_write(struct omap_rtc *rtc, unsigned int reg, u8 val)
+{
+        writeb(val, rtc->base + reg);
+}
+static inline void rtc_writel(struct omap_rtc *rtc, unsigned int reg, u32 val)
+{
+        writel(val, rtc->base + reg);
+}
-/* we rely on the rtc framework to handle locking (rtc->ops_lock),
+/*
+ * We rely on the rtc framework to handle locking (rtc->ops_lock),
 * so the only other requirement is that register accesses which
 * require BUSY to be clear are made with IRQs locally disabled
 */
-static void rtc_wait_not_busy(void)
+static void rtc_wait_not_busy(struct omap_rtc *rtc)
 {
-        int     count = 0;
+        int count;
-        u8      status;
+        u8 status;
        /* BUSY may stay active for 1/32768 second (~30 usec) */
        for (count = 0; count < 50; count++) {
-                status = rtc_read(OMAP_RTC_STATUS_REG);
+                status = rtc_read(rtc, OMAP_RTC_STATUS_REG);
-                if ((status & (u8)OMAP_RTC_STATUS_BUSY) == 0)
+                if (!(status & OMAP_RTC_STATUS_BUSY))
                        break;
                udelay(1);
        }
        /* now we have ~15 usec to read/write various registers */
 }
-static irqreturn_t rtc_irq(int irq, void *rtc)
+static irqreturn_t rtc_irq(int irq, void *dev_id)
 {
-        unsigned long           events = 0;
+        struct omap_rtc *rtc = dev_id;
-        u8                      irq_data;
+        unsigned long events = 0;
+        u8 irq_data;
-        irq_data = rtc_read(OMAP_RTC_STATUS_REG);
+        irq_data = rtc_read(rtc, OMAP_RTC_STATUS_REG);
        /* alarm irq? */
        if (irq_data & OMAP_RTC_STATUS_ALARM) {
-                rtc_write(OMAP_RTC_STATUS_ALARM, OMAP_RTC_STATUS_REG);
+                rtc_write(rtc, OMAP_RTC_STATUS_REG, OMAP_RTC_STATUS_ALARM);
                events |= RTC_IRQF | RTC_AF;
        }
@@ -164,23 +194,21 @@ static irqreturn_t rtc_irq(int irq, void *rtc)
        if (irq_data & OMAP_RTC_STATUS_1S_EVENT)
                events |= RTC_IRQF | RTC_UF;
-        rtc_update_irq(rtc, 1, events);
+        rtc_update_irq(rtc->rtc, 1, events);
        return IRQ_HANDLED;
 }
 static int omap_rtc_alarm_irq_enable(struct device *dev, unsigned int enabled)
 {
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
        u8 reg, irqwake_reg = 0;
-        struct platform_device *pdev = to_platform_device(dev);
-        const struct platform_device_id *id_entry =
-                                        platform_get_device_id(pdev);
        local_irq_disable();
-        rtc_wait_not_busy();
+        rtc_wait_not_busy(rtc);
-        reg = rtc_read(OMAP_RTC_INTERRUPTS_REG);
+        reg = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG);
-        if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN)
+        if (rtc->type->has_irqwakeen)
-                irqwake_reg = rtc_read(OMAP_RTC_IRQWAKEEN);
+                irqwake_reg = rtc_read(rtc, OMAP_RTC_IRQWAKEEN);
        if (enabled) {
                reg |= OMAP_RTC_INTERRUPTS_IT_ALARM;
@@ -189,10 +217,10 @@ static int omap_rtc_alarm_irq_enable(struct device *dev, unsigned int enabled)
                reg &= ~OMAP_RTC_INTERRUPTS_IT_ALARM;
                irqwake_reg &= ~OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN;
        }
-        rtc_wait_not_busy();
+        rtc_wait_not_busy(rtc);
-        rtc_write(reg, OMAP_RTC_INTERRUPTS_REG);
+        rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, reg);
-        if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN)
+        if (rtc->type->has_irqwakeen)
-                rtc_write(irqwake_reg, OMAP_RTC_IRQWAKEEN);
+                rtc_write(rtc, OMAP_RTC_IRQWAKEEN, irqwake_reg);
        local_irq_enable();
        return 0;
@@ -230,39 +258,47 @@ static void bcd2tm(struct rtc_time *tm)
        tm->tm_year = bcd2bin(tm->tm_year) + 100;
 }
+static void omap_rtc_read_time_raw(struct omap_rtc *rtc, struct rtc_time *tm)
+{
+        tm->tm_sec = rtc_read(rtc, OMAP_RTC_SECONDS_REG);
+        tm->tm_min = rtc_read(rtc, OMAP_RTC_MINUTES_REG);
+        tm->tm_hour = rtc_read(rtc, OMAP_RTC_HOURS_REG);
+        tm->tm_mday = rtc_read(rtc, OMAP_RTC_DAYS_REG);
+        tm->tm_mon = rtc_read(rtc, OMAP_RTC_MONTHS_REG);
+        tm->tm_year = rtc_read(rtc, OMAP_RTC_YEARS_REG);
+}
 static int omap_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
        /* we don't report wday/yday/isdst ... */
        local_irq_disable();
-        rtc_wait_not_busy();
+        rtc_wait_not_busy(rtc);
+        omap_rtc_read_time_raw(rtc, tm);
-        tm->tm_sec = rtc_read(OMAP_RTC_SECONDS_REG);
-        tm->tm_min = rtc_read(OMAP_RTC_MINUTES_REG);
-        tm->tm_hour = rtc_read(OMAP_RTC_HOURS_REG);
-        tm->tm_mday = rtc_read(OMAP_RTC_DAYS_REG);
-        tm->tm_mon = rtc_read(OMAP_RTC_MONTHS_REG);
-        tm->tm_year = rtc_read(OMAP_RTC_YEARS_REG);
        local_irq_enable();
        bcd2tm(tm);
        return 0;
 }
 static int omap_rtc_set_time(struct device *dev, struct rtc_time *tm)
 {
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
        if (tm2bcd(tm) < 0)
                return -EINVAL;
        local_irq_disable();
-        rtc_wait_not_busy();
+        rtc_wait_not_busy(rtc);
-        rtc_write(tm->tm_year, OMAP_RTC_YEARS_REG);
+        rtc_write(rtc, OMAP_RTC_YEARS_REG, tm->tm_year);
-        rtc_write(tm->tm_mon, OMAP_RTC_MONTHS_REG);
+        rtc_write(rtc, OMAP_RTC_MONTHS_REG, tm->tm_mon);
-        rtc_write(tm->tm_mday, OMAP_RTC_DAYS_REG);
+        rtc_write(rtc, OMAP_RTC_DAYS_REG, tm->tm_mday);
-        rtc_write(tm->tm_hour, OMAP_RTC_HOURS_REG);
+        rtc_write(rtc, OMAP_RTC_HOURS_REG, tm->tm_hour);
-        rtc_write(tm->tm_min, OMAP_RTC_MINUTES_REG);
+        rtc_write(rtc, OMAP_RTC_MINUTES_REG, tm->tm_min);
-        rtc_write(tm->tm_sec, OMAP_RTC_SECONDS_REG);
+        rtc_write(rtc, OMAP_RTC_SECONDS_REG, tm->tm_sec);
        local_irq_enable();
@@ -271,48 +307,50 @@ static int omap_rtc_set_time(struct device *dev, struct rtc_time *tm)
 static int omap_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alm)
 {
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
+        u8 interrupts;
        local_irq_disable();
-        rtc_wait_not_busy();
+        rtc_wait_not_busy(rtc);
-        alm->time.tm_sec = rtc_read(OMAP_RTC_ALARM_SECONDS_REG);
+        alm->time.tm_sec = rtc_read(rtc, OMAP_RTC_ALARM_SECONDS_REG);
-        alm->time.tm_min = rtc_read(OMAP_RTC_ALARM_MINUTES_REG);
+        alm->time.tm_min = rtc_read(rtc, OMAP_RTC_ALARM_MINUTES_REG);
-        alm->time.tm_hour = rtc_read(OMAP_RTC_ALARM_HOURS_REG);
+        alm->time.tm_hour = rtc_read(rtc, OMAP_RTC_ALARM_HOURS_REG);
-        alm->time.tm_mday = rtc_read(OMAP_RTC_ALARM_DAYS_REG);
+        alm->time.tm_mday = rtc_read(rtc, OMAP_RTC_ALARM_DAYS_REG);
-        alm->time.tm_mon = rtc_read(OMAP_RTC_ALARM_MONTHS_REG);
+        alm->time.tm_mon = rtc_read(rtc, OMAP_RTC_ALARM_MONTHS_REG);
-        alm->time.tm_year = rtc_read(OMAP_RTC_ALARM_YEARS_REG);
+        alm->time.tm_year = rtc_read(rtc, OMAP_RTC_ALARM_YEARS_REG);
        local_irq_enable();
        bcd2tm(&alm->time);
-        alm->enabled = !!(rtc_read(OMAP_RTC_INTERRUPTS_REG)
-                        & OMAP_RTC_INTERRUPTS_IT_ALARM);
+        interrupts = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG);
+        alm->enabled = !!(interrupts & OMAP_RTC_INTERRUPTS_IT_ALARM);
        return 0;
 }
 static int omap_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alm)
 {
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
        u8 reg, irqwake_reg = 0;
-        struct platform_device *pdev = to_platform_device(dev);
-        const struct platform_device_id *id_entry =
-                                        platform_get_device_id(pdev);
        if (tm2bcd(&alm->time) < 0)
                return -EINVAL;
        local_irq_disable();
-        rtc_wait_not_busy();
+        rtc_wait_not_busy(rtc);
-        rtc_write(alm->time.tm_year, OMAP_RTC_ALARM_YEARS_REG);
+        rtc_write(rtc, OMAP_RTC_ALARM_YEARS_REG, alm->time.tm_year);
-        rtc_write(alm->time.tm_mon, OMAP_RTC_ALARM_MONTHS_REG);
+        rtc_write(rtc, OMAP_RTC_ALARM_MONTHS_REG, alm->time.tm_mon);
-        rtc_write(alm->time.tm_mday, OMAP_RTC_ALARM_DAYS_REG);
+        rtc_write(rtc, OMAP_RTC_ALARM_DAYS_REG, alm->time.tm_mday);
-        rtc_write(alm->time.tm_hour, OMAP_RTC_ALARM_HOURS_REG);
+        rtc_write(rtc, OMAP_RTC_ALARM_HOURS_REG, alm->time.tm_hour);
-        rtc_write(alm->time.tm_min, OMAP_RTC_ALARM_MINUTES_REG);
+        rtc_write(rtc, OMAP_RTC_ALARM_MINUTES_REG, alm->time.tm_min);
-        rtc_write(alm->time.tm_sec, OMAP_RTC_ALARM_SECONDS_REG);
+        rtc_write(rtc, OMAP_RTC_ALARM_SECONDS_REG, alm->time.tm_sec);
-        reg = rtc_read(OMAP_RTC_INTERRUPTS_REG);
+        reg = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG);
-        if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN)
+        if (rtc->type->has_irqwakeen)
-                irqwake_reg = rtc_read(OMAP_RTC_IRQWAKEEN);
+                irqwake_reg = rtc_read(rtc, OMAP_RTC_IRQWAKEEN);
        if (alm->enabled) {
                reg |= OMAP_RTC_INTERRUPTS_IT_ALARM;
@@ -321,15 +359,79 @@ static int omap_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alm)
                reg &= ~OMAP_RTC_INTERRUPTS_IT_ALARM;
                irqwake_reg &= ~OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN;
        }
-        rtc_write(reg, OMAP_RTC_INTERRUPTS_REG);
+        rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, reg);
-        if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN)
+        if (rtc->type->has_irqwakeen)
-                rtc_write(irqwake_reg, OMAP_RTC_IRQWAKEEN);
+                rtc_write(rtc, OMAP_RTC_IRQWAKEEN, irqwake_reg);
        local_irq_enable();
        return 0;
 }
+static struct omap_rtc *omap_rtc_power_off_rtc;
+/*
+ * omap_rtc_poweroff: RTC-controlled power off
+ *
+ * The RTC can be used to control an external PMIC via the pmic_power_en pin,
+ * which can be configured to transition to OFF on ALARM2 events.
+ *
+ * Notes:
+ * The two-second alarm offset is the shortest offset possible as the alarm
+ * registers must be set before the next timer update and the offset
+ * calculation is too heavy for everything to be done within a single access
+ * period (~15 us).
+ *
+ * Called with local interrupts disabled.
+ */
+static void omap_rtc_power_off(void)
+{
+        struct omap_rtc *rtc = omap_rtc_power_off_rtc;
+        struct rtc_time tm;
+        unsigned long now;
+        u32 val;
+        /* enable pmic_power_en control */
+        val = rtc_readl(rtc, OMAP_RTC_PMIC_REG);
+        rtc_writel(rtc, OMAP_RTC_PMIC_REG, val | OMAP_RTC_PMIC_POWER_EN_EN);
+        /* set alarm two seconds from now */
+        omap_rtc_read_time_raw(rtc, &tm);
+        bcd2tm(&tm);
+        rtc_tm_to_time(&tm, &now);
+        rtc_time_to_tm(now + 2, &tm);
+        if (tm2bcd(&tm) < 0) {
+                dev_err(&rtc->rtc->dev, "power off failed\n");
+                return;
+        }
+        rtc_wait_not_busy(rtc);
+        rtc_write(rtc, OMAP_RTC_ALARM2_SECONDS_REG, tm.tm_sec);
+        rtc_write(rtc, OMAP_RTC_ALARM2_MINUTES_REG, tm.tm_min);
+        rtc_write(rtc, OMAP_RTC_ALARM2_HOURS_REG, tm.tm_hour);
+        rtc_write(rtc, OMAP_RTC_ALARM2_DAYS_REG, tm.tm_mday);
+        rtc_write(rtc, OMAP_RTC_ALARM2_MONTHS_REG, tm.tm_mon);
+        rtc_write(rtc, OMAP_RTC_ALARM2_YEARS_REG, tm.tm_year);
+        /*
+         * enable ALARM2 interrupt
+         *
+         * NOTE: this fails on AM3352 if rtc_write (writeb) is used
+         */
+        val = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG);
+        rtc_writel(rtc, OMAP_RTC_INTERRUPTS_REG,
+                        val | OMAP_RTC_INTERRUPTS_IT_ALARM2);
+        /*
+         * Wait for alarm to trigger (within two seconds) and external PMIC to
+         * power off the system. Add a 500 ms margin for external latencies
+         * (e.g. debounce circuits).
+         */
+        mdelay(2500);
+}
 static struct rtc_class_ops omap_rtc_ops = {
        .read_time      = omap_rtc_read_time,
        .set_time       = omap_rtc_set_time,
@@ -338,137 +440,140 @@ static struct rtc_class_ops omap_rtc_ops = {
        .alarm_irq_enable = omap_rtc_alarm_irq_enable,
 };
-static int omap_rtc_alarm;
+static const struct omap_rtc_device_type omap_rtc_default_type = {
-static int omap_rtc_timer;
+        .has_power_up_reset = true,
+};
-#define OMAP_RTC_DATA_AM3352_IDX        1
+static const struct omap_rtc_device_type omap_rtc_am3352_type = {
-#define OMAP_RTC_DATA_DA830_IDX         2
+        .has_32kclk_en  = true,
+        .has_kicker     = true,
+        .has_irqwakeen  = true,
+        .has_pmic_mode  = true,
+};
-static struct platform_device_id omap_rtc_devtype[] = {
+static const struct omap_rtc_device_type omap_rtc_da830_type = {
+        .has_kicker     = true,
+};
+static const struct platform_device_id omap_rtc_id_table[] = {
        {
-                .name   = DRIVER_NAME,
+                .name   = "omap_rtc",
-        },
+                .driver_data = (kernel_ulong_t)&omap_rtc_default_type,
-        [OMAP_RTC_DATA_AM3352_IDX] = {
+        }, {
                .name   = "am3352-rtc",
-                .driver_data = OMAP_RTC_HAS_KICKER | OMAP_RTC_HAS_IRQWAKEEN |
+                .driver_data = (kernel_ulong_t)&omap_rtc_am3352_type,
-                               OMAP_RTC_HAS_32KCLK_EN,
+        }, {
-        },
-        [OMAP_RTC_DATA_DA830_IDX] = {
                .name   = "da830-rtc",
-                .driver_data = OMAP_RTC_HAS_KICKER,
+                .driver_data = (kernel_ulong_t)&omap_rtc_da830_type,
-        },
+        }, {
-        {},
+                /* sentinel */
+        }
 };
-MODULE_DEVICE_TABLE(platform, omap_rtc_devtype);
+MODULE_DEVICE_TABLE(platform, omap_rtc_id_table);
 static const struct of_device_id omap_rtc_of_match[] = {
-        {       .compatible     = "ti,da830-rtc",
+        {
-                .data           = &omap_rtc_devtype[OMAP_RTC_DATA_DA830_IDX],
+                .compatible     = "ti,am3352-rtc",
-        },
+                .data           = &omap_rtc_am3352_type,
-        {       .compatible     = "ti,am3352-rtc",
+        }, {
-                .data           = &omap_rtc_devtype[OMAP_RTC_DATA_AM3352_IDX],
+                .compatible     = "ti,da830-rtc",
-        },
+                .data           = &omap_rtc_da830_type,
-        {},
+        }, {
+                /* sentinel */
+        }
 };
 MODULE_DEVICE_TABLE(of, omap_rtc_of_match);
 static int __init omap_rtc_probe(struct platform_device *pdev)
 {
-        struct resource         *res;
+        struct omap_rtc *rtc;
-        struct rtc_device       *rtc;
+        struct resource *res;
-        u8                      reg, new_ctrl;
+        u8 reg, mask, new_ctrl;
        const struct platform_device_id *id_entry;
        const struct of_device_id *of_id;
+        int ret;
-        of_id = of_match_device(omap_rtc_of_match, &pdev->dev);
+        rtc = devm_kzalloc(&pdev->dev, sizeof(*rtc), GFP_KERNEL);
-        if (of_id)
+        if (!rtc)
-                pdev->id_entry = of_id->data;
+                return -ENOMEM;
-        id_entry = platform_get_device_id(pdev);
+        of_id = of_match_device(omap_rtc_of_match, &pdev->dev);
-        if (!id_entry) {
+        if (of_id) {
-                dev_err(&pdev->dev, "no matching device entry\n");
+                rtc->type = of_id->data;
-                return -ENODEV;
+                rtc->is_pmic_controller = rtc->type->has_pmic_mode &&
+                                of_property_read_bool(pdev->dev.of_node,
+                                                "system-power-controller");
+        } else {
+                id_entry = platform_get_device_id(pdev);
+                rtc->type = (void *)id_entry->driver_data;
        }
-        omap_rtc_timer = platform_get_irq(pdev, 0);
+        rtc->irq_timer = platform_get_irq(pdev, 0);
-        if (omap_rtc_timer <= 0) {
+        if (rtc->irq_timer <= 0)
-                pr_debug("%s: no update irq?\n", pdev->name);
                return -ENOENT;
-        }
-        omap_rtc_alarm = platform_get_irq(pdev, 1);
+        rtc->irq_alarm = platform_get_irq(pdev, 1);
-        if (omap_rtc_alarm <= 0) {
+        if (rtc->irq_alarm <= 0)
-                pr_debug("%s: no alarm irq?\n", pdev->name);
                return -ENOENT;
-        }
        res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-        rtc_base = devm_ioremap_resource(&pdev->dev, res);
+        rtc->base = devm_ioremap_resource(&pdev->dev, res);
-        if (IS_ERR(rtc_base))
+        if (IS_ERR(rtc->base))
-                return PTR_ERR(rtc_base);
+                return PTR_ERR(rtc->base);
+        platform_set_drvdata(pdev, rtc);
        /* Enable the clock/module so that we can access the registers */
        pm_runtime_enable(&pdev->dev);
        pm_runtime_get_sync(&pdev->dev);
-        if (id_entry->driver_data & OMAP_RTC_HAS_KICKER) {
+        if (rtc->type->has_kicker) {
-                rtc_writel(KICK0_VALUE, OMAP_RTC_KICK0_REG);
+                rtc_writel(rtc, OMAP_RTC_KICK0_REG, KICK0_VALUE);
-                rtc_writel(KICK1_VALUE, OMAP_RTC_KICK1_REG);
+                rtc_writel(rtc, OMAP_RTC_KICK1_REG, KICK1_VALUE);
-        }
-        rtc = devm_rtc_device_register(&pdev->dev, pdev->name,
-                        &omap_rtc_ops, THIS_MODULE);
-        if (IS_ERR(rtc)) {
-                pr_debug("%s: can't register RTC device, err %ld\n",
-                        pdev->name, PTR_ERR(rtc));
-                goto fail0;
        }
-        platform_set_drvdata(pdev, rtc);
-        /* clear pending irqs, and set 1/second periodic,
+        /*
-         * which we'll use instead of update irqs
+         * disable interrupts
+         *
+         * NOTE: ALARM2 is not cleared on AM3352 if rtc_write (writeb) is used
         */
-        rtc_write(0, OMAP_RTC_INTERRUPTS_REG);
+        rtc_writel(rtc, OMAP_RTC_INTERRUPTS_REG, 0);
        /* enable RTC functional clock */
-        if (id_entry->driver_data & OMAP_RTC_HAS_32KCLK_EN)
+        if (rtc->type->has_32kclk_en) {
-                rtc_writel(OMAP_RTC_OSC_32KCLK_EN, OMAP_RTC_OSC_REG);
+                reg = rtc_read(rtc, OMAP_RTC_OSC_REG);
+                rtc_writel(rtc, OMAP_RTC_OSC_REG,
+                                reg | OMAP_RTC_OSC_32KCLK_EN);
+        }
        /* clear old status */
-        reg = rtc_read(OMAP_RTC_STATUS_REG);
+        reg = rtc_read(rtc, OMAP_RTC_STATUS_REG);
-        if (reg & (u8) OMAP_RTC_STATUS_POWER_UP) {
-                pr_info("%s: RTC power up reset detected\n",
-                        pdev->name);
-                rtc_write(OMAP_RTC_STATUS_POWER_UP, OMAP_RTC_STATUS_REG);
-        }
-        if (reg & (u8) OMAP_RTC_STATUS_ALARM)
-                rtc_write(OMAP_RTC_STATUS_ALARM, OMAP_RTC_STATUS_REG);
-        /* handle periodic and alarm irqs */
+        mask = OMAP_RTC_STATUS_ALARM;
-        if (devm_request_irq(&pdev->dev, omap_rtc_timer, rtc_irq, 0,
-                        dev_name(&rtc->dev), rtc)) {
+        if (rtc->type->has_pmic_mode)
-                pr_debug("%s: RTC timer interrupt IRQ%d already claimed\n",
+                mask |= OMAP_RTC_STATUS_ALARM2;
-                        pdev->name, omap_rtc_timer);
-                goto fail0;
+        if (rtc->type->has_power_up_reset) {
-        }
+                mask |= OMAP_RTC_STATUS_POWER_UP;
-        if ((omap_rtc_timer != omap_rtc_alarm) &&
+                if (reg & OMAP_RTC_STATUS_POWER_UP)
-                (devm_request_irq(&pdev->dev, omap_rtc_alarm, rtc_irq, 0,
+                        dev_info(&pdev->dev, "RTC power up reset detected\n");
-                        dev_name(&rtc->dev), rtc))) {
-                pr_debug("%s: RTC alarm interrupt IRQ%d already claimed\n",
-                        pdev->name, omap_rtc_alarm);
-                goto fail0;
        }
+        if (reg & mask)
+                rtc_write(rtc, OMAP_RTC_STATUS_REG, reg & mask);
        /* On boards with split power, RTC_ON_NOFF won't reset the RTC */
-        reg = rtc_read(OMAP_RTC_CTRL_REG);
+        reg = rtc_read(rtc, OMAP_RTC_CTRL_REG);
-        if (reg & (u8) OMAP_RTC_CTRL_STOP)
+        if (reg & OMAP_RTC_CTRL_STOP)
-                pr_info("%s: already running\n", pdev->name);
+                dev_info(&pdev->dev, "already running\n");
        /* force to 24 hour mode */
-        new_ctrl = reg & (OMAP_RTC_CTRL_SPLIT|OMAP_RTC_CTRL_AUTO_COMP);
+        new_ctrl = reg & (OMAP_RTC_CTRL_SPLIT | OMAP_RTC_CTRL_AUTO_COMP);
        new_ctrl |= OMAP_RTC_CTRL_STOP;
-        /* BOARD-SPECIFIC CUSTOMIZATION CAN GO HERE:
+        /*
+         * BOARD-SPECIFIC CUSTOMIZATION CAN GO HERE:
         *
         *  - Device wake-up capability setting should come through chip
         *    init logic. OMAP1 boards should initialize the "wakeup capable"
@@ -482,36 +587,70 @@ static int __init omap_rtc_probe(struct platform_device *pdev)
         *    is write-only, and always reads as zero...)
         */
+        if (new_ctrl & OMAP_RTC_CTRL_SPLIT)
+                dev_info(&pdev->dev, "split power mode\n");
+        if (reg != new_ctrl)
+                rtc_write(rtc, OMAP_RTC_CTRL_REG, new_ctrl);
        device_init_wakeup(&pdev->dev, true);
-        if (new_ctrl & (u8) OMAP_RTC_CTRL_SPLIT)
+        rtc->rtc = devm_rtc_device_register(&pdev->dev, pdev->name,
-                pr_info("%s: split power mode\n", pdev->name);
+                        &omap_rtc_ops, THIS_MODULE);
+        if (IS_ERR(rtc->rtc)) {
+                ret = PTR_ERR(rtc->rtc);
+                goto err;
+        }
-        if (reg != new_ctrl)
+        /* handle periodic and alarm irqs */
-                rtc_write(new_ctrl, OMAP_RTC_CTRL_REG);
+        ret = devm_request_irq(&pdev->dev, rtc->irq_timer, rtc_irq, 0,
+                        dev_name(&rtc->rtc->dev), rtc);
+        if (ret)
+                goto err;
+        if (rtc->irq_timer != rtc->irq_alarm) {
+                ret = devm_request_irq(&pdev->dev, rtc->irq_alarm, rtc_irq, 0,
+                                dev_name(&rtc->rtc->dev), rtc);
+                if (ret)
+                        goto err;
+        }
+        if (rtc->is_pmic_controller) {
+                if (!pm_power_off) {
+                        omap_rtc_power_off_rtc = rtc;
+                        pm_power_off = omap_rtc_power_off;
+                }
+        }
        return 0;
-fail0:
+err:
-        if (id_entry->driver_data & OMAP_RTC_HAS_KICKER)
+        device_init_wakeup(&pdev->dev, false);
-                rtc_writel(0, OMAP_RTC_KICK0_REG);
+        if (rtc->type->has_kicker)
+                rtc_writel(rtc, OMAP_RTC_KICK0_REG, 0);
        pm_runtime_put_sync(&pdev->dev);
        pm_runtime_disable(&pdev->dev);
-        return -EIO;
+        return ret;
 }
 static int __exit omap_rtc_remove(struct platform_device *pdev)
 {
-        const struct platform_device_id *id_entry =
+        struct omap_rtc *rtc = platform_get_drvdata(pdev);
-                                platform_get_device_id(pdev);
+        if (pm_power_off == omap_rtc_power_off &&
+                        omap_rtc_power_off_rtc == rtc) {
+                pm_power_off = NULL;
+                omap_rtc_power_off_rtc = NULL;
+        }
        device_init_wakeup(&pdev->dev, 0);
        /* leave rtc running, but disable irqs */
-        rtc_write(0, OMAP_RTC_INTERRUPTS_REG);
+        rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, 0);
-        if (id_entry->driver_data & OMAP_RTC_HAS_KICKER)
+        if (rtc->type->has_kicker)
-                rtc_writel(0, OMAP_RTC_KICK0_REG);
+                rtc_writel(rtc, OMAP_RTC_KICK0_REG, 0);
        /* Disable the clock/module */
        pm_runtime_put_sync(&pdev->dev);
@@ -521,20 +660,21 @@ static int __exit omap_rtc_remove(struct platform_device *pdev)
 }
 #ifdef CONFIG_PM_SLEEP
-static u8 irqstat;
 static int omap_rtc_suspend(struct device *dev)
 {
-        irqstat = rtc_read(OMAP_RTC_INTERRUPTS_REG);
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
-        /* FIXME the RTC alarm is not currently acting as a wakeup event
+        rtc->interrupts_reg = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG);
+        /*
+         * FIXME: the RTC alarm is not currently acting as a wakeup event
         * source on some platforms, and in fact this enable() call is just
         * saving a flag that's never used...
         */
        if (device_may_wakeup(dev))
-                enable_irq_wake(omap_rtc_alarm);
+                enable_irq_wake(rtc->irq_alarm);
        else
-                rtc_write(0, OMAP_RTC_INTERRUPTS_REG);
+                rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, 0);
        /* Disable the clock/module */
        pm_runtime_put_sync(dev);
@@ -544,13 +684,15 @@ static int omap_rtc_suspend(struct device *dev)
 static int omap_rtc_resume(struct device *dev)
 {
+        struct omap_rtc *rtc = dev_get_drvdata(dev);
        /* Enable the clock/module so that we can access the registers */
        pm_runtime_get_sync(dev);
        if (device_may_wakeup(dev))
-                disable_irq_wake(omap_rtc_alarm);
+                disable_irq_wake(rtc->irq_alarm);
        else
-                rtc_write(irqstat, OMAP_RTC_INTERRUPTS_REG);
+                rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, rtc->interrupts_reg);
        return 0;
 }
@@ -560,23 +702,32 @@ static SIMPLE_DEV_PM_OPS(omap_rtc_pm_ops, omap_rtc_suspend, omap_rtc_resume);
 static void omap_rtc_shutdown(struct platform_device *pdev)
 {
-        rtc_write(0, OMAP_RTC_INTERRUPTS_REG);
+        struct omap_rtc *rtc = platform_get_drvdata(pdev);
+        u8 mask;
+        /*
+         * Keep the ALARM interrupt enabled to allow the system to power up on
+         * alarm events.
+         */
+        mask = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG);
+        mask &= OMAP_RTC_INTERRUPTS_IT_ALARM;
+        rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, mask);
 }
-MODULE_ALIAS("platform:omap_rtc");
 static struct platform_driver omap_rtc_driver = {
        .remove         = __exit_p(omap_rtc_remove),
        .shutdown       = omap_rtc_shutdown,
        .driver         = {
-                .name   = DRIVER_NAME,
+                .name   = "omap_rtc",
                .owner  = THIS_MODULE,
                .pm     = &omap_rtc_pm_ops,
                .of_match_table = omap_rtc_of_match,
        },
-        .id_table       = omap_rtc_devtype,
+        .id_table       = omap_rtc_id_table,
 };
 module_platform_driver_probe(omap_rtc_driver, omap_rtc_probe);
+MODULE_ALIAS("platform:omap_rtc");
 MODULE_AUTHOR("George G. Davis (and others)");
 MODULE_LICENSE("GPL");
diff --git a/drivers/rtc/rtc-pcf8563.c b/drivers/rtc/rtc-pcf8563.c
index c2ef0a22ee94..96fb32e7d6f8 100644
--- a/drivers/rtc/rtc-pcf8563.c
+++ b/drivers/rtc/rtc-pcf8563.c
@@ -28,6 +28,7 @@
 #define PCF8563_REG_ST2         0x01
 #define PCF8563_BIT_AIE         (1 << 1)
 #define PCF8563_BIT_AF          (1 << 3)
+#define PCF8563_BITS_ST2_N      (7 << 5)
 #define PCF8563_REG_SC          0x02 /* datetime */
 #define PCF8563_REG_MN          0x03
@@ -41,6 +42,13 @@
 #define PCF8563_REG_CLKO        0x0D /* clock out */
 #define PCF8563_REG_TMRC        0x0E /* timer control */
+#define PCF8563_TMRC_ENABLE     BIT(7)
+#define PCF8563_TMRC_4096       0
+#define PCF8563_TMRC_64         1
+#define PCF8563_TMRC_1          2
+#define PCF8563_TMRC_1_60       3
+#define PCF8563_TMRC_MASK       3
 #define PCF8563_REG_TMR         0x0F /* timer */
 #define PCF8563_SC_LV           0x80 /* low voltage */
@@ -118,22 +126,21 @@ static int pcf8563_write_block_data(struct i2c_client *client,
 static int pcf8563_set_alarm_mode(struct i2c_client *client, bool on)
 {
-        unsigned char buf[2];
+        unsigned char buf;
        int err;
-        err = pcf8563_read_block_data(client, PCF8563_REG_ST2, 1, buf + 1);
+        err = pcf8563_read_block_data(client, PCF8563_REG_ST2, 1, &buf);
        if (err < 0)
                return err;
        if (on)
-                buf[1] |= PCF8563_BIT_AIE;
+                buf |= PCF8563_BIT_AIE;
        else
-                buf[1] &= ~PCF8563_BIT_AIE;
+                buf &= ~PCF8563_BIT_AIE;
-        buf[1] &= ~PCF8563_BIT_AF;
+        buf &= ~(PCF8563_BIT_AF | PCF8563_BITS_ST2_N);
-        buf[0] = PCF8563_REG_ST2;
-        err = pcf8563_write_block_data(client, PCF8563_REG_ST2, 1, buf + 1);
+        err = pcf8563_write_block_data(client, PCF8563_REG_ST2, 1, &buf);
        if (err < 0) {
                dev_err(&client->dev, "%s: write error\n", __func__);
                return -EIO;
@@ -336,8 +343,8 @@ static int pcf8563_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *tm)
                __func__, buf[0], buf[1], buf[2], buf[3]);
        tm->time.tm_min = bcd2bin(buf[0] & 0x7F);
-        tm->time.tm_hour = bcd2bin(buf[1] & 0x7F);
+        tm->time.tm_hour = bcd2bin(buf[1] & 0x3F);
-        tm->time.tm_mday = bcd2bin(buf[2] & 0x1F);
+        tm->time.tm_mday = bcd2bin(buf[2] & 0x3F);
        tm->time.tm_wday = bcd2bin(buf[3] & 0x7);
        tm->time.tm_mon = -1;
        tm->time.tm_year = -1;
@@ -361,6 +368,14 @@ static int pcf8563_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *tm)
        struct i2c_client *client = to_i2c_client(dev);
        unsigned char buf[4];
        int err;
+        unsigned long alarm_time;
+        /* The alarm has no seconds, round up to nearest minute */
+        if (tm->time.tm_sec) {
+                rtc_tm_to_time(&tm->time, &alarm_time);
+                alarm_time += 60-tm->time.tm_sec;
+                rtc_time_to_tm(alarm_time, &tm->time);
+        }
        dev_dbg(dev, "%s, min=%d hour=%d wday=%d mday=%d "
                "enabled=%d pending=%d\n", __func__,
@@ -381,6 +396,7 @@ static int pcf8563_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *tm)
 static int pcf8563_irq_enable(struct device *dev, unsigned int enabled)
 {
+        dev_dbg(dev, "%s: en=%d\n", __func__, enabled);
        return pcf8563_set_alarm_mode(to_i2c_client(dev), !!enabled);
 }
@@ -398,6 +414,8 @@ static int pcf8563_probe(struct i2c_client *client,
 {
        struct pcf8563 *pcf8563;
        int err;
+        unsigned char buf;
+        unsigned char alm_pending;
        dev_dbg(&client->dev, "%s\n", __func__);
@@ -415,6 +433,22 @@ static int pcf8563_probe(struct i2c_client *client,
        pcf8563->client = client;
        device_set_wakeup_capable(&client->dev, 1);
+        /* Set timer to lowest frequency to save power (ref Haoyu datasheet) */
+        buf = PCF8563_TMRC_1_60;
+        err = pcf8563_write_block_data(client, PCF8563_REG_TMRC, 1, &buf);
+        if (err < 0) {
+                dev_err(&client->dev, "%s: write error\n", __func__);
+                return err;
+        }
+        err = pcf8563_get_alarm_mode(client, NULL, &alm_pending);
+        if (err < 0) {
+                dev_err(&client->dev, "%s: read error\n", __func__);
+                return err;
+        }
+        if (alm_pending)
+                pcf8563_set_alarm_mode(client, 0);
        pcf8563->rtc = devm_rtc_device_register(&client->dev,
                                pcf8563_driver.driver.name,
                                &pcf8563_rtc_ops, THIS_MODULE);
@@ -435,6 +469,9 @@ static int pcf8563_probe(struct i2c_client *client,
        }
+        /* the pcf8563 alarm only supports a minute accuracy */
+        pcf8563->rtc->uie_unsupported = 1;
        return 0;
 }
diff --git a/drivers/rtc/rtc-sirfsoc.c b/drivers/rtc/rtc-sirfsoc.c
index 76e38007ba90..d2ac6688e5c7 100644
--- a/drivers/rtc/rtc-sirfsoc.c
+++ b/drivers/rtc/rtc-sirfsoc.c
@@ -47,6 +47,7 @@ struct sirfsoc_rtc_drv {
        unsigned                irq_wake;
        /* Overflow for every 8 years extra time */
        u32                     overflow_rtc;
+        spinlock_t              lock;
 #ifdef CONFIG_PM
        u32             saved_counter;
        u32             saved_overflow_rtc;
@@ -61,7 +62,7 @@ static int sirfsoc_rtc_read_alarm(struct device *dev,
        rtcdrv = dev_get_drvdata(dev);
-        local_irq_disable();
+        spin_lock_irq(&rtcdrv->lock);
        rtc_count = sirfsoc_rtc_iobrg_readl(rtcdrv->rtc_base + RTC_CN);
@@ -84,7 +85,8 @@ static int sirfsoc_rtc_read_alarm(struct device *dev,
        if (sirfsoc_rtc_iobrg_readl(
                        rtcdrv->rtc_base + RTC_STATUS) & SIRFSOC_RTC_AL0E)
                alrm->enabled = 1;
-        local_irq_enable();
+        spin_unlock_irq(&rtcdrv->lock);
        return 0;
 }
@@ -99,7 +101,7 @@ static int sirfsoc_rtc_set_alarm(struct device *dev,
        if (alrm->enabled) {
                rtc_tm_to_time(&(alrm->time), &rtc_alarm);
-                local_irq_disable();
+                spin_lock_irq(&rtcdrv->lock);
                rtc_status_reg = sirfsoc_rtc_iobrg_readl(
                                rtcdrv->rtc_base + RTC_STATUS);
@@ -123,14 +125,15 @@ static int sirfsoc_rtc_set_alarm(struct device *dev,
                rtc_status_reg |= SIRFSOC_RTC_AL0E;
                sirfsoc_rtc_iobrg_writel(
                        rtc_status_reg, rtcdrv->rtc_base + RTC_STATUS);
-                local_irq_enable();
+                spin_unlock_irq(&rtcdrv->lock);
        } else {
                /*
                 * if this function was called with enabled=0
                 * then it could mean that the application is
                 * trying to cancel an ongoing alarm
                 */
-                local_irq_disable();
+                spin_lock_irq(&rtcdrv->lock);
                rtc_status_reg = sirfsoc_rtc_iobrg_readl(
                                rtcdrv->rtc_base + RTC_STATUS);
@@ -146,7 +149,7 @@ static int sirfsoc_rtc_set_alarm(struct device *dev,
                                        rtcdrv->rtc_base + RTC_STATUS);
                }
-                local_irq_enable();
+                spin_unlock_irq(&rtcdrv->lock);
        }
        return 0;
@@ -209,12 +212,38 @@ static int sirfsoc_rtc_ioctl(struct device *dev, unsigned int cmd,
        }
 }
+static int sirfsoc_rtc_alarm_irq_enable(struct device *dev,
+                unsigned int enabled)
+{
+        unsigned long rtc_status_reg = 0x0;
+        struct sirfsoc_rtc_drv *rtcdrv;
+        rtcdrv = dev_get_drvdata(dev);
+        spin_lock_irq(&rtcdrv->lock);
+        rtc_status_reg = sirfsoc_rtc_iobrg_readl(
+                                rtcdrv->rtc_base + RTC_STATUS);
+        if (enabled)
+                rtc_status_reg |= SIRFSOC_RTC_AL0E;
+        else
+                rtc_status_reg &= ~SIRFSOC_RTC_AL0E;
+        sirfsoc_rtc_iobrg_writel(rtc_status_reg, rtcdrv->rtc_base + RTC_STATUS);
+        spin_unlock_irq(&rtcdrv->lock);
+        return 0;
+}
 static const struct rtc_class_ops sirfsoc_rtc_ops = {
        .read_time = sirfsoc_rtc_read_time,
        .set_time = sirfsoc_rtc_set_time,
        .read_alarm = sirfsoc_rtc_read_alarm,
        .set_alarm = sirfsoc_rtc_set_alarm,
-        .ioctl = sirfsoc_rtc_ioctl
+        .ioctl = sirfsoc_rtc_ioctl,
+        .alarm_irq_enable = sirfsoc_rtc_alarm_irq_enable
 };
 static irqreturn_t sirfsoc_rtc_irq_handler(int irq, void *pdata)
@@ -223,6 +252,8 @@ static irqreturn_t sirfsoc_rtc_irq_handler(int irq, void *pdata)
        unsigned long rtc_status_reg = 0x0;
        unsigned long events = 0x0;
+        spin_lock(&rtcdrv->lock);
        rtc_status_reg = sirfsoc_rtc_iobrg_readl(rtcdrv->rtc_base + RTC_STATUS);
        /* this bit will be set ONLY if an alarm was active
         * and it expired NOW
@@ -240,6 +271,9 @@ static irqreturn_t sirfsoc_rtc_irq_handler(int irq, void *pdata)
                rtc_status_reg &= ~(SIRFSOC_RTC_AL0E);
        }
        sirfsoc_rtc_iobrg_writel(rtc_status_reg, rtcdrv->rtc_base + RTC_STATUS);
+        spin_unlock(&rtcdrv->lock);
        /* this should wake up any apps polling/waiting on the read
         * after setting the alarm
         */
@@ -267,6 +301,8 @@ static int sirfsoc_rtc_probe(struct platform_device *pdev)
        if (rtcdrv == NULL)
                return -ENOMEM;
+        spin_lock_init(&rtcdrv->lock);
        err = of_property_read_u32(np, "reg", &rtcdrv->rtc_base);
        if (err) {
                dev_err(&pdev->dev, "unable to find base address of rtc node in dtb\n");
@@ -286,14 +322,6 @@ static int sirfsoc_rtc_probe(struct platform_device *pdev)
        rtc_div = ((32768 / RTC_HZ) / 2) - 1;
        sirfsoc_rtc_iobrg_writel(rtc_div, rtcdrv->rtc_base + RTC_DIV);
-        rtcdrv->rtc = devm_rtc_device_register(&pdev->dev, pdev->name,
-                        &sirfsoc_rtc_ops, THIS_MODULE);
-        if (IS_ERR(rtcdrv->rtc)) {
-                err = PTR_ERR(rtcdrv->rtc);
-                dev_err(&pdev->dev, "can't register RTC device\n");
-                return err;
-        }
        /* 0x3 -> RTC_CLK */
        sirfsoc_rtc_iobrg_writel(SIRFSOC_RTC_CLK,
                        rtcdrv->rtc_base + RTC_CLOCK_SWITCH);
@@ -308,6 +336,14 @@ static int sirfsoc_rtc_probe(struct platform_device *pdev)
        rtcdrv->overflow_rtc =
                sirfsoc_rtc_iobrg_readl(rtcdrv->rtc_base + RTC_SW_VALUE);
+        rtcdrv->rtc = devm_rtc_device_register(&pdev->dev, pdev->name,
+                        &sirfsoc_rtc_ops, THIS_MODULE);
+        if (IS_ERR(rtcdrv->rtc)) {
+                err = PTR_ERR(rtcdrv->rtc);
+                dev_err(&pdev->dev, "can't register RTC device\n");
+                return err;
+        }
        rtcdrv->irq = platform_get_irq(pdev, 0);
        err = devm_request_irq(
                        &pdev->dev,
diff --git a/drivers/rtc/rtc-snvs.c b/drivers/rtc/rtc-snvs.c
index fa384fe28988..2cd8ffe5c698 100644
--- a/drivers/rtc/rtc-snvs.c
+++ b/drivers/rtc/rtc-snvs.c
@@ -17,6 +17,7 @@
 #include <linux/of_device.h>
 #include <linux/platform_device.h>
 #include <linux/rtc.h>
+#include <linux/clk.h>
 /* These register offsets are relative to LP (Low Power) range */
 #define SNVS_LPCR               0x04
@@ -39,6 +40,7 @@ struct snvs_rtc_data {
        void __iomem *ioaddr;
        int irq;
        spinlock_t lock;
+        struct clk *clk;
 };
 static u32 rtc_read_lp_counter(void __iomem *ioaddr)
@@ -260,6 +262,18 @@ static int snvs_rtc_probe(struct platform_device *pdev)
        if (data->irq < 0)
                return data->irq;
+        data->clk = devm_clk_get(&pdev->dev, "snvs-rtc");
+        if (IS_ERR(data->clk)) {
+                data->clk = NULL;
+        } else {
+                ret = clk_prepare_enable(data->clk);
+                if (ret) {
+                        dev_err(&pdev->dev,
+                                "Could not prepare or enable the snvs clock\n");
+                        return ret;
+                }
+        }
        platform_set_drvdata(pdev, data);
        spin_lock_init(&data->lock);
@@ -280,7 +294,7 @@ static int snvs_rtc_probe(struct platform_device *pdev)
        if (ret) {
                dev_err(&pdev->dev, "failed to request irq %d: %d\n",
                        data->irq, ret);
-                return ret;
+                goto error_rtc_device_register;
        }
        data->rtc = devm_rtc_device_register(&pdev->dev, pdev->name,
@@ -288,10 +302,16 @@ static int snvs_rtc_probe(struct platform_device *pdev)
        if (IS_ERR(data->rtc)) {
                ret = PTR_ERR(data->rtc);
                dev_err(&pdev->dev, "failed to register rtc: %d\n", ret);
-                return ret;
+                goto error_rtc_device_register;
        }
        return 0;
+error_rtc_device_register:
+        if (data->clk)
+                clk_disable_unprepare(data->clk);
+        return ret;
 }
 #ifdef CONFIG_PM_SLEEP
@@ -302,21 +322,34 @@ static int snvs_rtc_suspend(struct device *dev)
        if (device_may_wakeup(dev))
                enable_irq_wake(data->irq);
+        if (data->clk)
+                clk_disable_unprepare(data->clk);
        return 0;
 }
 static int snvs_rtc_resume(struct device *dev)
 {
        struct snvs_rtc_data *data = dev_get_drvdata(dev);
+        int ret;
        if (device_may_wakeup(dev))
                disable_irq_wake(data->irq);
+        if (data->clk) {
+                ret = clk_prepare_enable(data->clk);
+                if (ret)
+                        return ret;
+        }
        return 0;
 }
 #endif
-static SIMPLE_DEV_PM_OPS(snvs_rtc_pm_ops, snvs_rtc_suspend, snvs_rtc_resume);
+static const struct dev_pm_ops snvs_rtc_pm_ops = {
+        .suspend_noirq = snvs_rtc_suspend,
+        .resume_noirq = snvs_rtc_resume,
+};
 static const struct of_device_id snvs_dt_ids[] = {
        { .compatible = "fsl,sec-v4.0-mon-rtc-lp", },
diff --git a/drivers/usb/storage/debug.c b/drivers/usb/storage/debug.c
index 66a684a29938..2d81e1d8ee30 100644
--- a/drivers/usb/storage/debug.c
+++ b/drivers/usb/storage/debug.c
@@ -188,7 +188,7 @@ int usb_stor_dbg(const struct us_data *us, const char *fmt, ...)
        va_start(args, fmt);
-        r = dev_vprintk_emit(7, &us->pusb_dev->dev, fmt, args);
+        r = dev_vprintk_emit(LOGLEVEL_DEBUG, &us->pusb_dev->dev, fmt, args);
        va_end(args);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index d8fc0605b9d2..3a6175fe10c0 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1994,18 +1994,6 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
        shdr4extnum->sh_info = segs;
 }
-static size_t elf_core_vma_data_size(struct vm_area_struct *gate_vma,
-                                     unsigned long mm_flags)
-{
-        struct vm_area_struct *vma;
-        size_t size = 0;
-        for (vma = first_vma(current, gate_vma); vma != NULL;
-             vma = next_vma(vma, gate_vma))
-                size += vma_dump_size(vma, mm_flags);
-        return size;
-}
 /*
 * Actual dumper
 *
@@ -2017,7 +2005,8 @@ static int elf_core_dump(struct coredump_params *cprm)
 {
        int has_dumped = 0;
        mm_segment_t fs;
-        int segs;
+        int segs, i;
+        size_t vma_data_size = 0;
        struct vm_area_struct *vma, *gate_vma;
        struct elfhdr *elf = NULL;
        loff_t offset = 0, dataoff;
@@ -2026,6 +2015,7 @@ static int elf_core_dump(struct coredump_params *cprm)
        struct elf_shdr *shdr4extnum = NULL;
        Elf_Half e_phnum;
        elf_addr_t e_shoff;
+        elf_addr_t *vma_filesz = NULL;
        /*
         * We no longer stop all VM operations.
@@ -2093,7 +2083,20 @@ static int elf_core_dump(struct coredump_params *cprm)
        dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
-        offset += elf_core_vma_data_size(gate_vma, cprm->mm_flags);
+        vma_filesz = kmalloc_array(segs - 1, sizeof(*vma_filesz), GFP_KERNEL);
+        if (!vma_filesz)
+                goto end_coredump;
+        for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
+                        vma = next_vma(vma, gate_vma)) {
+                unsigned long dump_size;
+                dump_size = vma_dump_size(vma, cprm->mm_flags);
+                vma_filesz[i++] = dump_size;
+                vma_data_size += dump_size;
+        }
+        offset += vma_data_size;
        offset += elf_core_extra_data_size();
        e_shoff = offset;
@@ -2113,7 +2116,7 @@ static int elf_core_dump(struct coredump_params *cprm)
                goto end_coredump;
        /* Write program headers for segments dump */
-        for (vma = first_vma(current, gate_vma); vma != NULL;
+        for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
                        vma = next_vma(vma, gate_vma)) {
                struct elf_phdr phdr;
@@ -2121,7 +2124,7 @@ static int elf_core_dump(struct coredump_params *cprm)
                phdr.p_offset = offset;
                phdr.p_vaddr = vma->vm_start;
                phdr.p_paddr = 0;
-                phdr.p_filesz = vma_dump_size(vma, cprm->mm_flags);
+                phdr.p_filesz = vma_filesz[i++];
                phdr.p_memsz = vma->vm_end - vma->vm_start;
                offset += phdr.p_filesz;
                phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
@@ -2149,12 +2152,12 @@ static int elf_core_dump(struct coredump_params *cprm)
        if (!dump_skip(cprm, dataoff - cprm->written))
                goto end_coredump;
-        for (vma = first_vma(current, gate_vma); vma != NULL;
+        for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
                        vma = next_vma(vma, gate_vma)) {
                unsigned long addr;
                unsigned long end;
-                end = vma->vm_start + vma_dump_size(vma, cprm->mm_flags);
+                end = vma->vm_start + vma_filesz[i++];
                for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) {
                        struct page *page;
@@ -2187,6 +2190,7 @@ end_coredump:
 cleanup:
        free_note_info(&info);
        kfree(shdr4extnum);
+        kfree(vma_filesz);
        kfree(phdr4note);
        kfree(elf);
 out:
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index fd8beb9657a2..70789e198dea 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -1,21 +1,14 @@
 /*
- *  binfmt_misc.c
+ * binfmt_misc.c
 *
- *  Copyright (C) 1997 Richard Günther
+ * Copyright (C) 1997 Richard Günther
 *
- *  binfmt_misc detects binaries via a magic or filename extension and invokes
+ * binfmt_misc detects binaries via a magic or filename extension and invokes
- *  a specified wrapper. This should obsolete binfmt_java, binfmt_em86 and
+ * a specified wrapper. See Documentation/binfmt_misc.txt for more details.
- *  binfmt_mz.
- *
- *  1997-04-25 first version
- *  [...]
- *  1997-05-19 cleanup
- *  1997-06-26 hpa: pass the real filename rather than argv[0]
- *  1997-06-30 minor cleanup
- *  1997-08-09 removed extension stripping, locking cleanup
- *  2001-02-28 AV: rewritten into something that resembles C. Original didn't.
 */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/sched.h>
@@ -30,8 +23,13 @@
 #include <linux/mount.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
+#include <linux/uaccess.h>
-#include <asm/uaccess.h>
+#ifdef DEBUG
+# define USE_DEBUG 1
+#else
+# define USE_DEBUG 0
+#endif
 enum {
        VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
@@ -41,9 +39,9 @@ static LIST_HEAD(entries);
 static int enabled = 1;
 enum {Enabled, Magic};
-#define MISC_FMT_PRESERVE_ARGV0 (1<<31)
+#define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
-#define MISC_FMT_OPEN_BINARY (1<<30)
+#define MISC_FMT_OPEN_BINARY (1 << 30)
-#define MISC_FMT_CREDENTIALS (1<<29)
+#define MISC_FMT_CREDENTIALS (1 << 29)
 typedef struct {
        struct list_head list;
@@ -87,20 +85,24 @@ static Node *check_file(struct linux_binprm *bprm)
        char *p = strrchr(bprm->interp, '.');
        struct list_head *l;
+        /* Walk all the registered handlers. */
        list_for_each(l, &entries) {
                Node *e = list_entry(l, Node, list);
                char *s;
                int j;
+                /* Make sure this one is currently enabled. */
                if (!test_bit(Enabled, &e->flags))
                        continue;
+                /* Do matching based on extension if applicable. */
                if (!test_bit(Magic, &e->flags)) {
                        if (p && !strcmp(e->magic, p + 1))
                                return e;
                        continue;
                }
+                /* Do matching based on magic & mask. */
                s = bprm->buf + e->offset;
                if (e->mask) {
                        for (j = 0; j < e->size; j++)
@@ -123,7 +125,7 @@ static Node *check_file(struct linux_binprm *bprm)
 static int load_misc_binary(struct linux_binprm *bprm)
 {
        Node *fmt;
-        struct file * interp_file = NULL;
+        struct file *interp_file = NULL;
        char iname[BINPRM_BUF_SIZE];
        const char *iname_addr = iname;
        int retval;
@@ -131,7 +133,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
        retval = -ENOEXEC;
        if (!enabled)
-                goto _ret;
+                goto ret;
        /* to keep locking time low, we copy the interpreter string */
        read_lock(&entries_lock);
@@ -140,25 +142,26 @@ static int load_misc_binary(struct linux_binprm *bprm)
                strlcpy(iname, fmt->interpreter, BINPRM_BUF_SIZE);
        read_unlock(&entries_lock);
        if (!fmt)
-                goto _ret;
+                goto ret;
        if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) {
                retval = remove_arg_zero(bprm);
                if (retval)
-                        goto _ret;
+                        goto ret;
        }
        if (fmt->flags & MISC_FMT_OPEN_BINARY) {
                /* if the binary should be opened on behalf of the
                 * interpreter than keep it open and assign descriptor
-                 * to it */
+                 * to it
-                fd_binary = get_unused_fd();
+                 */
-                if (fd_binary < 0) {
+                fd_binary = get_unused_fd_flags(0);
-                        retval = fd_binary;
+                if (fd_binary < 0) {
-                        goto _ret;
+                        retval = fd_binary;
-                }
+                        goto ret;
-                fd_install(fd_binary, bprm->file);
+                }
+                fd_install(fd_binary, bprm->file);
                /* if the binary is not readable than enforce mm->dumpable=0
                   regardless of the interpreter's permissions */
@@ -171,32 +174,32 @@ static int load_misc_binary(struct linux_binprm *bprm)
                bprm->interp_flags |= BINPRM_FLAGS_EXECFD;
                bprm->interp_data = fd_binary;
-        } else {
+        } else {
-                allow_write_access(bprm->file);
+                allow_write_access(bprm->file);
-                fput(bprm->file);
+                fput(bprm->file);
-                bprm->file = NULL;
+                bprm->file = NULL;
-        }
+        }
        /* make argv[1] be the path to the binary */
-        retval = copy_strings_kernel (1, &bprm->interp, bprm);
+        retval = copy_strings_kernel(1, &bprm->interp, bprm);
        if (retval < 0)
-                goto _error;
+                goto error;
        bprm->argc++;
        /* add the interp as argv[0] */
-        retval = copy_strings_kernel (1, &iname_addr, bprm);
+        retval = copy_strings_kernel(1, &iname_addr, bprm);
        if (retval < 0)
-                goto _error;
+                goto error;
-        bprm->argc ++;
+        bprm->argc++;
        /* Update interp in case binfmt_script needs it. */
        retval = bprm_change_interp(iname, bprm);
        if (retval < 0)
-                goto _error;
+                goto error;
-        interp_file = open_exec (iname);
+        interp_file = open_exec(iname);
-        retval = PTR_ERR (interp_file);
+        retval = PTR_ERR(interp_file);
-        if (IS_ERR (interp_file))
+        if (IS_ERR(interp_file))
-                goto _error;
+                goto error;
        bprm->file = interp_file;
        if (fmt->flags & MISC_FMT_CREDENTIALS) {
@@ -207,23 +210,23 @@ static int load_misc_binary(struct linux_binprm *bprm)
                memset(bprm->buf, 0, BINPRM_BUF_SIZE);
                retval = kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE);
        } else
-                retval = prepare_binprm (bprm);
+                retval = prepare_binprm(bprm);
        if (retval < 0)
-                goto _error;
+                goto error;
        retval = search_binary_handler(bprm);
        if (retval < 0)
-                goto _error;
+                goto error;
-_ret:
+ret:
        return retval;
-_error:
+error:
        if (fd_binary > 0)
                sys_close(fd_binary);
        bprm->interp_flags = 0;
        bprm->interp_data = 0;
-        goto _ret;
+        goto ret;
 }
 /* Command parsers */
@@ -250,36 +253,40 @@ static char *scanarg(char *s, char del)
        return s;
 }
-static char * check_special_flags (char * sfs, Node * e)
+static char *check_special_flags(char *sfs, Node *e)
 {
-        char * p = sfs;
+        char *p = sfs;
        int cont = 1;
        /* special flags */
        while (cont) {
                switch (*p) {
-                        case 'P':
+                case 'P':
-                                p++;
+                        pr_debug("register: flag: P (preserve argv0)\n");
-                                e->flags |= MISC_FMT_PRESERVE_ARGV0;
+                        p++;
-                                break;
+                        e->flags |= MISC_FMT_PRESERVE_ARGV0;
-                        case 'O':
+                        break;
-                                p++;
+                case 'O':
-                                e->flags |= MISC_FMT_OPEN_BINARY;
+                        pr_debug("register: flag: O (open binary)\n");
-                                break;
+                        p++;
-                        case 'C':
+                        e->flags |= MISC_FMT_OPEN_BINARY;
-                                p++;
+                        break;
-                                /* this flags also implies the
+                case 'C':
-                                   open-binary flag */
+                        pr_debug("register: flag: C (preserve creds)\n");
-                                e->flags |= (MISC_FMT_CREDENTIALS |
+                        p++;
-                                                MISC_FMT_OPEN_BINARY);
+                        /* this flags also implies the
-                                break;
+                           open-binary flag */
-                        default:
+                        e->flags |= (MISC_FMT_CREDENTIALS |
-                                cont = 0;
+                                        MISC_FMT_OPEN_BINARY);
+                        break;
+                default:
+                        cont = 0;
                }
        }
        return p;
 }
 /*
 * This registers a new binary format, it recognises the syntax
 * ':name:type:offset:magic:mask:interpreter:flags'
@@ -292,6 +299,8 @@ static Node *create_entry(const char __user *buffer, size_t count)
        char *buf, *p;
        char del;
+        pr_debug("register: received %zu bytes\n", count);
        /* some sanity checks */
        err = -EINVAL;
        if ((count < 11) || (count > MAX_REGISTER_LENGTH))
@@ -299,7 +308,7 @@ static Node *create_entry(const char __user *buffer, size_t count)
        err = -ENOMEM;
        memsize = sizeof(Node) + count + 8;
-        e = kmalloc(memsize, GFP_USER);
+        e = kmalloc(memsize, GFP_KERNEL);
        if (!e)
                goto out;
@@ -307,98 +316,175 @@ static Node *create_entry(const char __user *buffer, size_t count)
        memset(e, 0, sizeof(Node));
        if (copy_from_user(buf, buffer, count))
-                goto Efault;
+                goto efault;
        del = *p++;     /* delimeter */
-        memset(buf+count, del, 8);
+        pr_debug("register: delim: %#x {%c}\n", del, del);
+        /* Pad the buffer with the delim to simplify parsing below. */
+        memset(buf + count, del, 8);
+        /* Parse the 'name' field. */
        e->name = p;
        p = strchr(p, del);
        if (!p)
-                goto Einval;
+                goto einval;
        *p++ = '\0';
        if (!e->name[0] ||
            !strcmp(e->name, ".") ||
            !strcmp(e->name, "..") ||
            strchr(e->name, '/'))
-                goto Einval;
+                goto einval;
+        pr_debug("register: name: {%s}\n", e->name);
+        /* Parse the 'type' field. */
        switch (*p++) {
-                case 'E': e->flags = 1<<Enabled; break;
+        case 'E':
-                case 'M': e->flags = (1<<Enabled) | (1<<Magic); break;
+                pr_debug("register: type: E (extension)\n");
-                default: goto Einval;
+                e->flags = 1 << Enabled;
+                break;
+        case 'M':
+                pr_debug("register: type: M (magic)\n");
+                e->flags = (1 << Enabled) | (1 << Magic);
+                break;
+        default:
+                goto einval;
        }
        if (*p++ != del)
-                goto Einval;
+                goto einval;
        if (test_bit(Magic, &e->flags)) {
-                char *s = strchr(p, del);
+                /* Handle the 'M' (magic) format. */
+                char *s;
+                /* Parse the 'offset' field. */
+                s = strchr(p, del);
                if (!s)
-                        goto Einval;
+                        goto einval;
                *s++ = '\0';
                e->offset = simple_strtoul(p, &p, 10);
                if (*p++)
-                        goto Einval;
+                        goto einval;
+                pr_debug("register: offset: %#x\n", e->offset);
+                /* Parse the 'magic' field. */
                e->magic = p;
                p = scanarg(p, del);
                if (!p)
-                        goto Einval;
+                        goto einval;
                p[-1] = '\0';
-                if (!e->magic[0])
+                if (p == e->magic)
-                        goto Einval;
+                        goto einval;
+                if (USE_DEBUG)
+                        print_hex_dump_bytes(
+                                KBUILD_MODNAME ": register: magic[raw]: ",
+                                DUMP_PREFIX_NONE, e->magic, p - e->magic);
+                /* Parse the 'mask' field. */
                e->mask = p;
                p = scanarg(p, del);
                if (!p)
-                        goto Einval;
+                        goto einval;
                p[-1] = '\0';
-                if (!e->mask[0])
+                if (p == e->mask) {
                        e->mask = NULL;
+                        pr_debug("register:  mask[raw]: none\n");
+                } else if (USE_DEBUG)
+                        print_hex_dump_bytes(
+                                KBUILD_MODNAME ": register:  mask[raw]: ",
+                                DUMP_PREFIX_NONE, e->mask, p - e->mask);
+                /*
+                 * Decode the magic & mask fields.
+                 * Note: while we might have accepted embedded NUL bytes from
+                 * above, the unescape helpers here will stop at the first one
+                 * it encounters.
+                 */
                e->size = string_unescape_inplace(e->magic, UNESCAPE_HEX);
                if (e->mask &&
                    string_unescape_inplace(e->mask, UNESCAPE_HEX) != e->size)
-                        goto Einval;
+                        goto einval;
                if (e->size + e->offset > BINPRM_BUF_SIZE)
-                        goto Einval;
+                        goto einval;
+                pr_debug("register: magic/mask length: %i\n", e->size);
+                if (USE_DEBUG) {
+                        print_hex_dump_bytes(
+                                KBUILD_MODNAME ": register: magic[decoded]: ",
+                                DUMP_PREFIX_NONE, e->magic, e->size);
+                        if (e->mask) {
+                                int i;
+                                char *masked = kmalloc(e->size, GFP_KERNEL);
+                                print_hex_dump_bytes(
+                                        KBUILD_MODNAME ": register:  mask[decoded]: ",
+                                        DUMP_PREFIX_NONE, e->mask, e->size);
+                                if (masked) {
+                                        for (i = 0; i < e->size; ++i)
+                                                masked[i] = e->magic[i] & e->mask[i];
+                                        print_hex_dump_bytes(
+                                                KBUILD_MODNAME ": register:  magic[masked]: ",
+                                                DUMP_PREFIX_NONE, masked, e->size);
+                                        kfree(masked);
+                                }
+                        }
+                }
        } else {
+                /* Handle the 'E' (extension) format. */
+                /* Skip the 'offset' field. */
                p = strchr(p, del);
                if (!p)
-                        goto Einval;
+                        goto einval;
                *p++ = '\0';
+                /* Parse the 'magic' field. */
                e->magic = p;
                p = strchr(p, del);
                if (!p)
-                        goto Einval;
+                        goto einval;
                *p++ = '\0';
                if (!e->magic[0] || strchr(e->magic, '/'))
-                        goto Einval;
+                        goto einval;
+                pr_debug("register: extension: {%s}\n", e->magic);
+                /* Skip the 'mask' field. */
                p = strchr(p, del);
                if (!p)
-                        goto Einval;
+                        goto einval;
                *p++ = '\0';
        }
+        /* Parse the 'interpreter' field. */
        e->interpreter = p;
        p = strchr(p, del);
        if (!p)
-                goto Einval;
+                goto einval;
        *p++ = '\0';
        if (!e->interpreter[0])
-                goto Einval;
+                goto einval;
+        pr_debug("register: interpreter: {%s}\n", e->interpreter);
-        p = check_special_flags (p, e);
+        /* Parse the 'flags' field. */
+        p = check_special_flags(p, e);
        if (*p == '\n')
                p++;
        if (p != buf + count)
-                goto Einval;
+                goto einval;
        return e;
 out:
        return ERR_PTR(err);
-Efault:
+efault:
        kfree(e);
        return ERR_PTR(-EFAULT);
-Einval:
+einval:
        kfree(e);
        return ERR_PTR(-EINVAL);
 }
@@ -417,7 +503,7 @@ static int parse_command(const char __user *buffer, size_t count)
                return -EFAULT;
        if (!count)
                return 0;
-        if (s[count-1] == '\n')
+        if (s[count - 1] == '\n')
                count--;
        if (count == 1 && s[0] == '0')
                return 1;
@@ -434,7 +520,7 @@ static void entry_status(Node *e, char *page)
 {
        char *dp;
        char *status = "disabled";
-        const char * flags = "flags: ";
+        const char *flags = "flags: ";
        if (test_bit(Enabled, &e->flags))
                status = "enabled";
@@ -448,19 +534,15 @@ static void entry_status(Node *e, char *page)
        dp = page + strlen(page);
        /* print the special flags */
-        sprintf (dp, "%s", flags);
+        sprintf(dp, "%s", flags);
-        dp += strlen (flags);
+        dp += strlen(flags);
-        if (e->flags & MISC_FMT_PRESERVE_ARGV0) {
+        if (e->flags & MISC_FMT_PRESERVE_ARGV0)
-                *dp ++ = 'P';
+                *dp++ = 'P';
-        }
+        if (e->flags & MISC_FMT_OPEN_BINARY)
-        if (e->flags & MISC_FMT_OPEN_BINARY) {
+                *dp++ = 'O';
-                *dp ++ = 'O';
+        if (e->flags & MISC_FMT_CREDENTIALS)
-        }
+                *dp++ = 'C';
-        if (e->flags & MISC_FMT_CREDENTIALS) {
+        *dp++ = '\n';
-                *dp ++ = 'C';
-        }
-        *dp ++ = '\n';
        if (!test_bit(Magic, &e->flags)) {
                sprintf(dp, "extension .%s\n", e->magic);
@@ -488,7 +570,7 @@ static void entry_status(Node *e, char *page)
 static struct inode *bm_get_inode(struct super_block *sb, int mode)
 {
-        struct inode * inode = new_inode(sb);
+        struct inode *inode = new_inode(sb);
        if (inode) {
                inode->i_ino = get_next_ino();
@@ -528,13 +610,14 @@ static void kill_node(Node *e)
 /* /<entry> */
 static ssize_t
-bm_entry_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos)
+bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
        Node *e = file_inode(file)->i_private;
        ssize_t res;
        char *page;
-        if (!(page = (char*) __get_free_page(GFP_KERNEL)))
+        page = (char *) __get_free_page(GFP_KERNEL);
+        if (!page)
                return -ENOMEM;
        entry_status(e, page);
@@ -553,20 +636,28 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
        int res = parse_command(buffer, count);
        switch (res) {
-                case 1: clear_bit(Enabled, &e->flags);
+        case 1:
-                        break;
+                /* Disable this handler. */
-                case 2: set_bit(Enabled, &e->flags);
+                clear_bit(Enabled, &e->flags);
-                        break;
+                break;
-                case 3: root = dget(file->f_path.dentry->d_sb->s_root);
+        case 2:
-                        mutex_lock(&root->d_inode->i_mutex);
+                /* Enable this handler. */
+                set_bit(Enabled, &e->flags);
-                        kill_node(e);
+                break;
+        case 3:
-                        mutex_unlock(&root->d_inode->i_mutex);
+                /* Delete this handler. */
-                        dput(root);
+                root = dget(file->f_path.dentry->d_sb->s_root);
-                        break;
+                mutex_lock(&root->d_inode->i_mutex);
-                default: return res;
+                kill_node(e);
+                mutex_unlock(&root->d_inode->i_mutex);
+                dput(root);
+                break;
+        default:
+                return res;
        }
        return count;
 }
@@ -654,26 +745,36 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
        return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
 }
-static ssize_t bm_status_write(struct file * file, const char __user * buffer,
+static ssize_t bm_status_write(struct file *file, const char __user *buffer,
                size_t count, loff_t *ppos)
 {
        int res = parse_command(buffer, count);
        struct dentry *root;
        switch (res) {
-                case 1: enabled = 0; break;
+        case 1:
-                case 2: enabled = 1; break;
+                /* Disable all handlers. */
-                case 3: root = dget(file->f_path.dentry->d_sb->s_root);
+                enabled = 0;
-                        mutex_lock(&root->d_inode->i_mutex);
+                break;
+        case 2:
-                        while (!list_empty(&entries))
+                /* Enable all handlers. */
-                                kill_node(list_entry(entries.next, Node, list));
+                enabled = 1;
+                break;
-                        mutex_unlock(&root->d_inode->i_mutex);
+        case 3:
-                        dput(root);
+                /* Delete all handlers. */
-                        break;
+                root = dget(file->f_path.dentry->d_sb->s_root);
-                default: return res;
+                mutex_lock(&root->d_inode->i_mutex);
+                while (!list_empty(&entries))
+                        kill_node(list_entry(entries.next, Node, list));
+                mutex_unlock(&root->d_inode->i_mutex);
+                dput(root);
+                break;
+        default:
+                return res;
        }
        return count;
 }
@@ -690,14 +791,16 @@ static const struct super_operations s_ops = {
        .evict_inode    = bm_evict_inode,
 };
-static int bm_fill_super(struct super_block * sb, void * data, int silent)
+static int bm_fill_super(struct super_block *sb, void *data, int silent)
 {
+        int err;
        static struct tree_descr bm_files[] = {
                [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
                [3] = {"register", &bm_register_operations, S_IWUSR},
                /* last one */ {""}
        };
-        int err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
+        err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
        if (!err)
                sb->s_op = &s_ops;
        return err;
diff --git a/fs/char_dev.c b/fs/char_dev.c
index f77f7702fabe..67b2007f10fe 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -117,7 +117,6 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor,
                        goto out;
                }
                major = i;
-                ret = major;
        }
        cd->major = major;
diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c
index 6d00c419cbae..1ea780bc6376 100644
--- a/fs/cifs/cifsacl.c
+++ b/fs/cifs/cifsacl.c
@@ -38,7 +38,7 @@ static const struct cifs_sid sid_everyone = {
        1, 1, {0, 0, 0, 0, 0, 1}, {0} };
 /* security id for Authenticated Users system group */
 static const struct cifs_sid sid_authusers = {
-        1, 1, {0, 0, 0, 0, 0, 5}, {__constant_cpu_to_le32(11)} };
+        1, 1, {0, 0, 0, 0, 0, 5}, {cpu_to_le32(11)} };
 /* group users */
 static const struct cifs_sid sid_user = {1, 2 , {0, 0, 0, 0, 0, 5}, {} };
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 61d00a6e398f..fa13d5e79f64 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -2477,14 +2477,14 @@ CIFSSMBPosixLock(const unsigned int xid, struct cifs_tcon *tcon,
                }
                parm_data = (struct cifs_posix_lock *)
                        ((char *)&pSMBr->hdr.Protocol + data_offset);
-                if (parm_data->lock_type == __constant_cpu_to_le16(CIFS_UNLCK))
+                if (parm_data->lock_type == cpu_to_le16(CIFS_UNLCK))
                        pLockData->fl_type = F_UNLCK;
                else {
                        if (parm_data->lock_type ==
-                                        __constant_cpu_to_le16(CIFS_RDLCK))
+                                        cpu_to_le16(CIFS_RDLCK))
                                pLockData->fl_type = F_RDLCK;
                        else if (parm_data->lock_type ==
-                                        __constant_cpu_to_le16(CIFS_WRLCK))
+                                        cpu_to_le16(CIFS_WRLCK))
                                pLockData->fl_type = F_WRLCK;
                        pLockData->fl_start = le64_to_cpu(parm_data->start);
@@ -3276,25 +3276,25 @@ CIFSSMB_set_compression(const unsigned int xid, struct cifs_tcon *tcon,
        pSMB->compression_state = cpu_to_le16(COMPRESSION_FORMAT_DEFAULT);
        pSMB->TotalParameterCount = 0;
-        pSMB->TotalDataCount = __constant_cpu_to_le32(2);
+        pSMB->TotalDataCount = cpu_to_le32(2);
        pSMB->MaxParameterCount = 0;
        pSMB->MaxDataCount = 0;
        pSMB->MaxSetupCount = 4;
        pSMB->Reserved = 0;
        pSMB->ParameterOffset = 0;
-        pSMB->DataCount = __constant_cpu_to_le32(2);
+        pSMB->DataCount = cpu_to_le32(2);
        pSMB->DataOffset =
                cpu_to_le32(offsetof(struct smb_com_transaction_compr_ioctl_req,
                                compression_state) - 4);  /* 84 */
        pSMB->SetupCount = 4;
-        pSMB->SubCommand = __constant_cpu_to_le16(NT_TRANSACT_IOCTL);
+        pSMB->SubCommand = cpu_to_le16(NT_TRANSACT_IOCTL);
        pSMB->ParameterCount = 0;
-        pSMB->FunctionCode = __constant_cpu_to_le32(FSCTL_SET_COMPRESSION);
+        pSMB->FunctionCode = cpu_to_le32(FSCTL_SET_COMPRESSION);
        pSMB->IsFsctl = 1; /* FSCTL */
        pSMB->IsRootFlag = 0;
        pSMB->Fid = fid; /* file handle always le */
        /* 3 byte pad, followed by 2 byte compress state */
-        pSMB->ByteCount = __constant_cpu_to_le16(5);
+        pSMB->ByteCount = cpu_to_le16(5);
        inc_rfc1001_len(pSMB, 5);
        rc = SendReceive(xid, tcon->ses, (struct smb_hdr *) pSMB,
@@ -3430,10 +3430,10 @@ static __u16 ACL_to_cifs_posix(char *parm_data, const char *pACL,
        cifs_acl->version = cpu_to_le16(1);
        if (acl_type == ACL_TYPE_ACCESS) {
                cifs_acl->access_entry_count = cpu_to_le16(count);
-                cifs_acl->default_entry_count = __constant_cpu_to_le16(0xFFFF);
+                cifs_acl->default_entry_count = cpu_to_le16(0xFFFF);
        } else if (acl_type == ACL_TYPE_DEFAULT) {
                cifs_acl->default_entry_count = cpu_to_le16(count);
-                cifs_acl->access_entry_count = __constant_cpu_to_le16(0xFFFF);
+                cifs_acl->access_entry_count = cpu_to_le16(0xFFFF);
        } else {
                cifs_dbg(FYI, "unknown ACL type %d\n", acl_type);
                return 0;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index d535e168a9d3..96b7e9b7706d 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1066,7 +1066,7 @@ cifs_push_mandatory_locks(struct cifsFileInfo *cfile)
        max_num = (max_buf - sizeof(struct smb_hdr)) /
                                                sizeof(LOCKING_ANDX_RANGE);
-        buf = kzalloc(max_num * sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL);
+        buf = kcalloc(max_num, sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL);
        if (!buf) {
                free_xid(xid);
                return -ENOMEM;
@@ -1401,7 +1401,7 @@ cifs_unlock_range(struct cifsFileInfo *cfile, struct file_lock *flock,
        max_num = (max_buf - sizeof(struct smb_hdr)) /
                                                sizeof(LOCKING_ANDX_RANGE);
-        buf = kzalloc(max_num * sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL);
+        buf = kcalloc(max_num, sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL);
        if (!buf)
                return -ENOMEM;
diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c
index 446cb7fb3f58..bce6fdcd5d48 100644
--- a/fs/cifs/sess.c
+++ b/fs/cifs/sess.c
@@ -46,7 +46,7 @@ static __u32 cifs_ssetup_hdr(struct cifs_ses *ses, SESSION_SETUP_ANDX *pSMB)
                                        CIFSMaxBufSize + MAX_CIFS_HDR_SIZE - 4,
                                        USHRT_MAX));
        pSMB->req.MaxMpxCount = cpu_to_le16(ses->server->maxReq);
-        pSMB->req.VcNumber = __constant_cpu_to_le16(1);
+        pSMB->req.VcNumber = cpu_to_le16(1);
        /* Now no need to set SMBFLG_CASELESS or obsolete CANONICAL PATH */
diff --git a/fs/cifs/smb2file.c b/fs/cifs/smb2file.c
index 45992944e238..7198eac5dddd 100644
--- a/fs/cifs/smb2file.c
+++ b/fs/cifs/smb2file.c
@@ -111,7 +111,7 @@ smb2_unlock_range(struct cifsFileInfo *cfile, struct file_lock *flock,
                return -EINVAL;
        max_num = max_buf / sizeof(struct smb2_lock_element);
-        buf = kzalloc(max_num * sizeof(struct smb2_lock_element), GFP_KERNEL);
+        buf = kcalloc(max_num, sizeof(struct smb2_lock_element), GFP_KERNEL);
        if (!buf)
                return -ENOMEM;
@@ -247,7 +247,7 @@ smb2_push_mandatory_locks(struct cifsFileInfo *cfile)
        }
        max_num = max_buf / sizeof(struct smb2_lock_element);
-        buf = kzalloc(max_num * sizeof(struct smb2_lock_element), GFP_KERNEL);
+        buf = kcalloc(max_num, sizeof(struct smb2_lock_element), GFP_KERNEL);
        if (!buf) {
                free_xid(xid);
                return -ENOMEM;
diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index 1a08a34838fc..f1cefc9763ed 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -67,27 +67,27 @@ check_smb2_hdr(struct smb2_hdr *hdr, __u64 mid)
 *  indexed by command in host byte order
 */
 static const __le16 smb2_rsp_struct_sizes[NUMBER_OF_SMB2_COMMANDS] = {
-        /* SMB2_NEGOTIATE */ __constant_cpu_to_le16(65),
+        /* SMB2_NEGOTIATE */ cpu_to_le16(65),
-        /* SMB2_SESSION_SETUP */ __constant_cpu_to_le16(9),
+        /* SMB2_SESSION_SETUP */ cpu_to_le16(9),
-        /* SMB2_LOGOFF */ __constant_cpu_to_le16(4),
+        /* SMB2_LOGOFF */ cpu_to_le16(4),
-        /* SMB2_TREE_CONNECT */ __constant_cpu_to_le16(16),
+        /* SMB2_TREE_CONNECT */ cpu_to_le16(16),
-        /* SMB2_TREE_DISCONNECT */ __constant_cpu_to_le16(4),
+        /* SMB2_TREE_DISCONNECT */ cpu_to_le16(4),
-        /* SMB2_CREATE */ __constant_cpu_to_le16(89),
+        /* SMB2_CREATE */ cpu_to_le16(89),
-        /* SMB2_CLOSE */ __constant_cpu_to_le16(60),
+        /* SMB2_CLOSE */ cpu_to_le16(60),
-        /* SMB2_FLUSH */ __constant_cpu_to_le16(4),
+        /* SMB2_FLUSH */ cpu_to_le16(4),
-        /* SMB2_READ */ __constant_cpu_to_le16(17),
+        /* SMB2_READ */ cpu_to_le16(17),
-        /* SMB2_WRITE */ __constant_cpu_to_le16(17),
+        /* SMB2_WRITE */ cpu_to_le16(17),
-        /* SMB2_LOCK */ __constant_cpu_to_le16(4),
+        /* SMB2_LOCK */ cpu_to_le16(4),
-        /* SMB2_IOCTL */ __constant_cpu_to_le16(49),
+        /* SMB2_IOCTL */ cpu_to_le16(49),
        /* BB CHECK this ... not listed in documentation */
-        /* SMB2_CANCEL */ __constant_cpu_to_le16(0),
+        /* SMB2_CANCEL */ cpu_to_le16(0),
-        /* SMB2_ECHO */ __constant_cpu_to_le16(4),
+        /* SMB2_ECHO */ cpu_to_le16(4),
-        /* SMB2_QUERY_DIRECTORY */ __constant_cpu_to_le16(9),
+        /* SMB2_QUERY_DIRECTORY */ cpu_to_le16(9),
-        /* SMB2_CHANGE_NOTIFY */ __constant_cpu_to_le16(9),
+        /* SMB2_CHANGE_NOTIFY */ cpu_to_le16(9),
-        /* SMB2_QUERY_INFO */ __constant_cpu_to_le16(9),
+        /* SMB2_QUERY_INFO */ cpu_to_le16(9),
-        /* SMB2_SET_INFO */ __constant_cpu_to_le16(2),
+        /* SMB2_SET_INFO */ cpu_to_le16(2),
        /* BB FIXME can also be 44 for lease break */
-        /* SMB2_OPLOCK_BREAK */ __constant_cpu_to_le16(24)
+        /* SMB2_OPLOCK_BREAK */ cpu_to_le16(24)
 };
 int
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 568f323665c8..93fd0586f9ec 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -600,7 +600,7 @@ smb2_clone_range(const unsigned int xid,
                goto cchunk_out;
        /* For now array only one chunk long, will make more flexible later */
-        pcchunk->ChunkCount = __constant_cpu_to_le32(1);
+        pcchunk->ChunkCount = cpu_to_le32(1);
        pcchunk->Reserved = 0;
        pcchunk->Reserved2 = 0;
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 0ca7f6364754..3417340bf89e 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -1358,7 +1358,7 @@ SMB2_set_compression(const unsigned int xid, struct cifs_tcon *tcon,
        char *ret_data = NULL;
        fsctl_input.CompressionState =
-                        __constant_cpu_to_le16(COMPRESSION_FORMAT_DEFAULT);
+                        cpu_to_le16(COMPRESSION_FORMAT_DEFAULT);
        rc = SMB2_ioctl(xid, tcon, persistent_fid, volatile_fid,
                        FSCTL_SET_COMPRESSION, true /* is_fsctl */,
diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h
index d84f46c5b2c5..ce858477002a 100644
--- a/fs/cifs/smb2pdu.h
+++ b/fs/cifs/smb2pdu.h
@@ -85,7 +85,7 @@
 /* BB FIXME - analyze following length BB */
 #define MAX_SMB2_HDR_SIZE 0x78 /* 4 len + 64 hdr + (2*24 wct) + 2 bct + 2 pad */
-#define SMB2_PROTO_NUMBER __constant_cpu_to_le32(0x424d53fe)
+#define SMB2_PROTO_NUMBER cpu_to_le32(0x424d53fe)
 /*
 * SMB2 Header Definition
@@ -96,7 +96,7 @@
 *
 */
-#define SMB2_HEADER_STRUCTURE_SIZE __constant_cpu_to_le16(64)
+#define SMB2_HEADER_STRUCTURE_SIZE cpu_to_le16(64)
 struct smb2_hdr {
        __be32 smb2_buf_length; /* big endian on wire */
@@ -137,16 +137,16 @@ struct smb2_transform_hdr {
 } __packed;
 /* Encryption Algorithms */
-#define SMB2_ENCRYPTION_AES128_CCM      __constant_cpu_to_le16(0x0001)
+#define SMB2_ENCRYPTION_AES128_CCM      cpu_to_le16(0x0001)
 /*
 *      SMB2 flag definitions
 */
-#define SMB2_FLAGS_SERVER_TO_REDIR      __constant_cpu_to_le32(0x00000001)
+#define SMB2_FLAGS_SERVER_TO_REDIR      cpu_to_le32(0x00000001)
-#define SMB2_FLAGS_ASYNC_COMMAND        __constant_cpu_to_le32(0x00000002)
+#define SMB2_FLAGS_ASYNC_COMMAND        cpu_to_le32(0x00000002)
-#define SMB2_FLAGS_RELATED_OPERATIONS   __constant_cpu_to_le32(0x00000004)
+#define SMB2_FLAGS_RELATED_OPERATIONS   cpu_to_le32(0x00000004)
-#define SMB2_FLAGS_SIGNED               __constant_cpu_to_le32(0x00000008)
+#define SMB2_FLAGS_SIGNED               cpu_to_le32(0x00000008)
-#define SMB2_FLAGS_DFS_OPERATIONS       __constant_cpu_to_le32(0x10000000)
+#define SMB2_FLAGS_DFS_OPERATIONS       cpu_to_le32(0x10000000)
 /*
 *      Definitions for SMB2 Protocol Data Units (network frames)
@@ -157,7 +157,7 @@ struct smb2_transform_hdr {
 *
 */
-#define SMB2_ERROR_STRUCTURE_SIZE2 __constant_cpu_to_le16(9)
+#define SMB2_ERROR_STRUCTURE_SIZE2 cpu_to_le16(9)
 struct smb2_err_rsp {
        struct smb2_hdr hdr;
@@ -502,12 +502,12 @@ struct create_context {
 #define SMB2_LEASE_HANDLE_CACHING_HE    0x02
 #define SMB2_LEASE_WRITE_CACHING_HE     0x04
-#define SMB2_LEASE_NONE                 __constant_cpu_to_le32(0x00)
+#define SMB2_LEASE_NONE                 cpu_to_le32(0x00)
-#define SMB2_LEASE_READ_CACHING         __constant_cpu_to_le32(0x01)
+#define SMB2_LEASE_READ_CACHING         cpu_to_le32(0x01)
-#define SMB2_LEASE_HANDLE_CACHING       __constant_cpu_to_le32(0x02)
+#define SMB2_LEASE_HANDLE_CACHING       cpu_to_le32(0x02)
-#define SMB2_LEASE_WRITE_CACHING        __constant_cpu_to_le32(0x04)
+#define SMB2_LEASE_WRITE_CACHING        cpu_to_le32(0x04)
-#define SMB2_LEASE_FLAG_BREAK_IN_PROGRESS __constant_cpu_to_le32(0x02)
+#define SMB2_LEASE_FLAG_BREAK_IN_PROGRESS cpu_to_le32(0x02)
 #define SMB2_LEASE_KEY_SIZE 16
diff --git a/fs/file.c b/fs/file.c
index ab3eb6a88239..ee738ea028fa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -869,7 +869,7 @@ SYSCALL_DEFINE1(dup, unsigned int, fildes)
        struct file *file = fget_raw(fildes);
        if (file) {
-                ret = get_unused_fd();
+                ret = get_unused_fd_flags(0);
                if (ret >= 0)
                        fd_install(ret, file);
                else
diff --git a/fs/hfs/catalog.c b/fs/hfs/catalog.c
index ff0316b925a5..db458ee3a546 100644
--- a/fs/hfs/catalog.c
+++ b/fs/hfs/catalog.c
@@ -162,14 +162,16 @@ err2:
 */
 int hfs_cat_keycmp(const btree_key *key1, const btree_key *key2)
 {
-        int retval;
+        __be32 k1p, k2p;
-        retval = be32_to_cpu(key1->cat.ParID) - be32_to_cpu(key2->cat.ParID);
+        k1p = key1->cat.ParID;
-        if (!retval)
+        k2p = key2->cat.ParID;
-                retval = hfs_strcmp(key1->cat.CName.name, key1->cat.CName.len,
-                                    key2->cat.CName.name, key2->cat.CName.len);
-        return retval;
+        if (k1p != k2p)
+                return be32_to_cpu(k1p) < be32_to_cpu(k2p) ? -1 : 1;
+        return hfs_strcmp(key1->cat.CName.name, key1->cat.CName.len,
+                          key2->cat.CName.name, key2->cat.CName.len);
 }
 /* Try to get a catalog entry for given catalog id */
diff --git a/fs/ncpfs/ioctl.c b/fs/ncpfs/ioctl.c
index d5659d96ee7f..cf7e043a9447 100644
--- a/fs/ncpfs/ioctl.c
+++ b/fs/ncpfs/ioctl.c
@@ -447,7 +447,6 @@ static long __ncp_ioctl(struct inode *inode, unsigned int cmd, unsigned long arg
                                                result = -EIO;
                                        }
                                }
-                                result = 0;
                        }
                        mutex_unlock(&server->root_setup_lock);
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index e9e3325f29f3..3a03e0aea1fb 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -39,21 +39,15 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
         */
        struct the_nilfs *nilfs;
        struct inode *inode = file->f_mapping->host;
-        int err;
+        int err = 0;
-        err = filemap_write_and_wait_range(inode->i_mapping, start, end);
-        if (err)
-                return err;
-        mutex_lock(&inode->i_mutex);
        if (nilfs_inode_dirty(inode)) {
                if (datasync)
                        err = nilfs_construct_dsync_segment(inode->i_sb, inode,
-                                                            0, LLONG_MAX);
+                                                            start, end);
                else
                        err = nilfs_construct_segment(inode->i_sb);
        }
-        mutex_unlock(&inode->i_mutex);
        nilfs = inode->i_sb->s_fs_info;
        if (!err)
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index e1fa69b341b9..8b5969538f39 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -49,6 +49,8 @@ struct nilfs_iget_args {
        int for_gc;
 };
+static int nilfs_iget_test(struct inode *inode, void *opaque);
 void nilfs_inode_add_blocks(struct inode *inode, int n)
 {
        struct nilfs_root *root = NILFS_I(inode)->i_root;
@@ -348,6 +350,17 @@ const struct address_space_operations nilfs_aops = {
        .is_partially_uptodate  = block_is_partially_uptodate,
 };
+static int nilfs_insert_inode_locked(struct inode *inode,
+                                     struct nilfs_root *root,
+                                     unsigned long ino)
+{
+        struct nilfs_iget_args args = {
+                .ino = ino, .root = root, .cno = 0, .for_gc = 0
+        };
+        return insert_inode_locked4(inode, ino, nilfs_iget_test, &args);
+}
 struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
 {
        struct super_block *sb = dir->i_sb;
@@ -383,7 +396,7 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
        if (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)) {
                err = nilfs_bmap_read(ii->i_bmap, NULL);
                if (err < 0)
-                        goto failed_bmap;
+                        goto failed_after_creation;
                set_bit(NILFS_I_BMAP, &ii->i_state);
                /* No lock is needed; iget() ensures it. */
@@ -399,21 +412,24 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
        spin_lock(&nilfs->ns_next_gen_lock);
        inode->i_generation = nilfs->ns_next_generation++;
        spin_unlock(&nilfs->ns_next_gen_lock);
-        insert_inode_hash(inode);
+        if (nilfs_insert_inode_locked(inode, root, ino) < 0) {
+                err = -EIO;
+                goto failed_after_creation;
+        }
        err = nilfs_init_acl(inode, dir);
        if (unlikely(err))
-                goto failed_acl; /* never occur. When supporting
+                goto failed_after_creation; /* never occur. When supporting
                                    nilfs_init_acl(), proper cancellation of
                                    above jobs should be considered */
        return inode;
- failed_acl:
+ failed_after_creation:
- failed_bmap:
        clear_nlink(inode);
+        unlock_new_inode(inode);
        iput(inode);  /* raw_inode will be deleted through
-                         generic_delete_inode() */
+                         nilfs_evict_inode() */
        goto failed;
 failed_ifile_create_inode:
@@ -461,8 +477,8 @@ int nilfs_read_inode_common(struct inode *inode,
        inode->i_atime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec);
        inode->i_ctime.tv_nsec = le32_to_cpu(raw_inode->i_ctime_nsec);
        inode->i_mtime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec);
-        if (inode->i_nlink == 0 && inode->i_mode == 0)
+        if (inode->i_nlink == 0)
-                return -EINVAL; /* this inode is deleted */
+                return -ESTALE; /* this inode is deleted */
        inode->i_blocks = le64_to_cpu(raw_inode->i_blocks);
        ii->i_flags = le32_to_cpu(raw_inode->i_flags);
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 9de78f08989e..0f84b257932c 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -51,9 +51,11 @@ static inline int nilfs_add_nondir(struct dentry *dentry, struct inode *inode)
        int err = nilfs_add_link(dentry, inode);
        if (!err) {
                d_instantiate(dentry, inode);
+                unlock_new_inode(inode);
                return 0;
        }
        inode_dec_link_count(inode);
+        unlock_new_inode(inode);
        iput(inode);
        return err;
 }
@@ -182,6 +184,7 @@ out:
 out_fail:
        drop_nlink(inode);
        nilfs_mark_inode_dirty(inode);
+        unlock_new_inode(inode);
        iput(inode);
        goto out;
 }
@@ -201,11 +204,15 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
        inode_inc_link_count(inode);
        ihold(inode);
-        err = nilfs_add_nondir(dentry, inode);
+        err = nilfs_add_link(dentry, inode);
-        if (!err)
+        if (!err) {
+                d_instantiate(dentry, inode);
                err = nilfs_transaction_commit(dir->i_sb);
-        else
+        } else {
+                inode_dec_link_count(inode);
+                iput(inode);
                nilfs_transaction_abort(dir->i_sb);
+        }
        return err;
 }
@@ -243,6 +250,7 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
        nilfs_mark_inode_dirty(inode);
        d_instantiate(dentry, inode);
+        unlock_new_inode(inode);
 out:
        if (!err)
                err = nilfs_transaction_commit(dir->i_sb);
@@ -255,6 +263,7 @@ out_fail:
        drop_nlink(inode);
        drop_nlink(inode);
        nilfs_mark_inode_dirty(inode);
+        unlock_new_inode(inode);
        iput(inode);
 out_dir:
        drop_nlink(dir);
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 9da25fe9ea61..69bd801afb53 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -808,8 +808,7 @@ void nilfs_put_root(struct nilfs_root *root)
                spin_lock(&nilfs->ns_cptree_lock);
                rb_erase(&root->rb_node, &nilfs->ns_cptree);
                spin_unlock(&nilfs->ns_cptree_lock);
-                if (root->ifile)
+                iput(root->ifile);
-                        iput(root->ifile);
                kfree(root);
        }
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1ef547e49373..d9f222987f24 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1251,7 +1251,7 @@ static int ocfs2_write_cluster(struct address_space *mapping,
        ret = ocfs2_extent_map_get_blocks(inode, v_blkno, &p_blkno, NULL,
                                          NULL);
        if (ret < 0) {
-                ocfs2_error(inode->i_sb, "Corrupting extend for inode %llu, "
+                mlog(ML_ERROR, "Get physical blkno failed for inode %llu, "
                            "at logical block %llu",
                            (unsigned long long)OCFS2_I(inode)->ip_blkno,
                            (unsigned long long)v_blkno);
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index eb9d48746ab4..16eff45727ee 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -1127,10 +1127,10 @@ static int o2hb_thread(void *data)
                elapsed_msec = o2hb_elapsed_msecs(&before_hb, &after_hb);
                mlog(ML_HEARTBEAT,
-                     "start = %lu.%lu, end = %lu.%lu, msec = %u\n",
+                     "start = %lu.%lu, end = %lu.%lu, msec = %u, ret = %d\n",
                     before_hb.tv_sec, (unsigned long) before_hb.tv_usec,
                     after_hb.tv_sec, (unsigned long) after_hb.tv_usec,
-                     elapsed_msec);
+                     elapsed_msec, ret);
                if (!kthread_should_stop() &&
                    elapsed_msec < reg->hr_timeout_ms) {
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index a96044004064..2e355e0f8335 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1736,7 +1736,7 @@ static void o2net_connect_expired(struct work_struct *work)
                     o2net_idle_timeout() / 1000,
                     o2net_idle_timeout() % 1000);
-                o2net_set_nn_state(nn, NULL, 0, -ENOTCONN);
+                o2net_set_nn_state(nn, NULL, 0, 0);
        }
        spin_unlock(&nn->nn_lock);
 }
diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index c43d9b4a1ec0..79d56dc981bc 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -744,7 +744,7 @@ restart:
                if (ocfs2_read_dir_block(dir, block, &bh, 0)) {
                        /* read error, skip block & hope for the best.
                         * ocfs2_read_dir_block() has released the bh. */
-                        ocfs2_error(dir->i_sb, "reading directory %llu, "
+                        mlog(ML_ERROR, "reading directory %llu, "
                                    "offset %lu\n",
                                    (unsigned long long)OCFS2_I(dir)->ip_blkno,
                                    block);
diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 02d315fef432..50a59d2337b2 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -877,7 +877,7 @@ static int dlm_query_join_handler(struct o2net_msg *msg, u32 len, void *data,
         * to be put in someone's domain map.
         * Also, explicitly disallow joining at certain troublesome
         * times (ie. during recovery). */
-        if (dlm && dlm->dlm_state != DLM_CTXT_LEAVING) {
+        if (dlm->dlm_state != DLM_CTXT_LEAVING) {
                int bit = query->node_idx;
                spin_lock(&dlm->spinlock);
diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 215e41abf101..3689b3592042 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -1460,6 +1460,18 @@ way_up_top:
                /* take care of the easy cases up front */
                spin_lock(&res->spinlock);
+                /*
+                 * Right after dlm spinlock was released, dlm_thread could have
+                 * purged the lockres. Check if lockres got unhashed. If so
+                 * start over.
+                 */
+                if (hlist_unhashed(&res->hash_node)) {
+                        spin_unlock(&res->spinlock);
+                        dlm_lockres_put(res);
+                        goto way_up_top;
+                }
                if (res->state & (DLM_LOCK_RES_RECOVERING|
                                  DLM_LOCK_RES_MIGRATING)) {
                        spin_unlock(&res->spinlock);
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 3365839d2971..79b5af5e6a7b 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1656,14 +1656,18 @@ int dlm_do_master_requery(struct dlm_ctxt *dlm, struct dlm_lock_resource *res,
        req.namelen = res->lockname.len;
        memcpy(req.name, res->lockname.name, res->lockname.len);
+resend:
        ret = o2net_send_message(DLM_MASTER_REQUERY_MSG, dlm->key,
                                 &req, sizeof(req), nodenum, &status);
-        /* XXX: negative status not handled properly here. */
        if (ret < 0)
                mlog(ML_ERROR, "Error %d when sending message %u (key "
                     "0x%x) to node %u\n", ret, DLM_MASTER_REQUERY_MSG,
                     dlm->key, nodenum);
-        else {
+        else if (status == -ENOMEM) {
+                mlog_errno(status);
+                msleep(50);
+                goto resend;
+        } else {
                BUG_ON(status < 0);
                BUG_ON(status > DLM_LOCK_RES_OWNER_UNKNOWN);
                *real_master = (u8) (status & 0xff);
@@ -1705,9 +1709,13 @@ int dlm_master_requery_handler(struct o2net_msg *msg, u32 len, void *data,
                        int ret = dlm_dispatch_assert_master(dlm, res,
                                                             0, 0, flags);
                        if (ret < 0) {
-                                mlog_errno(-ENOMEM);
+                                mlog_errno(ret);
-                                /* retry!? */
+                                spin_unlock(&res->spinlock);
-                                BUG();
+                                dlm_lockres_put(res);
+                                spin_unlock(&dlm->spinlock);
+                                dlm_put(dlm);
+                                /* sender will take care of this and retry */
+                                return ret;
                        } else
                                __dlm_lockres_grab_inflight_worker(dlm, res);
                        spin_unlock(&res->spinlock);
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 37297c14f9a3..1c423af04c69 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -861,8 +861,13 @@ static inline void ocfs2_generic_handle_convert_action(struct ocfs2_lock_res *lo
         * We set the OCFS2_LOCK_UPCONVERT_FINISHING flag before clearing
         * the OCFS2_LOCK_BUSY flag to prevent the dc thread from
         * downconverting the lock before the upconvert has fully completed.
+         * Do not prevent the dc thread from downconverting if NONBLOCK lock
+         * had already returned.
         */
-        lockres_or_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
+        if (!(lockres->l_flags & OCFS2_LOCK_NONBLOCK_FINISHED))
+                lockres_or_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
+        else
+                lockres_clear_flags(lockres, OCFS2_LOCK_NONBLOCK_FINISHED);
        lockres_clear_flags(lockres, OCFS2_LOCK_BUSY);
 }
@@ -1324,13 +1329,12 @@ static void lockres_add_mask_waiter(struct ocfs2_lock_res *lockres,
 /* returns 0 if the mw that was removed was already satisfied, -EBUSY
 * if the mask still hadn't reached its goal */
-static int lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres,
+static int __lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres,
                                      struct ocfs2_mask_waiter *mw)
 {
-        unsigned long flags;
        int ret = 0;
-        spin_lock_irqsave(&lockres->l_lock, flags);
+        assert_spin_locked(&lockres->l_lock);
        if (!list_empty(&mw->mw_item)) {
                if ((lockres->l_flags & mw->mw_mask) != mw->mw_goal)
                        ret = -EBUSY;
@@ -1338,6 +1342,18 @@ static int lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres,
                list_del_init(&mw->mw_item);
                init_completion(&mw->mw_complete);
        }
+        return ret;
+}
+static int lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres,
+                                      struct ocfs2_mask_waiter *mw)
+{
+        unsigned long flags;
+        int ret = 0;
+        spin_lock_irqsave(&lockres->l_lock, flags);
+        ret = __lockres_remove_mask_waiter(lockres, mw);
        spin_unlock_irqrestore(&lockres->l_lock, flags);
        return ret;
@@ -1373,6 +1389,7 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb,
        unsigned long flags;
        unsigned int gen;
        int noqueue_attempted = 0;
+        int dlm_locked = 0;
        ocfs2_init_mask_waiter(&mw);
@@ -1481,6 +1498,7 @@ again:
                        ocfs2_recover_from_dlm_error(lockres, 1);
                        goto out;
                }
+                dlm_locked = 1;
                mlog(0, "lock %s, successful return from ocfs2_dlm_lock\n",
                     lockres->l_name);
@@ -1514,10 +1532,17 @@ out:
        if (wait && arg_flags & OCFS2_LOCK_NONBLOCK &&
            mw.mw_mask & (OCFS2_LOCK_BUSY|OCFS2_LOCK_BLOCKED)) {
                wait = 0;
-                if (lockres_remove_mask_waiter(lockres, &mw))
+                spin_lock_irqsave(&lockres->l_lock, flags);
+                if (__lockres_remove_mask_waiter(lockres, &mw)) {
+                        if (dlm_locked)
+                                lockres_or_flags(lockres,
+                                        OCFS2_LOCK_NONBLOCK_FINISHED);
+                        spin_unlock_irqrestore(&lockres->l_lock, flags);
                        ret = -EAGAIN;
-                else
+                } else {
+                        spin_unlock_irqrestore(&lockres->l_lock, flags);
                        goto again;
+                }
        }
        if (wait) {
                ret = ocfs2_wait_for_mask(&mw);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 324dc93ac896..69fb9f75b082 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2381,9 +2381,7 @@ out_dio:
                if (ret < 0)
                        written = ret;
-                if (!ret && ((old_size != i_size_read(inode)) ||
+                if (!ret) {
-                             (old_clusters != OCFS2_I(inode)->ip_clusters) ||
-                             has_refcount)) {
                        ret = jbd2_journal_force_commit(osb->journal->j_journal);
                        if (ret < 0)
                                written = ret;
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 437de7f768c6..c8b25de9efbb 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -540,8 +540,7 @@ bail:
        if (status < 0)
                make_bad_inode(inode);
-        if (args && bh)
+        brelse(bh);
-                brelse(bh);
        return status;
 }
diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 74caffeeee1d..56a768d06aa6 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -904,9 +904,6 @@ static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
        struct buffer_head *di_bh = NULL;
        struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
-        if (!inode)
-                return -ENOENT;
        if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
                return -EROFS;
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index bbec539230fd..7d6b7d090452 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -144,6 +144,12 @@ enum ocfs2_unlock_action {
                                                     * before the upconvert
                                                     * has completed */
+#define OCFS2_LOCK_NONBLOCK_FINISHED (0x00001000) /* NONBLOCK cluster
+                                                   * lock has already
+                                                   * returned, do not block
+                                                   * dc thread from
+                                                   * downconverting */
 struct ocfs2_lock_res_ops;
 typedef void (*ocfs2_lock_callback)(int status, unsigned long data);
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index a88b2a4fcc85..d5493e361a38 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -306,7 +306,7 @@ int ocfs2_slot_to_node_num_locked(struct ocfs2_super *osb, int slot_num,
        assert_spin_locked(&osb->osb_lock);
        BUG_ON(slot_num < 0);
-        BUG_ON(slot_num > osb->max_slots);
+        BUG_ON(slot_num >= osb->max_slots);
        if (!si->si_slots[slot_num].sl_valid)
                return -ENOENT;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 0945814ddb7b..83723179e1ec 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1629,8 +1629,9 @@ static int __init ocfs2_init(void)
        ocfs2_debugfs_root = debugfs_create_dir("ocfs2", NULL);
        if (!ocfs2_debugfs_root) {
-                status = -EFAULT;
+                status = -ENOMEM;
                mlog(ML_ERROR, "Unable to create ocfs2 debugfs root.\n");
+                goto out4;
        }
        ocfs2_set_locking_protocol();
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 016f01df3825..662f8dee149f 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -1284,7 +1284,7 @@ int ocfs2_xattr_get_nolock(struct inode *inode,
                return -EOPNOTSUPP;
        if (!(oi->ip_dyn_features & OCFS2_HAS_XATTR_FL))
-                ret = -ENODATA;
+                return -ENODATA;
        xis.inode_bh = xbs.inode_bh = di_bh;
        di = (struct ocfs2_dinode *)di_bh->b_data;
diff --git a/fs/proc/array.c b/fs/proc/array.c
index cd3653e4f35c..bd117d065b82 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -157,20 +157,29 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
        struct user_namespace *user_ns = seq_user_ns(m);
        struct group_info *group_info;
        int g;
-        struct fdtable *fdt = NULL;
+        struct task_struct *tracer;
        const struct cred *cred;
-        pid_t ppid, tpid;
+        pid_t ppid, tpid = 0, tgid, ngid;
+        unsigned int max_fds = 0;
        rcu_read_lock();
        ppid = pid_alive(p) ?
                task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
-        tpid = 0;
-        if (pid_alive(p)) {
+        tracer = ptrace_parent(p);
-                struct task_struct *tracer = ptrace_parent(p);
+        if (tracer)
-                if (tracer)
+                tpid = task_pid_nr_ns(tracer, ns);
-                        tpid = task_pid_nr_ns(tracer, ns);
-        }
+        tgid = task_tgid_nr_ns(p, ns);
+        ngid = task_numa_group_id(p);
        cred = get_task_cred(p);
+        task_lock(p);
+        if (p->files)
+                max_fds = files_fdtable(p->files)->max_fds;
+        task_unlock(p);
+        rcu_read_unlock();
        seq_printf(m,
                "State:\t%s\n"
                "Tgid:\t%d\n"
@@ -179,12 +188,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
                "PPid:\t%d\n"
                "TracerPid:\t%d\n"
                "Uid:\t%d\t%d\t%d\t%d\n"
-                "Gid:\t%d\t%d\t%d\t%d\n",
+                "Gid:\t%d\t%d\t%d\t%d\n"
+                "FDSize:\t%d\nGroups:\t",
                get_task_state(p),
-                task_tgid_nr_ns(p, ns),
+                tgid, ngid, pid_nr_ns(pid, ns), ppid, tpid,
-                task_numa_group_id(p),
-                pid_nr_ns(pid, ns),
-                ppid, tpid,
                from_kuid_munged(user_ns, cred->uid),
                from_kuid_munged(user_ns, cred->euid),
                from_kuid_munged(user_ns, cred->suid),
@@ -192,20 +199,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
                from_kgid_munged(user_ns, cred->gid),
                from_kgid_munged(user_ns, cred->egid),
                from_kgid_munged(user_ns, cred->sgid),
-                from_kgid_munged(user_ns, cred->fsgid));
+                from_kgid_munged(user_ns, cred->fsgid),
+                max_fds);
-        task_lock(p);
-        if (p->files)
-                fdt = files_fdtable(p->files);
-        seq_printf(m,
-                "FDSize:\t%d\n"
-                "Groups:\t",
-                fdt ? fdt->max_fds : 0);
-        rcu_read_unlock();
        group_info = cred->group_info;
-        task_unlock(p);
        for (g = 0; g < group_info->ngroups; g++)
                seq_printf(m, "%d ",
                           from_kgid_munged(user_ns, GROUP_AT(group_info, g)));
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 64891f3e41bd..590aeda5af12 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2618,6 +2618,9 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
                dput(dentry);
        }
+        if (pid == tgid)
+                return;
        name.name = buf;
        name.len = snprintf(buf, sizeof(buf), "%d", tgid);
        leader = d_hash_and_lookup(mnt->mnt_root, &name);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 317b72641ebf..7fea13229f33 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -31,9 +31,73 @@ static DEFINE_SPINLOCK(proc_subdir_lock);
 static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
 {
-        if (de->namelen != len)
+        if (len < de->namelen)
-                return 0;
+                return -1;
-        return !memcmp(name, de->name, len);
+        if (len > de->namelen)
+                return 1;
+        return memcmp(name, de->name, len);
+}
+static struct proc_dir_entry *pde_subdir_first(struct proc_dir_entry *dir)
+{
+        return rb_entry_safe(rb_first(&dir->subdir), struct proc_dir_entry,
+                             subdir_node);
+}
+static struct proc_dir_entry *pde_subdir_next(struct proc_dir_entry *dir)
+{
+        return rb_entry_safe(rb_next(&dir->subdir_node), struct proc_dir_entry,
+                             subdir_node);
+}
+static struct proc_dir_entry *pde_subdir_find(struct proc_dir_entry *dir,
+                                              const char *name,
+                                              unsigned int len)
+{
+        struct rb_node *node = dir->subdir.rb_node;
+        while (node) {
+                struct proc_dir_entry *de = container_of(node,
+                                                         struct proc_dir_entry,
+                                                         subdir_node);
+                int result = proc_match(len, name, de);
+                if (result < 0)
+                        node = node->rb_left;
+                else if (result > 0)
+                        node = node->rb_right;
+                else
+                        return de;
+        }
+        return NULL;
+}
+static bool pde_subdir_insert(struct proc_dir_entry *dir,
+                              struct proc_dir_entry *de)
+{
+        struct rb_root *root = &dir->subdir;
+        struct rb_node **new = &root->rb_node, *parent = NULL;
+        /* Figure out where to put new node */
+        while (*new) {
+                struct proc_dir_entry *this =
+                        container_of(*new, struct proc_dir_entry, subdir_node);
+                int result = proc_match(de->namelen, de->name, this);
+                parent = *new;
+                if (result < 0)
+                        new = &(*new)->rb_left;
+                else if (result > 0)
+                        new = &(*new)->rb_right;
+                else
+                        return false;
+        }
+        /* Add new node and rebalance tree. */
+        rb_link_node(&de->subdir_node, parent, new);
+        rb_insert_color(&de->subdir_node, root);
+        return true;
 }
 static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
@@ -92,10 +156,7 @@ static int __xlate_proc_name(const char *name, struct proc_dir_entry **ret,
                        break;
                len = next - cp;
-                for (de = de->subdir; de ; de = de->next) {
+                de = pde_subdir_find(de, cp, len);
-                        if (proc_match(len, cp, de))
-                                break;
-                }
                if (!de) {
                        WARN(1, "name '%s'\n", name);
                        return -ENOENT;
@@ -183,19 +244,16 @@ struct dentry *proc_lookup_de(struct proc_dir_entry *de, struct inode *dir,
        struct inode *inode;
        spin_lock(&proc_subdir_lock);
-        for (de = de->subdir; de ; de = de->next) {
+        de = pde_subdir_find(de, dentry->d_name.name, dentry->d_name.len);
-                if (de->namelen != dentry->d_name.len)
+        if (de) {
-                        continue;
+                pde_get(de);
-                if (!memcmp(dentry->d_name.name, de->name, de->namelen)) {
+                spin_unlock(&proc_subdir_lock);
-                        pde_get(de);
+                inode = proc_get_inode(dir->i_sb, de);
-                        spin_unlock(&proc_subdir_lock);
+                if (!inode)
-                        inode = proc_get_inode(dir->i_sb, de);
+                        return ERR_PTR(-ENOMEM);
-                        if (!inode)
+                d_set_d_op(dentry, &simple_dentry_operations);
-                                return ERR_PTR(-ENOMEM);
+                d_add(dentry, inode);
-                        d_set_d_op(dentry, &simple_dentry_operations);
+                return NULL;
-                        d_add(dentry, inode);
-                        return NULL;
-                }
        }
        spin_unlock(&proc_subdir_lock);
        return ERR_PTR(-ENOENT);
@@ -225,7 +283,7 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
                return 0;
        spin_lock(&proc_subdir_lock);
-        de = de->subdir;
+        de = pde_subdir_first(de);
        i = ctx->pos - 2;
        for (;;) {
                if (!de) {
@@ -234,7 +292,7 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
                }
                if (!i)
                        break;
-                de = de->next;
+                de = pde_subdir_next(de);
                i--;
        }
@@ -249,7 +307,7 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
                }
                spin_lock(&proc_subdir_lock);
                ctx->pos++;
-                next = de->next;
+                next = pde_subdir_next(de);
                pde_put(de);
                de = next;
        } while (de);
@@ -286,9 +344,8 @@ static const struct inode_operations proc_dir_inode_operations = {
 static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp)
 {
-        struct proc_dir_entry *tmp;
        int ret;
-        
        ret = proc_alloc_inum(&dp->low_ino);
        if (ret)
                return ret;
@@ -304,21 +361,21 @@ static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp
                dp->proc_iops = &proc_file_inode_operations;
        } else {
                WARN_ON(1);
+                proc_free_inum(dp->low_ino);
                return -EINVAL;
        }
        spin_lock(&proc_subdir_lock);
-        for (tmp = dir->subdir; tmp; tmp = tmp->next)
-                if (strcmp(tmp->name, dp->name) == 0) {
-                        WARN(1, "proc_dir_entry '%s/%s' already registered\n",
-                                dir->name, dp->name);
-                        break;
-                }
-        dp->next = dir->subdir;
        dp->parent = dir;
-        dir->subdir = dp;
+        if (pde_subdir_insert(dir, dp) == false) {
+                WARN(1, "proc_dir_entry '%s/%s' already registered\n",
+                     dir->name, dp->name);
+                spin_unlock(&proc_subdir_lock);
+                if (S_ISDIR(dp->mode))
+                        dir->nlink--;
+                proc_free_inum(dp->low_ino);
+                return -EEXIST;
+        }
        spin_unlock(&proc_subdir_lock);
        return 0;
@@ -354,6 +411,7 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent,
        ent->namelen = qstr.len;
        ent->mode = mode;
        ent->nlink = nlink;
+        ent->subdir = RB_ROOT;
        atomic_set(&ent->count, 1);
        spin_lock_init(&ent->pde_unload_lock);
        INIT_LIST_HEAD(&ent->pde_openers);
@@ -485,7 +543,6 @@ void pde_put(struct proc_dir_entry *pde)
 */
 void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
 {
-        struct proc_dir_entry **p;
        struct proc_dir_entry *de = NULL;
        const char *fn = name;
        unsigned int len;
@@ -497,14 +554,9 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
        }
        len = strlen(fn);
-        for (p = &parent->subdir; *p; p=&(*p)->next ) {
+        de = pde_subdir_find(parent, fn, len);
-                if (proc_match(len, fn, *p)) {
+        if (de)
-                        de = *p;
+                rb_erase(&de->subdir_node, &parent->subdir);
-                        *p = de->next;
-                        de->next = NULL;
-                        break;
-                }
-        }
        spin_unlock(&proc_subdir_lock);
        if (!de) {
                WARN(1, "name '%s'\n", name);
@@ -516,16 +568,15 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
        if (S_ISDIR(de->mode))
                parent->nlink--;
        de->nlink = 0;
-        WARN(de->subdir, "%s: removing non-empty directory "
+        WARN(pde_subdir_first(de),
-                         "'%s/%s', leaking at least '%s'\n", __func__,
+             "%s: removing non-empty directory '%s/%s', leaking at least '%s'\n",
-                         de->parent->name, de->name, de->subdir->name);
+             __func__, de->parent->name, de->name, pde_subdir_first(de)->name);
        pde_put(de);
 }
 EXPORT_SYMBOL(remove_proc_entry);
 int remove_proc_subtree(const char *name, struct proc_dir_entry *parent)
 {
-        struct proc_dir_entry **p;
        struct proc_dir_entry *root = NULL, *de, *next;
        const char *fn = name;
        unsigned int len;
@@ -537,24 +588,18 @@ int remove_proc_subtree(const char *name, struct proc_dir_entry *parent)
        }
        len = strlen(fn);
-        for (p = &parent->subdir; *p; p=&(*p)->next ) {
+        root = pde_subdir_find(parent, fn, len);
-                if (proc_match(len, fn, *p)) {
-                        root = *p;
-                        *p = root->next;
-                        root->next = NULL;
-                        break;
-                }
-        }
        if (!root) {
                spin_unlock(&proc_subdir_lock);
                return -ENOENT;
        }
+        rb_erase(&root->subdir_node, &parent->subdir);
        de = root;
        while (1) {
-                next = de->subdir;
+                next = pde_subdir_first(de);
                if (next) {
-                        de->subdir = next->next;
+                        rb_erase(&next->subdir_node, &de->subdir);
-                        next->next = NULL;
                        de = next;
                        continue;
                }
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa7a0ee182e1..7fb1a4869fd0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -24,10 +24,9 @@ struct mempolicy;
 * tree) of these proc_dir_entries, so that we can dynamically
 * add new files to /proc.
 *
- * The "next" pointer creates a linked list of one /proc directory,
+ * parent/subdir are used for the directory structure (every /proc file has a
- * while parent/subdir create the directory structure (every
+ * parent, but "subdir" is empty for all non-directory entries).
- * /proc file has a parent, but "subdir" is NULL for all
+ * subdir_node is used to build the rb tree "subdir" of the parent.
- * non-directory entries).
 */
 struct proc_dir_entry {
        unsigned int low_ino;
@@ -38,7 +37,9 @@ struct proc_dir_entry {
        loff_t size;
        const struct inode_operations *proc_iops;
        const struct file_operations *proc_fops;
-        struct proc_dir_entry *next, *parent, *subdir;
+        struct proc_dir_entry *parent;
+        struct rb_root subdir;
+        struct rb_node subdir_node;
        void *data;
        atomic_t count;         /* use count */
        atomic_t in_use;        /* number of callers into module in progress; */
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index a63af3e0a612..1bde894bc624 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -192,6 +192,7 @@ static __net_init int proc_net_ns_init(struct net *net)
        if (!netd)
                goto out;
+        netd->subdir = RB_ROOT;
        netd->data = net;
        netd->nlink = 2;
        netd->namelen = 3;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 094e44d4a6be..e74ac9f1a2c0 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -251,6 +251,7 @@ struct proc_dir_entry proc_root = {
        .proc_iops      = &proc_root_inode_operations, 
        .proc_fops      = &proc_root_operations,
        .parent         = &proc_root,
+        .subdir         = RB_ROOT,
        .name           = "/proc",
 };
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6734c6b66a6..246eae84b13b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -447,58 +447,91 @@ struct mem_size_stats {
        u64 pss;
 };
+static void smaps_account(struct mem_size_stats *mss, struct page *page,
+                unsigned long size, bool young, bool dirty)
+{
+        int mapcount;
+        if (PageAnon(page))
+                mss->anonymous += size;
-static void smaps_pte_entry(pte_t ptent, unsigned long addr,
+        mss->resident += size;
-                unsigned long ptent_size, struct mm_walk *walk)
+        /* Accumulate the size in pages that have been accessed. */
+        if (young || PageReferenced(page))
+                mss->referenced += size;
+        mapcount = page_mapcount(page);
+        if (mapcount >= 2) {
+                u64 pss_delta;
+                if (dirty || PageDirty(page))
+                        mss->shared_dirty += size;
+                else
+                        mss->shared_clean += size;
+                pss_delta = (u64)size << PSS_SHIFT;
+                do_div(pss_delta, mapcount);
+                mss->pss += pss_delta;
+        } else {
+                if (dirty || PageDirty(page))
+                        mss->private_dirty += size;
+                else
+                        mss->private_clean += size;
+                mss->pss += (u64)size << PSS_SHIFT;
+        }
+}
+static void smaps_pte_entry(pte_t *pte, unsigned long addr,
+                struct mm_walk *walk)
 {
        struct mem_size_stats *mss = walk->private;
        struct vm_area_struct *vma = mss->vma;
        pgoff_t pgoff = linear_page_index(vma, addr);
        struct page *page = NULL;
-        int mapcount;
-        if (pte_present(ptent)) {
+        if (pte_present(*pte)) {
-                page = vm_normal_page(vma, addr, ptent);
+                page = vm_normal_page(vma, addr, *pte);
-        } else if (is_swap_pte(ptent)) {
+        } else if (is_swap_pte(*pte)) {
-                swp_entry_t swpent = pte_to_swp_entry(ptent);
+                swp_entry_t swpent = pte_to_swp_entry(*pte);
                if (!non_swap_entry(swpent))
-                        mss->swap += ptent_size;
+                        mss->swap += PAGE_SIZE;
                else if (is_migration_entry(swpent))
                        page = migration_entry_to_page(swpent);
-        } else if (pte_file(ptent)) {
+        } else if (pte_file(*pte)) {
-                if (pte_to_pgoff(ptent) != pgoff)
+                if (pte_to_pgoff(*pte) != pgoff)
-                        mss->nonlinear += ptent_size;
+                        mss->nonlinear += PAGE_SIZE;
        }
        if (!page)
                return;
-        if (PageAnon(page))
-                mss->anonymous += ptent_size;
        if (page->index != pgoff)
-                mss->nonlinear += ptent_size;
+                mss->nonlinear += PAGE_SIZE;
-        mss->resident += ptent_size;
+        smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
-        /* Accumulate the size in pages that have been accessed. */
+}
-        if (pte_young(ptent) || PageReferenced(page))
-                mss->referenced += ptent_size;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-        mapcount = page_mapcount(page);
+static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
-        if (mapcount >= 2) {
+                struct mm_walk *walk)
-                if (pte_dirty(ptent) || PageDirty(page))
+{
-                        mss->shared_dirty += ptent_size;
+        struct mem_size_stats *mss = walk->private;
-                else
+        struct vm_area_struct *vma = mss->vma;
-                        mss->shared_clean += ptent_size;
+        struct page *page;
-                mss->pss += (ptent_size << PSS_SHIFT) / mapcount;
-        } else {
+        /* FOLL_DUMP will return -EFAULT on huge zero page */
-                if (pte_dirty(ptent) || PageDirty(page))
+        page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
-                        mss->private_dirty += ptent_size;
+        if (IS_ERR_OR_NULL(page))
-                else
+                return;
-                        mss->private_clean += ptent_size;
+        mss->anonymous_thp += HPAGE_PMD_SIZE;
-                mss->pss += (ptent_size << PSS_SHIFT);
+        smaps_account(mss, page, HPAGE_PMD_SIZE,
-        }
+                        pmd_young(*pmd), pmd_dirty(*pmd));
 }
+#else
+static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
+                struct mm_walk *walk)
+{
+}
+#endif
 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
                           struct mm_walk *walk)
@@ -509,9 +542,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
        spinlock_t *ptl;
        if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-                smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
+                smaps_pmd_entry(pmd, addr, walk);
                spin_unlock(ptl);
-                mss->anonymous_thp += HPAGE_PMD_SIZE;
                return 0;
        }
@@ -524,7 +556,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
         */
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
        for (; addr != end; pte++, addr += PAGE_SIZE)
-                smaps_pte_entry(*pte, addr, PAGE_SIZE, walk);
+                smaps_pte_entry(pte, addr, walk);
        pte_unmap_unlock(pte - 1, ptl);
        cond_resched();
        return 0;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 27b0c9105da5..641e56494a92 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -113,6 +113,19 @@ static inline void css_get(struct cgroup_subsys_state *css)
 }
 /**
+ * css_get_many - obtain references on the specified css
+ * @css: target css
+ * @n: number of references to get
+ *
+ * The caller must already have a reference.
+ */
+static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
+{
+        if (!(css->flags & CSS_NO_REF))
+                percpu_ref_get_many(&css->refcnt, n);
+}
+/**
 * css_tryget - try to obtain a reference on the specified css
 * @css: target css
 *
@@ -159,6 +172,19 @@ static inline void css_put(struct cgroup_subsys_state *css)
                percpu_ref_put(&css->refcnt);
 }
+/**
+ * css_put_many - put css references
+ * @css: target css
+ * @n: number of references to put
+ *
+ * Put references obtained via css_get() and css_tryget_online().
+ */
+static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
+{
+        if (!(css->flags & CSS_NO_REF))
+                percpu_ref_put_many(&css->refcnt, n);
+}
 /* bits in struct cgroup flags field */
 enum {
        /* Control Group requires release notifications to userspace */
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 60bdf8dc02a3..3238ffa33f68 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -33,10 +33,11 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
                        int order, gfp_t gfp_mask, nodemask_t *mask,
                        enum migrate_mode mode, int *contended,
-                        struct zone **candidate_zone);
+                        int alloc_flags, int classzone_idx);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order);
+extern unsigned long compaction_suitable(struct zone *zone, int order,
+                                        int alloc_flags, int classzone_idx);
 /* Do not skip compaction more than 64 times */
 #define COMPACT_MAX_DEFER_SHIFT 6
@@ -103,7 +104,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
                        int order, gfp_t gfp_mask, nodemask_t *nodemask,
                        enum migrate_mode mode, int *contended,
-                        struct zone **candidate_zone)
+                        int alloc_flags, int classzone_idx)
 {
        return COMPACT_CONTINUE;
 }
@@ -116,7 +117,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
-static inline unsigned long compaction_suitable(struct zone *zone, int order)
+static inline unsigned long compaction_suitable(struct zone *zone, int order,
+                                        int alloc_flags, int classzone_idx)
 {
        return COMPACT_SKIPPED;
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index 4d69123377a2..f87d30882a24 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -66,7 +66,6 @@ extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern void put_filp(struct file *);
 extern int get_unused_fd_flags(unsigned flags);
-#define get_unused_fd() get_unused_fd_flags(0)
 extern void put_unused_fd(unsigned int fd);
 extern void fd_install(unsigned int fd, struct file *file);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 41b30fd4d041..07d2699cdb51 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -381,8 +381,8 @@ extern void free_kmem_pages(unsigned long addr, unsigned int order);
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
-void drain_all_pages(void);
+void drain_all_pages(struct zone *zone);
-void drain_local_pages(void *dummy);
+void drain_local_pages(struct zone *zone);
 /*
 * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6e6d338641fe..cdd149ca5cc0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -311,7 +311,8 @@ static inline struct hstate *hstate_sizelog(int page_size_log)
 {
        if (!page_size_log)
                return &default_hstate;
-        return size_to_hstate(1 << page_size_log);
+        return size_to_hstate(1UL << page_size_log);
 }
 static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0129f89cf98d..bcc853eccc85 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -16,7 +16,6 @@
 #define _LINUX_HUGETLB_CGROUP_H
 #include <linux/mmdebug.h>
-#include <linux/res_counter.h>
 struct hugetlb_cgroup;
 /*
diff --git a/include/linux/kern_levels.h b/include/linux/kern_levels.h
index 866caaa9e2bb..c2ce155d83cc 100644
--- a/include/linux/kern_levels.h
+++ b/include/linux/kern_levels.h
@@ -22,4 +22,17 @@
 */
 #define KERN_CONT       ""
+/* integer equivalents of KERN_<LEVEL> */
+#define LOGLEVEL_SCHED          -2      /* Deferred messages from sched code
+                                         * are set to this special level */
+#define LOGLEVEL_DEFAULT        -1      /* default (or last) loglevel */
+#define LOGLEVEL_EMERG          0       /* system is unusable */
+#define LOGLEVEL_ALERT          1       /* action must be taken immediately */
+#define LOGLEVEL_CRIT           2       /* critical conditions */
+#define LOGLEVEL_ERR            3       /* error conditions */
+#define LOGLEVEL_WARNING        4       /* warning conditions */
+#define LOGLEVEL_NOTICE         5       /* normal but significant condition */
+#define LOGLEVEL_INFO           6       /* informational */
+#define LOGLEVEL_DEBUG          7       /* debug-level messages */
 #endif
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 446d76a87ba1..233ea8107038 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -427,6 +427,7 @@ extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
+extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
 /*
 * Only to be used by arch init code. If the user over-wrote the default
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6b75640ef5ab..6ea9f919e888 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,7 +25,6 @@
 #include <linux/jump_label.h>
 struct mem_cgroup;
-struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -68,10 +67,9 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
-bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
+bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
-                                  struct mem_cgroup *memcg);
+                              struct mem_cgroup *root);
-bool task_in_mem_cgroup(struct task_struct *task,
+bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-                        const struct mem_cgroup *memcg);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -79,15 +77,16 @@ extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css);
-static inline
+static inline bool mm_match_cgroup(struct mm_struct *mm,
-bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
+                                   struct mem_cgroup *memcg)
 {
        struct mem_cgroup *task_memcg;
-        bool match;
+        bool match = false;
        rcu_read_lock();
        task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
-        match = __mem_cgroup_same_or_subtree(memcg, task_memcg);
+        if (task_memcg)
+                match = mem_cgroup_is_descendant(task_memcg, memcg);
        rcu_read_unlock();
        return match;
 }
@@ -141,8 +140,8 @@ static inline bool mem_cgroup_disabled(void)
 struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page, bool *locked,
                                              unsigned long *flags);
-void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool locked,
+void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool *locked,
-                              unsigned long flags);
+                              unsigned long *flags);
 void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
                                 enum mem_cgroup_stat_index idx, int val);
@@ -174,10 +173,6 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
-#ifdef CONFIG_DEBUG_VM
-bool mem_cgroup_bad_page_check(struct page *page);
-void mem_cgroup_print_bad_page(struct page *page);
-#endif
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
@@ -297,7 +292,7 @@ static inline struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page,
 }
 static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg,
-                                        bool locked, unsigned long flags)
+                                        bool *locked, unsigned long *flags)
 {
 }
@@ -347,19 +342,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
-#if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
-static inline bool
-mem_cgroup_bad_page_check(struct page *page)
-{
-        return false;
-}
-static inline void
-mem_cgroup_print_bad_page(struct page *page)
-{
-}
-#endif
 enum {
        UNDER_LIMIT,
        SOFT_LIMIT,
@@ -447,9 +429,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
        /*
         * __GFP_NOFAIL allocations will move on even if charging is not
         * possible. Therefore we don't even try, and have this allocation
-         * unaccounted. We could in theory charge it with
+         * unaccounted. We could in theory charge it forcibly, but we hope
-         * res_counter_charge_nofail, but we hope those allocations are rare,
+         * those allocations are rare, and won't be worth the trouble.
-         * and won't be worth the trouble.
         */
        if (gfp & __GFP_NOFAIL)
                return true;
@@ -467,8 +448,6 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 * memcg_kmem_uncharge_pages: uncharge pages from memcg
 * @page: pointer to struct page being freed
 * @order: allocation order.
- *
- * there is no need to specify memcg here, since it is embedded in page_cgroup
 */
 static inline void
 memcg_kmem_uncharge_pages(struct page *page, int order)
@@ -485,8 +464,7 @@ memcg_kmem_uncharge_pages(struct page *page, int order)
 *
 * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
 * failure of the allocation. if @page is NULL, this function will revert the
- * charges. Otherwise, it will commit the memcg given by @memcg to the
+ * charges. Otherwise, it will commit @page to @memcg.
- * corresponding page_cgroup.
 */
 static inline void
 memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 004e9d17b47e..bf9f57529dcf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -22,6 +22,7 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 struct address_space;
+struct mem_cgroup;
 #define USE_SPLIT_PTE_PTLOCKS   (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS   (USE_SPLIT_PTE_PTLOCKS && \
@@ -167,6 +168,10 @@ struct page {
                struct page *first_page;        /* Compound tail pages */
        };
+#ifdef CONFIG_MEMCG
+        struct mem_cgroup *mem_cgroup;
+#endif
        /*
         * On machines where all RAM is mapped into kernel address space,
         * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ffe66e381c04..3879d7664dfc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -722,9 +722,6 @@ typedef struct pglist_data {
        int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
        struct page *node_mem_map;
-#ifdef CONFIG_MEMCG
-        struct page_cgroup *node_page_cgroup;
-#endif
 #endif
 #ifndef CONFIG_NO_BOOTMEM
        struct bootmem_data *bdata;
@@ -1078,7 +1075,6 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
 #define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)
 struct page;
-struct page_cgroup;
 struct mem_section {
        /*
         * This is, logically, a pointer to an array of struct
@@ -1096,14 +1092,6 @@ struct mem_section {
        /* See declaration of similar field in struct zone */
        unsigned long *pageblock_flags;
-#ifdef CONFIG_MEMCG
-        /*
-         * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
-         * section. (see memcontrol.h/page_cgroup.h about this.)
-         */
-        struct page_cgroup *page_cgroup;
-        unsigned long pad;
-#endif
        /*
         * WARNING: mem_section must be a power-of-2 in size for the
         * calculation and use of SECTION_ROOT_MASK to make sense.
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
deleted file mode 100644
index 5c831f1eca79..000000000000
--- a/include/linux/page_cgroup.h
+++ /dev/null
@@ -1,105 +0,0 @@
-#ifndef __LINUX_PAGE_CGROUP_H
-#define __LINUX_PAGE_CGROUP_H
-enum {
-        /* flags for mem_cgroup */
-        PCG_USED = 0x01,        /* This page is charged to a memcg */
-        PCG_MEM = 0x02,         /* This page holds a memory charge */
-        PCG_MEMSW = 0x04,       /* This page holds a memory+swap charge */
-};
-struct pglist_data;
-#ifdef CONFIG_MEMCG
-struct mem_cgroup;
-/*
- * Page Cgroup can be considered as an extended mem_map.
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- * All page cgroups are allocated at boot or memory hotplug event,
- * then the page cgroup for pfn always exists.
- */
-struct page_cgroup {
-        unsigned long flags;
-        struct mem_cgroup *mem_cgroup;
-};
-extern void pgdat_page_cgroup_init(struct pglist_data *pgdat);
-#ifdef CONFIG_SPARSEMEM
-static inline void page_cgroup_init_flatmem(void)
-{
-}
-extern void page_cgroup_init(void);
-#else
-extern void page_cgroup_init_flatmem(void);
-static inline void page_cgroup_init(void)
-{
-}
-#endif
-struct page_cgroup *lookup_page_cgroup(struct page *page);
-static inline int PageCgroupUsed(struct page_cgroup *pc)
-{
-        return !!(pc->flags & PCG_USED);
-}
-#else /* !CONFIG_MEMCG */
-struct page_cgroup;
-static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
-{
-}
-static inline struct page_cgroup *lookup_page_cgroup(struct page *page)
-{
-        return NULL;
-}
-static inline void page_cgroup_init(void)
-{
-}
-static inline void page_cgroup_init_flatmem(void)
-{
-}
-#endif /* CONFIG_MEMCG */
-#include <linux/swap.h>
-#ifdef CONFIG_MEMCG_SWAP
-extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-                                        unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
-extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
-extern int swap_cgroup_swapon(int type, unsigned long max_pages);
-extern void swap_cgroup_swapoff(int type);
-#else
-static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
-{
-        return 0;
-}
-static inline
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
-{
-        return 0;
-}
-static inline int
-swap_cgroup_swapon(int type, unsigned long max_pages)
-{
-        return 0;
-}
-static inline void swap_cgroup_swapoff(int type)
-{
-        return;
-}
-#endif /* CONFIG_MEMCG_SWAP */
-#endif /* __LINUX_PAGE_CGROUP_H */
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
new file mode 100644
index 000000000000..955421575d16
--- /dev/null
+++ b/include/linux/page_counter.h
@@ -0,0 +1,51 @@
+#ifndef _LINUX_PAGE_COUNTER_H
+#define _LINUX_PAGE_COUNTER_H
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <asm/page.h>
+struct page_counter {
+        atomic_long_t count;
+        unsigned long limit;
+        struct page_counter *parent;
+        /* legacy */
+        unsigned long watermark;
+        unsigned long failcnt;
+};
+#if BITS_PER_LONG == 32
+#define PAGE_COUNTER_MAX LONG_MAX
+#else
+#define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE)
+#endif
+static inline void page_counter_init(struct page_counter *counter,
+                                     struct page_counter *parent)
+{
+        atomic_long_set(&counter->count, 0);
+        counter->limit = PAGE_COUNTER_MAX;
+        counter->parent = parent;
+}
+static inline unsigned long page_counter_read(struct page_counter *counter)
+{
+        return atomic_long_read(&counter->count);
+}
+void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_try_charge(struct page_counter *counter,
+                            unsigned long nr_pages,
+                            struct page_counter **fail);
+void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_limit(struct page_counter *counter, unsigned long limit);
+int page_counter_memparse(const char *buf, unsigned long *nr_pages);
+static inline void page_counter_reset_watermark(struct page_counter *counter)
+{
+        counter->watermark = page_counter_read(counter);
+}
+#endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 51ce60c35f4c..530b249f7ea4 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -147,28 +147,42 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref,
 }
 /**
- * percpu_ref_get - increment a percpu refcount
+ * percpu_ref_get_many - increment a percpu refcount
 * @ref: percpu_ref to get
+ * @nr: number of references to get
 *
- * Analagous to atomic_long_inc().
+ * Analogous to atomic_long_add().
 *
 * This function is safe to call as long as @ref is between init and exit.
 */
-static inline void percpu_ref_get(struct percpu_ref *ref)
+static inline void percpu_ref_get_many(struct percpu_ref *ref, unsigned long nr)
 {
        unsigned long __percpu *percpu_count;
        rcu_read_lock_sched();
        if (__ref_is_percpu(ref, &percpu_count))
-                this_cpu_inc(*percpu_count);
+                this_cpu_add(*percpu_count, nr);
        else
-                atomic_long_inc(&ref->count);
+                atomic_long_add(nr, &ref->count);
        rcu_read_unlock_sched();
 }
 /**
+ * percpu_ref_get - increment a percpu refcount
+ * @ref: percpu_ref to get
+ *
+ * Analagous to atomic_long_inc().
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+        percpu_ref_get_many(ref, 1);
+}
+/**
 * percpu_ref_tryget - try to increment a percpu refcount
 * @ref: percpu_ref to try-get
 *
@@ -231,29 +245,44 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 }
 /**
- * percpu_ref_put - decrement a percpu refcount
+ * percpu_ref_put_many - decrement a percpu refcount
 * @ref: percpu_ref to put
+ * @nr: number of references to put
 *
 * Decrement the refcount, and if 0, call the release function (which was passed
 * to percpu_ref_init())
 *
 * This function is safe to call as long as @ref is between init and exit.
 */
-static inline void percpu_ref_put(struct percpu_ref *ref)
+static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr)
 {
        unsigned long __percpu *percpu_count;
        rcu_read_lock_sched();
        if (__ref_is_percpu(ref, &percpu_count))
-                this_cpu_dec(*percpu_count);
+                this_cpu_sub(*percpu_count, nr);
-        else if (unlikely(atomic_long_dec_and_test(&ref->count)))
+        else if (unlikely(atomic_long_sub_and_test(nr, &ref->count)))
                ref->release(ref);
        rcu_read_unlock_sched();
 }
 /**
+ * percpu_ref_put - decrement a percpu refcount
+ * @ref: percpu_ref to put
+ *
+ * Decrement the refcount, and if 0, call the release function (which was passed
+ * to percpu_ref_init())
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline void percpu_ref_put(struct percpu_ref *ref)
+{
+        percpu_ref_put_many(ref, 1);
+}
+/**
 * percpu_ref_is_zero - test whether a percpu refcount reached zero
 * @ref: percpu_ref to test
 *
diff --git a/include/linux/printk.h b/include/linux/printk.h
index d78125f73ac4..3dd489f2dedc 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -118,7 +118,6 @@ int no_printk(const char *fmt, ...)
 #ifdef CONFIG_EARLY_PRINTK
 extern asmlinkage __printf(1, 2)
 void early_printk(const char *fmt, ...);
-void early_vprintk(const char *fmt, va_list ap);
 #else
 static inline __printf(1, 2) __cold
 void early_printk(const char *s, ...) { }
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index cc79eff4a1ad..987a73a40ef8 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -52,7 +52,7 @@ extern void ptrace_notify(int exit_code);
 extern void __ptrace_link(struct task_struct *child,
                          struct task_struct *new_parent);
 extern void __ptrace_unlink(struct task_struct *child);
-extern void exit_ptrace(struct task_struct *tracer);
+extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
 #define PTRACE_MODE_READ        0x01
 #define PTRACE_MODE_ATTACH      0x02
 #define PTRACE_MODE_NOAUDIT     0x04
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
deleted file mode 100644
index 56b7bc32db4f..000000000000
--- a/include/linux/res_counter.h
+++ /dev/null
@@ -1,223 +0,0 @@
-#ifndef __RES_COUNTER_H__
-#define __RES_COUNTER_H__
-/*
- * Resource Counters
- * Contain common data types and routines for resource accounting
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <xemul@openvz.org>
- *
- * See Documentation/cgroups/resource_counter.txt for more
- * info about what this counter is.
- */
-#include <linux/spinlock.h>
-#include <linux/errno.h>
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
-struct res_counter {
-        /*
-         * the current resource consumption level
-         */
-        unsigned long long usage;
-        /*
-         * the maximal value of the usage from the counter creation
-         */
-        unsigned long long max_usage;
-        /*
-         * the limit that usage cannot exceed
-         */
-        unsigned long long limit;
-        /*
-         * the limit that usage can be exceed
-         */
-        unsigned long long soft_limit;
-        /*
-         * the number of unsuccessful attempts to consume the resource
-         */
-        unsigned long long failcnt;
-        /*
-         * the lock to protect all of the above.
-         * the routines below consider this to be IRQ-safe
-         */
-        spinlock_t lock;
-        /*
-         * Parent counter, used for hierarchial resource accounting
-         */
-        struct res_counter *parent;
-};
-#define RES_COUNTER_MAX ULLONG_MAX
-/**
- * Helpers to interact with userspace
- * res_counter_read_u64() - returns the value of the specified member.
- * res_counter_read/_write - put/get the specified fields from the
- * res_counter struct to/from the user
- *
- * @counter:     the counter in question
- * @member:  the field to work with (see RES_xxx below)
- * @buf:     the buffer to opeate on,...
- * @nbytes:  its size...
- * @pos:     and the offset.
- */
-u64 res_counter_read_u64(struct res_counter *counter, int member);
-ssize_t res_counter_read(struct res_counter *counter, int member,
-                const char __user *buf, size_t nbytes, loff_t *pos,
-                int (*read_strategy)(unsigned long long val, char *s));
-int res_counter_memparse_write_strategy(const char *buf,
-                                        unsigned long long *res);
-/*
- * the field descriptors. one for each member of res_counter
- */
-enum {
-        RES_USAGE,
-        RES_MAX_USAGE,
-        RES_LIMIT,
-        RES_FAILCNT,
-        RES_SOFT_LIMIT,
-};
-/*
- * helpers for accounting
- */
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
-/*
- * charge - try to consume more resource.
- *
- * @counter: the counter
- * @val: the amount of the resource. each controller defines its own
- *       units, e.g. numbers, bytes, Kbytes, etc
- *
- * returns 0 on success and <0 if the counter->usage will exceed the
- * counter->limit
- *
- * charge_nofail works the same, except that it charges the resource
- * counter unconditionally, and returns < 0 if the after the current
- * charge we are over limit.
- */
-int __must_check res_counter_charge(struct res_counter *counter,
-                unsigned long val, struct res_counter **limit_fail_at);
-int res_counter_charge_nofail(struct res_counter *counter,
-                unsigned long val, struct res_counter **limit_fail_at);
-/*
- * uncharge - tell that some portion of the resource is released
- *
- * @counter: the counter
- * @val: the amount of the resource
- *
- * these calls check for usage underflow and show a warning on the console
- *
- * returns the total charges still present in @counter.
- */
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
-u64 res_counter_uncharge_until(struct res_counter *counter,
-                               struct res_counter *top,
-                               unsigned long val);
-/**
- * res_counter_margin - calculate chargeable space of a counter
- * @cnt: the counter
- *
- * Returns the difference between the hard limit and the current usage
- * of resource counter @cnt.
- */
-static inline unsigned long long res_counter_margin(struct res_counter *cnt)
-{
-        unsigned long long margin;
-        unsigned long flags;
-        spin_lock_irqsave(&cnt->lock, flags);
-        if (cnt->limit > cnt->usage)
-                margin = cnt->limit - cnt->usage;
-        else
-                margin = 0;
-        spin_unlock_irqrestore(&cnt->lock, flags);
-        return margin;
-}
-/**
- * Get the difference between the usage and the soft limit
- * @cnt: The counter
- *
- * Returns 0 if usage is less than or equal to soft limit
- * The difference between usage and soft limit, otherwise.
- */
-static inline unsigned long long
-res_counter_soft_limit_excess(struct res_counter *cnt)
-{
-        unsigned long long excess;
-        unsigned long flags;
-        spin_lock_irqsave(&cnt->lock, flags);
-        if (cnt->usage <= cnt->soft_limit)
-                excess = 0;
-        else
-                excess = cnt->usage - cnt->soft_limit;
-        spin_unlock_irqrestore(&cnt->lock, flags);
-        return excess;
-}
-static inline void res_counter_reset_max(struct res_counter *cnt)
-{
-        unsigned long flags;
-        spin_lock_irqsave(&cnt->lock, flags);
-        cnt->max_usage = cnt->usage;
-        spin_unlock_irqrestore(&cnt->lock, flags);
-}
-static inline void res_counter_reset_failcnt(struct res_counter *cnt)
-{
-        unsigned long flags;
-        spin_lock_irqsave(&cnt->lock, flags);
-        cnt->failcnt = 0;
-        spin_unlock_irqrestore(&cnt->lock, flags);
-}
-static inline int res_counter_set_limit(struct res_counter *cnt,
-                unsigned long long limit)
-{
-        unsigned long flags;
-        int ret = -EBUSY;
-        spin_lock_irqsave(&cnt->lock, flags);
-        if (cnt->usage <= limit) {
-                cnt->limit = limit;
-                ret = 0;
-        }
-        spin_unlock_irqrestore(&cnt->lock, flags);
-        return ret;
-}
-static inline int
-res_counter_set_soft_limit(struct res_counter *cnt,
-                                unsigned long long soft_limit)
-{
-        unsigned long flags;
-        spin_lock_irqsave(&cnt->lock, flags);
-        cnt->soft_limit = soft_limit;
-        spin_unlock_irqrestore(&cnt->lock, flags);
-        return 0;
-}
-#endif
diff --git a/include/linux/slab.h b/include/linux/slab.h
index c265bec6a57d..8a2457d42fc8 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -513,10 +513,6 @@ struct memcg_cache_params {
 int memcg_update_all_caches(int num_memcgs);
-struct seq_file;
-int cache_show(struct kmem_cache *s, struct seq_file *m);
-void print_slabinfo_header(struct seq_file *m);
 /**
 * kmalloc_array - allocate memory for an array.
 * @n: number of elements.
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
new file mode 100644
index 000000000000..145306bdc92f
--- /dev/null
+++ b/include/linux/swap_cgroup.h
@@ -0,0 +1,42 @@
+#ifndef __LINUX_SWAP_CGROUP_H
+#define __LINUX_SWAP_CGROUP_H
+#include <linux/swap.h>
+#ifdef CONFIG_MEMCG_SWAP
+extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+                                        unsigned short old, unsigned short new);
+extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
+extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+static inline
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+{
+        return 0;
+}
+static inline
+unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
+{
+        return 0;
+}
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+        return 0;
+}
+static inline void swap_cgroup_swapoff(int type)
+{
+        return;
+}
+#endif /* CONFIG_MEMCG_SWAP */
+#endif /* __LINUX_SWAP_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index e6f235ebf6c9..7ff44e062a38 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,8 +54,8 @@
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/page_counter.h>
 #include <linux/memcontrol.h>
-#include <linux/res_counter.h>
 #include <linux/static_key.h>
 #include <linux/aio.h>
 #include <linux/sched.h>
@@ -1062,7 +1062,7 @@ enum cg_proto_flags {
 };
 struct cg_proto {
-        struct res_counter      memory_allocated;       /* Current allocated memory. */
+        struct page_counter     memory_allocated;       /* Current allocated memory. */
        struct percpu_counter   sockets_allocated;      /* Current number of sockets. */
        int                     memory_pressure;
        long                    sysctl_mem[3];
@@ -1214,34 +1214,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
                                              unsigned long amt,
                                              int *parent_status)
 {
-        struct res_counter *fail;
+        page_counter_charge(&prot->memory_allocated, amt);
-        int ret;
-        ret = res_counter_charge_nofail(&prot->memory_allocated,
+        if (page_counter_read(&prot->memory_allocated) >
-                                        amt << PAGE_SHIFT, &fail);
+            prot->memory_allocated.limit)
-        if (ret < 0)
                *parent_status = OVER_LIMIT;
 }
 static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
                                              unsigned long amt)
 {
-        res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
+        page_counter_uncharge(&prot->memory_allocated, amt);
-}
-static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
-{
-        u64 ret;
-        ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
-        return ret >> PAGE_SHIFT;
 }
 static inline long
 sk_memory_allocated(const struct sock *sk)
 {
        struct proto *prot = sk->sk_prot;
        if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-                return memcg_memory_allocated_read(sk->sk_cgrp);
+                return page_counter_read(&sk->sk_cgrp->memory_allocated);
        return atomic_long_read(prot->memory_allocated);
 }
@@ -1255,7 +1247,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
                memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
                /* update the root cgroup regardless */
                atomic_long_add_return(amt, prot->memory_allocated);
-                return memcg_memory_allocated_read(sk->sk_cgrp);
+                return page_counter_read(&sk->sk_cgrp->memory_allocated);
        }
        return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 43aaba1cc037..0956373b56db 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -153,6 +153,7 @@ enum
        KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
        KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
        KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+        KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */
 };
diff --git a/init/Kconfig b/init/Kconfig
index 903505e66d1d..9afb971497f4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -893,14 +893,6 @@ config ARCH_SUPPORTS_INT128
 config ARCH_WANT_NUMA_VARIABLE_LOCALITY
        bool
-config NUMA_BALANCING_DEFAULT_ENABLED
-        bool "Automatically enable NUMA aware memory/task placement"
-        default y
-        depends on NUMA_BALANCING
-        help
-          If set, automatic NUMA balancing will be enabled if running on a NUMA
-          machine.
 config NUMA_BALANCING
        bool "Memory placement aware NUMA scheduler"
        depends on ARCH_SUPPORTS_NUMA_BALANCING
@@ -913,6 +905,14 @@ config NUMA_BALANCING
          This system will be inactive on UMA systems.
+config NUMA_BALANCING_DEFAULT_ENABLED
+        bool "Automatically enable NUMA aware memory/task placement"
+        default y
+        depends on NUMA_BALANCING
+        help
+          If set, automatic NUMA balancing will be enabled if running on a NUMA
+          machine.
 menuconfig CGROUPS
        boolean "Control Group support"
        select KERNFS
@@ -972,32 +972,17 @@ config CGROUP_CPUACCT
          Provides a simple Resource Controller for monitoring the
          total CPU consumed by the tasks in a cgroup.
-config RESOURCE_COUNTERS
+config PAGE_COUNTER
-        bool "Resource counters"
+       bool
-        help
-          This option enables controller independent resource accounting
-          infrastructure that works with cgroups.
 config MEMCG
        bool "Memory Resource Controller for Control Groups"
-        depends on RESOURCE_COUNTERS
+        select PAGE_COUNTER
        select EVENTFD
        help
          Provides a memory resource controller that manages both anonymous
          memory and page cache. (See Documentation/cgroups/memory.txt)
-          Note that setting this option increases fixed memory overhead
-          associated with each page of memory in the system. By this,
-          8(16)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory
-          usage tracking struct at boot. Total amount of this is printed out
-          at boot.
-          Only enable when you're ok with these trade offs and really
-          sure you need the memory resource controller. Even when you enable
-          this, you can set "cgroup_disable=memory" at your boot option to
-          disable memory resource controller and you can avoid overheads.
-          (and lose benefits of memory resource controller)
 config MEMCG_SWAP
        bool "Memory Resource Controller Swap Extension"
        depends on MEMCG && SWAP
@@ -1048,7 +1033,8 @@ config MEMCG_KMEM
 config CGROUP_HUGETLB
        bool "HugeTLB Resource Controller for Control Groups"
-        depends on RESOURCE_COUNTERS && HUGETLB_PAGE
+        depends on HUGETLB_PAGE
+        select PAGE_COUNTER
        default n
        help
          Provides a cgroup Resource Controller for HugeTLB pages.
@@ -1294,6 +1280,22 @@ source "usr/Kconfig"
 endif
+config INIT_FALLBACK
+        bool "Fall back to defaults if init= parameter is bad"
+        default y
+        help
+          If enabled, the kernel will try the default init binaries if an
+          explicit request from the init= parameter fails.
+          This can have unexpected effects.  For example, booting
+          with init=/sbin/kiosk_app will run /sbin/init or even /bin/sh
+          if /sbin/kiosk_app cannot be executed.
+          The default value of Y is consistent with historical behavior.
+          Selecting N is likely to be more appropriate for most uses,
+          especially on kiosks and on kernels that are intended to be
+          run under the control of a script.
 config CC_OPTIMIZE_FOR_SIZE
        bool "Optimize for size"
        help
diff --git a/init/main.c b/init/main.c
index 321d0ceb26d3..ca380ec685de 100644
--- a/init/main.c
+++ b/init/main.c
@@ -51,7 +51,6 @@
 #include <linux/mempolicy.h>
 #include <linux/key.h>
 #include <linux/buffer_head.h>
-#include <linux/page_cgroup.h>
 #include <linux/debug_locks.h>
 #include <linux/debugobjects.h>
 #include <linux/lockdep.h>
@@ -485,11 +484,6 @@ void __init __weak thread_info_cache_init(void)
 */
 static void __init mm_init(void)
 {
-        /*
-         * page_cgroup requires contiguous pages,
-         * bigger than MAX_ORDER unless SPARSEMEM.
-         */
-        page_cgroup_init_flatmem();
        mem_init();
        kmem_cache_init();
        percpu_init_late();
@@ -627,7 +621,6 @@ asmlinkage __visible void __init start_kernel(void)
                initrd_start = 0;
        }
 #endif
-        page_cgroup_init();
        debug_objects_mem_init();
        kmemleak_init();
        setup_per_cpu_pageset();
@@ -959,8 +952,13 @@ static int __ref kernel_init(void *unused)
                ret = run_init_process(execute_command);
                if (!ret)
                        return 0;
+#ifndef CONFIG_INIT_FALLBACK
+                panic("Requested init %s failed (error %d).",
+                      execute_command, ret);
+#else
                pr_err("Failed to execute %s (error %d).  Attempting defaults...\n",
-                        execute_command, ret);
+                       execute_command, ret);
+#endif
        }
        if (!try_to_run_init_process("/sbin/init") ||
            !try_to_run_init_process("/etc/init") ||
diff --git a/kernel/Makefile b/kernel/Makefile
index 17ea6d4a9a24..a59481a3fa6c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,7 +57,6 @@ obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
 obj-$(CONFIG_PID_NS) += pid_namespace.o
 obj-$(CONFIG_IKCONFIG) += configs.o
-obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
 obj-$(CONFIG_SMP) += stop_machine.o
 obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff --git a/kernel/exit.c b/kernel/exit.c
index 232c4bc8bcc9..8714e5ded8b4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -118,13 +118,10 @@ static void __exit_signal(struct task_struct *tsk)
        }
        /*
-         * Accumulate here the counters for all threads but the group leader
+         * Accumulate here the counters for all threads as they die. We could
-         * as they die, so they can be added into the process-wide totals
+         * skip the group leader because it is the last user of signal_struct,
-         * when those are taken.  The group leader stays around as a zombie as
+         * but we want to avoid the race with thread_group_cputime() which can
-         * long as there are other threads.  When it gets reaped, the exit.c
+         * see the empty ->thread_head list.
-         * code will add its counts into these totals.  We won't ever get here
-         * for the group leader, since it will have been the last reference on
-         * the signal_struct.
         */
        task_cputime(tsk, &utime, &stime);
        write_seqlock(&sig->stats_lock);
@@ -462,6 +459,44 @@ static void exit_mm(struct task_struct *tsk)
        clear_thread_flag(TIF_MEMDIE);
 }
+static struct task_struct *find_alive_thread(struct task_struct *p)
+{
+        struct task_struct *t;
+        for_each_thread(p, t) {
+                if (!(t->flags & PF_EXITING))
+                        return t;
+        }
+        return NULL;
+}
+static struct task_struct *find_child_reaper(struct task_struct *father)
+        __releases(&tasklist_lock)
+        __acquires(&tasklist_lock)
+{
+        struct pid_namespace *pid_ns = task_active_pid_ns(father);
+        struct task_struct *reaper = pid_ns->child_reaper;
+        if (likely(reaper != father))
+                return reaper;
+        reaper = find_alive_thread(father);
+        if (reaper) {
+                pid_ns->child_reaper = reaper;
+                return reaper;
+        }
+        write_unlock_irq(&tasklist_lock);
+        if (unlikely(pid_ns == &init_pid_ns)) {
+                panic("Attempted to kill init! exitcode=0x%08x\n",
+                        father->signal->group_exit_code ?: father->exit_code);
+        }
+        zap_pid_ns_processes(pid_ns);
+        write_lock_irq(&tasklist_lock);
+        return father;
+}
 /*
 * When we die, we re-parent all our children, and try to:
 * 1. give them to another thread in our thread group, if such a member exists
@@ -469,58 +504,36 @@ static void exit_mm(struct task_struct *tsk)
 *    child_subreaper for its children (like a service manager)
 * 3. give it to the init process (PID 1) in our pid namespace
 */
-static struct task_struct *find_new_reaper(struct task_struct *father)
+static struct task_struct *find_new_reaper(struct task_struct *father,
-        __releases(&tasklist_lock)
+                                           struct task_struct *child_reaper)
-        __acquires(&tasklist_lock)
 {
-        struct pid_namespace *pid_ns = task_active_pid_ns(father);
+        struct task_struct *thread, *reaper;
-        struct task_struct *thread;
-        thread = father;
+        thread = find_alive_thread(father);
-        while_each_thread(father, thread) {
+        if (thread)
-                if (thread->flags & PF_EXITING)
-                        continue;
-                if (unlikely(pid_ns->child_reaper == father))
-                        pid_ns->child_reaper = thread;
                return thread;
-        }
-        if (unlikely(pid_ns->child_reaper == father)) {
-                write_unlock_irq(&tasklist_lock);
-                if (unlikely(pid_ns == &init_pid_ns)) {
-                        panic("Attempted to kill init! exitcode=0x%08x\n",
-                                father->signal->group_exit_code ?:
-                                        father->exit_code);
-                }
-                zap_pid_ns_processes(pid_ns);
-                write_lock_irq(&tasklist_lock);
-        } else if (father->signal->has_child_subreaper) {
-                struct task_struct *reaper;
+        if (father->signal->has_child_subreaper) {
                /*
-                 * Find the first ancestor marked as child_subreaper.
+                 * Find the first ->is_child_subreaper ancestor in our pid_ns.
-                 * Note that the code below checks same_thread_group(reaper,
+                 * We start from father to ensure we can not look into another
-                 * pid_ns->child_reaper).  This is what we need to DTRT in a
+                 * namespace, this is safe because all its threads are dead.
-                 * PID namespace. However we still need the check above, see
-                 * http://marc.info/?l=linux-kernel&m=131385460420380
                 */
-                for (reaper = father->real_parent;
+                for (reaper = father;
-                     reaper != &init_task;
+                     !same_thread_group(reaper, child_reaper);
                     reaper = reaper->real_parent) {
-                        if (same_thread_group(reaper, pid_ns->child_reaper))
+                        /* call_usermodehelper() descendants need this check */
+                        if (reaper == &init_task)
                                break;
                        if (!reaper->signal->is_child_subreaper)
                                continue;
-                        thread = reaper;
+                        thread = find_alive_thread(reaper);
-                        do {
+                        if (thread)
-                                if (!(thread->flags & PF_EXITING))
+                                return thread;
-                                        return reaper;
-                        } while_each_thread(reaper, thread);
                }
        }
-        return pid_ns->child_reaper;
+        return child_reaper;
 }
 /*
@@ -529,15 +542,7 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 static void reparent_leader(struct task_struct *father, struct task_struct *p,
                                struct list_head *dead)
 {
-        list_move_tail(&p->sibling, &p->real_parent->children);
+        if (unlikely(p->exit_state == EXIT_DEAD))
-        if (p->exit_state == EXIT_DEAD)
-                return;
-        /*
-         * If this is a threaded reparent there is no need to
-         * notify anyone anything has happened.
-         */
-        if (same_thread_group(p->real_parent, father))
                return;
        /* We don't want people slaying init. */
@@ -548,49 +553,53 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
            p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
                if (do_notify_parent(p, p->exit_signal)) {
                        p->exit_state = EXIT_DEAD;
-                        list_move_tail(&p->sibling, dead);
+                        list_add(&p->ptrace_entry, dead);
                }
        }
        kill_orphaned_pgrp(p, father);
 }
-static void forget_original_parent(struct task_struct *father)
+/*
+ * This does two things:
+ *
+ * A.  Make init inherit all the child processes
+ * B.  Check to see if any process groups have become orphaned
+ *      as a result of our exiting, and if they have any stopped
+ *      jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
+ */
+static void forget_original_parent(struct task_struct *father,
+                                        struct list_head *dead)
 {
-        struct task_struct *p, *n, *reaper;
+        struct task_struct *p, *t, *reaper;
-        LIST_HEAD(dead_children);
-        write_lock_irq(&tasklist_lock);
+        if (unlikely(!list_empty(&father->ptraced)))
-        /*
+                exit_ptrace(father, dead);
-         * Note that exit_ptrace() and find_new_reaper() might
-         * drop tasklist_lock and reacquire it.
-         */
-        exit_ptrace(father);
-        reaper = find_new_reaper(father);
-        list_for_each_entry_safe(p, n, &father->children, sibling) {
+        /* Can drop and reacquire tasklist_lock */
-                struct task_struct *t = p;
+        reaper = find_child_reaper(father);
+        if (list_empty(&father->children))
+                return;
-                do {
+        reaper = find_new_reaper(father, reaper);
+        list_for_each_entry(p, &father->children, sibling) {
+                for_each_thread(p, t) {
                        t->real_parent = reaper;
-                        if (t->parent == father) {
+                        BUG_ON((!t->ptrace) != (t->parent == father));
-                                BUG_ON(t->ptrace);
+                        if (likely(!t->ptrace))
                                t->parent = t->real_parent;
-                        }
                        if (t->pdeath_signal)
                                group_send_sig_info(t->pdeath_signal,
                                                    SEND_SIG_NOINFO, t);
-                } while_each_thread(p, t);
+                }
-                reparent_leader(father, p, &dead_children);
+                /*
-        }
+                 * If this is a threaded reparent there is no need to
-        write_unlock_irq(&tasklist_lock);
+                 * notify anyone anything has happened.
+                 */
-        BUG_ON(!list_empty(&father->children));
+                if (!same_thread_group(reaper, father))
+                        reparent_leader(father, p, dead);
-        list_for_each_entry_safe(p, n, &dead_children, sibling) {
-                list_del_init(&p->sibling);
-                release_task(p);
        }
+        list_splice_tail_init(&father->children, &reaper->children);
 }
 /*
@@ -600,18 +609,12 @@ static void forget_original_parent(struct task_struct *father)
 static void exit_notify(struct task_struct *tsk, int group_dead)
 {
        bool autoreap;
+        struct task_struct *p, *n;
-        /*
+        LIST_HEAD(dead);
-         * This does two things:
-         *
-         * A.  Make init inherit all the child processes
-         * B.  Check to see if any process groups have become orphaned
-         *      as a result of our exiting, and if they have any stopped
-         *      jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
-         */
-        forget_original_parent(tsk);
        write_lock_irq(&tasklist_lock);
+        forget_original_parent(tsk, &dead);
        if (group_dead)
                kill_orphaned_pgrp(tsk->group_leader, NULL);
@@ -629,15 +632,18 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
        }
        tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
+        if (tsk->exit_state == EXIT_DEAD)
+                list_add(&tsk->ptrace_entry, &dead);
        /* mt-exec, de_thread() is waiting for group leader */
        if (unlikely(tsk->signal->notify_count < 0))
                wake_up_process(tsk->signal->group_exit_task);
        write_unlock_irq(&tasklist_lock);
-        /* If the process is dead, release it - nobody will wait for it */
+        list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
-        if (autoreap)
+                list_del_init(&p->ptrace_entry);
-                release_task(tsk);
+                release_task(p);
+        }
 }
 #ifdef CONFIG_DEBUG_STACK_USAGE
@@ -982,8 +988,7 @@ static int wait_noreap_copyout(struct wait_opts *wo, struct task_struct *p,
 */
 static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 {
-        unsigned long state;
+        int state, retval, status;
-        int retval, status, traced;
        pid_t pid = task_pid_vnr(p);
        uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p));
        struct siginfo __user *infop;
@@ -1008,21 +1013,25 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
                }
                return wait_noreap_copyout(wo, p, pid, uid, why, status);
        }
-        traced = ptrace_reparented(p);
        /*
         * Move the task's state to DEAD/TRACE, only one thread can do this.
         */
-        state = traced && thread_group_leader(p) ? EXIT_TRACE : EXIT_DEAD;
+        state = (ptrace_reparented(p) && thread_group_leader(p)) ?
+                EXIT_TRACE : EXIT_DEAD;
        if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
                return 0;
        /*
-         * It can be ptraced but not reparented, check
+         * We own this thread, nobody else can reap it.
-         * thread_group_leader() to filter out sub-threads.
         */
-        if (likely(!traced) && thread_group_leader(p)) {
+        read_unlock(&tasklist_lock);
-                struct signal_struct *psig;
+        sched_annotate_sleep();
-                struct signal_struct *sig;
+        /*
+         * Check thread_group_leader() to exclude the traced sub-threads.
+         */
+        if (state == EXIT_DEAD && thread_group_leader(p)) {
+                struct signal_struct *sig = p->signal;
+                struct signal_struct *psig = current->signal;
                unsigned long maxrss;
                cputime_t tgutime, tgstime;
@@ -1034,21 +1043,20 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
                 * accumulate in the parent's signal_struct c* fields.
                 *
                 * We don't bother to take a lock here to protect these
-                 * p->signal fields, because they are only touched by
+                 * p->signal fields because the whole thread group is dead
-                 * __exit_signal, which runs with tasklist_lock
+                 * and nobody can change them.
-                 * write-locked anyway, and so is excluded here.  We do
+                 *
-                 * need to protect the access to parent->signal fields,
+                 * psig->stats_lock also protects us from our sub-theads
-                 * as other threads in the parent group can be right
+                 * which can reap other children at the same time. Until
-                 * here reaping other children at the same time.
+                 * we change k_getrusage()-like users to rely on this lock
+                 * we have to take ->siglock as well.
                 *
                 * We use thread_group_cputime_adjusted() to get times for
                 * the thread group, which consolidates times for all threads
                 * in the group including the group leader.
                 */
                thread_group_cputime_adjusted(p, &tgutime, &tgstime);
-                spin_lock_irq(&p->real_parent->sighand->siglock);
+                spin_lock_irq(&current->sighand->siglock);
-                psig = p->real_parent->signal;
-                sig = p->signal;
                write_seqlock(&psig->stats_lock);
                psig->cutime += tgutime + sig->cutime;
                psig->cstime += tgstime + sig->cstime;
@@ -1073,16 +1081,9 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
                task_io_accounting_add(&psig->ioac, &p->ioac);
                task_io_accounting_add(&psig->ioac, &sig->ioac);
                write_sequnlock(&psig->stats_lock);
-                spin_unlock_irq(&p->real_parent->sighand->siglock);
+                spin_unlock_irq(&current->sighand->siglock);
        }
-        /*
-         * Now we are sure this task is interesting, and no other
-         * thread can reap it because we its state == DEAD/TRACE.
-         */
-        read_unlock(&tasklist_lock);
-        sched_annotate_sleep();
        retval = wo->wo_rusage
                ? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;
        status = (p->signal->flags & SIGNAL_GROUP_EXIT)
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 80f7a6d00519..2777f40a9c7b 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -47,13 +47,6 @@ extern int max_threads;
 static struct workqueue_struct *khelper_wq;
-/*
- * kmod_thread_locker is used for deadlock avoidance.  There is no explicit
- * locking to protect this global - it is private to the singleton khelper
- * thread and should only ever be modified by that thread.
- */
-static const struct task_struct *kmod_thread_locker;
 #define CAP_BSET        (void *)1
 #define CAP_PI          (void *)2
@@ -223,7 +216,6 @@ static void umh_complete(struct subprocess_info *sub_info)
 static int ____call_usermodehelper(void *data)
 {
        struct subprocess_info *sub_info = data;
-        int wait = sub_info->wait & ~UMH_KILLABLE;
        struct cred *new;
        int retval;
@@ -267,20 +259,13 @@ static int ____call_usermodehelper(void *data)
 out:
        sub_info->retval = retval;
        /* wait_for_helper() will call umh_complete if UHM_WAIT_PROC. */
-        if (wait != UMH_WAIT_PROC)
+        if (!(sub_info->wait & UMH_WAIT_PROC))
                umh_complete(sub_info);
        if (!retval)
                return 0;
        do_exit(0);
 }
-static int call_helper(void *data)
-{
-        /* Worker thread started blocking khelper thread. */
-        kmod_thread_locker = current;
-        return ____call_usermodehelper(data);
-}
 /* Keventd can't block, but this (a child) can. */
 static int wait_for_helper(void *data)
 {
@@ -323,21 +308,14 @@ static void __call_usermodehelper(struct work_struct *work)
 {
        struct subprocess_info *sub_info =
                container_of(work, struct subprocess_info, work);
-        int wait = sub_info->wait & ~UMH_KILLABLE;
        pid_t pid;
-        /* CLONE_VFORK: wait until the usermode helper has execve'd
+        if (sub_info->wait & UMH_WAIT_PROC)
-         * successfully We need the data structures to stay around
-         * until that is done.  */
-        if (wait == UMH_WAIT_PROC)
                pid = kernel_thread(wait_for_helper, sub_info,
                                    CLONE_FS | CLONE_FILES | SIGCHLD);
-        else {
+        else
-                pid = kernel_thread(call_helper, sub_info,
+                pid = kernel_thread(____call_usermodehelper, sub_info,
-                                    CLONE_VFORK | SIGCHLD);
+                                    SIGCHLD);
-                /* Worker thread stopped blocking khelper thread. */
-                kmod_thread_locker = NULL;
-        }
        if (pid < 0) {
                sub_info->retval = pid;
@@ -571,17 +549,6 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
                goto out;
        }
        /*
-         * Worker thread must not wait for khelper thread at below
-         * wait_for_completion() if the thread was created with CLONE_VFORK
-         * flag, for khelper thread is already waiting for the thread at
-         * wait_for_completion() in do_fork().
-         */
-        if (wait != UMH_NO_WAIT && current == kmod_thread_locker) {
-                retval = -EBUSY;
-                goto out;
-        }
-        /*
         * Set the completion pointer only if there is a waiter.
         * This makes it possible to use umh_complete to free
         * the data structure in case of UMH_NO_WAIT.
diff --git a/kernel/panic.c b/kernel/panic.c
index cf80672b7924..4d8d6f906dec 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -33,6 +33,7 @@ static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 static bool crash_kexec_post_notifiers;
+int panic_on_warn __read_mostly;
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
@@ -428,6 +429,17 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
        if (args)
                vprintk(args->fmt, args->args);
+        if (panic_on_warn) {
+                /*
+                 * This thread may hit another WARN() in the panic path.
+                 * Resetting this prevents additional WARN() from panicking the
+                 * system on this thread.  Other threads are blocked by the
+                 * panic_mutex in panic().
+                 */
+                panic_on_warn = 0;
+                panic("panic_on_warn set ...\n");
+        }
        print_modules();
        dump_stack();
        print_oops_end_marker();
@@ -485,6 +497,7 @@ EXPORT_SYMBOL(__stack_chk_fail);
 core_param(panic, panic_timeout, int, 0644);
 core_param(pause_on_oops, pause_on_oops, int, 0644);
+core_param(panic_on_warn, panic_on_warn, int, 0644);
 static int __init setup_crash_kexec_post_notifiers(char *s)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 9b9a26698144..82430c858d69 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -341,6 +341,8 @@ out:
 out_unlock:
        spin_unlock_irq(&pidmap_lock);
+        put_pid_ns(ns);
 out_free:
        while (++i <= ns->level)
                free_pidmap(pid->numbers + i);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8eb761b..bc6d6a89b6e6 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -190,7 +190,11 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
        /* Don't allow any more processes into the pid namespace */
        disable_pid_allocation(pid_ns);
-        /* Ignore SIGCHLD causing any terminated children to autoreap */
+        /*
+         * Ignore SIGCHLD causing any terminated children to autoreap.
+         * This speeds up the namespace shutdown, plus see the comment
+         * below.
+         */
        spin_lock_irq(&me->sighand->siglock);
        me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
        spin_unlock_irq(&me->sighand->siglock);
@@ -223,15 +227,31 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
        }
        read_unlock(&tasklist_lock);
-        /* Firstly reap the EXIT_ZOMBIE children we may have. */
+        /*
+         * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
+         * sys_wait4() will also block until our children traced from the
+         * parent namespace are detached and become EXIT_DEAD.
+         */
        do {
                clear_thread_flag(TIF_SIGPENDING);
                rc = sys_wait4(-1, NULL, __WALL, NULL);
        } while (rc != -ECHILD);
        /*
-         * sys_wait4() above can't reap the TASK_DEAD children.
+         * sys_wait4() above can't reap the EXIT_DEAD children but we do not
-         * Make sure they all go away, see free_pid().
+         * really care, we could reparent them to the global init. We could
+         * exit and reap ->child_reaper even if it is not the last thread in
+         * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
+         * pid_ns can not go away until proc_kill_sb() drops the reference.
+         *
+         * But this ns can also have other tasks injected by setns()+fork().
+         * Again, ignoring the user visible semantics we do not really need
+         * to wait until they are all reaped, but they can be reparented to
+         * us and thus we need to ensure that pid->child_reaper stays valid
+         * until they all go away. See free_pid()->wake_up_process().
+         *
+         * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
+         * if reparented.
         */
        for (;;) {
                set_current_state(TASK_UNINTERRUPTIBLE);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index c8755e7e1dba..ea27c019655a 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -62,9 +62,6 @@ int console_printk[4] = {
        CONSOLE_LOGLEVEL_DEFAULT,       /* default_console_loglevel */
 };
-/* Deferred messaged from sched code are marked by this special level */
-#define SCHED_MESSAGE_LOGLEVEL -2
 /*
 * Low level drivers may need that to know if they can schedule in
 * their unblank() callback or not. So let's export it.
@@ -1259,7 +1256,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 int do_syslog(int type, char __user *buf, int len, bool from_file)
 {
        bool clear = false;
-        static int saved_console_loglevel = -1;
+        static int saved_console_loglevel = LOGLEVEL_DEFAULT;
        int error;
        error = check_syslog_permissions(type, from_file);
@@ -1316,15 +1313,15 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
                break;
        /* Disable logging to console */
        case SYSLOG_ACTION_CONSOLE_OFF:
-                if (saved_console_loglevel == -1)
+                if (saved_console_loglevel == LOGLEVEL_DEFAULT)
                        saved_console_loglevel = console_loglevel;
                console_loglevel = minimum_console_loglevel;
                break;
        /* Enable logging to console */
        case SYSLOG_ACTION_CONSOLE_ON:
-                if (saved_console_loglevel != -1) {
+                if (saved_console_loglevel != LOGLEVEL_DEFAULT) {
                        console_loglevel = saved_console_loglevel;
-                        saved_console_loglevel = -1;
+                        saved_console_loglevel = LOGLEVEL_DEFAULT;
                }
                break;
        /* Set level of messages printed to console */
@@ -1336,7 +1333,7 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
                        len = minimum_console_loglevel;
                console_loglevel = len;
                /* Implicitly re-enable logging to console */
-                saved_console_loglevel = -1;
+                saved_console_loglevel = LOGLEVEL_DEFAULT;
                error = 0;
                break;
        /* Number of chars in the log buffer */
@@ -1627,10 +1624,10 @@ asmlinkage int vprintk_emit(int facility, int level,
        int printed_len = 0;
        bool in_sched = false;
        /* cpu currently holding logbuf_lock in this function */
-        static volatile unsigned int logbuf_cpu = UINT_MAX;
+        static unsigned int logbuf_cpu = UINT_MAX;
-        if (level == SCHED_MESSAGE_LOGLEVEL) {
+        if (level == LOGLEVEL_SCHED) {
-                level = -1;
+                level = LOGLEVEL_DEFAULT;
                in_sched = true;
        }
@@ -1695,8 +1692,9 @@ asmlinkage int vprintk_emit(int facility, int level,
                        const char *end_of_header = printk_skip_level(text);
                        switch (kern_level) {
                        case '0' ... '7':
-                                if (level == -1)
+                                if (level == LOGLEVEL_DEFAULT)
                                        level = kern_level - '0';
+                                /* fallthrough */
                        case 'd':       /* KERN_DEFAULT */
                                lflags |= LOG_PREFIX;
                        }
@@ -1710,7 +1708,7 @@ asmlinkage int vprintk_emit(int facility, int level,
                }
        }
-        if (level == -1)
+        if (level == LOGLEVEL_DEFAULT)
                level = default_message_loglevel;
        if (dict)
@@ -1788,7 +1786,7 @@ EXPORT_SYMBOL(vprintk_emit);
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
-        return vprintk_emit(0, -1, NULL, 0, fmt, args);
+        return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
 }
 EXPORT_SYMBOL(vprintk);
@@ -1842,7 +1840,7 @@ asmlinkage __visible int printk(const char *fmt, ...)
        }
 #endif
        va_start(args, fmt);
-        r = vprintk_emit(0, -1, NULL, 0, fmt, args);
+        r = vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
        va_end(args);
        return r;
@@ -1881,23 +1879,20 @@ static size_t cont_print_text(char *text, size_t size) { return 0; }
 #ifdef CONFIG_EARLY_PRINTK
 struct console *early_console;
-void early_vprintk(const char *fmt, va_list ap)
-{
-        if (early_console) {
-                char buf[512];
-                int n = vscnprintf(buf, sizeof(buf), fmt, ap);
-                early_console->write(early_console, buf, n);
-        }
-}
 asmlinkage __visible void early_printk(const char *fmt, ...)
 {
        va_list ap;
+        char buf[512];
+        int n;
+        if (!early_console)
+                return;
        va_start(ap, fmt);
-        early_vprintk(fmt, ap);
+        n = vscnprintf(buf, sizeof(buf), fmt, ap);
        va_end(ap);
+        early_console->write(early_console, buf, n);
 }
 #endif
@@ -2634,7 +2629,7 @@ int printk_deferred(const char *fmt, ...)
        preempt_disable();
        va_start(args, fmt);
-        r = vprintk_emit(0, SCHED_MESSAGE_LOGLEVEL, NULL, 0, fmt, args);
+        r = vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args);
        va_end(args);
        __this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 54e75226c2c4..1eb9d90c3af9 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -485,36 +485,19 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
 /*
 * Detach all tasks we were using ptrace on. Called with tasklist held
- * for writing, and returns with it held too. But note it can release
+ * for writing.
- * and reacquire the lock.
 */
-void exit_ptrace(struct task_struct *tracer)
+void exit_ptrace(struct task_struct *tracer, struct list_head *dead)
-        __releases(&tasklist_lock)
-        __acquires(&tasklist_lock)
 {
        struct task_struct *p, *n;
-        LIST_HEAD(ptrace_dead);
-        if (likely(list_empty(&tracer->ptraced)))
-                return;
        list_for_each_entry_safe(p, n, &tracer->ptraced, ptrace_entry) {
                if (unlikely(p->ptrace & PT_EXITKILL))
                        send_sig_info(SIGKILL, SEND_SIG_FORCED, p);
                if (__ptrace_detach(tracer, p))
-                        list_add(&p->ptrace_entry, &ptrace_dead);
+                        list_add(&p->ptrace_entry, dead);
-        }
-        write_unlock_irq(&tasklist_lock);
-        BUG_ON(!list_empty(&tracer->ptraced));
-        list_for_each_entry_safe(p, n, &ptrace_dead, ptrace_entry) {
-                list_del_init(&p->ptrace_entry);
-                release_task(p);
        }
-        write_lock_irq(&tasklist_lock);
 }
 int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len)
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
deleted file mode 100644
index e791130f85a7..000000000000
--- a/kernel/res_counter.c
+++ /dev/null
@@ -1,211 +0,0 @@
-/*
- * resource cgroups
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <xemul@openvz.org>
- *
- */
-#include <linux/types.h>
-#include <linux/parser.h>
-#include <linux/fs.h>
-#include <linux/res_counter.h>
-#include <linux/uaccess.h>
-#include <linux/mm.h>
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
-{
-        spin_lock_init(&counter->lock);
-        counter->limit = RES_COUNTER_MAX;
-        counter->soft_limit = RES_COUNTER_MAX;
-        counter->parent = parent;
-}
-static u64 res_counter_uncharge_locked(struct res_counter *counter,
-                                       unsigned long val)
-{
-        if (WARN_ON(counter->usage < val))
-                val = counter->usage;
-        counter->usage -= val;
-        return counter->usage;
-}
-static int res_counter_charge_locked(struct res_counter *counter,
-                                     unsigned long val, bool force)
-{
-        int ret = 0;
-        if (counter->usage + val > counter->limit) {
-                counter->failcnt++;
-                ret = -ENOMEM;
-                if (!force)
-                        return ret;
-        }
-        counter->usage += val;
-        if (counter->usage > counter->max_usage)
-                counter->max_usage = counter->usage;
-        return ret;
-}
-static int __res_counter_charge(struct res_counter *counter, unsigned long val,
-                                struct res_counter **limit_fail_at, bool force)
-{
-        int ret, r;
-        unsigned long flags;
-        struct res_counter *c, *u;
-        r = ret = 0;
-        *limit_fail_at = NULL;
-        local_irq_save(flags);
-        for (c = counter; c != NULL; c = c->parent) {
-                spin_lock(&c->lock);
-                r = res_counter_charge_locked(c, val, force);
-                spin_unlock(&c->lock);
-                if (r < 0 && !ret) {
-                        ret = r;
-                        *limit_fail_at = c;
-                        if (!force)
-                                break;
-                }
-        }
-        if (ret < 0 && !force) {
-                for (u = counter; u != c; u = u->parent) {
-                        spin_lock(&u->lock);
-                        res_counter_uncharge_locked(u, val);
-                        spin_unlock(&u->lock);
-                }
-        }
-        local_irq_restore(flags);
-        return ret;
-}
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-                        struct res_counter **limit_fail_at)
-{
-        return __res_counter_charge(counter, val, limit_fail_at, false);
-}
-int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
-                              struct res_counter **limit_fail_at)
-{
-        return __res_counter_charge(counter, val, limit_fail_at, true);
-}
-u64 res_counter_uncharge_until(struct res_counter *counter,
-                               struct res_counter *top,
-                               unsigned long val)
-{
-        unsigned long flags;
-        struct res_counter *c;
-        u64 ret = 0;
-        local_irq_save(flags);
-        for (c = counter; c != top; c = c->parent) {
-                u64 r;
-                spin_lock(&c->lock);
-                r = res_counter_uncharge_locked(c, val);
-                if (c == counter)
-                        ret = r;
-                spin_unlock(&c->lock);
-        }
-        local_irq_restore(flags);
-        return ret;
-}
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
-{
-        return res_counter_uncharge_until(counter, NULL, val);
-}
-static inline unsigned long long *
-res_counter_member(struct res_counter *counter, int member)
-{
-        switch (member) {
-        case RES_USAGE:
-                return &counter->usage;
-        case RES_MAX_USAGE:
-                return &counter->max_usage;
-        case RES_LIMIT:
-                return &counter->limit;
-        case RES_FAILCNT:
-                return &counter->failcnt;
-        case RES_SOFT_LIMIT:
-                return &counter->soft_limit;
-        };
-        BUG();
-        return NULL;
-}
-ssize_t res_counter_read(struct res_counter *counter, int member,
-                const char __user *userbuf, size_t nbytes, loff_t *pos,
-                int (*read_strategy)(unsigned long long val, char *st_buf))
-{
-        unsigned long long *val;
-        char buf[64], *s;
-        s = buf;
-        val = res_counter_member(counter, member);
-        if (read_strategy)
-                s += read_strategy(*val, s);
-        else
-                s += sprintf(s, "%llu\n", *val);
-        return simple_read_from_buffer((void __user *)userbuf, nbytes,
-                        pos, buf, s - buf);
-}
-#if BITS_PER_LONG == 32
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
-        unsigned long flags;
-        u64 ret;
-        spin_lock_irqsave(&counter->lock, flags);
-        ret = *res_counter_member(counter, member);
-        spin_unlock_irqrestore(&counter->lock, flags);
-        return ret;
-}
-#else
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
-        return *res_counter_member(counter, member);
-}
-#endif
-int res_counter_memparse_write_strategy(const char *buf,
-                                        unsigned long long *resp)
-{
-        char *end;
-        unsigned long long res;
-        /* return RES_COUNTER_MAX(unlimited) if "-1" is specified */
-        if (*buf == '-') {
-                int rc = kstrtoull(buf + 1, 10, &res);
-                if (rc)
-                        return rc;
-                if (res != 1)
-                        return -EINVAL;
-                *resp = RES_COUNTER_MAX;
-                return 0;
-        }
-        res = memparse(buf, &end);
-        if (*end != '\0')
-                return -EINVAL;
-        if (PAGE_ALIGN(res) >= res)
-                res = PAGE_ALIGN(res);
-        else
-                res = RES_COUNTER_MAX;
-        *resp = res;
-        return 0;
-}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bb398c0c5f08..b5797b78add6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4527,8 +4527,10 @@ void sched_show_task(struct task_struct *p)
 #ifdef CONFIG_DEBUG_STACK_USAGE
        free = stack_not_used(p);
 #endif
+        ppid = 0;
        rcu_read_lock();
-        ppid = task_pid_nr(rcu_dereference(p->real_parent));
+        if (pid_alive(p))
+                ppid = task_pid_nr(rcu_dereference(p->real_parent));
        rcu_read_unlock();
        printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free,
                task_pid_nr(p), ppid,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511a1b7c..7c54ff79afd7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
                .proc_handler   = proc_dointvec,
        },
 #endif
+        {
+                .procname       = "panic_on_warn",
+                .data           = &panic_on_warn,
+                .maxlen         = sizeof(int),
+                .mode           = 0644,
+                .proc_handler   = proc_dointvec_minmax,
+                .extra1         = &zero,
+                .extra2         = &one,
+        },
        { }
 };
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 9a4f750a2963..7e7746a42a62 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = {
        { CTL_INT,      KERN_COMPAT_LOG,                "compat-log" },
        { CTL_INT,      KERN_MAX_LOCK_DEPTH,            "max_lock_depth" },
        { CTL_INT,      KERN_PANIC_ON_NMI,              "panic_on_unrecovered_nmi" },
+        { CTL_INT,      KERN_PANIC_ON_WARN,             "panic_on_warn" },
        {}
 };
diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index add80cc02dbe..9722bd2dbc9b 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -102,6 +102,14 @@ static DEFINE_SPINLOCK(free_entries_lock);
 /* Global disable flag - will be set in case of an error */
 static u32 global_disable __read_mostly;
+/* Early initialization disable flag, set at the end of dma_debug_init */
+static bool dma_debug_initialized __read_mostly;
+static inline bool dma_debug_disabled(void)
+{
+        return global_disable || !dma_debug_initialized;
+}
 /* Global error count */
 static u32 error_count;
@@ -945,7 +953,7 @@ static int dma_debug_device_change(struct notifier_block *nb, unsigned long acti
        struct dma_debug_entry *uninitialized_var(entry);
        int count;
-        if (global_disable)
+        if (dma_debug_disabled())
                return 0;
        switch (action) {
@@ -973,7 +981,7 @@ void dma_debug_add_bus(struct bus_type *bus)
 {
        struct notifier_block *nb;
-        if (global_disable)
+        if (dma_debug_disabled())
                return;
        nb = kzalloc(sizeof(struct notifier_block), GFP_KERNEL);
@@ -994,6 +1002,9 @@ void dma_debug_init(u32 num_entries)
 {
        int i;
+        /* Do not use dma_debug_initialized here, since we really want to be
+         * called to set dma_debug_initialized
+         */
        if (global_disable)
                return;
@@ -1021,6 +1032,8 @@ void dma_debug_init(u32 num_entries)
        nr_total_entries = num_free_entries;
+        dma_debug_initialized = true;
        pr_info("DMA-API: debugging enabled by kernel config\n");
 }
@@ -1243,7 +1256,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
 {
        struct dma_debug_entry *entry;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        if (dma_mapping_error(dev, dma_addr))
@@ -1283,7 +1296,7 @@ void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
        struct hash_bucket *bucket;
        unsigned long flags;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        ref.dev = dev;
@@ -1325,7 +1338,7 @@ void debug_dma_unmap_page(struct device *dev, dma_addr_t addr,
                .direction      = direction,
        };
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        if (map_single)
@@ -1342,7 +1355,7 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
        struct scatterlist *s;
        int i;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        for_each_sg(sg, s, mapped_ents, i) {
@@ -1395,7 +1408,7 @@ void debug_dma_unmap_sg(struct device *dev, struct scatterlist *sglist,
        struct scatterlist *s;
        int mapped_ents = 0, i;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        for_each_sg(sglist, s, nelems, i) {
@@ -1427,7 +1440,7 @@ void debug_dma_alloc_coherent(struct device *dev, size_t size,
 {
        struct dma_debug_entry *entry;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        if (unlikely(virt == NULL))
@@ -1462,7 +1475,7 @@ void debug_dma_free_coherent(struct device *dev, size_t size,
                .direction      = DMA_BIDIRECTIONAL,
        };
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        check_unmap(&ref);
@@ -1474,7 +1487,7 @@ void debug_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle,
 {
        struct dma_debug_entry ref;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        ref.type         = dma_debug_single;
@@ -1494,7 +1507,7 @@ void debug_dma_sync_single_for_device(struct device *dev,
 {
        struct dma_debug_entry ref;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        ref.type         = dma_debug_single;
@@ -1515,7 +1528,7 @@ void debug_dma_sync_single_range_for_cpu(struct device *dev,
 {
        struct dma_debug_entry ref;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        ref.type         = dma_debug_single;
@@ -1536,7 +1549,7 @@ void debug_dma_sync_single_range_for_device(struct device *dev,
 {
        struct dma_debug_entry ref;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        ref.type         = dma_debug_single;
@@ -1556,7 +1569,7 @@ void debug_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
        struct scatterlist *s;
        int mapped_ents = 0, i;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        for_each_sg(sg, s, nelems, i) {
@@ -1589,7 +1602,7 @@ void debug_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
        struct scatterlist *s;
        int mapped_ents = 0, i;
-        if (unlikely(global_disable))
+        if (unlikely(dma_debug_disabled()))
                return;
        for_each_sg(sg, s, nelems, i) {
diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
index dfba05521748..527799d44476 100644
--- a/lib/dynamic_debug.c
+++ b/lib/dynamic_debug.c
@@ -576,7 +576,7 @@ void __dynamic_dev_dbg(struct _ddebug *descriptor,
        } else {
                char buf[PREFIX_SIZE];
-                dev_printk_emit(7, dev, "%s%s %s: %pV",
+                dev_printk_emit(LOGLEVEL_DEBUG, dev, "%s%s %s: %pV",
                                dynamic_emit_prefix(descriptor, buf),
                                dev_driver_string(dev), dev_name(dev),
                                &vaf);
@@ -605,7 +605,7 @@ void __dynamic_netdev_dbg(struct _ddebug *descriptor,
        if (dev && dev->dev.parent) {
                char buf[PREFIX_SIZE];
-                dev_printk_emit(7, dev->dev.parent,
+                dev_printk_emit(LOGLEVEL_DEBUG, dev->dev.parent,
                                "%s%s %s %s%s: %pV",
                                dynamic_emit_prefix(descriptor, buf),
                                dev_driver_string(dev->dev.parent),
diff --git a/lib/lcm.c b/lib/lcm.c
index b9c8de461e9e..51cc6b13cd52 100644
--- a/lib/lcm.c
+++ b/lib/lcm.c
@@ -7,10 +7,8 @@
 unsigned long lcm(unsigned long a, unsigned long b)
 {
        if (a && b)
-                return (a * b) / gcd(a, b);
+                return (a / gcd(a, b)) * b;
-        else if (b)
+        else
-                return b;
+                return 0;
-        return a;
 }
 EXPORT_SYMBOL_GPL(lcm);
diff --git a/mm/Makefile b/mm/Makefile
index 8405eb0023a9..b3c6ce932c64 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -55,7 +55,9 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
+obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
+obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
+obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/cma.c b/mm/cma.c
index fde706e1284f..8e9ec13d31db 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -215,9 +215,21 @@ int __init cma_declare_contiguous(phys_addr_t base,
                        bool fixed, struct cma **res_cma)
 {
        phys_addr_t memblock_end = memblock_end_of_DRAM();
-        phys_addr_t highmem_start = __pa(high_memory);
+        phys_addr_t highmem_start;
        int ret = 0;
+#ifdef CONFIG_X86
+        /*
+         * high_memory isn't direct mapped memory so retrieving its physical
+         * address isn't appropriate.  But it would be useful to check the
+         * physical address of the highmem boundary so it's justfiable to get
+         * the physical address from it.  On x86 there is a validation check for
+         * this case, so the following workaround is needed to avoid it.
+         */
+        highmem_start = __pa_nodebug(high_memory);
+#else
+        highmem_start = __pa(high_memory);
+#endif
        pr_debug("%s(size %pa, base %pa, limit %pa alignment %pa)\n",
                __func__, &size, &base, &limit, &alignment);
diff --git a/mm/compaction.c b/mm/compaction.c
index f9792ba3537c..546e571e9d60 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -41,15 +41,17 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
 static unsigned long release_freepages(struct list_head *freelist)
 {
        struct page *page, *next;
-        unsigned long count = 0;
+        unsigned long high_pfn = 0;
        list_for_each_entry_safe(page, next, freelist, lru) {
+                unsigned long pfn = page_to_pfn(page);
                list_del(&page->lru);
                __free_page(page);
-                count++;
+                if (pfn > high_pfn)
+                        high_pfn = pfn;
        }
-        return count;
+        return high_pfn;
 }
 static void map_pages(struct list_head *list)
@@ -195,16 +197,12 @@ static void update_pageblock_skip(struct compact_control *cc,
        /* Update where async and sync compaction should restart */
        if (migrate_scanner) {
-                if (cc->finished_update_migrate)
-                        return;
                if (pfn > zone->compact_cached_migrate_pfn[0])
                        zone->compact_cached_migrate_pfn[0] = pfn;
                if (cc->mode != MIGRATE_ASYNC &&
                    pfn > zone->compact_cached_migrate_pfn[1])
                        zone->compact_cached_migrate_pfn[1] = pfn;
        } else {
-                if (cc->finished_update_free)
-                        return;
                if (pfn < zone->compact_cached_free_pfn)
                        zone->compact_cached_free_pfn = pfn;
        }
@@ -715,7 +713,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                del_page_from_lru_list(page, lruvec, page_lru(page));
 isolate_success:
-                cc->finished_update_migrate = true;
                list_add(&page->lru, migratelist);
                cc->nr_migratepages++;
                nr_isolated++;
@@ -889,15 +886,6 @@ static void isolate_freepages(struct compact_control *cc)
                                block_start_pfn - pageblock_nr_pages;
                /*
-                 * Set a flag that we successfully isolated in this pageblock.
-                 * In the next loop iteration, zone->compact_cached_free_pfn
-                 * will not be updated and thus it will effectively contain the
-                 * highest pageblock we isolated pages from.
-                 */
-                if (isolated)
-                        cc->finished_update_free = true;
-                /*
                 * isolate_freepages_block() might have aborted due to async
                 * compaction being contended
                 */
@@ -1086,9 +1074,9 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
        /* Compaction run is not finished if the watermark is not met */
        watermark = low_wmark_pages(zone);
-        watermark += (1 << cc->order);
-        if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+        if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
+                                                        cc->alloc_flags))
                return COMPACT_CONTINUE;
        /* Direct compactor: Is a suitable page free? */
@@ -1114,7 +1102,8 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 *   COMPACT_PARTIAL  - If the allocation would succeed without compaction
 *   COMPACT_CONTINUE - If compaction should run now
 */
-unsigned long compaction_suitable(struct zone *zone, int order)
+unsigned long compaction_suitable(struct zone *zone, int order,
+                                        int alloc_flags, int classzone_idx)
 {
        int fragindex;
        unsigned long watermark;
@@ -1126,21 +1115,30 @@ unsigned long compaction_suitable(struct zone *zone, int order)
        if (order == -1)
                return COMPACT_CONTINUE;
+        watermark = low_wmark_pages(zone);
+        /*
+         * If watermarks for high-order allocation are already met, there
+         * should be no need for compaction at all.
+         */
+        if (zone_watermark_ok(zone, order, watermark, classzone_idx,
+                                                                alloc_flags))
+                return COMPACT_PARTIAL;
        /*
         * Watermarks for order-0 must be met for compaction. Note the 2UL.
         * This is because during migration, copies of pages need to be
         * allocated and for a short time, the footprint is higher
         */
-        watermark = low_wmark_pages(zone) + (2UL << order);
+        watermark += (2UL << order);
-        if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+        if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
                return COMPACT_SKIPPED;
        /*
         * fragmentation index determines if allocation failures are due to
         * low memory or external fragmentation
         *
-         * index of -1000 implies allocations might succeed depending on
+         * index of -1000 would imply allocations might succeed depending on
-         * watermarks
+         * watermarks, but we already failed the high-order watermark check
         * index towards 0 implies failure is due to lack of memory
         * index towards 1000 implies failure is due to fragmentation
         *
@@ -1150,10 +1148,6 @@ unsigned long compaction_suitable(struct zone *zone, int order)
        if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
                return COMPACT_SKIPPED;
-        if (fragindex == -1000 && zone_watermark_ok(zone, order, watermark,
-            0, 0))
-                return COMPACT_PARTIAL;
        return COMPACT_CONTINUE;
 }
@@ -1164,8 +1158,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
        unsigned long end_pfn = zone_end_pfn(zone);
        const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
        const bool sync = cc->mode != MIGRATE_ASYNC;
+        unsigned long last_migrated_pfn = 0;
-        ret = compaction_suitable(zone, cc->order);
+        ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
+                                                        cc->classzone_idx);
        switch (ret) {
        case COMPACT_PARTIAL:
        case COMPACT_SKIPPED:
@@ -1208,6 +1204,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
        while ((ret = compact_finished(zone, cc, migratetype)) ==
                                                COMPACT_CONTINUE) {
                int err;
+                unsigned long isolate_start_pfn = cc->migrate_pfn;
                switch (isolate_migratepages(zone, cc)) {
                case ISOLATE_ABORT:
@@ -1216,7 +1213,12 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
                        cc->nr_migratepages = 0;
                        goto out;
                case ISOLATE_NONE:
-                        continue;
+                        /*
+                         * We haven't isolated and migrated anything, but
+                         * there might still be unflushed migrations from
+                         * previous cc->order aligned block.
+                         */
+                        goto check_drain;
                case ISOLATE_SUCCESS:
                        ;
                }
@@ -1241,12 +1243,61 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
                                goto out;
                        }
                }
+                /*
+                 * Record where we could have freed pages by migration and not
+                 * yet flushed them to buddy allocator. We use the pfn that
+                 * isolate_migratepages() started from in this loop iteration
+                 * - this is the lowest page that could have been isolated and
+                 * then freed by migration.
+                 */
+                if (!last_migrated_pfn)
+                        last_migrated_pfn = isolate_start_pfn;
+check_drain:
+                /*
+                 * Has the migration scanner moved away from the previous
+                 * cc->order aligned block where we migrated from? If yes,
+                 * flush the pages that were freed, so that they can merge and
+                 * compact_finished() can detect immediately if allocation
+                 * would succeed.
+                 */
+                if (cc->order > 0 && last_migrated_pfn) {
+                        int cpu;
+                        unsigned long current_block_start =
+                                cc->migrate_pfn & ~((1UL << cc->order) - 1);
+                        if (last_migrated_pfn < current_block_start) {
+                                cpu = get_cpu();
+                                lru_add_drain_cpu(cpu);
+                                drain_local_pages(zone);
+                                put_cpu();
+                                /* No more flushing until we migrate again */
+                                last_migrated_pfn = 0;
+                        }
+                }
        }
 out:
-        /* Release free pages and check accounting */
+        /*
-        cc->nr_freepages -= release_freepages(&cc->freepages);
+         * Release free pages and update where the free scanner should restart,
-        VM_BUG_ON(cc->nr_freepages != 0);
+         * so we don't leave any returned pages behind in the next attempt.
+         */
+        if (cc->nr_freepages > 0) {
+                unsigned long free_pfn = release_freepages(&cc->freepages);
+                cc->nr_freepages = 0;
+                VM_BUG_ON(free_pfn == 0);
+                /* The cached pfn is always the first in a pageblock */
+                free_pfn &= ~(pageblock_nr_pages-1);
+                /*
+                 * Only go back, not forward. The cached pfn might have been
+                 * already reset to zone end in compact_finished()
+                 */
+                if (free_pfn > zone->compact_cached_free_pfn)
+                        zone->compact_cached_free_pfn = free_pfn;
+        }
        trace_mm_compaction_end(ret);
@@ -1254,7 +1305,8 @@ out:
 }
 static unsigned long compact_zone_order(struct zone *zone, int order,
-                gfp_t gfp_mask, enum migrate_mode mode, int *contended)
+                gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+                int alloc_flags, int classzone_idx)
 {
        unsigned long ret;
        struct compact_control cc = {
@@ -1264,6 +1316,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
                .gfp_mask = gfp_mask,
                .zone = zone,
                .mode = mode,
+                .alloc_flags = alloc_flags,
+                .classzone_idx = classzone_idx,
        };
        INIT_LIST_HEAD(&cc.freepages);
        INIT_LIST_HEAD(&cc.migratepages);
@@ -1288,14 +1342,13 @@ int sysctl_extfrag_threshold = 500;
 * @mode: The migration mode for async, sync light, or sync migration
 * @contended: Return value that determines if compaction was aborted due to
 *             need_resched() or lock contention
- * @candidate_zone: Return the zone where we think allocation should succeed
 *
 * This is the main entry point for direct page compaction.
 */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
                        int order, gfp_t gfp_mask, nodemask_t *nodemask,
                        enum migrate_mode mode, int *contended,
-                        struct zone **candidate_zone)
+                        int alloc_flags, int classzone_idx)
 {
        enum zone_type high_zoneidx = gfp_zone(gfp_mask);
        int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1303,7 +1356,6 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
        struct zoneref *z;
        struct zone *zone;
        int rc = COMPACT_DEFERRED;
-        int alloc_flags = 0;
        int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
        *contended = COMPACT_CONTENDED_NONE;
@@ -1312,10 +1364,6 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
        if (!order || !may_enter_fs || !may_perform_io)
                return COMPACT_SKIPPED;
-#ifdef CONFIG_CMA
-        if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
-                alloc_flags |= ALLOC_CMA;
-#endif
        /* Compact each zone in the list */
        for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
                                                                nodemask) {
@@ -1326,7 +1374,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
                        continue;
                status = compact_zone_order(zone, order, gfp_mask, mode,
-                                                        &zone_contended);
+                                &zone_contended, alloc_flags, classzone_idx);
                rc = max(status, rc);
                /*
                 * It takes at least one zone that wasn't lock contended
@@ -1335,9 +1383,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
                all_zones_contended &= zone_contended;
                /* If a normal allocation would succeed, stop compacting */
-                if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
+                if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
-                                      alloc_flags)) {
+                                        classzone_idx, alloc_flags)) {
-                        *candidate_zone = zone;
                        /*
                         * We think the allocation will succeed in this zone,
                         * but it is not certain, hence the false. The caller
@@ -1359,7 +1406,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
                        goto break_loop;
                }
-                if (mode != MIGRATE_ASYNC) {
+                if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
                        /*
                         * We think that allocation won't succeed in this zone
                         * so we defer compaction there. If it ends up
diff --git a/mm/debug.c b/mm/debug.c
index 5ce45c9a29b5..0e58f3211f89 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -95,7 +95,10 @@ void dump_page_badflags(struct page *page, const char *reason,
                dump_flags(page->flags & badflags,
                                pageflag_names, ARRAY_SIZE(pageflag_names));
        }
-        mem_cgroup_print_bad_page(page);
+#ifdef CONFIG_MEMCG
+        if (page->mem_cgroup)
+                pr_alert("page->mem_cgroup:%p\n", page->mem_cgroup);
+#endif
 }
 void dump_page(struct page *page, const char *reason)
diff --git a/mm/frontswap.c b/mm/frontswap.c
index f2a3571c6e22..8d82809eb085 100644
--- a/mm/frontswap.c
+++ b/mm/frontswap.c
@@ -182,7 +182,7 @@ void __frontswap_init(unsigned type, unsigned long *map)
        if (frontswap_ops)
                frontswap_ops->init(type);
        else {
-                BUG_ON(type > MAX_SWAPFILES);
+                BUG_ON(type >= MAX_SWAPFILES);
                set_bit(type, need_init);
        }
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de984159cf0b..5b2c6875fc38 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -784,7 +784,6 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
        if (!pmd_none(*pmd))
                return false;
        entry = mk_pmd(zero_page, vma->vm_page_prot);
-        entry = pmd_wrprotect(entry);
        entry = pmd_mkhuge(entry);
        pgtable_trans_huge_deposit(mm, pmd, pgtable);
        set_pmd_at(mm, haddr, pmd, entry);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fd722769927..30cd96879152 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2638,8 +2638,9 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
        tlb_start_vma(tlb, vma);
        mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+        address = start;
 again:
-        for (address = start; address < end; address += sz) {
+        for (; address < end; address += sz) {
                ptep = huge_pte_offset(mm, address);
                if (!ptep)
                        continue;
@@ -2686,6 +2687,7 @@ again:
                page_remove_rmap(page);
                force_flush = !__tlb_remove_page(tlb, page);
                if (force_flush) {
+                        address += sz;
                        spin_unlock(ptl);
                        break;
                }
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a67c26e0f360..037e1c00a5b7 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -14,6 +14,7 @@
 */
 #include <linux/cgroup.h>
+#include <linux/page_counter.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
@@ -23,7 +24,7 @@ struct hugetlb_cgroup {
        /*
         * the counter to account for hugepages from hugetlb.
         */
-        struct res_counter hugepage[HUGE_MAX_HSTATE];
+        struct page_counter hugepage[HUGE_MAX_HSTATE];
 };
 #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
@@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
        int idx;
        for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-                if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+                if (page_counter_read(&h_cg->hugepage[idx]))
                        return true;
        }
        return false;
@@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
        if (parent_h_cgroup) {
                for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
-                        res_counter_init(&h_cgroup->hugepage[idx],
+                        page_counter_init(&h_cgroup->hugepage[idx],
-                                         &parent_h_cgroup->hugepage[idx]);
+                                          &parent_h_cgroup->hugepage[idx]);
        } else {
                root_h_cgroup = h_cgroup;
                for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
-                        res_counter_init(&h_cgroup->hugepage[idx], NULL);
+                        page_counter_init(&h_cgroup->hugepage[idx], NULL);
        }
        return &h_cgroup->css;
 }
@@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
                                       struct page *page)
 {
-        int csize;
+        unsigned int nr_pages;
-        struct res_counter *counter;
+        struct page_counter *counter;
-        struct res_counter *fail_res;
        struct hugetlb_cgroup *page_hcg;
        struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
@@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
        if (!page_hcg || page_hcg != h_cg)
                goto out;
-        csize = PAGE_SIZE << compound_order(page);
+        nr_pages = 1 << compound_order(page);
        if (!parent) {
                parent = root_h_cgroup;
                /* root has no limit */
-                res_counter_charge_nofail(&parent->hugepage[idx],
+                page_counter_charge(&parent->hugepage[idx], nr_pages);
-                                          csize, &fail_res);
        }
        counter = &h_cg->hugepage[idx];
-        res_counter_uncharge_until(counter, counter->parent, csize);
+        /* Take the pages off the local counter */
+        page_counter_cancel(counter, nr_pages);
        set_hugetlb_cgroup(page, parent);
 out:
@@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
                                 struct hugetlb_cgroup **ptr)
 {
        int ret = 0;
-        struct res_counter *fail_res;
+        struct page_counter *counter;
        struct hugetlb_cgroup *h_cg = NULL;
-        unsigned long csize = nr_pages * PAGE_SIZE;
        if (hugetlb_cgroup_disabled())
                goto done;
@@ -187,7 +186,7 @@ again:
        }
        rcu_read_unlock();
-        ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+        ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
        css_put(&h_cg->css);
 done:
        *ptr = h_cg;
@@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
                                  struct page *page)
 {
        struct hugetlb_cgroup *h_cg;
-        unsigned long csize = nr_pages * PAGE_SIZE;
        if (hugetlb_cgroup_disabled())
                return;
@@ -222,61 +220,76 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
        if (unlikely(!h_cg))
                return;
        set_hugetlb_cgroup(page, NULL);
-        res_counter_uncharge(&h_cg->hugepage[idx], csize);
+        page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
        return;
 }
 void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
                                    struct hugetlb_cgroup *h_cg)
 {
-        unsigned long csize = nr_pages * PAGE_SIZE;
        if (hugetlb_cgroup_disabled() || !h_cg)
                return;
        if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
                return;
-        res_counter_uncharge(&h_cg->hugepage[idx], csize);
+        page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
        return;
 }
+enum {
+        RES_USAGE,
+        RES_LIMIT,
+        RES_MAX_USAGE,
+        RES_FAILCNT,
+};
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
                                   struct cftype *cft)
 {
-        int idx, name;
+        struct page_counter *counter;
        struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
-        idx = MEMFILE_IDX(cft->private);
+        counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
-        name = MEMFILE_ATTR(cft->private);
-        return res_counter_read_u64(&h_cg->hugepage[idx], name);
+        switch (MEMFILE_ATTR(cft->private)) {
+        case RES_USAGE:
+                return (u64)page_counter_read(counter) * PAGE_SIZE;
+        case RES_LIMIT:
+                return (u64)counter->limit * PAGE_SIZE;
+        case RES_MAX_USAGE:
+                return (u64)counter->watermark * PAGE_SIZE;
+        case RES_FAILCNT:
+                return counter->failcnt;
+        default:
+                BUG();
+        }
 }
+static DEFINE_MUTEX(hugetlb_limit_mutex);
 static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
                                    char *buf, size_t nbytes, loff_t off)
 {
-        int idx, name, ret;
+        int ret, idx;
-        unsigned long long val;
+        unsigned long nr_pages;
        struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+        if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
+                return -EINVAL;
        buf = strstrip(buf);
+        ret = page_counter_memparse(buf, &nr_pages);
+        if (ret)
+                return ret;
        idx = MEMFILE_IDX(of_cft(of)->private);
-        name = MEMFILE_ATTR(of_cft(of)->private);
-        switch (name) {
+        switch (MEMFILE_ATTR(of_cft(of)->private)) {
        case RES_LIMIT:
-                if (hugetlb_cgroup_is_root(h_cg)) {
+                mutex_lock(&hugetlb_limit_mutex);
-                        /* Can't set limit on root */
+                ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
-                        ret = -EINVAL;
+                mutex_unlock(&hugetlb_limit_mutex);
-                        break;
-                }
-                /* This function does all necessary parse...reuse it */
-                ret = res_counter_memparse_write_strategy(buf, &val);
-                if (ret)
-                        break;
-                val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
-                ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
                break;
        default:
                ret = -EINVAL;
@@ -288,18 +301,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
                                    char *buf, size_t nbytes, loff_t off)
 {
-        int idx, name, ret = 0;
+        int ret = 0;
+        struct page_counter *counter;
        struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
-        idx = MEMFILE_IDX(of_cft(of)->private);
+        counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
-        name = MEMFILE_ATTR(of_cft(of)->private);
-        switch (name) {
+        switch (MEMFILE_ATTR(of_cft(of)->private)) {
        case RES_MAX_USAGE:
-                res_counter_reset_max(&h_cg->hugepage[idx]);
+                page_counter_reset_watermark(counter);
                break;
        case RES_FAILCNT:
-                res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+                counter->failcnt = 0;
                break;
        default:
                ret = -EINVAL;
diff --git a/mm/internal.h b/mm/internal.h
index a4f90ba7068e..efad241f7014 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -161,13 +161,10 @@ struct compact_control {
        unsigned long migrate_pfn;      /* isolate_migratepages search base */
        enum migrate_mode mode;         /* Async or sync migration mode */
        bool ignore_skip_hint;          /* Scan blocks even if marked skip */
-        bool finished_update_free;      /* True when the zone cached pfns are
-                                         * no longer being updated
-                                         */
-        bool finished_update_migrate;
        int order;                      /* order a direct compactor needs */
        const gfp_t gfp_mask;           /* gfp mask of a direct compactor */
+        const int alloc_flags;          /* alloc flags of a direct compactor */
+        const int classzone_idx;        /* zone index of a direct compactor */
        struct zone *zone;
        int contended;                  /* Signal need_sched() or lock
                                         * contention detected during
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee48428cf8e3..85df503ec023 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,7 +25,7 @@
 * GNU General Public License for more details.
 */
-#include <linux/res_counter.h>
+#include <linux/page_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
@@ -51,7 +51,7 @@
 #include <linux/seq_file.h>
 #include <linux/vmpressure.h>
 #include <linux/mm_inline.h>
-#include <linux/page_cgroup.h>
+#include <linux/swap_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
@@ -143,14 +143,8 @@ struct mem_cgroup_stat_cpu {
        unsigned long targets[MEM_CGROUP_NTARGETS];
 };
-struct mem_cgroup_reclaim_iter {
+struct reclaim_iter {
-        /*
+        struct mem_cgroup *position;
-         * last scanned hierarchy member. Valid only if last_dead_count
-         * matches memcg->dead_count of the hierarchy root group.
-         */
-        struct mem_cgroup *last_visited;
-        int last_dead_count;
        /* scan generation, increased every round-trip */
        unsigned int generation;
 };
@@ -162,10 +156,10 @@ struct mem_cgroup_per_zone {
        struct lruvec           lruvec;
        unsigned long           lru_size[NR_LRU_LISTS];
-        struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
+        struct reclaim_iter     iter[DEF_PRIORITY + 1];
        struct rb_node          tree_node;      /* RB tree node */
-        unsigned long long      usage_in_excess;/* Set to the value by which */
+        unsigned long           usage_in_excess;/* Set to the value by which */
                                                /* the soft limit is exceeded*/
        bool                    on_tree;
        struct mem_cgroup       *memcg;         /* Back pointer, we cannot */
@@ -198,7 +192,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
 struct mem_cgroup_threshold {
        struct eventfd_ctx *eventfd;
-        u64 threshold;
+        unsigned long threshold;
 };
 /* For threshold */
@@ -284,10 +278,13 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 */
 struct mem_cgroup {
        struct cgroup_subsys_state css;
-        /*
-         * the counter to account for memory usage
+        /* Accounted resources */
-         */
+        struct page_counter memory;
-        struct res_counter res;
+        struct page_counter memsw;
+        struct page_counter kmem;
+        unsigned long soft_limit;
        /* vmpressure notifications */
        struct vmpressure vmpressure;
@@ -296,15 +293,6 @@ struct mem_cgroup {
        int initialized;
        /*
-         * the counter to account for mem+swap usage.
-         */
-        struct res_counter memsw;
-        /*
-         * the counter to account for kernel memory usage.
-         */
-        struct res_counter kmem;
-        /*
         * Should the accounting and control be hierarchical, per subtree?
         */
        bool use_hierarchy;
@@ -352,7 +340,6 @@ struct mem_cgroup {
        struct mem_cgroup_stat_cpu nocpu_base;
        spinlock_t pcp_counter_lock;
-        atomic_t        dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
        struct cg_proto tcp_mem;
 #endif
@@ -382,7 +369,6 @@ struct mem_cgroup {
 /* internal only representation about the status of kmem accounting. */
 enum {
        KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
-        KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
 };
 #ifdef CONFIG_MEMCG_KMEM
@@ -396,22 +382,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
        return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
-{
-        /*
-         * Our caller must use css_get() first, because memcg_uncharge_kmem()
-         * will call css_put() if it sees the memcg is dead.
-         */
-        smp_wmb();
-        if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
-                set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
-}
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
-{
-        return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
-                                  &memcg->kmem_account_flags);
-}
 #endif
 /* Stuffs for move charges at task migration. */
@@ -650,7 +620,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
         * This check can't live in kmem destruction function,
         * since the charges will outlive the cgroup
         */
-        WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+        WARN_ON(page_counter_read(&memcg->kmem));
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -664,8 +634,6 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
        disarm_kmem_keys(memcg);
 }
-static void drain_all_stock_async(struct mem_cgroup *memcg);
 static struct mem_cgroup_per_zone *
 mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
 {
@@ -706,7 +674,7 @@ soft_limit_tree_from_page(struct page *page)
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
                                         struct mem_cgroup_tree_per_zone *mctz,
-                                         unsigned long long new_usage_in_excess)
+                                         unsigned long new_usage_in_excess)
 {
        struct rb_node **p = &mctz->rb_root.rb_node;
        struct rb_node *parent = NULL;
@@ -755,10 +723,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
        spin_unlock_irqrestore(&mctz->lock, flags);
 }
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+{
+        unsigned long nr_pages = page_counter_read(&memcg->memory);
+        unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
+        unsigned long excess = 0;
+        if (nr_pages > soft_limit)
+                excess = nr_pages - soft_limit;
+        return excess;
+}
 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 {
-        unsigned long long excess;
+        unsigned long excess;
        struct mem_cgroup_per_zone *mz;
        struct mem_cgroup_tree_per_zone *mctz;
@@ -769,7 +748,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
         */
        for (; memcg; memcg = parent_mem_cgroup(memcg)) {
                mz = mem_cgroup_page_zoneinfo(memcg, page);
-                excess = res_counter_soft_limit_excess(&memcg->res);
+                excess = soft_limit_excess(memcg);
                /*
                 * We have to update the tree if mz is on RB-tree or
                 * mem is over its softlimit.
@@ -825,7 +804,7 @@ retry:
         * position in the tree.
         */
        __mem_cgroup_remove_exceeded(mz, mctz);
-        if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
+        if (!soft_limit_excess(mz->memcg) ||
            !css_tryget_online(&mz->memcg->css))
                goto retry;
 done:
@@ -1062,122 +1041,6 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
        return memcg;
 }
-/*
- * Returns a next (in a pre-order walk) alive memcg (with elevated css
- * ref. count) or NULL if the whole root's subtree has been visited.
- *
- * helper function to be used by mem_cgroup_iter
- */
-static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
-                struct mem_cgroup *last_visited)
-{
-        struct cgroup_subsys_state *prev_css, *next_css;
-        prev_css = last_visited ? &last_visited->css : NULL;
-skip_node:
-        next_css = css_next_descendant_pre(prev_css, &root->css);
-        /*
-         * Even if we found a group we have to make sure it is
-         * alive. css && !memcg means that the groups should be
-         * skipped and we should continue the tree walk.
-         * last_visited css is safe to use because it is
-         * protected by css_get and the tree walk is rcu safe.
-         *
-         * We do not take a reference on the root of the tree walk
-         * because we might race with the root removal when it would
-         * be the only node in the iterated hierarchy and mem_cgroup_iter
-         * would end up in an endless loop because it expects that at
-         * least one valid node will be returned. Root cannot disappear
-         * because caller of the iterator should hold it already so
-         * skipping css reference should be safe.
-         */
-        if (next_css) {
-                struct mem_cgroup *memcg = mem_cgroup_from_css(next_css);
-                if (next_css == &root->css)
-                        return memcg;
-                if (css_tryget_online(next_css)) {
-                        /*
-                         * Make sure the memcg is initialized:
-                         * mem_cgroup_css_online() orders the the
-                         * initialization against setting the flag.
-                         */
-                        if (smp_load_acquire(&memcg->initialized))
-                                return memcg;
-                        css_put(next_css);
-                }
-                prev_css = next_css;
-                goto skip_node;
-        }
-        return NULL;
-}
-static void mem_cgroup_iter_invalidate(struct mem_cgroup *root)
-{
-        /*
-         * When a group in the hierarchy below root is destroyed, the
-         * hierarchy iterator can no longer be trusted since it might
-         * have pointed to the destroyed group.  Invalidate it.
-         */
-        atomic_inc(&root->dead_count);
-}
-static struct mem_cgroup *
-mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
-                     struct mem_cgroup *root,
-                     int *sequence)
-{
-        struct mem_cgroup *position = NULL;
-        /*
-         * A cgroup destruction happens in two stages: offlining and
-         * release.  They are separated by a RCU grace period.
-         *
-         * If the iterator is valid, we may still race with an
-         * offlining.  The RCU lock ensures the object won't be
-         * released, tryget will fail if we lost the race.
-         */
-        *sequence = atomic_read(&root->dead_count);
-        if (iter->last_dead_count == *sequence) {
-                smp_rmb();
-                position = iter->last_visited;
-                /*
-                 * We cannot take a reference to root because we might race
-                 * with root removal and returning NULL would end up in
-                 * an endless loop on the iterator user level when root
-                 * would be returned all the time.
-                 */
-                if (position && position != root &&
-                    !css_tryget_online(&position->css))
-                        position = NULL;
-        }
-        return position;
-}
-static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
-                                   struct mem_cgroup *last_visited,
-                                   struct mem_cgroup *new_position,
-                                   struct mem_cgroup *root,
-                                   int sequence)
-{
-        /* root reference counting symmetric to mem_cgroup_iter_load */
-        if (last_visited && last_visited != root)
-                css_put(&last_visited->css);
-        /*
-         * We store the sequence count from the time @last_visited was
-         * loaded successfully instead of rereading it here so that we
-         * don't lose destruction events in between.  We could have
-         * raced with the destruction of @new_position after all.
-         */
-        iter->last_visited = new_position;
-        smp_wmb();
-        iter->last_dead_count = sequence;
-}
 /**
 * mem_cgroup_iter - iterate over memory cgroup hierarchy
 * @root: hierarchy root
@@ -1199,8 +1062,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
                                   struct mem_cgroup *prev,
                                   struct mem_cgroup_reclaim_cookie *reclaim)
 {
+        struct reclaim_iter *uninitialized_var(iter);
+        struct cgroup_subsys_state *css = NULL;
        struct mem_cgroup *memcg = NULL;
-        struct mem_cgroup *last_visited = NULL;
+        struct mem_cgroup *pos = NULL;
        if (mem_cgroup_disabled())
                return NULL;
@@ -1209,50 +1074,101 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
                root = root_mem_cgroup;
        if (prev && !reclaim)
-                last_visited = prev;
+                pos = prev;
        if (!root->use_hierarchy && root != root_mem_cgroup) {
                if (prev)
-                        goto out_css_put;
+                        goto out;
                return root;
        }
        rcu_read_lock();
-        while (!memcg) {
-                struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-                int uninitialized_var(seq);
-                if (reclaim) {
-                        struct mem_cgroup_per_zone *mz;
-                        mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
-                        iter = &mz->reclaim_iter[reclaim->priority];
-                        if (prev && reclaim->generation != iter->generation) {
-                                iter->last_visited = NULL;
-                                goto out_unlock;
-                        }
-                        last_visited = mem_cgroup_iter_load(iter, root, &seq);
+        if (reclaim) {
+                struct mem_cgroup_per_zone *mz;
+                mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
+                iter = &mz->iter[reclaim->priority];
+                if (prev && reclaim->generation != iter->generation)
+                        goto out_unlock;
+                do {
+                        pos = ACCESS_ONCE(iter->position);
+                        /*
+                         * A racing update may change the position and
+                         * put the last reference, hence css_tryget(),
+                         * or retry to see the updated position.
+                         */
+                } while (pos && !css_tryget(&pos->css));
+        }
+        if (pos)
+                css = &pos->css;
+        for (;;) {
+                css = css_next_descendant_pre(css, &root->css);
+                if (!css) {
+                        /*
+                         * Reclaimers share the hierarchy walk, and a
+                         * new one might jump in right at the end of
+                         * the hierarchy - make sure they see at least
+                         * one group and restart from the beginning.
+                         */
+                        if (!prev)
+                                continue;
+                        break;
                }
-                memcg = __mem_cgroup_iter_next(root, last_visited);
+                /*
+                 * Verify the css and acquire a reference.  The root
+                 * is provided by the caller, so we know it's alive
+                 * and kicking, and don't take an extra reference.
+                 */
+                memcg = mem_cgroup_from_css(css);
+                if (css == &root->css)
+                        break;
-                if (reclaim) {
+                if (css_tryget(css)) {
-                        mem_cgroup_iter_update(iter, last_visited, memcg, root,
+                        /*
-                                        seq);
+                         * Make sure the memcg is initialized:
+                         * mem_cgroup_css_online() orders the the
+                         * initialization against setting the flag.
+                         */
+                        if (smp_load_acquire(&memcg->initialized))
+                                break;
-                        if (!memcg)
+                        css_put(css);
-                                iter->generation++;
-                        else if (!prev && memcg)
-                                reclaim->generation = iter->generation;
                }
-                if (prev && !memcg)
+                memcg = NULL;
-                        goto out_unlock;
+        }
+        if (reclaim) {
+                if (cmpxchg(&iter->position, pos, memcg) == pos) {
+                        if (memcg)
+                                css_get(&memcg->css);
+                        if (pos)
+                                css_put(&pos->css);
+                }
+                /*
+                 * pairs with css_tryget when dereferencing iter->position
+                 * above.
+                 */
+                if (pos)
+                        css_put(&pos->css);
+                if (!memcg)
+                        iter->generation++;
+                else if (!prev)
+                        reclaim->generation = iter->generation;
        }
 out_unlock:
        rcu_read_unlock();
-out_css_put:
+out:
        if (prev && prev != root)
                css_put(&prev->css);
@@ -1346,15 +1262,18 @@ out:
 }
 /**
- * mem_cgroup_page_lruvec - return lruvec for adding an lru page
+ * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
 * @page: the page
 * @zone: zone of the page
+ *
+ * This function is only safe when following the LRU page isolation
+ * and putback protocol: the LRU lock must be held, and the page must
+ * either be PageLRU() or the caller must have isolated/allocated it.
 */
 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 {
        struct mem_cgroup_per_zone *mz;
        struct mem_cgroup *memcg;
-        struct page_cgroup *pc;
        struct lruvec *lruvec;
        if (mem_cgroup_disabled()) {
@@ -1362,20 +1281,13 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
                goto out;
        }
-        pc = lookup_page_cgroup(page);
+        memcg = page->mem_cgroup;
-        memcg = pc->mem_cgroup;
        /*
-         * Surreptitiously switch any uncharged offlist page to root:
+         * Swapcache readahead pages are added to the LRU - and
-         * an uncharged page off lru does nothing to secure
+         * possibly migrated - before they are charged.
-         * its former mem_cgroup from sudden removal.
-         *
-         * Our caller holds lru_lock, and PageCgroupUsed is updated
-         * under page_cgroup lock: between them, they make all uses
-         * of pc->mem_cgroup safe.
         */
-        if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup)
+        if (!memcg)
-                pc->mem_cgroup = memcg = root_mem_cgroup;
+                memcg = root_mem_cgroup;
        mz = mem_cgroup_page_zoneinfo(memcg, page);
        lruvec = &mz->lruvec;
@@ -1414,41 +1326,24 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
        VM_BUG_ON((long)(*lru_size) < 0);
 }
-/*
+bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
- * Checks whether given mem is same or in the root_mem_cgroup's
- * hierarchy subtree
- */
-bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
-                                  struct mem_cgroup *memcg)
 {
-        if (root_memcg == memcg)
+        if (root == memcg)
                return true;
-        if (!root_memcg->use_hierarchy || !memcg)
+        if (!root->use_hierarchy)
                return false;
-        return cgroup_is_descendant(memcg->css.cgroup, root_memcg->css.cgroup);
+        return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
-}
-static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
-                                       struct mem_cgroup *memcg)
-{
-        bool ret;
-        rcu_read_lock();
-        ret = __mem_cgroup_same_or_subtree(root_memcg, memcg);
-        rcu_read_unlock();
-        return ret;
 }
-bool task_in_mem_cgroup(struct task_struct *task,
+bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
-                        const struct mem_cgroup *memcg)
 {
-        struct mem_cgroup *curr = NULL;
+        struct mem_cgroup *task_memcg;
        struct task_struct *p;
        bool ret;
        p = find_lock_task_mm(task);
        if (p) {
-                curr = get_mem_cgroup_from_mm(p->mm);
+                task_memcg = get_mem_cgroup_from_mm(p->mm);
                task_unlock(p);
        } else {
                /*
@@ -1457,19 +1352,12 @@ bool task_in_mem_cgroup(struct task_struct *task,
                 * killed to prevent needlessly killing additional tasks.
                 */
                rcu_read_lock();
-                curr = mem_cgroup_from_task(task);
+                task_memcg = mem_cgroup_from_task(task);
-                if (curr)
+                css_get(&task_memcg->css);
-                        css_get(&curr->css);
                rcu_read_unlock();
        }
-        /*
+        ret = mem_cgroup_is_descendant(task_memcg, memcg);
-         * We should check use_hierarchy of "memcg" not "curr". Because checking
+        css_put(&task_memcg->css);
-         * use_hierarchy of "curr" here make this function true if hierarchy is
-         * enabled in "curr" and "curr" is a child of "memcg" in *cgroup*
-         * hierarchy(even if use_hierarchy is disabled in "memcg").
-         */
-        ret = mem_cgroup_same_or_subtree(memcg, curr);
-        css_put(&curr->css);
        return ret;
 }
@@ -1492,7 +1380,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
        return inactive * inactive_ratio < active;
 }
-#define mem_cgroup_from_res_counter(counter, member)    \
+#define mem_cgroup_from_counter(counter, member)        \
        container_of(counter, struct mem_cgroup, member)
 /**
@@ -1504,12 +1392,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 */
 static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
-        unsigned long long margin;
+        unsigned long margin = 0;
+        unsigned long count;
+        unsigned long limit;
-        margin = res_counter_margin(&memcg->res);
+        count = page_counter_read(&memcg->memory);
-        if (do_swap_account)
+        limit = ACCESS_ONCE(memcg->memory.limit);
-                margin = min(margin, res_counter_margin(&memcg->memsw));
+        if (count < limit)
-        return margin >> PAGE_SHIFT;
+                margin = limit - count;
+        if (do_swap_account) {
+                count = page_counter_read(&memcg->memsw);
+                limit = ACCESS_ONCE(memcg->memsw.limit);
+                if (count <= limit)
+                        margin = min(margin, limit - count);
+        }
+        return margin;
 }
 int mem_cgroup_swappiness(struct mem_cgroup *memcg)
@@ -1522,37 +1421,6 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 }
 /*
- * memcg->moving_account is used for checking possibility that some thread is
- * calling move_account(). When a thread on CPU-A starts moving pages under
- * a memcg, other threads should check memcg->moving_account under
- * rcu_read_lock(), like this:
- *
- *         CPU-A                                    CPU-B
- *                                              rcu_read_lock()
- *         memcg->moving_account+1              if (memcg->mocing_account)
- *                                                   take heavy locks.
- *         synchronize_rcu()                    update something.
- *                                              rcu_read_unlock()
- *         start move here.
- */
-static void mem_cgroup_start_move(struct mem_cgroup *memcg)
-{
-        atomic_inc(&memcg->moving_account);
-        synchronize_rcu();
-}
-static void mem_cgroup_end_move(struct mem_cgroup *memcg)
-{
-        /*
-         * Now, mem_cgroup_clear_mc() may call this function with NULL.
-         * We check NULL in callee rather than caller.
-         */
-        if (memcg)
-                atomic_dec(&memcg->moving_account);
-}
-/*
 * A routine for checking "mem" is under move_account() or not.
 *
 * Checking a cgroup is mc.from or mc.to or under hierarchy of
@@ -1574,8 +1442,8 @@ static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
        if (!from)
                goto unlock;
-        ret = mem_cgroup_same_or_subtree(memcg, from)
+        ret = mem_cgroup_is_descendant(from, memcg) ||
-                || mem_cgroup_same_or_subtree(memcg, to);
+                mem_cgroup_is_descendant(to, memcg);
 unlock:
        spin_unlock(&mc.lock);
        return ret;
@@ -1597,23 +1465,6 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
        return false;
 }
-/*
- * Take this lock when
- * - a code tries to modify page's memcg while it's USED.
- * - a code tries to modify page state accounting in a memcg.
- */
-static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
-                                  unsigned long *flags)
-{
-        spin_lock_irqsave(&memcg->move_lock, *flags);
-}
-static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
-                                unsigned long *flags)
-{
-        spin_unlock_irqrestore(&memcg->move_lock, *flags);
-}
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /**
 * mem_cgroup_print_oom_info: Print OOM information relevant to memory controller.
@@ -1644,18 +1495,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
        rcu_read_unlock();
-        pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
+        pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
-                res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
+                K((u64)page_counter_read(&memcg->memory)),
-                res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
+                K((u64)memcg->memory.limit), memcg->memory.failcnt);
-                res_counter_read_u64(&memcg->res, RES_FAILCNT));
+        pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
-        pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
+                K((u64)page_counter_read(&memcg->memsw)),
-                res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
+                K((u64)memcg->memsw.limit), memcg->memsw.failcnt);
-                res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
+        pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
-                res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+                K((u64)page_counter_read(&memcg->kmem)),
-        pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
+                K((u64)memcg->kmem.limit), memcg->kmem.failcnt);
-                res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
-                res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
-                res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
        for_each_mem_cgroup_tree(iter, memcg) {
                pr_info("Memory cgroup stats for ");
@@ -1695,28 +1543,17 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 /*
 * Return the memory (and swap, if configured) limit for a memcg.
 */
-static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
-        u64 limit;
+        unsigned long limit;
-        limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-        /*
+        limit = memcg->memory.limit;
-         * Do not consider swap space if we cannot swap due to swappiness
-         */
        if (mem_cgroup_swappiness(memcg)) {
-                u64 memsw;
+                unsigned long memsw_limit;
-                limit += total_swap_pages << PAGE_SHIFT;
+                memsw_limit = memcg->memsw.limit;
-                memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+                limit = min(limit + total_swap_pages, memsw_limit);
-                /*
-                 * If memsw is finite and limits the amount of swap space
-                 * available to this memcg, return that limit.
-                 */
-                limit = min(limit, memsw);
        }
        return limit;
 }
@@ -1740,7 +1577,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
        }
        check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-        totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+        totalpages = mem_cgroup_get_limit(memcg) ? : 1;
        for_each_mem_cgroup_tree(iter, memcg) {
                struct css_task_iter it;
                struct task_struct *task;
@@ -1880,52 +1717,11 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
        memcg->last_scanned_node = node;
        return node;
 }
-/*
- * Check all nodes whether it contains reclaimable pages or not.
- * For quick scan, we make use of scan_nodes. This will allow us to skip
- * unused nodes. But scan_nodes is lazily updated and may not cotain
- * enough new information. We need to do double check.
- */
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
-        int nid;
-        /*
-         * quick check...making use of scan_node.
-         * We can skip unused nodes.
-         */
-        if (!nodes_empty(memcg->scan_nodes)) {
-                for (nid = first_node(memcg->scan_nodes);
-                     nid < MAX_NUMNODES;
-                     nid = next_node(nid, memcg->scan_nodes)) {
-                        if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
-                                return true;
-                }
-        }
-        /*
-         * Check rest of nodes.
-         */
-        for_each_node_state(nid, N_MEMORY) {
-                if (node_isset(nid, memcg->scan_nodes))
-                        continue;
-                if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
-                        return true;
-        }
-        return false;
-}
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 {
        return 0;
 }
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
-        return test_mem_cgroup_node_reclaimable(memcg, 0, noswap);
-}
 #endif
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
@@ -1943,7 +1739,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
                .priority = 0,
        };
-        excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
+        excess = soft_limit_excess(root_memcg);
        while (1) {
                victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1969,12 +1765,10 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
                        }
                        continue;
                }
-                if (!mem_cgroup_reclaimable(victim, false))
-                        continue;
                total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
                                                     zone, &nr_scanned);
                *total_scanned += nr_scanned;
-                if (!res_counter_soft_limit_excess(&root_memcg->res))
+                if (!soft_limit_excess(root_memcg))
                        break;
        }
        mem_cgroup_iter_break(root_memcg, victim);
@@ -2081,12 +1875,8 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
        oom_wait_info = container_of(wait, struct oom_wait_info, wait);
        oom_wait_memcg = oom_wait_info->memcg;
-        /*
+        if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) &&
-         * Both of oom_wait_info->memcg and wake_memcg are stable under us.
+            !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg))
-         * Then we can use css_is_ancestor without taking care of RCU.
-         */
-        if (!mem_cgroup_same_or_subtree(oom_wait_memcg, wake_memcg)
-                && !mem_cgroup_same_or_subtree(wake_memcg, oom_wait_memcg))
                return 0;
        return autoremove_wake_function(wait, mode, sync, arg);
 }
@@ -2228,26 +2018,23 @@ struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page,
                                              unsigned long *flags)
 {
        struct mem_cgroup *memcg;
-        struct page_cgroup *pc;
        rcu_read_lock();
        if (mem_cgroup_disabled())
                return NULL;
-        pc = lookup_page_cgroup(page);
 again:
-        memcg = pc->mem_cgroup;
+        memcg = page->mem_cgroup;
-        if (unlikely(!memcg || !PageCgroupUsed(pc)))
+        if (unlikely(!memcg))
                return NULL;
        *locked = false;
        if (atomic_read(&memcg->moving_account) <= 0)
                return memcg;
-        move_lock_mem_cgroup(memcg, flags);
+        spin_lock_irqsave(&memcg->move_lock, *flags);
-        if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
+        if (memcg != page->mem_cgroup) {
-                move_unlock_mem_cgroup(memcg, flags);
+                spin_unlock_irqrestore(&memcg->move_lock, *flags);
                goto again;
        }
        *locked = true;
@@ -2261,11 +2048,11 @@ again:
 * @locked: value received from mem_cgroup_begin_page_stat()
 * @flags: value received from mem_cgroup_begin_page_stat()
 */
-void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool locked,
+void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool *locked,
-                              unsigned long flags)
+                              unsigned long *flags)
 {
-        if (memcg && locked)
+        if (memcg && *locked)
-                move_unlock_mem_cgroup(memcg, &flags);
+                spin_unlock_irqrestore(&memcg->move_lock, *flags);
        rcu_read_unlock();
 }
@@ -2316,33 +2103,32 @@ static DEFINE_MUTEX(percpu_charge_mutex);
 static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
        struct memcg_stock_pcp *stock;
-        bool ret = true;
+        bool ret = false;
        if (nr_pages > CHARGE_BATCH)
-                return false;
+                return ret;
        stock = &get_cpu_var(memcg_stock);
-        if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+        if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
                stock->nr_pages -= nr_pages;
-        else /* need to call res_counter_charge */
+                ret = true;
-                ret = false;
+        }
        put_cpu_var(memcg_stock);
        return ret;
 }
 /*
- * Returns stocks cached in percpu to res_counter and reset cached information.
+ * Returns stocks cached in percpu and reset cached information.
 */
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
        struct mem_cgroup *old = stock->cached;
        if (stock->nr_pages) {
-                unsigned long bytes = stock->nr_pages * PAGE_SIZE;
+                page_counter_uncharge(&old->memory, stock->nr_pages);
-                res_counter_uncharge(&old->res, bytes);
                if (do_swap_account)
-                        res_counter_uncharge(&old->memsw, bytes);
+                        page_counter_uncharge(&old->memsw, stock->nr_pages);
+                css_put_many(&old->css, stock->nr_pages);
                stock->nr_pages = 0;
        }
        stock->cached = NULL;
@@ -2371,7 +2157,7 @@ static void __init memcg_stock_init(void)
 }
 /*
- * Cache charges(val) which is from res_counter, to local per_cpu area.
+ * Cache charges(val) to local per_cpu area.
 * This will be consumed by consume_stock() function, later.
 */
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2388,13 +2174,15 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 /*
 * Drains all per-CPU charge caches for given root_memcg resp. subtree
- * of the hierarchy under it. sync flag says whether we should block
+ * of the hierarchy under it.
- * until the work is done.
 */
-static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
+static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
        int cpu, curcpu;
+        /* If someone's already draining, avoid adding running more workers. */
+        if (!mutex_trylock(&percpu_charge_mutex))
+                return;
        /* Notify other cpus that system-wide "drain" is running */
        get_online_cpus();
        curcpu = get_cpu();
@@ -2405,7 +2193,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
                memcg = stock->cached;
                if (!memcg || !stock->nr_pages)
                        continue;
-                if (!mem_cgroup_same_or_subtree(root_memcg, memcg))
+                if (!mem_cgroup_is_descendant(memcg, root_memcg))
                        continue;
                if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
                        if (cpu == curcpu)
@@ -2415,42 +2203,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
                }
        }
        put_cpu();
-        if (!sync)
-                goto out;
-        for_each_online_cpu(cpu) {
-                struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-                if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
-                        flush_work(&stock->work);
-        }
-out:
        put_online_cpus();
-}
-/*
- * Tries to drain stocked charges in other cpus. This function is asynchronous
- * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back to res_counter later but cannot wait for
- * it.
- */
-static void drain_all_stock_async(struct mem_cgroup *root_memcg)
-{
-        /*
-         * If someone calls draining, avoid adding more kworker runs.
-         */
-        if (!mutex_trylock(&percpu_charge_mutex))
-                return;
-        drain_all_stock(root_memcg, false);
-        mutex_unlock(&percpu_charge_mutex);
-}
-/* This is a synchronous drain interface. */
-static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
-{
-        /* called when force_empty is called */
-        mutex_lock(&percpu_charge_mutex);
-        drain_all_stock(root_memcg, true);
        mutex_unlock(&percpu_charge_mutex);
 }
@@ -2506,9 +2259,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
        unsigned int batch = max(CHARGE_BATCH, nr_pages);
        int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
        struct mem_cgroup *mem_over_limit;
-        struct res_counter *fail_res;
+        struct page_counter *counter;
        unsigned long nr_reclaimed;
-        unsigned long long size;
        bool may_swap = true;
        bool drained = false;
        int ret = 0;
@@ -2519,16 +2271,15 @@ retry:
        if (consume_stock(memcg, nr_pages))
                goto done;
-        size = batch * PAGE_SIZE;
        if (!do_swap_account ||
-            !res_counter_charge(&memcg->memsw, size, &fail_res)) {
+            !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-                if (!res_counter_charge(&memcg->res, size, &fail_res))
+                if (!page_counter_try_charge(&memcg->memory, batch, &counter))
                        goto done_restock;
                if (do_swap_account)
-                        res_counter_uncharge(&memcg->memsw, size);
+                        page_counter_uncharge(&memcg->memsw, batch);
-                mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+                mem_over_limit = mem_cgroup_from_counter(counter, memory);
        } else {
-                mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+                mem_over_limit = mem_cgroup_from_counter(counter, memsw);
                may_swap = false;
        }
@@ -2561,7 +2312,7 @@ retry:
                goto retry;
        if (!drained) {
-                drain_all_stock_async(mem_over_limit);
+                drain_all_stock(mem_over_limit);
                drained = true;
                goto retry;
        }
@@ -2603,6 +2354,7 @@ bypass:
        return -EINTR;
 done_restock:
+        css_get_many(&memcg->css, batch);
        if (batch > nr_pages)
                refill_stock(memcg, batch - nr_pages);
 done:
@@ -2611,32 +2363,14 @@ done:
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-        unsigned long bytes = nr_pages * PAGE_SIZE;
        if (mem_cgroup_is_root(memcg))
                return;
-        res_counter_uncharge(&memcg->res, bytes);
+        page_counter_uncharge(&memcg->memory, nr_pages);
        if (do_swap_account)
-                res_counter_uncharge(&memcg->memsw, bytes);
+                page_counter_uncharge(&memcg->memsw, nr_pages);
-}
-/*
- * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
- * This is useful when moving usage to parent cgroup.
- */
-static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
-                                        unsigned int nr_pages)
-{
-        unsigned long bytes = nr_pages * PAGE_SIZE;
-        if (mem_cgroup_is_root(memcg))
-                return;
-        res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
+        css_put_many(&memcg->css, nr_pages);
-        if (do_swap_account)
-                res_counter_uncharge_until(&memcg->memsw,
-                                                memcg->memsw.parent, bytes);
 }
 /*
@@ -2665,17 +2399,15 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
-        struct mem_cgroup *memcg = NULL;
+        struct mem_cgroup *memcg;
-        struct page_cgroup *pc;
        unsigned short id;
        swp_entry_t ent;
        VM_BUG_ON_PAGE(!PageLocked(page), page);
-        pc = lookup_page_cgroup(page);
+        memcg = page->mem_cgroup;
-        if (PageCgroupUsed(pc)) {
+        if (memcg) {
-                memcg = pc->mem_cgroup;
+                if (!css_tryget_online(&memcg->css))
-                if (memcg && !css_tryget_online(&memcg->css))
                        memcg = NULL;
        } else if (PageSwapCache(page)) {
                ent.val = page_private(page);
@@ -2723,14 +2455,9 @@ static void unlock_page_lru(struct page *page, int isolated)
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
                          bool lrucare)
 {
-        struct page_cgroup *pc = lookup_page_cgroup(page);
        int isolated;
-        VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
+        VM_BUG_ON_PAGE(page->mem_cgroup, page);
-        /*
-         * we don't need page_cgroup_lock about tail pages, becase they are not
-         * accessed by any other context at this point.
-         */
        /*
         * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
@@ -2741,7 +2468,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
        /*
         * Nobody should be changing or seriously looking at
-         * pc->mem_cgroup and pc->flags at this point:
+         * page->mem_cgroup at this point:
         *
         * - the page is uncharged
         *
@@ -2753,15 +2480,12 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
         * - a page cache insertion, a swapin fault, or a migration
         *   have the page locked
         */
-        pc->mem_cgroup = memcg;
+        page->mem_cgroup = memcg;
-        pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
        if (lrucare)
                unlock_page_lru(page, isolated);
 }
-static DEFINE_MUTEX(set_limit_mutex);
 #ifdef CONFIG_MEMCG_KMEM
 /*
 * The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
@@ -2769,8 +2493,6 @@ static DEFINE_MUTEX(set_limit_mutex);
 */
 static DEFINE_MUTEX(memcg_slab_mutex);
-static DEFINE_MUTEX(activate_kmem_mutex);
 /*
 * This is a bit cumbersome, but it is rarely used and avoids a backpointer
 * in the memcg_cache_params struct.
@@ -2784,36 +2506,17 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
        return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
 }
-#ifdef CONFIG_SLABINFO
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
-static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
+                             unsigned long nr_pages)
-{
-        struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
-        struct memcg_cache_params *params;
-        if (!memcg_kmem_is_active(memcg))
-                return -EIO;
-        print_slabinfo_header(m);
-        mutex_lock(&memcg_slab_mutex);
-        list_for_each_entry(params, &memcg->memcg_slab_caches, list)
-                cache_show(memcg_params_to_cache(params), m);
-        mutex_unlock(&memcg_slab_mutex);
-        return 0;
-}
-#endif
-static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
-        struct res_counter *fail_res;
+        struct page_counter *counter;
        int ret = 0;
-        ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+        ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
-        if (ret)
+        if (ret < 0)
                return ret;
-        ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
+        ret = try_charge(memcg, gfp, nr_pages);
        if (ret == -EINTR)  {
                /*
                 * try_charge() chose to bypass to root due to OOM kill or
@@ -2830,37 +2533,27 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
                 * when the allocation triggers should have been already
                 * directed to the root cgroup in memcontrol.h
                 */
-                res_counter_charge_nofail(&memcg->res, size, &fail_res);
+                page_counter_charge(&memcg->memory, nr_pages);
                if (do_swap_account)
-                        res_counter_charge_nofail(&memcg->memsw, size,
+                        page_counter_charge(&memcg->memsw, nr_pages);
-                                                  &fail_res);
+                css_get_many(&memcg->css, nr_pages);
                ret = 0;
        } else if (ret)
-                res_counter_uncharge(&memcg->kmem, size);
+                page_counter_uncharge(&memcg->kmem, nr_pages);
        return ret;
 }
-static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
+                                unsigned long nr_pages)
 {
-        res_counter_uncharge(&memcg->res, size);
+        page_counter_uncharge(&memcg->memory, nr_pages);
        if (do_swap_account)
-                res_counter_uncharge(&memcg->memsw, size);
+                page_counter_uncharge(&memcg->memsw, nr_pages);
-        /* Not down to 0 */
+        page_counter_uncharge(&memcg->kmem, nr_pages);
-        if (res_counter_uncharge(&memcg->kmem, size))
-                return;
-        /*
+        css_put_many(&memcg->css, nr_pages);
-         * Releases a reference taken in kmem_cgroup_css_offline in case
-         * this last uncharge is racing with the offlining code or it is
-         * outliving the memcg existence.
-         *
-         * The memory barrier imposed by test&clear is paired with the
-         * explicit one in memcg_kmem_mark_dead().
-         */
-        if (memcg_kmem_test_and_clear_dead(memcg))
-                css_put(&memcg->css);
 }
 /*
@@ -3124,19 +2817,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 {
+        unsigned int nr_pages = 1 << order;
        int res;
-        res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
+        res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
-                                PAGE_SIZE << order);
        if (!res)
-                atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+                atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
        return res;
 }
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
-        memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
+        unsigned int nr_pages = 1 << order;
-        atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+        memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
+        atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
 }
 /*
@@ -3257,7 +2952,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
                return true;
        }
-        ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+        ret = memcg_charge_kmem(memcg, gfp, 1 << order);
        if (!ret)
                *_memcg = memcg;
@@ -3268,46 +2963,27 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
                              int order)
 {
-        struct page_cgroup *pc;
        VM_BUG_ON(mem_cgroup_is_root(memcg));
        /* The page allocation failed. Revert */
        if (!page) {
-                memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+                memcg_uncharge_kmem(memcg, 1 << order);
                return;
        }
-        /*
+        page->mem_cgroup = memcg;
-         * The page is freshly allocated and not visible to any
-         * outside callers yet.  Set up pc non-atomically.
-         */
-        pc = lookup_page_cgroup(page);
-        pc->mem_cgroup = memcg;
-        pc->flags = PCG_USED;
 }
 void __memcg_kmem_uncharge_pages(struct page *page, int order)
 {
-        struct mem_cgroup *memcg = NULL;
+        struct mem_cgroup *memcg = page->mem_cgroup;
-        struct page_cgroup *pc;
-        pc = lookup_page_cgroup(page);
-        if (!PageCgroupUsed(pc))
-                return;
-        memcg = pc->mem_cgroup;
-        pc->flags = 0;
-        /*
-         * We trust that only if there is a memcg associated with the page, it
-         * is a valid allocation
-         */
        if (!memcg)
                return;
        VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-        memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+        memcg_uncharge_kmem(memcg, 1 << order);
+        page->mem_cgroup = NULL;
 }
 #else
 static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
@@ -3325,21 +3001,15 @@ static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
-        struct page_cgroup *head_pc = lookup_page_cgroup(head);
-        struct page_cgroup *pc;
-        struct mem_cgroup *memcg;
        int i;
        if (mem_cgroup_disabled())
                return;
-        memcg = head_pc->mem_cgroup;
+        for (i = 1; i < HPAGE_PMD_NR; i++)
-        for (i = 1; i < HPAGE_PMD_NR; i++) {
+                head[i].mem_cgroup = head->mem_cgroup;
-                pc = head_pc + i;
-                pc->mem_cgroup = memcg;
+        __this_cpu_sub(head->mem_cgroup->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
-                pc->flags = head_pc->flags;
-        }
-        __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
                       HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -3348,7 +3018,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 * mem_cgroup_move_account - move account of the page
 * @page: the page
 * @nr_pages: number of regular pages (>1 for huge pages)
- * @pc: page_cgroup of the page.
 * @from: mem_cgroup which the page is moved from.
 * @to: mem_cgroup which the page is moved to. @from != @to.
 *
@@ -3361,7 +3030,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 */
 static int mem_cgroup_move_account(struct page *page,
                                   unsigned int nr_pages,
-                                   struct page_cgroup *pc,
                                   struct mem_cgroup *from,
                                   struct mem_cgroup *to)
 {
@@ -3381,7 +3049,7 @@ static int mem_cgroup_move_account(struct page *page,
                goto out;
        /*
-         * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+         * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
         * of its source page while we change it: page migration takes
         * both pages off the LRU, but page cache replacement doesn't.
         */
@@ -3389,10 +3057,10 @@ static int mem_cgroup_move_account(struct page *page,
                goto out;
        ret = -EINVAL;
-        if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
+        if (page->mem_cgroup != from)
                goto out_unlock;
-        move_lock_mem_cgroup(from, &flags);
+        spin_lock_irqsave(&from->move_lock, flags);
        if (!PageAnon(page) && page_mapped(page)) {
                __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3409,14 +3077,15 @@ static int mem_cgroup_move_account(struct page *page,
        }
        /*
-         * It is safe to change pc->mem_cgroup here because the page
+         * It is safe to change page->mem_cgroup here because the page
         * is referenced, charged, and isolated - we can't race with
         * uncharging, charging, migration, or LRU putback.
         */
        /* caller should have done css_get */
-        pc->mem_cgroup = to;
+        page->mem_cgroup = to;
-        move_unlock_mem_cgroup(from, &flags);
+        spin_unlock_irqrestore(&from->move_lock, flags);
        ret = 0;
        local_irq_disable();
@@ -3431,72 +3100,6 @@ out:
        return ret;
 }
-/**
- * mem_cgroup_move_parent - moves page to the parent group
- * @page: the page to move
- * @pc: page_cgroup of the page
- * @child: page's cgroup
- *
- * move charges to its parent or the root cgroup if the group has no
- * parent (aka use_hierarchy==0).
- * Although this might fail (get_page_unless_zero, isolate_lru_page or
- * mem_cgroup_move_account fails) the failure is always temporary and
- * it signals a race with a page removal/uncharge or migration. In the
- * first case the page is on the way out and it will vanish from the LRU
- * on the next attempt and the call should be retried later.
- * Isolation from the LRU fails only if page has been isolated from
- * the LRU since we looked at it and that usually means either global
- * reclaim or migration going on. The page will either get back to the
- * LRU or vanish.
- * Finaly mem_cgroup_move_account fails only if the page got uncharged
- * (!PageCgroupUsed) or moved to a different group. The page will
- * disappear in the next attempt.
- */
-static int mem_cgroup_move_parent(struct page *page,
-                                  struct page_cgroup *pc,
-                                  struct mem_cgroup *child)
-{
-        struct mem_cgroup *parent;
-        unsigned int nr_pages;
-        unsigned long uninitialized_var(flags);
-        int ret;
-        VM_BUG_ON(mem_cgroup_is_root(child));
-        ret = -EBUSY;
-        if (!get_page_unless_zero(page))
-                goto out;
-        if (isolate_lru_page(page))
-                goto put;
-        nr_pages = hpage_nr_pages(page);
-        parent = parent_mem_cgroup(child);
-        /*
-         * If no parent, move charges to root cgroup.
-         */
-        if (!parent)
-                parent = root_mem_cgroup;
-        if (nr_pages > 1) {
-                VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-                flags = compound_lock_irqsave(page);
-        }
-        ret = mem_cgroup_move_account(page, nr_pages,
-                                pc, child, parent);
-        if (!ret)
-                __mem_cgroup_cancel_local_charge(child, nr_pages);
-        if (nr_pages > 1)
-                compound_unlock_irqrestore(page, flags);
-        putback_lru_page(page);
-put:
-        put_page(page);
-out:
-        return ret;
-}
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
                                         bool charge)
@@ -3516,7 +3119,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 *
 * Returns 0 on success, -EINVAL on failure.
 *
- * The caller must have charged to @to, IOW, called res_counter_charge() about
+ * The caller must have charged to @to, IOW, called page_counter_charge() about
 * both res and memsw, and called css_get().
 */
 static int mem_cgroup_move_swap_account(swp_entry_t entry,
@@ -3532,7 +3135,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
                mem_cgroup_swap_statistics(to, true);
                /*
                 * This function is only called from task migration context now.
-                 * It postpones res_counter and refcount handling till the end
+                 * It postpones page_counter and refcount handling till the end
                 * of task migration(mem_cgroup_clear_mc()) for performance
                 * improvement. But we cannot postpone css_get(to)  because if
                 * the process that has been moved to @to does swap-in, the
@@ -3554,96 +3157,57 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
-#ifdef CONFIG_DEBUG_VM
+static DEFINE_MUTEX(memcg_limit_mutex);
-static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
-{
-        struct page_cgroup *pc;
-        pc = lookup_page_cgroup(page);
-        /*
-         * Can be NULL while feeding pages into the page allocator for
-         * the first time, i.e. during boot or memory hotplug;
-         * or when mem_cgroup_disabled().
-         */
-        if (likely(pc) && PageCgroupUsed(pc))
-                return pc;
-        return NULL;
-}
-bool mem_cgroup_bad_page_check(struct page *page)
-{
-        if (mem_cgroup_disabled())
-                return false;
-        return lookup_page_cgroup_used(page) != NULL;
-}
-void mem_cgroup_print_bad_page(struct page *page)
-{
-        struct page_cgroup *pc;
-        pc = lookup_page_cgroup_used(page);
-        if (pc) {
-                pr_alert("pc:%p pc->flags:%lx pc->mem_cgroup:%p\n",
-                         pc, pc->flags, pc->mem_cgroup);
-        }
-}
-#endif
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
-                                unsigned long long val)
+                                   unsigned long limit)
 {
+        unsigned long curusage;
+        unsigned long oldusage;
+        bool enlarge = false;
        int retry_count;
-        int ret = 0;
+        int ret;
-        int children = mem_cgroup_count_children(memcg);
-        u64 curusage, oldusage;
-        int enlarge;
        /*
         * For keeping hierarchical_reclaim simple, how long we should retry
         * is depends on callers. We set our retry-count to be function
         * of # of children which we should visit in this loop.
         */
-        retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
+        retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+                      mem_cgroup_count_children(memcg);
-        oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+        oldusage = page_counter_read(&memcg->memory);
-        enlarge = 0;
+        do {
-        while (retry_count) {
                if (signal_pending(current)) {
                        ret = -EINTR;
                        break;
                }
-                /*
-                 * Rather than hide all in some function, I do this in
+                mutex_lock(&memcg_limit_mutex);
-                 * open coded manner. You see what this really does.
+                if (limit > memcg->memsw.limit) {
-                 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
+                        mutex_unlock(&memcg_limit_mutex);
-                 */
-                mutex_lock(&set_limit_mutex);
-                if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
                        ret = -EINVAL;
-                        mutex_unlock(&set_limit_mutex);
                        break;
                }
+                if (limit > memcg->memory.limit)
-                if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
+                        enlarge = true;
-                        enlarge = 1;
+                ret = page_counter_limit(&memcg->memory, limit);
+                mutex_unlock(&memcg_limit_mutex);
-                ret = res_counter_set_limit(&memcg->res, val);
-                mutex_unlock(&set_limit_mutex);
                if (!ret)
                        break;
                try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
-                curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+                curusage = page_counter_read(&memcg->memory);
                /* Usage is reduced ? */
                if (curusage >= oldusage)
                        retry_count--;
                else
                        oldusage = curusage;
-        }
+        } while (retry_count);
        if (!ret && enlarge)
                memcg_oom_recover(memcg);
@@ -3651,52 +3215,53 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 }
 static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
-                                        unsigned long long val)
+                                         unsigned long limit)
 {
+        unsigned long curusage;
+        unsigned long oldusage;
+        bool enlarge = false;
        int retry_count;
-        u64 oldusage, curusage;
+        int ret;
-        int children = mem_cgroup_count_children(memcg);
-        int ret = -EBUSY;
-        int enlarge = 0;
        /* see mem_cgroup_resize_res_limit */
-        retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
+        retry_count = MEM_CGROUP_RECLAIM_RETRIES *
-        oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+                      mem_cgroup_count_children(memcg);
-        while (retry_count) {
+        oldusage = page_counter_read(&memcg->memsw);
+        do {
                if (signal_pending(current)) {
                        ret = -EINTR;
                        break;
                }
-                /*
-                 * Rather than hide all in some function, I do this in
+                mutex_lock(&memcg_limit_mutex);
-                 * open coded manner. You see what this really does.
+                if (limit < memcg->memory.limit) {
-                 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
+                        mutex_unlock(&memcg_limit_mutex);
-                 */
-                mutex_lock(&set_limit_mutex);
-                if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
                        ret = -EINVAL;
-                        mutex_unlock(&set_limit_mutex);
                        break;
                }
-                if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
+                if (limit > memcg->memsw.limit)
-                        enlarge = 1;
+                        enlarge = true;
-                ret = res_counter_set_limit(&memcg->memsw, val);
+                ret = page_counter_limit(&memcg->memsw, limit);
-                mutex_unlock(&set_limit_mutex);
+                mutex_unlock(&memcg_limit_mutex);
                if (!ret)
                        break;
                try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
-                curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+                curusage = page_counter_read(&memcg->memsw);
                /* Usage is reduced ? */
                if (curusage >= oldusage)
                        retry_count--;
                else
                        oldusage = curusage;
-        }
+        } while (retry_count);
        if (!ret && enlarge)
                memcg_oom_recover(memcg);
        return ret;
 }
@@ -3709,7 +3274,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
        unsigned long reclaimed;
        int loop = 0;
        struct mem_cgroup_tree_per_zone *mctz;
-        unsigned long long excess;
+        unsigned long excess;
        unsigned long nr_scanned;
        if (order > 0)
@@ -3735,35 +3300,17 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
                nr_reclaimed += reclaimed;
                *total_scanned += nr_scanned;
                spin_lock_irq(&mctz->lock);
+                __mem_cgroup_remove_exceeded(mz, mctz);
                /*
                 * If we failed to reclaim anything from this memory cgroup
                 * it is time to move on to the next cgroup
                 */
                next_mz = NULL;
-                if (!reclaimed) {
+                if (!reclaimed)
-                        do {
+                        next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
-                                /*
-                                 * Loop until we find yet another one.
+                excess = soft_limit_excess(mz->memcg);
-                                 *
-                                 * By the time we get the soft_limit lock
-                                 * again, someone might have aded the
-                                 * group back on the RB tree. Iterate to
-                                 * make sure we get a different mem.
-                                 * mem_cgroup_largest_soft_limit_node returns
-                                 * NULL if no other cgroup is present on
-                                 * the tree
-                                 */
-                                next_mz =
-                                __mem_cgroup_largest_soft_limit_node(mctz);
-                                if (next_mz == mz)
-                                        css_put(&next_mz->memcg->css);
-                                else /* next_mz == NULL or other memcg */
-                                        break;
-                        } while (1);
-                }
-                __mem_cgroup_remove_exceeded(mz, mctz);
-                excess = res_counter_soft_limit_excess(&mz->memcg->res);
                /*
                 * One school of thought says that we should not add
                 * back the node to the tree if reclaim returns 0.
@@ -3792,107 +3339,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
        return nr_reclaimed;
 }
-/**
- * mem_cgroup_force_empty_list - clears LRU of a group
- * @memcg: group to clear
- * @node: NUMA node
- * @zid: zone id
- * @lru: lru to to clear
- *
- * Traverse a specified page_cgroup list and try to drop them all.  This doesn't
- * reclaim the pages page themselves - pages are moved to the parent (or root)
- * group.
- */
-static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
-                                int node, int zid, enum lru_list lru)
-{
-        struct lruvec *lruvec;
-        unsigned long flags;
-        struct list_head *list;
-        struct page *busy;
-        struct zone *zone;
-        zone = &NODE_DATA(node)->node_zones[zid];
-        lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-        list = &lruvec->lists[lru];
-        busy = NULL;
-        do {
-                struct page_cgroup *pc;
-                struct page *page;
-                spin_lock_irqsave(&zone->lru_lock, flags);
-                if (list_empty(list)) {
-                        spin_unlock_irqrestore(&zone->lru_lock, flags);
-                        break;
-                }
-                page = list_entry(list->prev, struct page, lru);
-                if (busy == page) {
-                        list_move(&page->lru, list);
-                        busy = NULL;
-                        spin_unlock_irqrestore(&zone->lru_lock, flags);
-                        continue;
-                }
-                spin_unlock_irqrestore(&zone->lru_lock, flags);
-                pc = lookup_page_cgroup(page);
-                if (mem_cgroup_move_parent(page, pc, memcg)) {
-                        /* found lock contention or "pc" is obsolete. */
-                        busy = page;
-                } else
-                        busy = NULL;
-                cond_resched();
-        } while (!list_empty(list));
-}
-/*
- * make mem_cgroup's charge to be 0 if there is no task by moving
- * all the charges and pages to the parent.
- * This enables deleting this mem_cgroup.
- *
- * Caller is responsible for holding css reference on the memcg.
- */
-static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
-{
-        int node, zid;
-        u64 usage;
-        do {
-                /* This is for making all *used* pages to be on LRU. */
-                lru_add_drain_all();
-                drain_all_stock_sync(memcg);
-                mem_cgroup_start_move(memcg);
-                for_each_node_state(node, N_MEMORY) {
-                        for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-                                enum lru_list lru;
-                                for_each_lru(lru) {
-                                        mem_cgroup_force_empty_list(memcg,
-                                                        node, zid, lru);
-                                }
-                        }
-                }
-                mem_cgroup_end_move(memcg);
-                memcg_oom_recover(memcg);
-                cond_resched();
-                /*
-                 * Kernel memory may not necessarily be trackable to a specific
-                 * process. So they are not migrated, and therefore we can't
-                 * expect their value to drop to 0 here.
-                 * Having res filled up with kmem only is enough.
-                 *
-                 * This is a safety check because mem_cgroup_force_empty_list
-                 * could have raced with mem_cgroup_replace_page_cache callers
-                 * so the lru seemed empty but the page could have been added
-                 * right after the check. RES_USAGE should be safe as we always
-                 * charge before adding to the LRU.
-                 */
-                usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
-                        res_counter_read_u64(&memcg->kmem, RES_USAGE);
-        } while (usage > 0);
-}
 /*
 * Test whether @memcg has children, dead or alive.  Note that this
 * function doesn't care whether @memcg has use_hierarchy enabled and
@@ -3930,7 +3376,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
        /* we call try-to-free pages for make this cgroup empty */
        lru_add_drain_all();
        /* try to free all pages in this cgroup */
-        while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
+        while (nr_retries && page_counter_read(&memcg->memory)) {
                int progress;
                if (signal_pending(current))
@@ -4001,8 +3447,8 @@ out:
        return retval;
 }
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
+static unsigned long tree_stat(struct mem_cgroup *memcg,
-                                               enum mem_cgroup_stat_index idx)
+                               enum mem_cgroup_stat_index idx)
 {
        struct mem_cgroup *iter;
        long val = 0;
@@ -4020,55 +3466,71 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 {
        u64 val;
-        if (!mem_cgroup_is_root(memcg)) {
+        if (mem_cgroup_is_root(memcg)) {
+                val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
+                val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
+                if (swap)
+                        val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
+        } else {
                if (!swap)
-                        return res_counter_read_u64(&memcg->res, RES_USAGE);
+                        val = page_counter_read(&memcg->memory);
                else
-                        return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+                        val = page_counter_read(&memcg->memsw);
        }
-        /*
-         * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
-         * as well as in MEM_CGROUP_STAT_RSS_HUGE.
-         */
-        val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
-        val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-        if (swap)
-                val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
        return val << PAGE_SHIFT;
 }
+enum {
+        RES_USAGE,
+        RES_LIMIT,
+        RES_MAX_USAGE,
+        RES_FAILCNT,
+        RES_SOFT_LIMIT,
+};
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
                               struct cftype *cft)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-        enum res_type type = MEMFILE_TYPE(cft->private);
+        struct page_counter *counter;
-        int name = MEMFILE_ATTR(cft->private);
-        switch (type) {
+        switch (MEMFILE_TYPE(cft->private)) {
        case _MEM:
-                if (name == RES_USAGE)
+                counter = &memcg->memory;
-                        return mem_cgroup_usage(memcg, false);
+                break;
-                return res_counter_read_u64(&memcg->res, name);
        case _MEMSWAP:
-                if (name == RES_USAGE)
+                counter = &memcg->memsw;
-                        return mem_cgroup_usage(memcg, true);
+                break;
-                return res_counter_read_u64(&memcg->memsw, name);
        case _KMEM:
-                return res_counter_read_u64(&memcg->kmem, name);
+                counter = &memcg->kmem;
                break;
        default:
                BUG();
        }
+        switch (MEMFILE_ATTR(cft->private)) {
+        case RES_USAGE:
+                if (counter == &memcg->memory)
+                        return mem_cgroup_usage(memcg, false);
+                if (counter == &memcg->memsw)
+                        return mem_cgroup_usage(memcg, true);
+                return (u64)page_counter_read(counter) * PAGE_SIZE;
+        case RES_LIMIT:
+                return (u64)counter->limit * PAGE_SIZE;
+        case RES_MAX_USAGE:
+                return (u64)counter->watermark * PAGE_SIZE;
+        case RES_FAILCNT:
+                return counter->failcnt;
+        case RES_SOFT_LIMIT:
+                return (u64)memcg->soft_limit * PAGE_SIZE;
+        default:
+                BUG();
+        }
 }
 #ifdef CONFIG_MEMCG_KMEM
-/* should be called with activate_kmem_mutex held */
+static int memcg_activate_kmem(struct mem_cgroup *memcg,
-static int __memcg_activate_kmem(struct mem_cgroup *memcg,
+                               unsigned long nr_pages)
-                                 unsigned long long limit)
 {
        int err = 0;
        int memcg_id;
@@ -4115,7 +3577,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
         * We couldn't have accounted to this cgroup, because it hasn't got the
         * active bit set yet, so this should succeed.
         */
-        err = res_counter_set_limit(&memcg->kmem, limit);
+        err = page_counter_limit(&memcg->kmem, nr_pages);
        VM_BUG_ON(err);
        static_key_slow_inc(&memcg_kmem_enabled_key);
@@ -4130,26 +3592,17 @@ out:
        return err;
 }
-static int memcg_activate_kmem(struct mem_cgroup *memcg,
-                               unsigned long long limit)
-{
-        int ret;
-        mutex_lock(&activate_kmem_mutex);
-        ret = __memcg_activate_kmem(memcg, limit);
-        mutex_unlock(&activate_kmem_mutex);
-        return ret;
-}
 static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
-                                   unsigned long long val)
+                                   unsigned long limit)
 {
        int ret;
+        mutex_lock(&memcg_limit_mutex);
        if (!memcg_kmem_is_active(memcg))
-                ret = memcg_activate_kmem(memcg, val);
+                ret = memcg_activate_kmem(memcg, limit);
        else
-                ret = res_counter_set_limit(&memcg->kmem, val);
+                ret = page_counter_limit(&memcg->kmem, limit);
+        mutex_unlock(&memcg_limit_mutex);
        return ret;
 }
@@ -4161,19 +3614,19 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
        if (!parent)
                return 0;
-        mutex_lock(&activate_kmem_mutex);
+        mutex_lock(&memcg_limit_mutex);
        /*
         * If the parent cgroup is not kmem-active now, it cannot be activated
         * after this point, because it has at least one child already.
         */
        if (memcg_kmem_is_active(parent))
-                ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+                ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
-        mutex_unlock(&activate_kmem_mutex);
+        mutex_unlock(&memcg_limit_mutex);
        return ret;
 }
 #else
 static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
-                                   unsigned long long val)
+                                   unsigned long limit)
 {
        return -EINVAL;
 }
@@ -4187,110 +3640,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
                                char *buf, size_t nbytes, loff_t off)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-        enum res_type type;
+        unsigned long nr_pages;
-        int name;
-        unsigned long long val;
        int ret;
        buf = strstrip(buf);
-        type = MEMFILE_TYPE(of_cft(of)->private);
+        ret = page_counter_memparse(buf, &nr_pages);
-        name = MEMFILE_ATTR(of_cft(of)->private);
+        if (ret)
+                return ret;
-        switch (name) {
+        switch (MEMFILE_ATTR(of_cft(of)->private)) {
        case RES_LIMIT:
                if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
                        ret = -EINVAL;
                        break;
                }
-                /* This function does all necessary parse...reuse it */
+                switch (MEMFILE_TYPE(of_cft(of)->private)) {
-                ret = res_counter_memparse_write_strategy(buf, &val);
+                case _MEM:
-                if (ret)
+                        ret = mem_cgroup_resize_limit(memcg, nr_pages);
                        break;
-                if (type == _MEM)
+                case _MEMSWAP:
-                        ret = mem_cgroup_resize_limit(memcg, val);
+                        ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
-                else if (type == _MEMSWAP)
-                        ret = mem_cgroup_resize_memsw_limit(memcg, val);
-                else if (type == _KMEM)
-                        ret = memcg_update_kmem_limit(memcg, val);
-                else
-                        return -EINVAL;
-                break;
-        case RES_SOFT_LIMIT:
-                ret = res_counter_memparse_write_strategy(buf, &val);
-                if (ret)
                        break;
-                /*
+                case _KMEM:
-                 * For memsw, soft limits are hard to implement in terms
+                        ret = memcg_update_kmem_limit(memcg, nr_pages);
-                 * of semantics, for now, we support soft limits for
+                        break;
-                 * control without swap
+                }
-                 */
-                if (type == _MEM)
-                        ret = res_counter_set_soft_limit(&memcg->res, val);
-                else
-                        ret = -EINVAL;
                break;
-        default:
+        case RES_SOFT_LIMIT:
-                ret = -EINVAL; /* should be BUG() ? */
+                memcg->soft_limit = nr_pages;
+                ret = 0;
                break;
        }
        return ret ?: nbytes;
 }
-static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
-                unsigned long long *mem_limit, unsigned long long *memsw_limit)
-{
-        unsigned long long min_limit, min_memsw_limit, tmp;
-        min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-        min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-        if (!memcg->use_hierarchy)
-                goto out;
-        while (memcg->css.parent) {
-                memcg = mem_cgroup_from_css(memcg->css.parent);
-                if (!memcg->use_hierarchy)
-                        break;
-                tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
-                min_limit = min(min_limit, tmp);
-                tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-                min_memsw_limit = min(min_memsw_limit, tmp);
-        }
-out:
-        *mem_limit = min_limit;
-        *memsw_limit = min_memsw_limit;
-}
 static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
                                size_t nbytes, loff_t off)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-        int name;
+        struct page_counter *counter;
-        enum res_type type;
-        type = MEMFILE_TYPE(of_cft(of)->private);
+        switch (MEMFILE_TYPE(of_cft(of)->private)) {
-        name = MEMFILE_ATTR(of_cft(of)->private);
+        case _MEM:
+                counter = &memcg->memory;
+                break;
+        case _MEMSWAP:
+                counter = &memcg->memsw;
+                break;
+        case _KMEM:
+                counter = &memcg->kmem;
+                break;
+        default:
+                BUG();
+        }
-        switch (name) {
+        switch (MEMFILE_ATTR(of_cft(of)->private)) {
        case RES_MAX_USAGE:
-                if (type == _MEM)
+                page_counter_reset_watermark(counter);
-                        res_counter_reset_max(&memcg->res);
-                else if (type == _MEMSWAP)
-                        res_counter_reset_max(&memcg->memsw);
-                else if (type == _KMEM)
-                        res_counter_reset_max(&memcg->kmem);
-                else
-                        return -EINVAL;
                break;
        case RES_FAILCNT:
-                if (type == _MEM)
+                counter->failcnt = 0;
-                        res_counter_reset_failcnt(&memcg->res);
-                else if (type == _MEMSWAP)
-                        res_counter_reset_failcnt(&memcg->memsw);
-                else if (type == _KMEM)
-                        res_counter_reset_failcnt(&memcg->kmem);
-                else
-                        return -EINVAL;
                break;
+        default:
+                BUG();
        }
        return nbytes;
@@ -4387,6 +3799,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
 static int memcg_stat_show(struct seq_file *m, void *v)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+        unsigned long memory, memsw;
        struct mem_cgroup *mi;
        unsigned int i;
@@ -4406,14 +3819,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
                           mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
        /* Hierarchical information */
-        {
+        memory = memsw = PAGE_COUNTER_MAX;
-                unsigned long long limit, memsw_limit;
+        for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
-                memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
+                memory = min(memory, mi->memory.limit);
-                seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
+                memsw = min(memsw, mi->memsw.limit);
-                if (do_swap_account)
-                        seq_printf(m, "hierarchical_memsw_limit %llu\n",
-                                   memsw_limit);
        }
+        seq_printf(m, "hierarchical_memory_limit %llu\n",
+                   (u64)memory * PAGE_SIZE);
+        if (do_swap_account)
+                seq_printf(m, "hierarchical_memsw_limit %llu\n",
+                           (u64)memsw * PAGE_SIZE);
        for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
                long long val = 0;
@@ -4497,7 +3912,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
        struct mem_cgroup_threshold_ary *t;
-        u64 usage;
+        unsigned long usage;
        int i;
        rcu_read_lock();
@@ -4596,10 +4011,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 {
        struct mem_cgroup_thresholds *thresholds;
        struct mem_cgroup_threshold_ary *new;
-        u64 threshold, usage;
+        unsigned long threshold;
+        unsigned long usage;
        int i, size, ret;
-        ret = res_counter_memparse_write_strategy(args, &threshold);
+        ret = page_counter_memparse(args, &threshold);
        if (ret)
                return ret;
@@ -4689,7 +4105,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 {
        struct mem_cgroup_thresholds *thresholds;
        struct mem_cgroup_threshold_ary *new;
-        u64 usage;
+        unsigned long usage;
        int i, j, size;
        mutex_lock(&memcg->thresholds_lock);
@@ -4855,40 +4271,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
        mem_cgroup_sockets_destroy(memcg);
 }
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-        if (!memcg_kmem_is_active(memcg))
-                return;
-        /*
-         * kmem charges can outlive the cgroup. In the case of slab
-         * pages, for instance, a page contain objects from various
-         * processes. As we prevent from taking a reference for every
-         * such allocation we have to be careful when doing uncharge
-         * (see memcg_uncharge_kmem) and here during offlining.
-         *
-         * The idea is that that only the _last_ uncharge which sees
-         * the dead memcg will drop the last reference. An additional
-         * reference is taken here before the group is marked dead
-         * which is then paired with css_put during uncharge resp. here.
-         *
-         * Although this might sound strange as this path is called from
-         * css_offline() when the referencemight have dropped down to 0 and
-         * shouldn't be incremented anymore (css_tryget_online() would
-         * fail) we do not have other options because of the kmem
-         * allocations lifetime.
-         */
-        css_get(&memcg->css);
-        memcg_kmem_mark_dead(memcg);
-        if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
-                return;
-        if (memcg_kmem_test_and_clear_dead(memcg))
-                css_put(&memcg->css);
-}
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
@@ -4898,10 +4280,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 }
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
 #endif
 /*
@@ -5228,7 +4606,10 @@ static struct cftype mem_cgroup_files[] = {
 #ifdef CONFIG_SLABINFO
        {
                .name = "kmem.slabinfo",
-                .seq_show = mem_cgroup_slabinfo_read,
+                .seq_start = slab_start,
+                .seq_next = slab_next,
+                .seq_stop = slab_stop,
+                .seq_show = memcg_slab_show,
        },
 #endif
 #endif
@@ -5363,9 +4744,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 */
 struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
-        if (!memcg->res.parent)
+        if (!memcg->memory.parent)
                return NULL;
-        return mem_cgroup_from_res_counter(memcg->res.parent, res);
+        return mem_cgroup_from_counter(memcg->memory.parent, memory);
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
@@ -5410,9 +4791,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
        /* root ? */
        if (parent_css == NULL) {
                root_mem_cgroup = memcg;
-                res_counter_init(&memcg->res, NULL);
+                page_counter_init(&memcg->memory, NULL);
-                res_counter_init(&memcg->memsw, NULL);
+                page_counter_init(&memcg->memsw, NULL);
-                res_counter_init(&memcg->kmem, NULL);
+                page_counter_init(&memcg->kmem, NULL);
        }
        memcg->last_scanned_node = MAX_NUMNODES;
@@ -5451,18 +4832,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
        memcg->swappiness = mem_cgroup_swappiness(parent);
        if (parent->use_hierarchy) {
-                res_counter_init(&memcg->res, &parent->res);
+                page_counter_init(&memcg->memory, &parent->memory);
-                res_counter_init(&memcg->memsw, &parent->memsw);
+                page_counter_init(&memcg->memsw, &parent->memsw);
-                res_counter_init(&memcg->kmem, &parent->kmem);
+                page_counter_init(&memcg->kmem, &parent->kmem);
                /*
                 * No need to take a reference to the parent because cgroup
                 * core guarantees its existence.
                 */
        } else {
-                res_counter_init(&memcg->res, NULL);
+                page_counter_init(&memcg->memory, NULL);
-                res_counter_init(&memcg->memsw, NULL);
+                page_counter_init(&memcg->memsw, NULL);
-                res_counter_init(&memcg->kmem, NULL);
+                page_counter_init(&memcg->kmem, NULL);
                /*
                 * Deeper hierachy with use_hierarchy == false doesn't make
                 * much sense so let cgroup subsystem know about this
@@ -5487,29 +4868,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
        return 0;
 }
-/*
- * Announce all parents that a group from their hierarchy is gone.
- */
-static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
-{
-        struct mem_cgroup *parent = memcg;
-        while ((parent = parent_mem_cgroup(parent)))
-                mem_cgroup_iter_invalidate(parent);
-        /*
-         * if the root memcg is not hierarchical we have to check it
-         * explicitely.
-         */
-        if (!root_mem_cgroup->use_hierarchy)
-                mem_cgroup_iter_invalidate(root_mem_cgroup);
-}
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
        struct mem_cgroup_event *event, *tmp;
-        struct cgroup_subsys_state *iter;
        /*
         * Unregister events and notify userspace.
@@ -5523,17 +4885,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
        }
        spin_unlock(&memcg->event_list_lock);
-        kmem_cgroup_css_offline(memcg);
-        mem_cgroup_invalidate_reclaim_iterators(memcg);
-        /*
-         * This requires that offlining is serialized.  Right now that is
-         * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
-         */
-        css_for_each_descendant_post(iter, css)
-                mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
        memcg_unregister_all_caches(memcg);
        vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -5541,42 +4892,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-        /*
-         * XXX: css_offline() would be where we should reparent all
-         * memory to prepare the cgroup for destruction.  However,
-         * memcg does not do css_tryget_online() and res_counter charging
-         * under the same RCU lock region, which means that charging
-         * could race with offlining.  Offlining only happens to
-         * cgroups with no tasks in them but charges can show up
-         * without any tasks from the swapin path when the target
-         * memcg is looked up from the swapout record and not from the
-         * current task as it usually is.  A race like this can leak
-         * charges and put pages with stale cgroup pointers into
-         * circulation:
-         *
-         * #0                        #1
-         *                           lookup_swap_cgroup_id()
-         *                           rcu_read_lock()
-         *                           mem_cgroup_lookup()
-         *                           css_tryget_online()
-         *                           rcu_read_unlock()
-         * disable css_tryget_online()
-         * call_rcu()
-         *   offline_css()
-         *     reparent_charges()
-         *                           res_counter_charge()
-         *                           css_put()
-         *                             css_free()
-         *                           pc->mem_cgroup = dead memcg
-         *                           add page to lru
-         *
-         * The bulk of the charges are still moved in offline_css() to
-         * avoid pinning a lot of pages in case a long-term reference
-         * like a swapout record is deferring the css_free() to long
-         * after offlining.  But this makes sure we catch any charges
-         * made after offlining:
-         */
-        mem_cgroup_reparent_charges(memcg);
        memcg_destroy_kmem(memcg);
        __mem_cgroup_free(memcg);
@@ -5599,10 +4914,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-        mem_cgroup_resize_limit(memcg, ULLONG_MAX);
+        mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
-        mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
+        mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
-        memcg_update_kmem_limit(memcg, ULLONG_MAX);
+        memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
-        res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
+        memcg->soft_limit = 0;
 }
 #ifdef CONFIG_MMU
@@ -5758,7 +5073,6 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
                unsigned long addr, pte_t ptent, union mc_target *target)
 {
        struct page *page = NULL;
-        struct page_cgroup *pc;
        enum mc_target_type ret = MC_TARGET_NONE;
        swp_entry_t ent = { .val = 0 };
@@ -5772,13 +5086,12 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
        if (!page && !ent.val)
                return ret;
        if (page) {
-                pc = lookup_page_cgroup(page);
                /*
                 * Do only loose check w/o serialization.
-                 * mem_cgroup_move_account() checks the pc is valid or
+                 * mem_cgroup_move_account() checks the page is valid or
                 * not under LRU exclusion.
                 */
-                if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+                if (page->mem_cgroup == mc.from) {
                        ret = MC_TARGET_PAGE;
                        if (target)
                                target->page = page;
@@ -5806,15 +5119,13 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
                unsigned long addr, pmd_t pmd, union mc_target *target)
 {
        struct page *page = NULL;
-        struct page_cgroup *pc;
        enum mc_target_type ret = MC_TARGET_NONE;
        page = pmd_page(pmd);
        VM_BUG_ON_PAGE(!page || !PageHead(page), page);
        if (!move_anon())
                return ret;
-        pc = lookup_page_cgroup(page);
+        if (page->mem_cgroup == mc.from) {
-        if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
                ret = MC_TARGET_PAGE;
                if (target) {
                        get_page(page);
@@ -5897,7 +5208,6 @@ static void __mem_cgroup_clear_mc(void)
 {
        struct mem_cgroup *from = mc.from;
        struct mem_cgroup *to = mc.to;
-        int i;
        /* we must uncharge all the leftover precharges from mc.to */
        if (mc.precharge) {
@@ -5916,19 +5226,17 @@ static void __mem_cgroup_clear_mc(void)
        if (mc.moved_swap) {
                /* uncharge swap account from the old cgroup */
                if (!mem_cgroup_is_root(mc.from))
-                        res_counter_uncharge(&mc.from->memsw,
+                        page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
-                                             PAGE_SIZE * mc.moved_swap);
-                for (i = 0; i < mc.moved_swap; i++)
-                        css_put(&mc.from->css);
                /*
-                 * we charged both to->res and to->memsw, so we should
+                 * we charged both to->memory and to->memsw, so we
-                 * uncharge to->res.
+                 * should uncharge to->memory.
                 */
                if (!mem_cgroup_is_root(mc.to))
-                        res_counter_uncharge(&mc.to->res,
+                        page_counter_uncharge(&mc.to->memory, mc.moved_swap);
-                                             PAGE_SIZE * mc.moved_swap);
+                css_put_many(&mc.from->css, mc.moved_swap);
                /* we've already done css_get(mc.to) */
                mc.moved_swap = 0;
        }
@@ -5939,8 +5247,6 @@ static void __mem_cgroup_clear_mc(void)
 static void mem_cgroup_clear_mc(void)
 {
-        struct mem_cgroup *from = mc.from;
        /*
         * we must clear moving_task before waking up waiters at the end of
         * task migration.
@@ -5951,7 +5257,6 @@ static void mem_cgroup_clear_mc(void)
        mc.from = NULL;
        mc.to = NULL;
        spin_unlock(&mc.lock);
-        mem_cgroup_end_move(from);
 }
 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
@@ -5984,7 +5289,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
                        VM_BUG_ON(mc.precharge);
                        VM_BUG_ON(mc.moved_charge);
                        VM_BUG_ON(mc.moved_swap);
-                        mem_cgroup_start_move(from);
                        spin_lock(&mc.lock);
                        mc.from = from;
                        mc.to = memcg;
@@ -6004,7 +5309,8 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css,
                                     struct cgroup_taskset *tset)
 {
-        mem_cgroup_clear_mc();
+        if (mc.to)
+                mem_cgroup_clear_mc();
 }
 static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
@@ -6018,7 +5324,6 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
        enum mc_target_type target_type;
        union mc_target target;
        struct page *page;
-        struct page_cgroup *pc;
        /*
         * We don't take compound_lock() here but no race with splitting thp
@@ -6039,9 +5344,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
                if (target_type == MC_TARGET_PAGE) {
                        page = target.page;
                        if (!isolate_lru_page(page)) {
-                                pc = lookup_page_cgroup(page);
                                if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
-                                                        pc, mc.from, mc.to)) {
+                                                             mc.from, mc.to)) {
                                        mc.precharge -= HPAGE_PMD_NR;
                                        mc.moved_charge += HPAGE_PMD_NR;
                                }
@@ -6069,9 +5373,7 @@ retry:
                        page = target.page;
                        if (isolate_lru_page(page))
                                goto put;
-                        pc = lookup_page_cgroup(page);
+                        if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
-                        if (!mem_cgroup_move_account(page, 1, pc,
-                                                     mc.from, mc.to)) {
                                mc.precharge--;
                                /* we uncharge from mc.from later. */
                                mc.moved_charge++;
@@ -6115,6 +5417,13 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
        struct vm_area_struct *vma;
        lru_add_drain_all();
+        /*
+         * Signal mem_cgroup_begin_page_stat() to take the memcg's
+         * move_lock while we're moving its pages to another memcg.
+         * Then wait for already started RCU-only updates to finish.
+         */
+        atomic_inc(&mc.from->moving_account);
+        synchronize_rcu();
 retry:
        if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
                /*
@@ -6147,6 +5456,7 @@ retry:
                        break;
        }
        up_read(&mm->mmap_sem);
+        atomic_dec(&mc.from->moving_account);
 }
 static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
@@ -6250,7 +5560,7 @@ static void __init enable_swap_cgroup(void)
 */
 void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
-        struct page_cgroup *pc;
+        struct mem_cgroup *memcg;
        unsigned short oldid;
        VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -6259,20 +5569,26 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
        if (!do_swap_account)
                return;
-        pc = lookup_page_cgroup(page);
+        memcg = page->mem_cgroup;
        /* Readahead page, never charged */
-        if (!PageCgroupUsed(pc))
+        if (!memcg)
                return;
-        VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+        oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
-        oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
        VM_BUG_ON_PAGE(oldid, page);
+        mem_cgroup_swap_statistics(memcg, true);
+        page->mem_cgroup = NULL;
-        pc->flags &= ~PCG_MEMSW;
+        if (!mem_cgroup_is_root(memcg))
-        css_get(&pc->mem_cgroup->css);
+                page_counter_uncharge(&memcg->memory, 1);
-        mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+        /* XXX: caller holds IRQ-safe mapping->tree_lock */
+        VM_BUG_ON(!irqs_disabled());
+        mem_cgroup_charge_statistics(memcg, page, -1);
+        memcg_check_events(memcg, page);
 }
 /**
@@ -6294,7 +5610,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
        memcg = mem_cgroup_lookup(id);
        if (memcg) {
                if (!mem_cgroup_is_root(memcg))
-                        res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+                        page_counter_uncharge(&memcg->memsw, 1);
                mem_cgroup_swap_statistics(memcg, false);
                css_put(&memcg->css);
        }
@@ -6330,7 +5646,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
                goto out;
        if (PageSwapCache(page)) {
-                struct page_cgroup *pc = lookup_page_cgroup(page);
                /*
                 * Every swap fault against a single page tries to charge the
                 * page, bail as early as possible.  shmem_unuse() encounters
@@ -6338,7 +5653,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
                 * the page lock, which serializes swap cache removal, which
                 * in turn serializes uncharging.
                 */
-                if (PageCgroupUsed(pc))
+                if (page->mem_cgroup)
                        goto out;
        }
@@ -6452,19 +5767,16 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 }
 static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
-                           unsigned long nr_mem, unsigned long nr_memsw,
                           unsigned long nr_anon, unsigned long nr_file,
                           unsigned long nr_huge, struct page *dummy_page)
 {
+        unsigned long nr_pages = nr_anon + nr_file;
        unsigned long flags;
        if (!mem_cgroup_is_root(memcg)) {
-                if (nr_mem)
+                page_counter_uncharge(&memcg->memory, nr_pages);
-                        res_counter_uncharge(&memcg->res,
+                if (do_swap_account)
-                                             nr_mem * PAGE_SIZE);
+                        page_counter_uncharge(&memcg->memsw, nr_pages);
-                if (nr_memsw)
-                        res_counter_uncharge(&memcg->memsw,
-                                             nr_memsw * PAGE_SIZE);
                memcg_oom_recover(memcg);
        }
@@ -6473,27 +5785,27 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
        __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_file);
        __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
        __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
-        __this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
+        __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
        memcg_check_events(memcg, dummy_page);
        local_irq_restore(flags);
+        if (!mem_cgroup_is_root(memcg))
+                css_put_many(&memcg->css, nr_pages);
 }
 static void uncharge_list(struct list_head *page_list)
 {
        struct mem_cgroup *memcg = NULL;
-        unsigned long nr_memsw = 0;
        unsigned long nr_anon = 0;
        unsigned long nr_file = 0;
        unsigned long nr_huge = 0;
        unsigned long pgpgout = 0;
-        unsigned long nr_mem = 0;
        struct list_head *next;
        struct page *page;
        next = page_list->next;
        do {
                unsigned int nr_pages = 1;
-                struct page_cgroup *pc;
                page = list_entry(next, struct page, lru);
                next = page->lru.next;
@@ -6501,24 +5813,22 @@ static void uncharge_list(struct list_head *page_list)
                VM_BUG_ON_PAGE(PageLRU(page), page);
                VM_BUG_ON_PAGE(page_count(page), page);
-                pc = lookup_page_cgroup(page);
+                if (!page->mem_cgroup)
-                if (!PageCgroupUsed(pc))
                        continue;
                /*
                 * Nobody should be changing or seriously looking at
-                 * pc->mem_cgroup and pc->flags at this point, we have
+                 * page->mem_cgroup at this point, we have fully
-                 * fully exclusive access to the page.
+                 * exclusive access to the page.
                 */
-                if (memcg != pc->mem_cgroup) {
+                if (memcg != page->mem_cgroup) {
                        if (memcg) {
-                                uncharge_batch(memcg, pgpgout, nr_mem, nr_memsw,
+                                uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-                                               nr_anon, nr_file, nr_huge, page);
+                                               nr_huge, page);
-                                pgpgout = nr_mem = nr_memsw = 0;
+                                pgpgout = nr_anon = nr_file = nr_huge = 0;
-                                nr_anon = nr_file = nr_huge = 0;
                        }
-                        memcg = pc->mem_cgroup;
+                        memcg = page->mem_cgroup;
                }
                if (PageTransHuge(page)) {
@@ -6532,18 +5842,14 @@ static void uncharge_list(struct list_head *page_list)
                else
                        nr_file += nr_pages;
-                if (pc->flags & PCG_MEM)
+                page->mem_cgroup = NULL;
-                        nr_mem += nr_pages;
-                if (pc->flags & PCG_MEMSW)
-                        nr_memsw += nr_pages;
-                pc->flags = 0;
                pgpgout++;
        } while (next != page_list);
        if (memcg)
-                uncharge_batch(memcg, pgpgout, nr_mem, nr_memsw,
+                uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-                               nr_anon, nr_file, nr_huge, page);
+                               nr_huge, page);
 }
 /**
@@ -6555,14 +5861,11 @@ static void uncharge_list(struct list_head *page_list)
 */
 void mem_cgroup_uncharge(struct page *page)
 {
-        struct page_cgroup *pc;
        if (mem_cgroup_disabled())
                return;
        /* Don't touch page->lru of any random page, pre-check: */
-        pc = lookup_page_cgroup(page);
+        if (!page->mem_cgroup)
-        if (!PageCgroupUsed(pc))
                return;
        INIT_LIST_HEAD(&page->lru);
@@ -6598,7 +5901,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
                        bool lrucare)
 {
-        struct page_cgroup *pc;
+        struct mem_cgroup *memcg;
        int isolated;
        VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
@@ -6613,27 +5916,28 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
                return;
        /* Page cache replacement: new page already charged? */
-        pc = lookup_page_cgroup(newpage);
+        if (newpage->mem_cgroup)
-        if (PageCgroupUsed(pc))
                return;
-        /* Re-entrant migration: old page already uncharged? */
+        /*
-        pc = lookup_page_cgroup(oldpage);
+         * Swapcache readahead pages can get migrated before being
-        if (!PageCgroupUsed(pc))
+         * charged, and migration from compaction can happen to an
+         * uncharged page when the PFN walker finds a page that
+         * reclaim just put back on the LRU but has not released yet.
+         */
+        memcg = oldpage->mem_cgroup;
+        if (!memcg)
                return;
-        VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
-        VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
        if (lrucare)
                lock_page_lru(oldpage, &isolated);
-        pc->flags = 0;
+        oldpage->mem_cgroup = NULL;
        if (lrucare)
                unlock_page_lru(oldpage, isolated);
-        commit_charge(newpage, pc->mem_cgroup, lrucare);
+        commit_charge(newpage, memcg, lrucare);
 }
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b852b10ec76d..e5ee0ca7ae85 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -233,7 +233,7 @@ void shake_page(struct page *p, int access)
                lru_add_drain_all();
                if (PageLRU(p))
                        return;
-                drain_all_pages();
+                drain_all_pages(page_zone(p));
                if (PageLRU(p) || is_free_buddy_page(p))
                        return;
        }
@@ -1661,7 +1661,7 @@ static int __soft_offline_page(struct page *page, int flags)
                        if (!is_free_buddy_page(page))
                                lru_add_drain_all();
                        if (!is_free_buddy_page(page))
-                                drain_all_pages();
+                                drain_all_pages(page_zone(page));
                        SetPageHWPoison(page);
                        if (!is_free_buddy_page(page))
                                pr_info("soft offline: %#lx: page leaked\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1bf4807cb21e..9fab10795bea 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1725,7 +1725,7 @@ repeat:
        if (drain) {
                lru_add_drain_all();
                cond_resched();
-                drain_all_pages();
+                drain_all_pages(zone);
        }
        pfn = scan_movable_pages(start_pfn, end_pfn);
@@ -1747,7 +1747,7 @@ repeat:
        lru_add_drain_all();
        yield();
        /* drain pcp pages, this is synchronous. */
-        drain_all_pages();
+        drain_all_pages(zone);
        /*
         * dissolve free hugepages in the memory block before doing offlining
         * actually in order to make hugetlbfs's object counting consistent.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..3b014d326151 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -119,7 +119,7 @@ found:
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p,
-                const struct mem_cgroup *memcg, const nodemask_t *nodemask)
+                struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
        if (is_global_init(p))
                return true;
@@ -353,7 +353,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 * State information includes task's pid, uid, tgid, vm size, rss, nr_ptes,
 * swapents, oom_score_adj value, and name.
 */
-static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
+static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
        struct task_struct *p;
        struct task_struct *task;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..d5d81f5384d1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2357,7 +2357,7 @@ int test_clear_page_writeback(struct page *page)
                dec_zone_page_state(page, NR_WRITEBACK);
                inc_zone_page_state(page, NR_WRITTEN);
        }
-        mem_cgroup_end_page_stat(memcg, locked, memcg_flags);
+        mem_cgroup_end_page_stat(memcg, &locked, &memcg_flags);
        return ret;
 }
@@ -2399,7 +2399,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
                mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
                inc_zone_page_state(page, NR_WRITEBACK);
        }
-        mem_cgroup_end_page_stat(memcg, locked, memcg_flags);
+        mem_cgroup_end_page_stat(memcg, &locked, &memcg_flags);
        return ret;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 616a2c956b4b..a7198c065999 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -48,7 +48,6 @@
 #include <linux/backing-dev.h>
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
-#include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/compaction.h>
@@ -641,8 +640,10 @@ static inline int free_pages_check(struct page *page)
                bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
                bad_flags = PAGE_FLAGS_CHECK_AT_FREE;
        }
-        if (unlikely(mem_cgroup_bad_page_check(page)))
+#ifdef CONFIG_MEMCG
-                bad_reason = "cgroup check failed";
+        if (unlikely(page->mem_cgroup))
+                bad_reason = "page still charged to cgroup";
+#endif
        if (unlikely(bad_reason)) {
                bad_page(page, bad_reason, bad_flags);
                return 1;
@@ -741,6 +742,9 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
        int i;
        int bad = 0;
+        VM_BUG_ON_PAGE(PageTail(page), page);
+        VM_BUG_ON_PAGE(PageHead(page) && compound_order(page) != order, page);
        trace_mm_page_free(page, order);
        kmemcheck_free_shadow(page, order);
@@ -898,8 +902,10 @@ static inline int check_new_page(struct page *page)
                bad_reason = "PAGE_FLAGS_CHECK_AT_PREP flag set";
                bad_flags = PAGE_FLAGS_CHECK_AT_PREP;
        }
-        if (unlikely(mem_cgroup_bad_page_check(page)))
+#ifdef CONFIG_MEMCG
-                bad_reason = "cgroup check failed";
+        if (unlikely(page->mem_cgroup))
+                bad_reason = "page still charged to cgroup";
+#endif
        if (unlikely(bad_reason)) {
                bad_page(page, bad_reason, bad_flags);
                return 1;
@@ -1267,55 +1273,75 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 #endif
 /*
- * Drain pages of the indicated processor.
+ * Drain pcplists of the indicated processor and zone.
 *
 * The processor must either be the current processor and the
 * thread pinned to the current processor or a processor that
 * is not online.
 */
-static void drain_pages(unsigned int cpu)
+static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
        unsigned long flags;
-        struct zone *zone;
+        struct per_cpu_pageset *pset;
+        struct per_cpu_pages *pcp;
-        for_each_populated_zone(zone) {
+        local_irq_save(flags);
-                struct per_cpu_pageset *pset;
+        pset = per_cpu_ptr(zone->pageset, cpu);
-                struct per_cpu_pages *pcp;
-                local_irq_save(flags);
+        pcp = &pset->pcp;
-                pset = per_cpu_ptr(zone->pageset, cpu);
+        if (pcp->count) {
+                free_pcppages_bulk(zone, pcp->count, pcp);
+                pcp->count = 0;
+        }
+        local_irq_restore(flags);
+}
-                pcp = &pset->pcp;
+/*
-                if (pcp->count) {
+ * Drain pcplists of all zones on the indicated processor.
-                        free_pcppages_bulk(zone, pcp->count, pcp);
+ *
-                        pcp->count = 0;
+ * The processor must either be the current processor and the
-                }
+ * thread pinned to the current processor or a processor that
-                local_irq_restore(flags);
+ * is not online.
+ */
+static void drain_pages(unsigned int cpu)
+{
+        struct zone *zone;
+        for_each_populated_zone(zone) {
+                drain_pages_zone(cpu, zone);
        }
 }
 /*
 * Spill all of this CPU's per-cpu pages back into the buddy allocator.
+ *
+ * The CPU has to be pinned. When zone parameter is non-NULL, spill just
+ * the single zone's pages.
 */
-void drain_local_pages(void *arg)
+void drain_local_pages(struct zone *zone)
 {
-        drain_pages(smp_processor_id());
+        int cpu = smp_processor_id();
+        if (zone)
+                drain_pages_zone(cpu, zone);
+        else
+                drain_pages(cpu);
 }
 /*
 * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
 *
+ * When zone parameter is non-NULL, spill just the single zone's pages.
+ *
 * Note that this code is protected against sending an IPI to an offline
 * CPU but does not guarantee sending an IPI to newly hotplugged CPUs:
 * on_each_cpu_mask() blocks hotplug and won't talk to offlined CPUs but
 * nothing keeps CPUs from showing up after we populated the cpumask and
 * before the call to on_each_cpu_mask().
 */
-void drain_all_pages(void)
+void drain_all_pages(struct zone *zone)
 {
        int cpu;
-        struct per_cpu_pageset *pcp;
-        struct zone *zone;
        /*
         * Allocate in the BSS so we wont require allocation in
@@ -1330,20 +1356,31 @@ void drain_all_pages(void)
         * disables preemption as part of its processing
         */
        for_each_online_cpu(cpu) {
+                struct per_cpu_pageset *pcp;
+                struct zone *z;
                bool has_pcps = false;
-                for_each_populated_zone(zone) {
+                if (zone) {
                        pcp = per_cpu_ptr(zone->pageset, cpu);
-                        if (pcp->pcp.count) {
+                        if (pcp->pcp.count)
                                has_pcps = true;
-                                break;
+                } else {
+                        for_each_populated_zone(z) {
+                                pcp = per_cpu_ptr(z->pageset, cpu);
+                                if (pcp->pcp.count) {
+                                        has_pcps = true;
+                                        break;
+                                }
                        }
                }
                if (has_pcps)
                        cpumask_set_cpu(cpu, &cpus_with_pcps);
                else
                        cpumask_clear_cpu(cpu, &cpus_with_pcps);
        }
-        on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
+        on_each_cpu_mask(&cpus_with_pcps, (smp_call_func_t) drain_local_pages,
+                                                                zone, 1);
 }
 #ifdef CONFIG_HIBERNATION
@@ -1705,7 +1742,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
                        unsigned long mark, int classzone_idx, int alloc_flags,
                        long free_pages)
 {
-        /* free_pages my go negative - that's OK */
+        /* free_pages may go negative - that's OK */
        long min = mark;
        int o;
        long free_cma = 0;
@@ -2296,7 +2333,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
        int classzone_idx, int migratetype, enum migrate_mode mode,
        int *contended_compaction, bool *deferred_compaction)
 {
-        struct zone *last_compact_zone = NULL;
        unsigned long compact_result;
        struct page *page;
@@ -2307,7 +2343,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
        compact_result = try_to_compact_pages(zonelist, order, gfp_mask,
                                                nodemask, mode,
                                                contended_compaction,
-                                                &last_compact_zone);
+                                                alloc_flags, classzone_idx);
        current->flags &= ~PF_MEMALLOC;
        switch (compact_result) {
@@ -2326,10 +2362,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
         */
        count_vm_event(COMPACTSTALL);
-        /* Page migration frees to the PCP lists but we want merging */
-        drain_pages(get_cpu());
-        put_cpu();
        page = get_page_from_freelist(gfp_mask, nodemask,
                        order, zonelist, high_zoneidx,
                        alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -2345,14 +2377,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
        }
        /*
-         * last_compact_zone is where try_to_compact_pages thought allocation
-         * should succeed, so it did not defer compaction. But here we know
-         * that it didn't succeed, so we do the defer.
-         */
-        if (last_compact_zone && mode != MIGRATE_ASYNC)
-                defer_compaction(last_compact_zone, order);
-        /*
         * It's bad if compaction run occurs and fails. The most likely reason
         * is that pages exist, but not enough to satisfy watermarks.
         */
@@ -2433,7 +2457,7 @@ retry:
         * pages are pinned on the per-cpu lists. Drain them and try again
         */
        if (!page && !drained) {
-                drain_all_pages();
+                drain_all_pages(NULL);
                drained = true;
                goto retry;
        }
@@ -3893,14 +3917,14 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
        else
                page_group_by_mobility_disabled = 0;
-        printk("Built %i zonelists in %s order, mobility grouping %s.  "
+        pr_info("Built %i zonelists in %s order, mobility grouping %s.  "
                "Total pages: %ld\n",
                        nr_online_nodes,
                        zonelist_order_name[current_zonelist_order],
                        page_group_by_mobility_disabled ? "off" : "on",
                        vm_total_pages);
 #ifdef CONFIG_NUMA
-        printk("Policy zone: %s\n", zone_names[policy_zone]);
+        pr_info("Policy zone: %s\n", zone_names[policy_zone]);
 #endif
 }
@@ -4832,7 +4856,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 #endif
        init_waitqueue_head(&pgdat->kswapd_wait);
        init_waitqueue_head(&pgdat->pfmemalloc_wait);
-        pgdat_page_cgroup_init(pgdat);
        for (j = 0; j < MAX_NR_ZONES; j++) {
                struct zone *zone = pgdat->node_zones + j;
@@ -5334,33 +5357,33 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
        find_zone_movable_pfns_for_nodes();
        /* Print out the zone ranges */
-        printk("Zone ranges:\n");
+        pr_info("Zone ranges:\n");
        for (i = 0; i < MAX_NR_ZONES; i++) {
                if (i == ZONE_MOVABLE)
                        continue;
-                printk(KERN_CONT "  %-8s ", zone_names[i]);
+                pr_info("  %-8s ", zone_names[i]);
                if (arch_zone_lowest_possible_pfn[i] ==
                                arch_zone_highest_possible_pfn[i])
-                        printk(KERN_CONT "empty\n");
+                        pr_cont("empty\n");
                else
-                        printk(KERN_CONT "[mem %0#10lx-%0#10lx]\n",
+                        pr_cont("[mem %0#10lx-%0#10lx]\n",
                                arch_zone_lowest_possible_pfn[i] << PAGE_SHIFT,
                                (arch_zone_highest_possible_pfn[i]
                                        << PAGE_SHIFT) - 1);
        }
        /* Print out the PFNs ZONE_MOVABLE begins at in each node */
-        printk("Movable zone start for each node\n");
+        pr_info("Movable zone start for each node\n");
        for (i = 0; i < MAX_NUMNODES; i++) {
                if (zone_movable_pfn[i])
-                        printk("  Node %d: %#010lx\n", i,
+                        pr_info("  Node %d: %#010lx\n", i,
                               zone_movable_pfn[i] << PAGE_SHIFT);
        }
        /* Print out the early node map */
-        printk("Early memory node ranges\n");
+        pr_info("Early memory node ranges\n");
        for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
-                printk("  node %3d: [mem %#010lx-%#010lx]\n", nid,
+                pr_info("  node %3d: [mem %#010lx-%#010lx]\n", nid,
                       start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1);
        /* Initialise every node */
@@ -5496,7 +5519,7 @@ void __init mem_init_print_info(const char *str)
 #undef  adj_init_size
-        printk("Memory: %luK/%luK available "
+        pr_info("Memory: %luK/%luK available "
               "(%luK kernel code, %luK rwdata, %luK rodata, "
               "%luK init, %luK bss, %luK reserved"
 #ifdef  CONFIG_HIGHMEM
@@ -6385,7 +6408,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
         */
        lru_add_drain_all();
-        drain_all_pages();
+        drain_all_pages(cc.zone);
        order = 0;
        outer_start = start;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
deleted file mode 100644
index 5331c2bd85a2..000000000000
--- a/mm/page_cgroup.c
+++ /dev/null
@@ -1,530 +0,0 @@
-#include <linux/mm.h>
-#include <linux/mmzone.h>
-#include <linux/bootmem.h>
-#include <linux/bit_spinlock.h>
-#include <linux/page_cgroup.h>
-#include <linux/hash.h>
-#include <linux/slab.h>
-#include <linux/memory.h>
-#include <linux/vmalloc.h>
-#include <linux/cgroup.h>
-#include <linux/swapops.h>
-#include <linux/kmemleak.h>
-static unsigned long total_usage;
-#if !defined(CONFIG_SPARSEMEM)
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
-{
-        pgdat->node_page_cgroup = NULL;
-}
-struct page_cgroup *lookup_page_cgroup(struct page *page)
-{
-        unsigned long pfn = page_to_pfn(page);
-        unsigned long offset;
-        struct page_cgroup *base;
-        base = NODE_DATA(page_to_nid(page))->node_page_cgroup;
-#ifdef CONFIG_DEBUG_VM
-        /*
-         * The sanity checks the page allocator does upon freeing a
-         * page can reach here before the page_cgroup arrays are
-         * allocated when feeding a range of pages to the allocator
-         * for the first time during bootup or memory hotplug.
-         */
-        if (unlikely(!base))
-                return NULL;
-#endif
-        offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
-        return base + offset;
-}
-static int __init alloc_node_page_cgroup(int nid)
-{
-        struct page_cgroup *base;
-        unsigned long table_size;
-        unsigned long nr_pages;
-        nr_pages = NODE_DATA(nid)->node_spanned_pages;
-        if (!nr_pages)
-                return 0;
-        table_size = sizeof(struct page_cgroup) * nr_pages;
-        base = memblock_virt_alloc_try_nid_nopanic(
-                        table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-                        BOOTMEM_ALLOC_ACCESSIBLE, nid);
-        if (!base)
-                return -ENOMEM;
-        NODE_DATA(nid)->node_page_cgroup = base;
-        total_usage += table_size;
-        return 0;
-}
-void __init page_cgroup_init_flatmem(void)
-{
-        int nid, fail;
-        if (mem_cgroup_disabled())
-                return;
-        for_each_online_node(nid)  {
-                fail = alloc_node_page_cgroup(nid);
-                if (fail)
-                        goto fail;
-        }
-        printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-        printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-        " don't want memory cgroups\n");
-        return;
-fail:
-        printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-        printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
-        panic("Out of memory");
-}
-#else /* CONFIG_FLAT_NODE_MEM_MAP */
-struct page_cgroup *lookup_page_cgroup(struct page *page)
-{
-        unsigned long pfn = page_to_pfn(page);
-        struct mem_section *section = __pfn_to_section(pfn);
-#ifdef CONFIG_DEBUG_VM
-        /*
-         * The sanity checks the page allocator does upon freeing a
-         * page can reach here before the page_cgroup arrays are
-         * allocated when feeding a range of pages to the allocator
-         * for the first time during bootup or memory hotplug.
-         */
-        if (!section->page_cgroup)
-                return NULL;
-#endif
-        return section->page_cgroup + pfn;
-}
-static void *__meminit alloc_page_cgroup(size_t size, int nid)
-{
-        gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
-        void *addr = NULL;
-        addr = alloc_pages_exact_nid(nid, size, flags);
-        if (addr) {
-                kmemleak_alloc(addr, size, 1, flags);
-                return addr;
-        }
-        if (node_state(nid, N_HIGH_MEMORY))
-                addr = vzalloc_node(size, nid);
-        else
-                addr = vzalloc(size);
-        return addr;
-}
-static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
-{
-        struct mem_section *section;
-        struct page_cgroup *base;
-        unsigned long table_size;
-        section = __pfn_to_section(pfn);
-        if (section->page_cgroup)
-                return 0;
-        table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-        base = alloc_page_cgroup(table_size, nid);
-        /*
-         * The value stored in section->page_cgroup is (base - pfn)
-         * and it does not point to the memory block allocated above,
-         * causing kmemleak false positives.
-         */
-        kmemleak_not_leak(base);
-        if (!base) {
-                printk(KERN_ERR "page cgroup allocation failure\n");
-                return -ENOMEM;
-        }
-        /*
-         * The passed "pfn" may not be aligned to SECTION.  For the calculation
-         * we need to apply a mask.
-         */
-        pfn &= PAGE_SECTION_MASK;
-        section->page_cgroup = base - pfn;
-        total_usage += table_size;
-        return 0;
-}
-#ifdef CONFIG_MEMORY_HOTPLUG
-static void free_page_cgroup(void *addr)
-{
-        if (is_vmalloc_addr(addr)) {
-                vfree(addr);
-        } else {
-                struct page *page = virt_to_page(addr);
-                size_t table_size =
-                        sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-                BUG_ON(PageReserved(page));
-                kmemleak_free(addr);
-                free_pages_exact(addr, table_size);
-        }
-}
-static void __free_page_cgroup(unsigned long pfn)
-{
-        struct mem_section *ms;
-        struct page_cgroup *base;
-        ms = __pfn_to_section(pfn);
-        if (!ms || !ms->page_cgroup)
-                return;
-        base = ms->page_cgroup + pfn;
-        free_page_cgroup(base);
-        ms->page_cgroup = NULL;
-}
-static int __meminit online_page_cgroup(unsigned long start_pfn,
-                                unsigned long nr_pages,
-                                int nid)
-{
-        unsigned long start, end, pfn;
-        int fail = 0;
-        start = SECTION_ALIGN_DOWN(start_pfn);
-        end = SECTION_ALIGN_UP(start_pfn + nr_pages);
-        if (nid == -1) {
-                /*
-                 * In this case, "nid" already exists and contains valid memory.
-                 * "start_pfn" passed to us is a pfn which is an arg for
-                 * online__pages(), and start_pfn should exist.
-                 */
-                nid = pfn_to_nid(start_pfn);
-                VM_BUG_ON(!node_state(nid, N_ONLINE));
-        }
-        for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
-                if (!pfn_present(pfn))
-                        continue;
-                fail = init_section_page_cgroup(pfn, nid);
-        }
-        if (!fail)
-                return 0;
-        /* rollback */
-        for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
-                __free_page_cgroup(pfn);
-        return -ENOMEM;
-}
-static int __meminit offline_page_cgroup(unsigned long start_pfn,
-                                unsigned long nr_pages, int nid)
-{
-        unsigned long start, end, pfn;
-        start = SECTION_ALIGN_DOWN(start_pfn);
-        end = SECTION_ALIGN_UP(start_pfn + nr_pages);
-        for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
-                __free_page_cgroup(pfn);
-        return 0;
-}
-static int __meminit page_cgroup_callback(struct notifier_block *self,
-                               unsigned long action, void *arg)
-{
-        struct memory_notify *mn = arg;
-        int ret = 0;
-        switch (action) {
-        case MEM_GOING_ONLINE:
-                ret = online_page_cgroup(mn->start_pfn,
-                                   mn->nr_pages, mn->status_change_nid);
-                break;
-        case MEM_OFFLINE:
-                offline_page_cgroup(mn->start_pfn,
-                                mn->nr_pages, mn->status_change_nid);
-                break;
-        case MEM_CANCEL_ONLINE:
-                offline_page_cgroup(mn->start_pfn,
-                                mn->nr_pages, mn->status_change_nid);
-                break;
-        case MEM_GOING_OFFLINE:
-                break;
-        case MEM_ONLINE:
-        case MEM_CANCEL_OFFLINE:
-                break;
-        }
-        return notifier_from_errno(ret);
-}
-#endif
-void __init page_cgroup_init(void)
-{
-        unsigned long pfn;
-        int nid;
-        if (mem_cgroup_disabled())
-                return;
-        for_each_node_state(nid, N_MEMORY) {
-                unsigned long start_pfn, end_pfn;
-                start_pfn = node_start_pfn(nid);
-                end_pfn = node_end_pfn(nid);
-                /*
-                 * start_pfn and end_pfn may not be aligned to SECTION and the
-                 * page->flags of out of node pages are not initialized.  So we
-                 * scan [start_pfn, the biggest section's pfn < end_pfn) here.
-                 */
-                for (pfn = start_pfn;
-                     pfn < end_pfn;
-                     pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
-                        if (!pfn_valid(pfn))
-                                continue;
-                        /*
-                         * Nodes's pfns can be overlapping.
-                         * We know some arch can have a nodes layout such as
-                         * -------------pfn-------------->
-                         * N0 | N1 | N2 | N0 | N1 | N2|....
-                         */
-                        if (pfn_to_nid(pfn) != nid)
-                                continue;
-                        if (init_section_page_cgroup(pfn, nid))
-                                goto oom;
-                }
-        }
-        hotplug_memory_notifier(page_cgroup_callback, 0);
-        printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-        printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
-                         "don't want memory cgroups\n");
-        return;
-oom:
-        printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
-        panic("Out of memory");
-}
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
-{
-        return;
-}
-#endif
-#ifdef CONFIG_MEMCG_SWAP
-static DEFINE_MUTEX(swap_cgroup_mutex);
-struct swap_cgroup_ctrl {
-        struct page **map;
-        unsigned long length;
-        spinlock_t      lock;
-};
-static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
-struct swap_cgroup {
-        unsigned short          id;
-};
-#define SC_PER_PAGE     (PAGE_SIZE/sizeof(struct swap_cgroup))
-/*
- * SwapCgroup implements "lookup" and "exchange" operations.
- * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
- * against SwapCache. At swap_free(), this is accessed directly from swap.
- *
- * This means,
- *  - we have no race in "exchange" when we're accessed via SwapCache because
- *    SwapCache(and its swp_entry) is under lock.
- *  - When called via swap_free(), there is no user of this entry and no race.
- * Then, we don't need lock around "exchange".
- *
- * TODO: we can push these buffers out to HIGHMEM.
- */
-/*
- * allocate buffer for swap_cgroup.
- */
-static int swap_cgroup_prepare(int type)
-{
-        struct page *page;
-        struct swap_cgroup_ctrl *ctrl;
-        unsigned long idx, max;
-        ctrl = &swap_cgroup_ctrl[type];
-        for (idx = 0; idx < ctrl->length; idx++) {
-                page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-                if (!page)
-                        goto not_enough_page;
-                ctrl->map[idx] = page;
-        }
-        return 0;
-not_enough_page:
-        max = idx;
-        for (idx = 0; idx < max; idx++)
-                __free_page(ctrl->map[idx]);
-        return -ENOMEM;
-}
-static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
-                                        struct swap_cgroup_ctrl **ctrlp)
-{
-        pgoff_t offset = swp_offset(ent);
-        struct swap_cgroup_ctrl *ctrl;
-        struct page *mappage;
-        struct swap_cgroup *sc;
-        ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-        if (ctrlp)
-                *ctrlp = ctrl;
-        mappage = ctrl->map[offset / SC_PER_PAGE];
-        sc = page_address(mappage);
-        return sc + offset % SC_PER_PAGE;
-}
-/**
- * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
- * @ent: swap entry to be cmpxchged
- * @old: old id
- * @new: new id
- *
- * Returns old id at success, 0 at failure.
- * (There is no mem_cgroup using 0 as its id)
- */
-unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-                                        unsigned short old, unsigned short new)
-{
-        struct swap_cgroup_ctrl *ctrl;
-        struct swap_cgroup *sc;
-        unsigned long flags;
-        unsigned short retval;
-        sc = lookup_swap_cgroup(ent, &ctrl);
-        spin_lock_irqsave(&ctrl->lock, flags);
-        retval = sc->id;
-        if (retval == old)
-                sc->id = new;
-        else
-                retval = 0;
-        spin_unlock_irqrestore(&ctrl->lock, flags);
-        return retval;
-}
-/**
- * swap_cgroup_record - record mem_cgroup for this swp_entry.
- * @ent: swap entry to be recorded into
- * @id: mem_cgroup to be recorded
- *
- * Returns old value at success, 0 at failure.
- * (Of course, old value can be 0.)
- */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
-{
-        struct swap_cgroup_ctrl *ctrl;
-        struct swap_cgroup *sc;
-        unsigned short old;
-        unsigned long flags;
-        sc = lookup_swap_cgroup(ent, &ctrl);
-        spin_lock_irqsave(&ctrl->lock, flags);
-        old = sc->id;
-        sc->id = id;
-        spin_unlock_irqrestore(&ctrl->lock, flags);
-        return old;
-}
-/**
- * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
- * @ent: swap entry to be looked up.
- *
- * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
- */
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
-{
-        return lookup_swap_cgroup(ent, NULL)->id;
-}
-int swap_cgroup_swapon(int type, unsigned long max_pages)
-{
-        void *array;
-        unsigned long array_size;
-        unsigned long length;
-        struct swap_cgroup_ctrl *ctrl;
-        if (!do_swap_account)
-                return 0;
-        length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
-        array_size = length * sizeof(void *);
-        array = vzalloc(array_size);
-        if (!array)
-                goto nomem;
-        ctrl = &swap_cgroup_ctrl[type];
-        mutex_lock(&swap_cgroup_mutex);
-        ctrl->length = length;
-        ctrl->map = array;
-        spin_lock_init(&ctrl->lock);
-        if (swap_cgroup_prepare(type)) {
-                /* memory shortage */
-                ctrl->map = NULL;
-                ctrl->length = 0;
-                mutex_unlock(&swap_cgroup_mutex);
-                vfree(array);
-                goto nomem;
-        }
-        mutex_unlock(&swap_cgroup_mutex);
-        return 0;
-nomem:
-        printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
-        printk(KERN_INFO
-                "swap_cgroup can be disabled by swapaccount=0 boot option\n");
-        return -ENOMEM;
-}
-void swap_cgroup_swapoff(int type)
-{
-        struct page **map;
-        unsigned long i, length;
-        struct swap_cgroup_ctrl *ctrl;
-        if (!do_swap_account)
-                return;
-        mutex_lock(&swap_cgroup_mutex);
-        ctrl = &swap_cgroup_ctrl[type];
-        map = ctrl->map;
-        length = ctrl->length;
-        ctrl->map = NULL;
-        ctrl->length = 0;
-        mutex_unlock(&swap_cgroup_mutex);
-        if (map) {
-                for (i = 0; i < length; i++) {
-                        struct page *page = map[i];
-                        if (page)
-                                __free_page(page);
-                }
-                vfree(map);
-        }
-}
-#endif
diff --git a/mm/page_counter.c b/mm/page_counter.c
new file mode 100644
index 000000000000..a009574fbba9
--- /dev/null
+++ b/mm/page_counter.c
@@ -0,0 +1,192 @@
+/*
+ * Lockless hierarchical page accounting & limiting
+ *
+ * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner
+ */
+#include <linux/page_counter.h>
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/sched.h>
+#include <linux/bug.h>
+#include <asm/page.h>
+/**
+ * page_counter_cancel - take pages out of the local counter
+ * @counter: counter
+ * @nr_pages: number of pages to cancel
+ */
+void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+{
+        long new;
+        new = atomic_long_sub_return(nr_pages, &counter->count);
+        /* More uncharges than charges? */
+        WARN_ON_ONCE(new < 0);
+}
+/**
+ * page_counter_charge - hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ *
+ * NOTE: This does not consider any configured counter limits.
+ */
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
+{
+        struct page_counter *c;
+        for (c = counter; c; c = c->parent) {
+                long new;
+                new = atomic_long_add_return(nr_pages, &c->count);
+                /*
+                 * This is indeed racy, but we can live with some
+                 * inaccuracy in the watermark.
+                 */
+                if (new > c->watermark)
+                        c->watermark = new;
+        }
+}
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
+ * its ancestors has hit its configured limit.
+ */
+int page_counter_try_charge(struct page_counter *counter,
+                            unsigned long nr_pages,
+                            struct page_counter **fail)
+{
+        struct page_counter *c;
+        for (c = counter; c; c = c->parent) {
+                long new;
+                /*
+                 * Charge speculatively to avoid an expensive CAS.  If
+                 * a bigger charge fails, it might falsely lock out a
+                 * racing smaller charge and send it into reclaim
+                 * early, but the error is limited to the difference
+                 * between the two sizes, which is less than 2M/4M in
+                 * case of a THP locking out a regular page charge.
+                 *
+                 * The atomic_long_add_return() implies a full memory
+                 * barrier between incrementing the count and reading
+                 * the limit.  When racing with page_counter_limit(),
+                 * we either see the new limit or the setter sees the
+                 * counter has changed and retries.
+                 */
+                new = atomic_long_add_return(nr_pages, &c->count);
+                if (new > c->limit) {
+                        atomic_long_sub(nr_pages, &c->count);
+                        /*
+                         * This is racy, but we can live with some
+                         * inaccuracy in the failcnt.
+                         */
+                        c->failcnt++;
+                        *fail = c;
+                        goto failed;
+                }
+                /*
+                 * Just like with failcnt, we can live with some
+                 * inaccuracy in the watermark.
+                 */
+                if (new > c->watermark)
+                        c->watermark = new;
+        }
+        return 0;
+failed:
+        for (c = counter; c != *fail; c = c->parent)
+                page_counter_cancel(c, nr_pages);
+        return -ENOMEM;
+}
+/**
+ * page_counter_uncharge - hierarchically uncharge pages
+ * @counter: counter
+ * @nr_pages: number of pages to uncharge
+ */
+void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+{
+        struct page_counter *c;
+        for (c = counter; c; c = c->parent)
+                page_counter_cancel(c, nr_pages);
+}
+/**
+ * page_counter_limit - limit the number of pages allowed
+ * @counter: counter
+ * @limit: limit to set
+ *
+ * Returns 0 on success, -EBUSY if the current number of pages on the
+ * counter already exceeds the specified limit.
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+int page_counter_limit(struct page_counter *counter, unsigned long limit)
+{
+        for (;;) {
+                unsigned long old;
+                long count;
+                /*
+                 * Update the limit while making sure that it's not
+                 * below the concurrently-changing counter value.
+                 *
+                 * The xchg implies two full memory barriers before
+                 * and after, so the read-swap-read is ordered and
+                 * ensures coherency with page_counter_try_charge():
+                 * that function modifies the count before checking
+                 * the limit, so if it sees the old limit, we see the
+                 * modified counter and retry.
+                 */
+                count = atomic_long_read(&counter->count);
+                if (count > limit)
+                        return -EBUSY;
+                old = xchg(&counter->limit, limit);
+                if (atomic_long_read(&counter->count) <= count)
+                        return 0;
+                counter->limit = old;
+                cond_resched();
+        }
+}
+/**
+ * page_counter_memparse - memparse() for page counter limits
+ * @buf: string to parse
+ * @nr_pages: returns the result in number of pages
+ *
+ * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
+ * limited to %PAGE_COUNTER_MAX.
+ */
+int page_counter_memparse(const char *buf, unsigned long *nr_pages)
+{
+        char unlimited[] = "-1";
+        char *end;
+        u64 bytes;
+        if (!strncmp(buf, unlimited, sizeof(unlimited))) {
+                *nr_pages = PAGE_COUNTER_MAX;
+                return 0;
+        }
+        bytes = memparse(buf, &end);
+        if (*end != '\0')
+                return -EINVAL;
+        *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
+        return 0;
+}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c8778f7e208e..72f5ac381ab3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -68,7 +68,7 @@ out:
        spin_unlock_irqrestore(&zone->lock, flags);
        if (!ret)
-                drain_all_pages();
+                drain_all_pages(zone);
        return ret;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e4c7213210c..45eba36fd673 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1053,7 +1053,7 @@ void page_add_file_rmap(struct page *page)
                __inc_zone_page_state(page, NR_FILE_MAPPED);
                mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
        }
-        mem_cgroup_end_page_stat(memcg, locked, flags);
+        mem_cgroup_end_page_stat(memcg, &locked, &flags);
 }
 static void page_remove_file_rmap(struct page *page)
@@ -1083,7 +1083,7 @@ static void page_remove_file_rmap(struct page *page)
        if (unlikely(PageMlocked(page)))
                clear_page_mlock(page);
 out:
-        mem_cgroup_end_page_stat(memcg, locked, flags);
+        mem_cgroup_end_page_stat(memcg, &locked, &flags);
 }
 /**
diff --git a/mm/slab.c b/mm/slab.c
index f34e053ec46e..79e15f0a2a6e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2590,7 +2590,10 @@ static int cache_grow(struct kmem_cache *cachep,
         * Be lazy and only check for valid flags here,  keeping it out of the
         * critical path in kmem_cache_alloc().
         */
-        BUG_ON(flags & GFP_SLAB_BUG_MASK);
+        if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
+                pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
+                BUG();
+        }
        local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
        /* Take the node list lock to change the colour_next on this node */
@@ -3580,11 +3583,11 @@ static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
        for_each_online_node(node) {
-                if (use_alien_caches) {
+                if (use_alien_caches) {
-                        new_alien = alloc_alien_cache(node, cachep->limit, gfp);
+                        new_alien = alloc_alien_cache(node, cachep->limit, gfp);
-                        if (!new_alien)
+                        if (!new_alien)
-                                goto fail;
+                                goto fail;
-                }
+                }
                new_shared = NULL;
                if (cachep->shared) {
@@ -4043,12 +4046,6 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 #ifdef CONFIG_DEBUG_SLAB_LEAK
-static void *leaks_start(struct seq_file *m, loff_t *pos)
-{
-        mutex_lock(&slab_mutex);
-        return seq_list_start(&slab_caches, *pos);
-}
 static inline int add_caller(unsigned long *n, unsigned long v)
 {
        unsigned long *p;
@@ -4170,7 +4167,7 @@ static int leaks_show(struct seq_file *m, void *p)
 }
 static const struct seq_operations slabstats_op = {
-        .start = leaks_start,
+        .start = slab_start,
        .next = slab_next,
        .stop = slab_stop,
        .show = leaks_show,
diff --git a/mm/slab.h b/mm/slab.h
index ab019e63e3c2..1cf4005482dd 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -209,15 +209,15 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
        rcu_read_lock();
        params = rcu_dereference(s->memcg_params);
-        cachep = params->memcg_caches[idx];
-        rcu_read_unlock();
        /*
         * Make sure we will access the up-to-date value. The code updating
         * memcg_caches issues a write barrier to match this (see
         * memcg_register_cache()).
         */
-        smp_read_barrier_depends();
+        cachep = lockless_dereference(params->memcg_caches[idx]);
+        rcu_read_unlock();
        return cachep;
 }
@@ -357,7 +357,9 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 #endif
+void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
+int memcg_slab_show(struct seq_file *m, void *p);
 #endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index dcdab81bd240..e03dd6f2a272 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -240,7 +240,7 @@ struct kmem_cache *find_mergeable(size_t size, size_t align,
        size = ALIGN(size, align);
        flags = kmem_cache_flags(size, flags, name, NULL);
-        list_for_each_entry(s, &slab_caches, list) {
+        list_for_each_entry_reverse(s, &slab_caches, list) {
                if (slab_unmergeable(s))
                        continue;
@@ -811,7 +811,7 @@ EXPORT_SYMBOL(kmalloc_order_trace);
 #define SLABINFO_RIGHTS S_IRUSR
 #endif
-void print_slabinfo_header(struct seq_file *m)
+static void print_slabinfo_header(struct seq_file *m)
 {
        /*
         * Output format version, so at least we can change it
@@ -834,14 +834,9 @@ void print_slabinfo_header(struct seq_file *m)
        seq_putc(m, '\n');
 }
-static void *s_start(struct seq_file *m, loff_t *pos)
+void *slab_start(struct seq_file *m, loff_t *pos)
 {
-        loff_t n = *pos;
        mutex_lock(&slab_mutex);
-        if (!n)
-                print_slabinfo_header(m);
        return seq_list_start(&slab_caches, *pos);
 }
@@ -881,7 +876,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
        }
 }
-int cache_show(struct kmem_cache *s, struct seq_file *m)
+static void cache_show(struct kmem_cache *s, struct seq_file *m)
 {
        struct slabinfo sinfo;
@@ -900,17 +895,32 @@ int cache_show(struct kmem_cache *s, struct seq_file *m)
                   sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
        slabinfo_show_stats(m, s);
        seq_putc(m, '\n');
+}
+static int slab_show(struct seq_file *m, void *p)
+{
+        struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+        if (p == slab_caches.next)
+                print_slabinfo_header(m);
+        if (is_root_cache(s))
+                cache_show(s, m);
        return 0;
 }
-static int s_show(struct seq_file *m, void *p)
+#ifdef CONFIG_MEMCG_KMEM
+int memcg_slab_show(struct seq_file *m, void *p)
 {
        struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+        struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
-        if (!is_root_cache(s))
+        if (p == slab_caches.next)
-                return 0;
+                print_slabinfo_header(m);
-        return cache_show(s, m);
+        if (!is_root_cache(s) && s->memcg_params->memcg == memcg)
+                cache_show(s, m);
+        return 0;
 }
+#endif
 /*
 * slabinfo_op - iterator that generates /proc/slabinfo
@@ -926,10 +936,10 @@ static int s_show(struct seq_file *m, void *p)
 * + further values on SMP and with statistics enabled
 */
 static const struct seq_operations slabinfo_op = {
-        .start = s_start,
+        .start = slab_start,
        .next = slab_next,
        .stop = slab_stop,
-        .show = s_show,
+        .show = slab_show,
 };
 static int slabinfo_open(struct inode *inode, struct file *file)
diff --git a/mm/slub.c b/mm/slub.c
index ae7b9f1ad394..386bbed76e94 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -849,12 +849,12 @@ static int check_slab(struct kmem_cache *s, struct page *page)
        maxobj = order_objects(compound_order(page), s->size, s->reserved);
        if (page->objects > maxobj) {
                slab_err(s, page, "objects %u > max %u",
-                        s->name, page->objects, maxobj);
+                        page->objects, maxobj);
                return 0;
        }
        if (page->inuse > page->objects) {
                slab_err(s, page, "inuse %u > max %u",
-                        s->name, page->inuse, page->objects);
+                        page->inuse, page->objects);
                return 0;
        }
        /* Slab_pad_check fixes things up after itself */
@@ -871,7 +871,7 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
        int nr = 0;
        void *fp;
        void *object = NULL;
-        unsigned long max_objects;
+        int max_objects;
        fp = page->freelist;
        while (fp && nr <= page->objects) {
@@ -1377,7 +1377,10 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
        int order;
        int idx;
-        BUG_ON(flags & GFP_SLAB_BUG_MASK);
+        if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
+                pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
+                BUG();
+        }
        page = allocate_slab(s,
                flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
@@ -2554,7 +2557,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
                        } else { /* Needs to be taken off a list */
-                                n = get_node(s, page_to_nid(page));
+                                n = get_node(s, page_to_nid(page));
                                /*
                                 * Speculatively acquire the list_lock.
                                 * If the cmpxchg does not succeed then we may
@@ -2587,10 +2590,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
                 * The list lock was not taken therefore no list
                 * activity can be necessary.
                 */
-                if (was_frozen)
+                if (was_frozen)
-                        stat(s, FREE_FROZEN);
+                        stat(s, FREE_FROZEN);
-                return;
+                return;
-        }
+        }
        if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
                goto slab_empty;
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
new file mode 100644
index 000000000000..b5f7f24b8dd1
--- /dev/null
+++ b/mm/swap_cgroup.c
@@ -0,0 +1,208 @@
+#include <linux/swap_cgroup.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/swapops.h> /* depends on mm.h include */
+static DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+        struct page **map;
+        unsigned long length;
+        spinlock_t      lock;
+};
+static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+struct swap_cgroup {
+        unsigned short          id;
+};
+#define SC_PER_PAGE     (PAGE_SIZE/sizeof(struct swap_cgroup))
+/*
+ * SwapCgroup implements "lookup" and "exchange" operations.
+ * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
+ * against SwapCache. At swap_free(), this is accessed directly from swap.
+ *
+ * This means,
+ *  - we have no race in "exchange" when we're accessed via SwapCache because
+ *    SwapCache(and its swp_entry) is under lock.
+ *  - When called via swap_free(), there is no user of this entry and no race.
+ * Then, we don't need lock around "exchange".
+ *
+ * TODO: we can push these buffers out to HIGHMEM.
+ */
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+        struct page *page;
+        struct swap_cgroup_ctrl *ctrl;
+        unsigned long idx, max;
+        ctrl = &swap_cgroup_ctrl[type];
+        for (idx = 0; idx < ctrl->length; idx++) {
+                page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+                if (!page)
+                        goto not_enough_page;
+                ctrl->map[idx] = page;
+        }
+        return 0;
+not_enough_page:
+        max = idx;
+        for (idx = 0; idx < max; idx++)
+                __free_page(ctrl->map[idx]);
+        return -ENOMEM;
+}
+static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
+                                        struct swap_cgroup_ctrl **ctrlp)
+{
+        pgoff_t offset = swp_offset(ent);
+        struct swap_cgroup_ctrl *ctrl;
+        struct page *mappage;
+        struct swap_cgroup *sc;
+        ctrl = &swap_cgroup_ctrl[swp_type(ent)];
+        if (ctrlp)
+                *ctrlp = ctrl;
+        mappage = ctrl->map[offset / SC_PER_PAGE];
+        sc = page_address(mappage);
+        return sc + offset % SC_PER_PAGE;
+}
+/**
+ * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
+ * @ent: swap entry to be cmpxchged
+ * @old: old id
+ * @new: new id
+ *
+ * Returns old id at success, 0 at failure.
+ * (There is no mem_cgroup using 0 as its id)
+ */
+unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+                                        unsigned short old, unsigned short new)
+{
+        struct swap_cgroup_ctrl *ctrl;
+        struct swap_cgroup *sc;
+        unsigned long flags;
+        unsigned short retval;
+        sc = lookup_swap_cgroup(ent, &ctrl);
+        spin_lock_irqsave(&ctrl->lock, flags);
+        retval = sc->id;
+        if (retval == old)
+                sc->id = new;
+        else
+                retval = 0;
+        spin_unlock_irqrestore(&ctrl->lock, flags);
+        return retval;
+}
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @id: mem_cgroup to be recorded
+ *
+ * Returns old value at success, 0 at failure.
+ * (Of course, old value can be 0.)
+ */
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+{
+        struct swap_cgroup_ctrl *ctrl;
+        struct swap_cgroup *sc;
+        unsigned short old;
+        unsigned long flags;
+        sc = lookup_swap_cgroup(ent, &ctrl);
+        spin_lock_irqsave(&ctrl->lock, flags);
+        old = sc->id;
+        sc->id = id;
+        spin_unlock_irqrestore(&ctrl->lock, flags);
+        return old;
+}
+/**
+ * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
+ */
+unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
+{
+        return lookup_swap_cgroup(ent, NULL)->id;
+}
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+        void *array;
+        unsigned long array_size;
+        unsigned long length;
+        struct swap_cgroup_ctrl *ctrl;
+        if (!do_swap_account)
+                return 0;
+        length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
+        array_size = length * sizeof(void *);
+        array = vzalloc(array_size);
+        if (!array)
+                goto nomem;
+        ctrl = &swap_cgroup_ctrl[type];
+        mutex_lock(&swap_cgroup_mutex);
+        ctrl->length = length;
+        ctrl->map = array;
+        spin_lock_init(&ctrl->lock);
+        if (swap_cgroup_prepare(type)) {
+                /* memory shortage */
+                ctrl->map = NULL;
+                ctrl->length = 0;
+                mutex_unlock(&swap_cgroup_mutex);
+                vfree(array);
+                goto nomem;
+        }
+        mutex_unlock(&swap_cgroup_mutex);
+        return 0;
+nomem:
+        printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+        printk(KERN_INFO
+                "swap_cgroup can be disabled by swapaccount=0 boot option\n");
+        return -ENOMEM;
+}
+void swap_cgroup_swapoff(int type)
+{
+        struct page **map;
+        unsigned long i, length;
+        struct swap_cgroup_ctrl *ctrl;
+        if (!do_swap_account)
+                return;
+        mutex_lock(&swap_cgroup_mutex);
+        ctrl = &swap_cgroup_ctrl[type];
+        map = ctrl->map;
+        length = ctrl->length;
+        ctrl->map = NULL;
+        ctrl->length = 0;
+        mutex_unlock(&swap_cgroup_mutex);
+        if (map) {
+                for (i = 0; i < length; i++) {
+                        struct page *page = map[i];
+                        if (page)
+                                __free_page(page);
+                }
+                vfree(map);
+        }
+}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 154444918685..9711342987a0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,7 +17,6 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
-#include <linux/page_cgroup.h>
 #include <asm/pgtable.h>
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e0ac59..63f55ccb9b26 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -38,7 +38,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
-#include <linux/page_cgroup.h>
+#include <linux/swap_cgroup.h>
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
                                 unsigned char);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 90520af7f186..8a18196fcdff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -463,8 +463,7 @@ overflow:
                goto retry;
        }
        if (printk_ratelimit())
-                printk(KERN_WARNING
+                pr_warn("vmap allocation for size %lu failed: "
-                        "vmap allocation for size %lu failed: "
                        "use vmalloc=<size> to increase size.\n", size);
        kfree(va);
        return ERR_PTR(-EBUSY);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcb47074ae03..4636d9e822c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -260,8 +260,7 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
        do_div(delta, lru_pages + 1);
        total_scan += delta;
        if (total_scan < 0) {
-                printk(KERN_ERR
+                pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
-                "shrink_slab: %pF negative objects to delete nr=%ld\n",
                       shrinker->scan_objects, total_scan);
                total_scan = freeable;
        }
@@ -875,7 +874,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
                 * end of the LRU a second time.
                 */
                mapping = page_mapping(page);
-                if ((mapping && bdi_write_congested(mapping->backing_dev_info)) ||
+                if (((dirty || writeback) && mapping &&
+                     bdi_write_congested(mapping->backing_dev_info)) ||
                    (writeback && PageReclaim(page)))
                        nr_congested++;
@@ -2249,7 +2249,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
                return true;
        /* If compaction would go ahead or the allocation would succeed, stop */
-        switch (compaction_suitable(zone, sc->order)) {
+        switch (compaction_suitable(zone, sc->order, 0, 0)) {
        case COMPACT_PARTIAL:
        case COMPACT_CONTINUE:
                return false;
@@ -2346,7 +2346,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
         * If compaction is not ready to start and allocation is not likely
         * to succeed without it, then keep reclaiming.
         */
-        if (compaction_suitable(zone, order) == COMPACT_SKIPPED)
+        if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
                return false;
        return watermark_ok;
@@ -2824,8 +2824,8 @@ static bool zone_balanced(struct zone *zone, int order,
                                    balance_gap, classzone_idx, 0))
                return false;
-        if (IS_ENABLED(CONFIG_COMPACTION) && order &&
+        if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
-            compaction_suitable(zone, order) == COMPACT_SKIPPED)
+                                order, 0, classzone_idx) == COMPACT_SKIPPED)
                return false;
        return true;
@@ -2952,8 +2952,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
         * from memory. Do not reclaim more than needed for compaction.
         */
        if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
-                        compaction_suitable(zone, sc->order) !=
+                        compaction_suitable(zone, sc->order, 0, classzone_idx)
-                                COMPACT_SKIPPED)
+                                                        != COMPACT_SKIPPED)
                testorder = 0;
        /*
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1d191357bf88..272327134a1b 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -9,13 +9,13 @@
 int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
        /*
-         * The root cgroup does not use res_counters, but rather,
+         * The root cgroup does not use page_counters, but rather,
         * rely on the data already collected by the network
         * subsystem
         */
-        struct res_counter *res_parent = NULL;
-        struct cg_proto *cg_proto, *parent_cg;
        struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+        struct page_counter *counter_parent = NULL;
+        struct cg_proto *cg_proto, *parent_cg;
        cg_proto = tcp_prot.proto_cgroup(memcg);
        if (!cg_proto)
@@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
        parent_cg = tcp_prot.proto_cgroup(parent);
        if (parent_cg)
-                res_parent = &parent_cg->memory_allocated;
+                counter_parent = &parent_cg->memory_allocated;
-        res_counter_init(&cg_proto->memory_allocated, res_parent);
+        page_counter_init(&cg_proto->memory_allocated, counter_parent);
        percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
        return 0;
@@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
-static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
        struct cg_proto *cg_proto;
        int i;
@@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
        if (!cg_proto)
                return -EINVAL;
-        if (val > RES_COUNTER_MAX)
+        ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
-                val = RES_COUNTER_MAX;
-        ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
        if (ret)
                return ret;
        for (i = 0; i < 3; i++)
-                cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
+                cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
                                                sysctl_tcp_mem[i]);
-        if (val == RES_COUNTER_MAX)
+        if (nr_pages == PAGE_COUNTER_MAX)
                clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-        else if (val != RES_COUNTER_MAX) {
+        else {
                /*
                 * The active bit needs to be written after the static_key
                 * update. This is what guarantees that the socket activation
@@ -102,11 +99,20 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
        return 0;
 }
+enum {
+        RES_USAGE,
+        RES_LIMIT,
+        RES_MAX_USAGE,
+        RES_FAILCNT,
+};
+static DEFINE_MUTEX(tcp_limit_mutex);
 static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
                                char *buf, size_t nbytes, loff_t off)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-        unsigned long long val;
+        unsigned long nr_pages;
        int ret = 0;
        buf = strstrip(buf);
@@ -114,10 +120,12 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
        switch (of_cft(of)->private) {
        case RES_LIMIT:
                /* see memcontrol.c */
-                ret = res_counter_memparse_write_strategy(buf, &val);
+                ret = page_counter_memparse(buf, &nr_pages);
                if (ret)
                        break;
-                ret = tcp_update_limit(memcg, val);
+                mutex_lock(&tcp_limit_mutex);
+                ret = tcp_update_limit(memcg, nr_pages);
+                mutex_unlock(&tcp_limit_mutex);
                break;
        default:
                ret = -EINVAL;
@@ -126,43 +134,36 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
        return ret ?: nbytes;
 }
-static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
-{
-        struct cg_proto *cg_proto;
-        cg_proto = tcp_prot.proto_cgroup(memcg);
-        if (!cg_proto)
-                return default_val;
-        return res_counter_read_u64(&cg_proto->memory_allocated, type);
-}
-static u64 tcp_read_usage(struct mem_cgroup *memcg)
-{
-        struct cg_proto *cg_proto;
-        cg_proto = tcp_prot.proto_cgroup(memcg);
-        if (!cg_proto)
-                return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
-        return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
-}
 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+        struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
        u64 val;
        switch (cft->private) {
        case RES_LIMIT:
-                val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
+                if (!cg_proto)
+                        return PAGE_COUNTER_MAX;
+                val = cg_proto->memory_allocated.limit;
+                val *= PAGE_SIZE;
                break;
        case RES_USAGE:
-                val = tcp_read_usage(memcg);
+                if (!cg_proto)
+                        val = atomic_long_read(&tcp_memory_allocated);
+                else
+                        val = page_counter_read(&cg_proto->memory_allocated);
+                val *= PAGE_SIZE;
                break;
        case RES_FAILCNT:
+                if (!cg_proto)
+                        return 0;
+                val = cg_proto->memory_allocated.failcnt;
+                break;
        case RES_MAX_USAGE:
-                val = tcp_read_stat(memcg, cft->private, 0);
+                if (!cg_proto)
+                        return 0;
+                val = cg_proto->memory_allocated.watermark;
+                val *= PAGE_SIZE;
                break;
        default:
                BUG();
@@ -183,10 +184,10 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
        switch (of_cft(of)->private) {
        case RES_MAX_USAGE:
-                res_counter_reset_max(&cg_proto->memory_allocated);
+                page_counter_reset_watermark(&cg_proto->memory_allocated);
                break;
        case RES_FAILCNT:
-                res_counter_reset_failcnt(&cg_proto->memory_allocated);
+                cg_proto->memory_allocated.failcnt = 0;
                break;
        }
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 374abf443636..f0bb6d60c07b 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -7,10 +7,11 @@
 use strict;
 use POSIX;
+use File::Basename;
+use Cwd 'abs_path';
 my $P = $0;
-$P =~ s@(.*)/@@g;
+my $D = dirname(abs_path($P));
-my $D = $1;
 my $V = '0.32';
@@ -438,26 +439,29 @@ our $allowed_asm_includes = qr{(?x:
 # Load common spelling mistakes and build regular expression list.
 my $misspellings;
-my @spelling_list;
 my %spelling_fix;
-open(my $spelling, '<', $spelling_file)
-    or die "$P: Can't open $spelling_file for reading: $!\n";
-while (<$spelling>) {
-        my $line = $_;
-        $line =~ s/\s*\n?$//g;
+if (open(my $spelling, '<', $spelling_file)) {
-        $line =~ s/^\s*//g;
+        my @spelling_list;
+        while (<$spelling>) {
+                my $line = $_;
-        next if ($line =~ m/^\s*#/);
+                $line =~ s/\s*\n?$//g;
-        next if ($line =~ m/^\s*$/);
+                $line =~ s/^\s*//g;
-        my ($suspect, $fix) = split(/\|\|/, $line);
+                next if ($line =~ m/^\s*#/);
+                next if ($line =~ m/^\s*$/);
-        push(@spelling_list, $suspect);
+                my ($suspect, $fix) = split(/\|\|/, $line);
-        $spelling_fix{$suspect} = $fix;
+                push(@spelling_list, $suspect);
+                $spelling_fix{$suspect} = $fix;
+        }
+        close($spelling);
+        $misspellings = join("|", @spelling_list);
+} else {
+        warn "No typos will be found - file '$spelling_file': $!\n";
 }
-close($spelling);
-$misspellings = join("|", @spelling_list);
 sub build_types {
        my $mods = "(?x:  \n" . join("|\n  ", @modifierList) . "\n)";
@@ -942,7 +946,7 @@ sub sanitise_line {
 sub get_quoted_string {
        my ($line, $rawline) = @_;
-        return "" if ($line !~ m/(\"[X]+\")/g);
+        return "" if ($line !~ m/(\"[X\t]+\")/g);
        return substr($rawline, $-[0], $+[0] - $-[0]);
 }
@@ -1843,6 +1847,7 @@ sub process {
        my $non_utf8_charset = 0;
        my $last_blank_line = 0;
+        my $last_coalesced_string_linenr = -1;
        our @report = ();
        our $cnt_lines = 0;
@@ -2078,6 +2083,12 @@ sub process {
                        $in_commit_log = 0;
                }
+# Check if MAINTAINERS is being updated.  If so, there's probably no need to
+# emit the "does MAINTAINERS need updating?" message on file add/move/delete
+                if ($line =~ /^\s*MAINTAINERS\s*\|/) {
+                        $reported_maintainer_file = 1;
+                }
 # Check signature styles
                if (!$in_header_lines &&
                    $line =~ /^(\s*)([a-z0-9_-]+by:|$signature_tags)(\s*)(.*)/i) {
@@ -2246,7 +2257,7 @@ sub process {
                }
 # Check for various typo / spelling mistakes
-                if ($in_commit_log || $line =~ /^\+/) {
+                if (defined($misspellings) && ($in_commit_log || $line =~ /^\+/)) {
                        while ($rawline =~ /(?:^|[^a-z@])($misspellings)(?:$|[^a-z@])/gi) {
                                my $typo = $1;
                                my $typo_fix = $spelling_fix{lc($typo)};
@@ -2403,33 +2414,6 @@ sub process {
                             "line over $max_line_length characters\n" . $herecurr);
                }
-# Check for user-visible strings broken across lines, which breaks the ability
-# to grep for the string.  Make exceptions when the previous string ends in a
-# newline (multiple lines in one string constant) or '\t', '\r', ';', or '{'
-# (common in inline assembly) or is a octal \123 or hexadecimal \xaf value
-                if ($line =~ /^\+\s*"/ &&
-                    $prevline =~ /"\s*$/ &&
-                    $prevrawline !~ /(?:\\(?:[ntr]|[0-7]{1,3}|x[0-9a-fA-F]{1,2})|;\s*|\{\s*)"\s*$/) {
-                        WARN("SPLIT_STRING",
-                             "quoted string split across lines\n" . $hereprev);
-                }
-# check for missing a space in a string concatination
-                if ($prevrawline =~ /[^\\]\w"$/ && $rawline =~ /^\+[\t ]+"\w/) {
-                        WARN('MISSING_SPACE',
-                             "break quoted strings at a space character\n" . $hereprev);
-                }
-# check for spaces before a quoted newline
-                if ($rawline =~ /^.*\".*\s\\n/) {
-                        if (WARN("QUOTED_WHITESPACE_BEFORE_NEWLINE",
-                                 "unnecessary whitespace before a quoted newline\n" . $herecurr) &&
-                            $fix) {
-                                $fixed[$fixlinenr] =~ s/^(\+.*\".*)\s+\\n/$1\\n/;
-                        }
-                }
 # check for adding lines without a newline.
                if ($line =~ /^\+/ && defined $lines[$linenr] && $lines[$linenr] =~ /^\\ No newline at end of file/) {
                        WARN("MISSING_EOF_NEWLINE",
@@ -2515,7 +2499,8 @@ sub process {
                        }
                }
-                if ($line =~ /^\+.*\(\s*$Type\s*\)[ \t]+(?!$Assignment|$Arithmetic|{)/) {
+                if ($line =~ /^\+.*(\w+\s*)?\(\s*$Type\s*\)[ \t]+(?!$Assignment|$Arithmetic|[,;\({\[\<\>])/ &&
+                    (!defined($1) || $1 !~ /sizeof\s*/)) {
                        if (CHK("SPACING",
                                "No space is necessary after a cast\n" . $herecurr) &&
                            $fix) {
@@ -3563,14 +3548,33 @@ sub process {
                                                }
                                        }
-                                # , must have a space on the right.
+                                # , must not have a space before and must have a space on the right.
                                } elsif ($op eq ',') {
+                                        my $rtrim_before = 0;
+                                        my $space_after = 0;
+                                        if ($ctx =~ /Wx./) {
+                                                if (ERROR("SPACING",
+                                                          "space prohibited before that '$op' $at\n" . $hereptr)) {
+                                                        $line_fixed = 1;
+                                                        $rtrim_before = 1;
+                                                }
+                                        }
                                        if ($ctx !~ /.x[WEC]/ && $cc !~ /^}/) {
                                                if (ERROR("SPACING",
                                                          "space required after that '$op' $at\n" . $hereptr)) {
-                                                        $good = $fix_elements[$n] . trim($fix_elements[$n + 1]) . " ";
                                                        $line_fixed = 1;
                                                        $last_after = $n;
+                                                        $space_after = 1;
+                                                }
+                                        }
+                                        if ($rtrim_before || $space_after) {
+                                                if ($rtrim_before) {
+                                                        $good = rtrim($fix_elements[$n]) . trim($fix_elements[$n + 1]);
+                                                } else {
+                                                        $good = $fix_elements[$n] . trim($fix_elements[$n + 1]);
+                                                }
+                                                if ($space_after) {
+                                                        $good .= " ";
                                                }
                                        }
@@ -3814,9 +3818,27 @@ sub process {
 # ie: &(foo->bar) should be &foo->bar and *(foo->bar) should be *foo->bar
                while ($line =~ /(?:[^&]&\s*|\*)\(\s*($Ident\s*(?:$Member\s*)+)\s*\)/g) {
-                        CHK("UNNECESSARY_PARENTHESES",
+                        my $var = $1;
-                            "Unnecessary parentheses around $1\n" . $herecurr);
+                        if (CHK("UNNECESSARY_PARENTHESES",
-                    }
+                                "Unnecessary parentheses around $var\n" . $herecurr) &&
+                            $fix) {
+                                $fixed[$fixlinenr] =~ s/\(\s*\Q$var\E\s*\)/$var/;
+                        }
+                }
+# check for unnecessary parentheses around function pointer uses
+# ie: (foo->bar)(); should be foo->bar();
+# but not "if (foo->bar) (" to avoid some false positives
+                if ($line =~ /(\bif\s*|)(\(\s*$Ident\s*(?:$Member\s*)+\))[ \t]*\(/ && $1 !~ /^if/) {
+                        my $var = $2;
+                        if (CHK("UNNECESSARY_PARENTHESES",
+                                "Unnecessary parentheses around function pointer $var\n" . $herecurr) &&
+                            $fix) {
+                                my $var2 = deparenthesize($var);
+                                $var2 =~ s/\s//g;
+                                $fixed[$fixlinenr] =~ s/\Q$var\E/$var2/;
+                        }
+                }
 #goto labels aren't indented, allow a single space however
                if ($line=~/^.\s+[A-Za-z\d_]+:(?![0-9]+)/ and
@@ -4056,7 +4078,9 @@ sub process {
 #Ignore Page<foo> variants
                            $var !~ /^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ &&
 #Ignore SI style variants like nS, mV and dB (ie: max_uV, regulator_min_uA_show)
-                            $var !~ /^(?:[a-z_]*?)_?[a-z][A-Z](?:_[a-z_]+)?$/) {
+                            $var !~ /^(?:[a-z_]*?)_?[a-z][A-Z](?:_[a-z_]+)?$/ &&
+#Ignore some three character SI units explicitly, like MiB and KHz
+                            $var !~ /^(?:[a-z_]*?)_?(?:[KMGT]iB|[KMGT]?Hz)(?:_[a-z_]+)?$/) {
                                while ($var =~ m{($Ident)}g) {
                                        my $word = $1;
                                        next if ($word !~ /[A-Z][a-z]|[a-z][A-Z]/);
@@ -4408,12 +4432,85 @@ sub process {
                             "Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt\n" . $herecurr);
                }
+# Check for user-visible strings broken across lines, which breaks the ability
+# to grep for the string.  Make exceptions when the previous string ends in a
+# newline (multiple lines in one string constant) or '\t', '\r', ';', or '{'
+# (common in inline assembly) or is a octal \123 or hexadecimal \xaf value
+                if ($line =~ /^\+\s*"[X\t]*"/ &&
+                    $prevline =~ /"\s*$/ &&
+                    $prevrawline !~ /(?:\\(?:[ntr]|[0-7]{1,3}|x[0-9a-fA-F]{1,2})|;\s*|\{\s*)"\s*$/) {
+                        if (WARN("SPLIT_STRING",
+                                 "quoted string split across lines\n" . $hereprev) &&
+                                     $fix &&
+                                     $prevrawline =~ /^\+.*"\s*$/ &&
+                                     $last_coalesced_string_linenr != $linenr - 1) {
+                                my $extracted_string = get_quoted_string($line, $rawline);
+                                my $comma_close = "";
+                                if ($rawline =~ /\Q$extracted_string\E(\s*\)\s*;\s*$|\s*,\s*)/) {
+                                        $comma_close = $1;
+                                }
+                                fix_delete_line($fixlinenr - 1, $prevrawline);
+                                fix_delete_line($fixlinenr, $rawline);
+                                my $fixedline = $prevrawline;
+                                $fixedline =~ s/"\s*$//;
+                                $fixedline .= substr($extracted_string, 1) . trim($comma_close);
+                                fix_insert_line($fixlinenr - 1, $fixedline);
+                                $fixedline = $rawline;
+                                $fixedline =~ s/\Q$extracted_string\E\Q$comma_close\E//;
+                                if ($fixedline !~ /\+\s*$/) {
+                                        fix_insert_line($fixlinenr, $fixedline);
+                                }
+                                $last_coalesced_string_linenr = $linenr;
+                        }
+                }
+# check for missing a space in a string concatenation
+                if ($prevrawline =~ /[^\\]\w"$/ && $rawline =~ /^\+[\t ]+"\w/) {
+                        WARN('MISSING_SPACE',
+                             "break quoted strings at a space character\n" . $hereprev);
+                }
+# check for spaces before a quoted newline
+                if ($rawline =~ /^.*\".*\s\\n/) {
+                        if (WARN("QUOTED_WHITESPACE_BEFORE_NEWLINE",
+                                 "unnecessary whitespace before a quoted newline\n" . $herecurr) &&
+                            $fix) {
+                                $fixed[$fixlinenr] =~ s/^(\+.*\".*)\s+\\n/$1\\n/;
+                        }
+                }
 # concatenated string without spaces between elements
                if ($line =~ /"X+"[A-Z_]+/ || $line =~ /[A-Z_]+"X+"/) {
                        CHK("CONCATENATED_STRING",
                            "Concatenated strings should use spaces between elements\n" . $herecurr);
                }
+# uncoalesced string fragments
+                if ($line =~ /"X*"\s*"/) {
+                        WARN("STRING_FRAGMENTS",
+                             "Consecutive strings are generally better as a single string\n" . $herecurr);
+                }
+# check for %L{u,d,i} in strings
+                my $string;
+                while ($line =~ /(?:^|")([X\t]*)(?:"|$)/g) {
+                        $string = substr($rawline, $-[1], $+[1] - $-[1]);
+                        $string =~ s/%%/__/g;
+                        if ($string =~ /(?<!%)%L[udi]/) {
+                                WARN("PRINTF_L",
+                                     "\%Ld/%Lu are not-standard C, use %lld/%llu\n" . $herecurr);
+                                last;
+                        }
+                }
+# check for line continuations in quoted strings with odd counts of "
+                if ($rawline =~ /\\$/ && $rawline =~ tr/"/"/ % 2) {
+                        WARN("LINE_CONTINUATIONS",
+                             "Avoid line continuations in quoted strings\n" . $herecurr);
+                }
 # warn about #if 0
                if ($line =~ /^.\s*\#\s*if\s+0\b/) {
                        CHK("REDUNDANT_CODE",
@@ -4426,7 +4523,7 @@ sub process {
                        my $expr = '\s*\(\s*' . quotemeta($1) . '\s*\)\s*;';
                        if ($line =~ /\b(kfree|usb_free_urb|debugfs_remove(?:_recursive)?)$expr/) {
                                WARN('NEEDLESS_IF',
-                                     "$1(NULL) is safe this check is probably not required\n" . $hereprev);
+                                     "$1(NULL) is safe and this check is probably not required\n" . $hereprev);
                        }
                }
@@ -4458,6 +4555,28 @@ sub process {
                        }
                }
+# check for mask then right shift without a parentheses
+                if ($^V && $^V ge 5.10.0 &&
+                    $line =~ /$LvalOrFunc\s*\&\s*($LvalOrFunc)\s*>>/ &&
+                    $4 !~ /^\&/) { # $LvalOrFunc may be &foo, ignore if so
+                        WARN("MASK_THEN_SHIFT",
+                             "Possible precedence defect with mask then right shift - may need parentheses\n" . $herecurr);
+                }
+# check for pointer comparisons to NULL
+                if ($^V && $^V ge 5.10.0) {
+                        while ($line =~ /\b$LvalOrFunc\s*(==|\!=)\s*NULL\b/g) {
+                                my $val = $1;
+                                my $equal = "!";
+                                $equal = "" if ($4 eq "!=");
+                                if (CHK("COMPARISON_TO_NULL",
+                                        "Comparison to NULL could be written \"${equal}${val}\"\n" . $herecurr) &&
+                                            $fix) {
+                                        $fixed[$fixlinenr] =~ s/\b\Q$val\E\s*(?:==|\!=)\s*NULL\b/$equal$val/;
+                                }
+                        }
+                }
 # check for bad placement of section $InitAttribute (e.g.: __initdata)
                if ($line =~ /(\b$InitAttribute\b)/) {
                        my $attr = $1;
@@ -4652,6 +4771,15 @@ sub process {
                        }
                }
+# Check for __attribute__ weak, or __weak declarations (may have link issues)
+                if ($^V && $^V ge 5.10.0 &&
+                    $line =~ /(?:$Declare|$DeclareMisordered)\s*$Ident\s*$balanced_parens\s*(?:$Attribute)?\s*;/ &&
+                    ($line =~ /\b__attribute__\s*\(\s*\(.*\bweak\b/ ||
+                     $line =~ /\b__weak\b/)) {
+                        ERROR("WEAK_DECLARATION",
+                              "Using weak declarations can have unintended link defects\n" . $herecurr);
+                }
 # check for sizeof(&)
                if ($line =~ /\bsizeof\s*\(\s*\&/) {
                        WARN("SIZEOF_ADDRESS",
@@ -4667,12 +4795,6 @@ sub process {
                        }
                }
-# check for line continuations in quoted strings with odd counts of "
-                if ($rawline =~ /\\$/ && $rawline =~ tr/"/"/ % 2) {
-                        WARN("LINE_CONTINUATIONS",
-                             "Avoid line continuations in quoted strings\n" . $herecurr);
-                }
 # check for struct spinlock declarations
                if ($line =~ /^.\s*\bstruct\s+spinlock\s+\w+\s*;/) {
                        WARN("USE_SPINLOCK_T",
@@ -4908,6 +5030,17 @@ sub process {
                        }
                }
+# check for #defines like: 1 << <digit> that could be BIT(digit)
+                if ($line =~ /#\s*define\s+\w+\s+\(?\s*1\s*([ulUL]*)\s*\<\<\s*(?:\d+|$Ident)\s*\)?/) {
+                        my $ull = "";
+                        $ull = "_ULL" if (defined($1) && $1 =~ /ll/i);
+                        if (CHK("BIT_MACRO",
+                                "Prefer using the BIT$ull macro\n" . $herecurr) &&
+                            $fix) {
+                                $fixed[$fixlinenr] =~ s/\(?\s*1\s*[ulUL]*\s*<<\s*(\d+|$Ident)\s*\)?/BIT${ull}($1)/;
+                        }
+                }
 # check for case / default statements not preceded by break/fallthrough/switch
                if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) {
                        my $has_break = 0;
@@ -5071,18 +5204,6 @@ sub process {
                              "#define of '$1' is wrong - use Kconfig variables or standard guards instead\n" . $herecurr);
                }
-# check for %L{u,d,i} in strings
-                my $string;
-                while ($line =~ /(?:^|")([X\t]*)(?:"|$)/g) {
-                        $string = substr($rawline, $-[1], $+[1] - $-[1]);
-                        $string =~ s/%%/__/g;
-                        if ($string =~ /(?<!%)%L[udi]/) {
-                                WARN("PRINTF_L",
-                                     "\%Ld/%Lu are not-standard C, use %lld/%llu\n" . $herecurr);
-                                last;
-                        }
-                }
 # whine mightly about in_atomic
                if ($line =~ /\bin_atomic\s*\(/) {
                        if ($realfile =~ m@^drivers/@) {
diff --git a/scripts/kernel-doc b/scripts/kernel-doc
index 70bea942b413..9922e66883a5 100755
--- a/scripts/kernel-doc
+++ b/scripts/kernel-doc
@@ -1753,7 +1753,7 @@ sub dump_struct($$) {
        # strip kmemcheck_bitfield_{begin,end}.*;
        $members =~ s/kmemcheck_bitfield_.*?;//gos;
        # strip attributes
-        $members =~ s/__aligned\s*\(.+\)//gos;
+        $members =~ s/__aligned\s*\([^;]*\)//gos;
        create_parameterlist($members, ';', $file);
        check_sections($file, $declaration_name, "struct", $sectcheck, $struct_actual, $nested);