mm: introduce MADV_COLD

Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Minchan Kim <minchan@kernel.org> 2019-09-25 19:49:08 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2019-09-25 20:51:41 -0400
commit: 9c276cc65a58faf98be8e56962745ec99ab87636 (patch)
tree: 34789d8c8a0b1556c06e7f15c3524f919ee67183
parent: ce18d171cb7368557e6498a3ce111d7d3dc03e4d (diff)
10 files changed, 232 insertions, 4 deletions
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index ac23379b7a87..f3258fbf03d0 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -68,6 +68,8 @@
 #define MADV_WIPEONFORK 18              /* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19              /* Undo MADV_WIPEONFORK */
+#define MADV_COLD       20              /* deactivate these pages */
 /* compatibility flags */
 #define MAP_FILE        0
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index c2b40969eb1f..00ad09fc5eb1 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -95,6 +95,8 @@
 #define MADV_WIPEONFORK 18              /* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19              /* Undo MADV_WIPEONFORK */
+#define MADV_COLD       20              /* deactivate these pages */
 /* compatibility flags */
 #define MAP_FILE        0
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index c98162f494db..eb14e3a7b8f3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -48,6 +48,8 @@
 #define MADV_DONTFORK   10              /* don't inherit across fork */
 #define MADV_DOFORK     11              /* do inherit across fork */
+#define MADV_COLD       20              /* deactivate these pages */
 #define MADV_MERGEABLE   65             /* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66             /* KSM may not merge identical pages */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index ebbb48842190..f926b00ff11f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
 #define MADV_WIPEONFORK 18              /* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19              /* Undo MADV_WIPEONFORK */
+#define MADV_COLD       20              /* deactivate these pages */
 /* compatibility flags */
 #define MAP_FILE        0
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 63b1f506ea67..23431faf0eb6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -67,6 +67,8 @@
 #define MADV_WIPEONFORK 18              /* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19              /* Undo MADV_WIPEONFORK */
+#define MADV_COLD       20              /* deactivate these pages */
 /* compatibility flags */
 #define MAP_FILE        0
diff --git a/mm/internal.h b/mm/internal.h
index e32390802fd3..0d5f720c75ab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -39,7 +39,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
                unsigned long floor, unsigned long ceiling);
-static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
        return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 1f8a6fdc6878..e1aee62967c3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -11,6 +11,7 @@
 #include <linux/syscalls.h>
 #include <linux/mempolicy.h>
 #include <linux/page-isolation.h>
+#include <linux/page_idle.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/hugetlb.h>
 #include <linux/falloc.h>
@@ -42,6 +43,7 @@ static int madvise_need_mmap_write(int behavior)
        case MADV_REMOVE:
        case MADV_WILLNEED:
        case MADV_DONTNEED:
+        case MADV_COLD:
        case MADV_FREE:
                return 0;
        default:
@@ -289,6 +291,176 @@ static long madvise_willneed(struct vm_area_struct *vma,
        return 0;
 }
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+                                unsigned long end, struct mm_walk *walk)
+{
+        struct mmu_gather *tlb = walk->private;
+        struct mm_struct *mm = tlb->mm;
+        struct vm_area_struct *vma = walk->vma;
+        pte_t *orig_pte, *pte, ptent;
+        spinlock_t *ptl;
+        struct page *page;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+        if (pmd_trans_huge(*pmd)) {
+                pmd_t orig_pmd;
+                unsigned long next = pmd_addr_end(addr, end);
+                tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+                ptl = pmd_trans_huge_lock(pmd, vma);
+                if (!ptl)
+                        return 0;
+                orig_pmd = *pmd;
+                if (is_huge_zero_pmd(orig_pmd))
+                        goto huge_unlock;
+                if (unlikely(!pmd_present(orig_pmd))) {
+                        VM_BUG_ON(thp_migration_supported() &&
+                                        !is_pmd_migration_entry(orig_pmd));
+                        goto huge_unlock;
+                }
+                page = pmd_page(orig_pmd);
+                if (next - addr != HPAGE_PMD_SIZE) {
+                        int err;
+                        if (page_mapcount(page) != 1)
+                                goto huge_unlock;
+                        get_page(page);
+                        spin_unlock(ptl);
+                        lock_page(page);
+                        err = split_huge_page(page);
+                        unlock_page(page);
+                        put_page(page);
+                        if (!err)
+                                goto regular_page;
+                        return 0;
+                }
+                if (pmd_young(orig_pmd)) {
+                        pmdp_invalidate(vma, addr, pmd);
+                        orig_pmd = pmd_mkold(orig_pmd);
+                        set_pmd_at(mm, addr, pmd, orig_pmd);
+                        tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+                }
+                test_and_clear_page_young(page);
+                deactivate_page(page);
+huge_unlock:
+                spin_unlock(ptl);
+                return 0;
+        }
+        if (pmd_trans_unstable(pmd))
+                return 0;
+regular_page:
+#endif
+        tlb_change_page_size(tlb, PAGE_SIZE);
+        orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+        flush_tlb_batched_pending(mm);
+        arch_enter_lazy_mmu_mode();
+        for (; addr < end; pte++, addr += PAGE_SIZE) {
+                ptent = *pte;
+                if (pte_none(ptent))
+                        continue;
+                if (!pte_present(ptent))
+                        continue;
+                page = vm_normal_page(vma, addr, ptent);
+                if (!page)
+                        continue;
+                /*
+                 * Creating a THP page is expensive so split it only if we
+                 * are sure it's worth. Split it if we are only owner.
+                 */
+                if (PageTransCompound(page)) {
+                        if (page_mapcount(page) != 1)
+                                break;
+                        get_page(page);
+                        if (!trylock_page(page)) {
+                                put_page(page);
+                                break;
+                        }
+                        pte_unmap_unlock(orig_pte, ptl);
+                        if (split_huge_page(page)) {
+                                unlock_page(page);
+                                put_page(page);
+                                pte_offset_map_lock(mm, pmd, addr, &ptl);
+                                break;
+                        }
+                        unlock_page(page);
+                        put_page(page);
+                        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+                        pte--;
+                        addr -= PAGE_SIZE;
+                        continue;
+                }
+                VM_BUG_ON_PAGE(PageTransCompound(page), page);
+                if (pte_young(ptent)) {
+                        ptent = ptep_get_and_clear_full(mm, addr, pte,
+                                                        tlb->fullmm);
+                        ptent = pte_mkold(ptent);
+                        set_pte_at(mm, addr, pte, ptent);
+                        tlb_remove_tlb_entry(tlb, pte, addr);
+                }
+                /*
+                 * We are deactivating a page for accelerating reclaiming.
+                 * VM couldn't reclaim the page unless we clear PG_young.
+                 * As a side effect, it makes confuse idle-page tracking
+                 * because they will miss recent referenced history.
+                 */
+                test_and_clear_page_young(page);
+                deactivate_page(page);
+        }
+        arch_leave_lazy_mmu_mode();
+        pte_unmap_unlock(orig_pte, ptl);
+        cond_resched();
+        return 0;
+}
+static const struct mm_walk_ops cold_walk_ops = {
+        .pmd_entry = madvise_cold_pte_range,
+};
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+                             struct vm_area_struct *vma,
+                             unsigned long addr, unsigned long end)
+{
+        tlb_start_vma(tlb, vma);
+        walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, NULL);
+        tlb_end_vma(tlb, vma);
+}
+static long madvise_cold(struct vm_area_struct *vma,
+                        struct vm_area_struct **prev,
+                        unsigned long start_addr, unsigned long end_addr)
+{
+        struct mm_struct *mm = vma->vm_mm;
+        struct mmu_gather tlb;
+        *prev = vma;
+        if (!can_madv_lru_vma(vma))
+                return -EINVAL;
+        lru_add_drain();
+        tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+        madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+        tlb_finish_mmu(&tlb, start_addr, end_addr);
+        return 0;
+}
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
                                unsigned long end, struct mm_walk *walk)
@@ -493,7 +665,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
                                  int behavior)
 {
        *prev = vma;
-        if (!can_madv_dontneed_vma(vma))
+        if (!can_madv_lru_vma(vma))
                return -EINVAL;
        if (!userfaultfd_remove(vma, start, end)) {
@@ -515,7 +687,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
                         */
                        return -ENOMEM;
                }
-                if (!can_madv_dontneed_vma(vma))
+                if (!can_madv_lru_vma(vma))
                        return -EINVAL;
                if (end > vma->vm_end) {
                        /*
@@ -669,6 +841,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
                return madvise_remove(vma, prev, start, end);
        case MADV_WILLNEED:
                return madvise_willneed(vma, prev, start, end);
+        case MADV_COLD:
+                return madvise_cold(vma, prev, start, end);
        case MADV_FREE:
        case MADV_DONTNEED:
                return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -690,6 +864,7 @@ madvise_behavior_valid(int behavior)
        case MADV_WILLNEED:
        case MADV_DONTNEED:
        case MADV_FREE:
+        case MADV_COLD:
 #ifdef CONFIG_KSM
        case MADV_MERGEABLE:
        case MADV_UNMERGEABLE:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c1d9496b4c43..71e3acea7817 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,7 +523,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
        set_bit(MMF_UNSTABLE, &mm->flags);
        for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-                if (!can_madv_dontneed_vma(vma))
+                if (!can_madv_lru_vma(vma))
                        continue;
                /*
diff --git a/mm/swap.c b/mm/swap.c
index 784dc1620620..38c3fa4308e2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,22 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
        update_page_reclaim_stat(lruvec, file, 0);
 }
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+                            void *arg)
+{
+        if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+                int file = page_is_file_cache(page);
+                int lru = page_lru_base_type(page);
+                del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+                ClearPageActive(page);
+                ClearPageReferenced(page);
+                add_page_to_lru_list(page, lruvec, lru);
+                __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+                update_page_reclaim_stat(lruvec, file, 0);
+        }
+}
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
                            void *arg)
@@ -590,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
        if (pagevec_count(pvec))
                pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+        pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+        if (pagevec_count(pvec))
+                pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
        pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
        if (pagevec_count(pvec))
                pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +644,26 @@ void deactivate_file_page(struct page *page)
        }
 }
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+        if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+                struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+                get_page(page);
+                if (!pagevec_add(pvec, page) || PageCompound(page))
+                        pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+                put_cpu_var(lru_deactivate_pvecs);
+        }
+}
 /**
 * mark_page_lazyfree - make an anon page lazyfree
 * @page: page to deactivate
@@ -687,6 +728,7 @@ void lru_add_drain_all(void)
                if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
                    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
                    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+                    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
                    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
                    need_activate_page_drain(cpu)) {
                        INIT_WORK(work, lru_add_drain_per_cpu);
author	Minchan Kim <minchan@kernel.org>	2019-09-25 19:49:08 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2019-09-25 20:51:41 -0400
commit	9c276cc65a58faf98be8e56962745ec99ab87636 (patch)
tree	34789d8c8a0b1556c06e7f15c3524f919ee67183
parent	ce18d171cb7368557e6498a3ce111d7d3dc03e4d (diff)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index ac23379b7a87..f3258fbf03d0 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h
@@ -68,6 +68,8 @@
68	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */	68	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
69	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */	69	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
70		70
		71	#define MADV_COLD 20 /* deactivate these pages */
		72
71	/* compatibility flags */	73	/* compatibility flags */
72	#define MAP_FILE 0	74	#define MAP_FILE 0
73		75


diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index c2b40969eb1f..00ad09fc5eb1 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h
@@ -95,6 +95,8 @@
95	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */	95	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
96	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */	96	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
97		97
		98	#define MADV_COLD 20 /* deactivate these pages */
		99
98	/* compatibility flags */	100	/* compatibility flags */
99	#define MAP_FILE 0	101	#define MAP_FILE 0
100		102


diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index c98162f494db..eb14e3a7b8f3 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h
@@ -48,6 +48,8 @@
48	#define MADV_DONTFORK 10 /* don't inherit across fork */	48	#define MADV_DONTFORK 10 /* don't inherit across fork */
49	#define MADV_DOFORK 11 /* do inherit across fork */	49	#define MADV_DOFORK 11 /* do inherit across fork */
50		50
		51	#define MADV_COLD 20 /* deactivate these pages */
		52
51	#define MADV_MERGEABLE 65 /* KSM may merge identical pages */	53	#define MADV_MERGEABLE 65 /* KSM may merge identical pages */
52	#define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */	54	#define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */
53		55


diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index ebbb48842190..f926b00ff11f 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
103	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */	103	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
104	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */	104	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
105		105
		106	#define MADV_COLD 20 /* deactivate these pages */
		107
106	/* compatibility flags */	108	/* compatibility flags */
107	#define MAP_FILE 0	109	#define MAP_FILE 0
108		110


diff --git a/include/linux/swap.h b/include/linux/swap.h index de2c67a33b7e..0ce997edb8bb 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
340	extern void lru_add_drain_all(void);	340	extern void lru_add_drain_all(void);
341	extern void rotate_reclaimable_page(struct page *page);	341	extern void rotate_reclaimable_page(struct page *page);
342	extern void deactivate_file_page(struct page *page);	342	extern void deactivate_file_page(struct page *page);
		343	extern void deactivate_page(struct page *page);
343	extern void mark_page_lazyfree(struct page *page);	344	extern void mark_page_lazyfree(struct page *page);
344	extern void swap_setup(void);	345	extern void swap_setup(void);
345		346


diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 63b1f506ea67..23431faf0eb6 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h
@@ -67,6 +67,8 @@
67	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */	67	#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
68	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */	68	#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
69		69
		70	#define MADV_COLD 20 /* deactivate these pages */
		71
70	/* compatibility flags */	72	/* compatibility flags */
71	#define MAP_FILE 0	73	#define MAP_FILE 0
72		74


diff --git a/mm/internal.h b/mm/internal.h index e32390802fd3..0d5f720c75ab 100644 --- a/mm/internal.h +++ b/mm/internal.h
@@ -39,7 +39,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
39	void free_pgtables(struct mmu_gather tlb, struct vm_area_struct start_vma,	39	void free_pgtables(struct mmu_gather tlb, struct vm_area_struct start_vma,
40	unsigned long floor, unsigned long ceiling);	40	unsigned long floor, unsigned long ceiling);
41		41
42	static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)	42	static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
43	{	43	{
44	return !(vma->vm_flags & (VM_LOCKED\|VM_HUGETLB\|VM_PFNMAP));	44	return !(vma->vm_flags & (VM_LOCKED\|VM_HUGETLB\|VM_PFNMAP));
45	}	45	}


diff --git a/mm/madvise.c b/mm/madvise.c index 1f8a6fdc6878..e1aee62967c3 100644 --- a/mm/madvise.c +++ b/mm/madvise.c
@@ -11,6 +11,7 @@
11	#include <linux/syscalls.h>	11	#include <linux/syscalls.h>
12	#include <linux/mempolicy.h>	12	#include <linux/mempolicy.h>
13	#include <linux/page-isolation.h>	13	#include <linux/page-isolation.h>
		14	#include <linux/page_idle.h>
14	#include <linux/userfaultfd_k.h>	15	#include <linux/userfaultfd_k.h>
15	#include <linux/hugetlb.h>	16	#include <linux/hugetlb.h>
16	#include <linux/falloc.h>	17	#include <linux/falloc.h>
@@ -42,6 +43,7 @@ static int madvise_need_mmap_write(int behavior)
42	case MADV_REMOVE:	43	case MADV_REMOVE:
43	case MADV_WILLNEED:	44	case MADV_WILLNEED:
44	case MADV_DONTNEED:	45	case MADV_DONTNEED:
		46	case MADV_COLD:
45	case MADV_FREE:	47	case MADV_FREE:
46	return 0;	48	return 0;
47	default:	49	default:
@@ -289,6 +291,176 @@ static long madvise_willneed(struct vm_area_struct *vma,
289	return 0;	291	return 0;
290	}	292	}
291		293
		294	static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
		295	unsigned long end, struct mm_walk *walk)
		296	{
		297	struct mmu_gather *tlb = walk->private;
		298	struct mm_struct *mm = tlb->mm;
		299	struct vm_area_struct *vma = walk->vma;
		300	pte_t orig_pte, pte, ptent;
		301	spinlock_t *ptl;
		302	struct page *page;
		303
		304	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
		305	if (pmd_trans_huge(*pmd)) {
		306	pmd_t orig_pmd;
		307	unsigned long next = pmd_addr_end(addr, end);
		308
		309	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
		310	ptl = pmd_trans_huge_lock(pmd, vma);
		311	if (!ptl)
		312	return 0;
		313
		314	orig_pmd = *pmd;
		315	if (is_huge_zero_pmd(orig_pmd))
		316	goto huge_unlock;
		317
		318	if (unlikely(!pmd_present(orig_pmd))) {
		319	VM_BUG_ON(thp_migration_supported() &&
		320	!is_pmd_migration_entry(orig_pmd));
		321	goto huge_unlock;
		322	}
		323
		324	page = pmd_page(orig_pmd);
		325	if (next - addr != HPAGE_PMD_SIZE) {
		326	int err;
		327
		328	if (page_mapcount(page) != 1)
		329	goto huge_unlock;
		330
		331	get_page(page);
		332	spin_unlock(ptl);
		333	lock_page(page);
		334	err = split_huge_page(page);
		335	unlock_page(page);
		336	put_page(page);
		337	if (!err)
		338	goto regular_page;
		339	return 0;
		340	}
		341
		342	if (pmd_young(orig_pmd)) {
		343	pmdp_invalidate(vma, addr, pmd);
		344	orig_pmd = pmd_mkold(orig_pmd);
		345
		346	set_pmd_at(mm, addr, pmd, orig_pmd);
		347	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
		348	}
		349
		350	test_and_clear_page_young(page);
		351	deactivate_page(page);
		352	huge_unlock:
		353	spin_unlock(ptl);
		354	return 0;
		355	}
		356
		357	if (pmd_trans_unstable(pmd))
		358	return 0;
		359	regular_page:
		360	#endif
		361	tlb_change_page_size(tlb, PAGE_SIZE);
		362	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
		363	flush_tlb_batched_pending(mm);
		364	arch_enter_lazy_mmu_mode();
		365	for (; addr < end; pte++, addr += PAGE_SIZE) {
		366	ptent = *pte;
		367
		368	if (pte_none(ptent))
		369	continue;
		370
		371	if (!pte_present(ptent))
		372	continue;
		373
		374	page = vm_normal_page(vma, addr, ptent);
		375	if (!page)
		376	continue;
		377
		378	/*
		379	* Creating a THP page is expensive so split it only if we
		380	* are sure it's worth. Split it if we are only owner.
		381	*/
		382	if (PageTransCompound(page)) {
		383	if (page_mapcount(page) != 1)
		384	break;
		385	get_page(page);
		386	if (!trylock_page(page)) {
		387	put_page(page);
		388	break;
		389	}
		390	pte_unmap_unlock(orig_pte, ptl);
		391	if (split_huge_page(page)) {
		392	unlock_page(page);
		393	put_page(page);
		394	pte_offset_map_lock(mm, pmd, addr, &ptl);
		395	break;
		396	}
		397	unlock_page(page);
		398	put_page(page);
		399	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
		400	pte--;
		401	addr -= PAGE_SIZE;
		402	continue;
		403	}
		404
		405	VM_BUG_ON_PAGE(PageTransCompound(page), page);
		406
		407	if (pte_young(ptent)) {
		408	ptent = ptep_get_and_clear_full(mm, addr, pte,
		409	tlb->fullmm);
		410	ptent = pte_mkold(ptent);
		411	set_pte_at(mm, addr, pte, ptent);
		412	tlb_remove_tlb_entry(tlb, pte, addr);
		413	}
		414
		415	/*
		416	* We are deactivating a page for accelerating reclaiming.
		417	* VM couldn't reclaim the page unless we clear PG_young.
		418	* As a side effect, it makes confuse idle-page tracking
		419	* because they will miss recent referenced history.
		420	*/
		421	test_and_clear_page_young(page);
		422	deactivate_page(page);
		423	}
		424
		425	arch_leave_lazy_mmu_mode();
		426	pte_unmap_unlock(orig_pte, ptl);
		427	cond_resched();
		428
		429	return 0;
		430	}
		431
		432	static const struct mm_walk_ops cold_walk_ops = {
		433	.pmd_entry = madvise_cold_pte_range,
		434	};
		435
		436	static void madvise_cold_page_range(struct mmu_gather *tlb,
		437	struct vm_area_struct *vma,
		438	unsigned long addr, unsigned long end)
		439	{
		440	tlb_start_vma(tlb, vma);
		441	walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, NULL);
		442	tlb_end_vma(tlb, vma);
		443	}
		444
		445	static long madvise_cold(struct vm_area_struct *vma,
		446	struct vm_area_struct **prev,
		447	unsigned long start_addr, unsigned long end_addr)
		448	{
		449	struct mm_struct *mm = vma->vm_mm;
		450	struct mmu_gather tlb;
		451
		452	*prev = vma;
		453	if (!can_madv_lru_vma(vma))
		454	return -EINVAL;
		455
		456	lru_add_drain();
		457	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
		458	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
		459	tlb_finish_mmu(&tlb, start_addr, end_addr);
		460
		461	return 0;
		462	}
		463
292	static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,	464	static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
293	unsigned long end, struct mm_walk *walk)	465	unsigned long end, struct mm_walk *walk)
294		466
@@ -493,7 +665,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
493	int behavior)	665	int behavior)
494	{	666	{
495	*prev = vma;	667	*prev = vma;
496	if (!can_madv_dontneed_vma(vma))	668	if (!can_madv_lru_vma(vma))
497	return -EINVAL;	669	return -EINVAL;
498		670
499	if (!userfaultfd_remove(vma, start, end)) {	671	if (!userfaultfd_remove(vma, start, end)) {
@@ -515,7 +687,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
515	*/	687	*/
516	return -ENOMEM;	688	return -ENOMEM;
517	}	689	}
518	if (!can_madv_dontneed_vma(vma))	690	if (!can_madv_lru_vma(vma))
519	return -EINVAL;	691	return -EINVAL;
520	if (end > vma->vm_end) {	692	if (end > vma->vm_end) {
521	/*	693	/*
@@ -669,6 +841,8 @@ madvise_vma(struct vm_area_struct vma, struct vm_area_struct *prev,
669	return madvise_remove(vma, prev, start, end);	841	return madvise_remove(vma, prev, start, end);
670	case MADV_WILLNEED:	842	case MADV_WILLNEED:
671	return madvise_willneed(vma, prev, start, end);	843	return madvise_willneed(vma, prev, start, end);
		844	case MADV_COLD:
		845	return madvise_cold(vma, prev, start, end);
672	case MADV_FREE:	846	case MADV_FREE:
673	case MADV_DONTNEED:	847	case MADV_DONTNEED:
674	return madvise_dontneed_free(vma, prev, start, end, behavior);	848	return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -690,6 +864,7 @@ madvise_behavior_valid(int behavior)
690	case MADV_WILLNEED:	864	case MADV_WILLNEED:
691	case MADV_DONTNEED:	865	case MADV_DONTNEED:
692	case MADV_FREE:	866	case MADV_FREE:
		867	case MADV_COLD:
693	#ifdef CONFIG_KSM	868	#ifdef CONFIG_KSM
694	case MADV_MERGEABLE:	869	case MADV_MERGEABLE:
695	case MADV_UNMERGEABLE:	870	case MADV_UNMERGEABLE:


diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c1d9496b4c43..71e3acea7817 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c
@@ -523,7 +523,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
523	set_bit(MMF_UNSTABLE, &mm->flags);	523	set_bit(MMF_UNSTABLE, &mm->flags);
524		524
525	for (vma = mm->mmap ; vma; vma = vma->vm_next) {	525	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
526	if (!can_madv_dontneed_vma(vma))	526	if (!can_madv_lru_vma(vma))
527	continue;	527	continue;
528		528
529	/*	529	/*


diff --git a/mm/swap.c b/mm/swap.c index 784dc1620620..38c3fa4308e2 100644 --- a/mm/swap.c +++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
47	static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);	47	static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
48	static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);	48	static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
49	static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);	49	static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
		50	static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
50	static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);	51	static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
51	#ifdef CONFIG_SMP	52	#ifdef CONFIG_SMP
52	static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);	53	static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,22 @@ static void lru_deactivate_file_fn(struct page page, struct lruvec lruvec,
538	update_page_reclaim_stat(lruvec, file, 0);	539	update_page_reclaim_stat(lruvec, file, 0);
539	}	540	}
540		541
		542	static void lru_deactivate_fn(struct page page, struct lruvec lruvec,
		543	void *arg)
		544	{
		545	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
		546	int file = page_is_file_cache(page);
		547	int lru = page_lru_base_type(page);
		548
		549	del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
		550	ClearPageActive(page);
		551	ClearPageReferenced(page);
		552	add_page_to_lru_list(page, lruvec, lru);
		553
		554	__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
		555	update_page_reclaim_stat(lruvec, file, 0);
		556	}
		557	}
541		558
542	static void lru_lazyfree_fn(struct page page, struct lruvec lruvec,	559	static void lru_lazyfree_fn(struct page page, struct lruvec lruvec,
543	void *arg)	560	void *arg)
@@ -590,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
590	if (pagevec_count(pvec))	607	if (pagevec_count(pvec))
591	pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);	608	pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
592		609
		610	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
		611	if (pagevec_count(pvec))
		612	pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
		613
593	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);	614	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
594	if (pagevec_count(pvec))	615	if (pagevec_count(pvec))
595	pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);	616	pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +644,26 @@ void deactivate_file_page(struct page *page)
623	}	644	}
624	}	645	}
625		646
		647	/*
		648	* deactivate_page - deactivate a page
		649	* @page: page to deactivate
		650	*
		651	* deactivate_page() moves @page to the inactive list if @page was on the active
		652	* list and was not an unevictable page. This is done to accelerate the reclaim
		653	* of @page.
		654	*/
		655	void deactivate_page(struct page *page)
		656	{
		657	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
		658	struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
		659
		660	get_page(page);
		661	if (!pagevec_add(pvec, page) \|\| PageCompound(page))
		662	pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
		663	put_cpu_var(lru_deactivate_pvecs);
		664	}
		665	}
		666
626	/**	667	/**
627	* mark_page_lazyfree - make an anon page lazyfree	668	* mark_page_lazyfree - make an anon page lazyfree
628	* @page: page to deactivate	669	* @page: page to deactivate
@@ -687,6 +728,7 @@ void lru_add_drain_all(void)
687	if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) \|\|	728	if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) \|\|
688	pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) \|\|	729	pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) \|\|
689	pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) \|\|	730	pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) \|\|
		731	pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) \|\|
690	pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) \|\|	732	pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) \|\|
691	need_activate_page_drain(cpu)) {	733	need_activate_page_drain(cpu)) {
692	INIT_WORK(work, lru_add_drain_per_cpu);	734	INIT_WORK(work, lru_add_drain_per_cpu);