mm: rid swapoff of quadratic complexity

This patch was initially posted by Kelley Nielsen. Reposting the patch with all review comments addressed and with minor modifications and optimizations. Also, folding in the fixes offered by Hugh Dickins and Huang Ying. Tests were rerun and commit message updated with new results. try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementation....about 1200M calls(8min, avg 80% cpu util). Prototype.................about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Vineeth Remanan Pillai <vpillai@digitalocean.com> 2019-03-05 18:47:03 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2019-03-06 00:07:18 -0500
commit: b56a2d8af9147a4efe4011b60d93779c0461ca97 (patch)
tree: 1595323c4696df56df8018357c6c97a0aef14f7a /mm/swapfile.c
parent: c5bf121e4350a933bd431385e6fcb72a898ecc68 (diff)
1 files changed, 163 insertions, 270 deletions
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dbac1d49469d..6de46984d59d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1799,44 +1799,77 @@ out_nolock:
 }
 static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-                                unsigned long addr, unsigned long end,
+                        unsigned long addr, unsigned long end,
-                                swp_entry_t entry, struct page *page)
+                        unsigned int type, bool frontswap,
+                        unsigned long *fs_pages_to_unuse)
 {
-        pte_t swp_pte = swp_entry_to_pte(entry);
+        struct page *page;
+        swp_entry_t entry;
        pte_t *pte;
+        struct swap_info_struct *si;
+        unsigned long offset;
        int ret = 0;
+        volatile unsigned char *swap_map;
-        /*
+        si = swap_info[type];
-         * We don't actually need pte lock while scanning for swp_pte: since
-         * we hold page lock and mmap_sem, swp_pte cannot be inserted into the
-         * page table while we're scanning; though it could get zapped, and on
-         * some architectures (e.g. x86_32 with PAE) we might catch a glimpse
-         * of unmatched parts which look like swp_pte, so unuse_pte must
-         * recheck under pte lock.  Scanning without pte lock lets it be
-         * preemptable whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE.
-         */
        pte = pte_offset_map(pmd, addr);
        do {
-                /*
+                struct vm_fault vmf;
-                 * swapoff spends a _lot_ of time in this loop!
-                 * Test inline before going to call unuse_pte.
+                if (!is_swap_pte(*pte))
-                 */
+                        continue;
-                if (unlikely(pte_same_as_swp(*pte, swp_pte))) {
-                        pte_unmap(pte);
+                entry = pte_to_swp_entry(*pte);
-                        ret = unuse_pte(vma, pmd, addr, entry, page);
+                if (swp_type(entry) != type)
-                        if (ret)
+                        continue;
-                                goto out;
-                        pte = pte_offset_map(pmd, addr);
+                offset = swp_offset(entry);
+                if (frontswap && !frontswap_test(si, offset))
+                        continue;
+                pte_unmap(pte);
+                swap_map = &si->swap_map[offset];
+                vmf.vma = vma;
+                vmf.address = addr;
+                vmf.pmd = pmd;
+                page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
+                if (!page) {
+                        if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD)
+                                goto try_next;
+                        return -ENOMEM;
+                }
+                lock_page(page);
+                wait_on_page_writeback(page);
+                ret = unuse_pte(vma, pmd, addr, entry, page);
+                if (ret < 0) {
+                        unlock_page(page);
+                        put_page(page);
+                        goto out;
+                }
+                try_to_free_swap(page);
+                unlock_page(page);
+                put_page(page);
+                if (*fs_pages_to_unuse && !--(*fs_pages_to_unuse)) {
+                        ret = FRONTSWAP_PAGES_UNUSED;
+                        goto out;
                }
+try_next:
+                pte = pte_offset_map(pmd, addr);
        } while (pte++, addr += PAGE_SIZE, addr != end);
        pte_unmap(pte - 1);
+        ret = 0;
 out:
        return ret;
 }
 static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
                                unsigned long addr, unsigned long end,
-                                swp_entry_t entry, struct page *page)
+                                unsigned int type, bool frontswap,
+                                unsigned long *fs_pages_to_unuse)
 {
        pmd_t *pmd;
        unsigned long next;
@@ -1848,7 +1881,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
                next = pmd_addr_end(addr, end);
                if (pmd_none_or_trans_huge_or_clear_bad(pmd))
                        continue;
-                ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
+                ret = unuse_pte_range(vma, pmd, addr, next, type,
+                                      frontswap, fs_pages_to_unuse);
                if (ret)
                        return ret;
        } while (pmd++, addr = next, addr != end);
@@ -1857,7 +1891,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
                                unsigned long addr, unsigned long end,
-                                swp_entry_t entry, struct page *page)
+                                unsigned int type, bool frontswap,
+                                unsigned long *fs_pages_to_unuse)
 {
        pud_t *pud;
        unsigned long next;
@@ -1868,7 +1903,8 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
                next = pud_addr_end(addr, end);
                if (pud_none_or_clear_bad(pud))
                        continue;
-                ret = unuse_pmd_range(vma, pud, addr, next, entry, page);
+                ret = unuse_pmd_range(vma, pud, addr, next, type,
+                                      frontswap, fs_pages_to_unuse);
                if (ret)
                        return ret;
        } while (pud++, addr = next, addr != end);
@@ -1877,7 +1913,8 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
                                unsigned long addr, unsigned long end,
-                                swp_entry_t entry, struct page *page)
+                                unsigned int type, bool frontswap,
+                                unsigned long *fs_pages_to_unuse)
 {
        p4d_t *p4d;
        unsigned long next;
@@ -1888,78 +1925,66 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
                next = p4d_addr_end(addr, end);
                if (p4d_none_or_clear_bad(p4d))
                        continue;
-                ret = unuse_pud_range(vma, p4d, addr, next, entry, page);
+                ret = unuse_pud_range(vma, p4d, addr, next, type,
+                                      frontswap, fs_pages_to_unuse);
                if (ret)
                        return ret;
        } while (p4d++, addr = next, addr != end);
        return 0;
 }
-static int unuse_vma(struct vm_area_struct *vma,
+static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
-                                swp_entry_t entry, struct page *page)
+                     bool frontswap, unsigned long *fs_pages_to_unuse)
 {
        pgd_t *pgd;
        unsigned long addr, end, next;
        int ret;
-        if (page_anon_vma(page)) {
+        addr = vma->vm_start;
-                addr = page_address_in_vma(page, vma);
+        end = vma->vm_end;
-                if (addr == -EFAULT)
-                        return 0;
-                else
-                        end = addr + PAGE_SIZE;
-        } else {
-                addr = vma->vm_start;
-                end = vma->vm_end;
-        }
        pgd = pgd_offset(vma->vm_mm, addr);
        do {
                next = pgd_addr_end(addr, end);
                if (pgd_none_or_clear_bad(pgd))
                        continue;
-                ret = unuse_p4d_range(vma, pgd, addr, next, entry, page);
+                ret = unuse_p4d_range(vma, pgd, addr, next, type,
+                                      frontswap, fs_pages_to_unuse);
                if (ret)
                        return ret;
        } while (pgd++, addr = next, addr != end);
        return 0;
 }
-static int unuse_mm(struct mm_struct *mm,
+static int unuse_mm(struct mm_struct *mm, unsigned int type,
-                                swp_entry_t entry, struct page *page)
+                    bool frontswap, unsigned long *fs_pages_to_unuse)
 {
        struct vm_area_struct *vma;
        int ret = 0;
-        if (!down_read_trylock(&mm->mmap_sem)) {
+        down_read(&mm->mmap_sem);
-                /*
-                 * Activate page so shrink_inactive_list is unlikely to unmap
-                 * its ptes while lock is dropped, so swapoff can make progress.
-                 */
-                activate_page(page);
-                unlock_page(page);
-                down_read(&mm->mmap_sem);
-                lock_page(page);
-        }
        for (vma = mm->mmap; vma; vma = vma->vm_next) {
-                if (vma->anon_vma && (ret = unuse_vma(vma, entry, page)))
+                if (vma->anon_vma) {
-                        break;
+                        ret = unuse_vma(vma, type, frontswap,
+                                        fs_pages_to_unuse);
+                        if (ret)
+                                break;
+                }
                cond_resched();
        }
        up_read(&mm->mmap_sem);
-        return (ret < 0)? ret: 0;
+        return ret;
 }
 /*
 * Scan swap_map (or frontswap_map if frontswap parameter is true)
- * from current position to next entry still in use.
+ * from current position to next entry still in use. Return 0
- * Recycle to start on reaching the end, returning 0 when empty.
+ * if there are no inuse entries after prev till end of the map.
 */
 static unsigned int find_next_to_unuse(struct swap_info_struct *si,
                                        unsigned int prev, bool frontswap)
 {
-        unsigned int max = si->max;
+        unsigned int i;
-        unsigned int i = prev;
        unsigned char count;
        /*
@@ -1968,20 +1993,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
         * hits are okay, and sys_swapoff() has already prevented new
         * allocations from this area (while holding swap_lock).
         */
-        for (;;) {
+        for (i = prev + 1; i < si->max; i++) {
-                if (++i >= max) {
-                        if (!prev) {
-                                i = 0;
-                                break;
-                        }
-                        /*
-                         * No entries in use at top of swap_map,
-                         * loop back to start and recheck there.
-                         */
-                        max = prev + 1;
-                        prev = 0;
-                        i = 1;
-                }
                count = READ_ONCE(si->swap_map[i]);
                if (count && swap_count(count) != SWAP_MAP_BAD)
                        if (!frontswap || frontswap_test(si, i))
@@ -1989,240 +2001,121 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
                if ((i % LATENCY_LIMIT) == 0)
                        cond_resched();
        }
+        if (i == si->max)
+                i = 0;
        return i;
 }
 /*
- * We completely avoid races by reading each swap page in advance,
+ * If the boolean frontswap is true, only unuse pages_to_unuse pages;
- * and then search for the process using it.  All the necessary
- * page table adjustments can then be made atomically.
- *
- * if the boolean frontswap is true, only unuse pages_to_unuse pages;
 * pages_to_unuse==0 means all pages; ignored if frontswap is false
 */
+#define SWAP_UNUSE_MAX_TRIES 3
 int try_to_unuse(unsigned int type, bool frontswap,
                 unsigned long pages_to_unuse)
 {
+        struct mm_struct *prev_mm;
+        struct mm_struct *mm;
+        struct list_head *p;
+        int retval = 0;
        struct swap_info_struct *si = swap_info[type];
-        struct mm_struct *start_mm;
-        volatile unsigned char *swap_map; /* swap_map is accessed without
-                                           * locking. Mark it as volatile
-                                           * to prevent compiler doing
-                                           * something odd.
-                                           */
-        unsigned char swcount;
        struct page *page;
        swp_entry_t entry;
-        unsigned int i = 0;
+        unsigned int i;
-        int retval = 0;
+        int retries = 0;
-        /*
+        if (!si->inuse_pages)
-         * When searching mms for an entry, a good strategy is to
+                return 0;
-         * start at the first mm we freed the previous entry from
-         * (though actually we don't notice whether we or coincidence
-         * freed the entry).  Initialize this start_mm with a hold.
-         *
-         * A simpler strategy would be to start at the last mm we
-         * freed the previous entry from; but that would take less
-         * advantage of mmlist ordering, which clusters forked mms
-         * together, child after parent.  If we race with dup_mmap(), we
-         * prefer to resolve parent before child, lest we miss entries
-         * duplicated after we scanned child: using last mm would invert
-         * that.
-         */
-        start_mm = &init_mm;
-        mmget(&init_mm);
-        /*
+        if (!frontswap)
-         * Keep on scanning until all entries have gone.  Usually,
+                pages_to_unuse = 0;
-         * one pass through swap_map is enough, but not necessarily:
-         * there are races when an instance of an entry might be missed.
+retry:
-         */
+        retval = shmem_unuse(type, frontswap, &pages_to_unuse);
-        while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
+        if (retval)
+                goto out;
+        prev_mm = &init_mm;
+        mmget(prev_mm);
+        spin_lock(&mmlist_lock);
+        p = &init_mm.mmlist;
+        while ((p = p->next) != &init_mm.mmlist) {
                if (signal_pending(current)) {
                        retval = -EINTR;
                        break;
                }
-                /*
+                mm = list_entry(p, struct mm_struct, mmlist);
-                 * Get a page for the entry, using the existing swap
+                if (!mmget_not_zero(mm))
-                 * cache page if there is one.  Otherwise, get a clean
+                        continue;
-                 * page and read the swap into it.
+                spin_unlock(&mmlist_lock);
-                 */
+                mmput(prev_mm);
-                swap_map = &si->swap_map[i];
+                prev_mm = mm;
-                entry = swp_entry(type, i);
+                retval = unuse_mm(mm, type, frontswap, &pages_to_unuse);
-                page = read_swap_cache_async(entry,
-                                        GFP_HIGHUSER_MOVABLE, NULL, 0, false);
-                if (!page) {
-                        /*
-                         * Either swap_duplicate() failed because entry
-                         * has been freed independently, and will not be
-                         * reused since sys_swapoff() already disabled
-                         * allocation from here, or alloc_page() failed.
-                         */
-                        swcount = *swap_map;
-                        /*
-                         * We don't hold lock here, so the swap entry could be
-                         * SWAP_MAP_BAD (when the cluster is discarding).
-                         * Instead of fail out, We can just skip the swap
-                         * entry because swapoff will wait for discarding
-                         * finish anyway.
-                         */
-                        if (!swcount || swcount == SWAP_MAP_BAD)
-                                continue;
-                        retval = -ENOMEM;
-                        break;
-                }
-                /*
+                if (retval) {
-                 * Don't hold on to start_mm if it looks like exiting.
+                        mmput(prev_mm);
-                 */
+                        goto out;
-                if (atomic_read(&start_mm->mm_users) == 1) {
-                        mmput(start_mm);
-                        start_mm = &init_mm;
-                        mmget(&init_mm);
                }
                /*
-                 * Wait for and lock page.  When do_swap_page races with
+                 * Make sure that we aren't completely killing
-                 * try_to_unuse, do_swap_page can handle the fault much
+                 * interactive performance.
-                 * faster than try_to_unuse can locate the entry.  This
-                 * apparently redundant "wait_on_page_locked" lets try_to_unuse
-                 * defer to do_swap_page in such a case - in some tests,
-                 * do_swap_page and try_to_unuse repeatedly compete.
-                 */
-                wait_on_page_locked(page);
-                wait_on_page_writeback(page);
-                lock_page(page);
-                wait_on_page_writeback(page);
-                /*
-                 * Remove all references to entry.
                 */
-                swcount = *swap_map;
+                cond_resched();
-                if (swap_count(swcount) == SWAP_MAP_SHMEM) {
+                spin_lock(&mmlist_lock);
-                        retval = shmem_unuse(entry, page);
+        }
-                        /* page has already been unlocked and released */
+        spin_unlock(&mmlist_lock);
-                        if (retval < 0)
-                                break;
-                        continue;
-                }
-                if (swap_count(swcount) && start_mm != &init_mm)
-                        retval = unuse_mm(start_mm, entry, page);
-                if (swap_count(*swap_map)) {
-                        int set_start_mm = (*swap_map >= swcount);
-                        struct list_head *p = &start_mm->mmlist;
-                        struct mm_struct *new_start_mm = start_mm;
-                        struct mm_struct *prev_mm = start_mm;
-                        struct mm_struct *mm;
-                        mmget(new_start_mm);
-                        mmget(prev_mm);
-                        spin_lock(&mmlist_lock);
-                        while (swap_count(*swap_map) && !retval &&
-                                        (p = p->next) != &start_mm->mmlist) {
-                                mm = list_entry(p, struct mm_struct, mmlist);
-                                if (!mmget_not_zero(mm))
-                                        continue;
-                                spin_unlock(&mmlist_lock);
-                                mmput(prev_mm);
-                                prev_mm = mm;
-                                cond_resched();
+        mmput(prev_mm);
-                                swcount = *swap_map;
+        i = 0;
-                                if (!swap_count(swcount)) /* any usage ? */
+        while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
-                                        ;
-                                else if (mm == &init_mm)
-                                        set_start_mm = 1;
-                                else
-                                        retval = unuse_mm(mm, entry, page);
-                                if (set_start_mm && *swap_map < swcount) {
-                                        mmput(new_start_mm);
-                                        mmget(mm);
-                                        new_start_mm = mm;
-                                        set_start_mm = 0;
-                                }
-                                spin_lock(&mmlist_lock);
-                        }
-                        spin_unlock(&mmlist_lock);
-                        mmput(prev_mm);
-                        mmput(start_mm);
-                        start_mm = new_start_mm;
-                }
-                if (retval) {
-                        unlock_page(page);
-                        put_page(page);
-                        break;
-                }
-                /*
+                entry = swp_entry(type, i);
-                 * If a reference remains (rare), we would like to leave
+                page = find_get_page(swap_address_space(entry), i);
-                 * the page in the swap cache; but try_to_unmap could
+                if (!page)
-                 * then re-duplicate the entry once we drop page lock,
+                        continue;
-                 * so we might loop indefinitely; also, that page could
-                 * not be swapped out to other storage meanwhile.  So:
-                 * delete from cache even if there's another reference,
-                 * after ensuring that the data has been saved to disk -
-                 * since if the reference remains (rarer), it will be
-                 * read from disk into another page.  Splitting into two
-                 * pages would be incorrect if swap supported "shared
-                 * private" pages, but they are handled by tmpfs files.
-                 *
-                 * Given how unuse_vma() targets one particular offset
-                 * in an anon_vma, once the anon_vma has been determined,
-                 * this splitting happens to be just what is needed to
-                 * handle where KSM pages have been swapped out: re-reading
-                 * is unnecessarily slow, but we can fix that later on.
-                 */
-                if (swap_count(*swap_map) &&
-                     PageDirty(page) && PageSwapCache(page)) {
-                        struct writeback_control wbc = {
-                                .sync_mode = WB_SYNC_NONE,
-                        };
-                        swap_writepage(compound_head(page), &wbc);
-                        lock_page(page);
-                        wait_on_page_writeback(page);
-                }
                /*
                 * It is conceivable that a racing task removed this page from
-                 * swap cache just before we acquired the page lock at the top,
+                 * swap cache just before we acquired the page lock. The page
-                 * or while we dropped it in unuse_mm().  The page might even
+                 * might even be back in swap cache on another swap area. But
-                 * be back in swap cache on another swap area: that we must not
+                 * that is okay, try_to_free_swap() only removes stale pages.
-                 * delete, since it may not have been written out to swap yet.
-                 */
-                if (PageSwapCache(page) &&
-                    likely(page_private(page) == entry.val) &&
-                    (!PageTransCompound(page) ||
-                     !swap_page_trans_huge_swapped(si, entry)))
-                        delete_from_swap_cache(compound_head(page));
-                /*
-                 * So we could skip searching mms once swap count went
-                 * to 1, we did not mark any present ptes as dirty: must
-                 * mark page dirty so shrink_page_list will preserve it.
                 */
-                SetPageDirty(page);
+                lock_page(page);
+                wait_on_page_writeback(page);
+                try_to_free_swap(page);
                unlock_page(page);
                put_page(page);
                /*
-                 * Make sure that we aren't completely killing
+                 * For frontswap, we just need to unuse pages_to_unuse, if
-                 * interactive performance.
+                 * it was specified. Need not check frontswap again here as
+                 * we already zeroed out pages_to_unuse if not frontswap.
                 */
-                cond_resched();
+                if (pages_to_unuse && --pages_to_unuse == 0)
-                if (frontswap && pages_to_unuse > 0) {
+                        goto out;
-                        if (!--pages_to_unuse)
-                                break;
-                }
        }
-        mmput(start_mm);
+        /*
-        return retval;
+         * Lets check again to see if there are still swap entries in the map.
+         * If yes, we would need to do retry the unuse logic again.
+         * Under global memory pressure, swap entries can be reinserted back
+         * into process space after the mmlist loop above passes over them.
+         * Its not worth continuosuly retrying to unuse the swap in this case.
+         * So we try SWAP_UNUSE_MAX_TRIES times.
+         */
+        if (++retries >= SWAP_UNUSE_MAX_TRIES)
+                retval = -EBUSY;
+        else if (si->inuse_pages)
+                goto retry;
+out:
+        return (retval == FRONTSWAP_PAGES_UNUSED) ? 0 : retval;
 }
 /*
author	Vineeth Remanan Pillai <vpillai@digitalocean.com>	2019-03-05 18:47:03 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2019-03-06 00:07:18 -0500
commit	b56a2d8af9147a4efe4011b60d93779c0461ca97 (patch)
tree	1595323c4696df56df8018357c6c97a0aef14f7a /mm/swapfile.c
parent	c5bf121e4350a933bd431385e6fcb72a898ecc68 (diff)