mm, THP, swap: delay splitting THP during swap out

Patch series "THP swap: Delay splitting THP during swapping out", v11. This patchset is to optimize the performance of Transparent Huge Page (THP) swap. Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead on the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patchset is the first step for the THP swap support. The plan is to delay splitting THP step by step, finally avoid splitting THP during the THP swapping out and swap out/in the THP as a whole. As the first step, in this patchset, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. With the patchset, the swap out throughput improves 15.5% (from about 3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. This patch (of 5): In this patch, splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP (Transparent Huge Page) and adding the THP into the swap cache. This will batch the corresponding operation, thus improve THP swap out throughput. This is the first step for the THP swap optimization. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. In this patch, one swap cluster is used to hold the contents of each THP swapped out. So, the size of the swap cluster is changed to that of the THP (Transparent Huge Page) on x86_64 architecture (512). For other architectures which want such THP swap optimization, ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for the architecture. In effect, this will enlarge swap cluster size by 2 times on x86_64. Which may make it harder to find a free cluster when the swap space becomes fragmented. So that, this may reduce the continuous swap space allocation and sequential write in theory. The performance test in 0day shows no regressions caused by this. In the future of THP swap optimization, some information of the swapped out THP (such as compound map count) will be recorded in the swap_cluster_info data structure. The mem cgroup swap accounting functions are enhanced to support charge or uncharge a swap cluster backing a THP as a whole. The swap cluster allocate/free functions are added to allocate/free a swap cluster for a THP. A fair simple algorithm is used for swap cluster allocation, that is, only the first swap device in priority list will be tried to allocate the swap cluster. The function will fail if the trying is not successful, and the caller will fallback to allocate a single swap slot instead. This works good enough for normal cases. If the difference of the number of the free swap clusters among multiple swap devices is significant, it is possible that some THPs are split earlier than necessary. For example, this could be caused by big size difference among multiple swap devices. The swap cache functions is enhanced to support add/delete THP to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be enhanced in the future with multi-order radix tree. But because we will split the THP soon during swapping out, that optimization doesn't make much sense for this first step. The THP splitting functions are enhanced to support to split THP in swap cache during swapping out. The page lock will be held during allocating the swap cluster, adding the THP into the swap cache and splitting the THP. So in the code path other than swapping out, if the THP need to be split, the PageSwapCache(THP) will be always false. The swap cluster is only available for SSD, so the THP swap optimization in this patchset has no effect for HDD. [ying.huang@intel.com: fix two issues in THP optimize patch] Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size] Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option] Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h] Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Huang Ying <ying.huang@intel.com> 2017-07-06 18:37:18 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2017-07-06 19:24:31 -0400
commit: 38d8b4e6bdc872f07a3149309ab01719c96f3894 (patch)
tree: a4bdf8e41a90f49465829b98a46645af64b0103d /mm/swapfile.c
parent: 9d85e15f1d552653c989dbecf051d8eea5937be8 (diff)
1 files changed, 190 insertions, 69 deletions
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4f6cba1b6632..984f0dd94948 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -199,7 +199,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
        }
 }
+#ifdef CONFIG_THP_SWAP
+#define SWAPFILE_CLUSTER        HPAGE_PMD_NR
+#else
 #define SWAPFILE_CLUSTER        256
+#endif
 #define LATENCY_LIMIT           256
 static inline void cluster_set_flag(struct swap_cluster_info *info,
@@ -374,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
        schedule_work(&si->discard_work);
 }
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+        struct swap_cluster_info *ci = si->cluster_info;
+        cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+        cluster_list_add_tail(&si->free_clusters, ci, idx);
+}
 /*
 * Doing discard actually. After a cluster discard is finished, the cluster
 * will be added to free cluster list. caller should hold si->lock.
@@ -394,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
                spin_lock(&si->lock);
                ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
-                cluster_set_flag(ci, CLUSTER_FLAG_FREE);
+                __free_cluster(si, idx);
-                unlock_cluster(ci);
-                cluster_list_add_tail(&si->free_clusters, info, idx);
-                ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
                memset(si->swap_map + idx * SWAPFILE_CLUSTER,
                                0, SWAPFILE_CLUSTER);
                unlock_cluster(ci);
@@ -415,6 +424,34 @@ static void swap_discard_work(struct work_struct *work)
        spin_unlock(&si->lock);
 }
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+        struct swap_cluster_info *ci = si->cluster_info;
+        VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
+        cluster_list_del_first(&si->free_clusters, ci);
+        cluster_set_count_flag(ci + idx, 0, 0);
+}
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+        struct swap_cluster_info *ci = si->cluster_info + idx;
+        VM_BUG_ON(cluster_count(ci) != 0);
+        /*
+         * If the swap is discardable, prepare discard the cluster
+         * instead of free it immediately. The cluster will be freed
+         * after discard.
+         */
+        if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+            (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+                swap_cluster_schedule_discard(si, idx);
+                return;
+        }
+        __free_cluster(si, idx);
+}
 /*
 * The cluster corresponding to page_nr will be used. The cluster will be
 * removed from free cluster list and its usage counter will be increased.
@@ -426,11 +463,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
        if (!cluster_info)
                return;
-        if (cluster_is_free(&cluster_info[idx])) {
+        if (cluster_is_free(&cluster_info[idx]))
-                VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
+                alloc_cluster(p, idx);
-                cluster_list_del_first(&p->free_clusters, cluster_info);
-                cluster_set_count_flag(&cluster_info[idx], 0, 0);
-        }
        VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
        cluster_set_count(&cluster_info[idx],
@@ -454,21 +488,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
        cluster_set_count(&cluster_info[idx],
                cluster_count(&cluster_info[idx]) - 1);
-        if (cluster_count(&cluster_info[idx]) == 0) {
+        if (cluster_count(&cluster_info[idx]) == 0)
-                /*
+                free_cluster(p, idx);
-                 * If the swap is discardable, prepare discard the cluster
-                 * instead of free it immediately. The cluster will be freed
-                 * after discard.
-                 */
-                if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-                                 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-                        swap_cluster_schedule_discard(p, idx);
-                        return;
-                }
-                cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-                cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
-        }
 }
 /*
@@ -558,6 +579,60 @@ new_cluster:
        return found_free;
 }
+static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+                             unsigned int nr_entries)
+{
+        unsigned int end = offset + nr_entries - 1;
+        if (offset == si->lowest_bit)
+                si->lowest_bit += nr_entries;
+        if (end == si->highest_bit)
+                si->highest_bit -= nr_entries;
+        si->inuse_pages += nr_entries;
+        if (si->inuse_pages == si->pages) {
+                si->lowest_bit = si->max;
+                si->highest_bit = 0;
+                spin_lock(&swap_avail_lock);
+                plist_del(&si->avail_list, &swap_avail_head);
+                spin_unlock(&swap_avail_lock);
+        }
+}
+static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
+                            unsigned int nr_entries)
+{
+        unsigned long end = offset + nr_entries - 1;
+        void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+        if (offset < si->lowest_bit)
+                si->lowest_bit = offset;
+        if (end > si->highest_bit) {
+                bool was_full = !si->highest_bit;
+                si->highest_bit = end;
+                if (was_full && (si->flags & SWP_WRITEOK)) {
+                        spin_lock(&swap_avail_lock);
+                        WARN_ON(!plist_node_empty(&si->avail_list));
+                        if (plist_node_empty(&si->avail_list))
+                                plist_add(&si->avail_list, &swap_avail_head);
+                        spin_unlock(&swap_avail_lock);
+                }
+        }
+        atomic_long_add(nr_entries, &nr_swap_pages);
+        si->inuse_pages -= nr_entries;
+        if (si->flags & SWP_BLKDEV)
+                swap_slot_free_notify =
+                        si->bdev->bd_disk->fops->swap_slot_free_notify;
+        else
+                swap_slot_free_notify = NULL;
+        while (offset <= end) {
+                frontswap_invalidate_page(si->type, offset);
+                if (swap_slot_free_notify)
+                        swap_slot_free_notify(si->bdev, offset);
+                offset++;
+        }
+}
 static int scan_swap_map_slots(struct swap_info_struct *si,
                               unsigned char usage, int nr,
                               swp_entry_t slots[])
@@ -676,18 +751,7 @@ checks:
        inc_cluster_info_page(si, si->cluster_info, offset);
        unlock_cluster(ci);
-        if (offset == si->lowest_bit)
+        swap_range_alloc(si, offset, 1);
-                si->lowest_bit++;
-        if (offset == si->highest_bit)
-                si->highest_bit--;
-        si->inuse_pages++;
-        if (si->inuse_pages == si->pages) {
-                si->lowest_bit = si->max;
-                si->highest_bit = 0;
-                spin_lock(&swap_avail_lock);
-                plist_del(&si->avail_list, &swap_avail_head);
-                spin_unlock(&swap_avail_lock);
-        }
        si->cluster_next = offset + 1;
        slots[n_ret++] = swp_entry(si->type, offset);
@@ -766,6 +830,52 @@ no_page:
        return n_ret;
 }
+#ifdef CONFIG_THP_SWAP
+static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+{
+        unsigned long idx;
+        struct swap_cluster_info *ci;
+        unsigned long offset, i;
+        unsigned char *map;
+        if (cluster_list_empty(&si->free_clusters))
+                return 0;
+        idx = cluster_list_first(&si->free_clusters);
+        offset = idx * SWAPFILE_CLUSTER;
+        ci = lock_cluster(si, offset);
+        alloc_cluster(si, idx);
+        cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+        map = si->swap_map + offset;
+        for (i = 0; i < SWAPFILE_CLUSTER; i++)
+                map[i] = SWAP_HAS_CACHE;
+        unlock_cluster(ci);
+        swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
+        *slot = swp_entry(si->type, offset);
+        return 1;
+}
+static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+        unsigned long offset = idx * SWAPFILE_CLUSTER;
+        struct swap_cluster_info *ci;
+        ci = lock_cluster(si, offset);
+        cluster_set_count_flag(ci, 0, 0);
+        free_cluster(si, idx);
+        unlock_cluster(ci);
+        swap_range_free(si, offset, SWAPFILE_CLUSTER);
+}
+#else
+static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+{
+        VM_WARN_ON_ONCE(1);
+        return 0;
+}
+#endif /* CONFIG_THP_SWAP */
 static unsigned long scan_swap_map(struct swap_info_struct *si,
                                   unsigned char usage)
 {
@@ -781,13 +891,17 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 }
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
+int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
 {
+        unsigned long nr_pages = cluster ? SWAPFILE_CLUSTER : 1;
        struct swap_info_struct *si, *next;
        long avail_pgs;
        int n_ret = 0;
-        avail_pgs = atomic_long_read(&nr_swap_pages);
+        /* Only single cluster request supported */
+        WARN_ON_ONCE(n_goal > 1 && cluster);
+        avail_pgs = atomic_long_read(&nr_swap_pages) / nr_pages;
        if (avail_pgs <= 0)
                goto noswap;
@@ -797,7 +911,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
        if (n_goal > avail_pgs)
                n_goal = avail_pgs;
-        atomic_long_sub(n_goal, &nr_swap_pages);
+        atomic_long_sub(n_goal * nr_pages, &nr_swap_pages);
        spin_lock(&swap_avail_lock);
@@ -823,10 +937,13 @@ start_over:
                        spin_unlock(&si->lock);
                        goto nextsi;
                }
-                n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+                if (cluster)
-                                            n_goal, swp_entries);
+                        n_ret = swap_alloc_cluster(si, swp_entries);
+                else
+                        n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+                                                    n_goal, swp_entries);
                spin_unlock(&si->lock);
-                if (n_ret)
+                if (n_ret || cluster)
                        goto check_out;
                pr_debug("scan_swap_map of si %d failed to find offset\n",
                        si->type);
@@ -852,7 +969,8 @@ nextsi:
 check_out:
        if (n_ret < n_goal)
-                atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);
+                atomic_long_add((long)(n_goal - n_ret) * nr_pages,
+                                &nr_swap_pages);
 noswap:
        return n_ret;
 }
@@ -1008,32 +1126,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
        dec_cluster_info_page(p, p->cluster_info, offset);
        unlock_cluster(ci);
-        mem_cgroup_uncharge_swap(entry);
+        mem_cgroup_uncharge_swap(entry, 1);
-        if (offset < p->lowest_bit)
+        swap_range_free(p, offset, 1);
-                p->lowest_bit = offset;
-        if (offset > p->highest_bit) {
-                bool was_full = !p->highest_bit;
-                p->highest_bit = offset;
-                if (was_full && (p->flags & SWP_WRITEOK)) {
-                        spin_lock(&swap_avail_lock);
-                        WARN_ON(!plist_node_empty(&p->avail_list));
-                        if (plist_node_empty(&p->avail_list))
-                                plist_add(&p->avail_list,
-                                          &swap_avail_head);
-                        spin_unlock(&swap_avail_lock);
-                }
-        }
-        atomic_long_inc(&nr_swap_pages);
-        p->inuse_pages--;
-        frontswap_invalidate_page(p->type, offset);
-        if (p->flags & SWP_BLKDEV) {
-                struct gendisk *disk = p->bdev->bd_disk;
-                if (disk->fops->swap_slot_free_notify)
-                        disk->fops->swap_slot_free_notify(p->bdev,
-                                                          offset);
-        }
 }
 /*
@@ -1065,6 +1159,33 @@ void swapcache_free(swp_entry_t entry)
        }
 }
+#ifdef CONFIG_THP_SWAP
+void swapcache_free_cluster(swp_entry_t entry)
+{
+        unsigned long offset = swp_offset(entry);
+        unsigned long idx = offset / SWAPFILE_CLUSTER;
+        struct swap_cluster_info *ci;
+        struct swap_info_struct *si;
+        unsigned char *map;
+        unsigned int i;
+        si = swap_info_get(entry);
+        if (!si)
+                return;
+        ci = lock_cluster(si, offset);
+        map = si->swap_map + offset;
+        for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+                VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+                map[i] = 0;
+        }
+        unlock_cluster(ci);
+        mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+        swap_free_cluster(si, idx);
+        spin_unlock(&si->lock);
+}
+#endif /* CONFIG_THP_SWAP */
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
        struct swap_info_struct *p, *prev;
author	Huang Ying <ying.huang@intel.com>	2017-07-06 18:37:18 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 19:24:31 -0400
commit	38d8b4e6bdc872f07a3149309ab01719c96f3894 (patch)
tree	a4bdf8e41a90f49465829b98a46645af64b0103d /mm/swapfile.c
parent	9d85e15f1d552653c989dbecf051d8eea5937be8 (diff)

diff --git a/mm/swapfile.c b/mm/swapfile.c index 4f6cba1b6632..984f0dd94948 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c
@@ -199,7 +199,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
199	}	199	}
200	}	200	}
201		201
		202	#ifdef CONFIG_THP_SWAP
		203	#define SWAPFILE_CLUSTER HPAGE_PMD_NR
		204	#else
202	#define SWAPFILE_CLUSTER 256	205	#define SWAPFILE_CLUSTER 256
		206	#endif
203	#define LATENCY_LIMIT 256	207	#define LATENCY_LIMIT 256
204		208
205	static inline void cluster_set_flag(struct swap_cluster_info *info,	209	static inline void cluster_set_flag(struct swap_cluster_info *info,
@@ -374,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
374	schedule_work(&si->discard_work);	378	schedule_work(&si->discard_work);
375	}	379	}
376		380
		381	static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
		382	{
		383	struct swap_cluster_info *ci = si->cluster_info;
		384
		385	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
		386	cluster_list_add_tail(&si->free_clusters, ci, idx);
		387	}
		388
377	/*	389	/*
378	* Doing discard actually. After a cluster discard is finished, the cluster	390	* Doing discard actually. After a cluster discard is finished, the cluster
379	* will be added to free cluster list. caller should hold si->lock.	391	* will be added to free cluster list. caller should hold si->lock.
@@ -394,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
394		406
395	spin_lock(&si->lock);	407	spin_lock(&si->lock);
396	ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);	408	ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
397	cluster_set_flag(ci, CLUSTER_FLAG_FREE);	409	__free_cluster(si, idx);
398	unlock_cluster(ci);
399	cluster_list_add_tail(&si->free_clusters, info, idx);
400	ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
401	memset(si->swap_map + idx * SWAPFILE_CLUSTER,	410	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
402	0, SWAPFILE_CLUSTER);	411	0, SWAPFILE_CLUSTER);
403	unlock_cluster(ci);	412	unlock_cluster(ci);
@@ -415,6 +424,34 @@ static void swap_discard_work(struct work_struct *work)
415	spin_unlock(&si->lock);	424	spin_unlock(&si->lock);
416	}	425	}
417		426
		427	static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
		428	{
		429	struct swap_cluster_info *ci = si->cluster_info;
		430
		431	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
		432	cluster_list_del_first(&si->free_clusters, ci);
		433	cluster_set_count_flag(ci + idx, 0, 0);
		434	}
		435
		436	static void free_cluster(struct swap_info_struct *si, unsigned long idx)
		437	{
		438	struct swap_cluster_info *ci = si->cluster_info + idx;
		439
		440	VM_BUG_ON(cluster_count(ci) != 0);
		441	/*
		442	* If the swap is discardable, prepare discard the cluster
		443	* instead of free it immediately. The cluster will be freed
		444	* after discard.
		445	*/
		446	if ((si->flags & (SWP_WRITEOK \| SWP_PAGE_DISCARD)) ==
		447	(SWP_WRITEOK \| SWP_PAGE_DISCARD)) {
		448	swap_cluster_schedule_discard(si, idx);
		449	return;
		450	}
		451
		452	__free_cluster(si, idx);
		453	}
		454
418	/*	455	/*
419	* The cluster corresponding to page_nr will be used. The cluster will be	456	* The cluster corresponding to page_nr will be used. The cluster will be
420	* removed from free cluster list and its usage counter will be increased.	457	* removed from free cluster list and its usage counter will be increased.
@@ -426,11 +463,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
426		463
427	if (!cluster_info)	464	if (!cluster_info)
428	return;	465	return;
429	if (cluster_is_free(&cluster_info[idx])) {	466	if (cluster_is_free(&cluster_info[idx]))
430	VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);	467	alloc_cluster(p, idx);
431	cluster_list_del_first(&p->free_clusters, cluster_info);
432	cluster_set_count_flag(&cluster_info[idx], 0, 0);
433	}
434		468
435	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);	469	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
436	cluster_set_count(&cluster_info[idx],	470	cluster_set_count(&cluster_info[idx],
@@ -454,21 +488,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
454	cluster_set_count(&cluster_info[idx],	488	cluster_set_count(&cluster_info[idx],
455	cluster_count(&cluster_info[idx]) - 1);	489	cluster_count(&cluster_info[idx]) - 1);
456		490
457	if (cluster_count(&cluster_info[idx]) == 0) {	491	if (cluster_count(&cluster_info[idx]) == 0)
458	/*	492	free_cluster(p, idx);
459	* If the swap is discardable, prepare discard the cluster
460	* instead of free it immediately. The cluster will be freed
461	* after discard.
462	*/
463	if ((p->flags & (SWP_WRITEOK \| SWP_PAGE_DISCARD)) ==
464	(SWP_WRITEOK \| SWP_PAGE_DISCARD)) {
465	swap_cluster_schedule_discard(p, idx);
466	return;
467	}
468
469	cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
470	cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
471	}
472	}	493	}
473		494
474	/*	495	/*
@@ -558,6 +579,60 @@ new_cluster:
558	return found_free;	579	return found_free;
559	}	580	}
560		581
		582	static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
		583	unsigned int nr_entries)
		584	{
		585	unsigned int end = offset + nr_entries - 1;
		586
		587	if (offset == si->lowest_bit)
		588	si->lowest_bit += nr_entries;
		589	if (end == si->highest_bit)
		590	si->highest_bit -= nr_entries;
		591	si->inuse_pages += nr_entries;
		592	if (si->inuse_pages == si->pages) {
		593	si->lowest_bit = si->max;
		594	si->highest_bit = 0;
		595	spin_lock(&swap_avail_lock);
		596	plist_del(&si->avail_list, &swap_avail_head);
		597	spin_unlock(&swap_avail_lock);
		598	}
		599	}
		600
		601	static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
		602	unsigned int nr_entries)
		603	{
		604	unsigned long end = offset + nr_entries - 1;
		605	void (swap_slot_free_notify)(struct block_device , unsigned long);
		606
		607	if (offset < si->lowest_bit)
		608	si->lowest_bit = offset;
		609	if (end > si->highest_bit) {
		610	bool was_full = !si->highest_bit;
		611
		612	si->highest_bit = end;
		613	if (was_full && (si->flags & SWP_WRITEOK)) {
		614	spin_lock(&swap_avail_lock);
		615	WARN_ON(!plist_node_empty(&si->avail_list));
		616	if (plist_node_empty(&si->avail_list))
		617	plist_add(&si->avail_list, &swap_avail_head);
		618	spin_unlock(&swap_avail_lock);
		619	}
		620	}
		621	atomic_long_add(nr_entries, &nr_swap_pages);
		622	si->inuse_pages -= nr_entries;
		623	if (si->flags & SWP_BLKDEV)
		624	swap_slot_free_notify =
		625	si->bdev->bd_disk->fops->swap_slot_free_notify;
		626	else
		627	swap_slot_free_notify = NULL;
		628	while (offset <= end) {
		629	frontswap_invalidate_page(si->type, offset);
		630	if (swap_slot_free_notify)
		631	swap_slot_free_notify(si->bdev, offset);
		632	offset++;
		633	}
		634	}
		635
561	static int scan_swap_map_slots(struct swap_info_struct *si,	636	static int scan_swap_map_slots(struct swap_info_struct *si,
562	unsigned char usage, int nr,	637	unsigned char usage, int nr,
563	swp_entry_t slots[])	638	swp_entry_t slots[])
@@ -676,18 +751,7 @@ checks:
676	inc_cluster_info_page(si, si->cluster_info, offset);	751	inc_cluster_info_page(si, si->cluster_info, offset);
677	unlock_cluster(ci);	752	unlock_cluster(ci);
678		753
679	if (offset == si->lowest_bit)	754	swap_range_alloc(si, offset, 1);
680	si->lowest_bit++;
681	if (offset == si->highest_bit)
682	si->highest_bit--;
683	si->inuse_pages++;
684	if (si->inuse_pages == si->pages) {
685	si->lowest_bit = si->max;
686	si->highest_bit = 0;
687	spin_lock(&swap_avail_lock);
688	plist_del(&si->avail_list, &swap_avail_head);
689	spin_unlock(&swap_avail_lock);
690	}
691	si->cluster_next = offset + 1;	755	si->cluster_next = offset + 1;
692	slots[n_ret++] = swp_entry(si->type, offset);	756	slots[n_ret++] = swp_entry(si->type, offset);
693		757
@@ -766,6 +830,52 @@ no_page:
766	return n_ret;	830	return n_ret;
767	}	831	}
768		832
		833	#ifdef CONFIG_THP_SWAP
		834	static int swap_alloc_cluster(struct swap_info_struct si, swp_entry_t slot)
		835	{
		836	unsigned long idx;
		837	struct swap_cluster_info *ci;
		838	unsigned long offset, i;
		839	unsigned char *map;
		840
		841	if (cluster_list_empty(&si->free_clusters))
		842	return 0;
		843
		844	idx = cluster_list_first(&si->free_clusters);
		845	offset = idx * SWAPFILE_CLUSTER;
		846	ci = lock_cluster(si, offset);
		847	alloc_cluster(si, idx);
		848	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
		849
		850	map = si->swap_map + offset;
		851	for (i = 0; i < SWAPFILE_CLUSTER; i++)
		852	map[i] = SWAP_HAS_CACHE;
		853	unlock_cluster(ci);
		854	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
		855	*slot = swp_entry(si->type, offset);
		856
		857	return 1;
		858	}
		859
		860	static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
		861	{
		862	unsigned long offset = idx * SWAPFILE_CLUSTER;
		863	struct swap_cluster_info *ci;
		864
		865	ci = lock_cluster(si, offset);
		866	cluster_set_count_flag(ci, 0, 0);
		867	free_cluster(si, idx);
		868	unlock_cluster(ci);
		869	swap_range_free(si, offset, SWAPFILE_CLUSTER);
		870	}
		871	#else
		872	static int swap_alloc_cluster(struct swap_info_struct si, swp_entry_t slot)
		873	{
		874	VM_WARN_ON_ONCE(1);
		875	return 0;
		876	}
		877	#endif /* CONFIG_THP_SWAP */
		878
769	static unsigned long scan_swap_map(struct swap_info_struct *si,	879	static unsigned long scan_swap_map(struct swap_info_struct *si,
770	unsigned char usage)	880	unsigned char usage)
771	{	881	{
@@ -781,13 +891,17 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
781		891
782	}	892	}
783		893
784	int get_swap_pages(int n_goal, swp_entry_t swp_entries[])	894	int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
785	{	895	{
		896	unsigned long nr_pages = cluster ? SWAPFILE_CLUSTER : 1;
786	struct swap_info_struct si, next;	897	struct swap_info_struct si, next;
787	long avail_pgs;	898	long avail_pgs;
788	int n_ret = 0;	899	int n_ret = 0;
789		900
790	avail_pgs = atomic_long_read(&nr_swap_pages);	901	/* Only single cluster request supported */
		902	WARN_ON_ONCE(n_goal > 1 && cluster);
		903
		904	avail_pgs = atomic_long_read(&nr_swap_pages) / nr_pages;
791	if (avail_pgs <= 0)	905	if (avail_pgs <= 0)
792	goto noswap;	906	goto noswap;
793		907
@@ -797,7 +911,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
797	if (n_goal > avail_pgs)	911	if (n_goal > avail_pgs)
798	n_goal = avail_pgs;	912	n_goal = avail_pgs;
799		913
800	atomic_long_sub(n_goal, &nr_swap_pages);	914	atomic_long_sub(n_goal * nr_pages, &nr_swap_pages);
801		915
802	spin_lock(&swap_avail_lock);	916	spin_lock(&swap_avail_lock);
803		917
@@ -823,10 +937,13 @@ start_over:
823	spin_unlock(&si->lock);	937	spin_unlock(&si->lock);
824	goto nextsi;	938	goto nextsi;
825	}	939	}
826	n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,	940	if (cluster)
827	n_goal, swp_entries);	941	n_ret = swap_alloc_cluster(si, swp_entries);
		942	else
		943	n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
		944	n_goal, swp_entries);
828	spin_unlock(&si->lock);	945	spin_unlock(&si->lock);
829	if (n_ret)	946	if (n_ret \|\| cluster)
830	goto check_out;	947	goto check_out;
831	pr_debug("scan_swap_map of si %d failed to find offset\n",	948	pr_debug("scan_swap_map of si %d failed to find offset\n",
832	si->type);	949	si->type);
@@ -852,7 +969,8 @@ nextsi:
852		969
853	check_out:	970	check_out:
854	if (n_ret < n_goal)	971	if (n_ret < n_goal)
855	atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);	972	atomic_long_add((long)(n_goal - n_ret) * nr_pages,
		973	&nr_swap_pages);
856	noswap:	974	noswap:
857	return n_ret;	975	return n_ret;
858	}	976	}
@@ -1008,32 +1126,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
1008	dec_cluster_info_page(p, p->cluster_info, offset);	1126	dec_cluster_info_page(p, p->cluster_info, offset);
1009	unlock_cluster(ci);	1127	unlock_cluster(ci);
1010		1128
1011	mem_cgroup_uncharge_swap(entry);	1129	mem_cgroup_uncharge_swap(entry, 1);
1012	if (offset < p->lowest_bit)	1130	swap_range_free(p, offset, 1);
1013	p->lowest_bit = offset;
1014	if (offset > p->highest_bit) {
1015	bool was_full = !p->highest_bit;
1016
1017	p->highest_bit = offset;
1018	if (was_full && (p->flags & SWP_WRITEOK)) {
1019	spin_lock(&swap_avail_lock);
1020	WARN_ON(!plist_node_empty(&p->avail_list));
1021	if (plist_node_empty(&p->avail_list))
1022	plist_add(&p->avail_list,
1023	&swap_avail_head);
1024	spin_unlock(&swap_avail_lock);
1025	}
1026	}
1027	atomic_long_inc(&nr_swap_pages);
1028	p->inuse_pages--;
1029	frontswap_invalidate_page(p->type, offset);
1030	if (p->flags & SWP_BLKDEV) {
1031	struct gendisk *disk = p->bdev->bd_disk;
1032
1033	if (disk->fops->swap_slot_free_notify)
1034	disk->fops->swap_slot_free_notify(p->bdev,
1035	offset);
1036	}
1037	}	1131	}
1038		1132
1039	/*	1133	/*
@@ -1065,6 +1159,33 @@ void swapcache_free(swp_entry_t entry)
1065	}	1159	}
1066	}	1160	}
1067		1161
		1162	#ifdef CONFIG_THP_SWAP
		1163	void swapcache_free_cluster(swp_entry_t entry)
		1164	{
		1165	unsigned long offset = swp_offset(entry);
		1166	unsigned long idx = offset / SWAPFILE_CLUSTER;
		1167	struct swap_cluster_info *ci;
		1168	struct swap_info_struct *si;
		1169	unsigned char *map;
		1170	unsigned int i;
		1171
		1172	si = swap_info_get(entry);
		1173	if (!si)
		1174	return;
		1175
		1176	ci = lock_cluster(si, offset);
		1177	map = si->swap_map + offset;
		1178	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
		1179	VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
		1180	map[i] = 0;
		1181	}
		1182	unlock_cluster(ci);
		1183	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
		1184	swap_free_cluster(si, idx);
		1185	spin_unlock(&si->lock);
		1186	}
		1187	#endif /* CONFIG_THP_SWAP */
		1188
1068	void swapcache_free_entries(swp_entry_t *entries, int n)	1189	void swapcache_free_entries(swp_entry_t *entries, int n)
1069	{	1190	{
1070	struct swap_info_struct p, prev;	1191	struct swap_info_struct p, prev;