summaryrefslogtreecommitdiffstats
path: root/mm/swapfile.c
diff options
context:
space:
mode:
authorHuang Ying <ying.huang@intel.com>2017-07-06 18:37:18 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2017-07-06 19:24:31 -0400
commit38d8b4e6bdc872f07a3149309ab01719c96f3894 (patch)
treea4bdf8e41a90f49465829b98a46645af64b0103d /mm/swapfile.c
parent9d85e15f1d552653c989dbecf051d8eea5937be8 (diff)
mm, THP, swap: delay splitting THP during swap out
Patch series "THP swap: Delay splitting THP during swapping out", v11. This patchset is to optimize the performance of Transparent Huge Page (THP) swap. Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead on the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patchset is the first step for the THP swap support. The plan is to delay splitting THP step by step, finally avoid splitting THP during the THP swapping out and swap out/in the THP as a whole. As the first step, in this patchset, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. With the patchset, the swap out throughput improves 15.5% (from about 3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. This patch (of 5): In this patch, splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP (Transparent Huge Page) and adding the THP into the swap cache. This will batch the corresponding operation, thus improve THP swap out throughput. This is the first step for the THP swap optimization. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. In this patch, one swap cluster is used to hold the contents of each THP swapped out. So, the size of the swap cluster is changed to that of the THP (Transparent Huge Page) on x86_64 architecture (512). For other architectures which want such THP swap optimization, ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for the architecture. In effect, this will enlarge swap cluster size by 2 times on x86_64. Which may make it harder to find a free cluster when the swap space becomes fragmented. So that, this may reduce the continuous swap space allocation and sequential write in theory. The performance test in 0day shows no regressions caused by this. In the future of THP swap optimization, some information of the swapped out THP (such as compound map count) will be recorded in the swap_cluster_info data structure. The mem cgroup swap accounting functions are enhanced to support charge or uncharge a swap cluster backing a THP as a whole. The swap cluster allocate/free functions are added to allocate/free a swap cluster for a THP. A fair simple algorithm is used for swap cluster allocation, that is, only the first swap device in priority list will be tried to allocate the swap cluster. The function will fail if the trying is not successful, and the caller will fallback to allocate a single swap slot instead. This works good enough for normal cases. If the difference of the number of the free swap clusters among multiple swap devices is significant, it is possible that some THPs are split earlier than necessary. For example, this could be caused by big size difference among multiple swap devices. The swap cache functions is enhanced to support add/delete THP to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be enhanced in the future with multi-order radix tree. But because we will split the THP soon during swapping out, that optimization doesn't make much sense for this first step. The THP splitting functions are enhanced to support to split THP in swap cache during swapping out. The page lock will be held during allocating the swap cluster, adding the THP into the swap cache and splitting the THP. So in the code path other than swapping out, if the THP need to be split, the PageSwapCache(THP) will be always false. The swap cluster is only available for SSD, so the THP swap optimization in this patchset has no effect for HDD. [ying.huang@intel.com: fix two issues in THP optimize patch] Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size] Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option] Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h] Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/swapfile.c')
-rw-r--r--mm/swapfile.c259
1 files changed, 190 insertions, 69 deletions
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4f6cba1b6632..984f0dd94948 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -199,7 +199,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
199 } 199 }
200} 200}
201 201
202#ifdef CONFIG_THP_SWAP
203#define SWAPFILE_CLUSTER HPAGE_PMD_NR
204#else
202#define SWAPFILE_CLUSTER 256 205#define SWAPFILE_CLUSTER 256
206#endif
203#define LATENCY_LIMIT 256 207#define LATENCY_LIMIT 256
204 208
205static inline void cluster_set_flag(struct swap_cluster_info *info, 209static inline void cluster_set_flag(struct swap_cluster_info *info,
@@ -374,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
374 schedule_work(&si->discard_work); 378 schedule_work(&si->discard_work);
375} 379}
376 380
381static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
382{
383 struct swap_cluster_info *ci = si->cluster_info;
384
385 cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
386 cluster_list_add_tail(&si->free_clusters, ci, idx);
387}
388
377/* 389/*
378 * Doing discard actually. After a cluster discard is finished, the cluster 390 * Doing discard actually. After a cluster discard is finished, the cluster
379 * will be added to free cluster list. caller should hold si->lock. 391 * will be added to free cluster list. caller should hold si->lock.
@@ -394,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
394 406
395 spin_lock(&si->lock); 407 spin_lock(&si->lock);
396 ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); 408 ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
397 cluster_set_flag(ci, CLUSTER_FLAG_FREE); 409 __free_cluster(si, idx);
398 unlock_cluster(ci);
399 cluster_list_add_tail(&si->free_clusters, info, idx);
400 ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
401 memset(si->swap_map + idx * SWAPFILE_CLUSTER, 410 memset(si->swap_map + idx * SWAPFILE_CLUSTER,
402 0, SWAPFILE_CLUSTER); 411 0, SWAPFILE_CLUSTER);
403 unlock_cluster(ci); 412 unlock_cluster(ci);
@@ -415,6 +424,34 @@ static void swap_discard_work(struct work_struct *work)
415 spin_unlock(&si->lock); 424 spin_unlock(&si->lock);
416} 425}
417 426
427static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
428{
429 struct swap_cluster_info *ci = si->cluster_info;
430
431 VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
432 cluster_list_del_first(&si->free_clusters, ci);
433 cluster_set_count_flag(ci + idx, 0, 0);
434}
435
436static void free_cluster(struct swap_info_struct *si, unsigned long idx)
437{
438 struct swap_cluster_info *ci = si->cluster_info + idx;
439
440 VM_BUG_ON(cluster_count(ci) != 0);
441 /*
442 * If the swap is discardable, prepare discard the cluster
443 * instead of free it immediately. The cluster will be freed
444 * after discard.
445 */
446 if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
447 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
448 swap_cluster_schedule_discard(si, idx);
449 return;
450 }
451
452 __free_cluster(si, idx);
453}
454
418/* 455/*
419 * The cluster corresponding to page_nr will be used. The cluster will be 456 * The cluster corresponding to page_nr will be used. The cluster will be
420 * removed from free cluster list and its usage counter will be increased. 457 * removed from free cluster list and its usage counter will be increased.
@@ -426,11 +463,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
426 463
427 if (!cluster_info) 464 if (!cluster_info)
428 return; 465 return;
429 if (cluster_is_free(&cluster_info[idx])) { 466 if (cluster_is_free(&cluster_info[idx]))
430 VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx); 467 alloc_cluster(p, idx);
431 cluster_list_del_first(&p->free_clusters, cluster_info);
432 cluster_set_count_flag(&cluster_info[idx], 0, 0);
433 }
434 468
435 VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); 469 VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
436 cluster_set_count(&cluster_info[idx], 470 cluster_set_count(&cluster_info[idx],
@@ -454,21 +488,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
454 cluster_set_count(&cluster_info[idx], 488 cluster_set_count(&cluster_info[idx],
455 cluster_count(&cluster_info[idx]) - 1); 489 cluster_count(&cluster_info[idx]) - 1);
456 490
457 if (cluster_count(&cluster_info[idx]) == 0) { 491 if (cluster_count(&cluster_info[idx]) == 0)
458 /* 492 free_cluster(p, idx);
459 * If the swap is discardable, prepare discard the cluster
460 * instead of free it immediately. The cluster will be freed
461 * after discard.
462 */
463 if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
464 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
465 swap_cluster_schedule_discard(p, idx);
466 return;
467 }
468
469 cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
470 cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
471 }
472} 493}
473 494
474/* 495/*
@@ -558,6 +579,60 @@ new_cluster:
558 return found_free; 579 return found_free;
559} 580}
560 581
582static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
583 unsigned int nr_entries)
584{
585 unsigned int end = offset + nr_entries - 1;
586
587 if (offset == si->lowest_bit)
588 si->lowest_bit += nr_entries;
589 if (end == si->highest_bit)
590 si->highest_bit -= nr_entries;
591 si->inuse_pages += nr_entries;
592 if (si->inuse_pages == si->pages) {
593 si->lowest_bit = si->max;
594 si->highest_bit = 0;
595 spin_lock(&swap_avail_lock);
596 plist_del(&si->avail_list, &swap_avail_head);
597 spin_unlock(&swap_avail_lock);
598 }
599}
600
601static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
602 unsigned int nr_entries)
603{
604 unsigned long end = offset + nr_entries - 1;
605 void (*swap_slot_free_notify)(struct block_device *, unsigned long);
606
607 if (offset < si->lowest_bit)
608 si->lowest_bit = offset;
609 if (end > si->highest_bit) {
610 bool was_full = !si->highest_bit;
611
612 si->highest_bit = end;
613 if (was_full && (si->flags & SWP_WRITEOK)) {
614 spin_lock(&swap_avail_lock);
615 WARN_ON(!plist_node_empty(&si->avail_list));
616 if (plist_node_empty(&si->avail_list))
617 plist_add(&si->avail_list, &swap_avail_head);
618 spin_unlock(&swap_avail_lock);
619 }
620 }
621 atomic_long_add(nr_entries, &nr_swap_pages);
622 si->inuse_pages -= nr_entries;
623 if (si->flags & SWP_BLKDEV)
624 swap_slot_free_notify =
625 si->bdev->bd_disk->fops->swap_slot_free_notify;
626 else
627 swap_slot_free_notify = NULL;
628 while (offset <= end) {
629 frontswap_invalidate_page(si->type, offset);
630 if (swap_slot_free_notify)
631 swap_slot_free_notify(si->bdev, offset);
632 offset++;
633 }
634}
635
561static int scan_swap_map_slots(struct swap_info_struct *si, 636static int scan_swap_map_slots(struct swap_info_struct *si,
562 unsigned char usage, int nr, 637 unsigned char usage, int nr,
563 swp_entry_t slots[]) 638 swp_entry_t slots[])
@@ -676,18 +751,7 @@ checks:
676 inc_cluster_info_page(si, si->cluster_info, offset); 751 inc_cluster_info_page(si, si->cluster_info, offset);
677 unlock_cluster(ci); 752 unlock_cluster(ci);
678 753
679 if (offset == si->lowest_bit) 754 swap_range_alloc(si, offset, 1);
680 si->lowest_bit++;
681 if (offset == si->highest_bit)
682 si->highest_bit--;
683 si->inuse_pages++;
684 if (si->inuse_pages == si->pages) {
685 si->lowest_bit = si->max;
686 si->highest_bit = 0;
687 spin_lock(&swap_avail_lock);
688 plist_del(&si->avail_list, &swap_avail_head);
689 spin_unlock(&swap_avail_lock);
690 }
691 si->cluster_next = offset + 1; 755 si->cluster_next = offset + 1;
692 slots[n_ret++] = swp_entry(si->type, offset); 756 slots[n_ret++] = swp_entry(si->type, offset);
693 757
@@ -766,6 +830,52 @@ no_page:
766 return n_ret; 830 return n_ret;
767} 831}
768 832
833#ifdef CONFIG_THP_SWAP
834static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
835{
836 unsigned long idx;
837 struct swap_cluster_info *ci;
838 unsigned long offset, i;
839 unsigned char *map;
840
841 if (cluster_list_empty(&si->free_clusters))
842 return 0;
843
844 idx = cluster_list_first(&si->free_clusters);
845 offset = idx * SWAPFILE_CLUSTER;
846 ci = lock_cluster(si, offset);
847 alloc_cluster(si, idx);
848 cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
849
850 map = si->swap_map + offset;
851 for (i = 0; i < SWAPFILE_CLUSTER; i++)
852 map[i] = SWAP_HAS_CACHE;
853 unlock_cluster(ci);
854 swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
855 *slot = swp_entry(si->type, offset);
856
857 return 1;
858}
859
860static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
861{
862 unsigned long offset = idx * SWAPFILE_CLUSTER;
863 struct swap_cluster_info *ci;
864
865 ci = lock_cluster(si, offset);
866 cluster_set_count_flag(ci, 0, 0);
867 free_cluster(si, idx);
868 unlock_cluster(ci);
869 swap_range_free(si, offset, SWAPFILE_CLUSTER);
870}
871#else
872static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
873{
874 VM_WARN_ON_ONCE(1);
875 return 0;
876}
877#endif /* CONFIG_THP_SWAP */
878
769static unsigned long scan_swap_map(struct swap_info_struct *si, 879static unsigned long scan_swap_map(struct swap_info_struct *si,
770 unsigned char usage) 880 unsigned char usage)
771{ 881{
@@ -781,13 +891,17 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
781 891
782} 892}
783 893
784int get_swap_pages(int n_goal, swp_entry_t swp_entries[]) 894int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
785{ 895{
896 unsigned long nr_pages = cluster ? SWAPFILE_CLUSTER : 1;
786 struct swap_info_struct *si, *next; 897 struct swap_info_struct *si, *next;
787 long avail_pgs; 898 long avail_pgs;
788 int n_ret = 0; 899 int n_ret = 0;
789 900
790 avail_pgs = atomic_long_read(&nr_swap_pages); 901 /* Only single cluster request supported */
902 WARN_ON_ONCE(n_goal > 1 && cluster);
903
904 avail_pgs = atomic_long_read(&nr_swap_pages) / nr_pages;
791 if (avail_pgs <= 0) 905 if (avail_pgs <= 0)
792 goto noswap; 906 goto noswap;
793 907
@@ -797,7 +911,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
797 if (n_goal > avail_pgs) 911 if (n_goal > avail_pgs)
798 n_goal = avail_pgs; 912 n_goal = avail_pgs;
799 913
800 atomic_long_sub(n_goal, &nr_swap_pages); 914 atomic_long_sub(n_goal * nr_pages, &nr_swap_pages);
801 915
802 spin_lock(&swap_avail_lock); 916 spin_lock(&swap_avail_lock);
803 917
@@ -823,10 +937,13 @@ start_over:
823 spin_unlock(&si->lock); 937 spin_unlock(&si->lock);
824 goto nextsi; 938 goto nextsi;
825 } 939 }
826 n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, 940 if (cluster)
827 n_goal, swp_entries); 941 n_ret = swap_alloc_cluster(si, swp_entries);
942 else
943 n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
944 n_goal, swp_entries);
828 spin_unlock(&si->lock); 945 spin_unlock(&si->lock);
829 if (n_ret) 946 if (n_ret || cluster)
830 goto check_out; 947 goto check_out;
831 pr_debug("scan_swap_map of si %d failed to find offset\n", 948 pr_debug("scan_swap_map of si %d failed to find offset\n",
832 si->type); 949 si->type);
@@ -852,7 +969,8 @@ nextsi:
852 969
853check_out: 970check_out:
854 if (n_ret < n_goal) 971 if (n_ret < n_goal)
855 atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages); 972 atomic_long_add((long)(n_goal - n_ret) * nr_pages,
973 &nr_swap_pages);
856noswap: 974noswap:
857 return n_ret; 975 return n_ret;
858} 976}
@@ -1008,32 +1126,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
1008 dec_cluster_info_page(p, p->cluster_info, offset); 1126 dec_cluster_info_page(p, p->cluster_info, offset);
1009 unlock_cluster(ci); 1127 unlock_cluster(ci);
1010 1128
1011 mem_cgroup_uncharge_swap(entry); 1129 mem_cgroup_uncharge_swap(entry, 1);
1012 if (offset < p->lowest_bit) 1130 swap_range_free(p, offset, 1);
1013 p->lowest_bit = offset;
1014 if (offset > p->highest_bit) {
1015 bool was_full = !p->highest_bit;
1016
1017 p->highest_bit = offset;
1018 if (was_full && (p->flags & SWP_WRITEOK)) {
1019 spin_lock(&swap_avail_lock);
1020 WARN_ON(!plist_node_empty(&p->avail_list));
1021 if (plist_node_empty(&p->avail_list))
1022 plist_add(&p->avail_list,
1023 &swap_avail_head);
1024 spin_unlock(&swap_avail_lock);
1025 }
1026 }
1027 atomic_long_inc(&nr_swap_pages);
1028 p->inuse_pages--;
1029 frontswap_invalidate_page(p->type, offset);
1030 if (p->flags & SWP_BLKDEV) {
1031 struct gendisk *disk = p->bdev->bd_disk;
1032
1033 if (disk->fops->swap_slot_free_notify)
1034 disk->fops->swap_slot_free_notify(p->bdev,
1035 offset);
1036 }
1037} 1131}
1038 1132
1039/* 1133/*
@@ -1065,6 +1159,33 @@ void swapcache_free(swp_entry_t entry)
1065 } 1159 }
1066} 1160}
1067 1161
1162#ifdef CONFIG_THP_SWAP
1163void swapcache_free_cluster(swp_entry_t entry)
1164{
1165 unsigned long offset = swp_offset(entry);
1166 unsigned long idx = offset / SWAPFILE_CLUSTER;
1167 struct swap_cluster_info *ci;
1168 struct swap_info_struct *si;
1169 unsigned char *map;
1170 unsigned int i;
1171
1172 si = swap_info_get(entry);
1173 if (!si)
1174 return;
1175
1176 ci = lock_cluster(si, offset);
1177 map = si->swap_map + offset;
1178 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
1179 VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
1180 map[i] = 0;
1181 }
1182 unlock_cluster(ci);
1183 mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
1184 swap_free_cluster(si, idx);
1185 spin_unlock(&si->lock);
1186}
1187#endif /* CONFIG_THP_SWAP */
1188
1068void swapcache_free_entries(swp_entry_t *entries, int n) 1189void swapcache_free_entries(swp_entry_t *entries, int n)
1069{ 1190{
1070 struct swap_info_struct *p, *prev; 1191 struct swap_info_struct *p, *prev;