summaryrefslogtreecommitdiffstats
path: root/mm/memory.c
diff options
context:
space:
mode:
authorHuang Ying <ying.huang@intel.com>2017-09-06 19:24:36 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2017-09-06 20:27:29 -0400
commitec560175c0b6fce86994bdf036754d48122c5c87 (patch)
tree7aacd0beae098c785452a8a8361e13e7ffe2bc73 /mm/memory.c
parentc4fa63092f216737b60c789968371d9960a598e5 (diff)
mm, swap: VMA based swap readahead
The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patch, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/memory.c')
-rw-r--r--mm/memory.c23
1 files changed, 18 insertions, 5 deletions
diff --git a/mm/memory.c b/mm/memory.c
index 3dd8bb46391b..e87953775e3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2752,16 +2752,23 @@ EXPORT_SYMBOL(unmap_mapping_range);
2752int do_swap_page(struct vm_fault *vmf) 2752int do_swap_page(struct vm_fault *vmf)
2753{ 2753{
2754 struct vm_area_struct *vma = vmf->vma; 2754 struct vm_area_struct *vma = vmf->vma;
2755 struct page *page, *swapcache; 2755 struct page *page = NULL, *swapcache;
2756 struct mem_cgroup *memcg; 2756 struct mem_cgroup *memcg;
2757 struct vma_swap_readahead swap_ra;
2757 swp_entry_t entry; 2758 swp_entry_t entry;
2758 pte_t pte; 2759 pte_t pte;
2759 int locked; 2760 int locked;
2760 int exclusive = 0; 2761 int exclusive = 0;
2761 int ret = 0; 2762 int ret = 0;
2763 bool vma_readahead = swap_use_vma_readahead();
2762 2764
2763 if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) 2765 if (vma_readahead)
2766 page = swap_readahead_detect(vmf, &swap_ra);
2767 if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
2768 if (page)
2769 put_page(page);
2764 goto out; 2770 goto out;
2771 }
2765 2772
2766 entry = pte_to_swp_entry(vmf->orig_pte); 2773 entry = pte_to_swp_entry(vmf->orig_pte);
2767 if (unlikely(non_swap_entry(entry))) { 2774 if (unlikely(non_swap_entry(entry))) {
@@ -2777,10 +2784,16 @@ int do_swap_page(struct vm_fault *vmf)
2777 goto out; 2784 goto out;
2778 } 2785 }
2779 delayacct_set_flag(DELAYACCT_PF_SWAPIN); 2786 delayacct_set_flag(DELAYACCT_PF_SWAPIN);
2780 page = lookup_swap_cache(entry); 2787 if (!page)
2788 page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
2789 vmf->address);
2781 if (!page) { 2790 if (!page) {
2782 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma, 2791 if (vma_readahead)
2783 vmf->address); 2792 page = do_swap_page_readahead(entry,
2793 GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
2794 else
2795 page = swapin_readahead(entry,
2796 GFP_HIGHUSER_MOVABLE, vma, vmf->address);
2784 if (!page) { 2797 if (!page) {
2785 /* 2798 /*
2786 * Back out if somebody else faulted in this pte 2799 * Back out if somebody else faulted in this pte