aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorChristoph Lameter <clameter@sgi.com>2006-02-01 06:05:38 -0500
committerLinus Torvalds <torvalds@g5.osdl.org>2006-02-01 11:53:16 -0500
commita48d07afdf18212de22b959715b16793c5a6e57a (patch)
tree36d5963c29ceb5c2f6df53036cef5c0d30383dbf
parentb16664e44c54525be89dc07ad15a13b4eeec5634 (diff)
[PATCH] Direct Migration V9: migrate_pages() extension
Add direct migration support with fall back to swap. Direct migration support on top of the swap based page migration facility. This allows the direct migration of anonymous pages and the migration of file backed pages by dropping the associated buffers (requires writeout). Fall back to swap out if necessary. The patch is based on lots of patches from the hotplug project but the code was restructured, documented and simplified as much as possible. Note that an additional patch that defines the migrate_page() method for filesystems is necessary in order to avoid writeback for anonymous and file backed pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-rw-r--r--Documentation/vm/page_migration129
-rw-r--r--include/linux/rmap.h4
-rw-r--r--include/linux/swap.h2
-rw-r--r--mm/rmap.c21
-rw-r--r--mm/vmscan.c226
5 files changed, 360 insertions, 22 deletions
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
new file mode 100644
index 000000000000..c52820fcf500
--- /dev/null
+++ b/Documentation/vm/page_migration
@@ -0,0 +1,129 @@
1Page migration
2--------------
3
4Page migration allows the moving of the physical location of pages between
5nodes in a numa system while the process is running. This means that the
6virtual addresses that the process sees do not change. However, the
7system rearranges the physical location of those pages.
8
9The main intend of page migration is to reduce the latency of memory access
10by moving pages near to the processor where the process accessing that memory
11is running.
12
13Page migration allows a process to manually relocate the node on which its
14pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
15a new memory policy. The pages of process can also be relocated
16from another process using the sys_migrate_pages() function call. The
17migrate_pages function call takes two sets of nodes and moves pages of a
18process that are located on the from nodes to the destination nodes.
19
20Manual migration is very useful if for example the scheduler has relocated
21a process to a processor on a distant node. A batch scheduler or an
22administrator may detect the situation and move the pages of the process
23nearer to the new processor. At some point in the future we may have
24some mechanism in the scheduler that will automatically move the pages.
25
26Larger installations usually partition the system using cpusets into
27sections of nodes. Paul Jackson has equipped cpusets with the ability to
28move pages when a task is moved to another cpuset. This allows automatic
29control over locality of a process. If a task is moved to a new cpuset
30then also all its pages are moved with it so that the performance of the
31process does not sink dramatically (as is the case today).
32
33Page migration allows the preservation of the relative location of pages
34within a group of nodes for all migration techniques which will preserve a
35particular memory allocation pattern generated even after migrating a
36process. This is necessary in order to preserve the memory latencies.
37Processes will run with similar performance after migration.
38
39Page migration occurs in several steps. First a high level
40description for those trying to use migrate_pages() and then
41a low level description of how the low level details work.
42
43A. Use of migrate_pages()
44-------------------------
45
461. Remove pages from the LRU.
47
48 Lists of pages to be migrated are generated by scanning over
49 pages and moving them into lists. This is done by
50 calling isolate_lru_page() or __isolate_lru_page().
51 Calling isolate_lru_page increases the references to the page
52 so that it cannot vanish under us.
53
542. Generate a list of newly allocates page to move the contents
55 of the first list to.
56
573. The migrate_pages() function is called which attempts
58 to do the migration. It returns the moved pages in the
59 list specified as the third parameter and the failed
60 migrations in the fourth parameter. The first parameter
61 will contain the pages that could still be retried.
62
634. The leftover pages of various types are returned
64 to the LRU using putback_to_lru_pages() or otherwise
65 disposed of. The pages will still have the refcount as
66 increased by isolate_lru_pages()!
67
68B. Operation of migrate_pages()
69--------------------------------
70
71migrate_pages does several passes over its list of pages. A page is moved
72if all references to a page are removable at the time.
73
74Steps:
75
761. Lock the page to be migrated
77
782. Insure that writeback is complete.
79
803. Make sure that the page has assigned swap cache entry if
81 it is an anonyous page. The swap cache reference is necessary
82 to preserve the information contain in the page table maps.
83
844. Prep the new page that we want to move to. It is locked
85 and set to not being uptodate so that all accesses to the new
86 page immediately lock while we are moving references.
87
885. All the page table references to the page are either dropped (file backed)
89 or converted to swap references (anonymous pages). This should decrease the
90 reference count.
91
926. The radix tree lock is taken
93
947. The refcount of the page is examined and we back out if references remain
95 otherwise we know that we are the only one referencing this page.
96
978. The radix tree is checked and if it does not contain the pointer to this
98 page then we back out.
99
1009. The mapping is checked. If the mapping is gone then a truncate action may
101 be in progress and we back out.
102
10310. The new page is prepped with some settings from the old page so that accesses
104 to the new page will be discovered to have the correct settings.
105
10611. The radix tree is changed to point to the new page.
107
10812. The reference count of the old page is dropped because the reference has now
109 been removed.
110
11113. The radix tree lock is dropped.
112
11314. The page contents are copied to the new page.
114
11515. The remaining page flags are copied to the new page.
116
11716. The old page flags are cleared to indicate that the page does
118 not use any information anymore.
119
12017. Queued up writeback on the new page is triggered.
121
12218. If swap pte's were generated for the page then remove them again.
123
12419. The locks are dropped from the old and new page.
125
12620. The new page is moved to the LRU.
127
128Christoph Lameter, December 19, 2005.
129
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9d6fbeef2104..0f1ea2d6ed86 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct page *page)
91 * Called from mm/vmscan.c to handle paging out 91 * Called from mm/vmscan.c to handle paging out
92 */ 92 */
93int page_referenced(struct page *, int is_locked); 93int page_referenced(struct page *, int is_locked);
94int try_to_unmap(struct page *); 94int try_to_unmap(struct page *, int ignore_refs);
95 95
96/* 96/*
97 * Called from mm/filemap_xip.c to unmap empty zero page 97 * Called from mm/filemap_xip.c to unmap empty zero page
@@ -111,7 +111,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
111#define anon_vma_link(vma) do {} while (0) 111#define anon_vma_link(vma) do {} while (0)
112 112
113#define page_referenced(page,l) TestClearPageReferenced(page) 113#define page_referenced(page,l) TestClearPageReferenced(page)
114#define try_to_unmap(page) SWAP_FAIL 114#define try_to_unmap(page, refs) SWAP_FAIL
115 115
116#endif /* CONFIG_MMU */ 116#endif /* CONFIG_MMU */
117 117
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e53fef7051e6..d359fc022433 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -191,6 +191,8 @@ static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
191#ifdef CONFIG_MIGRATION 191#ifdef CONFIG_MIGRATION
192extern int isolate_lru_page(struct page *p); 192extern int isolate_lru_page(struct page *p);
193extern int putback_lru_pages(struct list_head *l); 193extern int putback_lru_pages(struct list_head *l);
194extern int migrate_page(struct page *, struct page *);
195extern void migrate_page_copy(struct page *, struct page *);
194extern int migrate_pages(struct list_head *l, struct list_head *t, 196extern int migrate_pages(struct list_head *l, struct list_head *t,
195 struct list_head *moved, struct list_head *failed); 197 struct list_head *moved, struct list_head *failed);
196#else 198#else
diff --git a/mm/rmap.c b/mm/rmap.c
index d85a99d28c03..13fad5fcdf79 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -52,6 +52,7 @@
52#include <linux/init.h> 52#include <linux/init.h>
53#include <linux/rmap.h> 53#include <linux/rmap.h>
54#include <linux/rcupdate.h> 54#include <linux/rcupdate.h>
55#include <linux/module.h>
55 56
56#include <asm/tlbflush.h> 57#include <asm/tlbflush.h>
57 58
@@ -541,7 +542,8 @@ void page_remove_rmap(struct page *page)
541 * Subfunctions of try_to_unmap: try_to_unmap_one called 542 * Subfunctions of try_to_unmap: try_to_unmap_one called
542 * repeatedly from either try_to_unmap_anon or try_to_unmap_file. 543 * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
543 */ 544 */
544static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma) 545static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
546 int ignore_refs)
545{ 547{
546 struct mm_struct *mm = vma->vm_mm; 548 struct mm_struct *mm = vma->vm_mm;
547 unsigned long address; 549 unsigned long address;
@@ -564,7 +566,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
564 * skipped over this mm) then we should reactivate it. 566 * skipped over this mm) then we should reactivate it.
565 */ 567 */
566 if ((vma->vm_flags & VM_LOCKED) || 568 if ((vma->vm_flags & VM_LOCKED) ||
567 ptep_clear_flush_young(vma, address, pte)) { 569 (ptep_clear_flush_young(vma, address, pte)
570 && !ignore_refs)) {
568 ret = SWAP_FAIL; 571 ret = SWAP_FAIL;
569 goto out_unmap; 572 goto out_unmap;
570 } 573 }
@@ -698,7 +701,7 @@ static void try_to_unmap_cluster(unsigned long cursor,
698 pte_unmap_unlock(pte - 1, ptl); 701 pte_unmap_unlock(pte - 1, ptl);
699} 702}
700 703
701static int try_to_unmap_anon(struct page *page) 704static int try_to_unmap_anon(struct page *page, int ignore_refs)
702{ 705{
703 struct anon_vma *anon_vma; 706 struct anon_vma *anon_vma;
704 struct vm_area_struct *vma; 707 struct vm_area_struct *vma;
@@ -709,7 +712,7 @@ static int try_to_unmap_anon(struct page *page)
709 return ret; 712 return ret;
710 713
711 list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { 714 list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
712 ret = try_to_unmap_one(page, vma); 715 ret = try_to_unmap_one(page, vma, ignore_refs);
713 if (ret == SWAP_FAIL || !page_mapped(page)) 716 if (ret == SWAP_FAIL || !page_mapped(page))
714 break; 717 break;
715 } 718 }
@@ -726,7 +729,7 @@ static int try_to_unmap_anon(struct page *page)
726 * 729 *
727 * This function is only called from try_to_unmap for object-based pages. 730 * This function is only called from try_to_unmap for object-based pages.
728 */ 731 */
729static int try_to_unmap_file(struct page *page) 732static int try_to_unmap_file(struct page *page, int ignore_refs)
730{ 733{
731 struct address_space *mapping = page->mapping; 734 struct address_space *mapping = page->mapping;
732 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 735 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -740,7 +743,7 @@ static int try_to_unmap_file(struct page *page)
740 743
741 spin_lock(&mapping->i_mmap_lock); 744 spin_lock(&mapping->i_mmap_lock);
742 vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 745 vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
743 ret = try_to_unmap_one(page, vma); 746 ret = try_to_unmap_one(page, vma, ignore_refs);
744 if (ret == SWAP_FAIL || !page_mapped(page)) 747 if (ret == SWAP_FAIL || !page_mapped(page))
745 goto out; 748 goto out;
746 } 749 }
@@ -825,16 +828,16 @@ out:
825 * SWAP_AGAIN - we missed a mapping, try again later 828 * SWAP_AGAIN - we missed a mapping, try again later
826 * SWAP_FAIL - the page is unswappable 829 * SWAP_FAIL - the page is unswappable
827 */ 830 */
828int try_to_unmap(struct page *page) 831int try_to_unmap(struct page *page, int ignore_refs)
829{ 832{
830 int ret; 833 int ret;
831 834
832 BUG_ON(!PageLocked(page)); 835 BUG_ON(!PageLocked(page));
833 836
834 if (PageAnon(page)) 837 if (PageAnon(page))
835 ret = try_to_unmap_anon(page); 838 ret = try_to_unmap_anon(page, ignore_refs);
836 else 839 else
837 ret = try_to_unmap_file(page); 840 ret = try_to_unmap_file(page, ignore_refs);
838 841
839 if (!page_mapped(page)) 842 if (!page_mapped(page))
840 ret = SWAP_SUCCESS; 843 ret = SWAP_SUCCESS;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aa4b80dbe3ad..8f326ce2b690 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -483,7 +483,7 @@ static int shrink_list(struct list_head *page_list, struct scan_control *sc)
483 if (!sc->may_swap) 483 if (!sc->may_swap)
484 goto keep_locked; 484 goto keep_locked;
485 485
486 switch (try_to_unmap(page)) { 486 switch (try_to_unmap(page, 0)) {
487 case SWAP_FAIL: 487 case SWAP_FAIL:
488 goto activate_locked; 488 goto activate_locked;
489 case SWAP_AGAIN: 489 case SWAP_AGAIN:
@@ -623,7 +623,7 @@ static int swap_page(struct page *page)
623 struct address_space *mapping = page_mapping(page); 623 struct address_space *mapping = page_mapping(page);
624 624
625 if (page_mapped(page) && mapping) 625 if (page_mapped(page) && mapping)
626 if (try_to_unmap(page) != SWAP_SUCCESS) 626 if (try_to_unmap(page, 0) != SWAP_SUCCESS)
627 goto unlock_retry; 627 goto unlock_retry;
628 628
629 if (PageDirty(page)) { 629 if (PageDirty(page)) {
@@ -659,6 +659,154 @@ unlock_retry:
659retry: 659retry:
660 return -EAGAIN; 660 return -EAGAIN;
661} 661}
662
663/*
664 * Page migration was first developed in the context of the memory hotplug
665 * project. The main authors of the migration code are:
666 *
667 * IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
668 * Hirokazu Takahashi <taka@valinux.co.jp>
669 * Dave Hansen <haveblue@us.ibm.com>
670 * Christoph Lameter <clameter@sgi.com>
671 */
672
673/*
674 * Remove references for a page and establish the new page with the correct
675 * basic settings to be able to stop accesses to the page.
676 */
677static int migrate_page_remove_references(struct page *newpage,
678 struct page *page, int nr_refs)
679{
680 struct address_space *mapping = page_mapping(page);
681 struct page **radix_pointer;
682
683 /*
684 * Avoid doing any of the following work if the page count
685 * indicates that the page is in use or truncate has removed
686 * the page.
687 */
688 if (!mapping || page_mapcount(page) + nr_refs != page_count(page))
689 return 1;
690
691 /*
692 * Establish swap ptes for anonymous pages or destroy pte
693 * maps for files.
694 *
695 * In order to reestablish file backed mappings the fault handlers
696 * will take the radix tree_lock which may then be used to stop
697 * processses from accessing this page until the new page is ready.
698 *
699 * A process accessing via a swap pte (an anonymous page) will take a
700 * page_lock on the old page which will block the process until the
701 * migration attempt is complete. At that time the PageSwapCache bit
702 * will be examined. If the page was migrated then the PageSwapCache
703 * bit will be clear and the operation to retrieve the page will be
704 * retried which will find the new page in the radix tree. Then a new
705 * direct mapping may be generated based on the radix tree contents.
706 *
707 * If the page was not migrated then the PageSwapCache bit
708 * is still set and the operation may continue.
709 */
710 try_to_unmap(page, 1);
711
712 /*
713 * Give up if we were unable to remove all mappings.
714 */
715 if (page_mapcount(page))
716 return 1;
717
718 write_lock_irq(&mapping->tree_lock);
719
720 radix_pointer = (struct page **)radix_tree_lookup_slot(
721 &mapping->page_tree,
722 page_index(page));
723
724 if (!page_mapping(page) || page_count(page) != nr_refs ||
725 *radix_pointer != page) {
726 write_unlock_irq(&mapping->tree_lock);
727 return 1;
728 }
729
730 /*
731 * Now we know that no one else is looking at the page.
732 *
733 * Certain minimal information about a page must be available
734 * in order for other subsystems to properly handle the page if they
735 * find it through the radix tree update before we are finished
736 * copying the page.
737 */
738 get_page(newpage);
739 newpage->index = page->index;
740 newpage->mapping = page->mapping;
741 if (PageSwapCache(page)) {
742 SetPageSwapCache(newpage);
743 set_page_private(newpage, page_private(page));
744 }
745
746 *radix_pointer = newpage;
747 __put_page(page);
748 write_unlock_irq(&mapping->tree_lock);
749
750 return 0;
751}
752
753/*
754 * Copy the page to its new location
755 */
756void migrate_page_copy(struct page *newpage, struct page *page)
757{
758 copy_highpage(newpage, page);
759
760 if (PageError(page))
761 SetPageError(newpage);
762 if (PageReferenced(page))
763 SetPageReferenced(newpage);
764 if (PageUptodate(page))
765 SetPageUptodate(newpage);
766 if (PageActive(page))
767 SetPageActive(newpage);
768 if (PageChecked(page))
769 SetPageChecked(newpage);
770 if (PageMappedToDisk(page))
771 SetPageMappedToDisk(newpage);
772
773 if (PageDirty(page)) {
774 clear_page_dirty_for_io(page);
775 set_page_dirty(newpage);
776 }
777
778 ClearPageSwapCache(page);
779 ClearPageActive(page);
780 ClearPagePrivate(page);
781 set_page_private(page, 0);
782 page->mapping = NULL;
783
784 /*
785 * If any waiters have accumulated on the new page then
786 * wake them up.
787 */
788 if (PageWriteback(newpage))
789 end_page_writeback(newpage);
790}
791
792/*
793 * Common logic to directly migrate a single page suitable for
794 * pages that do not use PagePrivate.
795 *
796 * Pages are locked upon entry and exit.
797 */
798int migrate_page(struct page *newpage, struct page *page)
799{
800 BUG_ON(PageWriteback(page)); /* Writeback must be complete */
801
802 if (migrate_page_remove_references(newpage, page, 2))
803 return -EAGAIN;
804
805 migrate_page_copy(newpage, page);
806
807 return 0;
808}
809
662/* 810/*
663 * migrate_pages 811 * migrate_pages
664 * 812 *
@@ -672,11 +820,6 @@ retry:
672 * are movable anymore because t has become empty 820 * are movable anymore because t has become empty
673 * or no retryable pages exist anymore. 821 * or no retryable pages exist anymore.
674 * 822 *
675 * SIMPLIFIED VERSION: This implementation of migrate_pages
676 * is only swapping out pages and never touches the second
677 * list. The direct migration patchset
678 * extends this function to avoid the use of swap.
679 *
680 * Return: Number of pages not migrated when "to" ran empty. 823 * Return: Number of pages not migrated when "to" ran empty.
681 */ 824 */
682int migrate_pages(struct list_head *from, struct list_head *to, 825int migrate_pages(struct list_head *from, struct list_head *to,
@@ -697,6 +840,9 @@ redo:
697 retry = 0; 840 retry = 0;
698 841
699 list_for_each_entry_safe(page, page2, from, lru) { 842 list_for_each_entry_safe(page, page2, from, lru) {
843 struct page *newpage = NULL;
844 struct address_space *mapping;
845
700 cond_resched(); 846 cond_resched();
701 847
702 rc = 0; 848 rc = 0;
@@ -704,6 +850,9 @@ redo:
704 /* page was freed from under us. So we are done. */ 850 /* page was freed from under us. So we are done. */
705 goto next; 851 goto next;
706 852
853 if (to && list_empty(to))
854 break;
855
707 /* 856 /*
708 * Skip locked pages during the first two passes to give the 857 * Skip locked pages during the first two passes to give the
709 * functions holding the lock time to release the page. Later we 858 * functions holding the lock time to release the page. Later we
@@ -740,12 +889,64 @@ redo:
740 } 889 }
741 } 890 }
742 891
892 if (!to) {
893 rc = swap_page(page);
894 goto next;
895 }
896
897 newpage = lru_to_page(to);
898 lock_page(newpage);
899
743 /* 900 /*
744 * Page is properly locked and writeback is complete. 901 * Pages are properly locked and writeback is complete.
745 * Try to migrate the page. 902 * Try to migrate the page.
746 */ 903 */
747 rc = swap_page(page); 904 mapping = page_mapping(page);
748 goto next; 905 if (!mapping)
906 goto unlock_both;
907
908 /*
909 * Trigger writeout if page is dirty
910 */
911 if (PageDirty(page)) {
912 switch (pageout(page, mapping)) {
913 case PAGE_KEEP:
914 case PAGE_ACTIVATE:
915 goto unlock_both;
916
917 case PAGE_SUCCESS:
918 unlock_page(newpage);
919 goto next;
920
921 case PAGE_CLEAN:
922 ; /* try to migrate the page below */
923 }
924 }
925 /*
926 * If we have no buffer or can release the buffer
927 * then do a simple migration.
928 */
929 if (!page_has_buffers(page) ||
930 try_to_release_page(page, GFP_KERNEL)) {
931 rc = migrate_page(newpage, page);
932 goto unlock_both;
933 }
934
935 /*
936 * On early passes with mapped pages simply
937 * retry. There may be a lock held for some
938 * buffers that may go away. Later
939 * swap them out.
940 */
941 if (pass > 4) {
942 unlock_page(newpage);
943 newpage = NULL;
944 rc = swap_page(page);
945 goto next;
946 }
947
948unlock_both:
949 unlock_page(newpage);
749 950
750unlock_page: 951unlock_page:
751 unlock_page(page); 952 unlock_page(page);
@@ -758,7 +959,10 @@ next:
758 list_move(&page->lru, failed); 959 list_move(&page->lru, failed);
759 nr_failed++; 960 nr_failed++;
760 } else { 961 } else {
761 /* Success */ 962 if (newpage) {
963 /* Successful migration. Return page to LRU */
964 move_to_lru(newpage);
965 }
762 list_move(&page->lru, moved); 966 list_move(&page->lru, moved);
763 } 967 }
764 } 968 }