summaryrefslogtreecommitdiffstats
path: root/mm/filemap.c
diff options
context:
space:
mode:
authorHugh Dickins <hughd@google.com>2018-12-28 03:36:14 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2018-12-28 15:11:48 -0500
commit9a1ea439b16b92002e0a6fceebc5d1794906e297 (patch)
tree101e06903ff748d910d5735066362c2c533b0a1e /mm/filemap.c
parentf0c867d9588d9efc10d6a55009c9560336673369 (diff)
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Baoquan He <bhe@redhat.com> Tested-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/filemap.c')
-rw-r--r--mm/filemap.c87
1 files changed, 74 insertions, 13 deletions
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..d2df272152f5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
981 if (wait_page->bit_nr != key->bit_nr) 981 if (wait_page->bit_nr != key->bit_nr)
982 return 0; 982 return 0;
983 983
984 /* Stop walking if it's locked */ 984 /*
985 * Stop walking if it's locked.
986 * Is this safe if put_and_wait_on_page_locked() is in use?
987 * Yes: the waker must hold a reference to this page, and if PG_locked
988 * has now already been set by another task, that task must also hold
989 * a reference to the *same usage* of this page; so there is no need
990 * to walk on to wake even the put_and_wait_on_page_locked() callers.
991 */
985 if (test_bit(key->bit_nr, &key->page->flags)) 992 if (test_bit(key->bit_nr, &key->page->flags))
986 return -1; 993 return -1;
987 994
@@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
1049 wake_up_page_bit(page, bit); 1056 wake_up_page_bit(page, bit);
1050} 1057}
1051 1058
1059/*
1060 * A choice of three behaviors for wait_on_page_bit_common():
1061 */
1062enum behavior {
1063 EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
1064 * __lock_page() waiting on then setting PG_locked.
1065 */
1066 SHARED, /* Hold ref to page and check the bit when woken, like
1067 * wait_on_page_writeback() waiting on PG_writeback.
1068 */
1069 DROP, /* Drop ref to page before wait, no check when woken,
1070 * like put_and_wait_on_page_locked() on PG_locked.
1071 */
1072};
1073
1052static inline int wait_on_page_bit_common(wait_queue_head_t *q, 1074static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1053 struct page *page, int bit_nr, int state, bool lock) 1075 struct page *page, int bit_nr, int state, enum behavior behavior)
1054{ 1076{
1055 struct wait_page_queue wait_page; 1077 struct wait_page_queue wait_page;
1056 wait_queue_entry_t *wait = &wait_page.wait; 1078 wait_queue_entry_t *wait = &wait_page.wait;
1079 bool bit_is_set;
1057 bool thrashing = false; 1080 bool thrashing = false;
1081 bool delayacct = false;
1058 unsigned long pflags; 1082 unsigned long pflags;
1059 int ret = 0; 1083 int ret = 0;
1060 1084
1061 if (bit_nr == PG_locked && 1085 if (bit_nr == PG_locked &&
1062 !PageUptodate(page) && PageWorkingset(page)) { 1086 !PageUptodate(page) && PageWorkingset(page)) {
1063 if (!PageSwapBacked(page)) 1087 if (!PageSwapBacked(page)) {
1064 delayacct_thrashing_start(); 1088 delayacct_thrashing_start();
1089 delayacct = true;
1090 }
1065 psi_memstall_enter(&pflags); 1091 psi_memstall_enter(&pflags);
1066 thrashing = true; 1092 thrashing = true;
1067 } 1093 }
1068 1094
1069 init_wait(wait); 1095 init_wait(wait);
1070 wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; 1096 wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
1071 wait->func = wake_page_function; 1097 wait->func = wake_page_function;
1072 wait_page.page = page; 1098 wait_page.page = page;
1073 wait_page.bit_nr = bit_nr; 1099 wait_page.bit_nr = bit_nr;
@@ -1084,14 +1110,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1084 1110
1085 spin_unlock_irq(&q->lock); 1111 spin_unlock_irq(&q->lock);
1086 1112
1087 if (likely(test_bit(bit_nr, &page->flags))) { 1113 bit_is_set = test_bit(bit_nr, &page->flags);
1114 if (behavior == DROP)
1115 put_page(page);
1116
1117 if (likely(bit_is_set))
1088 io_schedule(); 1118 io_schedule();
1089 }
1090 1119
1091 if (lock) { 1120 if (behavior == EXCLUSIVE) {
1092 if (!test_and_set_bit_lock(bit_nr, &page->flags)) 1121 if (!test_and_set_bit_lock(bit_nr, &page->flags))
1093 break; 1122 break;
1094 } else { 1123 } else if (behavior == SHARED) {
1095 if (!test_bit(bit_nr, &page->flags)) 1124 if (!test_bit(bit_nr, &page->flags))
1096 break; 1125 break;
1097 } 1126 }
@@ -1100,12 +1129,23 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1100 ret = -EINTR; 1129 ret = -EINTR;
1101 break; 1130 break;
1102 } 1131 }
1132
1133 if (behavior == DROP) {
1134 /*
1135 * We can no longer safely access page->flags:
1136 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
1137 * there is a risk of waiting forever on a page reused
1138 * for something that keeps it locked indefinitely.
1139 * But best check for -EINTR above before breaking.
1140 */
1141 break;
1142 }
1103 } 1143 }
1104 1144
1105 finish_wait(q, wait); 1145 finish_wait(q, wait);
1106 1146
1107 if (thrashing) { 1147 if (thrashing) {
1108 if (!PageSwapBacked(page)) 1148 if (delayacct)
1109 delayacct_thrashing_end(); 1149 delayacct_thrashing_end();
1110 psi_memstall_leave(&pflags); 1150 psi_memstall_leave(&pflags);
1111 } 1151 }
@@ -1124,18 +1164,37 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1124void wait_on_page_bit(struct page *page, int bit_nr) 1164void wait_on_page_bit(struct page *page, int bit_nr)
1125{ 1165{
1126 wait_queue_head_t *q = page_waitqueue(page); 1166 wait_queue_head_t *q = page_waitqueue(page);
1127 wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false); 1167 wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
1128} 1168}
1129EXPORT_SYMBOL(wait_on_page_bit); 1169EXPORT_SYMBOL(wait_on_page_bit);
1130 1170
1131int wait_on_page_bit_killable(struct page *page, int bit_nr) 1171int wait_on_page_bit_killable(struct page *page, int bit_nr)
1132{ 1172{
1133 wait_queue_head_t *q = page_waitqueue(page); 1173 wait_queue_head_t *q = page_waitqueue(page);
1134 return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false); 1174 return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED);
1135} 1175}
1136EXPORT_SYMBOL(wait_on_page_bit_killable); 1176EXPORT_SYMBOL(wait_on_page_bit_killable);
1137 1177
1138/** 1178/**
1179 * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
1180 * @page: The page to wait for.
1181 *
1182 * The caller should hold a reference on @page. They expect the page to
1183 * become unlocked relatively soon, but do not wish to hold up migration
1184 * (for example) by holding the reference while waiting for the page to
1185 * come unlocked. After this function returns, the caller should not
1186 * dereference @page.
1187 */
1188void put_and_wait_on_page_locked(struct page *page)
1189{
1190 wait_queue_head_t *q;
1191
1192 page = compound_head(page);
1193 q = page_waitqueue(page);
1194 wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP);
1195}
1196
1197/**
1139 * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue 1198 * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
1140 * @page: Page defining the wait queue of interest 1199 * @page: Page defining the wait queue of interest
1141 * @waiter: Waiter to add to the queue 1200 * @waiter: Waiter to add to the queue
@@ -1264,7 +1323,8 @@ void __lock_page(struct page *__page)
1264{ 1323{
1265 struct page *page = compound_head(__page); 1324 struct page *page = compound_head(__page);
1266 wait_queue_head_t *q = page_waitqueue(page); 1325 wait_queue_head_t *q = page_waitqueue(page);
1267 wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true); 1326 wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE,
1327 EXCLUSIVE);
1268} 1328}
1269EXPORT_SYMBOL(__lock_page); 1329EXPORT_SYMBOL(__lock_page);
1270 1330
@@ -1272,7 +1332,8 @@ int __lock_page_killable(struct page *__page)
1272{ 1332{
1273 struct page *page = compound_head(__page); 1333 struct page *page = compound_head(__page);
1274 wait_queue_head_t *q = page_waitqueue(page); 1334 wait_queue_head_t *q = page_waitqueue(page);
1275 return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true); 1335 return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
1336 EXCLUSIVE);
1276} 1337}
1277EXPORT_SYMBOL_GPL(__lock_page_killable); 1338EXPORT_SYMBOL_GPL(__lock_page_killable);
1278 1339