aboutsummaryrefslogtreecommitdiffstats
path: root/mm/vmscan.c
diff options
context:
space:
mode:
authorMel Gorman <mel@csn.ul.ie>2011-01-13 18:46:20 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2011-01-13 20:32:37 -0500
commit9950474883e027e6e728cbcff25f7f2bf0c96530 (patch)
treeecfdd3e68a25f1ef7822428c44f8375efbe9bc0c /mm/vmscan.c
parentc585a2678d83ba8fb02fa6b197de0ac7d67377f1 (diff)
mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem We're seeing cases on a number of servers where cache never fully grows to use all available memory. Sometimes we see servers with 4 GB of memory that never seem to have less than 1.5 GB free, even with a constantly-active VM. In some cases, these servers also swap out while this happens, even though they are constantly reading the working set into memory. We have been seeing this happening for a long time; I don't think it's anything recent, and it still happens on 2.6.36. After some debugging work by Simon, Dave Hansen and others, the prevaling theory became that kswapd is reclaiming order-3 pages requested by SLUB too aggressive about it. There are two apparent problems here. On the target machine, there is a small Normal zone in comparison to DMA32. As kswapd tries to balance all zones, it would continually try reclaiming for Normal even though DMA32 was balanced enough for callers. The second problem is that sleeping_prematurely() does not use the same logic as balance_pgdat() when deciding whether to sleep or not. This keeps kswapd artifically awake. A number of tests were run and the figures from previous postings will look very different for a few reasons. One, the old figures were forcing my network card to use GFP_ATOMIC in attempt to replicate Simon's problem. Second, I previous specified slub_min_order=3 again in an attempt to reproduce Simon's problem. In this posting, I'm depending on Simon to say whether his problem is fixed or not and these figures are to show the impact to the ordinary cases. Finally, the "vmscan" figures are taken from /proc/vmstat instead of the tracepoints. There is less information but recording is less disruptive. The first test of relevance was postmark with a process running in the background reading a large amount of anonymous memory in blocks. The objective was to vaguely simulate what was happening on Simon's machine and it's memory intensive enough to have kswapd awake. POSTMARK traceonly kanyzone Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%) Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%) Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%) Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%) Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%) Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%) Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 16.58 17.4 Total Elapsed Time (seconds) 218.48 222.47 VMstat Reclaim Statistics: vmscan Direct reclaims 0 4 Direct reclaim pages scanned 0 203 Direct reclaim pages reclaimed 0 184 Kswapd pages scanned 326631 322018 Kswapd pages reclaimed 312632 309784 Kswapd low wmark quickly 1 4 Kswapd high wmark quickly 122 475 Kswapd skip congestion_wait 1 0 Pages activated 700040 705317 Pages deactivated 212113 203922 Pages written 9875 6363 Total pages scanned 326631 322221 Total pages reclaimed 312632 309968 %age total pages scanned/reclaimed 95.71% 96.20% %age total pages scanned/written 3.02% 1.97% proc vmstat: Faults Major Faults 300 254 Minor Faults 645183 660284 Page ins 493588 486704 Page outs 4960088 4986704 Swap ins 1230 661 Swap outs 9869 6355 Performance is mildly affected because kswapd is no longer doing as much work and the background memory consumer process is getting in the way. Note that kswapd scanned and reclaimed fewer pages as it's less aggressive and overall fewer pages were scanned and reclaimed. Swap in/out is particularly reduced again reflecting kswapd throwing out fewer pages. The slight performance impact is unfortunate here but it looks like a direct result of kswapd being less aggressive. As the bug report is about too many pages being freed by kswapd, it may have to be accepted for now. The second test is a streaming IO benchmark that was previously used by Johannes to show regressions in page reclaim. MICRO traceonly kanyzone User/Sys Time Running Test (seconds) 29.29 28.87 Total Elapsed Time (seconds) 492.18 488.79 VMstat Reclaim Statistics: vmscan Direct reclaims 2128 1460 Direct reclaim pages scanned 2284822 1496067 Direct reclaim pages reclaimed 148919 110937 Kswapd pages scanned 15450014 16202876 Kswapd pages reclaimed 8503697 8537897 Kswapd low wmark quickly 3100 3397 Kswapd high wmark quickly 1860 7243 Kswapd skip congestion_wait 708 801 Pages activated 9635 9573 Pages deactivated 1432 1271 Pages written 223 1130 Total pages scanned 17734836 17698943 Total pages reclaimed 8652616 8648834 %age total pages scanned/reclaimed 48.79% 48.87% %age total pages scanned/written 0.00% 0.01% proc vmstat: Faults Major Faults 165 221 Minor Faults 9655785 9656506 Page ins 3880 7228 Page outs 37692940 37480076 Swap ins 0 69 Swap outs 19 15 Again fewer pages are scanned and reclaimed as expected and this time the test completed faster. Note that kswapd is hitting its watermarks faster (low and high wmark quickly) which I expect is due to kswapd reclaiming fewer pages. I also ran fs-mark, iozone and sysbench but there is nothing interesting to report in the figures. Performance is not significantly changed and the reclaim statistics look reasonable. Tgis patch: When the allocator enters its slow path, kswapd is woken up to balance the node. It continues working until all zones within the node are balanced. For order-0 allocations, this makes perfect sense but for higher orders it can have unintended side-effects. If the zone sizes are imbalanced, kswapd may reclaim heavily within a smaller zone discarding an excessive number of pages. The user-visible behaviour is that kswapd is awake and reclaiming even though plenty of pages are free from a suitable zone. This patch alters the "balance" logic for high-order reclaim allowing kswapd to stop if any suitable zone becomes balanced to reduce the number of pages it reclaims from other zones. kswapd still tries to ensure that order-0 watermarks for all zones are met before sleeping. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Cc: Simon Kirby <sim@hostway.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/vmscan.c')
-rw-r--r--mm/vmscan.c68
1 files changed, 59 insertions, 9 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7037cc8c60b6..3584067800e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2246,11 +2246,14 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
2246 * interoperates with the page allocator fallback scheme to ensure that aging 2246 * interoperates with the page allocator fallback scheme to ensure that aging
2247 * of pages is balanced across the zones. 2247 * of pages is balanced across the zones.
2248 */ 2248 */
2249static unsigned long balance_pgdat(pg_data_t *pgdat, int order) 2249static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
2250 int classzone_idx)
2250{ 2251{
2251 int all_zones_ok; 2252 int all_zones_ok;
2253 int any_zone_ok;
2252 int priority; 2254 int priority;
2253 int i; 2255 int i;
2256 int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
2254 unsigned long total_scanned; 2257 unsigned long total_scanned;
2255 struct reclaim_state *reclaim_state = current->reclaim_state; 2258 struct reclaim_state *reclaim_state = current->reclaim_state;
2256 struct scan_control sc = { 2259 struct scan_control sc = {
@@ -2273,7 +2276,6 @@ loop_again:
2273 count_vm_event(PAGEOUTRUN); 2276 count_vm_event(PAGEOUTRUN);
2274 2277
2275 for (priority = DEF_PRIORITY; priority >= 0; priority--) { 2278 for (priority = DEF_PRIORITY; priority >= 0; priority--) {
2276 int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
2277 unsigned long lru_pages = 0; 2279 unsigned long lru_pages = 0;
2278 int has_under_min_watermark_zone = 0; 2280 int has_under_min_watermark_zone = 0;
2279 2281
@@ -2282,6 +2284,7 @@ loop_again:
2282 disable_swap_token(); 2284 disable_swap_token();
2283 2285
2284 all_zones_ok = 1; 2286 all_zones_ok = 1;
2287 any_zone_ok = 0;
2285 2288
2286 /* 2289 /*
2287 * Scan in the highmem->dma direction for the highest 2290 * Scan in the highmem->dma direction for the highest
@@ -2400,10 +2403,12 @@ loop_again:
2400 * spectulatively avoid congestion waits 2403 * spectulatively avoid congestion waits
2401 */ 2404 */
2402 zone_clear_flag(zone, ZONE_CONGESTED); 2405 zone_clear_flag(zone, ZONE_CONGESTED);
2406 if (i <= classzone_idx)
2407 any_zone_ok = 1;
2403 } 2408 }
2404 2409
2405 } 2410 }
2406 if (all_zones_ok) 2411 if (all_zones_ok || (order && any_zone_ok))
2407 break; /* kswapd: all done */ 2412 break; /* kswapd: all done */
2408 /* 2413 /*
2409 * OK, kswapd is getting into trouble. Take a nap, then take 2414 * OK, kswapd is getting into trouble. Take a nap, then take
@@ -2426,7 +2431,13 @@ loop_again:
2426 break; 2431 break;
2427 } 2432 }
2428out: 2433out:
2429 if (!all_zones_ok) { 2434
2435 /*
2436 * order-0: All zones must meet high watermark for a balanced node
2437 * high-order: Any zone below pgdats classzone_idx must meet the high
2438 * watermark for a balanced node
2439 */
2440 if (!(all_zones_ok || (order && any_zone_ok))) {
2430 cond_resched(); 2441 cond_resched();
2431 2442
2432 try_to_freeze(); 2443 try_to_freeze();
@@ -2451,6 +2462,36 @@ out:
2451 goto loop_again; 2462 goto loop_again;
2452 } 2463 }
2453 2464
2465 /*
2466 * If kswapd was reclaiming at a higher order, it has the option of
2467 * sleeping without all zones being balanced. Before it does, it must
2468 * ensure that the watermarks for order-0 on *all* zones are met and
2469 * that the congestion flags are cleared. The congestion flag must
2470 * be cleared as kswapd is the only mechanism that clears the flag
2471 * and it is potentially going to sleep here.
2472 */
2473 if (order) {
2474 for (i = 0; i <= end_zone; i++) {
2475 struct zone *zone = pgdat->node_zones + i;
2476
2477 if (!populated_zone(zone))
2478 continue;
2479
2480 if (zone->all_unreclaimable && priority != DEF_PRIORITY)
2481 continue;
2482
2483 /* Confirm the zone is balanced for order-0 */
2484 if (!zone_watermark_ok(zone, 0,
2485 high_wmark_pages(zone), 0, 0)) {
2486 order = sc.order = 0;
2487 goto loop_again;
2488 }
2489
2490 /* If balanced, clear the congested flag */
2491 zone_clear_flag(zone, ZONE_CONGESTED);
2492 }
2493 }
2494
2454 return sc.nr_reclaimed; 2495 return sc.nr_reclaimed;
2455} 2496}
2456 2497
@@ -2514,6 +2555,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
2514static int kswapd(void *p) 2555static int kswapd(void *p)
2515{ 2556{
2516 unsigned long order; 2557 unsigned long order;
2558 int classzone_idx;
2517 pg_data_t *pgdat = (pg_data_t*)p; 2559 pg_data_t *pgdat = (pg_data_t*)p;
2518 struct task_struct *tsk = current; 2560 struct task_struct *tsk = current;
2519 2561
@@ -2544,21 +2586,27 @@ static int kswapd(void *p)
2544 set_freezable(); 2586 set_freezable();
2545 2587
2546 order = 0; 2588 order = 0;
2589 classzone_idx = MAX_NR_ZONES - 1;
2547 for ( ; ; ) { 2590 for ( ; ; ) {
2548 unsigned long new_order; 2591 unsigned long new_order;
2592 int new_classzone_idx;
2549 int ret; 2593 int ret;
2550 2594
2551 new_order = pgdat->kswapd_max_order; 2595 new_order = pgdat->kswapd_max_order;
2596 new_classzone_idx = pgdat->classzone_idx;
2552 pgdat->kswapd_max_order = 0; 2597 pgdat->kswapd_max_order = 0;
2553 if (order < new_order) { 2598 pgdat->classzone_idx = MAX_NR_ZONES - 1;
2599 if (order < new_order || classzone_idx > new_classzone_idx) {
2554 /* 2600 /*
2555 * Don't sleep if someone wants a larger 'order' 2601 * Don't sleep if someone wants a larger 'order'
2556 * allocation 2602 * allocation or has tigher zone constraints
2557 */ 2603 */
2558 order = new_order; 2604 order = new_order;
2605 classzone_idx = new_classzone_idx;
2559 } else { 2606 } else {
2560 kswapd_try_to_sleep(pgdat, order); 2607 kswapd_try_to_sleep(pgdat, order);
2561 order = pgdat->kswapd_max_order; 2608 order = pgdat->kswapd_max_order;
2609 classzone_idx = pgdat->classzone_idx;
2562 } 2610 }
2563 2611
2564 ret = try_to_freeze(); 2612 ret = try_to_freeze();
@@ -2571,7 +2619,7 @@ static int kswapd(void *p)
2571 */ 2619 */
2572 if (!ret) { 2620 if (!ret) {
2573 trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); 2621 trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
2574 balance_pgdat(pgdat, order); 2622 balance_pgdat(pgdat, order, classzone_idx);
2575 } 2623 }
2576 } 2624 }
2577 return 0; 2625 return 0;
@@ -2580,7 +2628,7 @@ static int kswapd(void *p)
2580/* 2628/*
2581 * A zone is low on free memory, so wake its kswapd task to service it. 2629 * A zone is low on free memory, so wake its kswapd task to service it.
2582 */ 2630 */
2583void wakeup_kswapd(struct zone *zone, int order) 2631void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
2584{ 2632{
2585 pg_data_t *pgdat; 2633 pg_data_t *pgdat;
2586 2634
@@ -2590,8 +2638,10 @@ void wakeup_kswapd(struct zone *zone, int order)
2590 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) 2638 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
2591 return; 2639 return;
2592 pgdat = zone->zone_pgdat; 2640 pgdat = zone->zone_pgdat;
2593 if (pgdat->kswapd_max_order < order) 2641 if (pgdat->kswapd_max_order < order) {
2594 pgdat->kswapd_max_order = order; 2642 pgdat->kswapd_max_order = order;
2643 pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
2644 }
2595 if (!waitqueue_active(&pgdat->kswapd_wait)) 2645 if (!waitqueue_active(&pgdat->kswapd_wait))
2596 return; 2646 return;
2597 if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) 2647 if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))