aboutsummaryrefslogtreecommitdiffstats
path: root/mm
diff options
context:
space:
mode:
authorAndrea Arcangeli <andrea@suse.de>2007-10-16 04:25:42 -0400
committerLinus Torvalds <torvalds@woody.linux-foundation.org>2007-10-16 12:42:59 -0400
commit4106f83a9f86afc423557d0d92ebf4b3f36728c1 (patch)
treeb2da62d411e720b6053a80074c9fb8343ec17ccc /mm
parent6cb062296f73e74768cca2f3eaf90deac54de02d (diff)
make swappiness safer to use
Swappiness isn't a safe sysctl. Setting it to 0 for example can hang a system. That's a corner case but even setting it to 10 or lower can waste enormous amounts of cpu without making much progress. We've customers who wants to use swappiness but they can't because of the current implementation (if you change it so the system stops swapping it really stops swapping and nothing works sane anymore if you really had to swap something to make progress). This patch from Kurt Garloff makes swappiness safer to use (no more huge cpu usage or hangs with low swappiness values). I think the prev_priority can also be nuked since it wastes 4 bytes per zone (that would be an incremental patch but I wait the nr_scan_[in]active to be nuked first for similar reasons). Clearly somebody at some point noticed how broken that thing was and they had to add min(priority, prev_priority) to give it some reliability, but they didn't go the last mile to nuke prev_priority too. Calculating distress only in function of not-racy priority is correct and sure more than enough without having to add randomness into the equation. Patch is tested on older kernels but it compiles and it's quite simple so... Overall I'm not very satisified by the swappiness tweak, since it doesn't rally do anything with the dirty pagecache that may be inactive. We need another kind of tweak that controls the inactive scan and tunes the can_writepage feature (not yet in mainline despite having submitted it a few times), not only the active one. That new tweak will tell the kernel how hard to scan the inactive list for pure clean pagecache (something the mainline kernel isn't capable of yet). We already have that feature working in all our enterprise kernels with the default reasonable tune, or they can't even run a readonly backup with tar without triggering huge write I/O. I think it should be available also in mainline later. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r--mm/vmscan.c41
1 files changed, 41 insertions, 0 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb8ad3c6e483..bbd194630c5b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -932,6 +932,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
932 long mapped_ratio; 932 long mapped_ratio;
933 long distress; 933 long distress;
934 long swap_tendency; 934 long swap_tendency;
935 long imbalance;
935 936
936 if (zone_is_near_oom(zone)) 937 if (zone_is_near_oom(zone))
937 goto force_reclaim_mapped; 938 goto force_reclaim_mapped;
@@ -967,6 +968,46 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
967 swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; 968 swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
968 969
969 /* 970 /*
971 * If there's huge imbalance between active and inactive
972 * (think active 100 times larger than inactive) we should
973 * become more permissive, or the system will take too much
974 * cpu before it start swapping during memory pressure.
975 * Distress is about avoiding early-oom, this is about
976 * making swappiness graceful despite setting it to low
977 * values.
978 *
979 * Avoid div by zero with nr_inactive+1, and max resulting
980 * value is vm_total_pages.
981 */
982 imbalance = zone_page_state(zone, NR_ACTIVE);
983 imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;
984
985 /*
986 * Reduce the effect of imbalance if swappiness is low,
987 * this means for a swappiness very low, the imbalance
988 * must be much higher than 100 for this logic to make
989 * the difference.
990 *
991 * Max temporary value is vm_total_pages*100.
992 */
993 imbalance *= (vm_swappiness + 1);
994 imbalance /= 100;
995
996 /*
997 * If not much of the ram is mapped, makes the imbalance
998 * less relevant, it's high priority we refill the inactive
999 * list with mapped pages only in presence of high ratio of
1000 * mapped pages.
1001 *
1002 * Max temporary value is vm_total_pages*100.
1003 */
1004 imbalance *= mapped_ratio;
1005 imbalance /= 100;
1006
1007 /* apply imbalance feedback to swap_tendency */
1008 swap_tendency += imbalance;
1009
1010 /*
970 * Now use this metric to decide whether to start moving mapped 1011 * Now use this metric to decide whether to start moving mapped
971 * memory onto the inactive list. 1012 * memory onto the inactive list.
972 */ 1013 */