aboutsummaryrefslogtreecommitdiffstats
path: root/mm
diff options
context:
space:
mode:
authorPaul Jackson <pj@sgi.com>2006-12-06 23:31:48 -0500
committerLinus Torvalds <torvalds@woody.osdl.org>2006-12-07 11:39:20 -0500
commit9276b1bc96a132f4068fdee00983c532f43d3a26 (patch)
tree04d64444cf6558632cfc7514b5437578b5e616af /mm
parent89689ae7f95995723fbcd5c116c47933a3bb8b13 (diff)
[PATCH] memory page_alloc zonelist caching speedup
Optimize the critical zonelist scanning for free pages in the kernel memory allocator by caching the zones that were found to be full recently, and skipping them. Remembers the zones in a zonelist that were short of free memory in the last second. And it stashes a zone-to-node table in the zonelist struct, to optimize that conversion (minimize its cache footprint.) Recent changes: This differs in a significant way from a similar patch that I posted a week ago. Now, instead of having a nodemask_t of recently full nodes, I have a bitmask of recently full zones. This solves a problem that last weeks patch had, which on systems with multiple zones per node (such as DMA zone) would take seeing any of these zones full as meaning that all zones on that node were full. Also I changed names - from "zonelist faster" to "zonelist cache", as that seemed to better convey what we're doing here - caching some of the key zonelist state (for faster access.) See below for some performance benchmark results. After all that discussion with David on why I didn't need them, I went and got some ;). I wanted to verify that I had not hurt the normal case of memory allocation noticeably. At least for my one little microbenchmark, I found (1) the normal case wasn't affected, and (2) workloads that forced scanning across multiple nodes for memory improved up to 10% fewer System CPU cycles and lower elapsed clock time ('sys' and 'real'). Good. See details, below. I didn't have the logic in get_page_from_freelist() for various full nodes and zone reclaim failures correct. That should be fixed up now - notice the new goto labels zonelist_scan, this_zone_full, and try_next_zone, in get_page_from_freelist(). There are two reasons I persued this alternative, over some earlier proposals that would have focused on optimizing the fake numa emulation case by caching the last useful zone: 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems) have seen real customer loads where the cost to scan the zonelist was a problem, due to many nodes being full of memory before we got to a node we could use. Or at least, I think we have. This was related to me by another engineer, based on experiences from some time past. So this is not guaranteed. Most likely, though. The following approach should help such real numa systems just as much as it helps fake numa systems, or any combination thereof. 2) The effort to distinguish fake from real numa, using node_distance, so that we could cache a fake numa node and optimize choosing it over equivalent distance fake nodes, while continuing to properly scan all real nodes in distance order, was going to require a nasty blob of zonelist and node distance munging. The following approach has no new dependency on node distances or zone sorting. See comment in the patch below for a description of what it actually does. Technical details of note (or controversy): - See the use of "zlc_active" and "did_zlc_setup" below, to delay adding any work for this new mechanism until we've looked at the first zone in zonelist. I figured the odds of the first zone having the memory we needed were high enough that we should just look there, first, then get fancy only if we need to keep looking. - Some odd hackery was needed to add items to struct zonelist, while not tripping up the custom zonelists built by the mm/mempolicy.c code for MPOL_BIND. My usual wordy comments below explain this. Search for "MPOL_BIND". - Some per-node data in the struct zonelist is now modified frequently, with no locking. Multiple CPU cores on a node could hit and mangle this data. The theory is that this is just performance hint data, and the memory allocator will work just fine despite any such mangling. The fields at risk are the struct 'zonelist_cache' fields 'fullzones' (a bitmask) and 'last_full_zap' (unsigned long jiffies). It should all be self correcting after at most a one second delay. - This still does a linear scan of the same lengths as before. All I've optimized is making the scan faster, not algorithmically shorter. It is now able to scan a compact array of 'unsigned short' in the case of many full nodes, so one cache line should cover quite a few nodes, rather than each node hitting another one or two new and distinct cache lines. - If both Andi and Nick don't find this too complicated, I will be (pleasantly) flabbergasted. - I removed the comment claiming we only use one cachline's worth of zonelist. We seem, at least in the fake numa case, to have put the lie to that claim. - I pay no attention to the various watermarks and such in this performance hint. A node could be marked full for one watermark, and then skipped over when searching for a page using a different watermark. I think that's actually quite ok, as it will tend to slightly increase the spreading of memory over other nodes, away from a memory stressed node. =============== Performance - some benchmark results and analysis: This benchmark runs a memory hog program that uses multiple threads to touch alot of memory as quickly as it can. Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of the total 96 GBytes on the system, and using 1, 19, 37, or 55 threads (on a 56 CPU system.) System, user and real (elapsed) timings were recorded for each run, shown in units of seconds, in the table below. Two kernels were tested - 2.6.18-mm3 and the same kernel with this zonelist caching patch added. The table also shows the percentage improvement the zonelist caching sys time is over (lower than) the stock *-mm kernel. number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent GBs N ------------ -------------- ---------------- systime mem threads sys user real sys user real sys user real better 12 1 153 24 177 151 24 176 -2 0 -1 1% 12 19 99 22 8 99 22 8 0 0 0 0% 12 37 111 25 6 112 25 6 1 0 0 -0% 12 55 115 25 5 110 23 5 -5 -2 0 4% 38 1 502 74 576 497 73 570 -5 -1 -6 0% 38 19 426 78 48 373 76 39 -53 -2 -9 12% 38 37 544 83 36 547 82 36 3 -1 0 -0% 38 55 501 77 23 511 80 24 10 3 1 -1% 64 1 917 125 1042 890 124 1014 -27 -1 -28 2% 64 19 1118 138 119 965 141 103 -153 3 -16 13% 64 37 1202 151 94 1136 150 81 -66 -1 -13 5% 64 55 1118 141 61 1072 140 58 -46 -1 -3 4% 90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4% 90 19 2392 199 192 2116 189 176 -276 -10 -16 11% 90 37 3313 238 175 2972 225 145 -341 -13 -30 10% 90 55 1948 210 104 1843 213 100 -105 3 -4 5% Notes: 1) This test ran a memory hog program that started a specified number N of threads, and had each thread allocate and touch 1/N'th of the total memory to be used in the test run in a single loop, writing a constant word to memory, one store every 4096 bytes. Watching this test during some earlier trial runs, I would see each of these threads sit down on one CPU and stay there, for the remainder of the pass, a different CPU for each thread. 2) The 'real' column is not comparable to the 'sys' or 'user' columns. The 'real' column is seconds wall clock time elapsed, from beginning to end of that test pass. The 'sys' and 'user' columns are total CPU seconds spent on that test pass. For a 19 thread test run, for example, the sum of 'sys' and 'user' could be up to 19 times the number of 'real' elapsed wall clock seconds. 3) Tests were run on a fresh, single-user boot, to minimize the amount of memory already in use at the start of the test, and to minimize the amount of background activity that might interfere. 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM. 5) Notice that the 'real' time gets large for the single thread runs, even though the measured 'sys' and 'user' times are modest. I'm not sure what that means - probably something to do with it being slow for one thread to be accessing memory along ways away. Perhaps the fake numa system, running ostensibly the same workload, would not show this substantial degradation of 'real' time for one thread on many nodes -- lets hope not. 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs) ran quite efficiently, as one might expect. Each pair of threads needed to allocate and touch the memory on the node the two threads shared, a pleasantly parallizable workload. 7) The intermediate thread count passes, when asking for alot of memory forcing them to go to a few neighboring nodes, improved the most with this zonelist caching patch. Conclusions: * This zonelist cache patch probably makes little difference one way or the other for most workloads on real numa hardware, if those workloads avoid heavy off node allocations. * For memory intensive workloads requiring substantial off-node allocations on real numa hardware, this patch improves both kernel and elapsed timings up to ten per-cent. * For fake numa systems, I'm optimistic, but will have to leave that up to Rohit Seth to actually test (once I get him a 2.6.18 backport.) Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Rohit Seth <rohitseth@google.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: David Rientjes <rientjes@cs.washington.edu> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Diffstat (limited to 'mm')
-rw-r--r--mm/mempolicy.c2
-rw-r--r--mm/page_alloc.c188
2 files changed, 183 insertions, 7 deletions
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 617fb31086ee..fb907236bbd8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -141,9 +141,11 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
141 enum zone_type k; 141 enum zone_type k;
142 142
143 max = 1 + MAX_NR_ZONES * nodes_weight(*nodes); 143 max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
144 max++; /* space for zlcache_ptr (see mmzone.h) */
144 zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL); 145 zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
145 if (!zl) 146 if (!zl)
146 return NULL; 147 return NULL;
148 zl->zlcache_ptr = NULL;
147 num = 0; 149 num = 0;
148 /* First put in the highest zones from all nodes, then all the next 150 /* First put in the highest zones from all nodes, then all the next
149 lower zones etc. Avoid empty zones because the memory allocator 151 lower zones etc. Avoid empty zones because the memory allocator
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23bc5bcbdcf9..230771d3c6b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -918,6 +918,126 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
918 return 1; 918 return 1;
919} 919}
920 920
921#ifdef CONFIG_NUMA
922/*
923 * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
924 * skip over zones that are not allowed by the cpuset, or that have
925 * been recently (in last second) found to be nearly full. See further
926 * comments in mmzone.h. Reduces cache footprint of zonelist scans
927 * that have to skip over alot of full or unallowed zones.
928 *
929 * If the zonelist cache is present in the passed in zonelist, then
930 * returns a pointer to the allowed node mask (either the current
931 * tasks mems_allowed, or node_online_map.)
932 *
933 * If the zonelist cache is not available for this zonelist, does
934 * nothing and returns NULL.
935 *
936 * If the fullzones BITMAP in the zonelist cache is stale (more than
937 * a second since last zap'd) then we zap it out (clear its bits.)
938 *
939 * We hold off even calling zlc_setup, until after we've checked the
940 * first zone in the zonelist, on the theory that most allocations will
941 * be satisfied from that first zone, so best to examine that zone as
942 * quickly as we can.
943 */
944static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
945{
946 struct zonelist_cache *zlc; /* cached zonelist speedup info */
947 nodemask_t *allowednodes; /* zonelist_cache approximation */
948
949 zlc = zonelist->zlcache_ptr;
950 if (!zlc)
951 return NULL;
952
953 if (jiffies - zlc->last_full_zap > 1 * HZ) {
954 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
955 zlc->last_full_zap = jiffies;
956 }
957
958 allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
959 &cpuset_current_mems_allowed :
960 &node_online_map;
961 return allowednodes;
962}
963
964/*
965 * Given 'z' scanning a zonelist, run a couple of quick checks to see
966 * if it is worth looking at further for free memory:
967 * 1) Check that the zone isn't thought to be full (doesn't have its
968 * bit set in the zonelist_cache fullzones BITMAP).
969 * 2) Check that the zones node (obtained from the zonelist_cache
970 * z_to_n[] mapping) is allowed in the passed in allowednodes mask.
971 * Return true (non-zero) if zone is worth looking at further, or
972 * else return false (zero) if it is not.
973 *
974 * This check -ignores- the distinction between various watermarks,
975 * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is
976 * found to be full for any variation of these watermarks, it will
977 * be considered full for up to one second by all requests, unless
978 * we are so low on memory on all allowed nodes that we are forced
979 * into the second scan of the zonelist.
980 *
981 * In the second scan we ignore this zonelist cache and exactly
982 * apply the watermarks to all zones, even it is slower to do so.
983 * We are low on memory in the second scan, and should leave no stone
984 * unturned looking for a free page.
985 */
986static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
987 nodemask_t *allowednodes)
988{
989 struct zonelist_cache *zlc; /* cached zonelist speedup info */
990 int i; /* index of *z in zonelist zones */
991 int n; /* node that zone *z is on */
992
993 zlc = zonelist->zlcache_ptr;
994 if (!zlc)
995 return 1;
996
997 i = z - zonelist->zones;
998 n = zlc->z_to_n[i];
999
1000 /* This zone is worth trying if it is allowed but not full */
1001 return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
1002}
1003
1004/*
1005 * Given 'z' scanning a zonelist, set the corresponding bit in
1006 * zlc->fullzones, so that subsequent attempts to allocate a page
1007 * from that zone don't waste time re-examining it.
1008 */
1009static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
1010{
1011 struct zonelist_cache *zlc; /* cached zonelist speedup info */
1012 int i; /* index of *z in zonelist zones */
1013
1014 zlc = zonelist->zlcache_ptr;
1015 if (!zlc)
1016 return;
1017
1018 i = z - zonelist->zones;
1019
1020 set_bit(i, zlc->fullzones);
1021}
1022
1023#else /* CONFIG_NUMA */
1024
1025static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
1026{
1027 return NULL;
1028}
1029
1030static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
1031 nodemask_t *allowednodes)
1032{
1033 return 1;
1034}
1035
1036static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
1037{
1038}
1039#endif /* CONFIG_NUMA */
1040
921/* 1041/*
922 * get_page_from_freelist goes through the zonelist trying to allocate 1042 * get_page_from_freelist goes through the zonelist trying to allocate
923 * a page. 1043 * a page.
@@ -926,23 +1046,32 @@ static struct page *
926get_page_from_freelist(gfp_t gfp_mask, unsigned int order, 1046get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
927 struct zonelist *zonelist, int alloc_flags) 1047 struct zonelist *zonelist, int alloc_flags)
928{ 1048{
929 struct zone **z = zonelist->zones; 1049 struct zone **z;
930 struct page *page = NULL; 1050 struct page *page = NULL;
931 int classzone_idx = zone_idx(*z); 1051 int classzone_idx = zone_idx(zonelist->zones[0]);
932 struct zone *zone; 1052 struct zone *zone;
1053 nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
1054 int zlc_active = 0; /* set if using zonelist_cache */
1055 int did_zlc_setup = 0; /* just call zlc_setup() one time */
933 1056
1057zonelist_scan:
934 /* 1058 /*
935 * Go through the zonelist once, looking for a zone with enough free. 1059 * Scan zonelist, looking for a zone with enough free.
936 * See also cpuset_zone_allowed() comment in kernel/cpuset.c. 1060 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
937 */ 1061 */
1062 z = zonelist->zones;
1063
938 do { 1064 do {
1065 if (NUMA_BUILD && zlc_active &&
1066 !zlc_zone_worth_trying(zonelist, z, allowednodes))
1067 continue;
939 zone = *z; 1068 zone = *z;
940 if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) && 1069 if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
941 zone->zone_pgdat != zonelist->zones[0]->zone_pgdat)) 1070 zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
942 break; 1071 break;
943 if ((alloc_flags & ALLOC_CPUSET) && 1072 if ((alloc_flags & ALLOC_CPUSET) &&
944 !cpuset_zone_allowed(zone, gfp_mask)) 1073 !cpuset_zone_allowed(zone, gfp_mask))
945 continue; 1074 goto try_next_zone;
946 1075
947 if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { 1076 if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
948 unsigned long mark; 1077 unsigned long mark;
@@ -956,15 +1085,30 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
956 classzone_idx, alloc_flags)) { 1085 classzone_idx, alloc_flags)) {
957 if (!zone_reclaim_mode || 1086 if (!zone_reclaim_mode ||
958 !zone_reclaim(zone, gfp_mask, order)) 1087 !zone_reclaim(zone, gfp_mask, order))
959 continue; 1088 goto this_zone_full;
960 } 1089 }
961 } 1090 }
962 1091
963 page = buffered_rmqueue(zonelist, zone, order, gfp_mask); 1092 page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
964 if (page) 1093 if (page)
965 break; 1094 break;
966 1095this_zone_full:
1096 if (NUMA_BUILD)
1097 zlc_mark_zone_full(zonelist, z);
1098try_next_zone:
1099 if (NUMA_BUILD && !did_zlc_setup) {
1100 /* we do zlc_setup after the first zone is tried */
1101 allowednodes = zlc_setup(zonelist, alloc_flags);
1102 zlc_active = 1;
1103 did_zlc_setup = 1;
1104 }
967 } while (*(++z) != NULL); 1105 } while (*(++z) != NULL);
1106
1107 if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
1108 /* Disable zlc cache for second zonelist scan */
1109 zlc_active = 0;
1110 goto zonelist_scan;
1111 }
968 return page; 1112 return page;
969} 1113}
970 1114
@@ -1535,6 +1679,24 @@ static void __meminit build_zonelists(pg_data_t *pgdat)
1535 } 1679 }
1536} 1680}
1537 1681
1682/* Construct the zonelist performance cache - see further mmzone.h */
1683static void __meminit build_zonelist_cache(pg_data_t *pgdat)
1684{
1685 int i;
1686
1687 for (i = 0; i < MAX_NR_ZONES; i++) {
1688 struct zonelist *zonelist;
1689 struct zonelist_cache *zlc;
1690 struct zone **z;
1691
1692 zonelist = pgdat->node_zonelists + i;
1693 zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
1694 bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
1695 for (z = zonelist->zones; *z; z++)
1696 zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
1697 }
1698}
1699
1538#else /* CONFIG_NUMA */ 1700#else /* CONFIG_NUMA */
1539 1701
1540static void __meminit build_zonelists(pg_data_t *pgdat) 1702static void __meminit build_zonelists(pg_data_t *pgdat)
@@ -1572,14 +1734,26 @@ static void __meminit build_zonelists(pg_data_t *pgdat)
1572 } 1734 }
1573} 1735}
1574 1736
1737/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
1738static void __meminit build_zonelist_cache(pg_data_t *pgdat)
1739{
1740 int i;
1741
1742 for (i = 0; i < MAX_NR_ZONES; i++)
1743 pgdat->node_zonelists[i].zlcache_ptr = NULL;
1744}
1745
1575#endif /* CONFIG_NUMA */ 1746#endif /* CONFIG_NUMA */
1576 1747
1577/* return values int ....just for stop_machine_run() */ 1748/* return values int ....just for stop_machine_run() */
1578static int __meminit __build_all_zonelists(void *dummy) 1749static int __meminit __build_all_zonelists(void *dummy)
1579{ 1750{
1580 int nid; 1751 int nid;
1581 for_each_online_node(nid) 1752
1753 for_each_online_node(nid) {
1582 build_zonelists(NODE_DATA(nid)); 1754 build_zonelists(NODE_DATA(nid));
1755 build_zonelist_cache(NODE_DATA(nid));
1756 }
1583 return 0; 1757 return 0;
1584} 1758}
1585 1759