[PATCH] memory page_alloc zonelist caching speedup

Optimize the critical zonelist scanning for free pages in the kernel memory allocator by caching the zones that were found to be full recently, and skipping them. Remembers the zones in a zonelist that were short of free memory in the last second. And it stashes a zone-to-node table in the zonelist struct, to optimize that conversion (minimize its cache footprint.) Recent changes: This differs in a significant way from a similar patch that I posted a week ago. Now, instead of having a nodemask_t of recently full nodes, I have a bitmask of recently full zones. This solves a problem that last weeks patch had, which on systems with multiple zones per node (such as DMA zone) would take seeing any of these zones full as meaning that all zones on that node were full. Also I changed names - from "zonelist faster" to "zonelist cache", as that seemed to better convey what we're doing here - caching some of the key zonelist state (for faster access.) See below for some performance benchmark results. After all that discussion with David on why I didn't need them, I went and got some ;). I wanted to verify that I had not hurt the normal case of memory allocation noticeably. At least for my one little microbenchmark, I found (1) the normal case wasn't affected, and (2) workloads that forced scanning across multiple nodes for memory improved up to 10% fewer System CPU cycles and lower elapsed clock time ('sys' and 'real'). Good. See details, below. I didn't have the logic in get_page_from_freelist() for various full nodes and zone reclaim failures correct. That should be fixed up now - notice the new goto labels zonelist_scan, this_zone_full, and try_next_zone, in get_page_from_freelist(). There are two reasons I persued this alternative, over some earlier proposals that would have focused on optimizing the fake numa emulation case by caching the last useful zone: 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems) have seen real customer loads where the cost to scan the zonelist was a problem, due to many nodes being full of memory before we got to a node we could use. Or at least, I think we have. This was related to me by another engineer, based on experiences from some time past. So this is not guaranteed. Most likely, though. The following approach should help such real numa systems just as much as it helps fake numa systems, or any combination thereof. 2) The effort to distinguish fake from real numa, using node_distance, so that we could cache a fake numa node and optimize choosing it over equivalent distance fake nodes, while continuing to properly scan all real nodes in distance order, was going to require a nasty blob of zonelist and node distance munging. The following approach has no new dependency on node distances or zone sorting. See comment in the patch below for a description of what it actually does. Technical details of note (or controversy): - See the use of "zlc_active" and "did_zlc_setup" below, to delay adding any work for this new mechanism until we've looked at the first zone in zonelist. I figured the odds of the first zone having the memory we needed were high enough that we should just look there, first, then get fancy only if we need to keep looking. - Some odd hackery was needed to add items to struct zonelist, while not tripping up the custom zonelists built by the mm/mempolicy.c code for MPOL_BIND. My usual wordy comments below explain this. Search for "MPOL_BIND". - Some per-node data in the struct zonelist is now modified frequently, with no locking. Multiple CPU cores on a node could hit and mangle this data. The theory is that this is just performance hint data, and the memory allocator will work just fine despite any such mangling. The fields at risk are the struct 'zonelist_cache' fields 'fullzones' (a bitmask) and 'last_full_zap' (unsigned long jiffies). It should all be self correcting after at most a one second delay. - This still does a linear scan of the same lengths as before. All I've optimized is making the scan faster, not algorithmically shorter. It is now able to scan a compact array of 'unsigned short' in the case of many full nodes, so one cache line should cover quite a few nodes, rather than each node hitting another one or two new and distinct cache lines. - If both Andi and Nick don't find this too complicated, I will be (pleasantly) flabbergasted. - I removed the comment claiming we only use one cachline's worth of zonelist. We seem, at least in the fake numa case, to have put the lie to that claim. - I pay no attention to the various watermarks and such in this performance hint. A node could be marked full for one watermark, and then skipped over when searching for a page using a different watermark. I think that's actually quite ok, as it will tend to slightly increase the spreading of memory over other nodes, away from a memory stressed node. =============== Performance - some benchmark results and analysis: This benchmark runs a memory hog program that uses multiple threads to touch alot of memory as quickly as it can. Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of the total 96 GBytes on the system, and using 1, 19, 37, or 55 threads (on a 56 CPU system.) System, user and real (elapsed) timings were recorded for each run, shown in units of seconds, in the table below. Two kernels were tested - 2.6.18-mm3 and the same kernel with this zonelist caching patch added. The table also shows the percentage improvement the zonelist caching sys time is over (lower than) the stock *-mm kernel. number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent GBs N ------------ -------------- ---------------- systime mem threads sys user real sys user real sys user real better 12 1 153 24 177 151 24 176 -2 0 -1 1% 12 19 99 22 8 99 22 8 0 0 0 0% 12 37 111 25 6 112 25 6 1 0 0 -0% 12 55 115 25 5 110 23 5 -5 -2 0 4% 38 1 502 74 576 497 73 570 -5 -1 -6 0% 38 19 426 78 48 373 76 39 -53 -2 -9 12% 38 37 544 83 36 547 82 36 3 -1 0 -0% 38 55 501 77 23 511 80 24 10 3 1 -1% 64 1 917 125 1042 890 124 1014 -27 -1 -28 2% 64 19 1118 138 119 965 141 103 -153 3 -16 13% 64 37 1202 151 94 1136 150 81 -66 -1 -13 5% 64 55 1118 141 61 1072 140 58 -46 -1 -3 4% 90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4% 90 19 2392 199 192 2116 189 176 -276 -10 -16 11% 90 37 3313 238 175 2972 225 145 -341 -13 -30 10% 90 55 1948 210 104 1843 213 100 -105 3 -4 5% Notes: 1) This test ran a memory hog program that started a specified number N of threads, and had each thread allocate and touch 1/N'th of the total memory to be used in the test run in a single loop, writing a constant word to memory, one store every 4096 bytes. Watching this test during some earlier trial runs, I would see each of these threads sit down on one CPU and stay there, for the remainder of the pass, a different CPU for each thread. 2) The 'real' column is not comparable to the 'sys' or 'user' columns. The 'real' column is seconds wall clock time elapsed, from beginning to end of that test pass. The 'sys' and 'user' columns are total CPU seconds spent on that test pass. For a 19 thread test run, for example, the sum of 'sys' and 'user' could be up to 19 times the number of 'real' elapsed wall clock seconds. 3) Tests were run on a fresh, single-user boot, to minimize the amount of memory already in use at the start of the test, and to minimize the amount of background activity that might interfere. 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM. 5) Notice that the 'real' time gets large for the single thread runs, even though the measured 'sys' and 'user' times are modest. I'm not sure what that means - probably something to do with it being slow for one thread to be accessing memory along ways away. Perhaps the fake numa system, running ostensibly the same workload, would not show this substantial degradation of 'real' time for one thread on many nodes -- lets hope not. 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs) ran quite efficiently, as one might expect. Each pair of threads needed to allocate and touch the memory on the node the two threads shared, a pleasantly parallizable workload. 7) The intermediate thread count passes, when asking for alot of memory forcing them to go to a few neighboring nodes, improved the most with this zonelist caching patch. Conclusions: * This zonelist cache patch probably makes little difference one way or the other for most workloads on real numa hardware, if those workloads avoid heavy off node allocations. * For memory intensive workloads requiring substantial off-node allocations on real numa hardware, this patch improves both kernel and elapsed timings up to ten per-cent. * For fake numa systems, I'm optimistic, but will have to leave that up to Rohit Seth to actually test (once I get him a 2.6.18 backport.) Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Rohit Seth <rohitseth@google.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: David Rientjes <rientjes@cs.washington.edu> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
author: Paul Jackson <pj@sgi.com> 2006-12-06 23:31:48 -0500
committer: Linus Torvalds <torvalds@woody.osdl.org> 2006-12-07 11:39:20 -0500
commit: 9276b1bc96a132f4068fdee00983c532f43d3a26 (patch)
tree: 04d64444cf6558632cfc7514b5437578b5e616af /mm
parent: 89689ae7f95995723fbcd5c116c47933a3bb8b13 (diff)
2 files changed, 183 insertions, 7 deletions
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 617fb31086ee..fb907236bbd8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -141,9 +141,11 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
        enum zone_type k;
        max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
+        max++;                  /* space for zlcache_ptr (see mmzone.h) */
        zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
        if (!zl)
                return NULL;
+        zl->zlcache_ptr = NULL;
        num = 0;
        /* First put in the highest zones from all nodes, then all the next 
           lower zones etc. Avoid empty zones because the memory allocator
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23bc5bcbdcf9..230771d3c6b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -918,6 +918,126 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
        return 1;
 }
+#ifdef CONFIG_NUMA
+/*
+ * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
+ * skip over zones that are not allowed by the cpuset, or that have
+ * been recently (in last second) found to be nearly full.  See further
+ * comments in mmzone.h.  Reduces cache footprint of zonelist scans
+ * that have to skip over alot of full or unallowed zones.
+ *
+ * If the zonelist cache is present in the passed in zonelist, then
+ * returns a pointer to the allowed node mask (either the current
+ * tasks mems_allowed, or node_online_map.)
+ *
+ * If the zonelist cache is not available for this zonelist, does
+ * nothing and returns NULL.
+ *
+ * If the fullzones BITMAP in the zonelist cache is stale (more than
+ * a second since last zap'd) then we zap it out (clear its bits.)
+ *
+ * We hold off even calling zlc_setup, until after we've checked the
+ * first zone in the zonelist, on the theory that most allocations will
+ * be satisfied from that first zone, so best to examine that zone as
+ * quickly as we can.
+ */
+static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
+{
+        struct zonelist_cache *zlc;     /* cached zonelist speedup info */
+        nodemask_t *allowednodes;       /* zonelist_cache approximation */
+        zlc = zonelist->zlcache_ptr;
+        if (!zlc)
+                return NULL;
+        if (jiffies - zlc->last_full_zap > 1 * HZ) {
+                bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+                zlc->last_full_zap = jiffies;
+        }
+        allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
+                                        &cpuset_current_mems_allowed :
+                                        &node_online_map;
+        return allowednodes;
+}
+/*
+ * Given 'z' scanning a zonelist, run a couple of quick checks to see
+ * if it is worth looking at further for free memory:
+ *  1) Check that the zone isn't thought to be full (doesn't have its
+ *     bit set in the zonelist_cache fullzones BITMAP).
+ *  2) Check that the zones node (obtained from the zonelist_cache
+ *     z_to_n[] mapping) is allowed in the passed in allowednodes mask.
+ * Return true (non-zero) if zone is worth looking at further, or
+ * else return false (zero) if it is not.
+ *
+ * This check -ignores- the distinction between various watermarks,
+ * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ...  If a zone is
+ * found to be full for any variation of these watermarks, it will
+ * be considered full for up to one second by all requests, unless
+ * we are so low on memory on all allowed nodes that we are forced
+ * into the second scan of the zonelist.
+ *
+ * In the second scan we ignore this zonelist cache and exactly
+ * apply the watermarks to all zones, even it is slower to do so.
+ * We are low on memory in the second scan, and should leave no stone
+ * unturned looking for a free page.
+ */
+static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+                                                nodemask_t *allowednodes)
+{
+        struct zonelist_cache *zlc;     /* cached zonelist speedup info */
+        int i;                          /* index of *z in zonelist zones */
+        int n;                          /* node that zone *z is on */
+        zlc = zonelist->zlcache_ptr;
+        if (!zlc)
+                return 1;
+        i = z - zonelist->zones;
+        n = zlc->z_to_n[i];
+        /* This zone is worth trying if it is allowed but not full */
+        return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
+}
+/*
+ * Given 'z' scanning a zonelist, set the corresponding bit in
+ * zlc->fullzones, so that subsequent attempts to allocate a page
+ * from that zone don't waste time re-examining it.
+ */
+static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+{
+        struct zonelist_cache *zlc;     /* cached zonelist speedup info */
+        int i;                          /* index of *z in zonelist zones */
+        zlc = zonelist->zlcache_ptr;
+        if (!zlc)
+                return;
+        i = z - zonelist->zones;
+        set_bit(i, zlc->fullzones);
+}
+#else   /* CONFIG_NUMA */
+static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
+{
+        return NULL;
+}
+static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+                                nodemask_t *allowednodes)
+{
+        return 1;
+}
+static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+{
+}
+#endif  /* CONFIG_NUMA */
 /*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
@@ -926,23 +1046,32 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, int alloc_flags)
 {
-        struct zone **z = zonelist->zones;
+        struct zone **z;
        struct page *page = NULL;
-        int classzone_idx = zone_idx(*z);
+        int classzone_idx = zone_idx(zonelist->zones[0]);
        struct zone *zone;
+        nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
+        int zlc_active = 0;             /* set if using zonelist_cache */
+        int did_zlc_setup = 0;          /* just call zlc_setup() one time */
+zonelist_scan:
        /*
-         * Go through the zonelist once, looking for a zone with enough free.
+         * Scan zonelist, looking for a zone with enough free.
         * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
         */
+        z = zonelist->zones;
        do {
+                if (NUMA_BUILD && zlc_active &&
+                        !zlc_zone_worth_trying(zonelist, z, allowednodes))
+                                continue;
                zone = *z;
                if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
                        zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
                                break;
                if ((alloc_flags & ALLOC_CPUSET) &&
                        !cpuset_zone_allowed(zone, gfp_mask))
-                                continue;
+                                goto try_next_zone;
                if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
                        unsigned long mark;
@@ -956,15 +1085,30 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                                    classzone_idx, alloc_flags)) {
                                if (!zone_reclaim_mode ||
                                    !zone_reclaim(zone, gfp_mask, order))
-                                        continue;
+                                        goto this_zone_full;
                        }
                }
                page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
                if (page)
                        break;
+this_zone_full:
+                if (NUMA_BUILD)
+                        zlc_mark_zone_full(zonelist, z);
+try_next_zone:
+                if (NUMA_BUILD && !did_zlc_setup) {
+                        /* we do zlc_setup after the first zone is tried */
+                        allowednodes = zlc_setup(zonelist, alloc_flags);
+                        zlc_active = 1;
+                        did_zlc_setup = 1;
+                }
        } while (*(++z) != NULL);
+        if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
+                /* Disable zlc cache for second zonelist scan */
+                zlc_active = 0;
+                goto zonelist_scan;
+        }
        return page;
 }
@@ -1535,6 +1679,24 @@ static void __meminit build_zonelists(pg_data_t *pgdat)
        }
 }
+/* Construct the zonelist performance cache - see further mmzone.h */
+static void __meminit build_zonelist_cache(pg_data_t *pgdat)
+{
+        int i;
+        for (i = 0; i < MAX_NR_ZONES; i++) {
+                struct zonelist *zonelist;
+                struct zonelist_cache *zlc;
+                struct zone **z;
+                zonelist = pgdat->node_zonelists + i;
+                zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
+                bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+                for (z = zonelist->zones; *z; z++)
+                        zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
+        }
+}
 #else   /* CONFIG_NUMA */
 static void __meminit build_zonelists(pg_data_t *pgdat)
@@ -1572,14 +1734,26 @@ static void __meminit build_zonelists(pg_data_t *pgdat)
        }
 }
+/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
+static void __meminit build_zonelist_cache(pg_data_t *pgdat)
+{
+        int i;
+        for (i = 0; i < MAX_NR_ZONES; i++)
+                pgdat->node_zonelists[i].zlcache_ptr = NULL;
+}
 #endif  /* CONFIG_NUMA */
 /* return values int ....just for stop_machine_run() */
 static int __meminit __build_all_zonelists(void *dummy)
 {
        int nid;
-        for_each_online_node(nid)
+        for_each_online_node(nid) {
                build_zonelists(NODE_DATA(nid));
+                build_zonelist_cache(NODE_DATA(nid));
+        }
        return 0;
 }
author	Paul Jackson <pj@sgi.com>	2006-12-06 23:31:48 -0500
committer	Linus Torvalds <torvalds@woody.osdl.org>	2006-12-07 11:39:20 -0500
commit	9276b1bc96a132f4068fdee00983c532f43d3a26 (patch)
tree	04d64444cf6558632cfc7514b5437578b5e616af /mm
parent	89689ae7f95995723fbcd5c116c47933a3bb8b13 (diff)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 617fb31086ee..fb907236bbd8 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c
@@ -141,9 +141,11 @@ static struct zonelist bind_zonelist(nodemask_t nodes)
141	enum zone_type k;	141	enum zone_type k;
142		142
143	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);	143	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
		144	max++; /* space for zlcache_ptr (see mmzone.h) */
144	zl = kmalloc(sizeof(struct zone ) max, GFP_KERNEL);	145	zl = kmalloc(sizeof(struct zone ) max, GFP_KERNEL);
145	if (!zl)	146	if (!zl)
146	return NULL;	147	return NULL;
		148	zl->zlcache_ptr = NULL;
147	num = 0;	149	num = 0;
148	/* First put in the highest zones from all nodes, then all the next	150	/* First put in the highest zones from all nodes, then all the next
149	lower zones etc. Avoid empty zones because the memory allocator	151	lower zones etc. Avoid empty zones because the memory allocator


diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 23bc5bcbdcf9..230771d3c6b6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c
@@ -918,6 +918,126 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
918	return 1;	918	return 1;
919	}	919	}
920		920
		921	#ifdef CONFIG_NUMA
		922	/*
		923	* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
		924	* skip over zones that are not allowed by the cpuset, or that have
		925	* been recently (in last second) found to be nearly full. See further
		926	* comments in mmzone.h. Reduces cache footprint of zonelist scans
		927	* that have to skip over alot of full or unallowed zones.
		928	*
		929	* If the zonelist cache is present in the passed in zonelist, then
		930	* returns a pointer to the allowed node mask (either the current
		931	* tasks mems_allowed, or node_online_map.)
		932	*
		933	* If the zonelist cache is not available for this zonelist, does
		934	* nothing and returns NULL.
		935	*
		936	* If the fullzones BITMAP in the zonelist cache is stale (more than
		937	* a second since last zap'd) then we zap it out (clear its bits.)
		938	*
		939	* We hold off even calling zlc_setup, until after we've checked the
		940	* first zone in the zonelist, on the theory that most allocations will
		941	* be satisfied from that first zone, so best to examine that zone as
		942	* quickly as we can.
		943	*/
		944	static nodemask_t zlc_setup(struct zonelist zonelist, int alloc_flags)
		945	{
		946	struct zonelist_cache zlc; / cached zonelist speedup info */
		947	nodemask_t allowednodes; / zonelist_cache approximation */
		948
		949	zlc = zonelist->zlcache_ptr;
		950	if (!zlc)
		951	return NULL;
		952
		953	if (jiffies - zlc->last_full_zap > 1 * HZ) {
		954	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
		955	zlc->last_full_zap = jiffies;
		956	}
		957
		958	allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
		959	&cpuset_current_mems_allowed :
		960	&node_online_map;
		961	return allowednodes;
		962	}
		963
		964	/*
		965	* Given 'z' scanning a zonelist, run a couple of quick checks to see
		966	* if it is worth looking at further for free memory:
		967	* 1) Check that the zone isn't thought to be full (doesn't have its
		968	* bit set in the zonelist_cache fullzones BITMAP).
		969	* 2) Check that the zones node (obtained from the zonelist_cache
		970	* z_to_n[] mapping) is allowed in the passed in allowednodes mask.
		971	* Return true (non-zero) if zone is worth looking at further, or
		972	* else return false (zero) if it is not.
		973	*
		974	* This check -ignores- the distinction between various watermarks,
		975	* such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ... If a zone is
		976	* found to be full for any variation of these watermarks, it will
		977	* be considered full for up to one second by all requests, unless
		978	* we are so low on memory on all allowed nodes that we are forced
		979	* into the second scan of the zonelist.
		980	*
		981	* In the second scan we ignore this zonelist cache and exactly
		982	* apply the watermarks to all zones, even it is slower to do so.
		983	* We are low on memory in the second scan, and should leave no stone
		984	* unturned looking for a free page.
		985	*/
		986	static int zlc_zone_worth_trying(struct zonelist zonelist, struct zone *z,
		987	nodemask_t *allowednodes)
		988	{
		989	struct zonelist_cache zlc; / cached zonelist speedup info */
		990	int i; /* index of z in zonelist zones /
		991	int n; /* node that zone z is on /
		992
		993	zlc = zonelist->zlcache_ptr;
		994	if (!zlc)
		995	return 1;
		996
		997	i = z - zonelist->zones;
		998	n = zlc->z_to_n[i];
		999
		1000	/* This zone is worth trying if it is allowed but not full */
		1001	return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
		1002	}
		1003
		1004	/*
		1005	* Given 'z' scanning a zonelist, set the corresponding bit in
		1006	* zlc->fullzones, so that subsequent attempts to allocate a page
		1007	* from that zone don't waste time re-examining it.
		1008	*/
		1009	static void zlc_mark_zone_full(struct zonelist zonelist, struct zone *z)
		1010	{
		1011	struct zonelist_cache zlc; / cached zonelist speedup info */
		1012	int i; /* index of z in zonelist zones /
		1013
		1014	zlc = zonelist->zlcache_ptr;
		1015	if (!zlc)
		1016	return;
		1017
		1018	i = z - zonelist->zones;
		1019
		1020	set_bit(i, zlc->fullzones);
		1021	}
		1022
		1023	#else /* CONFIG_NUMA */
		1024
		1025	static nodemask_t zlc_setup(struct zonelist zonelist, int alloc_flags)
		1026	{
		1027	return NULL;
		1028	}
		1029
		1030	static int zlc_zone_worth_trying(struct zonelist zonelist, struct zone *z,
		1031	nodemask_t *allowednodes)
		1032	{
		1033	return 1;
		1034	}
		1035
		1036	static void zlc_mark_zone_full(struct zonelist zonelist, struct zone *z)
		1037	{
		1038	}
		1039	#endif /* CONFIG_NUMA */
		1040
921	/*	1041	/*
922	* get_page_from_freelist goes through the zonelist trying to allocate	1042	* get_page_from_freelist goes through the zonelist trying to allocate
923	* a page.	1043	* a page.
@@ -926,23 +1046,32 @@ static struct page *
926	get_page_from_freelist(gfp_t gfp_mask, unsigned int order,	1046	get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
927	struct zonelist *zonelist, int alloc_flags)	1047	struct zonelist *zonelist, int alloc_flags)
928	{	1048	{
929	struct zone **z = zonelist->zones;	1049	struct zone **z;
930	struct page *page = NULL;	1050	struct page *page = NULL;
931	int classzone_idx = zone_idx(*z);	1051	int classzone_idx = zone_idx(zonelist->zones[0]);
932	struct zone *zone;	1052	struct zone *zone;
		1053	nodemask_t allowednodes = NULL;/ zonelist_cache approximation */
		1054	int zlc_active = 0; /* set if using zonelist_cache */
		1055	int did_zlc_setup = 0; /* just call zlc_setup() one time */
933		1056
		1057	zonelist_scan:
934	/*	1058	/*
935	* Go through the zonelist once, looking for a zone with enough free.	1059	* Scan zonelist, looking for a zone with enough free.
936	* See also cpuset_zone_allowed() comment in kernel/cpuset.c.	1060	* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
937	*/	1061	*/
		1062	z = zonelist->zones;
		1063
938	do {	1064	do {
		1065	if (NUMA_BUILD && zlc_active &&
		1066	!zlc_zone_worth_trying(zonelist, z, allowednodes))
		1067	continue;
939	zone = *z;	1068	zone = *z;
940	if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&	1069	if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
941	zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))	1070	zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
942	break;	1071	break;
943	if ((alloc_flags & ALLOC_CPUSET) &&	1072	if ((alloc_flags & ALLOC_CPUSET) &&
944	!cpuset_zone_allowed(zone, gfp_mask))	1073	!cpuset_zone_allowed(zone, gfp_mask))
945	continue;	1074	goto try_next_zone;
946		1075
947	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {	1076	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
948	unsigned long mark;	1077	unsigned long mark;
@@ -956,15 +1085,30 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
956	classzone_idx, alloc_flags)) {	1085	classzone_idx, alloc_flags)) {
957	if (!zone_reclaim_mode \|\|	1086	if (!zone_reclaim_mode \|\|
958	!zone_reclaim(zone, gfp_mask, order))	1087	!zone_reclaim(zone, gfp_mask, order))
959	continue;	1088	goto this_zone_full;
960	}	1089	}
961	}	1090	}
962		1091
963	page = buffered_rmqueue(zonelist, zone, order, gfp_mask);	1092	page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
964	if (page)	1093	if (page)
965	break;	1094	break;
966		1095	this_zone_full:
		1096	if (NUMA_BUILD)
		1097	zlc_mark_zone_full(zonelist, z);
		1098	try_next_zone:
		1099	if (NUMA_BUILD && !did_zlc_setup) {
		1100	/* we do zlc_setup after the first zone is tried */
		1101	allowednodes = zlc_setup(zonelist, alloc_flags);
		1102	zlc_active = 1;
		1103	did_zlc_setup = 1;
		1104	}
967	} while (*(++z) != NULL);	1105	} while (*(++z) != NULL);
		1106
		1107	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
		1108	/* Disable zlc cache for second zonelist scan */
		1109	zlc_active = 0;
		1110	goto zonelist_scan;
		1111	}
968	return page;	1112	return page;
969	}	1113	}
970		1114
@@ -1535,6 +1679,24 @@ static void __meminit build_zonelists(pg_data_t *pgdat)
1535	}	1679	}
1536	}	1680	}
1537		1681
		1682	/* Construct the zonelist performance cache - see further mmzone.h */
		1683	static void __meminit build_zonelist_cache(pg_data_t *pgdat)
		1684	{
		1685	int i;
		1686
		1687	for (i = 0; i < MAX_NR_ZONES; i++) {
		1688	struct zonelist *zonelist;
		1689	struct zonelist_cache *zlc;
		1690	struct zone **z;
		1691
		1692	zonelist = pgdat->node_zonelists + i;
		1693	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
		1694	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
		1695	for (z = zonelist->zones; *z; z++)
		1696	zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
		1697	}
		1698	}
		1699
1538	#else /* CONFIG_NUMA */	1700	#else /* CONFIG_NUMA */
1539		1701
1540	static void __meminit build_zonelists(pg_data_t *pgdat)	1702	static void __meminit build_zonelists(pg_data_t *pgdat)
@@ -1572,14 +1734,26 @@ static void __meminit build_zonelists(pg_data_t *pgdat)
1572	}	1734	}
1573	}	1735	}
1574		1736
		1737	/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
		1738	static void __meminit build_zonelist_cache(pg_data_t *pgdat)
		1739	{
		1740	int i;
		1741
		1742	for (i = 0; i < MAX_NR_ZONES; i++)
		1743	pgdat->node_zonelists[i].zlcache_ptr = NULL;
		1744	}
		1745
1575	#endif /* CONFIG_NUMA */	1746	#endif /* CONFIG_NUMA */
1576		1747
1577	/* return values int ....just for stop_machine_run() */	1748	/* return values int ....just for stop_machine_run() */
1578	static int __meminit __build_all_zonelists(void *dummy)	1749	static int __meminit __build_all_zonelists(void *dummy)
1579	{	1750	{
1580	int nid;	1751	int nid;
1581	for_each_online_node(nid)	1752
		1753	for_each_online_node(nid) {
1582	build_zonelists(NODE_DATA(nid));	1754	build_zonelists(NODE_DATA(nid));
		1755	build_zonelist_cache(NODE_DATA(nid));
		1756	}
1583	return 0;	1757	return 0;
1584	}	1758	}
1585		1759