aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorVlastimil Babka <vbabka@suse.cz>2017-01-24 18:18:32 -0500
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>2017-02-01 02:33:04 -0500
commitade7afe9dca6b13919f88abd38eefe32f22eaeb3 (patch)
tree0a81138f31ff4ec6e24e1cf7698c0997486f92ab
parent9b1a1ae9b5281e12893a34bcbd1686b5bcd2cd82 (diff)
mm, page_alloc: fix check for NULL preferred_zone
commit ea57485af8f4221312a5a95d63c382b45e7840dc upstream. Patch series "fix premature OOM regression in 4.7+ due to cpuset races". This is v2 of my attempt to fix the recent report based on LTP cpuset stress test [1]. The intention is to go to stable 4.9 LTSS with this, as triggering repeated OOMs is not nice. That's why the patches try to be not too intrusive. Unfortunately why investigating I found that modifying the testcase to use per-VMA policies instead of per-task policies will bring the OOM's back, but that seems to be much older and harder to fix problem. I have posted a RFC [2] but I believe that fixing the recent regressions has a higher priority. Longer-term we might try to think how to fix the cpuset mess in a better and less error prone way. I was for example very surprised to learn, that cpuset updates change not only task->mems_allowed, but also nodemask of mempolicies. Until now I expected the parameter to alloc_pages_nodemask() to be stable. I wonder why do we then treat cpusets specially in get_page_from_freelist() and distinguish HARDWALL etc, when there's unconditional intersection between mempolicy and cpuset. I would expect the nodemask adjustment for saving overhead in g_p_f(), but that clearly doesn't happen in the current form. So we have both crazy complexity and overhead, AFAICS. [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz This patch (of 4): Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") we have a wrong check for NULL preferred_zone, which can theoretically happen due to concurrent cpuset modification. We check the zoneref pointer which is never NULL and we should check the zone pointer. Also document this in first_zones_zonelist() comment per Michal Hocko. Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-rw-r--r--include/linux/mmzone.h6
-rw-r--r--mm/page_alloc.c2
2 files changed, 6 insertions, 2 deletions
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..f99c993dd500 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -972,12 +972,16 @@ static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
972 * @zonelist - The zonelist to search for a suitable zone 972 * @zonelist - The zonelist to search for a suitable zone
973 * @highest_zoneidx - The zone index of the highest zone to return 973 * @highest_zoneidx - The zone index of the highest zone to return
974 * @nodes - An optional nodemask to filter the zonelist with 974 * @nodes - An optional nodemask to filter the zonelist with
975 * @zone - The first suitable zone found is returned via this parameter 975 * @return - Zoneref pointer for the first suitable zone found (see below)
976 * 976 *
977 * This function returns the first zone at or below a given zone index that is 977 * This function returns the first zone at or below a given zone index that is
978 * within the allowed nodemask. The zoneref returned is a cursor that can be 978 * within the allowed nodemask. The zoneref returned is a cursor that can be
979 * used to iterate the zonelist with next_zones_zonelist by advancing it by 979 * used to iterate the zonelist with next_zones_zonelist by advancing it by
980 * one before calling. 980 * one before calling.
981 *
982 * When no eligible zone is found, zoneref->zone is NULL (zoneref itself is
983 * never NULL). This may happen either genuinely, or due to concurrent nodemask
984 * update due to cpuset modification.
981 */ 985 */
982static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, 986static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
983 enum zone_type highest_zoneidx, 987 enum zone_type highest_zoneidx,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34ada718ef47..593a11d8bc6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3763,7 +3763,7 @@ retry_cpuset:
3763 */ 3763 */
3764 ac.preferred_zoneref = first_zones_zonelist(ac.zonelist, 3764 ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
3765 ac.high_zoneidx, ac.nodemask); 3765 ac.high_zoneidx, ac.nodemask);
3766 if (!ac.preferred_zoneref) { 3766 if (!ac.preferred_zoneref->zone) {
3767 page = NULL; 3767 page = NULL;
3768 goto no_zone; 3768 goto no_zone;
3769 } 3769 }