aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* vmscan: merge duplicate code in shrink_active_list()Wu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "move pages to active list" and "move pages to inactive list" code blocks are mostly identical and can be served by a function. Thanks to Andrew Morton for pointing this out. Note that buffer_heads_over_limit check will also be carried out for re-activated pages, which is slightly different from pre-2.6.28 kernels. Also, Rik's "vmscan: evict use-once pages first" patch could totally stop scans of active file list when memory pressure is low. So the net effect could be, the number of buffer heads is now more likely to grow large. However that's fine according to Johannes' comments: I don't think that this could be harmful. We just preserve the buffer mappings of what we consider the working set and with low memory pressure, as you say, this set is not big. As to stripping of reactivated pages: the only pages we re-activate for now are those VM_EXEC mapped ones. Since we don't expect IO from or to these pages, removing the buffer mappings in case they grow too large should be okay, I guess. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: make mapped executable pages the first class citizenWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Protect referenced PROT_EXEC mapped pages from being deactivated. PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some currently running executables and their linked libraries, they shall really be cached aggressively to provide good user experiences. Thanks to Johannes Weiner for the advice to reuse the VMA walk in page_referenced() to get the PROT_EXEC bit. [more details] ( The consequences of this patch will have to be discussed together with Rik van Riel's recent patch "vmscan: evict use-once pages first". ) ( Some of the good points and insights are taken into this changelog. Thanks to all the involved people for the great LKML discussions. ) the problem =========== For a typical desktop, the most precious working set is composed of *actively accessed* (1) memory mapped executables (2) and their anonymous pages (3) and other files (4) and the dcache/icache/.. slabs while the least important data are (5) infrequently used or use-once files For a typical desktop, one major problem is busty and large amount of (5) use-once files flushing out the working set. Inside the working set, (4) dcache/icache have already been too sticky ;-) So we only have to care (2) anonymous and (1)(3) file pages. anonymous pages =============== Anonymous pages are effectively immune to the streaming IO attack, because we now have separate file/anon LRU lists. When the use-once files crowd into the file LRU, the list's "quality" is significantly lowered. Therefore the scan balance policy in get_scan_ratio() will choose to scan the (low quality) file LRU much more frequently than the anon LRU. file pages ========== Rik proposed to *not* scan the active file LRU when the inactive list grows larger than active list. This guarantees that when there are use-once streaming IO, and the working set is not too large(so that active_size < inactive_size), the active file LRU will *not* be scanned at all. So the not-too-large working set can be well protected. But there are also situations where the file working set is a bit large so that (active_size >= inactive_size), or the streaming IOs are not purely use-once. In these cases, the active list will be scanned slowly. Because the current shrink_active_list() policy is to deactivate active pages regardless of their referenced bits. The deactivated pages become susceptible to the streaming IO attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that the deactivated pages don't have enough time to get re-referenced. Because a user tend to switch between windows in intervals from seconds to minutes. This patch holds mapped executable pages in the active list as long as they are referenced during each full scan of the active list. Because the active list is normally scanned much slower, they get longer grace time (eg. 100s) for further references, which better matches the pace of user operations. Therefore this patch greatly prolongs the in-cache time of executable code, when there are moderate memory pressures. before patch: guaranteed to be cached if reference intervals < I after patch: guaranteed to be cached if reference intervals < I+A (except when randomly reclaimed by the lumpy reclaim) where A = time to fully scan the active file LRU I = time to fully scan the inactive file LRU Note that normally A >> I. side effects ============ This patch is safe in general, it restores the pre-2.6.28 mmap() behavior but in a much smaller and well targeted scope. One may worry about some one to abuse the PROT_EXEC heuristic. But as Andrew Morton stated, there are other tricks to getting that sort of boost. Another concern is the PROT_EXEC mapped pages growing large in rare cases, and therefore hurting reclaim efficiency. But a sane application targeted for large audience will never use PROT_EXEC for data mappings. If some home made application tries to abuse that bit, it shall be aware of the consequences. If it is abused to scale of 2/3 total memory, it gains nothing but overheads. benchmarks ========== 1) memory tight desktop 1.1) brief summary - clock time and major faults are reduced by 50%; - pswpin numbers are reduced to ~1/3. That means X desktop responsiveness is doubled under high memory/swap pressure. 1.2) test scenario - nfsroot gnome desktop with 512M physical memory - run some programs, and switch between the existing windows after starting each new program. 1.3) progress timing (seconds) before after programs 0.02 0.02 N xeyes 0.75 0.76 N firefox 2.02 1.88 N nautilus 3.36 3.17 N nautilus --browser 5.26 4.89 N gthumb 7.12 6.47 N gedit 9.22 8.16 N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf 13.58 12.55 N xterm 15.87 14.57 N mlterm 18.63 17.06 N gnome-terminal 21.16 18.90 N urxvt 26.24 23.48 N gnome-system-monitor 28.72 26.52 N gnome-help 32.15 29.65 N gnome-dictionary 39.66 36.12 N /usr/games/sol 43.16 39.27 N /usr/games/gnometris 48.65 42.56 N /usr/games/gnect 53.31 47.03 N /usr/games/gtali 58.60 52.05 N /usr/games/iagno 65.77 55.42 N /usr/games/gnotravex 70.76 61.47 N /usr/games/mahjongg 76.15 67.11 N /usr/games/gnome-sudoku 86.32 75.15 N /usr/games/glines 92.21 79.70 N /usr/games/glchess 103.79 88.48 N /usr/games/gnomine 113.84 96.51 N /usr/games/gnotski 124.40 102.19 N /usr/games/gnibbles 137.41 114.93 N /usr/games/gnobots2 155.53 125.02 N /usr/games/blackjack 179.85 135.11 N /usr/games/same-gnome 224.49 154.50 N /usr/bin/gnome-window-properties 248.44 162.09 N /usr/bin/gnome-default-applications-properties 282.62 173.29 N /usr/bin/gnome-at-properties 323.72 188.21 N /usr/bin/gnome-typing-monitor 363.99 199.93 N /usr/bin/gnome-at-visual 394.21 206.95 N /usr/bin/gnome-sound-properties 435.14 224.49 N /usr/bin/gnome-at-mobility 463.05 234.11 N /usr/bin/gnome-keybinding-properties 503.75 248.59 N /usr/bin/gnome-about-me 554.00 276.27 N /usr/bin/gnome-display-properties 615.48 304.39 N /usr/bin/gnome-network-preferences 693.03 342.01 N /usr/bin/gnome-mouse-properties 759.90 388.58 N /usr/bin/gnome-appearance-properties 937.90 508.47 N /usr/bin/gnome-control-center 1109.75 587.57 N /usr/bin/gnome-keyboard-properties 1399.05 758.16 N : oocalc 1524.64 830.03 N : oodraw 1684.31 900.03 N : ooimpress 1874.04 993.91 N : oomath 2115.12 1081.89 N : ooweb 2369.02 1161.99 N : oowriter Note that the last ": oo*" commands are actually commented out. 1.4) vmstat numbers (some relevant ones are marked with *) before after nr_free_pages 1293 3898 nr_inactive_anon 59956 53460 nr_active_anon 26815 30026 nr_inactive_file 2657 3218 nr_active_file 2019 2806 nr_unevictable 4 4 nr_mlock 4 4 nr_anon_pages 26706 27859 *nr_mapped 3542 4469 nr_file_pages 72232 67681 nr_dirty 1 0 nr_writeback 123 19 nr_slab_reclaimable 3375 3534 nr_slab_unreclaimable 11405 10665 nr_page_table_pages 8106 7864 nr_unstable 0 0 nr_bounce 0 0 *nr_vmscan_write 394776 230839 nr_writeback_temp 0 0 numa_hit 6843353 3318676 numa_miss 0 0 numa_foreign 0 0 numa_interleave 1719 1719 numa_local 6843353 3318676 numa_other 0 0 *pgpgin 5954683 2057175 *pgpgout 1578276 922744 *pswpin 1486615 512238 *pswpout 394568 230685 pgalloc_dma 277432 56602 pgalloc_dma32 6769477 3310348 pgalloc_normal 0 0 pgalloc_movable 0 0 pgfree 7048396 3371118 pgactivate 2036343 1471492 pgdeactivate 2189691 1612829 pgfault 3702176 3100702 *pgmajfault 452116 201343 pgrefill_dma 12185 7127 pgrefill_dma32 334384 653703 pgrefill_normal 0 0 pgrefill_movable 0 0 pgsteal_dma 74214 22179 pgsteal_dma32 3334164 1638029 pgsteal_normal 0 0 pgsteal_movable 0 0 pgscan_kswapd_dma 1081421 1216199 pgscan_kswapd_dma32 58979118 46002810 pgscan_kswapd_normal 0 0 pgscan_kswapd_movable 0 0 pgscan_direct_dma 2015438 1086109 pgscan_direct_dma32 55787823 36101597 pgscan_direct_normal 0 0 pgscan_direct_movable 0 0 pginodesteal 3461 7281 slabs_scanned 564864 527616 kswapd_steal 2889797 1448082 kswapd_inodesteal 14827 14835 pageoutrun 43459 21562 allocstall 9653 4032 pgrotated 384216 228631 1.5) free numbers at the end of the tests before patch: total used free shared buffers cached Mem: 474 467 7 0 0 236 -/+ buffers/cache: 230 243 Swap: 1023 418 605 after patch: total used free shared buffers cached Mem: 474 457 16 0 0 236 -/+ buffers/cache: 221 253 Swap: 1023 404 619 2) memory flushing in a file server 2.1) brief summary The number of major faults from 50 to 3 during 10% cache hot reads. That means this patch successfully stops major faults when the active file list is slowly scanned when there are partially cache hot streaming IO. 2.2) test scenario Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the pages will be activated: for i in `seq 0 100 10000000`; do echo $i 110; done > pattern-hot-10 iotrace.rb --load pattern-hot-10 --play /b/sparse vmmon nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree and monitor /proc/vmstat during the time. The test box has 2G memory. I carried out tests on fresh booted console as well as X desktop, and fetched the vmstat numbers on (1) begin: shortly after the big read IO starts; (2) end: just before the big read IO stops; (3) restore: the big read IO stops and the zsh working set restored (4) restore X: after IO, switch back and forth between the urxvt and firefox windows to restore their working set. 2.3) console mode results nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree 2.6.29 VM_EXEC protection ON: begin: 2481 2237 8694 630 0 574299 end: 275 231976 233914 633 776271 20933042 restore: 370 232154 234524 691 777183 20958453 2.6.29 VM_EXEC protection ON (second run): begin: 2434 2237 8493 629 0 574195 end: 284 231970 233536 632 771918 20896129 restore: 399 232218 234789 690 774526 20957909 2.6.30-rc4-mm VM_EXEC protection OFF: begin: 2479 2344 9659 210 0 579643 end: 284 232010 234142 260 772776 20917184 restore: 379 232159 234371 301 774888 20967849 The above console numbers show that - The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29. I'd attribute that improvement to the mmap readahead improvements :-) - The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50. That's a huge improvement - which means with the VM_EXEC protection logic, active mmap pages is pretty safe even under partially cache hot streaming IO. - when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8 under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree) That roughly means the active mmap pages get 20.8 more chances to get re-referenced to stay in memory. - The absolute nr_mapped drops considerably to 1/9 during the big IO, and the dropped pages are mostly inactive ones. The patch has almost no impact in this aspect, that means it won't unnecessarily increase memory pressure. (In contrast, your 20% mmap protection ratio will keep them all, and therefore eliminate the extra 41 major faults to restore working set of zsh etc.) The iotrace.rb read throughput is 151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse which means the inactive list is rotated at the speed of 250MB/s, so a full scan of which takes about 3.5 seconds, while a full scan of active file list takes about 77 seconds. 2.4) X mode results We can reach roughly the same conclusions for X desktop: nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree 2.6.30-rc4-mm VM_EXEC protection ON: begin: 9740 8920 64075 561 0 678360 end: 768 218254 220029 565 798953 21057006 restore: 857 218543 220987 606 799462 21075710 restore X: 2414 218560 225344 797 799462 21080795 2.6.30-rc4-mm VM_EXEC protection OFF: begin: 9368 5035 26389 554 0 633391 end: 770 218449 221230 661 646472 17832500 restore: 1113 218466 220978 710 649881 17905235 restore X: 2687 218650 225484 947 802700 21083584 - the absolute nr_mapped drops considerably (to 1/13 of the original size) during the streaming IO. - the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393 during the whole process. Cc: Elladan <elladan@eskimo.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: report vm_flags in page_referenced()Wu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | Collect vma->vm_flags of the VMAs that actually referenced the page. This is preparing for more informed reclaim heuristics, eg. to protect executable file pages more aggressively. For now only the VM_EXEC bit will be used by the caller. Thanks to Johannes, Peter and Minchan for all the good tips. Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add a gfp-translate script to help understand page allocation failure ↵Mel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reports The page allocation failure messages include a line that looks like page allocation failure. order:1, mode:0x4020 The mode is easy to translate but irritating for the lazy and a bit error prone. This patch adds a very simple helper script gfp-translate for the mode: portion of the page allocation failure messages. An example usage looks like mel@machina:~/linux-2.6 $ scripts/gfp-translate 0x4020 Source: /home/mel/linux-2.6 Parsing: 0x4020 #define __GFP_HIGH (0x20) /* Should access emergency pools? */ #define __GFP_COMP (0x4000) /* Add compound page metadata */ The script is not a work of art but it has come in handy for me a few times so I thought I would share. [akpm@linux-foundation.org: clarify an error message] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Hellwig <hch@infradead.org> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm cleanup: shmem_file_setup: 'char *' -> 'const char *' for name argumentSergei Trofimovich2009-06-16
| | | | | | | | | | As function shmem_file_setup does not modify/allocate/free/pass given filename - mark it as const. Signed-off-by: Sergei Trofimovich <slyfox@inbox.ru> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove file argument from swap_readpage()Minchan Kim2009-06-16
| | | | | | | | | | | | The file argument resulted from address_space's readpage long time ago. We don't use it any more. Let's remove unnecessary argement. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove annotation of gfp_mask in add_to_swapMinchan Kim2009-06-16
| | | | | | | | | | Hugh removed add_to_swap's gfp_mask argument. (mm: remove gfp_mask from add_to_swap) So we have to remove annotation of gfp_mask of the function. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-allocator: clear N_HIGH_MEMORY map before we set it againYinghai Lu2009-06-16
| | | | | | | | | | | | | | | | | SRAT tables may contains nodes of very small size. The arch code may decide to not activate such a node. However, currently the early boot code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be active although these nodes have no present pages. For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too Signed-off-by: Yinghai Lu <Yinghai@kernel.org> Tested-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove __invalidate_mapping_pages variantMike Waychison2009-06-16
| | | | | | | | | | | | | | | | | Remove __invalidate_mapping_pages atomic variant now that its sole caller can sleep (fixed in eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb ("vfs: fix lock inversion in drop_pagecache_sb()")). This fixes softlockups that can occur while in the drop_caches path. Signed-off-by: Mike Waychison <mikew@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: invoke oom killer for __GFP_NOFAILDavid Rientjes2009-06-16
| | | | | | | | | | | | | | | | The oom killer must be invoked regardless of the order if the allocation is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free some memory. Cc: Nick Piggin <npiggin@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: avoid unnecessary mm locking and scanning for OOM_DISABLEDavid Rientjes2009-06-16
| | | | | | | | | | | | | | | | | | | | | This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: move oom_adj value from task_struct to mm_structDavid Rientjes2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: reuse unused swap entry if necessaryKAMEZAWA Hiroyuki2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | Presently we can know a swap entry is just used as SwapCache via swap_map, without looking up swap cache. Then, we have a chance to reuse swap-cache-only swap entries in get_swap_pages(). This patch tries to free swap-cache-only swap entries if swap is not enough. Note: We hit following path when swap_cluster code cannot find a free cluster. Then, vm_swap_full() is not only condition to allow the kernel to reclaim unused swap. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: modify swap_map and add SWAP_HAS_CACHE flagKAMEZAWA Hiroyuki2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a part of the patches for fixing memcg's swap accountinf leak. But, IMHO, not a bad patch even if no memcg. There are 2 kinds of references to swap. - reference from swap entry - reference from swap cache Then, - If there is swap cache && swap's refcnt is 1, there is only swap cache. (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL This counting logic have worked well for a long time. But considering that we cannot know there is a _real_ reference or not by swap_map[], current usage of counter is not very good. This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry has a cache or not. This will remove -1 magic used in swapfile.c and be a help to avoid unnecessary find_get_page(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add swap cache interface for swap referenceKAMEZAWA Hiroyuki2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | In a following patch, the usage of swap cache is recorded into swap_map. This patch is for necessary interface changes to do that. 2 interfaces: - swapcache_prepare() - swapcache_free() are added for allocating/freeing refcnt from swap-cache to existing swap entries. But implementation itself is not changed under this patch. At adding swapcache_free(), memcg's hook code is moved under swapcache_free(). This is better than using scattered hooks. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove CONFIG_UNEVICTABLE_LRU config optionKOSAKI Motohiro2009-06-16
| | | | | | | | | | | | | | | | Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-allocator: reset wmark_min and inactive ratio of zone when hotplug happensMinchan Kim2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | Solve two problems. Whenever memory hotplug sucessfully happens, zone->present_pages have to be changed. 1) Now memory hotplug calls setup_per_zone_wmark_min only when online_pages called, not offline_pages. It breaks balance. 2) If zone->present_pages is changed, we also have to change zone->inactive_ratio. That's because inactive_ratio depends on zone->present_pages. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-allocator: add inactive ratio calculation function of each zoneMinchan Kim2009-06-16
| | | | | | | | | | | | | | Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s loop into a a separate function, calculate_zone_inactive_ratio(). This function will be used in a later patch [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-allocator: clean up functions related to pages_minMinchan Kim2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | Change the names of two functions. It doesn't affect behavior. Presently, setup_per_zone_pages_min() changes low, high of zone as well as min. So a better name is setup_per_zone_wmarks(). That's because Mel changed zone->pages_[hig/low/min] to zone->watermark array in "page allocator: replace the watermark-related union in struct zone with a watermark[] array". * setup_per_zone_pages_min => setup_per_zone_wmarks Of course, we have to change init_per_zone_pages_min, too. There are not pages_min any more. * init_per_zone_pages_min => init_per_zone_wmark_min [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-allocator: use integer fields lookup for gfp_zone and check for errors ↵Christoph Lameter2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | in flags passed to the page allocator This simplifies the code in gfp_zone() and also keeps the ability of the compiler to use constant folding to get rid of gfp_zone processing. The lookup of the zone is done using a bitfield stored in an integer. So the code in gfp_zone is a simple extraction of bits from a constant bitfield. The compiler is generating a load of a constant into a register and then performs a shift and mask operation to get the zone from a gfp_t. No cachelines are touched and no branches have to be predicted by the compiler. We are doing some macro tricks here to convince the compiler to always do the constant folding if possible. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: check the argument of kunmap on architectures without highmemMatthew Wilcox2009-06-16
| | | | | | | | | | | | | If you're using a non-highmem architecture, passing an argument with the wrong type to kunmap() doesn't give you a warning because the ifdef doesn't check the type. Using a static inline function solves the problem nicely. Reported-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: prevent shrinking of active anon lru list in case of no swap space V3MinChan Kim2009-06-16
| | | | | | | | | | | | | | | | shrink_zone() can deactivate active anon pages even if we don't have a swap device. Many embedded products don't have a swap device. So the deactivation of anon pages is unnecessary. This patch prevents unnecessary deactivation of anon lru pages. But, it don't prevent aging of anon pages to swap out. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* migration: only migrate_prep() once per move_pages()Brice Goglin2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz). Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages() throughput by breaking it into chunks, but it also made migrate_prep() be called once per chunk (every 128pages or so) instead of once per move_pages(). This patch reverts to calling migrate_prep() only once per chunk as we did before 2.6.29. It is also a followup to commit 0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from under mmap_sem"). This improves migration throughput on the above machine from 600MB/s to 750MB/s. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, PM/Freezer: Disable OOM killer when tasks are frozenRafael J. Wysocki2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the following scenario appears to be possible in theory: * Tasks are frozen for hibernation or suspend. * Free pages are almost exhausted. * Certain piece of code in the suspend code path attempts to allocate some memory using GFP_KERNEL and allocation order less than or equal to PAGE_ALLOC_COSTLY_ORDER. * __alloc_pages_internal() cannot find a free page so it invokes the OOM killer. * The OOM killer attempts to kill a task, but the task is frozen, so it doesn't die immediately. * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries to find a free page and invokes the OOM killer. * No progress can be made. Although it is now hard to trigger during hibernation due to the memory shrinking carried out by the hibernation code, it is theoretically possible to trigger during suspend after the memory shrinking has been removed from that code path. Moreover, since memory allocations are going to be used for the hibernation memory shrinking, it will be even more likely to happen during hibernation. To prevent it from happening, introduce the oom_killer_disabled switch that will cause __alloc_pages_internal() to fail in the situations in which the OOM killer would have been called and make the freezer set this switch after tasks have been successfully frozen. [akpm@linux-foundation.org: be nicer to the namespace] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: madvise(): correct return codeNick Piggin2009-06-16
| | | | | | | | | | | | | | | | | The posix_madvise() function succeeds (and does nothing) when called with parameters (NULL, 0, -1); according to LSB tests, it should fail with EINVAL because -1 is not a valid flag. When called with a valid address and size, it correctly fails. So perform an initial check for valid flags first. Reported-by: Jiri Dluhos <jdluhos@novell.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-and-Tested-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-allocator: warn if __GFP_NOFAIL is used for a large allocationAndrew Morton2009-06-16
| | | | | | | | | | | | | | | | | __GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should detect and suitably handle this (and not by lamely moving the infinite loop up to the caller level either). Attempting to use __GFP_NOFAIL for a higher-order allocation is even worse, so add a once-off runtime check for this to slap people around for even thinking about trying it. Cc: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* videobuf-dma-contig: zero copy USERPTR supportMagnus Damm2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | Since videobuf-dma-contig is designed to handle physically contiguous memory, this patch modifies the videobuf-dma-contig code to only accept a user space pointer to physically contiguous memory. For now only VM_PFNMAP vmas are supported, so forget hotplug. On SuperH Mobile we use this with our sh_mobile_ceu_camera driver together with various multimedia accelerator blocks that are exported to user space using UIO. The UIO kernel code exports physically contiguous memory to user space and lets the user space application mmap() this memory and pass a pointer using the USERPTR interface for V4L2 zero copy operation. With this approach we support zero copy capture, hardware scaling and various forms of hardware encoding and decoding. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Hans Verkuil <hverkuil@xs4all.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce follow_pfn()Johannes Weiner2009-06-16
| | | | | | | | | | | | | | Analoguous to follow_phys(), add a helper that looks up the PFN at a user virtual address in an IO mapping or a raw PFN mapping. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use generic follow_pte() in follow_phys()Johannes Weiner2009-06-16
| | | | | | | | | | | Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce follow_pte()Johannes Weiner2009-06-16
| | | | | | | | | | | | | | A generic readonly page table lookup helper to map an address space and an address from it to a pte. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Magnus Damm <magnus.damm@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: setup_per_zone_inactive_ratio - fix comment and make it __initCyrill Gorcunov2009-06-16
| | | | | | | | | | | | The caller of setup_per_zone_inactive_ratio is an __init function. There is no need to keep the callee after it completed as well. Also fix a comment. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: setup_per_zone_inactive_ratio - do not call for int_sqrt if not neededCyrill Gorcunov2009-06-16
| | | | | | | | | int_sqrt() returns 0 if its argument is zero so call it if only needed. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: ZVC updates in shrink_active_list() can be done onceWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | This effectively lifts the unit of updates to nr_inactive_* and pgdeactivate from PAGEVEC_SIZE=14 to SWAP_CLUSTER_MAX=32, or MAX_ORDER_NR_PAGES=1024 for reclaim_zone(). Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: don't export nr_saved_scan in /proc/zoneinfoWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | The lru->nr_saved_scan's are not meaningful counters for even kernel developers. They typically are smaller than 32 and are always 0 for large lists. So remove them from /proc/zoneinfo. Hopefully this interface change won't break too many scripts. /proc/zoneinfo is too unstructured to be script friendly, and I wonder the affected scripts - if there are any - are still bleeding since the not long ago commit "vmscan: split LRU lists into anon & file sets", which also touched the "scanned" line :) If we are to re-export accumulated vmscan counts in the future, they can go to new lines in /proc/zoneinfo instead of the current form, or to /sys/devices/system/node/node0/meminfo? Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: cleanup the scan batching codeWu Fengguang2009-06-16
| | | | | | | | | | | | | | The vmscan batching logic is twisting. Move it into a standalone function nr_scan_try_batch() and document it. No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: evict use-once pages firstRik van Riel2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | When the file LRU lists are dominated by streaming IO pages, evict those pages first, before considering evicting other pages. This should be safe from deadlocks or performance problems because only three things can happen to an inactive file page: 1) referenced twice and promoted to the active list 2) evicted by the pageout code 3) under IO, after which it will get evicted or promoted The pages freed in this way can either be reused for streaming IO, or allocated for something else. If the pages are used for streaming IO, this pageout pattern continues. Otherwise, we will fall back to the normal pageout pattern. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Elladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pagemap: add page-types toolWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add page-types, a handy tool for querying page flags. It will expand some of the overloaded flags: PG_slob_free = PG_private PG_slub_frozen = PG_active PG_slub_debug = PG_error PG_readahead = PG_reclaim and mask out obscure flags except in -raw mode: PG_reserved PG_mlocked PG_mappedtodisk PG_private PG_private_2 PG_owner_priv_1 PG_arch_1 PG_uncached PG_compound* for non hugeTLB pages [akpm@linux-foundation.org: fix warning] Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pagemap: document 9 more exported page flagsWu Fengguang2009-06-16
| | | | | | | | | | | | | Also add short descriptions for all of the 20 exported page flags. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pagemap: document clarificationsWu Fengguang2009-06-16
| | | | | | | | | | | | | | Some bit ranges were inclusive and some not. Fix them to be consistently inclusive. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: export more page flags in /proc/kpageflagsWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export all page flags faithfully in /proc/kpageflags. 11. KPF_MMAP (pseudo flag) memory mapped page 12. KPF_ANON (pseudo flag) memory mapped page (anonymous) 13. KPF_SWAPCACHE page is in swap cache 14. KPF_SWAPBACKED page is swap/RAM backed 15. KPF_COMPOUND_HEAD (*) 16. KPF_COMPOUND_TAIL (*) 17. KPF_HUGE hugeTLB pages 18. KPF_UNEVICTABLE page is in the unevictable LRU list 19. KPF_HWPOISON(TBD) hardware detected corruption 20. KPF_NOPAGE (pseudo flag) no page frame at the address 32-39. more obscure flags for kernel developers (*) For compound pages, exporting _both_ head/tail info enables users to tell where a compound page starts/ends, and its order. The accompanying page-types tool will handle the details like decoupling overloaded flags and hiding obscure flags to normal users. Thanks to KOSAKI and Andi for their valuable recommendations! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: kpagecount/kpageflags code cleanupWu Fengguang2009-06-16
| | | | | | | | | | | | | Move increments of pfn/out to bottom of the loop. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce PageHuge() for testing huge/gigantic pagesWu Fengguang2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A series of patches to enhance the /proc/pagemap interface and to add a userspace executable which can be used to present the pagemap data. Export 10 more flags to end users (and more for kernel developers): 11. KPF_MMAP (pseudo flag) memory mapped page 12. KPF_ANON (pseudo flag) memory mapped page (anonymous) 13. KPF_SWAPCACHE page is in swap cache 14. KPF_SWAPBACKED page is swap/RAM backed 15. KPF_COMPOUND_HEAD (*) 16. KPF_COMPOUND_TAIL (*) 17. KPF_HUGE hugeTLB pages 18. KPF_UNEVICTABLE page is in the unevictable LRU list 19. KPF_HWPOISON hardware detected corruption 20. KPF_NOPAGE (pseudo flag) no page frame at the address (*) For compound pages, exporting _both_ head/tail info enables users to tell where a compound page starts/ends, and its order. a simple demo of the page-types tool # ./page-types -h page-types [options] -r|--raw Raw mode, for kernel developers -a|--addr addr-spec Walk a range of pages -b|--bits bits-spec Walk pages with specified bits -l|--list Show page details in ranges -L|--list-each Show page details one by one -N|--no-summary Don't show summay info -h|--help Show this usage message addr-spec: N one page at offset N (unit: pages) N+M pages range from N to N+M-1 N,M pages range from N to M-1 N, pages range from N to end ,M pages range from 0 to M bits-spec: bit1,bit2 (flags & (bit1|bit2)) != 0 bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1 bit1,~bit2 (flags & (bit1|bit2)) == bit1 =bit1,bit2 flags == (bit1|bit2) bit-names: locked error referenced uptodate dirty lru active slab writeback reclaim buddy mmap anonymous swapcache swapbacked compound_head compound_tail huge unevictable hwpoison nopage reserved(r) mlocked(r) mappedtodisk(r) private(r) private_2(r) owner_private(r) arch(r) uncached(r) readahead(o) slob_free(o) slub_frozen(o) slub_debug(o) (r) raw mode bits (o) overloaded bits # ./page-types flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 487369 1903 _________________________________ 0x0000000000000014 5 0 __R_D____________________________ referenced,dirty 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000000000024 34 0 __R__l___________________________ referenced,lru 0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead 0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x0000000000000040 8344 32 ______A__________________________ active 0x0000000000000060 1 0 _____lA__________________________ lru,active 0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 503 1 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types -r flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 468002 1828 _________________________________ 0x0000000100000000 19102 74 _____________________r___________ reserved 0x0000000000008000 41 0 _______________H_________________ compound_head 0x0000000000010000 188 0 ________________T________________ compound_tail 0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head 0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private 0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead 0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk 0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead 0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk 0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private 0x0000000800000040 8124 31 ______A_________________P________ active,private 0x0000000000000040 219 0 ______A__________________________ active 0x0000000800000060 1 0 _____lA_________________P________ lru,active,private 0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk 0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private 0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk 0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private 0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private 0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 538 2 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types --raw --list --no-summary --bits reserved offset count flags 0 15 _____________________r___________ 31 4 _____________________r___________ 159 97 _____________________r___________ 4096 2067 _____________________r___________ 6752 2390 _____________________r___________ 9355 3 _____________________r___________ 9728 14526 _____________________r___________ This patch: Introduce PageHuge(), which identifies huge/gigantic pages by their dedicated compound destructor functions. Also move prep_compound_gigantic_page() to hugetlb.c and make __free_pages_ok() non-static. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated ↵Mel Gorman2009-06-16
| | | | | | | | | | | | | | | logic alloc_large_system_hash() has logic for freeing pages at the end of an excessively large power-of-two buffer that is a duplicate of what is in alloc_pages_exact(). This patch converts alloc_large_system_hash() to use alloc_pages_exact(). Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: sanity check order in the page allocator slow pathMel Gorman2009-06-16
| | | | | | | | | | | | | | Callers may speculatively call different allocators in order of preference trying to allocate a buffer of a given size. The order needed to allocate this may be larger than what the page allocator can normally handle. While the allocator mostly does the right thing, it should not direct reclaim or wakeup kswapd with a bogus order. This patch sanity checks the order in the slow path and returns NULL if it is too large. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: move free_page_mlock() to page_alloc.cKOSAKI Motohiro2009-06-16
| | | | | | | | | | | | | | Currently, free_page_mlock() is only called from page_alloc.c. Thus, we can move it to page_alloc.c. Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: slab: use nr_online_nodes to check for a NUMA platformMel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | SLAB currently avoids checking a bitmap repeatedly by checking once and storing a flag. When the addition of nr_online_nodes as a cheaper version of num_online_nodes(), this check can be replaced by nr_online_nodes. (Christoph did a patch that this is lifted almost verbatim from) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: use a pre-calculated value instead of num_online_nodes() in ↵Christoph Lameter2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | fast paths num_online_nodes() is called in a number of places but most often by the page allocator when deciding whether the zonelist needs to be filtered based on cpusets or the zonelist cache. This is actually a heavy function and touches a number of cache lines. This patch stores the number of online nodes at boot time and updates the value when nodes get onlined and offlined. The value is then used in a number of important paths in place of num_online_nodes(). [rientjes@google.com: do not override definition of node_set_online() with macro] Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: get the pageblock migratetype without disabling interruptsMel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | | | | Local interrupts are disabled when freeing pages to the PCP list. Part of that free checks what the migratetype of the pageblock the page is in but it checks this with interrupts disabled and interupts should never be disabled longer than necessary. This patch checks the pagetype with interrupts enabled with the impact that it is possible a page is freed to the wrong list when a pageblock changes type. As that block is now already considered mixed from an anti-fragmentation perspective, it's not of vital importance. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: update NR_FREE_PAGES only as necessaryMel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page freeing, it's updated once per page. This retouches cache lines more than necessary. Update the counters one per per-cpu bulk free. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: use allocation flags as an index to the zone watermarkMel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether pages_min, pages_low or pages_high is used as the zone watermark when allocating the pages. Two branches in the allocator hotpath determine which watermark to use. This patch uses the flags as an array index into a watermark array that is indexed with WMARK_* defines accessed via helpers. All call sites that use zone->pages_* are updated to use the helpers for accessing the values and the array offsets for setting. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>