aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/vmstat.h
Commit message (Collapse)AuthorAge
* mm,vmacache: add debug dataDavidlohr Bueso2014-06-04
| | | | | | | | | | | | | Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache hit rate -- exported in /proc/vmstat. Any updates to the caching scheme needs this kind of data, thus it can save some work re-implementing the counting all the time. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmstat: use raw_cpu_ops to avoid false positives on preemption checksChristoph Lameter2014-04-07
| | | | | | | | | | | | | vm counters are allowed to be racy. Use raw_cpu_ops to avoid the local_irq_disable overhead and to avoid preemption checks which will be added to the __this_cpu operations. [akpm@linux-foundation.org: Add comment. Again.] Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmstat: fix UP zone state accountingJohannes Weiner2014-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list "inactive list" and the frequently used list "active list". Currently, the VM aims for a 1:1 ratio between the lists, which is the "perfect" trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. This happens on file servers and media streaming servers, where the popular files and file sections change over time. Even though the individual files might be smaller than half of memory, concurrent access to many of them may still result in their inter-reference distance being greater than half of memory. It's also been reported as a problem on database workloads that switch back and forth between tables that are bigger than half of memory. In these cases the VM never recognizes the new working set and will for the remainder of the workload thrash disk data which could easily live in memory. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This series solves the problem by maintaining a history of pages evicted from the inactive list, enabling the VM to detect frequently used pages regardless of inactive list size and facilitate working set transitions. Tests: The reported database workload is easily demonstrated on a 8G machine with two filesets a 6G. This fio workload operates on one set first, then switches to the other. The VM should obviously always cache the set that the workload is currently using. This test is based on a problem encountered by Citus Data customers: http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data unpatched: db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92% real 27m15.541s user 0m19.059s sys 0m51.459s patched: db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40% real 6m8.630s user 0m14.714s sys 0m31.233s As can be seen, the unpatched kernel simply never adapts to the workingset change and db2 is stuck indefinitely with secondary storage speed. The patched kernel needs 2-3 iterations over db2 before it replaces db1 and reaches full memory speed. Given the unbounded negative affect of the existing VM behavior, these patches should be considered correctness fixes rather than performance optimizations. Another test resembles a fileserver or streaming server workload, where data in excess of memory size is accessed at different frequencies. There is very hot data accessed at a high frequency. Machines should be fitted so that the hot set of such a workload can be fully cached or all bets are off. Then there is a very big (compared to available memory) set of data that is used-once or at a very low frequency; this is what drives the inactive list and does not really benefit from caching. Lastly, there is a big set of warm data in between that is accessed at medium frequencies and benefits from caching the pages between the first and last streamer of each burst. unpatched: hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00% patched: hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18% In both kernels, the hot set is propagated to the active list and then served from cache. In both kernels, the beginning of the warm set is propagated to the active list as well, but in the unpatched case the active list eventually takes up half of memory and no new pages from the warm set get activated, despite repeated access, and despite most of the active list soon being stale. The patched kernel on the other hand detects the thrashing and manages to keep this cache window rolling through the data set. This frees up enough IO bandwidth that the cold set is served at full speed as well and disk utilization even drops by 20%. For reference, this same test was performed with the traditional demotion mechanism, where deactivation is coupled to inactive list reclaim. However, this had the same outcome as the unpatched kernel: while the warm set does indeed get activated continuously, it is forced out of the active list by inactive list pressure, which is dictated primarily by the unrelated cold set. The warm set is evicted before subsequent streamers can benefit from it, even though there would be enough space available to cache the pages of interest. Costs: Page reclaim used to shrink the radix trees but now the tree nodes are reused for shadow entries, where the cost depends heavily on the page cache access patterns. However, with workloads that maintain spatial or temporal locality, the shadow entries are either refaulted quickly or reclaimed along with the inode object itself. Workloads that will experience a memory cost increase are those that don't really benefit from caching in the first place. A more predictable alternative would be a fixed-cost separate pool of shadow entries, but this would incur relatively higher memory cost for well-behaved workloads at the benefit of cornercases. It would also make the shadow entry lookup more costly compared to storing them directly in the cache structure. Future: To simplify the merging process, this patch set is implementing thrash detection on a global per-zone level only for now, but the design is such that it can be extended to memory cgroups as well. All we need to do is store the unique cgroup ID along the node and zone identifier inside the eviction cookie to identify the lruvec. Right now we have a fixed ratio (50:50) between inactive and active list but we already have complaints about working sets exceeding half of memory being pushed out of the cache by simple streaming in the background. Ultimately, we want to adjust this ratio and allow for a much smaller inactive list. These patches are an essential step in this direction because they decouple the VMs ability to detect working set changes from the inactive list size. This would allow us to base the inactive list size on the combined readahead window size for example and potentially protect a much bigger working set. It's also a big step towards activating pages with a reuse distance larger than memory, as long as they are the most frequently used pages in the workload. This will require knowing more about the access frequency of active pages than what we measure right now, so it's also deferred in this series. Another possibility of having thrashing information would be to revisit the idea of local reclaim in the form of zero-config memory control groups. Instead of having allocating tasks go straight to global reclaim, they could try to reclaim the pages in the memcg they are part of first as long as the group is not thrashing. This would allow a user to drop e.g. a back-up job in an otherwise unconfigured memcg and it would only inflate (and possibly do global reclaim) until it has enough memory to do proper readahead. But once it reaches that point and stops thrashing it would just recycle its own used-once pages without kicking out the cache of any other tasks in the system more than necessary. This patch (of 10): Fengguang Wu's build testing spotted problems with inc_zone_state() and dec_zone_state() on UP configurations in out-of-tree patches. inc_zone_state() is declared but not defined, dec_zone_state() is missing entirely. Just like with *_zone_page_state(), they can be defined like their preemption-unsafe counterparts on UP. [akpm@linux-foundation.org: make it build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'efi-urgent' into x86/urgentH. Peter Anvin2014-02-07
|\ | | | | | | | | | | * Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * mm/page-writeback.c: do not count anon pages as dirtyable memoryJohannes Weiner2014-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VM is currently heavily tuned to avoid swapping. Whether that is good or bad is a separate discussion, but as long as the VM won't swap to make room for dirty cache, we can not consider anonymous pages when calculating the amount of dirtyable memory, the baseline to which dirty_background_ratio and dirty_ratio are applied. A simple workload that occupies a significant size (40+%, depending on memory layout, storage speeds etc.) of memory with anon/tmpfs pages and uses the remainder for a streaming writer demonstrates this problem. In that case, the actual cache pages are a small fraction of what is considered dirtyable overall, which results in an relatively large portion of the cache pages to be dirtied. As kswapd starts rotating these, random tasks enter direct reclaim and stall on IO. Only consider free pages and file pages dirtyable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, x86: Account for TLB flushes only when debuggingMel Gorman2014-01-25
|/ | | | | | | | | | | | | | | | | | | | Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: vmscan: fix do_try_to_free_pages() livelockLisa Du2013-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Nick Piggin <npiggin@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Neil Zhang <zhangwm@marvell.com> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmstat: create separate function to fold per cpu diffs into local countersChristoph Lameter2013-09-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove CONFIG_HOTPLUG ifdefsYijing Wang2013-04-29
| | | | | | | | | | | CONFIG_HOTPLUG is going away as an option, cleanup CONFIG_HOTPLUG ifdefs in mm files. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: numa: handle side-effects in count_vm_numa_events() for ↵Mel Gorman2013-02-23
| | | | | | | | | | | | | | | | | | | | | | | | | !CONFIG_NUMA_BALANCING The current definitions for count_vm_numa_events() is wrong for !CONFIG_NUMA_BALANCING as the following would miss the side-effect. count_vm_numa_events(NUMA_FOO, bar++); There are no such users of count_vm_numa_events() but this patch fixes it as it is a potential pitfall. Ideally both would be converted to static inline but NUMA_PTE_UPDATES is not defined if !CONFIG_NUMA_BALANCING and creating dummy constants just to have a static inline would be similarly clumsy. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: numa: Add pte updates, hinting and migration statsMel Gorman2012-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>
* memory-hotplug: fix zone stat mismatchMinchan Kim2012-10-09
| | | | | | | | | | | | | | | | | | | | | | During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing, causing the kernel to hang. When the system doesn't have enough free pages, it enters reclaim but never reclaim any pages due to too_many_isolated()==true and loops forever. The cause is that when we do memory-hotadd after memory-remove, __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset() although the vm_stat_diff of all CPUs still have values. In addtion, when we offline all pages of the zone, we reset them in zone_pcp_reset without draining so we loss some zone stat item. Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cma: count free CMA pagesBartlomiej Zolnierkiewicz2012-10-09
| | | | | | | | | | | | | | | | | Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this counter always available (not only when CONFIG_CMA=y). [akpm@linux-foundation.org: use conventional migratetype naming] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove redundant initializationMinchan Kim2012-07-31
| | | | | | | | | | | pg_data_t is zeroed before reaching free_area_init_core(), so remove the now unnecessary initializations. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* atomic: use <linux/atomic.h>Arun Sharma2011-07-26
| | | | | | | | | | | | | | This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move enum vm_event_item into a standalone header fileAndrew Morton2011-05-26
| | | | | | | | | | | | | | | | | | | | | | | | | | enums are problematic because they cannot be forward-declared: akpm2:/home/akpm> cat t.c enum foo; static inline void bar(enum foo f) { } akpm2:/home/akpm> gcc -c t.c t.c:4: error: parameter 1 ('f') has incomplete type So move the enum's definition into a standalone header file which can be used wherever its definition is needed. Cc: Ying Han <yinghan@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occurKOSAKI Motohiro2011-05-25
| | | | | | | | | | | | | Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [akpm@linux-foundation.org: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: per-node vmstat: show proper vmstatsKOSAKI Motohiro2011-05-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2ac390370a ("writeback: add /sys/devices/system/node/<node>/vmstat") added vmstat entry. But strangely it only show nr_written and nr_dirtied. # cat /sys/devices/system/node/node20/vmstat nr_written 0 nr_dirtied 0 Of course, It's not adequate. With this patch, the vmstat show all vm stastics as /proc/vmstat. # cat /sys/devices/system/node/node0/vmstat nr_free_pages 899224 nr_inactive_anon 201 nr_active_anon 17380 nr_inactive_file 31572 nr_active_file 28277 nr_unevictable 0 nr_mlock 0 nr_anon_pages 17321 nr_mapped 8640 nr_file_pages 60107 nr_dirty 33 nr_writeback 0 nr_slab_reclaimable 6850 nr_slab_unreclaimable 7604 nr_page_table_pages 3105 nr_kernel_stack 175 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 260 nr_dirtied 1050 nr_written 938 numa_hit 962872 numa_miss 0 numa_foreign 0 numa_interleave 8617 numa_local 962872 numa_other 0 nr_anon_transparent_hugepages 0 [akpm@linux-foundation.org: no externs in .c files] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Rubin <mrubin@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add VM counters for transparent hugepagesAndi Kleen2011-04-14
| | | | | | | | | | | | | | | | | | | I found it difficult to make sense of transparent huge pages without having any counters for its actions. Add some counters to vmstat for allocation of transparent hugepages and fallback to smaller pages. Optional patch, but useful for development and understanding the system. Contains improvements from Andrea Arcangeli and Johannes Weiner [akpm@linux-foundation.org: coding-style fixes] [hannes@cmpxchg.org: fix vmstat_text[] entries] Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add __GFP_OTHER_NODE flagAndi Kleen2011-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in zone_statistics() that an allocation is on behalf of another thread. This way the local and remote counters can be still correct, even when background daemons like khugepaged are changing memory mappings. This only affects the accounting, but I think it's worth doing that right to avoid confusing users. I first tried to just pass down the right node, but this required a lot of changes to pass down this parameter and at least one addition of a 10th argument to a 9 argument function. Using the flag is a lot less intrusive. Open: should be also used for migration? [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmstat: use a single setter function and callback for adjusting percpu ↵Mel Gorman2011-01-13
| | | | | | | | | | | | | | | | | | | thresholds reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid errors due to counter drift. The functions duplicate some code so this patch replaces them with a single set_pgdat_percpu_threshold() that takes a callback function to calculate the desired threshold as a parameter. [akpm@linux-foundation.org: readability tweak] [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page allocator: adjust the per-cpu counter threshold when memory is lowMel Gorman2011-01-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% David said: : We had to pull aa454840 "mm: page allocator: calculate a better estimate : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 : internally because tests showed that it would cause the machine to stall : as the result of heavy kswapd activity. I merged it back with this fix as : it is pending in the -mm tree and it solves the issue we were seeing, so I : definitely think this should be pushed to -stable (and I would seriously : consider it for 2.6.37 inclusion even at this late date). Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Tested-by: Nicolas Bareil <nico@chdir.org> Cc: David Rientjes <rientjes@google.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: <stable@kernel.org> [2.6.37.1, 2.6.36.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory ↵Christoph Lameter2010-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | is low and kswapd is awake Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: compaction: direct compact when a high-order allocation failsMel Gorman2010-05-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | Ordinarily when a high-order allocation fails, direct reclaim is entered to free pages to satisfy the allocation. With this patch, it is determined if an allocation failed due to external fragmentation instead of low memory and if so, the calling process will compact until a suitable page is freed. Compaction by moving pages in memory is considerably cheaper than paging out to disk and works where there are locked pages or no swap. If compaction fails to free a page of a suitable size, then reclaim will still occur. Direct compaction returns as soon as possible. As each block is compacted, it is checked if a suitable page has been freed and if so, it returns. [akpm@linux-foundation.org: Fix build errors] [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: compaction: memory compaction coreMel Gorman2010-05-25
| | | | | | | | | | | | | | | | | | | | | | | | | This patch is the core of a mechanism which compacts memory in a zone by relocating movable pages towards the end of the zone. A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages. [aarcange@redhat.com: Fix unsafe optimisation] [mel@csn.ul.ie: do not schedule work on other CPUs for compaction] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'master' into percpuTejun Heo2010-01-04
|\ | | | | | | | | | | Conflicts: arch/powerpc/platforms/pseries/hvCall.S include/linux/percpu.h
| * vmscan: stop kswapd waiting on congestion when the min watermark is not ↵KOSAKI Motohiro2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | being met If reclaim fails to make sufficient progress, the priority is raised. Once the priority is higher, kswapd starts waiting on congestion. However, if the zone is below the min watermark then kswapd needs to continue working without delay as there is a danger of an increased rate of GFP_ATOMIC allocation failure. This patch changes the conditions under which kswapd waits on congestion by only going to sleep if the min watermarks are being met. [mel@csn.ul.ie: add stats to track how relevant the logic is] [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * vmscan: have kswapd sleep for a short interval and double check it should be ↵Mel Gorman2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | asleep After kswapd balances all zones in a pgdat, it goes to sleep. In the event of no IO congestion, kswapd can go to sleep very shortly after the high watermark was reached. If there are a constant stream of allocations from parallel processes, it can mean that kswapd went to sleep too quickly and the high watermark is not being maintained for sufficient length time. This patch makes kswapd go to sleep as a two-stage process. It first tries to sleep for HZ/10. If it is woken up by another process or the high watermark is no longer met, it's considered a premature sleep and kswapd continues work. Otherwise it goes fully to sleep. This adds more counters to distinguish between fast and slow breaches of watermarks. A "fast" premature sleep is one where the low watermark was hit in a very short time after kswapd going to sleep. A "slow" premature sleep indicates that the high watermark was breached after a very short interval. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Frans Pop <elendil@planet.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | percpu: remove per_cpu__ prefix.Rusty Russell2009-10-29
|/ | | | | | | | | | | | | | | | | | Now that the return from alloc_percpu is compatible with the address of per-cpu vars, it makes sense to hand around the address of per-cpu variables. To make this sane, we remove the per_cpu__ prefix we used created to stop people accidentally using these vars directly. Now we have sparse, we can use that (next patch). tj: * Updated to convert stuff which were missed by or added after the original patch. * Kill per_cpu_var() macro. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
* this_cpu: Use this_cpu ops for VM statisticsChristoph Lameter2009-10-03
| | | | | | | | | | Using per cpu atomics for the vm statistics reduces their overhead. And in the case of x86 we are guaranteed that they will never race even in the lax form used for vm statistics. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* mm: count only reclaimable lru pagesWu Fengguang2009-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | global_lru_pages() / zone_lru_pages() can be used in two ways: - to estimate max reclaimable pages in determine_dirtyable_memory() - to calculate the slab scan ratio When swap is full or not present, the anon lru lists are not reclaimable and also won't be scanned. So the anon pages shall not be counted in both usage scenarios. Also rename to _reclaimable_pages: now they are counting the possibly reclaimable lru pages. It can greatly (and correctly) increase the slab scan rate under high memory pressure (when most file pages have been reclaimed and swap is full/absent), thus reduce false OOM kills. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: David Howells <dhowells@redhat.com> Cc: "Li, Ming Chun" <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove __{add,sub}_zone_page_state()KOSAKI Motohiro2009-09-22
| | | | | | | | | | | | | __add_zone_page_state() and __sub_zone_page_state() are unused. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: count the number of times zone_reclaim() scans and failsMel Gorman2009-06-16
| | | | | | | | | | | | | | | | | | | | | | | | | | On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but it is possible that the heuristic will fail and the CPU gets tied up scanning uselessly. Detecting the situation requires some guesswork and experimentation so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during high CPU utilisation this counter is increasing rapidly, then the resolution to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0. [akpm@linux-foundation.org: name things consistently] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove CONFIG_UNEVICTABLE_LRU config optionKOSAKI Motohiro2009-06-16
| | | | | | | | | | | | | | | | Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: move /proc/zoneinfo boilerplate to mm/vmstat.cAlexey Dobriyan2008-10-23
| | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org>
* proc: move /proc/vmstat boilerplate to mm/vmstat.cAlexey Dobriyan2008-10-23
| | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org>
* proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.cAlexey Dobriyan2008-10-23
| | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* proc: move /proc/buddyinfo boilerplate to mm/vmstat.cAlexey Dobriyan2008-10-23
| | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* mlock: count attempts to free mlocked pageLee Schermerhorn2008-10-20
| | | | | | | | | | | | | Allow free of mlock()ed pages. This shouldn't happen, but during developement, it occasionally did. This patch allows us to survive that condition, while keeping the statistics and events correct for debug. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmstat: mlocked pages statisticsNick Piggin2008-10-20
| | | | | | | | | | | | | | | | | Add NR_MLOCK zone page state, which provides a (conservative) count of mlocked pages (actually, the number of mlocked pages moved off the LRU). Reworked by lts to fit in with the modified mlock page support in the Reclaim Scalability series. [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo] [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* unevictable lru: add event counting with statisticsLee Schermerhorn2008-10-20
| | | | | | | | | | | | | | | | | | | | | | | | Fix to unevictable-lru-page-statistics.patch Add unevictable lru infrastructure vm events to the statistics patch. Rename the "NORECL_" and "noreclaim_" symbols and text strings to "UNEVICTABLE_" and "unevictable_", respectively. Currently, both the infrastructure and the mlocked pages event are added by a single patch later in the series. This makes it difficult to add or rework the incremental patches. The events actually "belong" with the stats, so pull them up to here. Also, restore the event counting to putback_lru_page(). This was removed from previous patch in series where it was "misplaced". The actual events weren't defined that early. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: split LRU lists into anon & file setsRik van Riel2008-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmstat.c: proper externsAdrian Bunk2008-07-24
| | | | | | | | | This patch adds proper extern declarations for five variables in include/linux/vmstat.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Subject: [PATCH] hugetlb: vmstat events for huge page allocationsAdam Litke2008-04-28
| | | | | | | | | | | | | | | | | | | | | | Allocating huge pages directly from the buddy allocator is not guaranteed to succeed. Success depends on several factors (such as the amount of physical memory available and the level of fragmentation). With the addition of dynamic hugetlb pool resizing, allocations can occur much more frequently. For these reasons it is desirable to keep track of huge page allocation successes and failures. Add two new vmstat entries to track huge page allocations that succeed and fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE being enabled. [akpm@linux-foundation.org: reduced ifdeffery] Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Eric Munson <ebmunson@us.ibm.com> Tested-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andy Whitcroft <apw@shadowen.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remember what the preferred zone is for zone_statisticsMel Gorman2008-04-28
| | | | | | | | | | | | | | | | | | | | | On NUMA, zone_statistics() is used to record events like numa hit, miss and foreign. It assumes that the first zone in a zonelist is the preferred zone. When multiple zonelists are replaced by one that is filtered, this is no longer the case. This patch records what the preferred zone is rather than assuming the first zone in the zonelist is it. This simplifies the reading of later patches in this set. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* let __dec_zone_page_state use __dec_zone_stateUwe Kleine-König2008-02-29
| | | | | | | | | This removes code duplication and makes __dec_zone_page_state look like __inc_zone_page_state. Signed-off-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Create the ZONE_MOVABLE zoneMel Gorman2007-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM and __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a single memory partition while allowing movable allocations to be satisfied from either partition. The patches may be applied with the list-based anti-fragmentation patches that groups pages together based on mobility. The size of the zone is determined by a kernelcore= parameter specified at boot-time. This specifies how much memory is usable by non-movable allocations and the remainder is used for ZONE_MOVABLE. Any range of pages within ZONE_MOVABLE can be released by migrating the pages or by reclaiming. When selecting a zone to take pages from for ZONE_MOVABLE, there are two things to consider. First, only memory from the highest populated zone is used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second, the amount of memory usable by the kernel will be spread evenly throughout NUMA nodes where possible. If the nodes are not of equal size, the amount of memory usable by the kernel on some nodes may be greater than others. By default, the zone is not as useful for hugetlb allocations because they are pinned and non-migratable (currently at least). A sysctl is provided that allows huge pages to be allocated from that zone. This means that the huge page pool can be resized to the size of ZONE_MOVABLE during the lifetime of the system assuming that pages are not mlocked. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. Credit goes to Andy Whitcroft for catching a large variety of problems during review of the patches. This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added memory continues to be placed in their existing destination as there is no mechanism to redirect them to a specific zone. [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code] [akpm@linux-foundation.org: various fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmstat: use our own timer eventsChristoph Lameter2007-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmstat is currently using the cache reaper to periodically bring the statistics up to date. The cache reaper does only exists in SLUB as a way to provide compatibility with SLAB. This patch removes the vmstat calls from the slab allocators and provides its own handling. The advantage is also that we can use a different frequency for the updates. Refreshing vm stats is a pretty fast job so we can run this every second and stagger this by only one tick. This will lead to some overlap in large systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm updates occurring at once. However, the vm stats update only accesses per node information. It is only necessary to stagger the vm statistics updates per processor in each node. Vm counter updates occurring on distant nodes will not cause cacheline contention. We could implement an alternate approach that runs the first processor on each node at the second and then each of the other processor on a node on a subsequent tick. That may be useful to keep a large amount of the second free of timer activity. Maybe the timer folks will have some feedback on this one? [jirislaby@gmail.com: add missing break] Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] count_vm_events-warning-fixAndrew Morton2007-02-11
| | | | | | | | | | | | | | | - Prevent things like this: block/ll_rw_blk.c: In function 'submit_bio': block/ll_rw_blk.c:3222: warning: unused variable 'count' inlines are very, very preferable to macros. - remove unused get_cpu_vm_events() macro Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VMChristoph Lameter2007-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make ZONE_DMA optional in core code. - ifdef all code for ZONE_DMA and related definitions following the example for ZONE_DMA32 and ZONE_HIGHMEM. - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of 0. - Modify the VM statistics to work correctly without a DMA zone. - Modify slab to not create DMA slabs if there is no ZONE_DMA. [akpm@osdl.org: cleanup] [jdike@addtoit.com: build fix] [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gfs file to * be created by the drm core */ struct drm_debugfs_list { const char *name; /** file name */ int (*show)(struct seq_file*, void*); /** show callback */ u32 driver_features; /**< Required driver features for this entry */ }; /** * debugfs node structure. This structure represents a debugfs file. */ struct drm_debugfs_node { struct list_head list; struct drm_minor *minor; struct drm_debugfs_list *debugfs_ent; struct dentry *dent; }; /** * Info file list entry. This structure represents a debugfs or proc file to * be created by the drm core */ struct drm_info_list { const char *name; /** file name */ int (*show)(struct seq_file*, void*); /** show callback */ u32 driver_features; /**< Required driver features for this entry */ void *data; }; /** * debugfs node structure. This structure represents a debugfs file. */ struct drm_info_node { struct list_head list; struct drm_minor *minor; struct drm_info_list *info_ent; struct dentry *dent; }; /** * DRM minor structure. This structure represents a drm minor number. */ struct drm_minor { int index; /**< Minor device number */ int type; /**< Control or render */ dev_t device; /**< Device number for mknod */ struct device kdev; /**< Linux device */ struct drm_device *dev; struct proc_dir_entry *proc_root; /**< proc directory entry */ struct drm_info_node proc_nodes; struct dentry *debugfs_root; struct drm_info_node debugfs_nodes; struct drm_master *master; /* currently active master for this node */ struct list_head master_list; struct drm_mode_group mode_group; }; struct drm_pending_vblank_event { struct drm_pending_event base; int pipe; struct drm_event_vblank event; }; /** * DRM device structure. This structure represent a complete card that * may contain multiple heads. */ struct drm_device { struct list_head driver_item; /**< list of devices per driver */ char *devname; /**< For /proc/interrupts */ int if_version; /**< Highest interface version set */ /** \name Locks */ /*@{ */ spinlock_t count_lock; /**< For inuse, drm_device::open_count, drm_device::buf_use */ struct mutex struct_mutex; /**< For others */ /*@} */ /** \name Usage Counters */ /*@{ */ int open_count; /**< Outstanding files open */ atomic_t ioctl_count; /**< Outstanding IOCTLs pending */ atomic_t vma_count; /**< Outstanding vma areas open */ int buf_use; /**< Buffers in use -- cannot alloc */ atomic_t buf_alloc; /**< Buffer allocation in progress */ /*@} */ /** \name Performance counters */ /*@{ */ unsigned long counters; enum drm_stat_type types[15]; atomic_t counts[15]; /*@} */ struct list_head filelist; /** \name Memory management */ /*@{ */ struct list_head maplist; /**< Linked list of regions */ int map_count; /**< Number of mappable regions */ struct drm_open_hash map_hash; /**< User token hash table for maps */ /** \name Context handle management */ /*@{ */ struct list_head ctxlist; /**< Linked list of context handles */ int ctx_count; /**< Number of context handles */ struct mutex ctxlist_mutex; /**< For ctxlist */ struct idr ctx_idr; struct list_head vmalist; /**< List of vmas (for debugging) */ /*@} */ /** \name DMA queues (contexts) */ /*@{ */ int queue_count; /**< Number of active DMA queues */ int queue_reserved; /**< Number of reserved DMA queues */ int queue_slots; /**< Actual length of queuelist */ struct drm_queue **queuelist; /**< Vector of pointers to DMA queues */ struct drm_device_dma *dma; /**< Optional pointer for DMA support */ /*@} */ /** \name Context support */ /*@{ */ int irq_enabled; /**< True if irq handler is enabled */ __volatile__ long context_flag; /**< Context swapping flag */ __volatile__ long interrupt_flag; /**< Interruption handler flag */ __volatile__ long dma_flag; /**< DMA dispatch flag */ wait_queue_head_t context_wait; /**< Processes waiting on ctx switch */ int last_checked; /**< Last context checked for DMA */ int last_context; /**< Last current context */ unsigned long last_switch; /**< jiffies at last context switch */ /*@} */ struct work_struct work; /** \name VBLANK IRQ support */ /*@{ */ /* * At load time, disabling the vblank interrupt won't be allowed since * old clients may not call the modeset ioctl and therefore misbehave. * Once the modeset ioctl *has* been called though, we can safely * disable them when unused. */ int vblank_disable_allowed; wait_queue_head_t *vbl_queue; /**< VBLANK wait queue */ atomic_t *_vblank_count; /**< number of VBLANK interrupts (driver must alloc the right number of counters) */ struct timeval *_vblank_time; /**< timestamp of current vblank_count (drivers must alloc right number of fields) */ spinlock_t vblank_time_lock; /**< Protects vblank count and time updates during vblank enable/disable */ spinlock_t vbl_lock; atomic_t *vblank_refcount; /* number of users of vblank interruptsper crtc */ u32 *last_vblank; /* protected by dev->vbl_lock, used */ /* for wraparound handling */ int *vblank_enabled; /* so we don't call enable more than once per disable */ int *vblank_inmodeset; /* Display driver is setting mode */ u32 *last_vblank_wait; /* Last vblank seqno waited per CRTC */ struct timer_list vblank_disable_timer; u32 max_vblank_count; /**< size of vblank counter register */ /** * List of events */ struct list_head vblank_event_list; spinlock_t event_lock; /*@} */ cycles_t ctx_start; cycles_t lck_start; struct fasync_struct *buf_async;/**< Processes waiting for SIGIO */ wait_queue_head_t buf_readers; /**< Processes waiting to read */ wait_queue_head_t buf_writers; /**< Processes waiting to ctx switch */ struct drm_agp_head *agp; /**< AGP data */ struct device *dev; /**< Device structure */ struct pci_dev *pdev; /**< PCI device structure */ int pci_vendor; /**< PCI vendor id */ int pci_device; /**< PCI device id */ #ifdef __alpha__ struct pci_controller *hose; #endif struct platform_device *platformdev; /**< Platform device struture */ struct drm_sg_mem *sg; /**< Scatter gather memory */ int num_crtcs; /**< Number of CRTCs on this device */ void *dev_private; /**< device private data */ void *mm_private; struct address_space *dev_mapping; struct drm_sigdata sigdata; /**< For block_all_signals */ sigset_t sigmask; struct drm_driver *driver; struct drm_local_map *agp_buffer_map; unsigned int agp_buffer_token; struct drm_minor *control; /**< Control node for card */ struct drm_minor *primary; /**< render type primary screen head */ struct drm_mode_config mode_config; /**< Current mode config */ /** \name GEM information */ /*@{ */ spinlock_t object_name_lock; struct idr object_name_idr; /*@} */ int switch_power_state; }; #define DRM_SWITCH_POWER_ON 0 #define DRM_SWITCH_POWER_OFF 1 #define DRM_SWITCH_POWER_CHANGING 2 static __inline__ int drm_core_check_feature(struct drm_device *dev, int feature) { return ((dev->driver->driver_features & feature) ? 1 : 0); } static inline int drm_dev_to_irq(struct drm_device *dev) { if (drm_core_check_feature(dev, DRIVER_USE_PLATFORM_DEVICE)) return platform_get_irq(dev->platformdev, 0); else return dev->pdev->irq; } static inline int drm_get_pci_domain(struct drm_device *dev) { if (drm_core_check_feature(dev, DRIVER_USE_PLATFORM_DEVICE)) return 0; #ifndef __alpha__ /* For historical reasons, drm_get_pci_domain() is busticated * on most archs and has to remain so for userspace interface * < 1.4, except on alpha which was right from the beginning */ if (dev->if_version < 0x10004) return 0; #endif /* __alpha__ */ return pci_domain_nr(dev->pdev->bus); } #if __OS_HAS_AGP static inline int drm_core_has_AGP(struct drm_device *dev) { return drm_core_check_feature(dev, DRIVER_USE_AGP); } #else #define drm_core_has_AGP(dev) (0) #endif #if __OS_HAS_MTRR static inline int drm_core_has_MTRR(struct drm_device *dev) { return drm_core_check_feature(dev, DRIVER_USE_MTRR); } #define DRM_MTRR_WC MTRR_TYPE_WRCOMB static inline int drm_mtrr_add(unsigned long offset, unsigned long size, unsigned int flags) { return mtrr_add(offset, size, flags, 1); } static inline int drm_mtrr_del(int handle, unsigned long offset, unsigned long size, unsigned int flags) { return mtrr_del(handle, offset, size); } #else #define drm_core_has_MTRR(dev) (0) #define DRM_MTRR_WC 0 static inline int drm_mtrr_add(unsigned long offset, unsigned long size, unsigned int flags) { return 0; } static inline int drm_mtrr_del(int handle, unsigned long offset, unsigned long size, unsigned int flags) { return 0; } #endif /******************************************************************/ /** \name Internal function definitions */ /*@{*/ /* Driver support (drm_drv.h) */ extern int drm_init(struct drm_driver *driver); extern void drm_exit(struct drm_driver *driver); extern long drm_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); extern long drm_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); extern int drm_lastclose(struct drm_device *dev); /* Device support (drm_fops.h) */ extern struct mutex drm_global_mutex; extern int drm_open(struct inode *inode, struct file *filp); extern int drm_stub_open(struct inode *inode, struct file *filp); extern int drm_fasync(int fd, struct file *filp, int on); extern ssize_t drm_read(struct file *filp, char __user *buffer, size_t count, loff_t *offset); extern int drm_release(struct inode *inode, struct file *filp); /* Mapping support (drm_vm.h) */ extern int drm_mmap(struct file *filp, struct vm_area_struct *vma); extern int drm_mmap_locked(struct file *filp, struct vm_area_struct *vma); extern void drm_vm_open_locked(struct vm_area_struct *vma); extern void drm_vm_close_locked(struct vm_area_struct *vma); extern unsigned int drm_poll(struct file *filp, struct poll_table_struct *wait); /* Memory management support (drm_memory.h) */ #include "drm_memory.h" extern void drm_mem_init(void); extern int drm_mem_info(char *buf, char **start, off_t offset, int request, int *eof, void *data); extern void *drm_realloc(void *oldpt, size_t oldsize, size_t size, int area); extern void drm_free_agp(DRM_AGP_MEM * handle, int pages); extern int drm_bind_agp(DRM_AGP_MEM * handle, unsigned int start); extern DRM_AGP_MEM *drm_agp_bind_pages(struct drm_device *dev, struct page **pages, unsigned long num_pages, uint32_t gtt_offset, uint32_t type); extern int drm_unbind_agp(DRM_AGP_MEM * handle); /* Misc. IOCTL support (drm_ioctl.h) */ extern int drm_irq_by_busid(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_getunique(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_setunique(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_getmap(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_getclient(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_getstats(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_setversion(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_noop(struct drm_device *dev, void *data, struct drm_file *file_priv); /* Context IOCTL support (drm_context.h) */ extern int drm_resctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_addctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_modctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_getctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_switchctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_newctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_rmctx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_ctxbitmap_init(struct drm_device *dev); extern void drm_ctxbitmap_cleanup(struct drm_device *dev); extern void drm_ctxbitmap_free(struct drm_device *dev, int ctx_handle); extern int drm_setsareactx(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_getsareactx(struct drm_device *dev, void *data, struct drm_file *file_priv); /* Authentication IOCTL support (drm_auth.h) */ extern int drm_getmagic(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_authmagic(struct drm_device *dev, void *data, struct drm_file *file_priv); /* Cache management (drm_cache.c) */ void drm_clflush_pages(struct page *pages[], unsigned long num_pages); /* Locking IOCTL support (drm_lock.h) */ extern int drm_lock(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_unlock(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_lock_free(struct drm_lock_data *lock_data, unsigned int context); extern void drm_idlelock_take(struct drm_lock_data *lock_data); extern void drm_idlelock_release(struct drm_lock_data *lock_data); /* * These are exported to drivers so that they can implement fencing using * DMA quiscent + idle. DMA quiescent usually requires the hardware lock. */ extern int drm_i_have_hw_lock(struct drm_device *dev, struct drm_file *file_priv); /* Buffer management support (drm_bufs.h) */ extern int drm_addbufs_agp(struct drm_device *dev, struct drm_buf_desc * request); extern int drm_addbufs_pci(struct drm_device *dev, struct drm_buf_desc * request); extern int drm_addmap(struct drm_device *dev, resource_size_t offset, unsigned int size, enum drm_map_type type, enum drm_map_flags flags, struct drm_local_map **map_ptr); extern int drm_addmap_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_rmmap(struct drm_device *dev, struct drm_local_map *map); extern int drm_rmmap_locked(struct drm_device *dev, struct drm_local_map *map); extern int drm_rmmap_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_addbufs(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_infobufs(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_markbufs(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_freebufs(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_mapbufs(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_order(unsigned long size); /* DMA support (drm_dma.h) */ extern int drm_dma_setup(struct drm_device *dev); extern void drm_dma_takedown(struct drm_device *dev); extern void drm_free_buffer(struct drm_device *dev, struct drm_buf * buf); extern void drm_core_reclaim_buffers(struct drm_device *dev, struct drm_file *filp); /* IRQ support (drm_irq.h) */ extern int drm_control(struct drm_device *dev, void *data, struct drm_file *file_priv); extern irqreturn_t drm_irq_handler(DRM_IRQ_ARGS); extern int drm_irq_install(struct drm_device *dev); extern int drm_irq_uninstall(struct drm_device *dev); extern void drm_driver_irq_preinstall(struct drm_device *dev); extern void drm_driver_irq_postinstall(struct drm_device *dev); extern void drm_driver_irq_uninstall(struct drm_device *dev); extern int drm_vblank_init(struct drm_device *dev, int num_crtcs); extern int drm_wait_vblank(struct drm_device *dev, void *data, struct drm_file *filp); extern int drm_vblank_wait(struct drm_device *dev, unsigned int *vbl_seq); extern u32 drm_vblank_count(struct drm_device *dev, int crtc); extern u32 drm_vblank_count_and_time(struct drm_device *dev, int crtc, struct timeval *vblanktime); extern void drm_handle_vblank(struct drm_device *dev, int crtc); extern int drm_vblank_get(struct drm_device *dev, int crtc); extern void drm_vblank_put(struct drm_device *dev, int crtc); extern void drm_vblank_off(struct drm_device *dev, int crtc); extern void drm_vblank_cleanup(struct drm_device *dev); extern u32 drm_get_last_vbltimestamp(struct drm_device *dev, int crtc, struct timeval *tvblank, unsigned flags); extern int drm_calc_vbltimestamp_from_scanoutpos(struct drm_device *dev, int crtc, int *max_error, struct timeval *vblank_time, unsigned flags, struct drm_crtc *refcrtc); extern void drm_calc_timestamping_constants(struct drm_crtc *crtc); /* Modesetting support */ extern void drm_vblank_pre_modeset(struct drm_device *dev, int crtc); extern void drm_vblank_post_modeset(struct drm_device *dev, int crtc); extern int drm_modeset_ctl(struct drm_device *dev, void *data, struct drm_file *file_priv); /* AGP/GART support (drm_agpsupport.h) */ extern struct drm_agp_head *drm_agp_init(struct drm_device *dev); extern int drm_agp_acquire(struct drm_device *dev); extern int drm_agp_acquire_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_release(struct drm_device *dev); extern int drm_agp_release_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_enable(struct drm_device *dev, struct drm_agp_mode mode); extern int drm_agp_enable_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_info(struct drm_device *dev, struct drm_agp_info *info); extern int drm_agp_info_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_alloc(struct drm_device *dev, struct drm_agp_buffer *request); extern int drm_agp_alloc_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_free(struct drm_device *dev, struct drm_agp_buffer *request); extern int drm_agp_free_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_unbind(struct drm_device *dev, struct drm_agp_binding *request); extern int drm_agp_unbind_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_agp_bind(struct drm_device *dev, struct drm_agp_binding *request); extern int drm_agp_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); /* Stub support (drm_stub.h) */ extern int drm_setmaster_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_dropmaster_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); struct drm_master *drm_master_create(struct drm_minor *minor); extern struct drm_master *drm_master_get(struct drm_master *master); extern void drm_master_put(struct drm_master **master); extern int drm_get_pci_dev(struct pci_dev *pdev, const struct pci_device_id *ent, struct drm_driver *driver); extern int drm_get_platform_dev(struct platform_device *pdev, struct drm_driver *driver); extern void drm_put_dev(struct drm_device *dev); extern int drm_put_minor(struct drm_minor **minor); extern unsigned int drm_debug; extern unsigned int drm_vblank_offdelay; extern unsigned int drm_timestamp_precision; extern struct class *drm_class; extern struct proc_dir_entry *drm_proc_root; extern struct dentry *drm_debugfs_root; extern struct idr drm_minors_idr; extern struct drm_local_map *drm_getsarea(struct drm_device *dev); /* Proc support (drm_proc.h) */ extern int drm_proc_init(struct drm_minor *minor, int minor_id, struct proc_dir_entry *root); extern int drm_proc_cleanup(struct drm_minor *minor, struct proc_dir_entry *root); /* Debugfs support */ #if defined(CONFIG_DEBUG_FS) extern int drm_debugfs_init(struct drm_minor *minor, int minor_id, struct dentry *root); extern int drm_debugfs_create_files(struct drm_info_list *files, int count, struct dentry *root, struct drm_minor *minor); extern int drm_debugfs_remove_files(struct drm_info_list *files, int count, struct drm_minor *minor); extern int drm_debugfs_cleanup(struct drm_minor *minor); #endif /* Info file support */ extern int drm_name_info(struct seq_file *m, void *data); extern int drm_vm_info(struct seq_file *m, void *data); extern int drm_queues_info(struct seq_file *m, void *data); extern int drm_bufs_info(struct seq_file *m, void *data); extern int drm_vblank_info(struct seq_file *m, void *data); extern int drm_clients_info(struct seq_file *m, void* data); extern int drm_gem_name_info(struct seq_file *m, void *data); #if DRM_DEBUG_CODE extern int drm_vma_info(struct seq_file *m, void *data); #endif /* Scatter Gather Support (drm_scatter.h) */ extern void drm_sg_cleanup(struct drm_sg_mem * entry); extern int drm_sg_alloc_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int drm_sg_alloc(struct drm_device *dev, struct drm_scatter_gather * request); extern int drm_sg_free(struct drm_device *dev, void *data, struct drm_file *file_priv); /* ATI PCIGART support (ati_pcigart.h) */ extern int drm_ati_pcigart_init(struct drm_device *dev, struct drm_ati_pcigart_info * gart_info); extern int drm_ati_pcigart_cleanup(struct drm_device *dev, struct drm_ati_pcigart_info * gart_info); extern drm_dma_handle_t *drm_pci_alloc(struct drm_device *dev, size_t size, size_t align); extern void __drm_pci_free(struct drm_device *dev, drm_dma_handle_t * dmah); extern void drm_pci_free(struct drm_device *dev, drm_dma_handle_t * dmah); /* sysfs support (drm_sysfs.c) */ struct drm_sysfs_class; extern struct class *drm_sysfs_create(struct module *owner, char *name); extern void drm_sysfs_destroy(void); extern int drm_sysfs_device_add(struct drm_minor *minor); extern void drm_sysfs_hotplug_event(struct drm_device *dev); extern void drm_sysfs_device_remove(struct drm_minor *minor); extern char *drm_get_connector_status_name(enum drm_connector_status status); extern int drm_sysfs_connector_add(struct drm_connector *connector); extern void drm_sysfs_connector_remove(struct drm_connector *connector); /* Graphics Execution Manager library functions (drm_gem.c) */ int drm_gem_init(struct drm_device *dev); void drm_gem_destroy(struct drm_device *dev); void drm_gem_object_release(struct drm_gem_object *obj); void drm_gem_object_free(struct kref *kref); struct drm_gem_object *drm_gem_object_alloc(struct drm_device *dev, size_t size); int drm_gem_object_init(struct drm_device *dev, struct drm_gem_object *obj, size_t size); void drm_gem_object_handle_free(struct drm_gem_object *obj); void drm_gem_vm_open(struct vm_area_struct *vma); void drm_gem_vm_close(struct vm_area_struct *vma); int drm_gem_mmap(struct file *filp, struct vm_area_struct *vma); #include "drm_global.h" static inline void drm_gem_object_reference(struct drm_gem_object *obj) { kref_get(&obj->refcount); } static inline void drm_gem_object_unreference(struct drm_gem_object *obj) { if (obj != NULL) kref_put(&obj->refcount, drm_gem_object_free); } static inline void drm_gem_object_unreference_unlocked(struct drm_gem_object *obj) { if (obj != NULL) { struct drm_device *dev = obj->dev; mutex_lock(&dev->struct_mutex); kref_put(&obj->refcount, drm_gem_object_free); mutex_unlock(&dev->struct_mutex); } } int drm_gem_handle_create(struct drm_file *file_priv, struct drm_gem_object *obj, u32 *handlep); static inline void drm_gem_object_handle_reference(struct drm_gem_object *obj) { drm_gem_object_reference(obj); atomic_inc(&obj->handle_count); } static inline void drm_gem_object_handle_unreference(struct drm_gem_object *obj) { if (obj == NULL) return; if (atomic_read(&obj->handle_count) == 0) return; /* * Must bump handle count first as this may be the last * ref, in which case the object would disappear before we * checked for a name */ if (atomic_dec_and_test(&obj->handle_count)) drm_gem_object_handle_free(obj); drm_gem_object_unreference(obj); } static inline void drm_gem_object_handle_unreference_unlocked(struct drm_gem_object *obj) { if (obj == NULL) return; if (atomic_read(&obj->handle_count) == 0) return; /* * Must bump handle count first as this may be the last * ref, in which case the object would disappear before we * checked for a name */ if (atomic_dec_and_test(&obj->handle_count)) drm_gem_object_handle_free(obj); drm_gem_object_unreference_unlocked(obj); } struct drm_gem_object *drm_gem_object_lookup(struct drm_device *dev, struct drm_file *filp, u32 handle); int drm_gem_close_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); int drm_gem_flink_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); int drm_gem_open_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); void drm_gem_open(struct drm_device *dev, struct drm_file *file_private); void drm_gem_release(struct drm_device *dev, struct drm_file *file_private); extern void drm_core_ioremap(struct drm_local_map *map, struct drm_device *dev); extern void drm_core_ioremap_wc(struct drm_local_map *map, struct drm_device *dev); extern void drm_core_ioremapfree(struct drm_local_map *map, struct drm_device *dev); static __inline__ struct drm_local_map *drm_core_findmap(struct drm_device *dev, unsigned int token) { struct drm_map_list *_entry; list_for_each_entry(_entry, &dev->maplist, head) if (_entry->user_token == token) return _entry->map; return NULL; } static __inline__ int drm_device_is_agp(struct drm_device *dev) { if (drm_core_check_feature(dev, DRIVER_USE_PLATFORM_DEVICE)) return 0; if (dev->driver->device_is_agp != NULL) { int err = (*dev->driver->device_is_agp) (dev); if (err != 2) { return err; } } return pci_find_capability(dev->pdev, PCI_CAP_ID_AGP); } static __inline__ int drm_device_is_pcie(struct drm_device *dev) { if (drm_core_check_feature(dev, DRIVER_USE_PLATFORM_DEVICE)) return 0; else return pci_find_capability(dev->pdev, PCI_CAP_ID_EXP); } static __inline__ void drm_core_dropmap(struct drm_local_map *map) { } #include "drm_mem_util.h" static inline void *drm_get_device(struct drm_device *dev) { if (drm_core_check_feature(dev, DRIVER_USE_PLATFORM_DEVICE)) return dev->platformdev; else return dev->pdev; } extern int drm_platform_init(struct drm_driver *driver); extern int drm_pci_init(struct drm_driver *driver); extern int drm_fill_in_dev(struct drm_device *dev, const struct pci_device_id *ent, struct drm_driver *driver); int drm_get_minor(struct drm_device *dev, struct drm_minor **minor, int type); /*@}*/ #endif /* __KERNEL__ */ #endif