aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* cpusets: randomize node rotor used in cpuset_mem_spread_node()Michal Hocko2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ This patch has already been accepted as commit 0ac0c0d0f837 but later reverted (commit 35926ff5fba8) because it itroduced arch specific __node_random which was defined only for x86 code so it broke other archs. This is a followup without any arch specific code. Other than that there are no functional changes.] Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [akpm@linux-foundation.org: fix layout] [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration] [mhocko@suse.cz: Make it arch independent] [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build] Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Menage <menage@google.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: get rid of percpu_charge_mutex lockMichal Hocko2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_charge_mutex protects from multiple simultaneous per-cpu charge caches draining because we might end up having too many work items. At least this was the case until commit 26fe61684449 ("memcg: fix percpu cached charge draining frequency") when we introduced a more targeted draining for async mode. Now that also sync draining is targeted we can safely remove mutex because we will not send more work than the current number of CPUs. FLUSHING_CACHED_CHARGE protects from sending the same work multiple times and stock->nr_pages == 0 protects from pointless sending a work if there is obviously nothing to be done. This is of course racy but we can live with it as the race window is really small (we would have to see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still non-zero). The only remaining place where we can race is synchronous mode when we rely on FLUSHING_CACHED_CHARGE test which might have been set by other drainer on the same group but we should wait in that case as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add mem_cgroup_same_or_subtree() helperMichal Hocko2011-07-26
| | | | | | | | | | | | We are checking whether a given two groups are same or at least in the same subtree of a hierarchy at several places. Let's make a helper for it to make code easier to read. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: unify sync and async per-cpu charge cache drainingMichal Hocko2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we have two ways how to drain per-CPU caches for charges. drain_all_stock_sync will synchronously drain all caches while drain_all_stock_async will asynchronously drain only those that refer to a given memory cgroup or its subtree in hierarchy. Targeted async draining has been introduced by 26fe6168 (memcg: fix percpu cached charge draining frequency) to reduce the cpu workers number. sync draining is currently triggered only from mem_cgroup_force_empty which is triggered only by userspace (mem_cgroup_force_empty_write) or when a cgroup is removed (mem_cgroup_pre_destroy). Although these are not usually frequent operations it still makes some sense to do targeted draining as well, especially if the box has many CPUs. This patch unifies both methods to use the single code (drain_all_stock) which relies on the original async implementation and just adds flush_work to wait on all caches that are still under work for the sync mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from waiting on a work that we haven't triggered. Please note that both sync and async functions are currently protected by percpu_charge_mutex so we cannot race with other drainers. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: do not try to drain per-cpu caches without pagesMichal Hocko2011-07-26
| | | | | | | | | | | | | | | | | | | | drain_all_stock_async tries to optimize a work to be done on the work queue by excluding any work for the current CPU because it assumes that the context we are called from already tried to charge from that cache and it's failed so it must be empty already. While the assumption is correct we can optimize it even more by checking the current number of pages in the cache. This will also reduce a work on other CPUs with an empty stock. For the current CPU we can simply call drain_local_stock rather than deferring it to the work queue. [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add memory.vmscan_statKAMEZAWA Hiroyuki2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim in...") says it adds scanning stats to memory.stat file. But it doesn't because we considered we needed to make a concensus for such new APIs. This patch is a trial to add memory.scan_stat. This shows - the number of scanned pages(total, anon, file) - the number of rotated pages(total, anon, file) - the number of freed pages(total, anon, file) - the number of elaplsed time (including sleep/pause time) for both of direct/soft reclaim. The biggest difference with oringinal Ying's one is that this file can be reset by some write, as # echo 0 ...../memory.scan_stat Example of output is here. This is a result after make -j 6 kernel under 300M limit. [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat scanned_pages_by_limit 9471864 scanned_anon_pages_by_limit 6640629 scanned_file_pages_by_limit 2831235 rotated_pages_by_limit 4243974 rotated_anon_pages_by_limit 3971968 rotated_file_pages_by_limit 272006 freed_pages_by_limit 2318492 freed_anon_pages_by_limit 962052 freed_file_pages_by_limit 1356440 elapsed_ns_by_limit 351386416101 scanned_pages_by_system 0 scanned_anon_pages_by_system 0 scanned_file_pages_by_system 0 rotated_pages_by_system 0 rotated_anon_pages_by_system 0 rotated_file_pages_by_system 0 freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 scanned_pages_by_limit_under_hierarchy 9471864 scanned_anon_pages_by_limit_under_hierarchy 6640629 scanned_file_pages_by_limit_under_hierarchy 2831235 rotated_pages_by_limit_under_hierarchy 4243974 rotated_anon_pages_by_limit_under_hierarchy 3971968 rotated_file_pages_by_limit_under_hierarchy 272006 freed_pages_by_limit_under_hierarchy 2318492 freed_anon_pages_by_limit_under_hierarchy 962052 freed_file_pages_by_limit_under_hierarchy 1356440 elapsed_ns_by_limit_under_hierarchy 351386416101 scanned_pages_by_system_under_hierarchy 0 scanned_anon_pages_by_system_under_hierarchy 0 scanned_file_pages_by_system_under_hierarchy 0 rotated_pages_by_system_under_hierarchy 0 rotated_anon_pages_by_system_under_hierarchy 0 rotated_file_pages_by_system_under_hierarchy 0 freed_pages_by_system_under_hierarchy 0 freed_anon_pages_by_system_under_hierarchy 0 freed_file_pages_by_system_under_hierarchy 0 elapsed_ns_by_system_under_hierarchy 0 total_xxxx is for hierarchy management. This will be useful for further memcg developments and need to be developped before we do some complicated rework on LRU/softlimit management. This patch adds a new struct memcg_scanrecord into scan_control struct. sc->nr_scanned at el is not designed for exporting information. For example, nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. To avoid complexity, I added a new param in scan_control which is for exporting scanning score. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Andrew Bresticker <abrestic@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix behavior of mem_cgroup_resize_limit()Daisuke Nishimura2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to memsw.limit") introduced "memsw_is_minimum" flag, which becomes true when mem_limit == memsw_limit. The flag is checked at the beginning of reclaim, and "noswap" is set if the flag is true, because using swap is meaningless in this case. This works well in most cases, but when we try to shrink mem_limit, which is the same as memsw_limit now, we might fail to shrink mem_limit because swap doesn't used. This patch fixes this behavior by: - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim - If it is set, don't set "noswap" flag even if memsw_is_minimum is true. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: fix vmscan count in small memcgsKAMEZAWA Hiroyuki2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets") fixes the memcg/kswapd behavior against small targets and prevent vmscan priority too high. But the implementation is too naive and adds another problem to small memcg. It always force scan to 32 pages of file/anon and doesn't handle swappiness and other rotate_info. It makes vmscan to scan anon LRU regardless of swappiness and make reclaim bad. This patch fixes it by adjusting scanning count with regard to swappiness at el. At a test "cat 1G file under 300M limit." (swappiness=20) before patch scanned_pages_by_limit 360919 scanned_anon_pages_by_limit 180469 scanned_file_pages_by_limit 180450 rotated_pages_by_limit 31 rotated_anon_pages_by_limit 25 rotated_file_pages_by_limit 6 freed_pages_by_limit 180458 freed_anon_pages_by_limit 19 freed_file_pages_by_limit 180439 elapsed_ns_by_limit 429758872 after patch scanned_pages_by_limit 180674 scanned_anon_pages_by_limit 24 scanned_file_pages_by_limit 180650 rotated_pages_by_limit 35 rotated_anon_pages_by_limit 24 rotated_file_pages_by_limit 11 freed_pages_by_limit 180634 freed_anon_pages_by_limit 0 freed_file_pages_by_limit 180634 elapsed_ns_by_limit 367119089 scanned_pages_by_system 0 the numbers of scanning anon are decreased(as expected), and elapsed time reduced. By this patch, small memcgs will work better. (*) Because the amount of file-cache is much bigger than anon, recalaim_stat's rotate-scan counter make scanning files more. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: change memcg_oom_mutex to spinlockMichal Hocko2011-07-26
| | | | | | | | | | | | | | | | | | | memcg_oom_mutex is used to protect memcg OOM path and eventfd interface for oom_control. None of the critical sections which it protects sleep (eventfd_signal works from atomic context and the rest are simple linked list resp. oom_lock atomic operations). Mutex is also too heavyweight for those code paths because it triggers a lot of scheduling. It also makes makes convoying effects more visible when we have a big number of oom killing because we take the lock mutliple times during mem_cgroup_handle_oom so we have multiple places where many processes can sleep. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: make oom_lock 0 and 1 based rather than counterMichal Hocko2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock counter which is incremented by mem_cgroup_oom_lock when we are about to handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if oom_lock > 1 to prevent from multiple oom kills at the same time. The counter is then decremented by mem_cgroup_oom_unlock called from the same function. This works correctly but it can lead to serious starvations when we have many processes triggering OOM and many CPUs available for them (I have tested with 16 CPUs). Consider a process (call it A) which gets the oom_lock (the first one that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other processes that are blocked on the mutex. While A releases the mutex and calls mem_cgroup_out_of_memory others will wake up (one after another) and increase the counter and fall into sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it takes the mutex again and decreases oom_lock and wakes other tasks (if releasing memory by somebody else - e.g. killed process - hasn't done it yet). A testcase would look like: Assume malloc XXX is a program allocating XXX Megabytes of memory which touches all allocated pages in a tight loop # swapoff SWAP_DEVICE # cgcreate -g memory:A # cgset -r memory.oom_control=0 A # cgset -r memory.limit_in_bytes= 200M # for i in `seq 100` # do # cgexec -g memory:A malloc 10 & # done The main problem here is that all processes still race for the mutex and there is no guarantee that we will get counter back to 0 for those that got back to mem_cgroup_handle_oom. In the end the whole convoy in/decreases the counter but we do not get to 1 that would enable killing so nothing useful can be done. The time is basically unbounded because it highly depends on scheduling and ordering on mutex (I have seen this taking hours...). This patch replaces the counter by a simple {un}lock semantic. As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to make sure that nobody else races with us which is guaranteed by the memcg_oom_mutex. We have to be careful while locking subtrees because we can encounter a subtree which is already locked: hierarchy: A / \ B \ /\ \ C D E B - C - D tree might be already locked. While we want to enable locking E subtree because OOM situations cannot influence each other we definitely do not want to allow locking A. Therefore we have to refuse lock if any subtree is already locked and clear up the lock for all nodes that have been set up to the failure point. On the other hand we have to make sure that the rest of the world will recognize that a group is under OOM even though it doesn't have a lock. Therefore we have to introduce under_oom variable which is incremented and decremented for the whole subtree when we enter resp. leave mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be updated under memcg_oom_mutex because its users only check a single group and they use atomic operations for that. This can be checked easily by the following test case: # cgcreate -g memory:A # cgset -r memory.use_hierarchy=1 A # cgset -r memory.oom_control=1 A # cgset -r memory.limit_in_bytes= 100M # cgset -r memory.memsw.limit_in_bytes= 100M # cgcreate -g memory:A/B # cgset -r memory.oom_control=1 A/B # cgset -r memory.limit_in_bytes=20M # cgset -r memory.memsw.limit_in_bytes=20M # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A While B gets oom_lock A will not get it. Both of them go into sleep and wait for an external action. We can make the limit higher for A to enforce waking it up # cgset -r memory.memsw.limit_in_bytes=300M A # cgset -r memory.limit_in_bytes=300M A malloc in A has to wake up even though it doesn't have oom_lock. Finally, the unlock path is very easy because we always unlock only the subtree we have locked previously while we always decrement under_oom. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: consolidate memory cgroup lru stat functionsKAMEZAWA Hiroyuki2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In mm/memcontrol.c, there are many lru stat functions as.. mem_cgroup_zone_nr_lru_pages mem_cgroup_node_nr_file_lru_pages mem_cgroup_nr_file_lru_pages mem_cgroup_node_nr_anon_lru_pages mem_cgroup_nr_anon_lru_pages mem_cgroup_node_nr_unevictable_lru_pages mem_cgroup_nr_unevictable_lru_pages mem_cgroup_node_nr_lru_pages mem_cgroup_nr_lru_pages mem_cgroup_get_local_zonestat Some of them are under #ifdef MAX_NUMNODES >1 and others are not. This seems bad. This patch consolidates all functions into mem_cgroup_zone_nr_lru_pages() mem_cgroup_node_nr_lru_pages() mem_cgroup_nr_lru_pages() For these functions, "which LRU?" information is passed by a mask. example: mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON)) And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON. example: mem_cgroup_nr_lru_pages(mem, ALL_LRU) BTW, considering layout of NUMA memory placement of counters, this patch seems to be better. Now, when we gather all LRU information, we scan in following orer for_each_lru -> for_each_node -> for_each_zone. This means we'll touch cache lines in different node in turn. After patch, we'll scan for_each_node -> for_each_zone -> for_each_lru(mask) Then, we'll gather information in the same cacheline at once. [akpm@linux-foundation.org: fix warnigns, build error] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: export memory cgroup's swappiness with mem_cgroup_swappiness()KAMEZAWA Hiroyuki2011-07-26
| | | | | | | | | | | | | | | | | | | | | | Each memory cgroup has a 'swappiness' value which can be accessed by get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages() and swappiness is passed by argument. It's propagated by scan_control. get_swappiness() is a static function but some planned updates will need to get swappiness from files other than memcontrol.c This patch exports get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the argument of swapiness from try_to_free... and drop swappiness from scan_control. only memcg uses it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: fix hrtimer deadlockThomas Gleixner2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | Ben reported a lockup related to rtc. The lockup happens due to: CPU0 CPU1 rtc_irq_set_state() __run_hrtimer() spin_lock_irqsave(&rtc->irq_task_lock) rtc_handle_legacy_irq(); spin_lock(&rtc->irq_task_lock); hrtimer_cancel() while (callback_running); So the running callback never finishes as it's blocked on rtc->irq_task_lock. Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while waiting for the callback. Fix this for both rtc_irq_set_state() and rtc_irq_set_freq(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Ben Greear <greearb@candelatech.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: limit frequencyThomas Gleixner2011-07-26
| | | | | | | | | | | | | | | | Due to the hrtimer self rearming mode a user can DoS the machine simply because it's starved by hrtimer events. The RTC hrtimer is self rearming. We really need to limit the frequency to something sensible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ben Greear <greearb@candelatech.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rtc: handle errors correctly in rtc_irq_set_state()Thomas Gleixner2011-07-26
| | | | | | | | | | | | | | | | The code checks the correctness of the parameters, but unconditionally arms/disarms the hrtimer. The result is that a random task might arm/disarm rtc timer and surprise the real owner by either generating events or by stopping them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ben Greear <greearb@candelatech.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mn10300, exec: remove redundant set_fs(USER_DS)Mathias Krause2011-07-26
| | | | | | | | | | | The address limit is already set in flush_old_exec() so this set_fs(USER_DS) is redundant. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/base/power/opp.c: fix dev_opp initial valueJonghwan Choi2011-07-26
| | | | | | | | | | | Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check error. Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* frv, exec: remove redundant set_fs(USER_DS)Mathias Krause2011-07-26
| | | | | | | | | | | | The address limit is already set in flush_old_exec() so those calls to set_fs(USER_DS) are redundant. Also removed the dead code in flush_thread(). Signed-off-by: Mathias Krause <minipli@googlemail.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linusLinus Torvalds2011-07-26
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (31 commits) MIPS: Close races in TLB modify handlers. MIPS: Add uasm UASM_i_SRL_SAFE macro. MIPS: RB532: Use hex_to_bin() MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platforms MIPS: PowerTV: Provide cpu-feature-overrides.h MIPS: Remove pointless return statement from empty void functions. MIPS: Limit fixrange_init() to the FIXMAP region MIPS: Install handlers for software IRQs MIPS: Move FIXADDR_TOP into spaces.h MIPS: Add SYNC after cacheflush MIPS: pfn_valid() is broken on low memory HIGHMEM systems MIPS: HIGHMEM DMA on noncoherent MIPS32 processors MIPS: topdown mmap support MIPS: Remove redundant addr_limit assignment on exec. MIPS: AR7: Replace __attribute__((__packed__)) with __packed MIPS: AR7: Remove 'space before tabs' in platform.c MIPS: Lantiq: Add missing clk_enable and clk_disable functions. MIPS: AR7: Fix trailing semicolon bug in clock.c MAINTAINERS: Update MIPS entry. MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definition ...
| * MIPS: Close races in TLB modify handlers.David Daney2011-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Page table entries are made invalid by writing a zero into the the PTE slot in a page table. This creates a race condition with the TLB modify handlers when they are updating the PTE. CPU0 CPU1 Test for _PAGE_PRESENT . set to not _PAGE_PRESENT (zero) Set to _PAGE_VALID So now the page not present value (zero) is suddenly valid and user space programs have access to physical page zero. We close the race by putting the test for _PAGE_PRESENT and setting of _PAGE_VALID into an atomic LL/SC section. This requires more registers than just K0 and K1 in the handlers, so we need to save some registers to a save area and then restore them when we are done. The save area is an array of cacheline aligned structures that should not suffer cache line bouncing as they are CPU private. [ralf@linux-mips.org: Fix !defined(CONFIG_MIPS_PGD_C0_CONTEXT) build error.] Signed-off-by: David Daney <david.daney@cavium.com> To: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2577/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Add uasm UASM_i_SRL_SAFE macro.David Daney2011-07-26
| | | | | | | | | | | | | | | | | | | | This can be used from either 32-bit or 64-bit code to generate logical right shifts of any constant amount. Signed-off-by: David Daney <david.daney@cavium.com> To: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2576/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: RB532: Use hex_to_bin()Andy Shevchenko2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | Remove custom implementation of hex_to_bin(). Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org To: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/1580/ Acked-by: Florian Fainelli <florian@openwrt.org> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platformsShinya Kuribayashi2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable cpu_has_clo_clz only when CONFIG_CPU_MIPS32 or CONFIG_CPU_MIPS64 is selected. This will optimize fls() and __fls() to use CLZ insn, and eventually ffs() and __ffs() as well. Malta and MIPSSim are development platforms, and need to take care of various processor configurations, release rivisions and so on, even across different MIPS ISAs. For such platforms we have to be careful, for instance, with turning on cpu_has_mips{32,64}r[12] features. As for CLZ, all MIPS32/64 processors support it, regardless of release revisions. Signed-off-by: Shinya Kuribayashi <skuribay@pobox.com> To: David VomLehn <dvomlehn@cisco.com> To: macro@linux-mips.org Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/1453/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: PowerTV: Provide cpu-feature-overrides.hDavid VomLehn2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will optimize fls() and __fls() to use CLZ throughout the kernel, and any other optimizations that depend on constant cpu_has_* values will also be used. Signed-off-by: David VomLehn <dvomlehn@cisco.com> Signed-off-by: Shinya Kuribayashi <skuribay@pobox.com> To: David VomLehn <dvomlehn@cisco.com> To: macro@linux-mips.org Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/1452/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Remove pointless return statement from empty void functions.Ralf Baechle2011-07-25
| | | | | | | | | | | | | | | | Signed-off-by: Ralf Baechle <ralf@linux-mips.org> To: Sergei Shtylyov <sshtylyov@mvista.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2391/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Limit fixrange_init() to the FIXMAP regionKevin Cernekee2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | fixrange_init() allocates page tables for all addresses higher than FIXADDR_TOP. On processors that override the default FIXADDR_TOP address of 0xfffe_0000, this can consume up to 4 pages (1 page per 4MB) for pgd's that are never used. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/1980/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Install handlers for software IRQsKevin Cernekee2011-07-25
| | | | | | | | | | | | | | | | | | | | | | BMIPS4350/4380/5000 CMT/SMT all use SW INT0/INT1 for inter-thread signaling. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/1709/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Move FIXADDR_TOP into spaces.hKevin Cernekee2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory maps and addressing quirks are normally defined in <spaces.h>. There are already three targets that need to override FIXADDR_TOP, and others exist. This will be a cleaner approach than adding lots of ifdefs in fixmap.h . Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Cc: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/1573/ Acked-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Add SYNC after cacheflushKevin Cernekee2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On processors with deep write buffers, it is likely that many cycles will pass between a CACHE instruction and the time the data actually gets written out to DRAM. Add a SYNC instruction to ensure that the buffers get emptied before the flush functions return. Actual problem seen in the wild: 1) dma_alloc_coherent() allocates cached memory 2) memset() is called to clear the new pages 3) dma_cache_wback_inv() is called to flush the zero data out to memory 4) dma_alloc_coherent() returns an uncached (kseg1) pointer to the freshly allocated pages 5) Caller writes data through the kseg1 pointer 6) Buffered writeback data finally gets flushed out to DRAM 7) Part of caller's data is inexplicably zeroed out This patch adds SYNC between steps 3 and 4, which fixed the problem. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: pfn_valid() is broken on low memory HIGHMEM systemsKevin Cernekee2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pfn_valid() compares the PFN to max_mapnr: __pfn >= min_low_pfn && __pfn < max_mapnr; On HIGHMEM kernels, highend_pfn is used to set the value of max_mapnr. Unfortunately, highend_pfn is left at zero if the system does not actually have enough RAM to reach into the HIGHMEM range. This causes pfn_valid() to always return false, and when debug checks are enabled the kernel will fail catastrophically: Memory: 22432k/32768k available (2249k kernel code, 10336k reserved, 653k data, 1352k init, 0k highmem) NR_IRQS:128 kfree_debugcheck: out of range ptr 81c02900h. Kernel bug detected[#1]: Cpu 0 $ 0 : 00000000 10008400 00000034 00000000 $ 4 : 8003e160 802a0000 8003e160 00000000 $ 8 : 00000000 0000003e 00000747 00000747 ... On such a configuration, max_low_pfn should be used to set max_mapnr. This was seen on 2.6.34. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> To: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/1992/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: HIGHMEM DMA on noncoherent MIPS32 processorsDezhong Diao2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [v4: Patch applies to linux-queue.git with kmap_atomic patches: https://patchwork.kernel.org/patch/189932/ https://patchwork.kernel.org/patch/194552/ https://patchwork.kernel.org/patch/189912/ ] The MIPS DMA coherency functions do not work properly (i.e. kernel oops) when HIGHMEM pages are passed in as arguments. Use kmap_atomic() to temporarily map high pages for cache maintenance operations. Tested on a 2.6.36-rc7 1GB HIGHMEM SMP no-alias system. Signed-off-by: Dezhong Diao <dediao@cisco.com> Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Cc: Dezhong Diao <dediao@cisco.com> Cc: David Daney <ddaney@caviumnetworks.com> Cc: David VomLehn <dvomlehn@cisco.com> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/1695/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: topdown mmap supportJian Peng2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduced topdown mmap support in user process address space allocation policy. Recently, we ran some large applications that use mmap heavily and lead to OOM due to inflexible mmap allocation policy on MIPS32. Since most other major archs supported it for years, it is reasonable to follow the trend and reduce the pain of porting applications. Due to cache aliasing concern, arch_get_unmapped_area_topdown() and other helper functions are implemented in arch/mips/kernel/syscall.c. Signed-off-by: Jian Peng <jipeng2005@gmail.com> Cc: David Daney <ddaney@caviumnetworks.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2389/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Remove redundant addr_limit assignment on exec.Mathias Krause2011-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The address limit is already set in flush_old_exec() via set_fs(USER_DS) so this assignment is redundant. [ralf@linux-mips.org: also see dac853ae89043f1b7752875300faf614de43c74b for further explanation.] Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/2466/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: AR7: Replace __attribute__((__packed__)) with __packedFlorian Fainelli2011-07-25
| | | | | | | | | | | | | | Signed-off-by: Florian Fainelli <florian@openwrt.org> To: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2491/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org
| * MIPS: AR7: Remove 'space before tabs' in platform.cFlorian Fainelli2011-07-25
| | | | | | | | | | | | | | Signed-off-by: Florian Fainelli <florian@openwrt.org> To: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2490/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Lantiq: Add missing clk_enable and clk_disable functions.John Crispin2011-07-20
| | | | | | | | | | | | | | Signed-of-by: John Crispin <blogic@openwrt.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2465/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: AR7: Fix trailing semicolon bug in clock.cFlorian Fainelli2011-07-20
| | | | | | | | | | | | | | Signed-off-by: Florian Fainelli <florian@openwrt.org> To: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2489/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MAINTAINERS: Update MIPS entry.Ralf Baechle2011-07-20
| | | | | | | | | | | | | | o Add entry for MIPS patchworks o Reorder entries for readability. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definitionJonas Gorski2011-07-20
| | | | | | | | | | | | | | | | Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Cc: linux-mips@linux-mips.org Acked-by: Florian Fainelli <florian@openwrt.org> Patchwork: https://patchwork.linux-mips.org/patch/2461/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Netlogic: SMP fixes for XLR/XLS platform code.Jayachandran C2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | Fix few issues in the Netlogic code: - Use handle_percpu_irq to handle per-cpu interrupts - Remove unused function nlm_common_ipi_handler() - Call scheduler_ipi() on SMP_RESCHEDULE_YOURSELF - Enable interrupts in nlm_smp_finish() Signed-off-by: Jayachandran C <jayachandranc@netlogicmicro.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2460/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: SB1250: Restore dropped irq_mask functionThomas Gleixner2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | Commit d6d5d5c4a (MIPS: Sibyte: Convert to new irq_chip functions) removed the mask function which breaks irq_shutdown(). Restore it. Reported-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2460/ Tested-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: MIPSsim: Fix uniprocessor build.Ralf Baechle2011-07-20
| | | | | | | | Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: ARC: Fix build of firmware library on uniprocessor.Ralf Baechle2011-07-20
| | | | | | | | Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: XLR, XLS: Move makefile bits to were they belong.Ralf Baechle2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch combines linux-mips.org patches 637d69600fb1773da56487271ec2a79c33d237ed [MIPS: Netlogic: Yank out crap.] and 5e3c263b9658a4b1c6c5577793e9347efb44854e [MIPS: XLR, XLS: Add Kbuild files for platform.] Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Jayachandran C <jayachandranc@netlogicmicro.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/2415/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Malta: Fix crash in SMP kernel on non-CMP systems.Ralf Baechle2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 6be63bbbdab66b9185dc6f67c8b1bacb6f37f946 (lmo) rsp. af3a1f6f4813907e143f87030cde67a9971db533 (kernel.org) the Malta code does no longer probe for presence of GCMP if CMP is not configured. This means that the variable gcmp_present well be left at its default value of -1 which normally is meant to indicate that GCMP has not yet been mmapped. This non-zero value is now interpreted as GCMP being present resulting in a write attempt to a GCMP register resulting in a crash. Reported and a build fix on top of my fix by Rob Landley <rob@landley.net>. Reported-by: Rob Landley <rob@landley.net> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Patchwork: https://patchwork.linux-mips.org/patch/2413/
| * MIPS: Wire up sendmmsg and renumber setns syscall.Ralf Baechle2011-07-20
| | | | | | | | | | | | | | | | | | | | Renumbering was necessary because I had already wired up setns(2) in the linux-mips.org tree in commit c3fce54644cabbb90700cc3acc040718a377f609 [MIPS: Wire up new sendmmsg syscall.] but the same syscall numbers were used by 7b21fddd087678a70ad64afc0f632e0f1071b092 [ns: Wire up the setns system call] resulting in a conflict. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: SMTC: Fix build.Ralf Baechle2011-07-20
| | | | | | | | | | | | | | | | | | | | Commit ee6114202e48dc282c6d04741e9c5d7171eb8bb8 (lmo) rsp. 1685f3b158a244d4f6e205e67c84483fffcb2d9f (kernel.org) ["MIPS: SMTC: Move declaration of smtc_init_secondary to <asm/smtc.h>."] didn't quite do that - it rather lost the declaration of smtc_init_secondary resulting in a build error. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: Malta SMTC: Fix build.Ralf Baechle2011-07-20
| | | | | | | | | | | | | | | | | | Commit a561b02a2577aec51277ba39c82bd192a79c0267 (lmo) rsp. 7c8d948f1633da5ff81e4f5b31ef237d74c40127 (kernel.org) ["MIPS: i8259: Convert to new irq_chip functions"] missed one location to modify resulting in build breakage. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: GT64120: Remove useless inclusion of clocksource.h.Ralf Baechle2011-07-20
| | | | | | | | Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| * MIPS: NILE4: Remove useless inclusion of GT64120 header.Ralf Baechle2011-07-20
| | | | | | | | Signed-off-by: Ralf Baechle <ralf@linux-mips.org>