aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* cpuset: rename @cont to @cgrpLi Zefan2013-06-13
| | | | | | | | | | | Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: fix to migrate mm correctly in a corner caseLi Zefan2013-06-13
| | | | | | | | | | | | | | | | | | | | | | | Before moving tasks out of empty cpusets, update_tasks_nodemask() is called, which calls do_migrate_pages(xx, from, to). Then those tasks are moved to an ancestor, and do_migrate_pages() is called again. The first time: from = node_to_be_offlined, to = empty. The second time: from = empty, to = ancestor's nodemask. so looks like no pages will be migrated. Fix this by: - Don't call update_tasks_nodemask() on empty cpusets. - Pass cs->old_mems_allowed to do_migrate_pages(). v4: added comment in cpuset_hotplug_update_tasks() and rephased comment in cpuset_attach(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: allow to move tasks to empty cpusetsLi Zefan2013-06-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently some cpuset behaviors are not friendly when cpuset is co-mounted with other cgroup controllers. Now with this patchset if cpuset is mounted with sane_behavior option, it behaves differently: - Tasks will be kept in empty cpusets when hotplug happens and take masks of ancestors with non-empty cpus/mems, instead of being moved to an ancestor. - A task can be moved into an empty cpuset, and again it takes masks of ancestors, so the user can drop a task into a newly created cgroup without having to do anything for it. As tasks can reside in empy cpusets, here're some rules: - They can be moved to another cpuset, regardless it's empty or not. - Though it takes masks from ancestors, it takes other configs from the empty cpuset. - If the ancestors' masks are changed, those tasks will also be updated to take new masks. v2: add documentation in include/linux/cgroup.h Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: allow to keep tasks in empty cpusetsLi Zefan2013-06-13
| | | | | | | | | | | | | | | | | | | | | To achieve this: - We call update_tasks_cpumask/nodemask() for empty cpusets when hotplug happens, instead of moving tasks out of them. - When a cpuset's masks are changed by writing cpuset.cpus/mems, we also update tasks in child cpusets which are empty. v3: - do propagation work in one place for both hotplug and unplug v2: - drop rcu_read_lock before calling update_task_nodemask() and update_task_cpumask(), instead of using workqueue. - add documentation in include/linux/cgroup.h Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: introduce effective_{cpumask|nodemask}_cpuset()Li Zefan2013-06-13
| | | | | | | | | | | | | | effective_cpumask_cpuset() returns an ancestor cpuset which has non-empty cpumask. If a cpuset is empty and the tasks in it need to update their cpus_allowed, they take on the ancestor cpuset's cpumask. This currently won't change any behavior, but it will later allow us to keep tasks in empty cpusets. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: record old_mems_allowed in struct cpusetLi Zefan2013-06-13
| | | | | | | | | | | | | | | | | | When we update a cpuset's mems_allowed and thus update tasks' mems_allowed, it's required to pass the old mems_allowed and new mems_allowed to cpuset_migrate_mm(). Currently we save old mems_allowed in a temp local variable before changing cpuset->mems_allowed. This patch changes it by saving old mems_allowed in cpuset->old_mems_allowed. This currently won't change any behavior, but it will later allow us to keep tasks in empty cpusets. v3: restored "cpuset_attach_nodemask_to = cs->mems_allowed" Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: remove async hotplug propagation workLi Zefan2013-06-09
| | | | | | | | As we can drop rcu read lock while iterating cgroup hierarchy, we don't have to do propagation asynchronously via workqueue. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: let hotplug propagation work wait for task attachingLi Zefan2013-06-09
| | | | | | | | | | | | | | | | | Instead of triggering propagation work in cpuset_attach(), we make hotplug propagation work wait until there's no task attaching in progress. IMO this is more robust. We won't see empty masks in cpuset_attach(). Also it's a preparation for removing propagation work. Without asynchronous propagation we can't call move_tasks_in_empty_cpuset() in cpuset_attach(), because otherwise we'll deadlock on cgroup_mutex. tj: typo fixes. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: re-structure update_cpumask() a bitLi Zefan2013-06-05
| | | | | | | | | | | | | | | Check if cpus_allowed is to be changed before calling validate_change(). This won't change any behavior, but later it will allow us to do this: # mkdir /cpuset/child # echo $$ > /cpuset/child/tasks /* empty cpuset */ # echo > /cpuset/child/cpuset.cpus /* do nothing, won't fail */ Without this patch, the last operation will fail. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: remove cpuset_test_cpumask()Li Zefan2013-06-05
| | | | | | | The test is done in set_cpus_allowed_ptr(), so it's redundant. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: remove unnecessary variable in cpuset_attach()Li Zefan2013-06-05
| | | | | | | We can just use oldcs->mems_allowed. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: cleanup guarantee_online_{cpus|mems}()Li Zefan2013-06-05
| | | | | | | | - We never pass a NULL @cs to these functions. - The top cpuset always has some online cpus/mems. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: remove redundant check in cpuset_cpus_allowed_fallback()Li Zefan2013-06-05
| | | | | | | task_cs() will never return NULL. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cgroup: clean up the cftype array for the base cgroup filesTejun Heo2013-06-05
| | | | | | | | | | | | | | | * Rename it from files[] (really?) to cgroup_base_files[]. * Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and used inconsistently. Just use "cgroup." directly. * Collect insane files at the end. Note that only the insane ones are missing "cgroup." prefix. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: mark "notify_on_release" and "release_agent" cgroup files insaneTejun Heo2013-06-05
| | | | | | | | | | | | | The empty cgroup notification mechanism currently implemented in cgroup is tragically outdated. Forking and execing userland process stopped being a viable notification mechanism more than a decade ago. We're gonna have a saner mechanism. Let's make it clear that this abomination is going away. Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: mark "tasks" cgroup file as insaneTejun Heo2013-06-05
| | | | | | | | | | | | | | | | | | | Some resources controlled by cgroup aren't per-task and cgroup core allowing threads of a single thread_group to be in different cgroups forced memcg do explicitly find the group leader and use it. This is gonna be nasty when transitioning to unified hierarchy and in general we don't want and won't support granularity finer than processes. Mark "tasks" with CFTYPE_INSANE. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: cgroups@vger.kernel.org Cc: Vivek Goyal <vgoyal@redhat.com>
* cgroup: update iterators to use cgroup_next_sibling()Tejun Heo2013-05-23
| | | | | | | | | | | | | | | | | | | | | | | | This patch converts cgroup_for_each_child(), cgroup_next_descendant_pre/post() and thus cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling() instead of manually dereferencing ->sibling.next. The only reason the iterators couldn't allow dropping RCU read lock while iteration is in progress was because they couldn't determine the next sibling safely once RCU read lock is dropped. Using cgroup_next_sibling() removes that problem and enables all iterators to allow dropping RCU read lock in the middle. Comments are updated accordingly. This makes the iterators easier to use and will simplify controllers. Note that @cgroup argument is renamed to @cgrp in cgroup_for_each_child() because it conflicts with "struct cgroup" used in the new macro body. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz>
* cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()Tejun Heo2013-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there's no easy way to find out the next sibling cgroup unless it's known that the current cgroup is accessed from the parent's children list in a single RCU critical section. This in turn forces all iterators to require whole iteration to be enclosed in a single RCU critical section, which sometimes is too restrictive. This patch implements cgroup_next_sibling() which can reliably determine the next sibling regardless of the state of the current cgroup as long as it's accessible. It currently is impossible to determine the next sibling after dropping RCU read lock because the cgroup being iterated could be removed anytime and if RCU read lock is dropped, nothing guarantess its ->sibling.next pointer is accessible. A removed cgroup would continue to point to its next sibling for RCU accesses but stop receiving updates from the sibling. IOW, the next sibling could be removed and then complete its grace period while RCU read lock is dropped, making it unsafe to dereference ->sibling.next after dropping and re-acquiring RCU read lock. This can be solved by adding a way to traverse to the next sibling without dereferencing ->sibling.next. This patch adds a monotonically increasing cgroup serial number, cgroup->serial_nr, which guarantees that all cgroup->children lists are kept in increasing serial_nr order. A new function, cgroup_next_sibling(), is implemented, which, if CGRP_REMOVED is not set on the current cgroup, follows ->sibling.next; otherwise, traverses the parent's ->children list until it sees a sibling with higher ->serial_nr. This allows the function to always return the next sibling regardless of the state of the current cgroup without adding overhead in the fast path. Further patches will update the iterators to use cgroup_next_sibling() so that they allow dropping RCU read lock and blocking while iteration is in progress which in turn will be used to simplify controllers. v2: Typo fix as per Serge. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
* cgroup: make cgroup_is_removed() staticTejun Heo2013-05-23
| | | | | | | | | | cgroup_is_removed() no longer has external users and it shouldn't grow any - controllers should deal with cgroup_subsys_state on/offline state instead of cgroup removal state. Make it static. While at it, make it return bool. Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge branch 'for-3.10-fixes' into for-3.11Tejun Heo2013-05-23
|\ | | | | | | | | | | | | Merging to receive 7805d000db ("cgroup: fix a subtle bug in descendant pre-order walk") so that further iterator updates can build upon it. Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup: fix a subtle bug in descendant pre-order walkTejun Heo2013-05-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When cgroup_next_descendant_pre() initiates a walk, it checks whether the subtree root doesn't have any children and if not returns NULL. Later code assumes that the subtree isn't empty. This is broken because the subtree may become empty inbetween, which can lead to the traversal escaping the subtree by walking to the sibling of the subtree root. There's no reason to have the early exit path. Remove it along with the later assumption that the subtree isn't empty. This simplifies the code a bit and fixes the subtle bug. While at it, fix the comment of cgroup_for_each_descendant_pre() which was incorrectly referring to ->css_offline() instead of ->css_online(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org
| * cgroup: initialize xattr before calling d_instantiate()Li Zefan2013-05-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_create_file() calls d_instantiate(), which may decide to look at the xattrs on the file. Smack always does this and SELinux can be configured to do so. But cgroup_add_file() didn't initialize xattrs before calling cgroup_create_file(), which finally leads to dereferencing NULL dentry->d_fsdata. This bug has been there since cgroup xattr was introduced. Cc: <stable@vger.kernel.org> # 3.8.x Reported-by: Ivan Bulatovic <combuster@archlinux.us> Reported-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | cgroup: implement task_cgroup_path_from_hierarchy()Tejun Heo2013-05-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kdbus folks want a sane way to determine the cgroup path that a given task belongs to on a given hierarchy, which is a reasonble thing to expect from cgroup core. Implement task_cgroup_path_from_hierarchy(). v2: Dropped unnecessary NULL check on the return value of task_cgroup_from_root() as suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <greg@kroah.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <daniel@zonque.org>
* | cgroup: make hierarchy_id use cyclic idrTejun Heo2013-05-14
| | | | | | | | | | | | | | | | | | We want to be able to lookup a hierarchy from its id and cyclic allocation is a whole lot simpler with idr. Convert to idr and use idr_alloc_cyclc(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* | cgroup: drop hierarchy_id_lockTejun Heo2013-05-14
| | | | | | | | | | | | | | | | Now that hierarchy_id alloc / free are protected by the cgroup mutexes, there's no need for this separate lock. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* | cgroup: refactor hierarchy_id handlingTejun Heo2013-05-14
|/ | | | | | | | | | | | | | | | | | | | | | | We're planning to converting hierarchy_ida to an idr and use it to look up hierarchy from its id. As we want the mapping to happen atomically with cgroupfs_root registration, this patch refactors hierarchy_id init / exit so that ida operations happen inside cgroup_[root_]mutex. * s/init_root_id()/cgroup_init_root_id()/ and make it return 0 or -errno like a normal function. * Move hierarchy_id initialization from cgroup_root_from_opts() into cgroup_mount() block where the root is confirmed to be used and being registered while holding both mutexes. * Split cgroup_drop_id() into cgroup_exit_root_id() and cgroup_free_root(), so that ID release can happen before dropping the mutexes in cgroup_kill_sb(). The latter expects hierarchy_id to be exited before being invoked. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* Merge tag 'trace-fixes-v3.10' of ↵Linus Torvalds2013-05-11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing/kprobes update from Steven Rostedt: "The majority of these changes are from Masami Hiramatsu bringing kprobes up to par with the latest changes to ftrace (multi buffering and the new function probes). He also discovered and fixed some bugs in doing so. When pulling in his patches, I also found a few minor bugs as well and fixed them. This also includes a compile fix for some archs that select the ring buffer but not tracing. I based this off of the last patch you took from me that fixed the merge conflict error, as that was the commit that had all the changes I needed for this set of changes." * tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/kprobes: Support soft-mode disabling tracing/kprobes: Support ftrace_event_file base multibuffer tracing/kprobes: Pass trace_probe directly from dispatcher tracing/kprobes: Increment probe hit-count even if it is used by perf tracing/kprobes: Use bool for retprobe checker ftrace: Fix function probe when more than one probe is added ftrace: Fix the output of enabled_functions debug file ftrace: Fix locking in register_ftrace_function_probe() tracing: Add helper function trace_create_new_event() to remove duplicate code tracing: Modify soft-mode only if there's no other referrer tracing: Indicate enabled soft-mode in enable file tracing/kprobes: Fix to increment return event probe hit-count ftrace: Cleanup regex_lock and ftrace_lock around hash updating ftrace, kprobes: Fix a deadlock on ftrace_regex_lock ftrace: Have ftrace_regex_write() return either read or error tracing: Return error if register_ftrace_function_probe() fails for event_enable_func() tracing: Don't succeed if event_enable_func did not register anything ring-buffer: Select IRQ_WORK
| * tracing/kprobes: Support soft-mode disablingMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support soft-mode disabling on kprobe-based dynamic events. Soft-disabling is just ignoring recording if the soft disabled flag is set. Link: http://lkml.kernel.org/r/20130509054454.30398.7237.stgit@mhiramat-M0-7522 Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing/kprobes: Support ftrace_event_file base multibufferMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support multi-buffer on kprobe-based dynamic events by using ftrace_event_file. Link: http://lkml.kernel.org/r/20130509054449.30398.88343.stgit@mhiramat-M0-7522 Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing/kprobes: Pass trace_probe directly from dispatcherMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass the pointer of struct trace_probe directly from probe dispatcher to handlers. This removes redundant container_of macro uses. Same thing has already done in trace_uprobe. Link: http://lkml.kernel.org/r/20130509054441.30398.69112.stgit@mhiramat-M0-7522 Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing/kprobes: Increment probe hit-count even if it is used by perfMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Increment probe hit-count for profiling even if it is used by perf tool. Same thing has already done in trace_uprobe. Link: http://lkml.kernel.org/r/20130509054436.30398.21133.stgit@mhiramat-M0-7522 Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing/kprobes: Use bool for retprobe checkerMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use bool instead of int for kretprobe checker. Link: http://lkml.kernel.org/r/20130509054431.30398.38561.stgit@mhiramat-M0-7522 Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Fix function probe when more than one probe is addedSteven Rostedt (Red Hat)2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the first function probe is added and the function tracer is updated the functions are modified to call the probe. But when a second function is added, it updates the function records to have the second function also update, but it fails to update the actual function itself. This prevents the second (or third or forth and so on) probes from having their functions called. # echo vfs_symlink:enable_event:sched:sched_switch > set_ftrace_filter # echo vfs_unlink:enable_event:sched:sched_switch > set_ftrace_filter # cat trace # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | # touch /tmp/a # rm /tmp/a # cat trace # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | # ln -s /tmp/a # cat trace # tracer: nop # # entries-in-buffer/entries-written: 414/414 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | <idle>-0 [000] d..3 2847.923031: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=2786 next_prio=120 <...>-3114 [001] d..4 2847.923035: sched_switch: prev_comm=ln prev_pid=3114 prev_prio=120 prev_state=x ==> next_comm=swapper/1 next_pid=0 next_prio=120 bash-2786 [000] d..3 2847.923535: sched_switch: prev_comm=bash prev_pid=2786 prev_prio=120 prev_state=S ==> next_comm=kworker/0:1 next_pid=34 next_prio=120 kworker/0:1-34 [000] d..3 2847.923552: sched_switch: prev_comm=kworker/0:1 prev_pid=34 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 <idle>-0 [002] d..3 2847.923554: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=2783 next_prio=120 sshd-2783 [002] d..3 2847.923660: sched_switch: prev_comm=sshd prev_pid=2783 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 Still need to update the functions even though the probe itself does not need to be registered again when added a new probe. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Fix the output of enabled_functions debug fileSteven Rostedt (Red Hat)2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The enabled_functions debugfs file was created to be able to see what functions have been modified from nops to calling a tracer. The current method uses the counter in the function record. As when a ftrace_ops is registered to a function, its count increases. But that doesn't mean that the function is actively being traced. /proc/sys/kernel/ftrace_enabled can be set to zero which would disable it, as well as something can go wrong and we can think its enabled when only the counter is set. The record's FTRACE_FL_ENABLED flag is set or cleared when its function is modified. That is a much more accurate way of knowing what function is enabled or not. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Fix locking in register_ftrace_function_probe()Steven Rostedt (Red Hat)2013-05-09
| | | | | | | | | | | | | | The iteration of the ftrace function list and the call to ftrace_match_record() need to be protected by the ftrace_lock. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing: Add helper function trace_create_new_event() to remove duplicate codeSteven Rostedt (Red Hat)2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Both __trace_add_new_event() and __trace_early_add_new_event() do basically the same thing, except that __trace_add_new_event() does a little more. Instead of having duplicate code between the two functions, add a helper function trace_create_new_event() that both can use. This will help against having bugs fixed in one function but not the other. Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing: Modify soft-mode only if there's no other referrerMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify soft-mode flag only if no other soft-mode referrer (currently only the ftrace triggers) by using a reference counter in each ftrace_event_file. Without this fix, adding and removing several different enable/disable_event triggers on the same event clear soft-mode bit from the ftrace_event_file. This also happens with a typo of glob on setting triggers. e.g. # echo vfs_symlink:enable_event:net:netif_rx > set_ftrace_filter # cat events/net/netif_rx/enable 0* # echo typo_func:enable_event:net:netif_rx > set_ftrace_filter # cat events/net/netif_rx/enable 0 # cat set_ftrace_filter #### all functions enabled #### vfs_symlink:enable_event:net:netif_rx:unlimited As above, we still have a trigger, but soft-mode is gone. Link: http://lkml.kernel.org/r/20130509054429.30398.7464.stgit@mhiramat-M0-7522 Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: David Sharp <dhsharp@google.com> Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing: Indicate enabled soft-mode in enable fileMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Indicate enabled soft-mode event as "1*" in "enable" file for each event, because it can be soft-disabled when disable_event trigger is hit. Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522 Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing/kprobes: Fix to increment return event probe hit-countMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix to increment probe hit-count for function return event. Link: http://lkml.kernel.org/r/20130509054424.30398.34058.stgit@mhiramat-M0-7522 Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Cleanup regex_lock and ftrace_lock around hash updatingMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup regex_lock and ftrace_lock locking points around ftrace_ops hash update code. The new rule is that regex_lock protects ops->*_hash read-update-write code for each ftrace_ops. Usually, hash update is done by following sequence. 1. allocate a new local hash and copy the original hash. 2. update the local hash. 3. move(actually, copy) back the local hash to ftrace_ops. 4. update ftrace entries if needed. 5. release the local hash. This makes regex_lock protect #1-#4, and ftrace_lock to protect #3, #4 and adding and removing ftrace_ops from the ftrace_ops_list. The ftrace_lock protects #3 as well because the move functions update the entries too. Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522 Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace, kprobes: Fix a deadlock on ftrace_regex_lockMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a deadlock on ftrace_regex_lock which happens when setting an enable_event trigger on dynamic kprobe event as below. ---- sh-2.05b# echo p vfs_symlink > kprobe_events sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter ============================================= [ INFO: possible recursive locking detected ] 3.9.0+ #35 Not tainted --------------------------------------------- sh/72 is trying to acquire lock: (ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0 but task is already holding lock: (ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(ftrace_regex_lock); lock(ftrace_regex_lock); *** DEADLOCK *** ---- To fix that, this introduces a finer regex_lock for each ftrace_ops. ftrace_regex_lock is too big of a lock which protects all filter/notrace_hash operations, but it doesn't need to be a global lock after supporting multiple ftrace_ops because each ftrace_ops has its own filter/notrace_hash. Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522 Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> [ Added initialization flag and automate mutex initialization for non ftrace.c ftrace_probes. ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Have ftrace_regex_write() return either read or errorSteven Rostedt (Red Hat)2013-05-09
| | | | | | | | | | | | | | | | | | | | | | As ftrace_regex_write() reads the result of ftrace_process_regex() which can sometimes return a positive number, only consider a failure if the return is negative. Otherwise, it will skip possible other registered probes and by returning a positive number that wasn't read, it will confuse the user processes doing the writing. Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing: Return error if register_ftrace_function_probe() fails for ↵Steven Rostedt (Red Hat)2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | event_enable_func() register_ftrace_function_probe() returns the number of functions it registered, which can be zero, it can also return a negative number if something went wrong. But event_enable_func() only checks for the case that it didn't register anything, it needs to also check for the case that something went wrong and return that error code as well. Added some comments about the code as well, to make it more understandable. Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * tracing: Don't succeed if event_enable_func did not register anythingMasami Hiramatsu2013-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return 0 instead of the number of activated ftrace function probes if event_enable_func succeeded and return an error code if it failed or did not register any functions. But it currently returns the number of registered functions and if it didn't register anything, it returns 0, but that is considered success. This also fixes the return value. As if it succeeds, it returns the number of functions that were enabled, which is returned back to the user in ftrace_regex_write (the write() return code). If only one function is enabled, then the return code of the write is one, and this can confuse the user program in thinking it only wrote 1 byte. Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522 Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@intel.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> [ Rewrote change log to reflect that this fixes two bugs - SR ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ring-buffer: Select IRQ_WORKSteven Rostedt (Red Hat)2013-05-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the wake up logic for waiters on the buffer has been moved from the tracing code to the ring buffer, it requires also adding IRQ_WORK as the wake up code is performed via irq_work. This fixes compile breakage when a user of the ring buffer is selected but tracing and irq_work are not. Link http://lkml.kernel.org/r/20130503115332.GT8356@rric.localhost Cc: Arnd Bergmann <arnd@arndb.de> Reported-by: Robert Richter <rric@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | Merge git://git.infradead.org/users/eparis/auditLinus Torvalds2013-05-11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull audit changes from Eric Paris: "Al used to send pull requests every couple of years but he told me to just start pushing them to you directly. Our touching outside of core audit code is pretty straight forward. A couple of interface changes which hit net/. A simple argument bug calling audit functions in namei.c and the removal of some assembly branch prediction code on ppc" * git://git.infradead.org/users/eparis/audit: (31 commits) audit: fix message spacing printing auid Revert "audit: move kaudit thread start from auditd registration to kaudit init" audit: vfs: fix audit_inode call in O_CREAT case of do_last audit: Make testing for a valid loginuid explicit. audit: fix event coverage of AUDIT_ANOM_LINK audit: use spin_lock in audit_receive_msg to process tty logging audit: do not needlessly take a lock in tty_audit_exit audit: do not needlessly take a spinlock in copy_signal audit: add an option to control logging of passwords with pam_tty_audit audit: use spin_lock_irqsave/restore in audit tty code helper for some session id stuff audit: use a consistent audit helper to log lsm information audit: push loginuid and sessionid processing down audit: stop pushing loginid, uid, sessionid as arguments audit: remove the old depricated kernel interface audit: make validity checking generic audit: allow checking the type of audit message in the user filter audit: fix build break when AUDIT_DEBUG == 2 audit: remove duplicate export of audit_enabled Audit: do not print error when LSMs disabled ...
| * | audit: fix message spacing printing auidEric Paris2013-05-08
| | | | | | | | | | | | | | | | | | | | | The helper function didn't include a leading space, so it was jammed against the previous text in the audit record. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | Revert "audit: move kaudit thread start from auditd registration to kaudit init"Eric Paris2013-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 6ff5e45985c2fcb97947818f66d1eeaf9d6600b2. Conflicts: kernel/audit.c This patch was starting a kthread for all the time. Since the follow on patches that required it didn't get finished in 3.10 time, we shouldn't ship this change in 3.10. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | audit: Make testing for a valid loginuid explicit.Eric W. Biederman2013-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | audit rule additions containing "-F auid!=4294967295" were failing with EINVAL because of a regression caused by e1760bd. Apparently some userland audit rule sets want to know if loginuid uid has been set and are using a test for auid != 4294967295 to determine that. In practice that is a horrible way to ask if a value has been set, because it relies on subtle implementation details and will break every time the uid implementation in the kernel changes. So add a clean way to test if the audit loginuid has been set, and silently convert the old idiom to the cleaner and more comprehensible new idiom. Cc: <stable@vger.kernel.org> # 3.7 Reported-By: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | audit: fix event coverage of AUDIT_ANOM_LINKEric Paris2013-04-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The userspace audit tools didn't like the existing formatting of the AUDIT_ANOM_LINK event. It needed to be expanded to emit an AUDIT_PATH event as well, so this implements the change. The bulk of the patch is moving code out of auditsc.c into audit.c and audit.h for general use. It expands audit_log_name to include an optional "struct path" argument for the simple case of just needing to report a pathname. This also makes audit_log_task_info available when syscall auditing is not enabled, since it is needed in either case for process details. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Steve Grubb <sgrubb@redhat.com>