aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/trace
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2015-04-14 17:37:47 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2015-04-14 17:37:47 -0400
commit6c8a53c9e6a151fffb07f8b4c34bd1e33dddd467 (patch)
tree791caf826ef136c521a97b7878f226b6ba1c1d75 /kernel/trace
parente95e7f627062be5e6ce971ce873e6234c91ffc50 (diff)
parent066450be419fa48007a9f29e19828f2a86198754 (diff)
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar: "Core kernel changes: - One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively. (Right now it's limited to root-only, but in the future we might allow unprivileged use as well.) (Alexei Starovoitov) - Another non-trivial feature is per event clockid support: this allows, amongst other things, the selection of different clock sources for event timestamps traced via perf. This feature is sought by people who'd like to merge perf generated events with external events that were measured with different clocks: - cluster wide profiling - for system wide tracing with user-space events, - JIT profiling events etc. Matching perf tooling support is added as well, available via the -k, --clockid <clockid> parameter to perf record et al. (Peter Zijlstra) Hardware enablement kernel changes: - x86 Intel Processor Trace (PT) support: which is a hardware tracer on steroids, available on Broadwell CPUs. The hardware trace stream is directly output into the user-space ring-buffer, using the 'AUX' data format extension that was added to the perf core to support hardware constraints such as the necessity to have the tracing buffer physically contiguous. This patch-set was developed for two years and this is the result. A simple way to make use of this is to use BTS tracing, the PT driver emulates BTS output - available via the 'intel_bts' PMU. More explicit PT specific tooling support is in the works as well - will probably be ready by 4.2. (Alexander Shishkin, Peter Zijlstra) - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware feature of Intel Xeon CPUs that allows the measurement and allocation/partitioning of caches to individual workloads. These kernel changes expose the measurement side as a new PMU driver, which exposes various QoS related PMU events. (The partitioning change is work in progress and is planned to be merged as a cgroup extension.) (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P Waskiewicz Jr) - x86 Intel Haswell LBR call stack support: this is a new Haswell feature that allows the hardware recording of call chains, plus tooling support. To activate this feature you have to enable it via the new 'lbr' call-graph recording option: perf record --call-graph lbr perf report or: perf top --call-graph lbr This hardware feature is a lot faster than stack walk or dwarf based unwinding, but has some limitations: - It reuses the current LBR facility, so LBR call stack and branch record can not be enabled at the same time. - It is only available for user-space callchains. (Yan, Zheng) - x86 Intel Broadwell CPU support and various event constraints and event table fixes for earlier models. (Andi Kleen) - x86 Intel HT CPUs event scheduling workarounds. This is a complex CPU bug affecting the SNB,IVB,HSW families that results in counter value corruption. The mitigation code is automatically enabled and is transparent. (Maria Dimakopoulou, Stephane Eranian) The perf tooling side had a ton of changes in this cycle as well, so I'm only able to list the user visible changes here, in addition to the tooling changes outlined above: User visible changes affecting all tools: - Improve support of compressed kernel modules (Jiri Olsa) - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo) - Bash completion for subcommands (Yunlong Song) - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa) - Support missing -f to override perf.data file ownership. (Yunlong Song) - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo) User visible changes in individual tools: 'perf data': New tool for converting perf.data to other formats, initially for the CTF (Common Trace Format) from LTTng (Jiri Olsa, Sebastian Siewior) 'perf diff': Add --kallsyms option (David Ahern) 'perf list': Allow listing events with 'tracepoint' prefix (Yunlong Song) Sort the output of the command (Yunlong Song) 'perf kmem': Respect -i option (Jiri Olsa) Print big numbers using thousands' group (Namhyung Kim) Allow -v option (Namhyung Kim) Fix alignment of slab result table (Namhyung Kim) 'perf probe': Support multiple probes on different binaries on the same command line (Masami Hiramatsu) Support unnamed union/structure members data collection. (Masami Hiramatsu) Check kprobes blacklist when adding new events. (Masami Hiramatsu) 'perf record': Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra) Support recording running/enabled time (Andi Kleen) 'perf sched': Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song) 'perf report' and 'perf top': Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo) Indicate which callchain entries are annotated in the TUI hists browser (Arnaldo Carvalho de Melo) Add pid/tid filtering to 'report' and 'script' commands (David Ahern) Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT events (Arnaldo Carvalho de Melo) 'perf stat': Report unsupported events properly (Suzuki K. Poulose) Output running time and run/enabled ratio in CSV mode (Andi Kleen) 'perf trace': Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo) Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo) Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo) Dump stack on segfaults (Arnaldo Carvalho de Melo) No need to explicitely enable evsels for workload started from perf, let it be enabled via perf_event_attr.enable_on_exec, removing some events that take place in the 'perf trace' before a workload is really started by it. (Arnaldo Carvalho de Melo) Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo) There's also been a ton of infrastructure work done, such as the split-out of perf's build system into tools/build/ and other changes - see the shortlog and changelog for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits) perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init() perf evlist: Fix type for references to data_head/tail perf probe: Check the orphaned -x option perf probe: Support multiple probes on different binaries perf buildid-list: Fix segfault when show DSOs with hits perf tools: Fix cross-endian analysis perf tools: Fix error path to do closedir() when synthesizing threads perf tools: Fix synthesizing fork_event.ppid for non-main thread perf tools: Add 'I' event modifier for exclude_idle bit perf report: Don't call map__kmap if map is NULL. perf tests: Fix attr tests perf probe: Fix ARM 32 building error perf tools: Merge all perf_event_attr print functions perf record: Add clockid parameter perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10 perf sched replay: Support using -f to override perf.data file ownership perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task perf sched replay: Fix the segmentation fault problem caused by pr_err in threads perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations ...
Diffstat (limited to 'kernel/trace')
-rw-r--r--kernel/trace/Kconfig8
-rw-r--r--kernel/trace/Makefile1
-rw-r--r--kernel/trace/bpf_trace.c222
-rw-r--r--kernel/trace/trace_kprobe.c10
-rw-r--r--kernel/trace/trace_uprobe.c10
5 files changed, 245 insertions, 6 deletions
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index fedbdd7d5d1e..3b9a48ae153a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -432,6 +432,14 @@ config UPROBE_EVENT
432 This option is required if you plan to use perf-probe subcommand 432 This option is required if you plan to use perf-probe subcommand
433 of perf tools on user space applications. 433 of perf tools on user space applications.
434 434
435config BPF_EVENTS
436 depends on BPF_SYSCALL
437 depends on KPROBE_EVENT
438 bool
439 default y
440 help
441 This allows the user to attach BPF programs to kprobe events.
442
435config PROBE_EVENTS 443config PROBE_EVENTS
436 def_bool n 444 def_bool n
437 445
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 98f26588255e..9b1044e936a6 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
53endif 53endif
54obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o 54obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
55obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o 55obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
56obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
56obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o 57obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
57obj-$(CONFIG_TRACEPOINTS) += power-traces.o 58obj-$(CONFIG_TRACEPOINTS) += power-traces.o
58ifeq ($(CONFIG_PM),y) 59ifeq ($(CONFIG_PM),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..2d56ce501632
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,222 @@
1/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
2 *
3 * This program is free software; you can redistribute it and/or
4 * modify it under the terms of version 2 of the GNU General Public
5 * License as published by the Free Software Foundation.
6 */
7#include <linux/kernel.h>
8#include <linux/types.h>
9#include <linux/slab.h>
10#include <linux/bpf.h>
11#include <linux/filter.h>
12#include <linux/uaccess.h>
13#include <linux/ctype.h>
14#include "trace.h"
15
16static DEFINE_PER_CPU(int, bpf_prog_active);
17
18/**
19 * trace_call_bpf - invoke BPF program
20 * @prog: BPF program
21 * @ctx: opaque context pointer
22 *
23 * kprobe handlers execute BPF programs via this helper.
24 * Can be used from static tracepoints in the future.
25 *
26 * Return: BPF programs always return an integer which is interpreted by
27 * kprobe handler as:
28 * 0 - return from kprobe (event is filtered out)
29 * 1 - store kprobe event into ring buffer
30 * Other values are reserved and currently alias to 1
31 */
32unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
33{
34 unsigned int ret;
35
36 if (in_nmi()) /* not supported yet */
37 return 1;
38
39 preempt_disable();
40
41 if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
42 /*
43 * since some bpf program is already running on this cpu,
44 * don't call into another bpf program (same or different)
45 * and don't send kprobe event into ring-buffer,
46 * so return zero here
47 */
48 ret = 0;
49 goto out;
50 }
51
52 rcu_read_lock();
53 ret = BPF_PROG_RUN(prog, ctx);
54 rcu_read_unlock();
55
56 out:
57 __this_cpu_dec(bpf_prog_active);
58 preempt_enable();
59
60 return ret;
61}
62EXPORT_SYMBOL_GPL(trace_call_bpf);
63
64static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
65{
66 void *dst = (void *) (long) r1;
67 int size = (int) r2;
68 void *unsafe_ptr = (void *) (long) r3;
69
70 return probe_kernel_read(dst, unsafe_ptr, size);
71}
72
73static const struct bpf_func_proto bpf_probe_read_proto = {
74 .func = bpf_probe_read,
75 .gpl_only = true,
76 .ret_type = RET_INTEGER,
77 .arg1_type = ARG_PTR_TO_STACK,
78 .arg2_type = ARG_CONST_STACK_SIZE,
79 .arg3_type = ARG_ANYTHING,
80};
81
82static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
83{
84 /* NMI safe access to clock monotonic */
85 return ktime_get_mono_fast_ns();
86}
87
88static const struct bpf_func_proto bpf_ktime_get_ns_proto = {
89 .func = bpf_ktime_get_ns,
90 .gpl_only = true,
91 .ret_type = RET_INTEGER,
92};
93
94/*
95 * limited trace_printk()
96 * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
97 */
98static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
99{
100 char *fmt = (char *) (long) r1;
101 int mod[3] = {};
102 int fmt_cnt = 0;
103 int i;
104
105 /*
106 * bpf_check()->check_func_arg()->check_stack_boundary()
107 * guarantees that fmt points to bpf program stack,
108 * fmt_size bytes of it were initialized and fmt_size > 0
109 */
110 if (fmt[--fmt_size] != 0)
111 return -EINVAL;
112
113 /* check format string for allowed specifiers */
114 for (i = 0; i < fmt_size; i++) {
115 if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
116 return -EINVAL;
117
118 if (fmt[i] != '%')
119 continue;
120
121 if (fmt_cnt >= 3)
122 return -EINVAL;
123
124 /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
125 i++;
126 if (fmt[i] == 'l') {
127 mod[fmt_cnt]++;
128 i++;
129 } else if (fmt[i] == 'p') {
130 mod[fmt_cnt]++;
131 i++;
132 if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0)
133 return -EINVAL;
134 fmt_cnt++;
135 continue;
136 }
137
138 if (fmt[i] == 'l') {
139 mod[fmt_cnt]++;
140 i++;
141 }
142
143 if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
144 return -EINVAL;
145 fmt_cnt++;
146 }
147
148 return __trace_printk(1/* fake ip will not be printed */, fmt,
149 mod[0] == 2 ? r3 : mod[0] == 1 ? (long) r3 : (u32) r3,
150 mod[1] == 2 ? r4 : mod[1] == 1 ? (long) r4 : (u32) r4,
151 mod[2] == 2 ? r5 : mod[2] == 1 ? (long) r5 : (u32) r5);
152}
153
154static const struct bpf_func_proto bpf_trace_printk_proto = {
155 .func = bpf_trace_printk,
156 .gpl_only = true,
157 .ret_type = RET_INTEGER,
158 .arg1_type = ARG_PTR_TO_STACK,
159 .arg2_type = ARG_CONST_STACK_SIZE,
160};
161
162static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
163{
164 switch (func_id) {
165 case BPF_FUNC_map_lookup_elem:
166 return &bpf_map_lookup_elem_proto;
167 case BPF_FUNC_map_update_elem:
168 return &bpf_map_update_elem_proto;
169 case BPF_FUNC_map_delete_elem:
170 return &bpf_map_delete_elem_proto;
171 case BPF_FUNC_probe_read:
172 return &bpf_probe_read_proto;
173 case BPF_FUNC_ktime_get_ns:
174 return &bpf_ktime_get_ns_proto;
175
176 case BPF_FUNC_trace_printk:
177 /*
178 * this program might be calling bpf_trace_printk,
179 * so allocate per-cpu printk buffers
180 */
181 trace_printk_init_buffers();
182
183 return &bpf_trace_printk_proto;
184 default:
185 return NULL;
186 }
187}
188
189/* bpf+kprobe programs can access fields of 'struct pt_regs' */
190static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type)
191{
192 /* check bounds */
193 if (off < 0 || off >= sizeof(struct pt_regs))
194 return false;
195
196 /* only read is allowed */
197 if (type != BPF_READ)
198 return false;
199
200 /* disallow misaligned access */
201 if (off % size != 0)
202 return false;
203
204 return true;
205}
206
207static struct bpf_verifier_ops kprobe_prog_ops = {
208 .get_func_proto = kprobe_prog_func_proto,
209 .is_valid_access = kprobe_prog_is_valid_access,
210};
211
212static struct bpf_prog_type_list kprobe_tl = {
213 .ops = &kprobe_prog_ops,
214 .type = BPF_PROG_TYPE_KPROBE,
215};
216
217static int __init register_kprobe_prog_ops(void)
218{
219 bpf_register_prog_type(&kprobe_tl);
220 return 0;
221}
222late_initcall(register_kprobe_prog_ops);
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 9ba3f43f580e..d0ce590f06e1 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1135,11 +1135,15 @@ static void
1135kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) 1135kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
1136{ 1136{
1137 struct ftrace_event_call *call = &tk->tp.call; 1137 struct ftrace_event_call *call = &tk->tp.call;
1138 struct bpf_prog *prog = call->prog;
1138 struct kprobe_trace_entry_head *entry; 1139 struct kprobe_trace_entry_head *entry;
1139 struct hlist_head *head; 1140 struct hlist_head *head;
1140 int size, __size, dsize; 1141 int size, __size, dsize;
1141 int rctx; 1142 int rctx;
1142 1143
1144 if (prog && !trace_call_bpf(prog, regs))
1145 return;
1146
1143 head = this_cpu_ptr(call->perf_events); 1147 head = this_cpu_ptr(call->perf_events);
1144 if (hlist_empty(head)) 1148 if (hlist_empty(head))
1145 return; 1149 return;
@@ -1166,11 +1170,15 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
1166 struct pt_regs *regs) 1170 struct pt_regs *regs)
1167{ 1171{
1168 struct ftrace_event_call *call = &tk->tp.call; 1172 struct ftrace_event_call *call = &tk->tp.call;
1173 struct bpf_prog *prog = call->prog;
1169 struct kretprobe_trace_entry_head *entry; 1174 struct kretprobe_trace_entry_head *entry;
1170 struct hlist_head *head; 1175 struct hlist_head *head;
1171 int size, __size, dsize; 1176 int size, __size, dsize;
1172 int rctx; 1177 int rctx;
1173 1178
1179 if (prog && !trace_call_bpf(prog, regs))
1180 return;
1181
1174 head = this_cpu_ptr(call->perf_events); 1182 head = this_cpu_ptr(call->perf_events);
1175 if (hlist_empty(head)) 1183 if (hlist_empty(head))
1176 return; 1184 return;
@@ -1287,7 +1295,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
1287 kfree(call->print_fmt); 1295 kfree(call->print_fmt);
1288 return -ENODEV; 1296 return -ENODEV;
1289 } 1297 }
1290 call->flags = 0; 1298 call->flags = TRACE_EVENT_FL_KPROBE;
1291 call->class->reg = kprobe_register; 1299 call->class->reg = kprobe_register;
1292 call->data = tk; 1300 call->data = tk;
1293 ret = trace_add_event_call(call); 1301 ret = trace_add_event_call(call);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 74865465e0b7..d60fe62ec4fa 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1006,7 +1006,7 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
1006 return true; 1006 return true;
1007 1007
1008 list_for_each_entry(event, &filter->perf_events, hw.tp_list) { 1008 list_for_each_entry(event, &filter->perf_events, hw.tp_list) {
1009 if (event->hw.tp_target->mm == mm) 1009 if (event->hw.target->mm == mm)
1010 return true; 1010 return true;
1011 } 1011 }
1012 1012
@@ -1016,7 +1016,7 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
1016static inline bool 1016static inline bool
1017uprobe_filter_event(struct trace_uprobe *tu, struct perf_event *event) 1017uprobe_filter_event(struct trace_uprobe *tu, struct perf_event *event)
1018{ 1018{
1019 return __uprobe_perf_filter(&tu->filter, event->hw.tp_target->mm); 1019 return __uprobe_perf_filter(&tu->filter, event->hw.target->mm);
1020} 1020}
1021 1021
1022static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event) 1022static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
@@ -1024,10 +1024,10 @@ static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
1024 bool done; 1024 bool done;
1025 1025
1026 write_lock(&tu->filter.rwlock); 1026 write_lock(&tu->filter.rwlock);
1027 if (event->hw.tp_target) { 1027 if (event->hw.target) {
1028 list_del(&event->hw.tp_list); 1028 list_del(&event->hw.tp_list);
1029 done = tu->filter.nr_systemwide || 1029 done = tu->filter.nr_systemwide ||
1030 (event->hw.tp_target->flags & PF_EXITING) || 1030 (event->hw.target->flags & PF_EXITING) ||
1031 uprobe_filter_event(tu, event); 1031 uprobe_filter_event(tu, event);
1032 } else { 1032 } else {
1033 tu->filter.nr_systemwide--; 1033 tu->filter.nr_systemwide--;
@@ -1047,7 +1047,7 @@ static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
1047 int err; 1047 int err;
1048 1048
1049 write_lock(&tu->filter.rwlock); 1049 write_lock(&tu->filter.rwlock);
1050 if (event->hw.tp_target) { 1050 if (event->hw.target) {
1051 /* 1051 /*
1052 * event->parent != NULL means copy_process(), we can avoid 1052 * event->parent != NULL means copy_process(), we can avoid
1053 * uprobe_apply(). current->mm must be probed and we can rely 1053 * uprobe_apply(). current->mm must be probed and we can rely