Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf changes from Ingo Molnar: "Core kernel changes: - One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively. (Right now it's limited to root-only, but in the future we might allow unprivileged use as well.) (Alexei Starovoitov) - Another non-trivial feature is per event clockid support: this allows, amongst other things, the selection of different clock sources for event timestamps traced via perf. This feature is sought by people who'd like to merge perf generated events with external events that were measured with different clocks: - cluster wide profiling - for system wide tracing with user-space events, - JIT profiling events etc. Matching perf tooling support is added as well, available via the -k, --clockid <clockid> parameter to perf record et al. (Peter Zijlstra) Hardware enablement kernel changes: - x86 Intel Processor Trace (PT) support: which is a hardware tracer on steroids, available on Broadwell CPUs. The hardware trace stream is directly output into the user-space ring-buffer, using the 'AUX' data format extension that was added to the perf core to support hardware constraints such as the necessity to have the tracing buffer physically contiguous. This patch-set was developed for two years and this is the result. A simple way to make use of this is to use BTS tracing, the PT driver emulates BTS output - available via the 'intel_bts' PMU. More explicit PT specific tooling support is in the works as well - will probably be ready by 4.2. (Alexander Shishkin, Peter Zijlstra) - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware feature of Intel Xeon CPUs that allows the measurement and allocation/partitioning of caches to individual workloads. These kernel changes expose the measurement side as a new PMU driver, which exposes various QoS related PMU events. (The partitioning change is work in progress and is planned to be merged as a cgroup extension.) (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P Waskiewicz Jr) - x86 Intel Haswell LBR call stack support: this is a new Haswell feature that allows the hardware recording of call chains, plus tooling support. To activate this feature you have to enable it via the new 'lbr' call-graph recording option: perf record --call-graph lbr perf report or: perf top --call-graph lbr This hardware feature is a lot faster than stack walk or dwarf based unwinding, but has some limitations: - It reuses the current LBR facility, so LBR call stack and branch record can not be enabled at the same time. - It is only available for user-space callchains. (Yan, Zheng) - x86 Intel Broadwell CPU support and various event constraints and event table fixes for earlier models. (Andi Kleen) - x86 Intel HT CPUs event scheduling workarounds. This is a complex CPU bug affecting the SNB,IVB,HSW families that results in counter value corruption. The mitigation code is automatically enabled and is transparent. (Maria Dimakopoulou, Stephane Eranian) The perf tooling side had a ton of changes in this cycle as well, so I'm only able to list the user visible changes here, in addition to the tooling changes outlined above: User visible changes affecting all tools: - Improve support of compressed kernel modules (Jiri Olsa) - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo) - Bash completion for subcommands (Yunlong Song) - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa) - Support missing -f to override perf.data file ownership. (Yunlong Song) - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo) User visible changes in individual tools: 'perf data': New tool for converting perf.data to other formats, initially for the CTF (Common Trace Format) from LTTng (Jiri Olsa, Sebastian Siewior) 'perf diff': Add --kallsyms option (David Ahern) 'perf list': Allow listing events with 'tracepoint' prefix (Yunlong Song) Sort the output of the command (Yunlong Song) 'perf kmem': Respect -i option (Jiri Olsa) Print big numbers using thousands' group (Namhyung Kim) Allow -v option (Namhyung Kim) Fix alignment of slab result table (Namhyung Kim) 'perf probe': Support multiple probes on different binaries on the same command line (Masami Hiramatsu) Support unnamed union/structure members data collection. (Masami Hiramatsu) Check kprobes blacklist when adding new events. (Masami Hiramatsu) 'perf record': Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra) Support recording running/enabled time (Andi Kleen) 'perf sched': Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song) 'perf report' and 'perf top': Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo) Indicate which callchain entries are annotated in the TUI hists browser (Arnaldo Carvalho de Melo) Add pid/tid filtering to 'report' and 'script' commands (David Ahern) Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT events (Arnaldo Carvalho de Melo) 'perf stat': Report unsupported events properly (Suzuki K. Poulose) Output running time and run/enabled ratio in CSV mode (Andi Kleen) 'perf trace': Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo) Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo) Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo) Dump stack on segfaults (Arnaldo Carvalho de Melo) No need to explicitely enable evsels for workload started from perf, let it be enabled via perf_event_attr.enable_on_exec, removing some events that take place in the 'perf trace' before a workload is really started by it. (Arnaldo Carvalho de Melo) Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo) There's also been a ton of infrastructure work done, such as the split-out of perf's build system into tools/build/ and other changes - see the shortlog and changelog for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits) perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init() perf evlist: Fix type for references to data_head/tail perf probe: Check the orphaned -x option perf probe: Support multiple probes on different binaries perf buildid-list: Fix segfault when show DSOs with hits perf tools: Fix cross-endian analysis perf tools: Fix error path to do closedir() when synthesizing threads perf tools: Fix synthesizing fork_event.ppid for non-main thread perf tools: Add 'I' event modifier for exclude_idle bit perf report: Don't call map__kmap if map is NULL. perf tests: Fix attr tests perf probe: Fix ARM 32 building error perf tools: Merge all perf_event_attr print functions perf record: Add clockid parameter perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10 perf sched replay: Support using -f to override perf.data file ownership perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task perf sched replay: Fix the segmentation fault problem caused by pr_err in threads perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2015-04-14 17:37:47 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2015-04-14 17:37:47 -0400
commit: 6c8a53c9e6a151fffb07f8b4c34bd1e33dddd467 (patch)
tree: 791caf826ef136c521a97b7878f226b6ba1c1d75 /kernel/trace
parent: e95e7f627062be5e6ce971ce873e6234c91ffc50 (diff)
parent: 066450be419fa48007a9f29e19828f2a86198754 (diff)
5 files changed, 245 insertions, 6 deletions
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index fedbdd7d5d1e..3b9a48ae153a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -432,6 +432,14 @@ config UPROBE_EVENT
          This option is required if you plan to use perf-probe subcommand
          of perf tools on user space applications.
+config BPF_EVENTS
+        depends on BPF_SYSCALL
+        depends on KPROBE_EVENT
+        bool
+        default y
+        help
+          This allows the user to attach BPF programs to kprobe events.
 config PROBE_EVENTS
        def_bool n
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 98f26588255e..9b1044e936a6 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
 endif
 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
 obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
 obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_PM),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..2d56ce501632
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,222 @@
+/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <linux/ctype.h>
+#include "trace.h"
+static DEFINE_PER_CPU(int, bpf_prog_active);
+/**
+ * trace_call_bpf - invoke BPF program
+ * @prog: BPF program
+ * @ctx: opaque context pointer
+ *
+ * kprobe handlers execute BPF programs via this helper.
+ * Can be used from static tracepoints in the future.
+ *
+ * Return: BPF programs always return an integer which is interpreted by
+ * kprobe handler as:
+ * 0 - return from kprobe (event is filtered out)
+ * 1 - store kprobe event into ring buffer
+ * Other values are reserved and currently alias to 1
+ */
+unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+{
+        unsigned int ret;
+        if (in_nmi()) /* not supported yet */
+                return 1;
+        preempt_disable();
+        if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
+                /*
+                 * since some bpf program is already running on this cpu,
+                 * don't call into another bpf program (same or different)
+                 * and don't send kprobe event into ring-buffer,
+                 * so return zero here
+                 */
+                ret = 0;
+                goto out;
+        }
+        rcu_read_lock();
+        ret = BPF_PROG_RUN(prog, ctx);
+        rcu_read_unlock();
+ out:
+        __this_cpu_dec(bpf_prog_active);
+        preempt_enable();
+        return ret;
+}
+EXPORT_SYMBOL_GPL(trace_call_bpf);
+static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+        void *dst = (void *) (long) r1;
+        int size = (int) r2;
+        void *unsafe_ptr = (void *) (long) r3;
+        return probe_kernel_read(dst, unsafe_ptr, size);
+}
+static const struct bpf_func_proto bpf_probe_read_proto = {
+        .func           = bpf_probe_read,
+        .gpl_only       = true,
+        .ret_type       = RET_INTEGER,
+        .arg1_type      = ARG_PTR_TO_STACK,
+        .arg2_type      = ARG_CONST_STACK_SIZE,
+        .arg3_type      = ARG_ANYTHING,
+};
+static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+        /* NMI safe access to clock monotonic */
+        return ktime_get_mono_fast_ns();
+}
+static const struct bpf_func_proto bpf_ktime_get_ns_proto = {
+        .func           = bpf_ktime_get_ns,
+        .gpl_only       = true,
+        .ret_type       = RET_INTEGER,
+};
+/*
+ * limited trace_printk()
+ * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
+ */
+static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+        char *fmt = (char *) (long) r1;
+        int mod[3] = {};
+        int fmt_cnt = 0;
+        int i;
+        /*
+         * bpf_check()->check_func_arg()->check_stack_boundary()
+         * guarantees that fmt points to bpf program stack,
+         * fmt_size bytes of it were initialized and fmt_size > 0
+         */
+        if (fmt[--fmt_size] != 0)
+                return -EINVAL;
+        /* check format string for allowed specifiers */
+        for (i = 0; i < fmt_size; i++) {
+                if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
+                        return -EINVAL;
+                if (fmt[i] != '%')
+                        continue;
+                if (fmt_cnt >= 3)
+                        return -EINVAL;
+                /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
+                i++;
+                if (fmt[i] == 'l') {
+                        mod[fmt_cnt]++;
+                        i++;
+                } else if (fmt[i] == 'p') {
+                        mod[fmt_cnt]++;
+                        i++;
+                        if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0)
+                                return -EINVAL;
+                        fmt_cnt++;
+                        continue;
+                }
+                if (fmt[i] == 'l') {
+                        mod[fmt_cnt]++;
+                        i++;
+                }
+                if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
+                        return -EINVAL;
+                fmt_cnt++;
+        }
+        return __trace_printk(1/* fake ip will not be printed */, fmt,
+                              mod[0] == 2 ? r3 : mod[0] == 1 ? (long) r3 : (u32) r3,
+                              mod[1] == 2 ? r4 : mod[1] == 1 ? (long) r4 : (u32) r4,
+                              mod[2] == 2 ? r5 : mod[2] == 1 ? (long) r5 : (u32) r5);
+}
+static const struct bpf_func_proto bpf_trace_printk_proto = {
+        .func           = bpf_trace_printk,
+        .gpl_only       = true,
+        .ret_type       = RET_INTEGER,
+        .arg1_type      = ARG_PTR_TO_STACK,
+        .arg2_type      = ARG_CONST_STACK_SIZE,
+};
+static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+{
+        switch (func_id) {
+        case BPF_FUNC_map_lookup_elem:
+                return &bpf_map_lookup_elem_proto;
+        case BPF_FUNC_map_update_elem:
+                return &bpf_map_update_elem_proto;
+        case BPF_FUNC_map_delete_elem:
+                return &bpf_map_delete_elem_proto;
+        case BPF_FUNC_probe_read:
+                return &bpf_probe_read_proto;
+        case BPF_FUNC_ktime_get_ns:
+                return &bpf_ktime_get_ns_proto;
+        case BPF_FUNC_trace_printk:
+                /*
+                 * this program might be calling bpf_trace_printk,
+                 * so allocate per-cpu printk buffers
+                 */
+                trace_printk_init_buffers();
+                return &bpf_trace_printk_proto;
+        default:
+                return NULL;
+        }
+}
+/* bpf+kprobe programs can access fields of 'struct pt_regs' */
+static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+        /* check bounds */
+        if (off < 0 || off >= sizeof(struct pt_regs))
+                return false;
+        /* only read is allowed */
+        if (type != BPF_READ)
+                return false;
+        /* disallow misaligned access */
+        if (off % size != 0)
+                return false;
+        return true;
+}
+static struct bpf_verifier_ops kprobe_prog_ops = {
+        .get_func_proto  = kprobe_prog_func_proto,
+        .is_valid_access = kprobe_prog_is_valid_access,
+};
+static struct bpf_prog_type_list kprobe_tl = {
+        .ops    = &kprobe_prog_ops,
+        .type   = BPF_PROG_TYPE_KPROBE,
+};
+static int __init register_kprobe_prog_ops(void)
+{
+        bpf_register_prog_type(&kprobe_tl);
+        return 0;
+}
+late_initcall(register_kprobe_prog_ops);
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 9ba3f43f580e..d0ce590f06e1 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1135,11 +1135,15 @@ static void
 kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 {
        struct ftrace_event_call *call = &tk->tp.call;
+        struct bpf_prog *prog = call->prog;
        struct kprobe_trace_entry_head *entry;
        struct hlist_head *head;
        int size, __size, dsize;
        int rctx;
+        if (prog && !trace_call_bpf(prog, regs))
+                return;
        head = this_cpu_ptr(call->perf_events);
        if (hlist_empty(head))
                return;
@@ -1166,11 +1170,15 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
                    struct pt_regs *regs)
 {
        struct ftrace_event_call *call = &tk->tp.call;
+        struct bpf_prog *prog = call->prog;
        struct kretprobe_trace_entry_head *entry;
        struct hlist_head *head;
        int size, __size, dsize;
        int rctx;
+        if (prog && !trace_call_bpf(prog, regs))
+                return;
        head = this_cpu_ptr(call->perf_events);
        if (hlist_empty(head))
                return;
@@ -1287,7 +1295,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
                kfree(call->print_fmt);
                return -ENODEV;
        }
-        call->flags = 0;
+        call->flags = TRACE_EVENT_FL_KPROBE;
        call->class->reg = kprobe_register;
        call->data = tk;
        ret = trace_add_event_call(call);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 74865465e0b7..d60fe62ec4fa 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1006,7 +1006,7 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
                return true;
        list_for_each_entry(event, &filter->perf_events, hw.tp_list) {
-                if (event->hw.tp_target->mm == mm)
+                if (event->hw.target->mm == mm)
                        return true;
        }
@@ -1016,7 +1016,7 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
 static inline bool
 uprobe_filter_event(struct trace_uprobe *tu, struct perf_event *event)
 {
-        return __uprobe_perf_filter(&tu->filter, event->hw.tp_target->mm);
+        return __uprobe_perf_filter(&tu->filter, event->hw.target->mm);
 }
 static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
@@ -1024,10 +1024,10 @@ static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
        bool done;
        write_lock(&tu->filter.rwlock);
-        if (event->hw.tp_target) {
+        if (event->hw.target) {
                list_del(&event->hw.tp_list);
                done = tu->filter.nr_systemwide ||
-                        (event->hw.tp_target->flags & PF_EXITING) ||
+                        (event->hw.target->flags & PF_EXITING) ||
                        uprobe_filter_event(tu, event);
        } else {
                tu->filter.nr_systemwide--;
@@ -1047,7 +1047,7 @@ static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
        int err;
        write_lock(&tu->filter.rwlock);
-        if (event->hw.tp_target) {
+        if (event->hw.target) {
                /*
                 * event->parent != NULL means copy_process(), we can avoid
                 * uprobe_apply(). current->mm must be probed and we can rely
author	Linus Torvalds <torvalds@linux-foundation.org>	2015-04-14 17:37:47 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2015-04-14 17:37:47 -0400
commit	6c8a53c9e6a151fffb07f8b4c34bd1e33dddd467 (patch)
tree	791caf826ef136c521a97b7878f226b6ba1c1d75 /kernel/trace
parent	e95e7f627062be5e6ce971ce873e6234c91ffc50 (diff)
parent	066450be419fa48007a9f29e19828f2a86198754 (diff)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index fedbdd7d5d1e..3b9a48ae153a 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig
@@ -432,6 +432,14 @@ config UPROBE_EVENT
432	This option is required if you plan to use perf-probe subcommand	432	This option is required if you plan to use perf-probe subcommand
433	of perf tools on user space applications.	433	of perf tools on user space applications.
434		434
		435	config BPF_EVENTS
		436	depends on BPF_SYSCALL
		437	depends on KPROBE_EVENT
		438	bool
		439	default y
		440	help
		441	This allows the user to attach BPF programs to kprobe events.
		442
435	config PROBE_EVENTS	443	config PROBE_EVENTS
436	def_bool n	444	def_bool n
437		445


diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 98f26588255e..9b1044e936a6 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
53	endif	53	endif
54	obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o	54	obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
55	obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o	55	obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
		56	obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
56	obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o	57	obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
57	obj-$(CONFIG_TRACEPOINTS) += power-traces.o	58	obj-$(CONFIG_TRACEPOINTS) += power-traces.o
58	ifeq ($(CONFIG_PM),y)	59	ifeq ($(CONFIG_PM),y)


diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c new file mode 100644 index 000000000000..2d56ce501632 --- /dev/null +++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,222 @@
		1	/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
		2	*
		3	* This program is free software; you can redistribute it and/or
		4	* modify it under the terms of version 2 of the GNU General Public
		5	* License as published by the Free Software Foundation.
		6	*/
		7	#include <linux/kernel.h>
		8	#include <linux/types.h>
		9	#include <linux/slab.h>
		10	#include <linux/bpf.h>
		11	#include <linux/filter.h>
		12	#include <linux/uaccess.h>
		13	#include <linux/ctype.h>
		14	#include "trace.h"
		15
		16	static DEFINE_PER_CPU(int, bpf_prog_active);
		17
		18	/**
		19	* trace_call_bpf - invoke BPF program
		20	* @prog: BPF program
		21	* @ctx: opaque context pointer
		22	*
		23	* kprobe handlers execute BPF programs via this helper.
		24	* Can be used from static tracepoints in the future.
		25	*
		26	* Return: BPF programs always return an integer which is interpreted by
		27	* kprobe handler as:
		28	* 0 - return from kprobe (event is filtered out)
		29	* 1 - store kprobe event into ring buffer
		30	* Other values are reserved and currently alias to 1
		31	*/
		32	unsigned int trace_call_bpf(struct bpf_prog prog, void ctx)
		33	{
		34	unsigned int ret;
		35
		36	if (in_nmi()) /* not supported yet */
		37	return 1;
		38
		39	preempt_disable();
		40
		41	if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
		42	/*
		43	* since some bpf program is already running on this cpu,
		44	* don't call into another bpf program (same or different)
		45	* and don't send kprobe event into ring-buffer,
		46	* so return zero here
		47	*/
		48	ret = 0;
		49	goto out;
		50	}
		51
		52	rcu_read_lock();
		53	ret = BPF_PROG_RUN(prog, ctx);
		54	rcu_read_unlock();
		55
		56	out:
		57	__this_cpu_dec(bpf_prog_active);
		58	preempt_enable();
		59
		60	return ret;
		61	}
		62	EXPORT_SYMBOL_GPL(trace_call_bpf);
		63
		64	static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
		65	{
		66	void dst = (void ) (long) r1;
		67	int size = (int) r2;
		68	void unsafe_ptr = (void ) (long) r3;
		69
		70	return probe_kernel_read(dst, unsafe_ptr, size);
		71	}
		72
		73	static const struct bpf_func_proto bpf_probe_read_proto = {
		74	.func = bpf_probe_read,
		75	.gpl_only = true,
		76	.ret_type = RET_INTEGER,
		77	.arg1_type = ARG_PTR_TO_STACK,
		78	.arg2_type = ARG_CONST_STACK_SIZE,
		79	.arg3_type = ARG_ANYTHING,
		80	};
		81
		82	static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
		83	{
		84	/* NMI safe access to clock monotonic */
		85	return ktime_get_mono_fast_ns();
		86	}
		87
		88	static const struct bpf_func_proto bpf_ktime_get_ns_proto = {
		89	.func = bpf_ktime_get_ns,
		90	.gpl_only = true,
		91	.ret_type = RET_INTEGER,
		92	};
		93
		94	/*
		95	* limited trace_printk()
		96	* only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
		97	*/
		98	static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
		99	{
		100	char fmt = (char ) (long) r1;
		101	int mod[3] = {};
		102	int fmt_cnt = 0;
		103	int i;
		104
		105	/*
		106	* bpf_check()->check_func_arg()->check_stack_boundary()
		107	* guarantees that fmt points to bpf program stack,
		108	* fmt_size bytes of it were initialized and fmt_size > 0
		109	*/
		110	if (fmt[--fmt_size] != 0)
		111	return -EINVAL;
		112
		113	/* check format string for allowed specifiers */
		114	for (i = 0; i < fmt_size; i++) {
		115	if ((!isprint(fmt[i]) && !isspace(fmt[i])) \|\| !isascii(fmt[i]))
		116	return -EINVAL;
		117
		118	if (fmt[i] != '%')
		119	continue;
		120
		121	if (fmt_cnt >= 3)
		122	return -EINVAL;
		123
		124	/* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
		125	i++;
		126	if (fmt[i] == 'l') {
		127	mod[fmt_cnt]++;
		128	i++;
		129	} else if (fmt[i] == 'p') {
		130	mod[fmt_cnt]++;
		131	i++;
		132	if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0)
		133	return -EINVAL;
		134	fmt_cnt++;
		135	continue;
		136	}
		137
		138	if (fmt[i] == 'l') {
		139	mod[fmt_cnt]++;
		140	i++;
		141	}
		142
		143	if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
		144	return -EINVAL;
		145	fmt_cnt++;
		146	}
		147
		148	return __trace_printk(1/* fake ip will not be printed */, fmt,
		149	mod[0] == 2 ? r3 : mod[0] == 1 ? (long) r3 : (u32) r3,
		150	mod[1] == 2 ? r4 : mod[1] == 1 ? (long) r4 : (u32) r4,
		151	mod[2] == 2 ? r5 : mod[2] == 1 ? (long) r5 : (u32) r5);
		152	}
		153
		154	static const struct bpf_func_proto bpf_trace_printk_proto = {
		155	.func = bpf_trace_printk,
		156	.gpl_only = true,
		157	.ret_type = RET_INTEGER,
		158	.arg1_type = ARG_PTR_TO_STACK,
		159	.arg2_type = ARG_CONST_STACK_SIZE,
		160	};
		161
		162	static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
		163	{
		164	switch (func_id) {
		165	case BPF_FUNC_map_lookup_elem:
		166	return &bpf_map_lookup_elem_proto;
		167	case BPF_FUNC_map_update_elem:
		168	return &bpf_map_update_elem_proto;
		169	case BPF_FUNC_map_delete_elem:
		170	return &bpf_map_delete_elem_proto;
		171	case BPF_FUNC_probe_read:
		172	return &bpf_probe_read_proto;
		173	case BPF_FUNC_ktime_get_ns:
		174	return &bpf_ktime_get_ns_proto;
		175
		176	case BPF_FUNC_trace_printk:
		177	/*
		178	* this program might be calling bpf_trace_printk,
		179	* so allocate per-cpu printk buffers
		180	*/
		181	trace_printk_init_buffers();
		182
		183	return &bpf_trace_printk_proto;
		184	default:
		185	return NULL;
		186	}
		187	}
		188
		189	/* bpf+kprobe programs can access fields of 'struct pt_regs' */
		190	static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type)
		191	{
		192	/* check bounds */
		193	if (off < 0 \|\| off >= sizeof(struct pt_regs))
		194	return false;
		195
		196	/* only read is allowed */
		197	if (type != BPF_READ)
		198	return false;
		199
		200	/* disallow misaligned access */
		201	if (off % size != 0)
		202	return false;
		203
		204	return true;
		205	}
		206
		207	static struct bpf_verifier_ops kprobe_prog_ops = {
		208	.get_func_proto = kprobe_prog_func_proto,
		209	.is_valid_access = kprobe_prog_is_valid_access,
		210	};
		211
		212	static struct bpf_prog_type_list kprobe_tl = {
		213	.ops = &kprobe_prog_ops,
		214	.type = BPF_PROG_TYPE_KPROBE,
		215	};
		216
		217	static int __init register_kprobe_prog_ops(void)
		218	{
		219	bpf_register_prog_type(&kprobe_tl);
		220	return 0;
		221	}
		222	late_initcall(register_kprobe_prog_ops);


diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 9ba3f43f580e..d0ce590f06e1 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c
@@ -1135,11 +1135,15 @@ static void
1135	kprobe_perf_func(struct trace_kprobe tk, struct pt_regs regs)	1135	kprobe_perf_func(struct trace_kprobe tk, struct pt_regs regs)
1136	{	1136	{
1137	struct ftrace_event_call *call = &tk->tp.call;	1137	struct ftrace_event_call *call = &tk->tp.call;
		1138	struct bpf_prog *prog = call->prog;
1138	struct kprobe_trace_entry_head *entry;	1139	struct kprobe_trace_entry_head *entry;
1139	struct hlist_head *head;	1140	struct hlist_head *head;
1140	int size, __size, dsize;	1141	int size, __size, dsize;
1141	int rctx;	1142	int rctx;
1142		1143
		1144	if (prog && !trace_call_bpf(prog, regs))
		1145	return;
		1146
1143	head = this_cpu_ptr(call->perf_events);	1147	head = this_cpu_ptr(call->perf_events);
1144	if (hlist_empty(head))	1148	if (hlist_empty(head))
1145	return;	1149	return;
@@ -1166,11 +1170,15 @@ kretprobe_perf_func(struct trace_kprobe tk, struct kretprobe_instance ri,
1166	struct pt_regs *regs)	1170	struct pt_regs *regs)
1167	{	1171	{
1168	struct ftrace_event_call *call = &tk->tp.call;	1172	struct ftrace_event_call *call = &tk->tp.call;
		1173	struct bpf_prog *prog = call->prog;
1169	struct kretprobe_trace_entry_head *entry;	1174	struct kretprobe_trace_entry_head *entry;
1170	struct hlist_head *head;	1175	struct hlist_head *head;
1171	int size, __size, dsize;	1176	int size, __size, dsize;
1172	int rctx;	1177	int rctx;
1173		1178
		1179	if (prog && !trace_call_bpf(prog, regs))
		1180	return;
		1181
1174	head = this_cpu_ptr(call->perf_events);	1182	head = this_cpu_ptr(call->perf_events);
1175	if (hlist_empty(head))	1183	if (hlist_empty(head))
1176	return;	1184	return;
@@ -1287,7 +1295,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
1287	kfree(call->print_fmt);	1295	kfree(call->print_fmt);
1288	return -ENODEV;	1296	return -ENODEV;
1289	}	1297	}
1290	call->flags = 0;	1298	call->flags = TRACE_EVENT_FL_KPROBE;
1291	call->class->reg = kprobe_register;	1299	call->class->reg = kprobe_register;
1292	call->data = tk;	1300	call->data = tk;
1293	ret = trace_add_event_call(call);	1301	ret = trace_add_event_call(call);


diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 74865465e0b7..d60fe62ec4fa 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c
@@ -1006,7 +1006,7 @@ __uprobe_perf_filter(struct trace_uprobe_filter filter, struct mm_struct mm)
1006	return true;	1006	return true;
1007		1007
1008	list_for_each_entry(event, &filter->perf_events, hw.tp_list) {	1008	list_for_each_entry(event, &filter->perf_events, hw.tp_list) {
1009	if (event->hw.tp_target->mm == mm)	1009	if (event->hw.target->mm == mm)
1010	return true;	1010	return true;
1011	}	1011	}
1012		1012
@@ -1016,7 +1016,7 @@ __uprobe_perf_filter(struct trace_uprobe_filter filter, struct mm_struct mm)
1016	static inline bool	1016	static inline bool
1017	uprobe_filter_event(struct trace_uprobe tu, struct perf_event event)	1017	uprobe_filter_event(struct trace_uprobe tu, struct perf_event event)
1018	{	1018	{
1019	return __uprobe_perf_filter(&tu->filter, event->hw.tp_target->mm);	1019	return __uprobe_perf_filter(&tu->filter, event->hw.target->mm);
1020	}	1020	}
1021		1021
1022	static int uprobe_perf_close(struct trace_uprobe tu, struct perf_event event)	1022	static int uprobe_perf_close(struct trace_uprobe tu, struct perf_event event)
@@ -1024,10 +1024,10 @@ static int uprobe_perf_close(struct trace_uprobe tu, struct perf_event event)
1024	bool done;	1024	bool done;
1025		1025
1026	write_lock(&tu->filter.rwlock);	1026	write_lock(&tu->filter.rwlock);
1027	if (event->hw.tp_target) {	1027	if (event->hw.target) {
1028	list_del(&event->hw.tp_list);	1028	list_del(&event->hw.tp_list);
1029	done = tu->filter.nr_systemwide \|\|	1029	done = tu->filter.nr_systemwide \|\|
1030	(event->hw.tp_target->flags & PF_EXITING) \|\|	1030	(event->hw.target->flags & PF_EXITING) \|\|
1031	uprobe_filter_event(tu, event);	1031	uprobe_filter_event(tu, event);
1032	} else {	1032	} else {
1033	tu->filter.nr_systemwide--;	1033	tu->filter.nr_systemwide--;
@@ -1047,7 +1047,7 @@ static int uprobe_perf_open(struct trace_uprobe tu, struct perf_event event)
1047	int err;	1047	int err;
1048		1048
1049	write_lock(&tu->filter.rwlock);	1049	write_lock(&tu->filter.rwlock);
1050	if (event->hw.tp_target) {	1050	if (event->hw.target) {
1051	/*	1051	/*
1052	* event->parent != NULL means copy_process(), we can avoid	1052	* event->parent != NULL means copy_process(), we can avoid
1053	* uprobe_apply(). current->mm must be probed and we can rely	1053	* uprobe_apply(). current->mm must be probed and we can rely