net: filter: rework/optimize internal BPF interpreter's instruction set

This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the *internal* format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <pmoore@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: linux-kernel@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Alexei Starovoitov <ast@plumgrid.com> 2014-03-28 13:58:25 -0400
committer: David S. Miller <davem@davemloft.net> 2014-03-31 00:45:09 -0400
commit: bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 (patch)
tree: 6ffb15296ce4cdc1f272e31bd43a5804b8da588c /kernel/seccomp.c
parent: 77e0114ae9ae08685c503772a57af21d299c6701 (diff)
1 files changed, 58 insertions, 61 deletions
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b7a10048a32c..4f18e754c23e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -55,60 +55,33 @@ struct seccomp_filter {
        atomic_t usage;
        struct seccomp_filter *prev;
        unsigned short len;  /* Instruction count */
-        struct sock_filter insns[];
+        struct sock_filter_int insnsi[];
 };
 /* Limit any path through the tree to 256KB worth of instructions. */
 #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
-/**
+/*
- * get_u32 - returns a u32 offset into data
- * @data: a unsigned 64 bit value
- * @index: 0 or 1 to return the first or second 32-bits
- *
- * This inline exists to hide the length of unsigned long.  If a 32-bit
- * unsigned long is passed in, it will be extended and the top 32-bits will be
- * 0. If it is a 64-bit unsigned long, then whatever data is resident will be
- * properly returned.
- *
 * Endianness is explicitly ignored and left for BPF program authors to manage
 * as per the specific architecture.
 */
-static inline u32 get_u32(u64 data, int index)
+static void populate_seccomp_data(struct seccomp_data *sd)
 {
-        return ((u32 *)&data)[index];
+        struct task_struct *task = current;
-}
+        struct pt_regs *regs = task_pt_regs(task);
-/* Helper for bpf_load below. */
+        sd->nr = syscall_get_nr(task, regs);
-#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
+        sd->arch = syscall_get_arch(task, regs);
-/**
- * bpf_load: checks and returns a pointer to the requested offset
+        /* Unroll syscall_get_args to help gcc on arm. */
- * @off: offset into struct seccomp_data to load from
+        syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);
- *
+        syscall_get_arguments(task, regs, 1, 1, (unsigned long *) &sd->args[1]);
- * Returns the requested 32-bits of data.
+        syscall_get_arguments(task, regs, 2, 1, (unsigned long *) &sd->args[2]);
- * seccomp_check_filter() should assure that @off is 32-bit aligned
+        syscall_get_arguments(task, regs, 3, 1, (unsigned long *) &sd->args[3]);
- * and not out of bounds.  Failure to do so is a BUG.
+        syscall_get_arguments(task, regs, 4, 1, (unsigned long *) &sd->args[4]);
- */
+        syscall_get_arguments(task, regs, 5, 1, (unsigned long *) &sd->args[5]);
-u32 seccomp_bpf_load(int off)
-{
+        sd->instruction_pointer = KSTK_EIP(task);
-        struct pt_regs *regs = task_pt_regs(current);
-        if (off == BPF_DATA(nr))
-                return syscall_get_nr(current, regs);
-        if (off == BPF_DATA(arch))
-                return syscall_get_arch(current, regs);
-        if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
-                unsigned long value;
-                int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
-                int index = !!(off % sizeof(u64));
-                syscall_get_arguments(current, regs, arg, 1, &value);
-                return get_u32(value, index);
-        }
-        if (off == BPF_DATA(instruction_pointer))
-                return get_u32(KSTK_EIP(current), 0);
-        if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
-                return get_u32(KSTK_EIP(current), 1);
-        /* seccomp_check_filter should make this impossible. */
-        BUG();
 }
 /**
@@ -133,17 +106,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
                switch (code) {
                case BPF_S_LD_W_ABS:
-                        ftest->code = BPF_S_ANC_SECCOMP_LD_W;
+                        ftest->code = BPF_LDX | BPF_W | BPF_ABS;
                        /* 32-bit aligned and not out of bounds. */
                        if (k >= sizeof(struct seccomp_data) || k & 3)
                                return -EINVAL;
                        continue;
                case BPF_S_LD_W_LEN:
-                        ftest->code = BPF_S_LD_IMM;
+                        ftest->code = BPF_LD | BPF_IMM;
                        ftest->k = sizeof(struct seccomp_data);
                        continue;
                case BPF_S_LDX_W_LEN:
-                        ftest->code = BPF_S_LDX_IMM;
+                        ftest->code = BPF_LDX | BPF_IMM;
                        ftest->k = sizeof(struct seccomp_data);
                        continue;
                /* Explicitly include allowed calls. */
@@ -185,6 +158,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
                case BPF_S_JMP_JGT_X:
                case BPF_S_JMP_JSET_K:
                case BPF_S_JMP_JSET_X:
+                        sk_decode_filter(ftest, ftest);
                        continue;
                default:
                        return -EINVAL;
@@ -202,18 +176,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 static u32 seccomp_run_filters(int syscall)
 {
        struct seccomp_filter *f;
+        struct seccomp_data sd;
        u32 ret = SECCOMP_RET_ALLOW;
        /* Ensure unexpected behavior doesn't result in failing open. */
        if (WARN_ON(current->seccomp.filter == NULL))
                return SECCOMP_RET_KILL;
+        populate_seccomp_data(&sd);
        /*
         * All filters in the list are evaluated and the lowest BPF return
         * value always takes priority (ignoring the DATA).
         */
        for (f = current->seccomp.filter; f; f = f->prev) {
-                u32 cur_ret = sk_run_filter(NULL, f->insns);
+                u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
                if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
                        ret = cur_ret;
        }
@@ -231,6 +208,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
        struct seccomp_filter *filter;
        unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
        unsigned long total_insns = fprog->len;
+        struct sock_filter *fp;
+        int new_len;
        long ret;
        if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
@@ -252,28 +231,43 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
                                     CAP_SYS_ADMIN) != 0)
                return -EACCES;
-        /* Allocate a new seccomp_filter */
+        fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
-        filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
+        if (!fp)
-                         GFP_KERNEL|__GFP_NOWARN);
-        if (!filter)
                return -ENOMEM;
-        atomic_set(&filter->usage, 1);
-        filter->len = fprog->len;
        /* Copy the instructions from fprog. */
        ret = -EFAULT;
-        if (copy_from_user(filter->insns, fprog->filter, fp_size))
+        if (copy_from_user(fp, fprog->filter, fp_size))
-                goto fail;
+                goto free_prog;
        /* Check and rewrite the fprog via the skb checker */
-        ret = sk_chk_filter(filter->insns, filter->len);
+        ret = sk_chk_filter(fp, fprog->len);
        if (ret)
-                goto fail;
+                goto free_prog;
        /* Check and rewrite the fprog for seccomp use */
-        ret = seccomp_check_filter(filter->insns, filter->len);
+        ret = seccomp_check_filter(fp, fprog->len);
+        if (ret)
+                goto free_prog;
+        /* Convert 'sock_filter' insns to 'sock_filter_int' insns */
+        ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
+        if (ret)
+                goto free_prog;
+        /* Allocate a new seccomp_filter */
+        filter = kzalloc(sizeof(struct seccomp_filter) +
+                         sizeof(struct sock_filter_int) * new_len,
+                         GFP_KERNEL|__GFP_NOWARN);
+        if (!filter)
+                goto free_prog;
+        ret = sk_convert_filter(fp, fprog->len, filter->insnsi, &new_len);
        if (ret)
-                goto fail;
+                goto free_filter;
+        atomic_set(&filter->usage, 1);
+        filter->len = new_len;
        /*
         * If there is an existing filter, make it the prev and don't drop its
@@ -282,8 +276,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
        filter->prev = current->seccomp.filter;
        current->seccomp.filter = filter;
        return 0;
-fail:
+free_filter:
        kfree(filter);
+free_prog:
+        kfree(fp);
        return ret;
 }
author	Alexei Starovoitov <ast@plumgrid.com>	2014-03-28 13:58:25 -0400
committer	David S. Miller <davem@davemloft.net>	2014-03-31 00:45:09 -0400
commit	bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 (patch)
tree	6ffb15296ce4cdc1f272e31bd43a5804b8da588c /kernel/seccomp.c
parent	77e0114ae9ae08685c503772a57af21d299c6701 (diff)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b7a10048a32c..4f18e754c23e 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c
@@ -55,60 +55,33 @@ struct seccomp_filter {
55	atomic_t usage;	55	atomic_t usage;
56	struct seccomp_filter *prev;	56	struct seccomp_filter *prev;
57	unsigned short len; /* Instruction count */	57	unsigned short len; /* Instruction count */
58	struct sock_filter insns[];	58	struct sock_filter_int insnsi[];
59	};	59	};
60		60
61	/* Limit any path through the tree to 256KB worth of instructions. */	61	/* Limit any path through the tree to 256KB worth of instructions. */
62	#define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))	62	#define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
63		63
64	/**	64	/*
65	* get_u32 - returns a u32 offset into data
66	* @data: a unsigned 64 bit value
67	* @index: 0 or 1 to return the first or second 32-bits
68	*
69	* This inline exists to hide the length of unsigned long. If a 32-bit
70	* unsigned long is passed in, it will be extended and the top 32-bits will be
71	* 0. If it is a 64-bit unsigned long, then whatever data is resident will be
72	* properly returned.
73	*
74	* Endianness is explicitly ignored and left for BPF program authors to manage	65	* Endianness is explicitly ignored and left for BPF program authors to manage
75	* as per the specific architecture.	66	* as per the specific architecture.
76	*/	67	*/
77	static inline u32 get_u32(u64 data, int index)	68	static void populate_seccomp_data(struct seccomp_data *sd)
78	{	69	{
79	return ((u32 *)&data)[index];	70	struct task_struct *task = current;
80	}	71	struct pt_regs *regs = task_pt_regs(task);
81		72
82	/* Helper for bpf_load below. */	73	sd->nr = syscall_get_nr(task, regs);
83	#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)	74	sd->arch = syscall_get_arch(task, regs);
84	/**	75
85	* bpf_load: checks and returns a pointer to the requested offset	76	/* Unroll syscall_get_args to help gcc on arm. */
86	* @off: offset into struct seccomp_data to load from	77	syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);
87	*	78	syscall_get_arguments(task, regs, 1, 1, (unsigned long *) &sd->args[1]);
88	* Returns the requested 32-bits of data.	79	syscall_get_arguments(task, regs, 2, 1, (unsigned long *) &sd->args[2]);
89	* seccomp_check_filter() should assure that @off is 32-bit aligned	80	syscall_get_arguments(task, regs, 3, 1, (unsigned long *) &sd->args[3]);
90	* and not out of bounds. Failure to do so is a BUG.	81	syscall_get_arguments(task, regs, 4, 1, (unsigned long *) &sd->args[4]);
91	*/	82	syscall_get_arguments(task, regs, 5, 1, (unsigned long *) &sd->args[5]);
92	u32 seccomp_bpf_load(int off)	83
93	{	84	sd->instruction_pointer = KSTK_EIP(task);
94	struct pt_regs *regs = task_pt_regs(current);
95	if (off == BPF_DATA(nr))
96	return syscall_get_nr(current, regs);
97	if (off == BPF_DATA(arch))
98	return syscall_get_arch(current, regs);
99	if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
100	unsigned long value;
101	int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
102	int index = !!(off % sizeof(u64));
103	syscall_get_arguments(current, regs, arg, 1, &value);
104	return get_u32(value, index);
105	}
106	if (off == BPF_DATA(instruction_pointer))
107	return get_u32(KSTK_EIP(current), 0);
108	if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
109	return get_u32(KSTK_EIP(current), 1);
110	/* seccomp_check_filter should make this impossible. */
111	BUG();
112	}	85	}
113		86
114	/**	87	/**
@@ -133,17 +106,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
133		106
134	switch (code) {	107	switch (code) {
135	case BPF_S_LD_W_ABS:	108	case BPF_S_LD_W_ABS:
136	ftest->code = BPF_S_ANC_SECCOMP_LD_W;	109	ftest->code = BPF_LDX \| BPF_W \| BPF_ABS;
137	/* 32-bit aligned and not out of bounds. */	110	/* 32-bit aligned and not out of bounds. */
138	if (k >= sizeof(struct seccomp_data) \|\| k & 3)	111	if (k >= sizeof(struct seccomp_data) \|\| k & 3)
139	return -EINVAL;	112	return -EINVAL;
140	continue;	113	continue;
141	case BPF_S_LD_W_LEN:	114	case BPF_S_LD_W_LEN:
142	ftest->code = BPF_S_LD_IMM;	115	ftest->code = BPF_LD \| BPF_IMM;
143	ftest->k = sizeof(struct seccomp_data);	116	ftest->k = sizeof(struct seccomp_data);
144	continue;	117	continue;
145	case BPF_S_LDX_W_LEN:	118	case BPF_S_LDX_W_LEN:
146	ftest->code = BPF_S_LDX_IMM;	119	ftest->code = BPF_LDX \| BPF_IMM;
147	ftest->k = sizeof(struct seccomp_data);	120	ftest->k = sizeof(struct seccomp_data);
148	continue;	121	continue;
149	/* Explicitly include allowed calls. */	122	/* Explicitly include allowed calls. */
@@ -185,6 +158,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
185	case BPF_S_JMP_JGT_X:	158	case BPF_S_JMP_JGT_X:
186	case BPF_S_JMP_JSET_K:	159	case BPF_S_JMP_JSET_K:
187	case BPF_S_JMP_JSET_X:	160	case BPF_S_JMP_JSET_X:
		161	sk_decode_filter(ftest, ftest);
188	continue;	162	continue;
189	default:	163	default:
190	return -EINVAL;	164	return -EINVAL;
@@ -202,18 +176,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
202	static u32 seccomp_run_filters(int syscall)	176	static u32 seccomp_run_filters(int syscall)
203	{	177	{
204	struct seccomp_filter *f;	178	struct seccomp_filter *f;
		179	struct seccomp_data sd;
205	u32 ret = SECCOMP_RET_ALLOW;	180	u32 ret = SECCOMP_RET_ALLOW;
206		181
207	/* Ensure unexpected behavior doesn't result in failing open. */	182	/* Ensure unexpected behavior doesn't result in failing open. */
208	if (WARN_ON(current->seccomp.filter == NULL))	183	if (WARN_ON(current->seccomp.filter == NULL))
209	return SECCOMP_RET_KILL;	184	return SECCOMP_RET_KILL;
210		185
		186	populate_seccomp_data(&sd);
		187
211	/*	188	/*
212	* All filters in the list are evaluated and the lowest BPF return	189	* All filters in the list are evaluated and the lowest BPF return
213	* value always takes priority (ignoring the DATA).	190	* value always takes priority (ignoring the DATA).
214	*/	191	*/
215	for (f = current->seccomp.filter; f; f = f->prev) {	192	for (f = current->seccomp.filter; f; f = f->prev) {
216	u32 cur_ret = sk_run_filter(NULL, f->insns);	193	u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
217	if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))	194	if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
218	ret = cur_ret;	195	ret = cur_ret;
219	}	196	}
@@ -231,6 +208,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
231	struct seccomp_filter *filter;	208	struct seccomp_filter *filter;
232	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);	209	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
233	unsigned long total_insns = fprog->len;	210	unsigned long total_insns = fprog->len;
		211	struct sock_filter *fp;
		212	int new_len;
234	long ret;	213	long ret;
235		214
236	if (fprog->len == 0 \|\| fprog->len > BPF_MAXINSNS)	215	if (fprog->len == 0 \|\| fprog->len > BPF_MAXINSNS)
@@ -252,28 +231,43 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
252	CAP_SYS_ADMIN) != 0)	231	CAP_SYS_ADMIN) != 0)
253	return -EACCES;	232	return -EACCES;
254		233
255	/* Allocate a new seccomp_filter */	234	fp = kzalloc(fp_size, GFP_KERNEL\|__GFP_NOWARN);
256	filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,	235	if (!fp)
257	GFP_KERNEL\|__GFP_NOWARN);
258	if (!filter)
259	return -ENOMEM;	236	return -ENOMEM;
260	atomic_set(&filter->usage, 1);
261	filter->len = fprog->len;
262		237
263	/* Copy the instructions from fprog. */	238	/* Copy the instructions from fprog. */
264	ret = -EFAULT;	239	ret = -EFAULT;
265	if (copy_from_user(filter->insns, fprog->filter, fp_size))	240	if (copy_from_user(fp, fprog->filter, fp_size))
266	goto fail;	241	goto free_prog;
267		242
268	/* Check and rewrite the fprog via the skb checker */	243	/* Check and rewrite the fprog via the skb checker */
269	ret = sk_chk_filter(filter->insns, filter->len);	244	ret = sk_chk_filter(fp, fprog->len);
270	if (ret)	245	if (ret)
271	goto fail;	246	goto free_prog;
272		247
273	/* Check and rewrite the fprog for seccomp use */	248	/* Check and rewrite the fprog for seccomp use */
274	ret = seccomp_check_filter(filter->insns, filter->len);	249	ret = seccomp_check_filter(fp, fprog->len);
		250	if (ret)
		251	goto free_prog;
		252
		253	/* Convert 'sock_filter' insns to 'sock_filter_int' insns */
		254	ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
		255	if (ret)
		256	goto free_prog;
		257
		258	/* Allocate a new seccomp_filter */
		259	filter = kzalloc(sizeof(struct seccomp_filter) +
		260	sizeof(struct sock_filter_int) * new_len,
		261	GFP_KERNEL\|__GFP_NOWARN);
		262	if (!filter)
		263	goto free_prog;
		264
		265	ret = sk_convert_filter(fp, fprog->len, filter->insnsi, &new_len);
275	if (ret)	266	if (ret)
276	goto fail;	267	goto free_filter;
		268
		269	atomic_set(&filter->usage, 1);
		270	filter->len = new_len;
277		271
278	/*	272	/*
279	* If there is an existing filter, make it the prev and don't drop its	273	* If there is an existing filter, make it the prev and don't drop its
@@ -282,8 +276,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
282	filter->prev = current->seccomp.filter;	276	filter->prev = current->seccomp.filter;
283	current->seccomp.filter = filter;	277	current->seccomp.filter = filter;
284	return 0;	278	return 0;
285	fail:	279
		280	free_filter:
286	kfree(filter);	281	kfree(filter);
		282	free_prog:
		283	kfree(fp);
287	return ret;	284	return ret;
288	}	285	}
289		286