diff options
author | Will Drewry <wad@chromium.org> | 2012-04-12 17:48:04 -0400 |
---|---|---|
committer | James Morris <james.l.morris@oracle.com> | 2012-04-13 21:13:22 -0400 |
commit | 8ac270d1e29f0428228ab2b9a8ae5e1ed4a5cd84 (patch) | |
tree | 6deba4ed83da9ace758004b29d15aa0d2ec875a7 /Documentation/prctl | |
parent | c6cfbeb4029610c8c330c312dcf4d514cc067554 (diff) |
Documentation: prctl/seccomp_filter
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 and a semi-generic
example using a macro-based code generator.
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
v18: - added acked by
- update no new privs numbers
v17: - remove @compat note and add Pitfalls section for arch checking
(keescook@chromium.org)
v16: -
v15: -
v14: - rebase/nochanges
v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
v12: - comment on the ptrace_event use
- update arch support comment
- note the behavior of SECCOMP_RET_DATA when there are multiple filters
(keescook@chromium.org)
- lots of samples/ clean up incl 64-bit bpf-direct support
(markus@chromium.org)
- rebase to linux-next
v11: - overhaul return value language, updates (keescook@chromium.org)
- comment on do_exit(SIGSYS)
v10: - update for SIGSYS
- update for new seccomp_data layout
- update for ptrace option use
v9: - updated bpf-direct.c for SIGILL
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check (coreyb@linux.vnet.ibm.com)
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value (eparis@redhat.com)
- language cleanup (rdunlap@xenotime.net)
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net)
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples (corbet@lwn.net)
Signed-off-by: James Morris <james.l.morris@oracle.com>
Diffstat (limited to 'Documentation/prctl')
-rw-r--r-- | Documentation/prctl/seccomp_filter.txt | 163 |
1 files changed, 163 insertions, 0 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt new file mode 100644 index 000000000000..597c3c581375 --- /dev/null +++ b/Documentation/prctl/seccomp_filter.txt | |||
@@ -0,0 +1,163 @@ | |||
1 | SECure COMPuting with filters | ||
2 | ============================= | ||
3 | |||
4 | Introduction | ||
5 | ------------ | ||
6 | |||
7 | A large number of system calls are exposed to every userland process | ||
8 | with many of them going unused for the entire lifetime of the process. | ||
9 | As system calls change and mature, bugs are found and eradicated. A | ||
10 | certain subset of userland applications benefit by having a reduced set | ||
11 | of available system calls. The resulting set reduces the total kernel | ||
12 | surface exposed to the application. System call filtering is meant for | ||
13 | use with those applications. | ||
14 | |||
15 | Seccomp filtering provides a means for a process to specify a filter for | ||
16 | incoming system calls. The filter is expressed as a Berkeley Packet | ||
17 | Filter (BPF) program, as with socket filters, except that the data | ||
18 | operated on is related to the system call being made: system call | ||
19 | number and the system call arguments. This allows for expressive | ||
20 | filtering of system calls using a filter program language with a long | ||
21 | history of being exposed to userland and a straightforward data set. | ||
22 | |||
23 | Additionally, BPF makes it impossible for users of seccomp to fall prey | ||
24 | to time-of-check-time-of-use (TOCTOU) attacks that are common in system | ||
25 | call interposition frameworks. BPF programs may not dereference | ||
26 | pointers which constrains all filters to solely evaluating the system | ||
27 | call arguments directly. | ||
28 | |||
29 | What it isn't | ||
30 | ------------- | ||
31 | |||
32 | System call filtering isn't a sandbox. It provides a clearly defined | ||
33 | mechanism for minimizing the exposed kernel surface. It is meant to be | ||
34 | a tool for sandbox developers to use. Beyond that, policy for logical | ||
35 | behavior and information flow should be managed with a combination of | ||
36 | other system hardening techniques and, potentially, an LSM of your | ||
37 | choosing. Expressive, dynamic filters provide further options down this | ||
38 | path (avoiding pathological sizes or selecting which of the multiplexed | ||
39 | system calls in socketcall() is allowed, for instance) which could be | ||
40 | construed, incorrectly, as a more complete sandboxing solution. | ||
41 | |||
42 | Usage | ||
43 | ----- | ||
44 | |||
45 | An additional seccomp mode is added and is enabled using the same | ||
46 | prctl(2) call as the strict seccomp. If the architecture has | ||
47 | CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: | ||
48 | |||
49 | PR_SET_SECCOMP: | ||
50 | Now takes an additional argument which specifies a new filter | ||
51 | using a BPF program. | ||
52 | The BPF program will be executed over struct seccomp_data | ||
53 | reflecting the system call number, arguments, and other | ||
54 | metadata. The BPF program must then return one of the | ||
55 | acceptable values to inform the kernel which action should be | ||
56 | taken. | ||
57 | |||
58 | Usage: | ||
59 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); | ||
60 | |||
61 | The 'prog' argument is a pointer to a struct sock_fprog which | ||
62 | will contain the filter program. If the program is invalid, the | ||
63 | call will return -1 and set errno to EINVAL. | ||
64 | |||
65 | If fork/clone and execve are allowed by @prog, any child | ||
66 | processes will be constrained to the same filters and system | ||
67 | call ABI as the parent. | ||
68 | |||
69 | Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or | ||
70 | run with CAP_SYS_ADMIN privileges in its namespace. If these are not | ||
71 | true, -EACCES will be returned. This requirement ensures that filter | ||
72 | programs cannot be applied to child processes with greater privileges | ||
73 | than the task that installed them. | ||
74 | |||
75 | Additionally, if prctl(2) is allowed by the attached filter, | ||
76 | additional filters may be layered on which will increase evaluation | ||
77 | time, but allow for further decreasing the attack surface during | ||
78 | execution of a process. | ||
79 | |||
80 | The above call returns 0 on success and non-zero on error. | ||
81 | |||
82 | Return values | ||
83 | ------------- | ||
84 | A seccomp filter may return any of the following values. If multiple | ||
85 | filters exist, the return value for the evaluation of a given system | ||
86 | call will always use the highest precedent value. (For example, | ||
87 | SECCOMP_RET_KILL will always take precedence.) | ||
88 | |||
89 | In precedence order, they are: | ||
90 | |||
91 | SECCOMP_RET_KILL: | ||
92 | Results in the task exiting immediately without executing the | ||
93 | system call. The exit status of the task (status & 0x7f) will | ||
94 | be SIGSYS, not SIGKILL. | ||
95 | |||
96 | SECCOMP_RET_TRAP: | ||
97 | Results in the kernel sending a SIGSYS signal to the triggering | ||
98 | task without executing the system call. The kernel will | ||
99 | rollback the register state to just before the system call | ||
100 | entry such that a signal handler in the task will be able to | ||
101 | inspect the ucontext_t->uc_mcontext registers and emulate | ||
102 | system call success or failure upon return from the signal | ||
103 | handler. | ||
104 | |||
105 | The SECCOMP_RET_DATA portion of the return value will be passed | ||
106 | as si_errno. | ||
107 | |||
108 | SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. | ||
109 | |||
110 | SECCOMP_RET_ERRNO: | ||
111 | Results in the lower 16-bits of the return value being passed | ||
112 | to userland as the errno without executing the system call. | ||
113 | |||
114 | SECCOMP_RET_TRACE: | ||
115 | When returned, this value will cause the kernel to attempt to | ||
116 | notify a ptrace()-based tracer prior to executing the system | ||
117 | call. If there is no tracer present, -ENOSYS is returned to | ||
118 | userland and the system call is not executed. | ||
119 | |||
120 | A tracer will be notified if it requests PTRACE_O_TRACESECCOMP | ||
121 | using ptrace(PTRACE_SETOPTIONS). The tracer will be notified | ||
122 | of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of | ||
123 | the BPF program return value will be available to the tracer | ||
124 | via PTRACE_GETEVENTMSG. | ||
125 | |||
126 | SECCOMP_RET_ALLOW: | ||
127 | Results in the system call being executed. | ||
128 | |||
129 | If multiple filters exist, the return value for the evaluation of a | ||
130 | given system call will always use the highest precedent value. | ||
131 | |||
132 | Precedence is only determined using the SECCOMP_RET_ACTION mask. When | ||
133 | multiple filters return values of the same precedence, only the | ||
134 | SECCOMP_RET_DATA from the most recently installed filter will be | ||
135 | returned. | ||
136 | |||
137 | Pitfalls | ||
138 | -------- | ||
139 | |||
140 | The biggest pitfall to avoid during use is filtering on system call | ||
141 | number without checking the architecture value. Why? On any | ||
142 | architecture that supports multiple system call invocation conventions, | ||
143 | the system call numbers may vary based on the specific invocation. If | ||
144 | the numbers in the different calling conventions overlap, then checks in | ||
145 | the filters may be abused. Always check the arch value! | ||
146 | |||
147 | Example | ||
148 | ------- | ||
149 | |||
150 | The samples/seccomp/ directory contains both an x86-specific example | ||
151 | and a more generic example of a higher level macro interface for BPF | ||
152 | program generation. | ||
153 | |||
154 | |||
155 | |||
156 | Adding architecture support | ||
157 | ----------------------- | ||
158 | |||
159 | See arch/Kconfig for the authoritative requirements. In general, if an | ||
160 | architecture supports both ptrace_event and seccomp, it will be able to | ||
161 | support seccomp filter with minor fixup: SIGSYS support and seccomp return | ||
162 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER | ||
163 | to its arch-specific Kconfig. | ||