diff options
author | Kees Cook <keescook@chromium.org> | 2017-05-13 07:51:37 -0400 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2017-05-18 12:30:01 -0400 |
commit | c061f33f35be0ccc80f4b8e0aea5dfd2ed7e01a3 (patch) | |
tree | 591f8da7a6af2c08d53897af619f7ba369d882de /Documentation/prctl | |
parent | 5e33994dca0e501336b52d8aec5327a9dec6430f (diff) |
doc: ReSTify seccomp_filter.txt
This updates seccomp_filter.txt for ReST markup, and moves it under the
user-space API index, since it describes how application author can use
seccomp.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/prctl')
-rw-r--r-- | Documentation/prctl/seccomp_filter.txt | 225 |
1 files changed, 0 insertions, 225 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt deleted file mode 100644 index 1e469ef75778..000000000000 --- a/Documentation/prctl/seccomp_filter.txt +++ /dev/null | |||
@@ -1,225 +0,0 @@ | |||
1 | SECure COMPuting with filters | ||
2 | ============================= | ||
3 | |||
4 | Introduction | ||
5 | ------------ | ||
6 | |||
7 | A large number of system calls are exposed to every userland process | ||
8 | with many of them going unused for the entire lifetime of the process. | ||
9 | As system calls change and mature, bugs are found and eradicated. A | ||
10 | certain subset of userland applications benefit by having a reduced set | ||
11 | of available system calls. The resulting set reduces the total kernel | ||
12 | surface exposed to the application. System call filtering is meant for | ||
13 | use with those applications. | ||
14 | |||
15 | Seccomp filtering provides a means for a process to specify a filter for | ||
16 | incoming system calls. The filter is expressed as a Berkeley Packet | ||
17 | Filter (BPF) program, as with socket filters, except that the data | ||
18 | operated on is related to the system call being made: system call | ||
19 | number and the system call arguments. This allows for expressive | ||
20 | filtering of system calls using a filter program language with a long | ||
21 | history of being exposed to userland and a straightforward data set. | ||
22 | |||
23 | Additionally, BPF makes it impossible for users of seccomp to fall prey | ||
24 | to time-of-check-time-of-use (TOCTOU) attacks that are common in system | ||
25 | call interposition frameworks. BPF programs may not dereference | ||
26 | pointers which constrains all filters to solely evaluating the system | ||
27 | call arguments directly. | ||
28 | |||
29 | What it isn't | ||
30 | ------------- | ||
31 | |||
32 | System call filtering isn't a sandbox. It provides a clearly defined | ||
33 | mechanism for minimizing the exposed kernel surface. It is meant to be | ||
34 | a tool for sandbox developers to use. Beyond that, policy for logical | ||
35 | behavior and information flow should be managed with a combination of | ||
36 | other system hardening techniques and, potentially, an LSM of your | ||
37 | choosing. Expressive, dynamic filters provide further options down this | ||
38 | path (avoiding pathological sizes or selecting which of the multiplexed | ||
39 | system calls in socketcall() is allowed, for instance) which could be | ||
40 | construed, incorrectly, as a more complete sandboxing solution. | ||
41 | |||
42 | Usage | ||
43 | ----- | ||
44 | |||
45 | An additional seccomp mode is added and is enabled using the same | ||
46 | prctl(2) call as the strict seccomp. If the architecture has | ||
47 | CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: | ||
48 | |||
49 | PR_SET_SECCOMP: | ||
50 | Now takes an additional argument which specifies a new filter | ||
51 | using a BPF program. | ||
52 | The BPF program will be executed over struct seccomp_data | ||
53 | reflecting the system call number, arguments, and other | ||
54 | metadata. The BPF program must then return one of the | ||
55 | acceptable values to inform the kernel which action should be | ||
56 | taken. | ||
57 | |||
58 | Usage: | ||
59 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); | ||
60 | |||
61 | The 'prog' argument is a pointer to a struct sock_fprog which | ||
62 | will contain the filter program. If the program is invalid, the | ||
63 | call will return -1 and set errno to EINVAL. | ||
64 | |||
65 | If fork/clone and execve are allowed by @prog, any child | ||
66 | processes will be constrained to the same filters and system | ||
67 | call ABI as the parent. | ||
68 | |||
69 | Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or | ||
70 | run with CAP_SYS_ADMIN privileges in its namespace. If these are not | ||
71 | true, -EACCES will be returned. This requirement ensures that filter | ||
72 | programs cannot be applied to child processes with greater privileges | ||
73 | than the task that installed them. | ||
74 | |||
75 | Additionally, if prctl(2) is allowed by the attached filter, | ||
76 | additional filters may be layered on which will increase evaluation | ||
77 | time, but allow for further decreasing the attack surface during | ||
78 | execution of a process. | ||
79 | |||
80 | The above call returns 0 on success and non-zero on error. | ||
81 | |||
82 | Return values | ||
83 | ------------- | ||
84 | A seccomp filter may return any of the following values. If multiple | ||
85 | filters exist, the return value for the evaluation of a given system | ||
86 | call will always use the highest precedent value. (For example, | ||
87 | SECCOMP_RET_KILL will always take precedence.) | ||
88 | |||
89 | In precedence order, they are: | ||
90 | |||
91 | SECCOMP_RET_KILL: | ||
92 | Results in the task exiting immediately without executing the | ||
93 | system call. The exit status of the task (status & 0x7f) will | ||
94 | be SIGSYS, not SIGKILL. | ||
95 | |||
96 | SECCOMP_RET_TRAP: | ||
97 | Results in the kernel sending a SIGSYS signal to the triggering | ||
98 | task without executing the system call. siginfo->si_call_addr | ||
99 | will show the address of the system call instruction, and | ||
100 | siginfo->si_syscall and siginfo->si_arch will indicate which | ||
101 | syscall was attempted. The program counter will be as though | ||
102 | the syscall happened (i.e. it will not point to the syscall | ||
103 | instruction). The return value register will contain an arch- | ||
104 | dependent value -- if resuming execution, set it to something | ||
105 | sensible. (The architecture dependency is because replacing | ||
106 | it with -ENOSYS could overwrite some useful information.) | ||
107 | |||
108 | The SECCOMP_RET_DATA portion of the return value will be passed | ||
109 | as si_errno. | ||
110 | |||
111 | SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. | ||
112 | |||
113 | SECCOMP_RET_ERRNO: | ||
114 | Results in the lower 16-bits of the return value being passed | ||
115 | to userland as the errno without executing the system call. | ||
116 | |||
117 | SECCOMP_RET_TRACE: | ||
118 | When returned, this value will cause the kernel to attempt to | ||
119 | notify a ptrace()-based tracer prior to executing the system | ||
120 | call. If there is no tracer present, -ENOSYS is returned to | ||
121 | userland and the system call is not executed. | ||
122 | |||
123 | A tracer will be notified if it requests PTRACE_O_TRACESECCOMP | ||
124 | using ptrace(PTRACE_SETOPTIONS). The tracer will be notified | ||
125 | of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of | ||
126 | the BPF program return value will be available to the tracer | ||
127 | via PTRACE_GETEVENTMSG. | ||
128 | |||
129 | The tracer can skip the system call by changing the syscall number | ||
130 | to -1. Alternatively, the tracer can change the system call | ||
131 | requested by changing the system call to a valid syscall number. If | ||
132 | the tracer asks to skip the system call, then the system call will | ||
133 | appear to return the value that the tracer puts in the return value | ||
134 | register. | ||
135 | |||
136 | The seccomp check will not be run again after the tracer is | ||
137 | notified. (This means that seccomp-based sandboxes MUST NOT | ||
138 | allow use of ptrace, even of other sandboxed processes, without | ||
139 | extreme care; ptracers can use this mechanism to escape.) | ||
140 | |||
141 | SECCOMP_RET_ALLOW: | ||
142 | Results in the system call being executed. | ||
143 | |||
144 | If multiple filters exist, the return value for the evaluation of a | ||
145 | given system call will always use the highest precedent value. | ||
146 | |||
147 | Precedence is only determined using the SECCOMP_RET_ACTION mask. When | ||
148 | multiple filters return values of the same precedence, only the | ||
149 | SECCOMP_RET_DATA from the most recently installed filter will be | ||
150 | returned. | ||
151 | |||
152 | Pitfalls | ||
153 | -------- | ||
154 | |||
155 | The biggest pitfall to avoid during use is filtering on system call | ||
156 | number without checking the architecture value. Why? On any | ||
157 | architecture that supports multiple system call invocation conventions, | ||
158 | the system call numbers may vary based on the specific invocation. If | ||
159 | the numbers in the different calling conventions overlap, then checks in | ||
160 | the filters may be abused. Always check the arch value! | ||
161 | |||
162 | Example | ||
163 | ------- | ||
164 | |||
165 | The samples/seccomp/ directory contains both an x86-specific example | ||
166 | and a more generic example of a higher level macro interface for BPF | ||
167 | program generation. | ||
168 | |||
169 | |||
170 | |||
171 | Adding architecture support | ||
172 | ----------------------- | ||
173 | |||
174 | See arch/Kconfig for the authoritative requirements. In general, if an | ||
175 | architecture supports both ptrace_event and seccomp, it will be able to | ||
176 | support seccomp filter with minor fixup: SIGSYS support and seccomp return | ||
177 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER | ||
178 | to its arch-specific Kconfig. | ||
179 | |||
180 | |||
181 | |||
182 | Caveats | ||
183 | ------- | ||
184 | |||
185 | The vDSO can cause some system calls to run entirely in userspace, | ||
186 | leading to surprises when you run programs on different machines that | ||
187 | fall back to real syscalls. To minimize these surprises on x86, make | ||
188 | sure you test with | ||
189 | /sys/devices/system/clocksource/clocksource0/current_clocksource set to | ||
190 | something like acpi_pm. | ||
191 | |||
192 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are | ||
193 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: | ||
194 | |||
195 | - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to | ||
196 | the vsyscall entry for the given call and not the address after the | ||
197 | 'syscall' instruction. Any code which wants to restart the call | ||
198 | should be aware that (a) a ret instruction has been emulated and (b) | ||
199 | trying to resume the syscall will again trigger the standard vsyscall | ||
200 | emulation security checks, making resuming the syscall mostly | ||
201 | pointless. | ||
202 | |||
203 | - A return value of SECCOMP_RET_TRACE will signal the tracer as usual, | ||
204 | but the syscall may not be changed to another system call using the | ||
205 | orig_rax register. It may only be changed to -1 order to skip the | ||
206 | currently emulated call. Any other change MAY terminate the process. | ||
207 | The rip value seen by the tracer will be the syscall entry address; | ||
208 | this is different from normal behavior. The tracer MUST NOT modify | ||
209 | rip or rsp. (Do not rely on other changes terminating the process. | ||
210 | They might work. For example, on some kernels, choosing a syscall | ||
211 | that only exists in future kernels will be correctly emulated (by | ||
212 | returning -ENOSYS). | ||
213 | |||
214 | To detect this quirky behavior, check for addr & ~0x0C00 == | ||
215 | 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For | ||
216 | SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other | ||
217 | condition: future kernels may improve vsyscall emulation and current | ||
218 | kernels in vsyscall=native mode will behave differently, but the | ||
219 | instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these | ||
220 | cases. | ||
221 | |||
222 | Note that modern systems are unlikely to use vsyscalls at all -- they | ||
223 | are a legacy feature and they are considerably slower than standard | ||
224 | syscalls. New code will use the vDSO, and vDSO-issued system calls | ||
225 | are indistinguishable from normal system calls. | ||