diff options
author | Kees Cook <keescook@chromium.org> | 2017-05-13 07:51:37 -0400 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2017-05-18 12:30:01 -0400 |
commit | c061f33f35be0ccc80f4b8e0aea5dfd2ed7e01a3 (patch) | |
tree | 591f8da7a6af2c08d53897af619f7ba369d882de | |
parent | 5e33994dca0e501336b52d8aec5327a9dec6430f (diff) |
doc: ReSTify seccomp_filter.txt
This updates seccomp_filter.txt for ReST markup, and moves it under the
user-space API index, since it describes how application author can use
seccomp.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-rw-r--r-- | Documentation/userspace-api/index.rst | 1 | ||||
-rw-r--r-- | Documentation/userspace-api/seccomp_filter.rst (renamed from Documentation/prctl/seccomp_filter.txt) | 116 | ||||
-rw-r--r-- | MAINTAINERS | 1 |
3 files changed, 62 insertions, 56 deletions
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index a9d01b44a659..15ff12342db8 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst | |||
@@ -16,6 +16,7 @@ place where this information is gathered. | |||
16 | .. toctree:: | 16 | .. toctree:: |
17 | :maxdepth: 2 | 17 | :maxdepth: 2 |
18 | 18 | ||
19 | seccomp_filter | ||
19 | unshare | 20 | unshare |
20 | 21 | ||
21 | .. only:: subproject and html | 22 | .. only:: subproject and html |
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/userspace-api/seccomp_filter.rst index 1e469ef75778..f71eb5ef1f2d 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/userspace-api/seccomp_filter.rst | |||
@@ -1,8 +1,9 @@ | |||
1 | SECure COMPuting with filters | 1 | =========================================== |
2 | ============================= | 2 | Seccomp BPF (SECure COMPuting with filters) |
3 | =========================================== | ||
3 | 4 | ||
4 | Introduction | 5 | Introduction |
5 | ------------ | 6 | ============ |
6 | 7 | ||
7 | A large number of system calls are exposed to every userland process | 8 | A large number of system calls are exposed to every userland process |
8 | with many of them going unused for the entire lifetime of the process. | 9 | with many of them going unused for the entire lifetime of the process. |
@@ -27,7 +28,7 @@ pointers which constrains all filters to solely evaluating the system | |||
27 | call arguments directly. | 28 | call arguments directly. |
28 | 29 | ||
29 | What it isn't | 30 | What it isn't |
30 | ------------- | 31 | ============= |
31 | 32 | ||
32 | System call filtering isn't a sandbox. It provides a clearly defined | 33 | System call filtering isn't a sandbox. It provides a clearly defined |
33 | mechanism for minimizing the exposed kernel surface. It is meant to be | 34 | mechanism for minimizing the exposed kernel surface. It is meant to be |
@@ -40,13 +41,13 @@ system calls in socketcall() is allowed, for instance) which could be | |||
40 | construed, incorrectly, as a more complete sandboxing solution. | 41 | construed, incorrectly, as a more complete sandboxing solution. |
41 | 42 | ||
42 | Usage | 43 | Usage |
43 | ----- | 44 | ===== |
44 | 45 | ||
45 | An additional seccomp mode is added and is enabled using the same | 46 | An additional seccomp mode is added and is enabled using the same |
46 | prctl(2) call as the strict seccomp. If the architecture has | 47 | prctl(2) call as the strict seccomp. If the architecture has |
47 | CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: | 48 | ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below: |
48 | 49 | ||
49 | PR_SET_SECCOMP: | 50 | ``PR_SET_SECCOMP``: |
50 | Now takes an additional argument which specifies a new filter | 51 | Now takes an additional argument which specifies a new filter |
51 | using a BPF program. | 52 | using a BPF program. |
52 | The BPF program will be executed over struct seccomp_data | 53 | The BPF program will be executed over struct seccomp_data |
@@ -55,24 +56,25 @@ PR_SET_SECCOMP: | |||
55 | acceptable values to inform the kernel which action should be | 56 | acceptable values to inform the kernel which action should be |
56 | taken. | 57 | taken. |
57 | 58 | ||
58 | Usage: | 59 | Usage:: |
60 | |||
59 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); | 61 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); |
60 | 62 | ||
61 | The 'prog' argument is a pointer to a struct sock_fprog which | 63 | The 'prog' argument is a pointer to a struct sock_fprog which |
62 | will contain the filter program. If the program is invalid, the | 64 | will contain the filter program. If the program is invalid, the |
63 | call will return -1 and set errno to EINVAL. | 65 | call will return -1 and set errno to ``EINVAL``. |
64 | 66 | ||
65 | If fork/clone and execve are allowed by @prog, any child | 67 | If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child |
66 | processes will be constrained to the same filters and system | 68 | processes will be constrained to the same filters and system |
67 | call ABI as the parent. | 69 | call ABI as the parent. |
68 | 70 | ||
69 | Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or | 71 | Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or |
70 | run with CAP_SYS_ADMIN privileges in its namespace. If these are not | 72 | run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not |
71 | true, -EACCES will be returned. This requirement ensures that filter | 73 | true, ``-EACCES`` will be returned. This requirement ensures that filter |
72 | programs cannot be applied to child processes with greater privileges | 74 | programs cannot be applied to child processes with greater privileges |
73 | than the task that installed them. | 75 | than the task that installed them. |
74 | 76 | ||
75 | Additionally, if prctl(2) is allowed by the attached filter, | 77 | Additionally, if ``prctl(2)`` is allowed by the attached filter, |
76 | additional filters may be layered on which will increase evaluation | 78 | additional filters may be layered on which will increase evaluation |
77 | time, but allow for further decreasing the attack surface during | 79 | time, but allow for further decreasing the attack surface during |
78 | execution of a process. | 80 | execution of a process. |
@@ -80,51 +82,52 @@ PR_SET_SECCOMP: | |||
80 | The above call returns 0 on success and non-zero on error. | 82 | The above call returns 0 on success and non-zero on error. |
81 | 83 | ||
82 | Return values | 84 | Return values |
83 | ------------- | 85 | ============= |
86 | |||
84 | A seccomp filter may return any of the following values. If multiple | 87 | A seccomp filter may return any of the following values. If multiple |
85 | filters exist, the return value for the evaluation of a given system | 88 | filters exist, the return value for the evaluation of a given system |
86 | call will always use the highest precedent value. (For example, | 89 | call will always use the highest precedent value. (For example, |
87 | SECCOMP_RET_KILL will always take precedence.) | 90 | ``SECCOMP_RET_KILL`` will always take precedence.) |
88 | 91 | ||
89 | In precedence order, they are: | 92 | In precedence order, they are: |
90 | 93 | ||
91 | SECCOMP_RET_KILL: | 94 | ``SECCOMP_RET_KILL``: |
92 | Results in the task exiting immediately without executing the | 95 | Results in the task exiting immediately without executing the |
93 | system call. The exit status of the task (status & 0x7f) will | 96 | system call. The exit status of the task (``status & 0x7f``) will |
94 | be SIGSYS, not SIGKILL. | 97 | be ``SIGSYS``, not ``SIGKILL``. |
95 | 98 | ||
96 | SECCOMP_RET_TRAP: | 99 | ``SECCOMP_RET_TRAP``: |
97 | Results in the kernel sending a SIGSYS signal to the triggering | 100 | Results in the kernel sending a ``SIGSYS`` signal to the triggering |
98 | task without executing the system call. siginfo->si_call_addr | 101 | task without executing the system call. ``siginfo->si_call_addr`` |
99 | will show the address of the system call instruction, and | 102 | will show the address of the system call instruction, and |
100 | siginfo->si_syscall and siginfo->si_arch will indicate which | 103 | ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which |
101 | syscall was attempted. The program counter will be as though | 104 | syscall was attempted. The program counter will be as though |
102 | the syscall happened (i.e. it will not point to the syscall | 105 | the syscall happened (i.e. it will not point to the syscall |
103 | instruction). The return value register will contain an arch- | 106 | instruction). The return value register will contain an arch- |
104 | dependent value -- if resuming execution, set it to something | 107 | dependent value -- if resuming execution, set it to something |
105 | sensible. (The architecture dependency is because replacing | 108 | sensible. (The architecture dependency is because replacing |
106 | it with -ENOSYS could overwrite some useful information.) | 109 | it with ``-ENOSYS`` could overwrite some useful information.) |
107 | 110 | ||
108 | The SECCOMP_RET_DATA portion of the return value will be passed | 111 | The ``SECCOMP_RET_DATA`` portion of the return value will be passed |
109 | as si_errno. | 112 | as ``si_errno``. |
110 | 113 | ||
111 | SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. | 114 | ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``. |
112 | 115 | ||
113 | SECCOMP_RET_ERRNO: | 116 | ``SECCOMP_RET_ERRNO``: |
114 | Results in the lower 16-bits of the return value being passed | 117 | Results in the lower 16-bits of the return value being passed |
115 | to userland as the errno without executing the system call. | 118 | to userland as the errno without executing the system call. |
116 | 119 | ||
117 | SECCOMP_RET_TRACE: | 120 | ``SECCOMP_RET_TRACE``: |
118 | When returned, this value will cause the kernel to attempt to | 121 | When returned, this value will cause the kernel to attempt to |
119 | notify a ptrace()-based tracer prior to executing the system | 122 | notify a ``ptrace()``-based tracer prior to executing the system |
120 | call. If there is no tracer present, -ENOSYS is returned to | 123 | call. If there is no tracer present, ``-ENOSYS`` is returned to |
121 | userland and the system call is not executed. | 124 | userland and the system call is not executed. |
122 | 125 | ||
123 | A tracer will be notified if it requests PTRACE_O_TRACESECCOMP | 126 | A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P |
124 | using ptrace(PTRACE_SETOPTIONS). The tracer will be notified | 127 | using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified |
125 | of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of | 128 | of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of |
126 | the BPF program return value will be available to the tracer | 129 | the BPF program return value will be available to the tracer |
127 | via PTRACE_GETEVENTMSG. | 130 | via ``PTRACE_GETEVENTMSG``. |
128 | 131 | ||
129 | The tracer can skip the system call by changing the syscall number | 132 | The tracer can skip the system call by changing the syscall number |
130 | to -1. Alternatively, the tracer can change the system call | 133 | to -1. Alternatively, the tracer can change the system call |
@@ -138,19 +141,19 @@ SECCOMP_RET_TRACE: | |||
138 | allow use of ptrace, even of other sandboxed processes, without | 141 | allow use of ptrace, even of other sandboxed processes, without |
139 | extreme care; ptracers can use this mechanism to escape.) | 142 | extreme care; ptracers can use this mechanism to escape.) |
140 | 143 | ||
141 | SECCOMP_RET_ALLOW: | 144 | ``SECCOMP_RET_ALLOW``: |
142 | Results in the system call being executed. | 145 | Results in the system call being executed. |
143 | 146 | ||
144 | If multiple filters exist, the return value for the evaluation of a | 147 | If multiple filters exist, the return value for the evaluation of a |
145 | given system call will always use the highest precedent value. | 148 | given system call will always use the highest precedent value. |
146 | 149 | ||
147 | Precedence is only determined using the SECCOMP_RET_ACTION mask. When | 150 | Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When |
148 | multiple filters return values of the same precedence, only the | 151 | multiple filters return values of the same precedence, only the |
149 | SECCOMP_RET_DATA from the most recently installed filter will be | 152 | ``SECCOMP_RET_DATA`` from the most recently installed filter will be |
150 | returned. | 153 | returned. |
151 | 154 | ||
152 | Pitfalls | 155 | Pitfalls |
153 | -------- | 156 | ======== |
154 | 157 | ||
155 | The biggest pitfall to avoid during use is filtering on system call | 158 | The biggest pitfall to avoid during use is filtering on system call |
156 | number without checking the architecture value. Why? On any | 159 | number without checking the architecture value. Why? On any |
@@ -160,39 +163,40 @@ the numbers in the different calling conventions overlap, then checks in | |||
160 | the filters may be abused. Always check the arch value! | 163 | the filters may be abused. Always check the arch value! |
161 | 164 | ||
162 | Example | 165 | Example |
163 | ------- | 166 | ======= |
164 | 167 | ||
165 | The samples/seccomp/ directory contains both an x86-specific example | 168 | The ``samples/seccomp/`` directory contains both an x86-specific example |
166 | and a more generic example of a higher level macro interface for BPF | 169 | and a more generic example of a higher level macro interface for BPF |
167 | program generation. | 170 | program generation. |
168 | 171 | ||
169 | 172 | ||
170 | 173 | ||
171 | Adding architecture support | 174 | Adding architecture support |
172 | ----------------------- | 175 | =========================== |
173 | 176 | ||
174 | See arch/Kconfig for the authoritative requirements. In general, if an | 177 | See ``arch/Kconfig`` for the authoritative requirements. In general, if an |
175 | architecture supports both ptrace_event and seccomp, it will be able to | 178 | architecture supports both ptrace_event and seccomp, it will be able to |
176 | support seccomp filter with minor fixup: SIGSYS support and seccomp return | 179 | support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return |
177 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER | 180 | value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER`` |
178 | to its arch-specific Kconfig. | 181 | to its arch-specific Kconfig. |
179 | 182 | ||
180 | 183 | ||
181 | 184 | ||
182 | Caveats | 185 | Caveats |
183 | ------- | 186 | ======= |
184 | 187 | ||
185 | The vDSO can cause some system calls to run entirely in userspace, | 188 | The vDSO can cause some system calls to run entirely in userspace, |
186 | leading to surprises when you run programs on different machines that | 189 | leading to surprises when you run programs on different machines that |
187 | fall back to real syscalls. To minimize these surprises on x86, make | 190 | fall back to real syscalls. To minimize these surprises on x86, make |
188 | sure you test with | 191 | sure you test with |
189 | /sys/devices/system/clocksource/clocksource0/current_clocksource set to | 192 | ``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to |
190 | something like acpi_pm. | 193 | something like ``acpi_pm``. |
191 | 194 | ||
192 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are | 195 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are |
193 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: | 196 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will |
197 | honor seccomp, with a few oddities: | ||
194 | 198 | ||
195 | - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to | 199 | - A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to |
196 | the vsyscall entry for the given call and not the address after the | 200 | the vsyscall entry for the given call and not the address after the |
197 | 'syscall' instruction. Any code which wants to restart the call | 201 | 'syscall' instruction. Any code which wants to restart the call |
198 | should be aware that (a) a ret instruction has been emulated and (b) | 202 | should be aware that (a) a ret instruction has been emulated and (b) |
@@ -200,7 +204,7 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom | |||
200 | emulation security checks, making resuming the syscall mostly | 204 | emulation security checks, making resuming the syscall mostly |
201 | pointless. | 205 | pointless. |
202 | 206 | ||
203 | - A return value of SECCOMP_RET_TRACE will signal the tracer as usual, | 207 | - A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual, |
204 | but the syscall may not be changed to another system call using the | 208 | but the syscall may not be changed to another system call using the |
205 | orig_rax register. It may only be changed to -1 order to skip the | 209 | orig_rax register. It may only be changed to -1 order to skip the |
206 | currently emulated call. Any other change MAY terminate the process. | 210 | currently emulated call. Any other change MAY terminate the process. |
@@ -209,14 +213,14 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom | |||
209 | rip or rsp. (Do not rely on other changes terminating the process. | 213 | rip or rsp. (Do not rely on other changes terminating the process. |
210 | They might work. For example, on some kernels, choosing a syscall | 214 | They might work. For example, on some kernels, choosing a syscall |
211 | that only exists in future kernels will be correctly emulated (by | 215 | that only exists in future kernels will be correctly emulated (by |
212 | returning -ENOSYS). | 216 | returning ``-ENOSYS``). |
213 | 217 | ||
214 | To detect this quirky behavior, check for addr & ~0x0C00 == | 218 | To detect this quirky behavior, check for ``addr & ~0x0C00 == |
215 | 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For | 219 | 0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For |
216 | SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other | 220 | ``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other |
217 | condition: future kernels may improve vsyscall emulation and current | 221 | condition: future kernels may improve vsyscall emulation and current |
218 | kernels in vsyscall=native mode will behave differently, but the | 222 | kernels in vsyscall=native mode will behave differently, but the |
219 | instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these | 223 | instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these |
220 | cases. | 224 | cases. |
221 | 225 | ||
222 | Note that modern systems are unlikely to use vsyscalls at all -- they | 226 | Note that modern systems are unlikely to use vsyscalls at all -- they |
diff --git a/MAINTAINERS b/MAINTAINERS index f7d568b8f133..752916d1461c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS | |||
@@ -11492,6 +11492,7 @@ F: kernel/seccomp.c | |||
11492 | F: include/uapi/linux/seccomp.h | 11492 | F: include/uapi/linux/seccomp.h |
11493 | F: include/linux/seccomp.h | 11493 | F: include/linux/seccomp.h |
11494 | F: tools/testing/selftests/seccomp/* | 11494 | F: tools/testing/selftests/seccomp/* |
11495 | F: Documentation/userspace-api/seccomp_filter.rst | ||
11495 | K: \bsecure_computing | 11496 | K: \bsecure_computing |
11496 | K: \bTIF_SECCOMP\b | 11497 | K: \bTIF_SECCOMP\b |
11497 | 11498 | ||