summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorKees Cook <keescook@chromium.org>2017-05-13 07:51:37 -0400
committerJonathan Corbet <corbet@lwn.net>2017-05-18 12:30:01 -0400
commitc061f33f35be0ccc80f4b8e0aea5dfd2ed7e01a3 (patch)
tree591f8da7a6af2c08d53897af619f7ba369d882de
parent5e33994dca0e501336b52d8aec5327a9dec6430f (diff)
doc: ReSTify seccomp_filter.txt
This updates seccomp_filter.txt for ReST markup, and moves it under the user-space API index, since it describes how application author can use seccomp. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-rw-r--r--Documentation/userspace-api/index.rst1
-rw-r--r--Documentation/userspace-api/seccomp_filter.rst (renamed from Documentation/prctl/seccomp_filter.txt)116
-rw-r--r--MAINTAINERS1
3 files changed, 62 insertions, 56 deletions
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index a9d01b44a659..15ff12342db8 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -16,6 +16,7 @@ place where this information is gathered.
16.. toctree:: 16.. toctree::
17 :maxdepth: 2 17 :maxdepth: 2
18 18
19 seccomp_filter
19 unshare 20 unshare
20 21
21.. only:: subproject and html 22.. only:: subproject and html
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/userspace-api/seccomp_filter.rst
index 1e469ef75778..f71eb5ef1f2d 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -1,8 +1,9 @@
1 SECure COMPuting with filters 1===========================================
2 ============================= 2Seccomp BPF (SECure COMPuting with filters)
3===========================================
3 4
4Introduction 5Introduction
5------------ 6============
6 7
7A large number of system calls are exposed to every userland process 8A large number of system calls are exposed to every userland process
8with many of them going unused for the entire lifetime of the process. 9with many of them going unused for the entire lifetime of the process.
@@ -27,7 +28,7 @@ pointers which constrains all filters to solely evaluating the system
27call arguments directly. 28call arguments directly.
28 29
29What it isn't 30What it isn't
30------------- 31=============
31 32
32System call filtering isn't a sandbox. It provides a clearly defined 33System call filtering isn't a sandbox. It provides a clearly defined
33mechanism for minimizing the exposed kernel surface. It is meant to be 34mechanism for minimizing the exposed kernel surface. It is meant to be
@@ -40,13 +41,13 @@ system calls in socketcall() is allowed, for instance) which could be
40construed, incorrectly, as a more complete sandboxing solution. 41construed, incorrectly, as a more complete sandboxing solution.
41 42
42Usage 43Usage
43----- 44=====
44 45
45An additional seccomp mode is added and is enabled using the same 46An additional seccomp mode is added and is enabled using the same
46prctl(2) call as the strict seccomp. If the architecture has 47prctl(2) call as the strict seccomp. If the architecture has
47CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: 48``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below:
48 49
49PR_SET_SECCOMP: 50``PR_SET_SECCOMP``:
50 Now takes an additional argument which specifies a new filter 51 Now takes an additional argument which specifies a new filter
51 using a BPF program. 52 using a BPF program.
52 The BPF program will be executed over struct seccomp_data 53 The BPF program will be executed over struct seccomp_data
@@ -55,24 +56,25 @@ PR_SET_SECCOMP:
55 acceptable values to inform the kernel which action should be 56 acceptable values to inform the kernel which action should be
56 taken. 57 taken.
57 58
58 Usage: 59 Usage::
60
59 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); 61 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
60 62
61 The 'prog' argument is a pointer to a struct sock_fprog which 63 The 'prog' argument is a pointer to a struct sock_fprog which
62 will contain the filter program. If the program is invalid, the 64 will contain the filter program. If the program is invalid, the
63 call will return -1 and set errno to EINVAL. 65 call will return -1 and set errno to ``EINVAL``.
64 66
65 If fork/clone and execve are allowed by @prog, any child 67 If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child
66 processes will be constrained to the same filters and system 68 processes will be constrained to the same filters and system
67 call ABI as the parent. 69 call ABI as the parent.
68 70
69 Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or 71 Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or
70 run with CAP_SYS_ADMIN privileges in its namespace. If these are not 72 run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not
71 true, -EACCES will be returned. This requirement ensures that filter 73 true, ``-EACCES`` will be returned. This requirement ensures that filter
72 programs cannot be applied to child processes with greater privileges 74 programs cannot be applied to child processes with greater privileges
73 than the task that installed them. 75 than the task that installed them.
74 76
75 Additionally, if prctl(2) is allowed by the attached filter, 77 Additionally, if ``prctl(2)`` is allowed by the attached filter,
76 additional filters may be layered on which will increase evaluation 78 additional filters may be layered on which will increase evaluation
77 time, but allow for further decreasing the attack surface during 79 time, but allow for further decreasing the attack surface during
78 execution of a process. 80 execution of a process.
@@ -80,51 +82,52 @@ PR_SET_SECCOMP:
80The above call returns 0 on success and non-zero on error. 82The above call returns 0 on success and non-zero on error.
81 83
82Return values 84Return values
83------------- 85=============
86
84A seccomp filter may return any of the following values. If multiple 87A seccomp filter may return any of the following values. If multiple
85filters exist, the return value for the evaluation of a given system 88filters exist, the return value for the evaluation of a given system
86call will always use the highest precedent value. (For example, 89call will always use the highest precedent value. (For example,
87SECCOMP_RET_KILL will always take precedence.) 90``SECCOMP_RET_KILL`` will always take precedence.)
88 91
89In precedence order, they are: 92In precedence order, they are:
90 93
91SECCOMP_RET_KILL: 94``SECCOMP_RET_KILL``:
92 Results in the task exiting immediately without executing the 95 Results in the task exiting immediately without executing the
93 system call. The exit status of the task (status & 0x7f) will 96 system call. The exit status of the task (``status & 0x7f``) will
94 be SIGSYS, not SIGKILL. 97 be ``SIGSYS``, not ``SIGKILL``.
95 98
96SECCOMP_RET_TRAP: 99``SECCOMP_RET_TRAP``:
97 Results in the kernel sending a SIGSYS signal to the triggering 100 Results in the kernel sending a ``SIGSYS`` signal to the triggering
98 task without executing the system call. siginfo->si_call_addr 101 task without executing the system call. ``siginfo->si_call_addr``
99 will show the address of the system call instruction, and 102 will show the address of the system call instruction, and
100 siginfo->si_syscall and siginfo->si_arch will indicate which 103 ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which
101 syscall was attempted. The program counter will be as though 104 syscall was attempted. The program counter will be as though
102 the syscall happened (i.e. it will not point to the syscall 105 the syscall happened (i.e. it will not point to the syscall
103 instruction). The return value register will contain an arch- 106 instruction). The return value register will contain an arch-
104 dependent value -- if resuming execution, set it to something 107 dependent value -- if resuming execution, set it to something
105 sensible. (The architecture dependency is because replacing 108 sensible. (The architecture dependency is because replacing
106 it with -ENOSYS could overwrite some useful information.) 109 it with ``-ENOSYS`` could overwrite some useful information.)
107 110
108 The SECCOMP_RET_DATA portion of the return value will be passed 111 The ``SECCOMP_RET_DATA`` portion of the return value will be passed
109 as si_errno. 112 as ``si_errno``.
110 113
111 SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. 114 ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``.
112 115
113SECCOMP_RET_ERRNO: 116``SECCOMP_RET_ERRNO``:
114 Results in the lower 16-bits of the return value being passed 117 Results in the lower 16-bits of the return value being passed
115 to userland as the errno without executing the system call. 118 to userland as the errno without executing the system call.
116 119
117SECCOMP_RET_TRACE: 120``SECCOMP_RET_TRACE``:
118 When returned, this value will cause the kernel to attempt to 121 When returned, this value will cause the kernel to attempt to
119 notify a ptrace()-based tracer prior to executing the system 122 notify a ``ptrace()``-based tracer prior to executing the system
120 call. If there is no tracer present, -ENOSYS is returned to 123 call. If there is no tracer present, ``-ENOSYS`` is returned to
121 userland and the system call is not executed. 124 userland and the system call is not executed.
122 125
123 A tracer will be notified if it requests PTRACE_O_TRACESECCOMP 126 A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P
124 using ptrace(PTRACE_SETOPTIONS). The tracer will be notified 127 using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified
125 of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of 128 of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of
126 the BPF program return value will be available to the tracer 129 the BPF program return value will be available to the tracer
127 via PTRACE_GETEVENTMSG. 130 via ``PTRACE_GETEVENTMSG``.
128 131
129 The tracer can skip the system call by changing the syscall number 132 The tracer can skip the system call by changing the syscall number
130 to -1. Alternatively, the tracer can change the system call 133 to -1. Alternatively, the tracer can change the system call
@@ -138,19 +141,19 @@ SECCOMP_RET_TRACE:
138 allow use of ptrace, even of other sandboxed processes, without 141 allow use of ptrace, even of other sandboxed processes, without
139 extreme care; ptracers can use this mechanism to escape.) 142 extreme care; ptracers can use this mechanism to escape.)
140 143
141SECCOMP_RET_ALLOW: 144``SECCOMP_RET_ALLOW``:
142 Results in the system call being executed. 145 Results in the system call being executed.
143 146
144If multiple filters exist, the return value for the evaluation of a 147If multiple filters exist, the return value for the evaluation of a
145given system call will always use the highest precedent value. 148given system call will always use the highest precedent value.
146 149
147Precedence is only determined using the SECCOMP_RET_ACTION mask. When 150Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When
148multiple filters return values of the same precedence, only the 151multiple filters return values of the same precedence, only the
149SECCOMP_RET_DATA from the most recently installed filter will be 152``SECCOMP_RET_DATA`` from the most recently installed filter will be
150returned. 153returned.
151 154
152Pitfalls 155Pitfalls
153-------- 156========
154 157
155The biggest pitfall to avoid during use is filtering on system call 158The biggest pitfall to avoid during use is filtering on system call
156number without checking the architecture value. Why? On any 159number without checking the architecture value. Why? On any
@@ -160,39 +163,40 @@ the numbers in the different calling conventions overlap, then checks in
160the filters may be abused. Always check the arch value! 163the filters may be abused. Always check the arch value!
161 164
162Example 165Example
163------- 166=======
164 167
165The samples/seccomp/ directory contains both an x86-specific example 168The ``samples/seccomp/`` directory contains both an x86-specific example
166and a more generic example of a higher level macro interface for BPF 169and a more generic example of a higher level macro interface for BPF
167program generation. 170program generation.
168 171
169 172
170 173
171Adding architecture support 174Adding architecture support
172----------------------- 175===========================
173 176
174See arch/Kconfig for the authoritative requirements. In general, if an 177See ``arch/Kconfig`` for the authoritative requirements. In general, if an
175architecture supports both ptrace_event and seccomp, it will be able to 178architecture supports both ptrace_event and seccomp, it will be able to
176support seccomp filter with minor fixup: SIGSYS support and seccomp return 179support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return
177value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER 180value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``
178to its arch-specific Kconfig. 181to its arch-specific Kconfig.
179 182
180 183
181 184
182Caveats 185Caveats
183------- 186=======
184 187
185The vDSO can cause some system calls to run entirely in userspace, 188The vDSO can cause some system calls to run entirely in userspace,
186leading to surprises when you run programs on different machines that 189leading to surprises when you run programs on different machines that
187fall back to real syscalls. To minimize these surprises on x86, make 190fall back to real syscalls. To minimize these surprises on x86, make
188sure you test with 191sure you test with
189/sys/devices/system/clocksource/clocksource0/current_clocksource set to 192``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to
190something like acpi_pm. 193something like ``acpi_pm``.
191 194
192On x86-64, vsyscall emulation is enabled by default. (vsyscalls are 195On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
193legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: 196legacy variants on vDSO calls.) Currently, emulated vsyscalls will
197honor seccomp, with a few oddities:
194 198
195- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to 199- A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to
196 the vsyscall entry for the given call and not the address after the 200 the vsyscall entry for the given call and not the address after the
197 'syscall' instruction. Any code which wants to restart the call 201 'syscall' instruction. Any code which wants to restart the call
198 should be aware that (a) a ret instruction has been emulated and (b) 202 should be aware that (a) a ret instruction has been emulated and (b)
@@ -200,7 +204,7 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom
200 emulation security checks, making resuming the syscall mostly 204 emulation security checks, making resuming the syscall mostly
201 pointless. 205 pointless.
202 206
203- A return value of SECCOMP_RET_TRACE will signal the tracer as usual, 207- A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual,
204 but the syscall may not be changed to another system call using the 208 but the syscall may not be changed to another system call using the
205 orig_rax register. It may only be changed to -1 order to skip the 209 orig_rax register. It may only be changed to -1 order to skip the
206 currently emulated call. Any other change MAY terminate the process. 210 currently emulated call. Any other change MAY terminate the process.
@@ -209,14 +213,14 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom
209 rip or rsp. (Do not rely on other changes terminating the process. 213 rip or rsp. (Do not rely on other changes terminating the process.
210 They might work. For example, on some kernels, choosing a syscall 214 They might work. For example, on some kernels, choosing a syscall
211 that only exists in future kernels will be correctly emulated (by 215 that only exists in future kernels will be correctly emulated (by
212 returning -ENOSYS). 216 returning ``-ENOSYS``).
213 217
214To detect this quirky behavior, check for addr & ~0x0C00 == 218To detect this quirky behavior, check for ``addr & ~0x0C00 ==
2150xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For 2190xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For
216SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other 220``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other
217condition: future kernels may improve vsyscall emulation and current 221condition: future kernels may improve vsyscall emulation and current
218kernels in vsyscall=native mode will behave differently, but the 222kernels in vsyscall=native mode will behave differently, but the
219instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these 223instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these
220cases. 224cases.
221 225
222Note that modern systems are unlikely to use vsyscalls at all -- they 226Note that modern systems are unlikely to use vsyscalls at all -- they
diff --git a/MAINTAINERS b/MAINTAINERS
index f7d568b8f133..752916d1461c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11492,6 +11492,7 @@ F: kernel/seccomp.c
11492F: include/uapi/linux/seccomp.h 11492F: include/uapi/linux/seccomp.h
11493F: include/linux/seccomp.h 11493F: include/linux/seccomp.h
11494F: tools/testing/selftests/seccomp/* 11494F: tools/testing/selftests/seccomp/*
11495F: Documentation/userspace-api/seccomp_filter.rst
11495K: \bsecure_computing 11496K: \bsecure_computing
11496K: \bTIF_SECCOMP\b 11497K: \bTIF_SECCOMP\b
11497 11498