diff options
Diffstat (limited to 'Documentation/userspace-api')
| -rw-r--r-- | Documentation/userspace-api/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/userspace-api/seccomp_filter.rst | 229 |
2 files changed, 230 insertions, 0 deletions
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index a9d01b44a659..15ff12342db8 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst | |||
| @@ -16,6 +16,7 @@ place where this information is gathered. | |||
| 16 | .. toctree:: | 16 | .. toctree:: |
| 17 | :maxdepth: 2 | 17 | :maxdepth: 2 |
| 18 | 18 | ||
| 19 | seccomp_filter | ||
| 19 | unshare | 20 | unshare |
| 20 | 21 | ||
| 21 | .. only:: subproject and html | 22 | .. only:: subproject and html |
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst new file mode 100644 index 000000000000..f71eb5ef1f2d --- /dev/null +++ b/Documentation/userspace-api/seccomp_filter.rst | |||
| @@ -0,0 +1,229 @@ | |||
| 1 | =========================================== | ||
| 2 | Seccomp BPF (SECure COMPuting with filters) | ||
| 3 | =========================================== | ||
| 4 | |||
| 5 | Introduction | ||
| 6 | ============ | ||
| 7 | |||
| 8 | A large number of system calls are exposed to every userland process | ||
| 9 | with many of them going unused for the entire lifetime of the process. | ||
| 10 | As system calls change and mature, bugs are found and eradicated. A | ||
| 11 | certain subset of userland applications benefit by having a reduced set | ||
| 12 | of available system calls. The resulting set reduces the total kernel | ||
| 13 | surface exposed to the application. System call filtering is meant for | ||
| 14 | use with those applications. | ||
| 15 | |||
| 16 | Seccomp filtering provides a means for a process to specify a filter for | ||
| 17 | incoming system calls. The filter is expressed as a Berkeley Packet | ||
| 18 | Filter (BPF) program, as with socket filters, except that the data | ||
| 19 | operated on is related to the system call being made: system call | ||
| 20 | number and the system call arguments. This allows for expressive | ||
| 21 | filtering of system calls using a filter program language with a long | ||
| 22 | history of being exposed to userland and a straightforward data set. | ||
| 23 | |||
| 24 | Additionally, BPF makes it impossible for users of seccomp to fall prey | ||
| 25 | to time-of-check-time-of-use (TOCTOU) attacks that are common in system | ||
| 26 | call interposition frameworks. BPF programs may not dereference | ||
| 27 | pointers which constrains all filters to solely evaluating the system | ||
| 28 | call arguments directly. | ||
| 29 | |||
| 30 | What it isn't | ||
| 31 | ============= | ||
| 32 | |||
| 33 | System call filtering isn't a sandbox. It provides a clearly defined | ||
| 34 | mechanism for minimizing the exposed kernel surface. It is meant to be | ||
| 35 | a tool for sandbox developers to use. Beyond that, policy for logical | ||
| 36 | behavior and information flow should be managed with a combination of | ||
| 37 | other system hardening techniques and, potentially, an LSM of your | ||
| 38 | choosing. Expressive, dynamic filters provide further options down this | ||
| 39 | path (avoiding pathological sizes or selecting which of the multiplexed | ||
| 40 | system calls in socketcall() is allowed, for instance) which could be | ||
| 41 | construed, incorrectly, as a more complete sandboxing solution. | ||
| 42 | |||
| 43 | Usage | ||
| 44 | ===== | ||
| 45 | |||
| 46 | An additional seccomp mode is added and is enabled using the same | ||
| 47 | prctl(2) call as the strict seccomp. If the architecture has | ||
| 48 | ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below: | ||
| 49 | |||
| 50 | ``PR_SET_SECCOMP``: | ||
| 51 | Now takes an additional argument which specifies a new filter | ||
| 52 | using a BPF program. | ||
| 53 | The BPF program will be executed over struct seccomp_data | ||
| 54 | reflecting the system call number, arguments, and other | ||
| 55 | metadata. The BPF program must then return one of the | ||
| 56 | acceptable values to inform the kernel which action should be | ||
| 57 | taken. | ||
| 58 | |||
| 59 | Usage:: | ||
| 60 | |||
| 61 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); | ||
| 62 | |||
| 63 | The 'prog' argument is a pointer to a struct sock_fprog which | ||
| 64 | will contain the filter program. If the program is invalid, the | ||
| 65 | call will return -1 and set errno to ``EINVAL``. | ||
| 66 | |||
| 67 | If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child | ||
| 68 | processes will be constrained to the same filters and system | ||
| 69 | call ABI as the parent. | ||
| 70 | |||
| 71 | Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or | ||
| 72 | run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not | ||
| 73 | true, ``-EACCES`` will be returned. This requirement ensures that filter | ||
| 74 | programs cannot be applied to child processes with greater privileges | ||
| 75 | than the task that installed them. | ||
| 76 | |||
| 77 | Additionally, if ``prctl(2)`` is allowed by the attached filter, | ||
| 78 | additional filters may be layered on which will increase evaluation | ||
| 79 | time, but allow for further decreasing the attack surface during | ||
| 80 | execution of a process. | ||
| 81 | |||
| 82 | The above call returns 0 on success and non-zero on error. | ||
| 83 | |||
| 84 | Return values | ||
| 85 | ============= | ||
| 86 | |||
| 87 | A seccomp filter may return any of the following values. If multiple | ||
| 88 | filters exist, the return value for the evaluation of a given system | ||
| 89 | call will always use the highest precedent value. (For example, | ||
| 90 | ``SECCOMP_RET_KILL`` will always take precedence.) | ||
| 91 | |||
| 92 | In precedence order, they are: | ||
| 93 | |||
| 94 | ``SECCOMP_RET_KILL``: | ||
| 95 | Results in the task exiting immediately without executing the | ||
| 96 | system call. The exit status of the task (``status & 0x7f``) will | ||
| 97 | be ``SIGSYS``, not ``SIGKILL``. | ||
| 98 | |||
| 99 | ``SECCOMP_RET_TRAP``: | ||
| 100 | Results in the kernel sending a ``SIGSYS`` signal to the triggering | ||
| 101 | task without executing the system call. ``siginfo->si_call_addr`` | ||
| 102 | will show the address of the system call instruction, and | ||
| 103 | ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which | ||
| 104 | syscall was attempted. The program counter will be as though | ||
| 105 | the syscall happened (i.e. it will not point to the syscall | ||
| 106 | instruction). The return value register will contain an arch- | ||
| 107 | dependent value -- if resuming execution, set it to something | ||
| 108 | sensible. (The architecture dependency is because replacing | ||
| 109 | it with ``-ENOSYS`` could overwrite some useful information.) | ||
| 110 | |||
| 111 | The ``SECCOMP_RET_DATA`` portion of the return value will be passed | ||
| 112 | as ``si_errno``. | ||
| 113 | |||
| 114 | ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``. | ||
| 115 | |||
| 116 | ``SECCOMP_RET_ERRNO``: | ||
| 117 | Results in the lower 16-bits of the return value being passed | ||
| 118 | to userland as the errno without executing the system call. | ||
| 119 | |||
| 120 | ``SECCOMP_RET_TRACE``: | ||
| 121 | When returned, this value will cause the kernel to attempt to | ||
| 122 | notify a ``ptrace()``-based tracer prior to executing the system | ||
| 123 | call. If there is no tracer present, ``-ENOSYS`` is returned to | ||
| 124 | userland and the system call is not executed. | ||
| 125 | |||
| 126 | A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P | ||
| 127 | using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified | ||
| 128 | of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of | ||
| 129 | the BPF program return value will be available to the tracer | ||
| 130 | via ``PTRACE_GETEVENTMSG``. | ||
| 131 | |||
| 132 | The tracer can skip the system call by changing the syscall number | ||
| 133 | to -1. Alternatively, the tracer can change the system call | ||
| 134 | requested by changing the system call to a valid syscall number. If | ||
| 135 | the tracer asks to skip the system call, then the system call will | ||
| 136 | appear to return the value that the tracer puts in the return value | ||
| 137 | register. | ||
| 138 | |||
| 139 | The seccomp check will not be run again after the tracer is | ||
| 140 | notified. (This means that seccomp-based sandboxes MUST NOT | ||
| 141 | allow use of ptrace, even of other sandboxed processes, without | ||
| 142 | extreme care; ptracers can use this mechanism to escape.) | ||
| 143 | |||
| 144 | ``SECCOMP_RET_ALLOW``: | ||
| 145 | Results in the system call being executed. | ||
| 146 | |||
| 147 | If multiple filters exist, the return value for the evaluation of a | ||
| 148 | given system call will always use the highest precedent value. | ||
| 149 | |||
| 150 | Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When | ||
| 151 | multiple filters return values of the same precedence, only the | ||
| 152 | ``SECCOMP_RET_DATA`` from the most recently installed filter will be | ||
| 153 | returned. | ||
| 154 | |||
| 155 | Pitfalls | ||
| 156 | ======== | ||
| 157 | |||
| 158 | The biggest pitfall to avoid during use is filtering on system call | ||
| 159 | number without checking the architecture value. Why? On any | ||
| 160 | architecture that supports multiple system call invocation conventions, | ||
| 161 | the system call numbers may vary based on the specific invocation. If | ||
| 162 | the numbers in the different calling conventions overlap, then checks in | ||
| 163 | the filters may be abused. Always check the arch value! | ||
| 164 | |||
| 165 | Example | ||
| 166 | ======= | ||
| 167 | |||
| 168 | The ``samples/seccomp/`` directory contains both an x86-specific example | ||
| 169 | and a more generic example of a higher level macro interface for BPF | ||
| 170 | program generation. | ||
| 171 | |||
| 172 | |||
| 173 | |||
| 174 | Adding architecture support | ||
| 175 | =========================== | ||
| 176 | |||
| 177 | See ``arch/Kconfig`` for the authoritative requirements. In general, if an | ||
| 178 | architecture supports both ptrace_event and seccomp, it will be able to | ||
| 179 | support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return | ||
| 180 | value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER`` | ||
| 181 | to its arch-specific Kconfig. | ||
| 182 | |||
| 183 | |||
| 184 | |||
| 185 | Caveats | ||
| 186 | ======= | ||
| 187 | |||
| 188 | The vDSO can cause some system calls to run entirely in userspace, | ||
| 189 | leading to surprises when you run programs on different machines that | ||
| 190 | fall back to real syscalls. To minimize these surprises on x86, make | ||
| 191 | sure you test with | ||
| 192 | ``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to | ||
| 193 | something like ``acpi_pm``. | ||
| 194 | |||
| 195 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are | ||
| 196 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will | ||
| 197 | honor seccomp, with a few oddities: | ||
| 198 | |||
| 199 | - A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to | ||
| 200 | the vsyscall entry for the given call and not the address after the | ||
| 201 | 'syscall' instruction. Any code which wants to restart the call | ||
| 202 | should be aware that (a) a ret instruction has been emulated and (b) | ||
| 203 | trying to resume the syscall will again trigger the standard vsyscall | ||
| 204 | emulation security checks, making resuming the syscall mostly | ||
| 205 | pointless. | ||
| 206 | |||
| 207 | - A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual, | ||
| 208 | but the syscall may not be changed to another system call using the | ||
| 209 | orig_rax register. It may only be changed to -1 order to skip the | ||
| 210 | currently emulated call. Any other change MAY terminate the process. | ||
| 211 | The rip value seen by the tracer will be the syscall entry address; | ||
| 212 | this is different from normal behavior. The tracer MUST NOT modify | ||
| 213 | rip or rsp. (Do not rely on other changes terminating the process. | ||
| 214 | They might work. For example, on some kernels, choosing a syscall | ||
| 215 | that only exists in future kernels will be correctly emulated (by | ||
| 216 | returning ``-ENOSYS``). | ||
| 217 | |||
| 218 | To detect this quirky behavior, check for ``addr & ~0x0C00 == | ||
| 219 | 0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For | ||
| 220 | ``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other | ||
| 221 | condition: future kernels may improve vsyscall emulation and current | ||
| 222 | kernels in vsyscall=native mode will behave differently, but the | ||
| 223 | instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these | ||
| 224 | cases. | ||
| 225 | |||
| 226 | Note that modern systems are unlikely to use vsyscalls at all -- they | ||
| 227 | are a legacy feature and they are considerably slower than standard | ||
| 228 | syscalls. New code will use the vDSO, and vDSO-issued system calls | ||
| 229 | are indistinguishable from normal system calls. | ||
