diff options
| author | Andy Lutomirski <luto@amacapital.net> | 2012-10-01 14:40:45 -0400 |
|---|---|---|
| committer | James Morris <james.l.morris@oracle.com> | 2012-10-02 07:14:29 -0400 |
| commit | 87b526d349b04c31d7b3a40b434eb3f825d22305 (patch) | |
| tree | 2aeec0465901c9623ef7f5b3eb451ea6ccce6ecc /Documentation | |
| parent | bf5308344527d015ac9a6d2bda4ad4d40fd7d943 (diff) | |
seccomp: Make syscall skipping and nr changes more consistent
This fixes two issues that could cause incompatibility between
kernel versions:
- If a tracer uses SECCOMP_RET_TRACE to select a syscall number
higher than the largest known syscall, emulate the unknown
vsyscall by returning -ENOSYS. (This is unlikely to make a
noticeable difference on x86-64 due to the way the system call
entry works.)
- On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
This updates the documentation accordingly.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/prctl/seccomp_filter.txt | 74 |
1 files changed, 68 insertions, 6 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt index 597c3c581375..1e469ef75778 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/prctl/seccomp_filter.txt | |||
| @@ -95,12 +95,15 @@ SECCOMP_RET_KILL: | |||
| 95 | 95 | ||
| 96 | SECCOMP_RET_TRAP: | 96 | SECCOMP_RET_TRAP: |
| 97 | Results in the kernel sending a SIGSYS signal to the triggering | 97 | Results in the kernel sending a SIGSYS signal to the triggering |
| 98 | task without executing the system call. The kernel will | 98 | task without executing the system call. siginfo->si_call_addr |
| 99 | rollback the register state to just before the system call | 99 | will show the address of the system call instruction, and |
| 100 | entry such that a signal handler in the task will be able to | 100 | siginfo->si_syscall and siginfo->si_arch will indicate which |
| 101 | inspect the ucontext_t->uc_mcontext registers and emulate | 101 | syscall was attempted. The program counter will be as though |
| 102 | system call success or failure upon return from the signal | 102 | the syscall happened (i.e. it will not point to the syscall |
| 103 | handler. | 103 | instruction). The return value register will contain an arch- |
| 104 | dependent value -- if resuming execution, set it to something | ||
| 105 | sensible. (The architecture dependency is because replacing | ||
| 106 | it with -ENOSYS could overwrite some useful information.) | ||
| 104 | 107 | ||
| 105 | The SECCOMP_RET_DATA portion of the return value will be passed | 108 | The SECCOMP_RET_DATA portion of the return value will be passed |
| 106 | as si_errno. | 109 | as si_errno. |
| @@ -123,6 +126,18 @@ SECCOMP_RET_TRACE: | |||
| 123 | the BPF program return value will be available to the tracer | 126 | the BPF program return value will be available to the tracer |
| 124 | via PTRACE_GETEVENTMSG. | 127 | via PTRACE_GETEVENTMSG. |
| 125 | 128 | ||
| 129 | The tracer can skip the system call by changing the syscall number | ||
| 130 | to -1. Alternatively, the tracer can change the system call | ||
| 131 | requested by changing the system call to a valid syscall number. If | ||
| 132 | the tracer asks to skip the system call, then the system call will | ||
| 133 | appear to return the value that the tracer puts in the return value | ||
| 134 | register. | ||
| 135 | |||
| 136 | The seccomp check will not be run again after the tracer is | ||
| 137 | notified. (This means that seccomp-based sandboxes MUST NOT | ||
| 138 | allow use of ptrace, even of other sandboxed processes, without | ||
| 139 | extreme care; ptracers can use this mechanism to escape.) | ||
| 140 | |||
| 126 | SECCOMP_RET_ALLOW: | 141 | SECCOMP_RET_ALLOW: |
| 127 | Results in the system call being executed. | 142 | Results in the system call being executed. |
| 128 | 143 | ||
| @@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to | |||
| 161 | support seccomp filter with minor fixup: SIGSYS support and seccomp return | 176 | support seccomp filter with minor fixup: SIGSYS support and seccomp return |
| 162 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER | 177 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER |
| 163 | to its arch-specific Kconfig. | 178 | to its arch-specific Kconfig. |
| 179 | |||
| 180 | |||
| 181 | |||
| 182 | Caveats | ||
| 183 | ------- | ||
| 184 | |||
| 185 | The vDSO can cause some system calls to run entirely in userspace, | ||
| 186 | leading to surprises when you run programs on different machines that | ||
| 187 | fall back to real syscalls. To minimize these surprises on x86, make | ||
| 188 | sure you test with | ||
| 189 | /sys/devices/system/clocksource/clocksource0/current_clocksource set to | ||
| 190 | something like acpi_pm. | ||
| 191 | |||
| 192 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are | ||
| 193 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: | ||
| 194 | |||
| 195 | - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to | ||
| 196 | the vsyscall entry for the given call and not the address after the | ||
| 197 | 'syscall' instruction. Any code which wants to restart the call | ||
| 198 | should be aware that (a) a ret instruction has been emulated and (b) | ||
| 199 | trying to resume the syscall will again trigger the standard vsyscall | ||
| 200 | emulation security checks, making resuming the syscall mostly | ||
| 201 | pointless. | ||
| 202 | |||
| 203 | - A return value of SECCOMP_RET_TRACE will signal the tracer as usual, | ||
| 204 | but the syscall may not be changed to another system call using the | ||
| 205 | orig_rax register. It may only be changed to -1 order to skip the | ||
| 206 | currently emulated call. Any other change MAY terminate the process. | ||
| 207 | The rip value seen by the tracer will be the syscall entry address; | ||
| 208 | this is different from normal behavior. The tracer MUST NOT modify | ||
| 209 | rip or rsp. (Do not rely on other changes terminating the process. | ||
| 210 | They might work. For example, on some kernels, choosing a syscall | ||
| 211 | that only exists in future kernels will be correctly emulated (by | ||
| 212 | returning -ENOSYS). | ||
| 213 | |||
| 214 | To detect this quirky behavior, check for addr & ~0x0C00 == | ||
| 215 | 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For | ||
| 216 | SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other | ||
| 217 | condition: future kernels may improve vsyscall emulation and current | ||
| 218 | kernels in vsyscall=native mode will behave differently, but the | ||
| 219 | instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these | ||
| 220 | cases. | ||
| 221 | |||
| 222 | Note that modern systems are unlikely to use vsyscalls at all -- they | ||
| 223 | are a legacy feature and they are considerably slower than standard | ||
| 224 | syscalls. New code will use the vDSO, and vDSO-issued system calls | ||
| 225 | are indistinguishable from normal system calls. | ||
