aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorAndy Lutomirski <luto@amacapital.net>2012-10-01 14:40:45 -0400
committerJames Morris <james.l.morris@oracle.com>2012-10-02 07:14:29 -0400
commit87b526d349b04c31d7b3a40b434eb3f825d22305 (patch)
tree2aeec0465901c9623ef7f5b3eb451ea6ccce6ecc /Documentation
parentbf5308344527d015ac9a6d2bda4ad4d40fd7d943 (diff)
seccomp: Make syscall skipping and nr changes more consistent
This fixes two issues that could cause incompatibility between kernel versions: - If a tracer uses SECCOMP_RET_TRACE to select a syscall number higher than the largest known syscall, emulate the unknown vsyscall by returning -ENOSYS. (This is unlikely to make a noticeable difference on x86-64 due to the way the system call entry works.) - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy. This updates the documentation accordingly. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Will Drewry <wad@chromium.org> Signed-off-by: James Morris <james.l.morris@oracle.com>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/prctl/seccomp_filter.txt74
1 files changed, 68 insertions, 6 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
index 597c3c581375..1e469ef75778 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -95,12 +95,15 @@ SECCOMP_RET_KILL:
95 95
96SECCOMP_RET_TRAP: 96SECCOMP_RET_TRAP:
97 Results in the kernel sending a SIGSYS signal to the triggering 97 Results in the kernel sending a SIGSYS signal to the triggering
98 task without executing the system call. The kernel will 98 task without executing the system call. siginfo->si_call_addr
99 rollback the register state to just before the system call 99 will show the address of the system call instruction, and
100 entry such that a signal handler in the task will be able to 100 siginfo->si_syscall and siginfo->si_arch will indicate which
101 inspect the ucontext_t->uc_mcontext registers and emulate 101 syscall was attempted. The program counter will be as though
102 system call success or failure upon return from the signal 102 the syscall happened (i.e. it will not point to the syscall
103 handler. 103 instruction). The return value register will contain an arch-
104 dependent value -- if resuming execution, set it to something
105 sensible. (The architecture dependency is because replacing
106 it with -ENOSYS could overwrite some useful information.)
104 107
105 The SECCOMP_RET_DATA portion of the return value will be passed 108 The SECCOMP_RET_DATA portion of the return value will be passed
106 as si_errno. 109 as si_errno.
@@ -123,6 +126,18 @@ SECCOMP_RET_TRACE:
123 the BPF program return value will be available to the tracer 126 the BPF program return value will be available to the tracer
124 via PTRACE_GETEVENTMSG. 127 via PTRACE_GETEVENTMSG.
125 128
129 The tracer can skip the system call by changing the syscall number
130 to -1. Alternatively, the tracer can change the system call
131 requested by changing the system call to a valid syscall number. If
132 the tracer asks to skip the system call, then the system call will
133 appear to return the value that the tracer puts in the return value
134 register.
135
136 The seccomp check will not be run again after the tracer is
137 notified. (This means that seccomp-based sandboxes MUST NOT
138 allow use of ptrace, even of other sandboxed processes, without
139 extreme care; ptracers can use this mechanism to escape.)
140
126SECCOMP_RET_ALLOW: 141SECCOMP_RET_ALLOW:
127 Results in the system call being executed. 142 Results in the system call being executed.
128 143
@@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to
161support seccomp filter with minor fixup: SIGSYS support and seccomp return 176support seccomp filter with minor fixup: SIGSYS support and seccomp return
162value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER 177value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
163to its arch-specific Kconfig. 178to its arch-specific Kconfig.
179
180
181
182Caveats
183-------
184
185The vDSO can cause some system calls to run entirely in userspace,
186leading to surprises when you run programs on different machines that
187fall back to real syscalls. To minimize these surprises on x86, make
188sure you test with
189/sys/devices/system/clocksource/clocksource0/current_clocksource set to
190something like acpi_pm.
191
192On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
193legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
194
195- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
196 the vsyscall entry for the given call and not the address after the
197 'syscall' instruction. Any code which wants to restart the call
198 should be aware that (a) a ret instruction has been emulated and (b)
199 trying to resume the syscall will again trigger the standard vsyscall
200 emulation security checks, making resuming the syscall mostly
201 pointless.
202
203- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
204 but the syscall may not be changed to another system call using the
205 orig_rax register. It may only be changed to -1 order to skip the
206 currently emulated call. Any other change MAY terminate the process.
207 The rip value seen by the tracer will be the syscall entry address;
208 this is different from normal behavior. The tracer MUST NOT modify
209 rip or rsp. (Do not rely on other changes terminating the process.
210 They might work. For example, on some kernels, choosing a syscall
211 that only exists in future kernels will be correctly emulated (by
212 returning -ENOSYS).
213
214To detect this quirky behavior, check for addr & ~0x0C00 ==
2150xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
216SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
217condition: future kernels may improve vsyscall emulation and current
218kernels in vsyscall=native mode will behave differently, but the
219instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
220cases.
221
222Note that modern systems are unlikely to use vsyscalls at all -- they
223are a legacy feature and they are considerably slower than standard
224syscalls. New code will use the vDSO, and vDSO-issued system calls
225are indistinguishable from normal system calls.