diff options
author | Andy Lutomirski <luto@amacapital.net> | 2012-10-01 14:40:45 -0400 |
---|---|---|
committer | James Morris <james.l.morris@oracle.com> | 2012-10-02 07:14:29 -0400 |
commit | 87b526d349b04c31d7b3a40b434eb3f825d22305 (patch) | |
tree | 2aeec0465901c9623ef7f5b3eb451ea6ccce6ecc /Documentation | |
parent | bf5308344527d015ac9a6d2bda4ad4d40fd7d943 (diff) |
seccomp: Make syscall skipping and nr changes more consistent
This fixes two issues that could cause incompatibility between
kernel versions:
- If a tracer uses SECCOMP_RET_TRACE to select a syscall number
higher than the largest known syscall, emulate the unknown
vsyscall by returning -ENOSYS. (This is unlikely to make a
noticeable difference on x86-64 due to the way the system call
entry works.)
- On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
This updates the documentation accordingly.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/prctl/seccomp_filter.txt | 74 |
1 files changed, 68 insertions, 6 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt index 597c3c581375..1e469ef75778 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/prctl/seccomp_filter.txt | |||
@@ -95,12 +95,15 @@ SECCOMP_RET_KILL: | |||
95 | 95 | ||
96 | SECCOMP_RET_TRAP: | 96 | SECCOMP_RET_TRAP: |
97 | Results in the kernel sending a SIGSYS signal to the triggering | 97 | Results in the kernel sending a SIGSYS signal to the triggering |
98 | task without executing the system call. The kernel will | 98 | task without executing the system call. siginfo->si_call_addr |
99 | rollback the register state to just before the system call | 99 | will show the address of the system call instruction, and |
100 | entry such that a signal handler in the task will be able to | 100 | siginfo->si_syscall and siginfo->si_arch will indicate which |
101 | inspect the ucontext_t->uc_mcontext registers and emulate | 101 | syscall was attempted. The program counter will be as though |
102 | system call success or failure upon return from the signal | 102 | the syscall happened (i.e. it will not point to the syscall |
103 | handler. | 103 | instruction). The return value register will contain an arch- |
104 | dependent value -- if resuming execution, set it to something | ||
105 | sensible. (The architecture dependency is because replacing | ||
106 | it with -ENOSYS could overwrite some useful information.) | ||
104 | 107 | ||
105 | The SECCOMP_RET_DATA portion of the return value will be passed | 108 | The SECCOMP_RET_DATA portion of the return value will be passed |
106 | as si_errno. | 109 | as si_errno. |
@@ -123,6 +126,18 @@ SECCOMP_RET_TRACE: | |||
123 | the BPF program return value will be available to the tracer | 126 | the BPF program return value will be available to the tracer |
124 | via PTRACE_GETEVENTMSG. | 127 | via PTRACE_GETEVENTMSG. |
125 | 128 | ||
129 | The tracer can skip the system call by changing the syscall number | ||
130 | to -1. Alternatively, the tracer can change the system call | ||
131 | requested by changing the system call to a valid syscall number. If | ||
132 | the tracer asks to skip the system call, then the system call will | ||
133 | appear to return the value that the tracer puts in the return value | ||
134 | register. | ||
135 | |||
136 | The seccomp check will not be run again after the tracer is | ||
137 | notified. (This means that seccomp-based sandboxes MUST NOT | ||
138 | allow use of ptrace, even of other sandboxed processes, without | ||
139 | extreme care; ptracers can use this mechanism to escape.) | ||
140 | |||
126 | SECCOMP_RET_ALLOW: | 141 | SECCOMP_RET_ALLOW: |
127 | Results in the system call being executed. | 142 | Results in the system call being executed. |
128 | 143 | ||
@@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to | |||
161 | support seccomp filter with minor fixup: SIGSYS support and seccomp return | 176 | support seccomp filter with minor fixup: SIGSYS support and seccomp return |
162 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER | 177 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER |
163 | to its arch-specific Kconfig. | 178 | to its arch-specific Kconfig. |
179 | |||
180 | |||
181 | |||
182 | Caveats | ||
183 | ------- | ||
184 | |||
185 | The vDSO can cause some system calls to run entirely in userspace, | ||
186 | leading to surprises when you run programs on different machines that | ||
187 | fall back to real syscalls. To minimize these surprises on x86, make | ||
188 | sure you test with | ||
189 | /sys/devices/system/clocksource/clocksource0/current_clocksource set to | ||
190 | something like acpi_pm. | ||
191 | |||
192 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are | ||
193 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: | ||
194 | |||
195 | - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to | ||
196 | the vsyscall entry for the given call and not the address after the | ||
197 | 'syscall' instruction. Any code which wants to restart the call | ||
198 | should be aware that (a) a ret instruction has been emulated and (b) | ||
199 | trying to resume the syscall will again trigger the standard vsyscall | ||
200 | emulation security checks, making resuming the syscall mostly | ||
201 | pointless. | ||
202 | |||
203 | - A return value of SECCOMP_RET_TRACE will signal the tracer as usual, | ||
204 | but the syscall may not be changed to another system call using the | ||
205 | orig_rax register. It may only be changed to -1 order to skip the | ||
206 | currently emulated call. Any other change MAY terminate the process. | ||
207 | The rip value seen by the tracer will be the syscall entry address; | ||
208 | this is different from normal behavior. The tracer MUST NOT modify | ||
209 | rip or rsp. (Do not rely on other changes terminating the process. | ||
210 | They might work. For example, on some kernels, choosing a syscall | ||
211 | that only exists in future kernels will be correctly emulated (by | ||
212 | returning -ENOSYS). | ||
213 | |||
214 | To detect this quirky behavior, check for addr & ~0x0C00 == | ||
215 | 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For | ||
216 | SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other | ||
217 | condition: future kernels may improve vsyscall emulation and current | ||
218 | kernels in vsyscall=native mode will behave differently, but the | ||
219 | instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these | ||
220 | cases. | ||
221 | |||
222 | Note that modern systems are unlikely to use vsyscalls at all -- they | ||
223 | are a legacy feature and they are considerably slower than standard | ||
224 | syscalls. New code will use the vDSO, and vDSO-issued system calls | ||
225 | are indistinguishable from normal system calls. | ||