aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/userspace-api
diff options
context:
space:
mode:
authorTycho Andersen <tycho@tycho.ws>2018-12-09 13:24:13 -0500
committerKees Cook <keescook@chromium.org>2018-12-11 19:28:41 -0500
commit6a21cc50f0c7f87dae5259f6cfefe024412313f6 (patch)
tree0312987667dc2b05e9f9cc33586fac101b542a9a /Documentation/userspace-api
parenta5662e4d81c4d4b08140c625d0f3c50b15786252 (diff)
seccomp: add a return code to trap to userspace
This patch introduces a means for syscalls matched in seccomp to notify some other task that a particular filter has been triggered. The motivation for this is primarily for use with containers. For example, if a container does an init_module(), we obviously don't want to load this untrusted code, which may be compiled for the wrong version of the kernel anyway. Instead, we could parse the module image, figure out which module the container is trying to load and load it on the host. As another example, containers cannot mount() in general since various filesystems assume a trusted image. However, if an orchestrator knows that e.g. a particular block device has not been exposed to a container for writing, it want to allow the container to mount that block device (that is, handle the mount for it). This patch adds functionality that is already possible via at least two other means that I know about, both of which involve ptrace(): first, one could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. Unfortunately this is slow, so a faster version would be to install a filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. Since ptrace allows only one tracer, if the container runtime is that tracer, users inside the container (or outside) trying to debug it will not be able to use ptrace, which is annoying. It also means that older distributions based on Upstart cannot boot inside containers using ptrace, since upstart itself uses ptrace to monitor services while starting. The actual implementation of this is fairly small, although getting the synchronization right was/is slightly complex. Finally, it's worth noting that the classic seccomp TOCTOU of reading memory data from the task still applies here, but can be avoided with careful design of the userspace handler: if the userspace handler reads all of the task memory that is necessary before applying its security policy, the tracee's subsequent memory edits will not be read by the tracer. Signed-off-by: Tycho Andersen <tycho@tycho.ws> CC: Kees Cook <keescook@chromium.org> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: "Serge E. Hallyn" <serge@hallyn.com> Acked-by: Serge Hallyn <serge@hallyn.com> CC: Christian Brauner <christian@brauner.io> CC: Tyler Hicks <tyhicks@canonical.com> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> Signed-off-by: Kees Cook <keescook@chromium.org>
Diffstat (limited to 'Documentation/userspace-api')
-rw-r--r--Documentation/userspace-api/seccomp_filter.rst84
1 files changed, 84 insertions, 0 deletions
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 82a468bc7560..b1b846d8a094 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -122,6 +122,11 @@ In precedence order, they are:
122 Results in the lower 16-bits of the return value being passed 122 Results in the lower 16-bits of the return value being passed
123 to userland as the errno without executing the system call. 123 to userland as the errno without executing the system call.
124 124
125``SECCOMP_RET_USER_NOTIF``:
126 Results in a ``struct seccomp_notif`` message sent on the userspace
127 notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
128 on discussion of how to handle user notifications.
129
125``SECCOMP_RET_TRACE``: 130``SECCOMP_RET_TRACE``:
126 When returned, this value will cause the kernel to attempt to 131 When returned, this value will cause the kernel to attempt to
127 notify a ``ptrace()``-based tracer prior to executing the system 132 notify a ``ptrace()``-based tracer prior to executing the system
@@ -183,6 +188,85 @@ The ``samples/seccomp/`` directory contains both an x86-specific example
183and a more generic example of a higher level macro interface for BPF 188and a more generic example of a higher level macro interface for BPF
184program generation. 189program generation.
185 190
191Userspace Notification
192======================
193
194The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
195particular syscall to userspace to be handled. This may be useful for
196applications like container managers, which wish to intercept particular
197syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
198
199To acquire a notification FD, use the ``SECCOMP_FILTER_FLAG_NEW_LISTENER``
200argument to the ``seccomp()`` syscall:
201
202.. code-block:: c
203
204 fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
205
206which (on success) will return a listener fd for the filter, which can then be
207passed around via ``SCM_RIGHTS`` or similar. Note that filter fds correspond to
208a particular filter, and not a particular task. So if this task then forks,
209notifications from both tasks will appear on the same filter fd. Reads and
210writes to/from a filter fd are also synchronized, so a filter fd can safely
211have many readers.
212
213The interface for a seccomp notification fd consists of two structures:
214
215.. code-block:: c
216
217 struct seccomp_notif_sizes {
218 __u16 seccomp_notif;
219 __u16 seccomp_notif_resp;
220 __u16 seccomp_data;
221 };
222
223 struct seccomp_notif {
224 __u64 id;
225 __u32 pid;
226 __u32 flags;
227 struct seccomp_data data;
228 };
229
230 struct seccomp_notif_resp {
231 __u64 id;
232 __s64 val;
233 __s32 error;
234 __u32 flags;
235 };
236
237The ``struct seccomp_notif_sizes`` structure can be used to determine the size
238of the various structures used in seccomp notifications. The size of ``struct
239seccomp_data`` may change in the future, so code should use:
240
241.. code-block:: c
242
243 struct seccomp_notif_sizes sizes;
244 seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
245
246to determine the size of the various structures to allocate. See
247samples/seccomp/user-trap.c for an example.
248
249Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)`` (or ``poll()``) on a
250seccomp notification fd to receive a ``struct seccomp_notif``, which contains
251five members: the input length of the structure, a unique-per-filter ``id``,
252the ``pid`` of the task which triggered this request (which may be 0 if the
253task is in a pid ns not visible from the listener's pid namespace), a ``flags``
254member which for now only has ``SECCOMP_NOTIF_FLAG_SIGNALED``, representing
255whether or not the notification is a result of a non-fatal signal, and the
256``data`` passed to seccomp. Userspace can then make a decision based on this
257information about what to do, and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a
258response, indicating what should be returned to userspace. The ``id`` member of
259``struct seccomp_notif_resp`` should be the same ``id`` as in ``struct
260seccomp_notif``.
261
262It is worth noting that ``struct seccomp_data`` contains the values of register
263arguments to the syscall, but does not contain pointers to memory. The task's
264memory is accessible to suitably privileged traces via ``ptrace()`` or
265``/proc/pid/mem``. However, care should be taken to avoid the TOCTOU mentioned
266above in this document: all arguments being read from the tracee's memory
267should be read into the tracer's memory before any policy decisions are made.
268This allows for an atomic decision on syscall arguments.
269
186Sysctls 270Sysctls
187======= 271=======
188 272