diff options
| author | Will Drewry <wad@chromium.org> | 2011-06-27 12:11:13 -0400 |
|---|---|---|
| committer | Leann Ogasawara <leann.ogasawara@canonical.com> | 2011-08-30 17:33:51 -0400 |
| commit | ba94ba4d8fc971b1b6a607bbb6885da79319d65a (patch) | |
| tree | a4f533c2caee3388d5e85472828721893f3fbe17 /Documentation | |
| parent | 2a4a5645a73cbcfc1d19673edd0200e24e350194 (diff) | |
UBUNTU: SAUCE: seccomp_filter: Document what seccomp_filter is and how it works.
Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
implemented presently, and what it may be used for. In addition,
the limitations and caveats of the proposed implementation are
included.
v10: fix to reflect mode==13 now.
v9: rebase on to bccaeafd7c117acee36e90d37c7e05c19be9e7bf
v8: -
v7: Add a caveat around fork behavior and execve
v6: -
v5: -
v4: rewording (courtesy kees.cook@canonical.com)
reflect support for event ids
add a small section on adding per-arch support
v3: a little more cleanup
v2: moved to prctl/
updated for the v2 syntax.
adds a note about compat behavior
Signed-off-by: Will Drewry <wad@chromium.org>
BUG=chromium-os:14496
TEST=I can readz.
Change-Id: I10945ea369757756b08834650e59d148b3e08aa2
Reviewed-on: http://gerrit.chromium.org/gerrit/3243
Reviewed-by: Sonny Rao <sonnyrao@chromium.org>
Tested-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/prctl/seccomp_filter.txt | 189 |
1 files changed, 189 insertions, 0 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt new file mode 100644 index 00000000000..5afb4787e6a --- /dev/null +++ b/Documentation/prctl/seccomp_filter.txt | |||
| @@ -0,0 +1,189 @@ | |||
| 1 | Seccomp filtering | ||
| 2 | ================= | ||
| 3 | |||
| 4 | Introduction | ||
| 5 | ------------ | ||
| 6 | |||
| 7 | A large number of system calls are exposed to every userland process | ||
| 8 | with many of them going unused for the entire lifetime of the process. | ||
| 9 | As system calls change and mature, bugs are found and eradicated. A | ||
| 10 | certain subset of userland applications benefit by having a reduced set | ||
| 11 | of available system calls. The resulting set reduces the total kernel | ||
| 12 | surface exposed to the application. System call filtering is meant for | ||
| 13 | use with those applications. | ||
| 14 | |||
| 15 | The implementation currently leverages both the existing seccomp | ||
| 16 | infrastructure and the kernel tracing infrastructure. By centralizing | ||
| 17 | hooks for attack surface reduction in seccomp, it is possible to assure | ||
| 18 | attention to security that is less relevant in normal ftrace scenarios, | ||
| 19 | such as time-of-check, time-of-use attacks. However, ftrace provides a | ||
| 20 | rich, human-friendly environment for interfacing with system call | ||
| 21 | specific arguments. (As such, this requires FTRACE_SYSCALLS for any | ||
| 22 | introspective filtering support.) | ||
| 23 | |||
| 24 | |||
| 25 | What it isn't | ||
| 26 | ------------- | ||
| 27 | |||
| 28 | System call filtering isn't a sandbox. It provides a clearly defined | ||
| 29 | mechanism for minimizing the exposed kernel surface. Beyond that, | ||
| 30 | policy for logical behavior and information flow should be managed with | ||
| 31 | a combinations of other system hardening techniques and, potentially, a | ||
| 32 | LSM of your choosing. Expressive, dynamic filters based on the ftrace | ||
| 33 | filter engine provide further options down this path (avoiding | ||
| 34 | pathological sizes or selecting which of the multiplexed system calls in | ||
| 35 | socketcall() is allowed, for instance) which could be construed, | ||
| 36 | incorrectly, as a more complete sandboxing solution. | ||
| 37 | |||
| 38 | |||
| 39 | Usage | ||
| 40 | ----- | ||
| 41 | |||
| 42 | An additional seccomp mode is exposed through mode '13'. | ||
| 43 | This mode depends on CONFIG_SECCOMP_FILTER. By default, it provides | ||
| 44 | only the most trivial of filter support "1" or cleared. However, if | ||
| 45 | CONFIG_FTRACE_SYSCALLS is enabled, the ftrace filter engine may be used | ||
| 46 | for more expressive filters. | ||
| 47 | |||
| 48 | A collection of filters may be supplied via prctl, and the current set | ||
| 49 | of filters is exposed in /proc/<pid>/seccomp_filter. | ||
| 50 | |||
| 51 | Interacting with seccomp filters can be done through three new prctl calls | ||
| 52 | and one existing one. | ||
| 53 | |||
| 54 | PR_SET_SECCOMP: | ||
| 55 | A pre-existing option for enabling strict seccomp mode (1) or | ||
| 56 | filtering seccomp (13). | ||
| 57 | |||
| 58 | Usage: | ||
| 59 | prctl(PR_SET_SECCOMP, 1); /* strict */ | ||
| 60 | prctl(PR_SET_SECCOMP, 13); /* filters */ | ||
| 61 | |||
| 62 | PR_SET_SECCOMP_FILTER: | ||
| 63 | Allows the specification of a new filter for a given system | ||
| 64 | call, by number, and filter string. By default, the filter | ||
| 65 | string may only be "1". However, if CONFIG_FTRACE_SYSCALLS is | ||
| 66 | supported, the filter string may make use of the ftrace | ||
| 67 | filtering language's awareness of system call arguments. | ||
| 68 | |||
| 69 | In addition, the event id for the system call entry may be | ||
| 70 | specified in lieu of the system call number itself, as | ||
| 71 | determined by the 'type' argument. This allows for the future | ||
| 72 | addition of seccomp-based filtering on other registered, | ||
| 73 | relevant ftrace events. | ||
| 74 | |||
| 75 | All calls to PR_SET_SECCOMP_FILTER for a given system | ||
| 76 | call will append the supplied string to any existing filters. | ||
| 77 | Filter construction looks as follows: | ||
| 78 | (Nothing) + "fd == 1 || fd == 2" => fd == 1 || fd == 2 | ||
| 79 | ... + "fd != 2" => (fd == 1 || fd == 2) && fd != 2 | ||
| 80 | ... + "size < 100" => | ||
| 81 | ((fd == 1 || fd == 2) && fd != 2) && size < 100 | ||
| 82 | If there is no filter and the seccomp mode has already | ||
| 83 | transitioned to filtering, additions cannot be made. Filters | ||
| 84 | may only be added that reduce the available kernel surface. | ||
| 85 | |||
| 86 | Usage (per the construction example above): | ||
| 87 | unsigned long type = PR_SECCOMP_FILTER_SYSCALL; | ||
| 88 | prctl(PR_SET_SECCOMP_FILTER, type, __NR_write, | ||
| 89 | "fd == 1 || fd == 2"); | ||
| 90 | prctl(PR_SET_SECCOMP_FILTER, type, __NR_write, | ||
| 91 | "fd != 2"); | ||
| 92 | prctl(PR_SET_SECCOMP_FILTER, type, __NR_write, | ||
| 93 | "size < 100"); | ||
| 94 | |||
| 95 | The 'type' argument may be one of PR_SECCOMP_FILTER_SYSCALL or | ||
| 96 | PR_SECCOMP_FILTER_EVENT. | ||
| 97 | |||
| 98 | PR_CLEAR_SECCOMP_FILTER: | ||
| 99 | Removes all filter entries for a given system call number or | ||
| 100 | event id. When called prior to entering seccomp filtering mode, | ||
| 101 | it allows for new filters to be applied to the same system call. | ||
| 102 | After transition, however, it completely drops access to the | ||
| 103 | call. | ||
| 104 | |||
| 105 | Usage: | ||
| 106 | prctl(PR_CLEAR_SECCOMP_FILTER, | ||
| 107 | PR_SECCOMP_FILTER_SYSCALL, __NR_open); | ||
| 108 | |||
| 109 | PR_GET_SECCOMP_FILTER: | ||
| 110 | Returns the aggregated filter string for a system call into a | ||
| 111 | user-supplied buffer of a given length. | ||
| 112 | |||
| 113 | Usage: | ||
| 114 | prctl(PR_GET_SECCOMP_FILTER, | ||
| 115 | PR_SECCOMP_FILTER_SYSCALL, __NR_write, buf, | ||
| 116 | sizeof(buf)); | ||
| 117 | |||
| 118 | All of the above calls return 0 on success and non-zero on error. If | ||
| 119 | CONFIG_FTRACE_SYSCALLS is not supported and a rich-filter was specified, | ||
| 120 | the caller may check the errno for -ENOSYS. The same is true if | ||
| 121 | specifying an filter by the event id fails to discover any relevant | ||
| 122 | event entries. | ||
| 123 | |||
| 124 | |||
| 125 | Example | ||
| 126 | ------- | ||
| 127 | |||
| 128 | Assume a process would like to cleanly read and write to stdin/out/err | ||
| 129 | as well as access its filters after seccomp enforcement begins. This | ||
| 130 | may be done as follows: | ||
| 131 | |||
| 132 | int filter_syscall(int nr, char *buf) { | ||
| 133 | return prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_SYSCALL, | ||
| 134 | nr, buf); | ||
| 135 | } | ||
| 136 | |||
| 137 | filter_syscall(__NR_read, "fd == 0"); | ||
| 138 | filter_syscall(_NR_write, "fd == 1 || fd == 2"); | ||
| 139 | filter_syscall(__NR_exit, "1"); | ||
| 140 | filter_syscall(__NR_prctl, "1"); | ||
| 141 | prctl(PR_SET_SECCOMP, 13); | ||
| 142 | |||
| 143 | /* Do stuff with fdset . . .*/ | ||
| 144 | |||
| 145 | /* Drop read access and keep only write access to fd 1. */ | ||
| 146 | prctl(PR_CLEAR_SECCOMP_FILTER, PR_SECCOMP_FILTER_SYSCALL, __NR_read); | ||
| 147 | filter_syscall(__NR_write, "fd != 2"); | ||
| 148 | |||
| 149 | /* Perform any final processing . . . */ | ||
| 150 | syscall(__NR_exit, 0); | ||
| 151 | |||
| 152 | |||
| 153 | Caveats | ||
| 154 | ------- | ||
| 155 | |||
| 156 | - Avoid using a filter of "0" to disable a filter. Always favor calling | ||
| 157 | prctl(PR_CLEAR_SECCOMP_FILTER, ...). Otherwise the behavior may vary | ||
| 158 | depending on if CONFIG_FTRACE_SYSCALLS support exists -- though an | ||
| 159 | error will be returned if the support is missing. | ||
| 160 | |||
| 161 | - execve is always blocked. seccomp filters may not cross that boundary. | ||
| 162 | |||
| 163 | - Filters can be inherited across fork/clone but only when they are | ||
| 164 | active (e.g., PR_SET_SECCOMP has been set to 13), but not prior to use. | ||
| 165 | This stops the parent process from adding filters that may undermine | ||
| 166 | the child process security or create unexpected behavior after an | ||
| 167 | execve. | ||
| 168 | |||
| 169 | - Some platforms support a 32-bit userspace with 64-bit kernels. In | ||
| 170 | these cases (CONFIG_COMPAT), system call numbers may not match across | ||
| 171 | 64-bit and 32-bit system calls. When the first PRCTL_SET_SECCOMP_FILTER | ||
| 172 | is called, the in-memory filters state is annotated with whether the | ||
| 173 | call has been made via the compat interface. All subsequent calls will | ||
| 174 | be checked for compat call mismatch. In the long run, it may make sense | ||
| 175 | to store compat and non-compat filters separately, but that is not | ||
| 176 | supported at present. Once one type of system call interface has been | ||
| 177 | used, it must be continued to be used. | ||
| 178 | |||
| 179 | |||
| 180 | Adding architecture support | ||
| 181 | ----------------------- | ||
| 182 | |||
| 183 | Any platform with seccomp support should be able to support the bare | ||
| 184 | minimum of seccomp filter features. However, since seccomp_filter | ||
| 185 | requires that execve be blocked, it expects the architecture to expose a | ||
| 186 | __NR_seccomp_execve define that maps to the execve system call number. | ||
| 187 | On platforms where CONFIG_COMPAT applies, __NR_seccomp_execve_32 must | ||
| 188 | also be provided. Once those macros exist, "select HAVE_SECCOMP_FILTER" | ||
| 189 | support may be added to the architectures Kconfig. | ||
