diff options
author | Mauro Carvalho Chehab <mchehab@s-opensource.com> | 2016-09-21 12:14:35 -0400 |
---|---|---|
committer | Mauro Carvalho Chehab <mchehab@s-opensource.com> | 2016-10-24 06:12:35 -0400 |
commit | 12983bcd822f5c13a0f350cc97bc9fb781cae944 (patch) | |
tree | a70f0bde41c34eb0a4705cedd49d5474ed66fdfd /Documentation/adding-syscals.txt | |
parent | 3a61baddcec3e7873b49deb5804d3d6f39b92def (diff) |
Documentation/adding-syscalls.txt: convert it to ReST markup
Convert adding-syscalls.txt to ReST markup and add it to the
development-process book:
- add extra lines to make Sphinx to correctly parse paragraphs;
- use quote blocks for examples;
- use monotonic font for dirs, function calls, etc;
- mark manpage pages using the right markup;
- add cross-reference to SubmittingPatches.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Diffstat (limited to 'Documentation/adding-syscals.txt')
-rw-r--r-- | Documentation/adding-syscals.txt | 542 |
1 files changed, 542 insertions, 0 deletions
diff --git a/Documentation/adding-syscals.txt b/Documentation/adding-syscals.txt new file mode 100644 index 000000000000..f5b5b1aa51b3 --- /dev/null +++ b/Documentation/adding-syscals.txt | |||
@@ -0,0 +1,542 @@ | |||
1 | Adding a New System Call | ||
2 | ======================== | ||
3 | |||
4 | This document describes what's involved in adding a new system call to the | ||
5 | Linux kernel, over and above the normal submission advice in | ||
6 | :ref:`Documentation/SubmittingPatches <submittingpatches>`. | ||
7 | |||
8 | |||
9 | System Call Alternatives | ||
10 | ------------------------ | ||
11 | |||
12 | The first thing to consider when adding a new system call is whether one of | ||
13 | the alternatives might be suitable instead. Although system calls are the | ||
14 | most traditional and most obvious interaction points between userspace and the | ||
15 | kernel, there are other possibilities -- choose what fits best for your | ||
16 | interface. | ||
17 | |||
18 | - If the operations involved can be made to look like a filesystem-like | ||
19 | object, it may make more sense to create a new filesystem or device. This | ||
20 | also makes it easier to encapsulate the new functionality in a kernel module | ||
21 | rather than requiring it to be built into the main kernel. | ||
22 | |||
23 | - If the new functionality involves operations where the kernel notifies | ||
24 | userspace that something has happened, then returning a new file | ||
25 | descriptor for the relevant object allows userspace to use | ||
26 | ``poll``/``select``/``epoll`` to receive that notification. | ||
27 | - However, operations that don't map to | ||
28 | :manpage:`read(2)`/:manpage:`write(2)`-like operations | ||
29 | have to be implemented as :manpage:`ioctl(2)` requests, which can lead | ||
30 | to a somewhat opaque API. | ||
31 | |||
32 | - If you're just exposing runtime system information, a new node in sysfs | ||
33 | (see ``Documentation/filesystems/sysfs.txt``) or the ``/proc`` filesystem may | ||
34 | be more appropriate. However, access to these mechanisms requires that the | ||
35 | relevant filesystem is mounted, which might not always be the case (e.g. | ||
36 | in a namespaced/sandboxed/chrooted environment). Avoid adding any API to | ||
37 | debugfs, as this is not considered a 'production' interface to userspace. | ||
38 | - If the operation is specific to a particular file or file descriptor, then | ||
39 | an additional :manpage:`fcntl(2)` command option may be more appropriate. However, | ||
40 | :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so | ||
41 | this option is best for when the new function is closely analogous to | ||
42 | existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple | ||
43 | (for example, getting/setting a simple flag related to a file descriptor). | ||
44 | - If the operation is specific to a particular task or process, then an | ||
45 | additional :manpage:`prctl(2)` command option may be more appropriate. As | ||
46 | with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so | ||
47 | is best reserved for near-analogs of existing ``prctl()`` commands or | ||
48 | getting/setting a simple flag related to a process. | ||
49 | |||
50 | |||
51 | Designing the API: Planning for Extension | ||
52 | ----------------------------------------- | ||
53 | |||
54 | A new system call forms part of the API of the kernel, and has to be supported | ||
55 | indefinitely. As such, it's a very good idea to explicitly discuss the | ||
56 | interface on the kernel mailing list, and it's important to plan for future | ||
57 | extensions of the interface. | ||
58 | |||
59 | (The syscall table is littered with historical examples where this wasn't done, | ||
60 | together with the corresponding follow-up system calls -- | ||
61 | ``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``, | ||
62 | ``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so | ||
63 | learn from the history of the kernel and plan for extensions from the start.) | ||
64 | |||
65 | For simpler system calls that only take a couple of arguments, the preferred | ||
66 | way to allow for future extensibility is to include a flags argument to the | ||
67 | system call. To make sure that userspace programs can safely use flags | ||
68 | between kernel versions, check whether the flags value holds any unknown | ||
69 | flags, and reject the system call (with ``EINVAL``) if it does:: | ||
70 | |||
71 | if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) | ||
72 | return -EINVAL; | ||
73 | |||
74 | (If no flags values are used yet, check that the flags argument is zero.) | ||
75 | |||
76 | For more sophisticated system calls that involve a larger number of arguments, | ||
77 | it's preferred to encapsulate the majority of the arguments into a structure | ||
78 | that is passed in by pointer. Such a structure can cope with future extension | ||
79 | by including a size argument in the structure:: | ||
80 | |||
81 | struct xyzzy_params { | ||
82 | u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ | ||
83 | u32 param_1; | ||
84 | u64 param_2; | ||
85 | u64 param_3; | ||
86 | }; | ||
87 | |||
88 | As long as any subsequently added field, say ``param_4``, is designed so that a | ||
89 | zero value gives the previous behaviour, then this allows both directions of | ||
90 | version mismatch: | ||
91 | |||
92 | - To cope with a later userspace program calling an older kernel, the kernel | ||
93 | code should check that any memory beyond the size of the structure that it | ||
94 | expects is zero (effectively checking that ``param_4 == 0``). | ||
95 | - To cope with an older userspace program calling a newer kernel, the kernel | ||
96 | code can zero-extend a smaller instance of the structure (effectively | ||
97 | setting ``param_4 = 0``). | ||
98 | |||
99 | See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in | ||
100 | ``kernel/events/core.c``) for an example of this approach. | ||
101 | |||
102 | |||
103 | Designing the API: Other Considerations | ||
104 | --------------------------------------- | ||
105 | |||
106 | If your new system call allows userspace to refer to a kernel object, it | ||
107 | should use a file descriptor as the handle for that object -- don't invent a | ||
108 | new type of userspace object handle when the kernel already has mechanisms and | ||
109 | well-defined semantics for using file descriptors. | ||
110 | |||
111 | If your new :manpage:`xyzzy(2)` system call does return a new file descriptor, | ||
112 | then the flags argument should include a value that is equivalent to setting | ||
113 | ``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close | ||
114 | the timing window between ``xyzzy()`` and calling | ||
115 | ``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and | ||
116 | ``execve()`` in another thread could leak a descriptor to | ||
117 | the exec'ed program. (However, resist the temptation to re-use the actual value | ||
118 | of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a | ||
119 | numbering space of ``O_*`` flags that is fairly full.) | ||
120 | |||
121 | If your system call returns a new file descriptor, you should also consider | ||
122 | what it means to use the :manpage:`poll(2)` family of system calls on that file | ||
123 | descriptor. Making a file descriptor ready for reading or writing is the | ||
124 | normal way for the kernel to indicate to userspace that an event has | ||
125 | occurred on the corresponding kernel object. | ||
126 | |||
127 | If your new :manpage:`xyzzy(2)` system call involves a filename argument:: | ||
128 | |||
129 | int sys_xyzzy(const char __user *path, ..., unsigned int flags); | ||
130 | |||
131 | you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate:: | ||
132 | |||
133 | int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags); | ||
134 | |||
135 | This allows more flexibility for how userspace specifies the file in question; | ||
136 | in particular it allows userspace to request the functionality for an | ||
137 | already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively | ||
138 | giving an :manpage:`fxyzzy(3)` operation for free:: | ||
139 | |||
140 | - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...) | ||
141 | - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...) | ||
142 | |||
143 | (For more details on the rationale of the \*at() calls, see the | ||
144 | :manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the | ||
145 | :manpage:`fstatat(2)` man page.) | ||
146 | |||
147 | If your new :manpage:`xyzzy(2)` system call involves a parameter describing an | ||
148 | offset within a file, make its type ``loff_t`` so that 64-bit offsets can be | ||
149 | supported even on 32-bit architectures. | ||
150 | |||
151 | If your new :manpage:`xyzzy(2)` system call involves privileged functionality, | ||
152 | it needs to be governed by the appropriate Linux capability bit (checked with | ||
153 | a call to ``capable()``), as described in the :manpage:`capabilities(7)` man | ||
154 | page. Choose an existing capability bit that governs related functionality, | ||
155 | but try to avoid combining lots of only vaguely related functions together | ||
156 | under the same bit, as this goes against capabilities' purpose of splitting | ||
157 | the power of root. In particular, avoid adding new uses of the already | ||
158 | overly-general ``CAP_SYS_ADMIN`` capability. | ||
159 | |||
160 | If your new :manpage:`xyzzy(2)` system call manipulates a process other than | ||
161 | the calling process, it should be restricted (using a call to | ||
162 | ``ptrace_may_access()``) so that only a calling process with the same | ||
163 | permissions as the target process, or with the necessary capabilities, can | ||
164 | manipulate the target process. | ||
165 | |||
166 | Finally, be aware that some non-x86 architectures have an easier time if | ||
167 | system call parameters that are explicitly 64-bit fall on odd-numbered | ||
168 | arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit | ||
169 | registers. (This concern does not apply if the arguments are part of a | ||
170 | structure that's passed in by pointer.) | ||
171 | |||
172 | |||
173 | Proposing the API | ||
174 | ----------------- | ||
175 | |||
176 | To make new system calls easy to review, it's best to divide up the patchset | ||
177 | into separate chunks. These should include at least the following items as | ||
178 | distinct commits (each of which is described further below): | ||
179 | |||
180 | - The core implementation of the system call, together with prototypes, | ||
181 | generic numbering, Kconfig changes and fallback stub implementation. | ||
182 | - Wiring up of the new system call for one particular architecture, usually | ||
183 | x86 (including all of x86_64, x86_32 and x32). | ||
184 | - A demonstration of the use of the new system call in userspace via a | ||
185 | selftest in ``tools/testing/selftests/``. | ||
186 | - A draft man-page for the new system call, either as plain text in the | ||
187 | cover letter, or as a patch to the (separate) man-pages repository. | ||
188 | |||
189 | New system call proposals, like any change to the kernel's API, should always | ||
190 | be cc'ed to linux-api@vger.kernel.org. | ||
191 | |||
192 | |||
193 | Generic System Call Implementation | ||
194 | ---------------------------------- | ||
195 | |||
196 | The main entry point for your new :manpage:`xyzzy(2)` system call will be called | ||
197 | ``sys_xyzzy()``, but you add this entry point with the appropriate | ||
198 | ``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the | ||
199 | number of arguments to the system call, and the macro takes the system call name | ||
200 | followed by the (type, name) pairs for the parameters as arguments. Using | ||
201 | this macro allows metadata about the new system call to be made available for | ||
202 | other tools. | ||
203 | |||
204 | The new entry point also needs a corresponding function prototype, in | ||
205 | ``include/linux/syscalls.h``, marked as asmlinkage to match the way that system | ||
206 | calls are invoked:: | ||
207 | |||
208 | asmlinkage long sys_xyzzy(...); | ||
209 | |||
210 | Some architectures (e.g. x86) have their own architecture-specific syscall | ||
211 | tables, but several other architectures share a generic syscall table. Add your | ||
212 | new system call to the generic list by adding an entry to the list in | ||
213 | ``include/uapi/asm-generic/unistd.h``:: | ||
214 | |||
215 | #define __NR_xyzzy 292 | ||
216 | __SYSCALL(__NR_xyzzy, sys_xyzzy) | ||
217 | |||
218 | Also update the __NR_syscalls count to reflect the additional system call, and | ||
219 | note that if multiple new system calls are added in the same merge window, | ||
220 | your new syscall number may get adjusted to resolve conflicts. | ||
221 | |||
222 | The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each | ||
223 | system call, returning ``-ENOSYS``. Add your new system call here too:: | ||
224 | |||
225 | cond_syscall(sys_xyzzy); | ||
226 | |||
227 | Your new kernel functionality, and the system call that controls it, should | ||
228 | normally be optional, so add a ``CONFIG`` option (typically to | ||
229 | ``init/Kconfig``) for it. As usual for new ``CONFIG`` options: | ||
230 | |||
231 | - Include a description of the new functionality and system call controlled | ||
232 | by the option. | ||
233 | - Make the option depend on EXPERT if it should be hidden from normal users. | ||
234 | - Make any new source files implementing the function dependent on the CONFIG | ||
235 | option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.c``). | ||
236 | - Double check that the kernel still builds with the new CONFIG option turned | ||
237 | off. | ||
238 | |||
239 | To summarize, you need a commit that includes: | ||
240 | |||
241 | - ``CONFIG`` option for the new function, normally in ``init/Kconfig`` | ||
242 | - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point | ||
243 | - corresponding prototype in ``include/linux/syscalls.h`` | ||
244 | - generic table entry in ``include/uapi/asm-generic/unistd.h`` | ||
245 | - fallback stub in ``kernel/sys_ni.c`` | ||
246 | |||
247 | |||
248 | x86 System Call Implementation | ||
249 | ------------------------------ | ||
250 | |||
251 | To wire up your new system call for x86 platforms, you need to update the | ||
252 | master syscall tables. Assuming your new system call isn't special in some | ||
253 | way (see below), this involves a "common" entry (for x86_64 and x32) in | ||
254 | arch/x86/entry/syscalls/syscall_64.tbl:: | ||
255 | |||
256 | 333 common xyzzy sys_xyzzy | ||
257 | |||
258 | and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``:: | ||
259 | |||
260 | 380 i386 xyzzy sys_xyzzy | ||
261 | |||
262 | Again, these numbers are liable to be changed if there are conflicts in the | ||
263 | relevant merge window. | ||
264 | |||
265 | |||
266 | Compatibility System Calls (Generic) | ||
267 | ------------------------------------ | ||
268 | |||
269 | For most system calls the same 64-bit implementation can be invoked even when | ||
270 | the userspace program is itself 32-bit; even if the system call's parameters | ||
271 | include an explicit pointer, this is handled transparently. | ||
272 | |||
273 | However, there are a couple of situations where a compatibility layer is | ||
274 | needed to cope with size differences between 32-bit and 64-bit. | ||
275 | |||
276 | The first is if the 64-bit kernel also supports 32-bit userspace programs, and | ||
277 | so needs to parse areas of (``__user``) memory that could hold either 32-bit or | ||
278 | 64-bit values. In particular, this is needed whenever a system call argument | ||
279 | is: | ||
280 | |||
281 | - a pointer to a pointer | ||
282 | - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``) | ||
283 | - a pointer to a varying sized integral type (``time_t``, ``off_t``, | ||
284 | ``long``, ...) | ||
285 | - a pointer to a struct containing a varying sized integral type. | ||
286 | |||
287 | The second situation that requires a compatibility layer is if one of the | ||
288 | system call's arguments has a type that is explicitly 64-bit even on a 32-bit | ||
289 | architecture, for example ``loff_t`` or ``__u64``. In this case, a value that | ||
290 | arrives at a 64-bit kernel from a 32-bit application will be split into two | ||
291 | 32-bit values, which then need to be re-assembled in the compatibility layer. | ||
292 | |||
293 | (Note that a system call argument that's a pointer to an explicit 64-bit type | ||
294 | does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of | ||
295 | type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.) | ||
296 | |||
297 | The compatibility version of the system call is called ``compat_sys_xyzzy()``, | ||
298 | and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to | ||
299 | SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit | ||
300 | kernel, but expects to receive 32-bit parameter values and does whatever is | ||
301 | needed to deal with them. (Typically, the ``compat_sys_`` version converts the | ||
302 | values to 64-bit versions and either calls on to the ``sys_`` version, or both of | ||
303 | them call a common inner implementation function.) | ||
304 | |||
305 | The compat entry point also needs a corresponding function prototype, in | ||
306 | ``include/linux/compat.h``, marked as asmlinkage to match the way that system | ||
307 | calls are invoked:: | ||
308 | |||
309 | asmlinkage long compat_sys_xyzzy(...); | ||
310 | |||
311 | If the system call involves a structure that is laid out differently on 32-bit | ||
312 | and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h | ||
313 | header file should also include a compat version of the structure (``struct | ||
314 | compat_xyzzy_args``) where each variable-size field has the appropriate | ||
315 | ``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The | ||
316 | ``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to | ||
317 | parse the arguments from a 32-bit invocation. | ||
318 | |||
319 | For example, if there are fields:: | ||
320 | |||
321 | struct xyzzy_args { | ||
322 | const char __user *ptr; | ||
323 | __kernel_long_t varying_val; | ||
324 | u64 fixed_val; | ||
325 | /* ... */ | ||
326 | }; | ||
327 | |||
328 | in struct xyzzy_args, then struct compat_xyzzy_args would have:: | ||
329 | |||
330 | struct compat_xyzzy_args { | ||
331 | compat_uptr_t ptr; | ||
332 | compat_long_t varying_val; | ||
333 | u64 fixed_val; | ||
334 | /* ... */ | ||
335 | }; | ||
336 | |||
337 | The generic system call list also needs adjusting to allow for the compat | ||
338 | version; the entry in ``include/uapi/asm-generic/unistd.h`` should use | ||
339 | ``__SC_COMP`` rather than ``__SYSCALL``:: | ||
340 | |||
341 | #define __NR_xyzzy 292 | ||
342 | __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy) | ||
343 | |||
344 | To summarize, you need: | ||
345 | |||
346 | - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point | ||
347 | - corresponding prototype in ``include/linux/compat.h`` | ||
348 | - (if needed) 32-bit mapping struct in ``include/linux/compat.h`` | ||
349 | - instance of ``__SC_COMP`` not ``__SYSCALL`` in | ||
350 | ``include/uapi/asm-generic/unistd.h`` | ||
351 | |||
352 | |||
353 | Compatibility System Calls (x86) | ||
354 | -------------------------------- | ||
355 | |||
356 | To wire up the x86 architecture of a system call with a compatibility version, | ||
357 | the entries in the syscall tables need to be adjusted. | ||
358 | |||
359 | First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra | ||
360 | column to indicate that a 32-bit userspace program running on a 64-bit kernel | ||
361 | should hit the compat entry point:: | ||
362 | |||
363 | 380 i386 xyzzy sys_xyzzy compat_sys_xyzzy | ||
364 | |||
365 | Second, you need to figure out what should happen for the x32 ABI version of | ||
366 | the new system call. There's a choice here: the layout of the arguments | ||
367 | should either match the 64-bit version or the 32-bit version. | ||
368 | |||
369 | If there's a pointer-to-a-pointer involved, the decision is easy: x32 is | ||
370 | ILP32, so the layout should match the 32-bit version, and the entry in | ||
371 | ``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit | ||
372 | the compatibility wrapper:: | ||
373 | |||
374 | 333 64 xyzzy sys_xyzzy | ||
375 | ... | ||
376 | 555 x32 xyzzy compat_sys_xyzzy | ||
377 | |||
378 | If no pointers are involved, then it is preferable to re-use the 64-bit system | ||
379 | call for the x32 ABI (and consequently the entry in | ||
380 | arch/x86/entry/syscalls/syscall_64.tbl is unchanged). | ||
381 | |||
382 | In either case, you should check that the types involved in your argument | ||
383 | layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or | ||
384 | 64-bit (-m64) equivalents. | ||
385 | |||
386 | |||
387 | System Calls Returning Elsewhere | ||
388 | -------------------------------- | ||
389 | |||
390 | For most system calls, once the system call is complete the user program | ||
391 | continues exactly where it left off -- at the next instruction, with the | ||
392 | stack the same and most of the registers the same as before the system call, | ||
393 | and with the same virtual memory space. | ||
394 | |||
395 | However, a few system calls do things differently. They might return to a | ||
396 | different location (``rt_sigreturn``) or change the memory space | ||
397 | (``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``) | ||
398 | of the program. | ||
399 | |||
400 | To allow for this, the kernel implementation of the system call may need to | ||
401 | save and restore additional registers to the kernel stack, allowing complete | ||
402 | control of where and how execution continues after the system call. | ||
403 | |||
404 | This is arch-specific, but typically involves defining assembly entry points | ||
405 | that save/restore additional registers and invoke the real system call entry | ||
406 | point. | ||
407 | |||
408 | For x86_64, this is implemented as a ``stub_xyzzy`` entry point in | ||
409 | ``arch/x86/entry/entry_64.S``, and the entry in the syscall table | ||
410 | (``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match:: | ||
411 | |||
412 | 333 common xyzzy stub_xyzzy | ||
413 | |||
414 | The equivalent for 32-bit programs running on a 64-bit kernel is normally | ||
415 | called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``, | ||
416 | with the corresponding syscall table adjustment in | ||
417 | ``arch/x86/entry/syscalls/syscall_32.tbl``:: | ||
418 | |||
419 | 380 i386 xyzzy sys_xyzzy stub32_xyzzy | ||
420 | |||
421 | If the system call needs a compatibility layer (as in the previous section) | ||
422 | then the ``stub32_`` version needs to call on to the ``compat_sys_`` version | ||
423 | of the system call rather than the native 64-bit version. Also, if the x32 ABI | ||
424 | implementation is not common with the x86_64 version, then its syscall | ||
425 | table will also need to invoke a stub that calls on to the ``compat_sys_`` | ||
426 | version. | ||
427 | |||
428 | For completeness, it's also nice to set up a mapping so that user-mode Linux | ||
429 | still works -- its syscall table will reference stub_xyzzy, but the UML build | ||
430 | doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML | ||
431 | simulates registers etc). Fixing this is as simple as adding a #define to | ||
432 | ``arch/x86/um/sys_call_table_64.c``:: | ||
433 | |||
434 | #define stub_xyzzy sys_xyzzy | ||
435 | |||
436 | |||
437 | Other Details | ||
438 | ------------- | ||
439 | |||
440 | Most of the kernel treats system calls in a generic way, but there is the | ||
441 | occasional exception that may need updating for your particular system call. | ||
442 | |||
443 | The audit subsystem is one such special case; it includes (arch-specific) | ||
444 | functions that classify some special types of system call -- specifically | ||
445 | file open (``open``/``openat``), program execution (``execve``/``exeveat``) or | ||
446 | socket multiplexor (``socketcall``) operations. If your new system call is | ||
447 | analogous to one of these, then the audit system should be updated. | ||
448 | |||
449 | More generally, if there is an existing system call that is analogous to your | ||
450 | new system call, it's worth doing a kernel-wide grep for the existing system | ||
451 | call to check there are no other special cases. | ||
452 | |||
453 | |||
454 | Testing | ||
455 | ------- | ||
456 | |||
457 | A new system call should obviously be tested; it is also useful to provide | ||
458 | reviewers with a demonstration of how user space programs will use the system | ||
459 | call. A good way to combine these aims is to include a simple self-test | ||
460 | program in a new directory under ``tools/testing/selftests/``. | ||
461 | |||
462 | For a new system call, there will obviously be no libc wrapper function and so | ||
463 | the test will need to invoke it using ``syscall()``; also, if the system call | ||
464 | involves a new userspace-visible structure, the corresponding header will need | ||
465 | to be installed to compile the test. | ||
466 | |||
467 | Make sure the selftest runs successfully on all supported architectures. For | ||
468 | example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32) | ||
469 | and x32 (-mx32) ABI program. | ||
470 | |||
471 | For more extensive and thorough testing of new functionality, you should also | ||
472 | consider adding tests to the Linux Test Project, or to the xfstests project | ||
473 | for filesystem-related changes. | ||
474 | |||
475 | - https://linux-test-project.github.io/ | ||
476 | - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git | ||
477 | |||
478 | |||
479 | Man Page | ||
480 | -------- | ||
481 | |||
482 | All new system calls should come with a complete man page, ideally using groff | ||
483 | markup, but plain text will do. If groff is used, it's helpful to include a | ||
484 | pre-rendered ASCII version of the man page in the cover email for the | ||
485 | patchset, for the convenience of reviewers. | ||
486 | |||
487 | The man page should be cc'ed to linux-man@vger.kernel.org | ||
488 | For more details, see https://www.kernel.org/doc/man-pages/patches.html | ||
489 | |||
490 | References and Sources | ||
491 | ---------------------- | ||
492 | |||
493 | - LWN article from Michael Kerrisk on use of flags argument in system calls: | ||
494 | https://lwn.net/Articles/585415/ | ||
495 | - LWN article from Michael Kerrisk on how to handle unknown flags in a system | ||
496 | call: https://lwn.net/Articles/588444/ | ||
497 | - LWN article from Jake Edge describing constraints on 64-bit system call | ||
498 | arguments: https://lwn.net/Articles/311630/ | ||
499 | - Pair of LWN articles from David Drysdale that describe the system call | ||
500 | implementation paths in detail for v3.14: | ||
501 | |||
502 | - https://lwn.net/Articles/604287/ | ||
503 | - https://lwn.net/Articles/604515/ | ||
504 | |||
505 | - Architecture-specific requirements for system calls are discussed in the | ||
506 | :manpage:`syscall(2)` man-page: | ||
507 | http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES | ||
508 | - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``: | ||
509 | http://yarchive.net/comp/linux/ioctl.html | ||
510 | - "How to not invent kernel interfaces", Arnd Bergmann, | ||
511 | http://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf | ||
512 | - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN: | ||
513 | https://lwn.net/Articles/486306/ | ||
514 | - Recommendation from Andrew Morton that all related information for a new | ||
515 | system call should come in the same email thread: | ||
516 | https://lkml.org/lkml/2014/7/24/641 | ||
517 | - Recommendation from Michael Kerrisk that a new system call should come with | ||
518 | a man page: https://lkml.org/lkml/2014/6/13/309 | ||
519 | - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate | ||
520 | commit: https://lkml.org/lkml/2014/11/19/254 | ||
521 | - Suggestion from Greg Kroah-Hartman that it's good for new system calls to | ||
522 | come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710 | ||
523 | - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension: | ||
524 | https://lkml.org/lkml/2014/6/3/411 | ||
525 | - Suggestion from Ingo Molnar that system calls that involve multiple | ||
526 | arguments should encapsulate those arguments in a struct, which includes a | ||
527 | size field for future extensibility: https://lkml.org/lkml/2015/7/30/117 | ||
528 | - Numbering oddities arising from (re-)use of O_* numbering space flags: | ||
529 | |||
530 | - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness | ||
531 | check") | ||
532 | - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc | ||
533 | conflict") | ||
534 | - commit bb458c644a59 ("Safer ABI for O_TMPFILE") | ||
535 | |||
536 | - Discussion from Matthew Wilcox about restrictions on 64-bit arguments: | ||
537 | https://lkml.org/lkml/2008/12/12/187 | ||
538 | - Recommendation from Greg Kroah-Hartman that unknown flags should be | ||
539 | policed: https://lkml.org/lkml/2014/7/17/577 | ||
540 | - Recommendation from Linus Torvalds that x32 system calls should prefer | ||
541 | compatibility with 64-bit versions rather than 32-bit versions: | ||
542 | https://lkml.org/lkml/2011/8/31/244 | ||