diff options
| author | Jonathan Corbet <corbet@lwn.net> | 2017-04-02 17:18:32 -0400 |
|---|---|---|
| committer | Jonathan Corbet <corbet@lwn.net> | 2017-04-02 17:18:32 -0400 |
| commit | f504d47be5e8fa7ecf2bf660b18b42e6960c0eb2 (patch) | |
| tree | c80102f6f0ff516be6096b8019758873e65a82a9 /Documentation/userspace-api | |
| parent | 1d596dee3862bc14895ba15afe69c4e85c6d6124 (diff) | |
docs: Convert unshare.txt to RST and add to the user-space API manual
This is a straightforward conversion, without any real textual changes.
Since this document has seen no substantive changes since its addition in
2006, some such changes are probably warranted.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/userspace-api')
| -rw-r--r-- | Documentation/userspace-api/index.rst | 2 | ||||
| -rw-r--r-- | Documentation/userspace-api/unshare.rst | 332 |
2 files changed, 334 insertions, 0 deletions
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 6d98ea6c0d2d..a9d01b44a659 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst | |||
| @@ -16,6 +16,8 @@ place where this information is gathered. | |||
| 16 | .. toctree:: | 16 | .. toctree:: |
| 17 | :maxdepth: 2 | 17 | :maxdepth: 2 |
| 18 | 18 | ||
| 19 | unshare | ||
| 20 | |||
| 19 | .. only:: subproject and html | 21 | .. only:: subproject and html |
| 20 | 22 | ||
| 21 | Indices | 23 | Indices |
diff --git a/Documentation/userspace-api/unshare.rst b/Documentation/userspace-api/unshare.rst new file mode 100644 index 000000000000..737c192cf4e7 --- /dev/null +++ b/Documentation/userspace-api/unshare.rst | |||
| @@ -0,0 +1,332 @@ | |||
| 1 | unshare system call | ||
| 2 | =================== | ||
| 3 | |||
| 4 | This document describes the new system call, unshare(). The document | ||
| 5 | provides an overview of the feature, why it is needed, how it can | ||
| 6 | be used, its interface specification, design, implementation and | ||
| 7 | how it can be tested. | ||
| 8 | |||
| 9 | Change Log | ||
| 10 | ---------- | ||
| 11 | version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 | ||
| 12 | |||
| 13 | Contents | ||
| 14 | -------- | ||
| 15 | 1) Overview | ||
| 16 | 2) Benefits | ||
| 17 | 3) Cost | ||
| 18 | 4) Requirements | ||
| 19 | 5) Functional Specification | ||
| 20 | 6) High Level Design | ||
| 21 | 7) Low Level Design | ||
| 22 | 8) Test Specification | ||
| 23 | 9) Future Work | ||
| 24 | |||
| 25 | 1) Overview | ||
| 26 | ----------- | ||
| 27 | |||
| 28 | Most legacy operating system kernels support an abstraction of threads | ||
| 29 | as multiple execution contexts within a process. These kernels provide | ||
| 30 | special resources and mechanisms to maintain these "threads". The Linux | ||
| 31 | kernel, in a clever and simple manner, does not make distinction | ||
| 32 | between processes and "threads". The kernel allows processes to share | ||
| 33 | resources and thus they can achieve legacy "threads" behavior without | ||
| 34 | requiring additional data structures and mechanisms in the kernel. The | ||
| 35 | power of implementing threads in this manner comes not only from | ||
| 36 | its simplicity but also from allowing application programmers to work | ||
| 37 | outside the confinement of all-or-nothing shared resources of legacy | ||
| 38 | threads. On Linux, at the time of thread creation using the clone system | ||
| 39 | call, applications can selectively choose which resources to share | ||
| 40 | between threads. | ||
| 41 | |||
| 42 | unshare() system call adds a primitive to the Linux thread model that | ||
| 43 | allows threads to selectively 'unshare' any resources that were being | ||
| 44 | shared at the time of their creation. unshare() was conceptualized by | ||
| 45 | Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part | ||
| 46 | of the discussion on POSIX threads on Linux. unshare() augments the | ||
| 47 | usefulness of Linux threads for applications that would like to control | ||
| 48 | shared resources without creating a new process. unshare() is a natural | ||
| 49 | addition to the set of available primitives on Linux that implement | ||
| 50 | the concept of process/thread as a virtual machine. | ||
| 51 | |||
| 52 | 2) Benefits | ||
| 53 | ----------- | ||
| 54 | |||
| 55 | unshare() would be useful to large application frameworks such as PAM | ||
| 56 | where creating a new process to control sharing/unsharing of process | ||
| 57 | resources is not possible. Since namespaces are shared by default | ||
| 58 | when creating a new process using fork or clone, unshare() can benefit | ||
| 59 | even non-threaded applications if they have a need to disassociate | ||
| 60 | from default shared namespace. The following lists two use-cases | ||
| 61 | where unshare() can be used. | ||
| 62 | |||
| 63 | 2.1 Per-security context namespaces | ||
| 64 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 65 | |||
| 66 | unshare() can be used to implement polyinstantiated directories using | ||
| 67 | the kernel's per-process namespace mechanism. Polyinstantiated directories, | ||
| 68 | such as per-user and/or per-security context instance of /tmp, /var/tmp or | ||
| 69 | per-security context instance of a user's home directory, isolate user | ||
| 70 | processes when working with these directories. Using unshare(), a PAM | ||
| 71 | module can easily setup a private namespace for a user at login. | ||
| 72 | Polyinstantiated directories are required for Common Criteria certification | ||
| 73 | with Labeled System Protection Profile, however, with the availability | ||
| 74 | of shared-tree feature in the Linux kernel, even regular Linux systems | ||
| 75 | can benefit from setting up private namespaces at login and | ||
| 76 | polyinstantiating /tmp, /var/tmp and other directories deemed | ||
| 77 | appropriate by system administrators. | ||
| 78 | |||
| 79 | 2.2 unsharing of virtual memory and/or open files | ||
| 80 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 81 | |||
| 82 | Consider a client/server application where the server is processing | ||
| 83 | client requests by creating processes that share resources such as | ||
| 84 | virtual memory and open files. Without unshare(), the server has to | ||
| 85 | decide what needs to be shared at the time of creating the process | ||
| 86 | which services the request. unshare() allows the server an ability to | ||
| 87 | disassociate parts of the context during the servicing of the | ||
| 88 | request. For large and complex middleware application frameworks, this | ||
| 89 | ability to unshare() after the process was created can be very | ||
| 90 | useful. | ||
| 91 | |||
| 92 | 3) Cost | ||
| 93 | ------- | ||
| 94 | |||
| 95 | In order to not duplicate code and to handle the fact that unshare() | ||
| 96 | works on an active task (as opposed to clone/fork working on a newly | ||
| 97 | allocated inactive task) unshare() had to make minor reorganizational | ||
| 98 | changes to copy_* functions utilized by clone/fork system call. | ||
| 99 | There is a cost associated with altering existing, well tested and | ||
| 100 | stable code to implement a new feature that may not get exercised | ||
| 101 | extensively in the beginning. However, with proper design and code | ||
| 102 | review of the changes and creation of an unshare() test for the LTP | ||
| 103 | the benefits of this new feature can exceed its cost. | ||
| 104 | |||
| 105 | 4) Requirements | ||
| 106 | --------------- | ||
| 107 | |||
| 108 | unshare() reverses sharing that was done using clone(2) system call, | ||
| 109 | so unshare() should have a similar interface as clone(2). That is, | ||
| 110 | since flags in clone(int flags, void *stack) specifies what should | ||
| 111 | be shared, similar flags in unshare(int flags) should specify | ||
| 112 | what should be unshared. Unfortunately, this may appear to invert | ||
| 113 | the meaning of the flags from the way they are used in clone(2). | ||
| 114 | However, there was no easy solution that was less confusing and that | ||
| 115 | allowed incremental context unsharing in future without an ABI change. | ||
| 116 | |||
| 117 | unshare() interface should accommodate possible future addition of | ||
| 118 | new context flags without requiring a rebuild of old applications. | ||
| 119 | If and when new context flags are added, unshare() design should allow | ||
| 120 | incremental unsharing of those resources on an as needed basis. | ||
| 121 | |||
| 122 | 5) Functional Specification | ||
| 123 | --------------------------- | ||
| 124 | |||
| 125 | NAME | ||
| 126 | unshare - disassociate parts of the process execution context | ||
| 127 | |||
| 128 | SYNOPSIS | ||
| 129 | #include <sched.h> | ||
| 130 | |||
| 131 | int unshare(int flags); | ||
| 132 | |||
| 133 | DESCRIPTION | ||
| 134 | unshare() allows a process to disassociate parts of its execution | ||
| 135 | context that are currently being shared with other processes. Part | ||
| 136 | of execution context, such as the namespace, is shared by default | ||
| 137 | when a new process is created using fork(2), while other parts, | ||
| 138 | such as the virtual memory, open file descriptors, etc, may be | ||
| 139 | shared by explicit request to share them when creating a process | ||
| 140 | using clone(2). | ||
| 141 | |||
| 142 | The main use of unshare() is to allow a process to control its | ||
| 143 | shared execution context without creating a new process. | ||
| 144 | |||
| 145 | The flags argument specifies one or bitwise-or'ed of several of | ||
| 146 | the following constants. | ||
| 147 | |||
| 148 | CLONE_FS | ||
| 149 | If CLONE_FS is set, file system information of the caller | ||
| 150 | is disassociated from the shared file system information. | ||
| 151 | |||
| 152 | CLONE_FILES | ||
| 153 | If CLONE_FILES is set, the file descriptor table of the | ||
| 154 | caller is disassociated from the shared file descriptor | ||
| 155 | table. | ||
| 156 | |||
| 157 | CLONE_NEWNS | ||
| 158 | If CLONE_NEWNS is set, the namespace of the caller is | ||
| 159 | disassociated from the shared namespace. | ||
| 160 | |||
| 161 | CLONE_VM | ||
| 162 | If CLONE_VM is set, the virtual memory of the caller is | ||
| 163 | disassociated from the shared virtual memory. | ||
| 164 | |||
| 165 | RETURN VALUE | ||
| 166 | On success, zero returned. On failure, -1 is returned and errno is | ||
| 167 | |||
| 168 | ERRORS | ||
| 169 | EPERM CLONE_NEWNS was specified by a non-root process (process | ||
| 170 | without CAP_SYS_ADMIN). | ||
| 171 | |||
| 172 | ENOMEM Cannot allocate sufficient memory to copy parts of caller's | ||
| 173 | context that need to be unshared. | ||
| 174 | |||
| 175 | EINVAL Invalid flag was specified as an argument. | ||
| 176 | |||
| 177 | CONFORMING TO | ||
| 178 | The unshare() call is Linux-specific and should not be used | ||
| 179 | in programs intended to be portable. | ||
| 180 | |||
| 181 | SEE ALSO | ||
| 182 | clone(2), fork(2) | ||
| 183 | |||
| 184 | 6) High Level Design | ||
| 185 | -------------------- | ||
| 186 | |||
| 187 | Depending on the flags argument, the unshare() system call allocates | ||
| 188 | appropriate process context structures, populates it with values from | ||
| 189 | the current shared version, associates newly duplicated structures | ||
| 190 | with the current task structure and releases corresponding shared | ||
| 191 | versions. Helper functions of clone (copy_*) could not be used | ||
| 192 | directly by unshare() because of the following two reasons. | ||
| 193 | |||
| 194 | 1) clone operates on a newly allocated not-yet-active task | ||
| 195 | structure, where as unshare() operates on the current active | ||
| 196 | task. Therefore unshare() has to take appropriate task_lock() | ||
| 197 | before associating newly duplicated context structures | ||
| 198 | |||
| 199 | 2) unshare() has to allocate and duplicate all context structures | ||
| 200 | that are being unshared, before associating them with the | ||
| 201 | current task and releasing older shared structures. Failure | ||
| 202 | do so will create race conditions and/or oops when trying | ||
| 203 | to backout due to an error. Consider the case of unsharing | ||
| 204 | both virtual memory and namespace. After successfully unsharing | ||
| 205 | vm, if the system call encounters an error while allocating | ||
| 206 | new namespace structure, the error return code will have to | ||
| 207 | reverse the unsharing of vm. As part of the reversal the | ||
| 208 | system call will have to go back to older, shared, vm | ||
| 209 | structure, which may not exist anymore. | ||
| 210 | |||
| 211 | Therefore code from copy_* functions that allocated and duplicated | ||
| 212 | current context structure was moved into new dup_* functions. Now, | ||
| 213 | copy_* functions call dup_* functions to allocate and duplicate | ||
| 214 | appropriate context structures and then associate them with the | ||
| 215 | task structure that is being constructed. unshare() system call on | ||
| 216 | the other hand performs the following: | ||
| 217 | |||
| 218 | 1) Check flags to force missing, but implied, flags | ||
| 219 | |||
| 220 | 2) For each context structure, call the corresponding unshare() | ||
| 221 | helper function to allocate and duplicate a new context | ||
| 222 | structure, if the appropriate bit is set in the flags argument. | ||
| 223 | |||
| 224 | 3) If there is no error in allocation and duplication and there | ||
| 225 | are new context structures then lock the current task structure, | ||
| 226 | associate new context structures with the current task structure, | ||
| 227 | and release the lock on the current task structure. | ||
| 228 | |||
| 229 | 4) Appropriately release older, shared, context structures. | ||
| 230 | |||
| 231 | 7) Low Level Design | ||
| 232 | ------------------- | ||
| 233 | |||
| 234 | Implementation of unshare() can be grouped in the following 4 different | ||
| 235 | items: | ||
| 236 | |||
| 237 | a) Reorganization of existing copy_* functions | ||
| 238 | |||
| 239 | b) unshare() system call service function | ||
| 240 | |||
| 241 | c) unshare() helper functions for each different process context | ||
| 242 | |||
| 243 | d) Registration of system call number for different architectures | ||
| 244 | |||
| 245 | 7.1) Reorganization of copy_* functions | ||
| 246 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 247 | |||
| 248 | Each copy function such as copy_mm, copy_namespace, copy_files, | ||
| 249 | etc, had roughly two components. The first component allocated | ||
| 250 | and duplicated the appropriate structure and the second component | ||
| 251 | linked it to the task structure passed in as an argument to the copy | ||
| 252 | function. The first component was split into its own function. | ||
| 253 | These dup_* functions allocated and duplicated the appropriate | ||
| 254 | context structure. The reorganized copy_* functions invoked | ||
| 255 | their corresponding dup_* functions and then linked the newly | ||
| 256 | duplicated structures to the task structure with which the | ||
| 257 | copy function was called. | ||
| 258 | |||
| 259 | 7.2) unshare() system call service function | ||
| 260 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 261 | |||
| 262 | * Check flags | ||
| 263 | Force implied flags. If CLONE_THREAD is set force CLONE_VM. | ||
| 264 | If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is | ||
| 265 | set and signals are also being shared, force CLONE_THREAD. If | ||
| 266 | CLONE_NEWNS is set, force CLONE_FS. | ||
| 267 | |||
| 268 | * For each context flag, invoke the corresponding unshare_* | ||
| 269 | helper routine with flags passed into the system call and a | ||
| 270 | reference to pointer pointing the new unshared structure | ||
| 271 | |||
| 272 | * If any new structures are created by unshare_* helper | ||
| 273 | functions, take the task_lock() on the current task, | ||
| 274 | modify appropriate context pointers, and release the | ||
| 275 | task lock. | ||
| 276 | |||
| 277 | * For all newly unshared structures, release the corresponding | ||
| 278 | older, shared, structures. | ||
| 279 | |||
| 280 | 7.3) unshare_* helper functions | ||
| 281 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 282 | |||
| 283 | For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, | ||
| 284 | and CLONE_THREAD, return -EINVAL since they are not implemented yet. | ||
| 285 | For others, check the flag value to see if the unsharing is | ||
| 286 | required for that structure. If it is, invoke the corresponding | ||
| 287 | dup_* function to allocate and duplicate the structure and return | ||
| 288 | a pointer to it. | ||
| 289 | |||
| 290 | 7.4) Finally | ||
| 291 | ~~~~~~~~~~~~ | ||
| 292 | |||
| 293 | Appropriately modify architecture specific code to register the | ||
| 294 | new system call. | ||
| 295 | |||
| 296 | 8) Test Specification | ||
| 297 | --------------------- | ||
| 298 | |||
| 299 | The test for unshare() should test the following: | ||
| 300 | |||
| 301 | 1) Valid flags: Test to check that clone flags for signal and | ||
| 302 | signal handlers, for which unsharing is not implemented | ||
| 303 | yet, return -EINVAL. | ||
| 304 | |||
| 305 | 2) Missing/implied flags: Test to make sure that if unsharing | ||
| 306 | namespace without specifying unsharing of filesystem, correctly | ||
| 307 | unshares both namespace and filesystem information. | ||
| 308 | |||
| 309 | 3) For each of the four (namespace, filesystem, files and vm) | ||
| 310 | supported unsharing, verify that the system call correctly | ||
| 311 | unshares the appropriate structure. Verify that unsharing | ||
| 312 | them individually as well as in combination with each | ||
| 313 | other works as expected. | ||
| 314 | |||
| 315 | 4) Concurrent execution: Use shared memory segments and futex on | ||
| 316 | an address in the shm segment to synchronize execution of | ||
| 317 | about 10 threads. Have a couple of threads execute execve, | ||
| 318 | a couple _exit and the rest unshare with different combination | ||
| 319 | of flags. Verify that unsharing is performed as expected and | ||
| 320 | that there are no oops or hangs. | ||
| 321 | |||
| 322 | 9) Future Work | ||
| 323 | -------------- | ||
| 324 | |||
| 325 | The current implementation of unshare() does not allow unsharing of | ||
| 326 | signals and signal handlers. Signals are complex to begin with and | ||
| 327 | to unshare signals and/or signal handlers of a currently running | ||
| 328 | process is even more complex. If in the future there is a specific | ||
| 329 | need to allow unsharing of signals and/or signal handlers, it can | ||
| 330 | be incrementally added to unshare() without affecting legacy | ||
| 331 | applications using unshare(). | ||
| 332 | |||
