diff options
Diffstat (limited to 'Documentation/vm/locking')
-rw-r--r-- | Documentation/vm/locking | 131 |
1 files changed, 131 insertions, 0 deletions
diff --git a/Documentation/vm/locking b/Documentation/vm/locking new file mode 100644 index 000000000000..c3ef09ae3bb1 --- /dev/null +++ b/Documentation/vm/locking | |||
@@ -0,0 +1,131 @@ | |||
1 | Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com> | ||
2 | |||
3 | The intent of this file is to have an uptodate, running commentary | ||
4 | from different people about how locking and synchronization is done | ||
5 | in the Linux vm code. | ||
6 | |||
7 | page_table_lock & mmap_sem | ||
8 | -------------------------------------- | ||
9 | |||
10 | Page stealers pick processes out of the process pool and scan for | ||
11 | the best process to steal pages from. To guarantee the existence | ||
12 | of the victim mm, a mm_count inc and a mmdrop are done in swap_out(). | ||
13 | Page stealers hold kernel_lock to protect against a bunch of races. | ||
14 | The vma list of the victim mm is also scanned by the stealer, | ||
15 | and the page_table_lock is used to preserve list sanity against the | ||
16 | process adding/deleting to the list. This also guarantees existence | ||
17 | of the vma. Vma existence is not guaranteed once try_to_swap_out() | ||
18 | drops the page_table_lock. To guarantee the existence of the underlying | ||
19 | file structure, a get_file is done before the swapout() method is | ||
20 | invoked. The page passed into swapout() is guaranteed not to be reused | ||
21 | for a different purpose because the page reference count due to being | ||
22 | present in the user's pte is not released till after swapout() returns. | ||
23 | |||
24 | Any code that modifies the vmlist, or the vm_start/vm_end/ | ||
25 | vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent | ||
26 | kswapd from looking at the chain. | ||
27 | |||
28 | The rules are: | ||
29 | 1. To scan the vmlist (look but don't touch) you must hold the | ||
30 | mmap_sem with read bias, i.e. down_read(&mm->mmap_sem) | ||
31 | 2. To modify the vmlist you need to hold the mmap_sem with | ||
32 | read&write bias, i.e. down_write(&mm->mmap_sem) *AND* | ||
33 | you need to take the page_table_lock. | ||
34 | 3. The swapper takes _just_ the page_table_lock, this is done | ||
35 | because the mmap_sem can be an extremely long lived lock | ||
36 | and the swapper just cannot sleep on that. | ||
37 | 4. The exception to this rule is expand_stack, which just | ||
38 | takes the read lock and the page_table_lock, this is ok | ||
39 | because it doesn't really modify fields anybody relies on. | ||
40 | 5. You must be able to guarantee that while holding page_table_lock | ||
41 | or page_table_lock of mm A, you will not try to get either lock | ||
42 | for mm B. | ||
43 | |||
44 | The caveats are: | ||
45 | 1. find_vma() makes use of, and updates, the mmap_cache pointer hint. | ||
46 | The update of mmap_cache is racy (page stealer can race with other code | ||
47 | that invokes find_vma with mmap_sem held), but that is okay, since it | ||
48 | is a hint. This can be fixed, if desired, by having find_vma grab the | ||
49 | page_table_lock. | ||
50 | |||
51 | |||
52 | Code that add/delete elements from the vmlist chain are | ||
53 | 1. callers of insert_vm_struct | ||
54 | 2. callers of merge_segments | ||
55 | 3. callers of avl_remove | ||
56 | |||
57 | Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on | ||
58 | the list: | ||
59 | 1. expand_stack | ||
60 | 2. mprotect | ||
61 | 3. mlock | ||
62 | 4. mremap | ||
63 | |||
64 | It is advisable that changes to vm_start/vm_end be protected, although | ||
65 | in some cases it is not really needed. Eg, vm_start is modified by | ||
66 | expand_stack(), it is hard to come up with a destructive scenario without | ||
67 | having the vmlist protection in this case. | ||
68 | |||
69 | The page_table_lock nests with the inode i_mmap_lock and the kmem cache | ||
70 | c_spinlock spinlocks. This is okay, since the kmem code asks for pages after | ||
71 | dropping c_spinlock. The page_table_lock also nests with pagecache_lock and | ||
72 | pagemap_lru_lock spinlocks, and no code asks for memory with these locks | ||
73 | held. | ||
74 | |||
75 | The page_table_lock is grabbed while holding the kernel_lock spinning monitor. | ||
76 | |||
77 | The page_table_lock is a spin lock. | ||
78 | |||
79 | Note: PTL can also be used to guarantee that no new clones using the | ||
80 | mm start up ... this is a loose form of stability on mm_users. For | ||
81 | example, it is used in copy_mm to protect against a racing tlb_gather_mmu | ||
82 | single address space optimization, so that the zap_page_range (from | ||
83 | vmtruncate) does not lose sending ipi's to cloned threads that might | ||
84 | be spawned underneath it and go to user mode to drag in pte's into tlbs. | ||
85 | |||
86 | swap_list_lock/swap_device_lock | ||
87 | ------------------------------- | ||
88 | The swap devices are chained in priority order from the "swap_list" header. | ||
89 | The "swap_list" is used for the round-robin swaphandle allocation strategy. | ||
90 | The #free swaphandles is maintained in "nr_swap_pages". These two together | ||
91 | are protected by the swap_list_lock. | ||
92 | |||
93 | The swap_device_lock, which is per swap device, protects the reference | ||
94 | counts on the corresponding swaphandles, maintained in the "swap_map" | ||
95 | array, and the "highest_bit" and "lowest_bit" fields. | ||
96 | |||
97 | Both of these are spinlocks, and are never acquired from intr level. The | ||
98 | locking hierarchy is swap_list_lock -> swap_device_lock. | ||
99 | |||
100 | To prevent races between swap space deletion or async readahead swapins | ||
101 | deciding whether a swap handle is being used, ie worthy of being read in | ||
102 | from disk, and an unmap -> swap_free making the handle unused, the swap | ||
103 | delete and readahead code grabs a temp reference on the swaphandle to | ||
104 | prevent warning messages from swap_duplicate <- read_swap_cache_async. | ||
105 | |||
106 | Swap cache locking | ||
107 | ------------------ | ||
108 | Pages are added into the swap cache with kernel_lock held, to make sure | ||
109 | that multiple pages are not being added (and hence lost) by associating | ||
110 | all of them with the same swaphandle. | ||
111 | |||
112 | Pages are guaranteed not to be removed from the scache if the page is | ||
113 | "shared": ie, other processes hold reference on the page or the associated | ||
114 | swap handle. The only code that does not follow this rule is shrink_mmap, | ||
115 | which deletes pages from the swap cache if no process has a reference on | ||
116 | the page (multiple processes might have references on the corresponding | ||
117 | swap handle though). lookup_swap_cache() races with shrink_mmap, when | ||
118 | establishing a reference on a scache page, so, it must check whether the | ||
119 | page it located is still in the swapcache, or shrink_mmap deleted it. | ||
120 | (This race is due to the fact that shrink_mmap looks at the page ref | ||
121 | count with pagecache_lock, but then drops pagecache_lock before deleting | ||
122 | the page from the scache). | ||
123 | |||
124 | do_wp_page and do_swap_page have MP races in them while trying to figure | ||
125 | out whether a page is "shared", by looking at the page_count + swap_count. | ||
126 | To preserve the sum of the counts, the page lock _must_ be acquired before | ||
127 | calling is_page_shared (else processes might switch their swap_count refs | ||
128 | to the page count refs, after the page count ref has been snapshotted). | ||
129 | |||
130 | Swap device deletion code currently breaks all the scache assumptions, | ||
131 | since it grabs neither mmap_sem nor page_table_lock. | ||