1 files changed, 131 insertions, 0 deletions
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
new file mode 100644
index 000000000000..c3ef09ae3bb1
--- /dev/null
+++ b/Documentation/vm/locking
@@ -0,0 +1,131 @@
+Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
+The intent of this file is to have an uptodate, running commentary 
+from different people about how locking and synchronization is done 
+in the Linux vm code.
+page_table_lock & mmap_sem
+--------------------------------------
+Page stealers pick processes out of the process pool and scan for 
+the best process to steal pages from. To guarantee the existence 
+of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
+Page stealers hold kernel_lock to protect against a bunch of races.
+The vma list of the victim mm is also scanned by the stealer, 
+and the page_table_lock is used to preserve list sanity against the
+process adding/deleting to the list. This also guarantees existence
+of the vma. Vma existence is not guaranteed once try_to_swap_out() 
+drops the page_table_lock. To guarantee the existence of the underlying 
+file structure, a get_file is done before the swapout() method is 
+invoked. The page passed into swapout() is guaranteed not to be reused
+for a different purpose because the page reference count due to being
+present in the user's pte is not released till after swapout() returns.
+Any code that modifies the vmlist, or the vm_start/vm_end/
+vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent 
+kswapd from looking at the chain.
+The rules are:
+1. To scan the vmlist (look but don't touch) you must hold the
+   mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
+2. To modify the vmlist you need to hold the mmap_sem with
+   read&write bias, i.e. down_write(&mm->mmap_sem)  *AND*
+   you need to take the page_table_lock.
+3. The swapper takes _just_ the page_table_lock, this is done
+   because the mmap_sem can be an extremely long lived lock
+   and the swapper just cannot sleep on that.
+4. The exception to this rule is expand_stack, which just
+   takes the read lock and the page_table_lock, this is ok
+   because it doesn't really modify fields anybody relies on.
+5. You must be able to guarantee that while holding page_table_lock
+   or page_table_lock of mm A, you will not try to get either lock
+   for mm B.
+The caveats are:
+1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
+The update of mmap_cache is racy (page stealer can race with other code
+that invokes find_vma with mmap_sem held), but that is okay, since it 
+is a hint. This can be fixed, if desired, by having find_vma grab the
+page_table_lock.
+Code that add/delete elements from the vmlist chain are
+1. callers of insert_vm_struct
+2. callers of merge_segments
+3. callers of avl_remove
+Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
+the list:
+1. expand_stack
+2. mprotect
+3. mlock
+4. mremap
+It is advisable that changes to vm_start/vm_end be protected, although 
+in some cases it is not really needed. Eg, vm_start is modified by 
+expand_stack(), it is hard to come up with a destructive scenario without 
+having the vmlist protection in this case.
+The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
+dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
+pagemap_lru_lock spinlocks, and no code asks for memory with these locks
+held.
+The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
+The page_table_lock is a spin lock.
+Note: PTL can also be used to guarantee that no new clones using the
+mm start up ... this is a loose form of stability on mm_users. For
+example, it is used in copy_mm to protect against a racing tlb_gather_mmu
+single address space optimization, so that the zap_page_range (from
+vmtruncate) does not lose sending ipi's to cloned threads that might 
+be spawned underneath it and go to user mode to drag in pte's into tlbs.
+swap_list_lock/swap_device_lock
+-------------------------------
+The swap devices are chained in priority order from the "swap_list" header. 
+The "swap_list" is used for the round-robin swaphandle allocation strategy.
+The #free swaphandles is maintained in "nr_swap_pages". These two together
+are protected by the swap_list_lock. 
+The swap_device_lock, which is per swap device, protects the reference 
+counts on the corresponding swaphandles, maintained in the "swap_map"
+array, and the "highest_bit" and "lowest_bit" fields.
+Both of these are spinlocks, and are never acquired from intr level. The
+locking hierarchy is swap_list_lock -> swap_device_lock.
+To prevent races between swap space deletion or async readahead swapins
+deciding whether a swap handle is being used, ie worthy of being read in
+from disk, and an unmap -> swap_free making the handle unused, the swap
+delete and readahead code grabs a temp reference on the swaphandle to
+prevent warning messages from swap_duplicate <- read_swap_cache_async.
+Swap cache locking
+------------------
+Pages are added into the swap cache with kernel_lock held, to make sure
+that multiple pages are not being added (and hence lost) by associating
+all of them with the same swaphandle.
+Pages are guaranteed not to be removed from the scache if the page is 
+"shared": ie, other processes hold reference on the page or the associated 
+swap handle. The only code that does not follow this rule is shrink_mmap,
+which deletes pages from the swap cache if no process has a reference on 
+the page (multiple processes might have references on the corresponding
+swap handle though). lookup_swap_cache() races with shrink_mmap, when
+establishing a reference on a scache page, so, it must check whether the
+page it located is still in the swapcache, or shrink_mmap deleted it.
+(This race is due to the fact that shrink_mmap looks at the page ref
+count with pagecache_lock, but then drops pagecache_lock before deleting
+the page from the scache).
+do_wp_page and do_swap_page have MP races in them while trying to figure
+out whether a page is "shared", by looking at the page_count + swap_count.
+To preserve the sum of the counts, the page lock _must_ be acquired before
+calling is_page_shared (else processes might switch their swap_count refs
+to the page count refs, after the page count ref has been snapshotted).
+Swap device deletion code currently breaks all the scache assumptions,
+since it grabs neither mmap_sem nor page_table_lock.

diff --git a/Documentation/vm/locking b/Documentation/vm/locking new file mode 100644 index 000000000000..c3ef09ae3bb1 --- /dev/null +++ b/Documentation/vm/locking
@@ -0,0 +1,131 @@
	1	Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
	2
	3	The intent of this file is to have an uptodate, running commentary
	4	from different people about how locking and synchronization is done
	5	in the Linux vm code.
	6
	7	page_table_lock & mmap_sem
	8	--------------------------------------
	9
	10	Page stealers pick processes out of the process pool and scan for
	11	the best process to steal pages from. To guarantee the existence
	12	of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
	13	Page stealers hold kernel_lock to protect against a bunch of races.
	14	The vma list of the victim mm is also scanned by the stealer,
	15	and the page_table_lock is used to preserve list sanity against the
	16	process adding/deleting to the list. This also guarantees existence
	17	of the vma. Vma existence is not guaranteed once try_to_swap_out()
	18	drops the page_table_lock. To guarantee the existence of the underlying
	19	file structure, a get_file is done before the swapout() method is
	20	invoked. The page passed into swapout() is guaranteed not to be reused
	21	for a different purpose because the page reference count due to being
	22	present in the user's pte is not released till after swapout() returns.
	23
	24	Any code that modifies the vmlist, or the vm_start/vm_end/
	25	vm_flags:VM_LOCKED/vm_next of any vma in the list must prevent
	26	kswapd from looking at the chain.
	27
	28	The rules are:
	29	1. To scan the vmlist (look but don't touch) you must hold the
	30	mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
	31	2. To modify the vmlist you need to hold the mmap_sem with
	32	read&write bias, i.e. down_write(&mm->mmap_sem) AND
	33	you need to take the page_table_lock.
	34	3. The swapper takes _just_ the page_table_lock, this is done
	35	because the mmap_sem can be an extremely long lived lock
	36	and the swapper just cannot sleep on that.
	37	4. The exception to this rule is expand_stack, which just
	38	takes the read lock and the page_table_lock, this is ok
	39	because it doesn't really modify fields anybody relies on.
	40	5. You must be able to guarantee that while holding page_table_lock
	41	or page_table_lock of mm A, you will not try to get either lock
	42	for mm B.
	43
	44	The caveats are:
	45	1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
	46	The update of mmap_cache is racy (page stealer can race with other code
	47	that invokes find_vma with mmap_sem held), but that is okay, since it
	48	is a hint. This can be fixed, if desired, by having find_vma grab the
	49	page_table_lock.
	50
	51
	52	Code that add/delete elements from the vmlist chain are
	53	1. callers of insert_vm_struct
	54	2. callers of merge_segments
	55	3. callers of avl_remove
	56
	57	Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
	58	the list:
	59	1. expand_stack
	60	2. mprotect
	61	3. mlock
	62	4. mremap
	63
	64	It is advisable that changes to vm_start/vm_end be protected, although
	65	in some cases it is not really needed. Eg, vm_start is modified by
	66	expand_stack(), it is hard to come up with a destructive scenario without
	67	having the vmlist protection in this case.
	68
	69	The page_table_lock nests with the inode i_mmap_lock and the kmem cache
	70	c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
	71	dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
	72	pagemap_lru_lock spinlocks, and no code asks for memory with these locks
	73	held.
	74
	75	The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
	76
	77	The page_table_lock is a spin lock.
	78
	79	Note: PTL can also be used to guarantee that no new clones using the
	80	mm start up ... this is a loose form of stability on mm_users. For
	81	example, it is used in copy_mm to protect against a racing tlb_gather_mmu
	82	single address space optimization, so that the zap_page_range (from
	83	vmtruncate) does not lose sending ipi's to cloned threads that might
	84	be spawned underneath it and go to user mode to drag in pte's into tlbs.
	85
	86	swap_list_lock/swap_device_lock
	87	-------------------------------
	88	The swap devices are chained in priority order from the "swap_list" header.
	89	The "swap_list" is used for the round-robin swaphandle allocation strategy.
	90	The #free swaphandles is maintained in "nr_swap_pages". These two together
	91	are protected by the swap_list_lock.
	92
	93	The swap_device_lock, which is per swap device, protects the reference
	94	counts on the corresponding swaphandles, maintained in the "swap_map"
	95	array, and the "highest_bit" and "lowest_bit" fields.
	96
	97	Both of these are spinlocks, and are never acquired from intr level. The
	98	locking hierarchy is swap_list_lock -> swap_device_lock.
	99
	100	To prevent races between swap space deletion or async readahead swapins
	101	deciding whether a swap handle is being used, ie worthy of being read in
	102	from disk, and an unmap -> swap_free making the handle unused, the swap
	103	delete and readahead code grabs a temp reference on the swaphandle to
	104	prevent warning messages from swap_duplicate <- read_swap_cache_async.
	105
	106	Swap cache locking
	107	------------------
	108	Pages are added into the swap cache with kernel_lock held, to make sure
	109	that multiple pages are not being added (and hence lost) by associating
	110	all of them with the same swaphandle.
	111
	112	Pages are guaranteed not to be removed from the scache if the page is
	113	"shared": ie, other processes hold reference on the page or the associated
	114	swap handle. The only code that does not follow this rule is shrink_mmap,
	115	which deletes pages from the swap cache if no process has a reference on
	116	the page (multiple processes might have references on the corresponding
	117	swap handle though). lookup_swap_cache() races with shrink_mmap, when
	118	establishing a reference on a scache page, so, it must check whether the
	119	page it located is still in the swapcache, or shrink_mmap deleted it.
	120	(This race is due to the fact that shrink_mmap looks at the page ref
	121	count with pagecache_lock, but then drops pagecache_lock before deleting
	122	the page from the scache).
	123
	124	do_wp_page and do_swap_page have MP races in them while trying to figure
	125	out whether a page is "shared", by looking at the page_count + swap_count.
	126	To preserve the sum of the counts, the page lock _must_ be acquired before
	127	calling is_page_shared (else processes might switch their swap_count refs
	128	to the page count refs, after the page count ref has been snapshotted).
	129
	130	Swap device deletion code currently breaks all the scache assumptions,
	131	since it grabs neither mmap_sem nor page_table_lock.