diff options
author | Peter Zijlstra <a.p.zijlstra@chello.nl> | 2010-10-26 17:21:54 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2010-10-26 19:52:08 -0400 |
commit | d65bfacb046f3df8aa11a9cb9b6e448f6171174d (patch) | |
tree | 56e2debcf416665b115789d4484cb4f8d6b59908 /Documentation | |
parent | 7a837d1bb7cb2bceb093ec639068626586a89234 (diff) |
mm: highmem documentation
Document outlining some of the highmem issues, started by me, edited by
David.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/vm/highmem.txt | 162 |
1 files changed, 162 insertions, 0 deletions
diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt new file mode 100644 index 000000000000..4324d24ffacd --- /dev/null +++ b/Documentation/vm/highmem.txt | |||
@@ -0,0 +1,162 @@ | |||
1 | |||
2 | ==================== | ||
3 | HIGH MEMORY HANDLING | ||
4 | ==================== | ||
5 | |||
6 | By: Peter Zijlstra <a.p.zijlstra@chello.nl> | ||
7 | |||
8 | Contents: | ||
9 | |||
10 | (*) What is high memory? | ||
11 | |||
12 | (*) Temporary virtual mappings. | ||
13 | |||
14 | (*) Using kmap_atomic. | ||
15 | |||
16 | (*) Cost of temporary mappings. | ||
17 | |||
18 | (*) i386 PAE. | ||
19 | |||
20 | |||
21 | ==================== | ||
22 | WHAT IS HIGH MEMORY? | ||
23 | ==================== | ||
24 | |||
25 | High memory (highmem) is used when the size of physical memory approaches or | ||
26 | exceeds the maximum size of virtual memory. At that point it becomes | ||
27 | impossible for the kernel to keep all of the available physical memory mapped | ||
28 | at all times. This means the kernel needs to start using temporary mappings of | ||
29 | the pieces of physical memory that it wants to access. | ||
30 | |||
31 | The part of (physical) memory not covered by a permanent mapping is what we | ||
32 | refer to as 'highmem'. There are various architecture dependent constraints on | ||
33 | where exactly that border lies. | ||
34 | |||
35 | In the i386 arch, for example, we choose to map the kernel into every process's | ||
36 | VM space so that we don't have to pay the full TLB invalidation costs for | ||
37 | kernel entry/exit. This means the available virtual memory space (4GiB on | ||
38 | i386) has to be divided between user and kernel space. | ||
39 | |||
40 | The traditional split for architectures using this approach is 3:1, 3GiB for | ||
41 | userspace and the top 1GiB for kernel space: | ||
42 | |||
43 | +--------+ 0xffffffff | ||
44 | | Kernel | | ||
45 | +--------+ 0xc0000000 | ||
46 | | | | ||
47 | | User | | ||
48 | | | | ||
49 | +--------+ 0x00000000 | ||
50 | |||
51 | This means that the kernel can at most map 1GiB of physical memory at any one | ||
52 | time, but because we need virtual address space for other things - including | ||
53 | temporary maps to access the rest of the physical memory - the actual direct | ||
54 | map will typically be less (usually around ~896MiB). | ||
55 | |||
56 | Other architectures that have mm context tagged TLBs can have separate kernel | ||
57 | and user maps. Some hardware (like some ARMs), however, have limited virtual | ||
58 | space when they use mm context tags. | ||
59 | |||
60 | |||
61 | ========================== | ||
62 | TEMPORARY VIRTUAL MAPPINGS | ||
63 | ========================== | ||
64 | |||
65 | The kernel contains several ways of creating temporary mappings: | ||
66 | |||
67 | (*) vmap(). This can be used to make a long duration mapping of multiple | ||
68 | physical pages into a contiguous virtual space. It needs global | ||
69 | synchronization to unmap. | ||
70 | |||
71 | (*) kmap(). This permits a short duration mapping of a single page. It needs | ||
72 | global synchronization, but is amortized somewhat. It is also prone to | ||
73 | deadlocks when using in a nested fashion, and so it is not recommended for | ||
74 | new code. | ||
75 | |||
76 | (*) kmap_atomic(). This permits a very short duration mapping of a single | ||
77 | page. Since the mapping is restricted to the CPU that issued it, it | ||
78 | performs well, but the issuing task is therefore required to stay on that | ||
79 | CPU until it has finished, lest some other task displace its mappings. | ||
80 | |||
81 | kmap_atomic() may also be used by interrupt contexts, since it is does not | ||
82 | sleep and the caller may not sleep until after kunmap_atomic() is called. | ||
83 | |||
84 | It may be assumed that k[un]map_atomic() won't fail. | ||
85 | |||
86 | |||
87 | ================= | ||
88 | USING KMAP_ATOMIC | ||
89 | ================= | ||
90 | |||
91 | When and where to use kmap_atomic() is straightforward. It is used when code | ||
92 | wants to access the contents of a page that might be allocated from high memory | ||
93 | (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two | ||
94 | functions, and they can be used in a manner similar to the following: | ||
95 | |||
96 | /* Find the page of interest. */ | ||
97 | struct page *page = find_get_page(mapping, offset); | ||
98 | |||
99 | /* Gain access to the contents of that page. */ | ||
100 | void *vaddr = kmap_atomic(page); | ||
101 | |||
102 | /* Do something to the contents of that page. */ | ||
103 | memset(vaddr, 0, PAGE_SIZE); | ||
104 | |||
105 | /* Unmap that page. */ | ||
106 | kunmap_atomic(vaddr); | ||
107 | |||
108 | Note that the kunmap_atomic() call takes the result of the kmap_atomic() call | ||
109 | not the argument. | ||
110 | |||
111 | If you need to map two pages because you want to copy from one page to | ||
112 | another you need to keep the kmap_atomic calls strictly nested, like: | ||
113 | |||
114 | vaddr1 = kmap_atomic(page1); | ||
115 | vaddr2 = kmap_atomic(page2); | ||
116 | |||
117 | memcpy(vaddr1, vaddr2, PAGE_SIZE); | ||
118 | |||
119 | kunmap_atomic(vaddr2); | ||
120 | kunmap_atomic(vaddr1); | ||
121 | |||
122 | |||
123 | ========================== | ||
124 | COST OF TEMPORARY MAPPINGS | ||
125 | ========================== | ||
126 | |||
127 | The cost of creating temporary mappings can be quite high. The arch has to | ||
128 | manipulate the kernel's page tables, the data TLB and/or the MMU's registers. | ||
129 | |||
130 | If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping | ||
131 | simply with a bit of arithmetic that will convert the page struct address into | ||
132 | a pointer to the page contents rather than juggling mappings about. In such a | ||
133 | case, the unmap operation may be a null operation. | ||
134 | |||
135 | If CONFIG_MMU is not set, then there can be no temporary mappings and no | ||
136 | highmem. In such a case, the arithmetic approach will also be used. | ||
137 | |||
138 | |||
139 | ======== | ||
140 | i386 PAE | ||
141 | ======== | ||
142 | |||
143 | The i386 arch, under some circumstances, will permit you to stick up to 64GiB | ||
144 | of RAM into your 32-bit machine. This has a number of consequences: | ||
145 | |||
146 | (*) Linux needs a page-frame structure for each page in the system and the | ||
147 | pageframes need to live in the permanent mapping, which means: | ||
148 | |||
149 | (*) you can have 896M/sizeof(struct page) page-frames at most; with struct | ||
150 | page being 32-bytes that would end up being something in the order of 112G | ||
151 | worth of pages; the kernel, however, needs to store more than just | ||
152 | page-frames in that memory... | ||
153 | |||
154 | (*) PAE makes your page tables larger - which slows the system down as more | ||
155 | data has to be accessed to traverse in TLB fills and the like. One | ||
156 | advantage is that PAE has more PTE bits and can provide advanced features | ||
157 | like NX and PAT. | ||
158 | |||
159 | The general recommendation is that you don't use more than 8GiB on a 32-bit | ||
160 | machine - although more might work for you and your workload, you're pretty | ||
161 | much on your own - don't expect kernel developers to really care much if things | ||
162 | come apart. | ||