aboutsummaryrefslogtreecommitdiffstats
path: root/mm/huge_memory.c
diff options
context:
space:
mode:
authorAndrea Arcangeli <aarcange@redhat.com>2011-01-13 18:46:52 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2011-01-13 20:32:42 -0500
commit71e3aac0724ffe8918992d76acfe3aad7d8724a5 (patch)
tree4ff96e1fc3e53bc9d25b859bf7e5bdbab8f1b25a /mm/huge_memory.c
parent5c3240d92e29ae7bfb9cb58a9b37e80ab40894ff (diff)
thp: transparent hugepage core
Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/huge_memory.c')
-rw-r--r--mm/huge_memory.c901
1 files changed, 901 insertions, 0 deletions
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
index 000000000000..0c1e8f939f7c
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,901 @@
1/*
2 * Copyright (C) 2009 Red Hat, Inc.
3 *
4 * This work is licensed under the terms of the GNU GPL, version 2. See
5 * the COPYING file in the top-level directory.
6 */
7
8#include <linux/mm.h>
9#include <linux/sched.h>
10#include <linux/highmem.h>
11#include <linux/hugetlb.h>
12#include <linux/mmu_notifier.h>
13#include <linux/rmap.h>
14#include <linux/swap.h>
15#include <asm/tlb.h>
16#include <asm/pgalloc.h>
17#include "internal.h"
18
19unsigned long transparent_hugepage_flags __read_mostly =
20 (1<<TRANSPARENT_HUGEPAGE_FLAG);
21
22#ifdef CONFIG_SYSFS
23static ssize_t double_flag_show(struct kobject *kobj,
24 struct kobj_attribute *attr, char *buf,
25 enum transparent_hugepage_flag enabled,
26 enum transparent_hugepage_flag req_madv)
27{
28 if (test_bit(enabled, &transparent_hugepage_flags)) {
29 VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
30 return sprintf(buf, "[always] madvise never\n");
31 } else if (test_bit(req_madv, &transparent_hugepage_flags))
32 return sprintf(buf, "always [madvise] never\n");
33 else
34 return sprintf(buf, "always madvise [never]\n");
35}
36static ssize_t double_flag_store(struct kobject *kobj,
37 struct kobj_attribute *attr,
38 const char *buf, size_t count,
39 enum transparent_hugepage_flag enabled,
40 enum transparent_hugepage_flag req_madv)
41{
42 if (!memcmp("always", buf,
43 min(sizeof("always")-1, count))) {
44 set_bit(enabled, &transparent_hugepage_flags);
45 clear_bit(req_madv, &transparent_hugepage_flags);
46 } else if (!memcmp("madvise", buf,
47 min(sizeof("madvise")-1, count))) {
48 clear_bit(enabled, &transparent_hugepage_flags);
49 set_bit(req_madv, &transparent_hugepage_flags);
50 } else if (!memcmp("never", buf,
51 min(sizeof("never")-1, count))) {
52 clear_bit(enabled, &transparent_hugepage_flags);
53 clear_bit(req_madv, &transparent_hugepage_flags);
54 } else
55 return -EINVAL;
56
57 return count;
58}
59
60static ssize_t enabled_show(struct kobject *kobj,
61 struct kobj_attribute *attr, char *buf)
62{
63 return double_flag_show(kobj, attr, buf,
64 TRANSPARENT_HUGEPAGE_FLAG,
65 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
66}
67static ssize_t enabled_store(struct kobject *kobj,
68 struct kobj_attribute *attr,
69 const char *buf, size_t count)
70{
71 return double_flag_store(kobj, attr, buf, count,
72 TRANSPARENT_HUGEPAGE_FLAG,
73 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
74}
75static struct kobj_attribute enabled_attr =
76 __ATTR(enabled, 0644, enabled_show, enabled_store);
77
78static ssize_t single_flag_show(struct kobject *kobj,
79 struct kobj_attribute *attr, char *buf,
80 enum transparent_hugepage_flag flag)
81{
82 if (test_bit(flag, &transparent_hugepage_flags))
83 return sprintf(buf, "[yes] no\n");
84 else
85 return sprintf(buf, "yes [no]\n");
86}
87static ssize_t single_flag_store(struct kobject *kobj,
88 struct kobj_attribute *attr,
89 const char *buf, size_t count,
90 enum transparent_hugepage_flag flag)
91{
92 if (!memcmp("yes", buf,
93 min(sizeof("yes")-1, count))) {
94 set_bit(flag, &transparent_hugepage_flags);
95 } else if (!memcmp("no", buf,
96 min(sizeof("no")-1, count))) {
97 clear_bit(flag, &transparent_hugepage_flags);
98 } else
99 return -EINVAL;
100
101 return count;
102}
103
104/*
105 * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
106 * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
107 * memory just to allocate one more hugepage.
108 */
109static ssize_t defrag_show(struct kobject *kobj,
110 struct kobj_attribute *attr, char *buf)
111{
112 return double_flag_show(kobj, attr, buf,
113 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
114 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
115}
116static ssize_t defrag_store(struct kobject *kobj,
117 struct kobj_attribute *attr,
118 const char *buf, size_t count)
119{
120 return double_flag_store(kobj, attr, buf, count,
121 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
122 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
123}
124static struct kobj_attribute defrag_attr =
125 __ATTR(defrag, 0644, defrag_show, defrag_store);
126
127#ifdef CONFIG_DEBUG_VM
128static ssize_t debug_cow_show(struct kobject *kobj,
129 struct kobj_attribute *attr, char *buf)
130{
131 return single_flag_show(kobj, attr, buf,
132 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
133}
134static ssize_t debug_cow_store(struct kobject *kobj,
135 struct kobj_attribute *attr,
136 const char *buf, size_t count)
137{
138 return single_flag_store(kobj, attr, buf, count,
139 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
140}
141static struct kobj_attribute debug_cow_attr =
142 __ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
143#endif /* CONFIG_DEBUG_VM */
144
145static struct attribute *hugepage_attr[] = {
146 &enabled_attr.attr,
147 &defrag_attr.attr,
148#ifdef CONFIG_DEBUG_VM
149 &debug_cow_attr.attr,
150#endif
151 NULL,
152};
153
154static struct attribute_group hugepage_attr_group = {
155 .attrs = hugepage_attr,
156 .name = "transparent_hugepage",
157};
158#endif /* CONFIG_SYSFS */
159
160static int __init hugepage_init(void)
161{
162#ifdef CONFIG_SYSFS
163 int err;
164
165 err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
166 if (err)
167 printk(KERN_ERR "hugepage: register sysfs failed\n");
168#endif
169 return 0;
170}
171module_init(hugepage_init)
172
173static int __init setup_transparent_hugepage(char *str)
174{
175 int ret = 0;
176 if (!str)
177 goto out;
178 if (!strcmp(str, "always")) {
179 set_bit(TRANSPARENT_HUGEPAGE_FLAG,
180 &transparent_hugepage_flags);
181 clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
182 &transparent_hugepage_flags);
183 ret = 1;
184 } else if (!strcmp(str, "madvise")) {
185 clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
186 &transparent_hugepage_flags);
187 set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
188 &transparent_hugepage_flags);
189 ret = 1;
190 } else if (!strcmp(str, "never")) {
191 clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
192 &transparent_hugepage_flags);
193 clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
194 &transparent_hugepage_flags);
195 ret = 1;
196 }
197out:
198 if (!ret)
199 printk(KERN_WARNING
200 "transparent_hugepage= cannot parse, ignored\n");
201 return ret;
202}
203__setup("transparent_hugepage=", setup_transparent_hugepage);
204
205static void prepare_pmd_huge_pte(pgtable_t pgtable,
206 struct mm_struct *mm)
207{
208 assert_spin_locked(&mm->page_table_lock);
209
210 /* FIFO */
211 if (!mm->pmd_huge_pte)
212 INIT_LIST_HEAD(&pgtable->lru);
213 else
214 list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
215 mm->pmd_huge_pte = pgtable;
216}
217
218static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
219{
220 if (likely(vma->vm_flags & VM_WRITE))
221 pmd = pmd_mkwrite(pmd);
222 return pmd;
223}
224
225static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
226 struct vm_area_struct *vma,
227 unsigned long haddr, pmd_t *pmd,
228 struct page *page)
229{
230 int ret = 0;
231 pgtable_t pgtable;
232
233 VM_BUG_ON(!PageCompound(page));
234 pgtable = pte_alloc_one(mm, haddr);
235 if (unlikely(!pgtable)) {
236 put_page(page);
237 return VM_FAULT_OOM;
238 }
239
240 clear_huge_page(page, haddr, HPAGE_PMD_NR);
241 __SetPageUptodate(page);
242
243 spin_lock(&mm->page_table_lock);
244 if (unlikely(!pmd_none(*pmd))) {
245 spin_unlock(&mm->page_table_lock);
246 put_page(page);
247 pte_free(mm, pgtable);
248 } else {
249 pmd_t entry;
250 entry = mk_pmd(page, vma->vm_page_prot);
251 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
252 entry = pmd_mkhuge(entry);
253 /*
254 * The spinlocking to take the lru_lock inside
255 * page_add_new_anon_rmap() acts as a full memory
256 * barrier to be sure clear_huge_page writes become
257 * visible after the set_pmd_at() write.
258 */
259 page_add_new_anon_rmap(page, vma, haddr);
260 set_pmd_at(mm, haddr, pmd, entry);
261 prepare_pmd_huge_pte(pgtable, mm);
262 add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
263 spin_unlock(&mm->page_table_lock);
264 }
265
266 return ret;
267}
268
269static inline struct page *alloc_hugepage(int defrag)
270{
271 return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
272 HPAGE_PMD_ORDER);
273}
274
275int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
276 unsigned long address, pmd_t *pmd,
277 unsigned int flags)
278{
279 struct page *page;
280 unsigned long haddr = address & HPAGE_PMD_MASK;
281 pte_t *pte;
282
283 if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
284 if (unlikely(anon_vma_prepare(vma)))
285 return VM_FAULT_OOM;
286 page = alloc_hugepage(transparent_hugepage_defrag(vma));
287 if (unlikely(!page))
288 goto out;
289
290 return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
291 }
292out:
293 /*
294 * Use __pte_alloc instead of pte_alloc_map, because we can't
295 * run pte_offset_map on the pmd, if an huge pmd could
296 * materialize from under us from a different thread.
297 */
298 if (unlikely(__pte_alloc(mm, vma, pmd, address)))
299 return VM_FAULT_OOM;
300 /* if an huge pmd materialized from under us just retry later */
301 if (unlikely(pmd_trans_huge(*pmd)))
302 return 0;
303 /*
304 * A regular pmd is established and it can't morph into a huge pmd
305 * from under us anymore at this point because we hold the mmap_sem
306 * read mode and khugepaged takes it in write mode. So now it's
307 * safe to run pte_offset_map().
308 */
309 pte = pte_offset_map(pmd, address);
310 return handle_pte_fault(mm, vma, address, pte, pmd, flags);
311}
312
313int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
314 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
315 struct vm_area_struct *vma)
316{
317 struct page *src_page;
318 pmd_t pmd;
319 pgtable_t pgtable;
320 int ret;
321
322 ret = -ENOMEM;
323 pgtable = pte_alloc_one(dst_mm, addr);
324 if (unlikely(!pgtable))
325 goto out;
326
327 spin_lock(&dst_mm->page_table_lock);
328 spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
329
330 ret = -EAGAIN;
331 pmd = *src_pmd;
332 if (unlikely(!pmd_trans_huge(pmd))) {
333 pte_free(dst_mm, pgtable);
334 goto out_unlock;
335 }
336 if (unlikely(pmd_trans_splitting(pmd))) {
337 /* split huge page running from under us */
338 spin_unlock(&src_mm->page_table_lock);
339 spin_unlock(&dst_mm->page_table_lock);
340 pte_free(dst_mm, pgtable);
341
342 wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
343 goto out;
344 }
345 src_page = pmd_page(pmd);
346 VM_BUG_ON(!PageHead(src_page));
347 get_page(src_page);
348 page_dup_rmap(src_page);
349 add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
350
351 pmdp_set_wrprotect(src_mm, addr, src_pmd);
352 pmd = pmd_mkold(pmd_wrprotect(pmd));
353 set_pmd_at(dst_mm, addr, dst_pmd, pmd);
354 prepare_pmd_huge_pte(pgtable, dst_mm);
355
356 ret = 0;
357out_unlock:
358 spin_unlock(&src_mm->page_table_lock);
359 spin_unlock(&dst_mm->page_table_lock);
360out:
361 return ret;
362}
363
364/* no "address" argument so destroys page coloring of some arch */
365pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
366{
367 pgtable_t pgtable;
368
369 assert_spin_locked(&mm->page_table_lock);
370
371 /* FIFO */
372 pgtable = mm->pmd_huge_pte;
373 if (list_empty(&pgtable->lru))
374 mm->pmd_huge_pte = NULL;
375 else {
376 mm->pmd_huge_pte = list_entry(pgtable->lru.next,
377 struct page, lru);
378 list_del(&pgtable->lru);
379 }
380 return pgtable;
381}
382
383static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
384 struct vm_area_struct *vma,
385 unsigned long address,
386 pmd_t *pmd, pmd_t orig_pmd,
387 struct page *page,
388 unsigned long haddr)
389{
390 pgtable_t pgtable;
391 pmd_t _pmd;
392 int ret = 0, i;
393 struct page **pages;
394
395 pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
396 GFP_KERNEL);
397 if (unlikely(!pages)) {
398 ret |= VM_FAULT_OOM;
399 goto out;
400 }
401
402 for (i = 0; i < HPAGE_PMD_NR; i++) {
403 pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
404 vma, address);
405 if (unlikely(!pages[i])) {
406 while (--i >= 0)
407 put_page(pages[i]);
408 kfree(pages);
409 ret |= VM_FAULT_OOM;
410 goto out;
411 }
412 }
413
414 for (i = 0; i < HPAGE_PMD_NR; i++) {
415 copy_user_highpage(pages[i], page + i,
416 haddr + PAGE_SHIFT*i, vma);
417 __SetPageUptodate(pages[i]);
418 cond_resched();
419 }
420
421 spin_lock(&mm->page_table_lock);
422 if (unlikely(!pmd_same(*pmd, orig_pmd)))
423 goto out_free_pages;
424 VM_BUG_ON(!PageHead(page));
425
426 pmdp_clear_flush_notify(vma, haddr, pmd);
427 /* leave pmd empty until pte is filled */
428
429 pgtable = get_pmd_huge_pte(mm);
430 pmd_populate(mm, &_pmd, pgtable);
431
432 for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
433 pte_t *pte, entry;
434 entry = mk_pte(pages[i], vma->vm_page_prot);
435 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
436 page_add_new_anon_rmap(pages[i], vma, haddr);
437 pte = pte_offset_map(&_pmd, haddr);
438 VM_BUG_ON(!pte_none(*pte));
439 set_pte_at(mm, haddr, pte, entry);
440 pte_unmap(pte);
441 }
442 kfree(pages);
443
444 mm->nr_ptes++;
445 smp_wmb(); /* make pte visible before pmd */
446 pmd_populate(mm, pmd, pgtable);
447 page_remove_rmap(page);
448 spin_unlock(&mm->page_table_lock);
449
450 ret |= VM_FAULT_WRITE;
451 put_page(page);
452
453out:
454 return ret;
455
456out_free_pages:
457 spin_unlock(&mm->page_table_lock);
458 for (i = 0; i < HPAGE_PMD_NR; i++)
459 put_page(pages[i]);
460 kfree(pages);
461 goto out;
462}
463
464int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
465 unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
466{
467 int ret = 0;
468 struct page *page, *new_page;
469 unsigned long haddr;
470
471 VM_BUG_ON(!vma->anon_vma);
472 spin_lock(&mm->page_table_lock);
473 if (unlikely(!pmd_same(*pmd, orig_pmd)))
474 goto out_unlock;
475
476 page = pmd_page(orig_pmd);
477 VM_BUG_ON(!PageCompound(page) || !PageHead(page));
478 haddr = address & HPAGE_PMD_MASK;
479 if (page_mapcount(page) == 1) {
480 pmd_t entry;
481 entry = pmd_mkyoung(orig_pmd);
482 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
483 if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
484 update_mmu_cache(vma, address, entry);
485 ret |= VM_FAULT_WRITE;
486 goto out_unlock;
487 }
488 get_page(page);
489 spin_unlock(&mm->page_table_lock);
490
491 if (transparent_hugepage_enabled(vma) &&
492 !transparent_hugepage_debug_cow())
493 new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
494 else
495 new_page = NULL;
496
497 if (unlikely(!new_page)) {
498 ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
499 pmd, orig_pmd, page, haddr);
500 put_page(page);
501 goto out;
502 }
503
504 copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
505 __SetPageUptodate(new_page);
506
507 spin_lock(&mm->page_table_lock);
508 put_page(page);
509 if (unlikely(!pmd_same(*pmd, orig_pmd)))
510 put_page(new_page);
511 else {
512 pmd_t entry;
513 VM_BUG_ON(!PageHead(page));
514 entry = mk_pmd(new_page, vma->vm_page_prot);
515 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
516 entry = pmd_mkhuge(entry);
517 pmdp_clear_flush_notify(vma, haddr, pmd);
518 page_add_new_anon_rmap(new_page, vma, haddr);
519 set_pmd_at(mm, haddr, pmd, entry);
520 update_mmu_cache(vma, address, entry);
521 page_remove_rmap(page);
522 put_page(page);
523 ret |= VM_FAULT_WRITE;
524 }
525out_unlock:
526 spin_unlock(&mm->page_table_lock);
527out:
528 return ret;
529}
530
531struct page *follow_trans_huge_pmd(struct mm_struct *mm,
532 unsigned long addr,
533 pmd_t *pmd,
534 unsigned int flags)
535{
536 struct page *page = NULL;
537
538 assert_spin_locked(&mm->page_table_lock);
539
540 if (flags & FOLL_WRITE && !pmd_write(*pmd))
541 goto out;
542
543 page = pmd_page(*pmd);
544 VM_BUG_ON(!PageHead(page));
545 if (flags & FOLL_TOUCH) {
546 pmd_t _pmd;
547 /*
548 * We should set the dirty bit only for FOLL_WRITE but
549 * for now the dirty bit in the pmd is meaningless.
550 * And if the dirty bit will become meaningful and
551 * we'll only set it with FOLL_WRITE, an atomic
552 * set_bit will be required on the pmd to set the
553 * young bit, instead of the current set_pmd_at.
554 */
555 _pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
556 set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
557 }
558 page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
559 VM_BUG_ON(!PageCompound(page));
560 if (flags & FOLL_GET)
561 get_page(page);
562
563out:
564 return page;
565}
566
567int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
568 pmd_t *pmd)
569{
570 int ret = 0;
571
572 spin_lock(&tlb->mm->page_table_lock);
573 if (likely(pmd_trans_huge(*pmd))) {
574 if (unlikely(pmd_trans_splitting(*pmd))) {
575 spin_unlock(&tlb->mm->page_table_lock);
576 wait_split_huge_page(vma->anon_vma,
577 pmd);
578 } else {
579 struct page *page;
580 pgtable_t pgtable;
581 pgtable = get_pmd_huge_pte(tlb->mm);
582 page = pmd_page(*pmd);
583 pmd_clear(pmd);
584 page_remove_rmap(page);
585 VM_BUG_ON(page_mapcount(page) < 0);
586 add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
587 VM_BUG_ON(!PageHead(page));
588 spin_unlock(&tlb->mm->page_table_lock);
589 tlb_remove_page(tlb, page);
590 pte_free(tlb->mm, pgtable);
591 ret = 1;
592 }
593 } else
594 spin_unlock(&tlb->mm->page_table_lock);
595
596 return ret;
597}
598
599pmd_t *page_check_address_pmd(struct page *page,
600 struct mm_struct *mm,
601 unsigned long address,
602 enum page_check_address_pmd_flag flag)
603{
604 pgd_t *pgd;
605 pud_t *pud;
606 pmd_t *pmd, *ret = NULL;
607
608 if (address & ~HPAGE_PMD_MASK)
609 goto out;
610
611 pgd = pgd_offset(mm, address);
612 if (!pgd_present(*pgd))
613 goto out;
614
615 pud = pud_offset(pgd, address);
616 if (!pud_present(*pud))
617 goto out;
618
619 pmd = pmd_offset(pud, address);
620 if (pmd_none(*pmd))
621 goto out;
622 if (pmd_page(*pmd) != page)
623 goto out;
624 VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
625 pmd_trans_splitting(*pmd));
626 if (pmd_trans_huge(*pmd)) {
627 VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
628 !pmd_trans_splitting(*pmd));
629 ret = pmd;
630 }
631out:
632 return ret;
633}
634
635static int __split_huge_page_splitting(struct page *page,
636 struct vm_area_struct *vma,
637 unsigned long address)
638{
639 struct mm_struct *mm = vma->vm_mm;
640 pmd_t *pmd;
641 int ret = 0;
642
643 spin_lock(&mm->page_table_lock);
644 pmd = page_check_address_pmd(page, mm, address,
645 PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
646 if (pmd) {
647 /*
648 * We can't temporarily set the pmd to null in order
649 * to split it, the pmd must remain marked huge at all
650 * times or the VM won't take the pmd_trans_huge paths
651 * and it won't wait on the anon_vma->root->lock to
652 * serialize against split_huge_page*.
653 */
654 pmdp_splitting_flush_notify(vma, address, pmd);
655 ret = 1;
656 }
657 spin_unlock(&mm->page_table_lock);
658
659 return ret;
660}
661
662static void __split_huge_page_refcount(struct page *page)
663{
664 int i;
665 unsigned long head_index = page->index;
666 struct zone *zone = page_zone(page);
667
668 /* prevent PageLRU to go away from under us, and freeze lru stats */
669 spin_lock_irq(&zone->lru_lock);
670 compound_lock(page);
671
672 for (i = 1; i < HPAGE_PMD_NR; i++) {
673 struct page *page_tail = page + i;
674
675 /* tail_page->_count cannot change */
676 atomic_sub(atomic_read(&page_tail->_count), &page->_count);
677 BUG_ON(page_count(page) <= 0);
678 atomic_add(page_mapcount(page) + 1, &page_tail->_count);
679 BUG_ON(atomic_read(&page_tail->_count) <= 0);
680
681 /* after clearing PageTail the gup refcount can be released */
682 smp_mb();
683
684 page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
685 page_tail->flags |= (page->flags &
686 ((1L << PG_referenced) |
687 (1L << PG_swapbacked) |
688 (1L << PG_mlocked) |
689 (1L << PG_uptodate)));
690 page_tail->flags |= (1L << PG_dirty);
691
692 /*
693 * 1) clear PageTail before overwriting first_page
694 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
695 */
696 smp_wmb();
697
698 /*
699 * __split_huge_page_splitting() already set the
700 * splitting bit in all pmd that could map this
701 * hugepage, that will ensure no CPU can alter the
702 * mapcount on the head page. The mapcount is only
703 * accounted in the head page and it has to be
704 * transferred to all tail pages in the below code. So
705 * for this code to be safe, the split the mapcount
706 * can't change. But that doesn't mean userland can't
707 * keep changing and reading the page contents while
708 * we transfer the mapcount, so the pmd splitting
709 * status is achieved setting a reserved bit in the
710 * pmd, not by clearing the present bit.
711 */
712 BUG_ON(page_mapcount(page_tail));
713 page_tail->_mapcount = page->_mapcount;
714
715 BUG_ON(page_tail->mapping);
716 page_tail->mapping = page->mapping;
717
718 page_tail->index = ++head_index;
719
720 BUG_ON(!PageAnon(page_tail));
721 BUG_ON(!PageUptodate(page_tail));
722 BUG_ON(!PageDirty(page_tail));
723 BUG_ON(!PageSwapBacked(page_tail));
724
725 lru_add_page_tail(zone, page, page_tail);
726 }
727
728 ClearPageCompound(page);
729 compound_unlock(page);
730 spin_unlock_irq(&zone->lru_lock);
731
732 for (i = 1; i < HPAGE_PMD_NR; i++) {
733 struct page *page_tail = page + i;
734 BUG_ON(page_count(page_tail) <= 0);
735 /*
736 * Tail pages may be freed if there wasn't any mapping
737 * like if add_to_swap() is running on a lru page that
738 * had its mapping zapped. And freeing these pages
739 * requires taking the lru_lock so we do the put_page
740 * of the tail pages after the split is complete.
741 */
742 put_page(page_tail);
743 }
744
745 /*
746 * Only the head page (now become a regular page) is required
747 * to be pinned by the caller.
748 */
749 BUG_ON(page_count(page) <= 0);
750}
751
752static int __split_huge_page_map(struct page *page,
753 struct vm_area_struct *vma,
754 unsigned long address)
755{
756 struct mm_struct *mm = vma->vm_mm;
757 pmd_t *pmd, _pmd;
758 int ret = 0, i;
759 pgtable_t pgtable;
760 unsigned long haddr;
761
762 spin_lock(&mm->page_table_lock);
763 pmd = page_check_address_pmd(page, mm, address,
764 PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
765 if (pmd) {
766 pgtable = get_pmd_huge_pte(mm);
767 pmd_populate(mm, &_pmd, pgtable);
768
769 for (i = 0, haddr = address; i < HPAGE_PMD_NR;
770 i++, haddr += PAGE_SIZE) {
771 pte_t *pte, entry;
772 BUG_ON(PageCompound(page+i));
773 entry = mk_pte(page + i, vma->vm_page_prot);
774 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
775 if (!pmd_write(*pmd))
776 entry = pte_wrprotect(entry);
777 else
778 BUG_ON(page_mapcount(page) != 1);
779 if (!pmd_young(*pmd))
780 entry = pte_mkold(entry);
781 pte = pte_offset_map(&_pmd, haddr);
782 BUG_ON(!pte_none(*pte));
783 set_pte_at(mm, haddr, pte, entry);
784 pte_unmap(pte);
785 }
786
787 mm->nr_ptes++;
788 smp_wmb(); /* make pte visible before pmd */
789 /*
790 * Up to this point the pmd is present and huge and
791 * userland has the whole access to the hugepage
792 * during the split (which happens in place). If we
793 * overwrite the pmd with the not-huge version
794 * pointing to the pte here (which of course we could
795 * if all CPUs were bug free), userland could trigger
796 * a small page size TLB miss on the small sized TLB
797 * while the hugepage TLB entry is still established
798 * in the huge TLB. Some CPU doesn't like that. See
799 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
800 * Erratum 383 on page 93. Intel should be safe but is
801 * also warns that it's only safe if the permission
802 * and cache attributes of the two entries loaded in
803 * the two TLB is identical (which should be the case
804 * here). But it is generally safer to never allow
805 * small and huge TLB entries for the same virtual
806 * address to be loaded simultaneously. So instead of
807 * doing "pmd_populate(); flush_tlb_range();" we first
808 * mark the current pmd notpresent (atomically because
809 * here the pmd_trans_huge and pmd_trans_splitting
810 * must remain set at all times on the pmd until the
811 * split is complete for this pmd), then we flush the
812 * SMP TLB and finally we write the non-huge version
813 * of the pmd entry with pmd_populate.
814 */
815 set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
816 flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
817 pmd_populate(mm, pmd, pgtable);
818 ret = 1;
819 }
820 spin_unlock(&mm->page_table_lock);
821
822 return ret;
823}
824
825/* must be called with anon_vma->root->lock hold */
826static void __split_huge_page(struct page *page,
827 struct anon_vma *anon_vma)
828{
829 int mapcount, mapcount2;
830 struct anon_vma_chain *avc;
831
832 BUG_ON(!PageHead(page));
833 BUG_ON(PageTail(page));
834
835 mapcount = 0;
836 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
837 struct vm_area_struct *vma = avc->vma;
838 unsigned long addr = vma_address(page, vma);
839 BUG_ON(is_vma_temporary_stack(vma));
840 if (addr == -EFAULT)
841 continue;
842 mapcount += __split_huge_page_splitting(page, vma, addr);
843 }
844 BUG_ON(mapcount != page_mapcount(page));
845
846 __split_huge_page_refcount(page);
847
848 mapcount2 = 0;
849 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
850 struct vm_area_struct *vma = avc->vma;
851 unsigned long addr = vma_address(page, vma);
852 BUG_ON(is_vma_temporary_stack(vma));
853 if (addr == -EFAULT)
854 continue;
855 mapcount2 += __split_huge_page_map(page, vma, addr);
856 }
857 BUG_ON(mapcount != mapcount2);
858}
859
860int split_huge_page(struct page *page)
861{
862 struct anon_vma *anon_vma;
863 int ret = 1;
864
865 BUG_ON(!PageAnon(page));
866 anon_vma = page_lock_anon_vma(page);
867 if (!anon_vma)
868 goto out;
869 ret = 0;
870 if (!PageCompound(page))
871 goto out_unlock;
872
873 BUG_ON(!PageSwapBacked(page));
874 __split_huge_page(page, anon_vma);
875
876 BUG_ON(PageCompound(page));
877out_unlock:
878 page_unlock_anon_vma(anon_vma);
879out:
880 return ret;
881}
882
883void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
884{
885 struct page *page;
886
887 spin_lock(&mm->page_table_lock);
888 if (unlikely(!pmd_trans_huge(*pmd))) {
889 spin_unlock(&mm->page_table_lock);
890 return;
891 }
892 page = pmd_page(*pmd);
893 VM_BUG_ON(!page_count(page));
894 get_page(page);
895 spin_unlock(&mm->page_table_lock);
896
897 split_huge_page(page);
898
899 put_page(page);
900 BUG_ON(pmd_trans_huge(*pmd));
901}