diff options
author | Henry Nestler <henry.nestler@gmail.com> | 2008-05-12 09:44:39 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2008-06-12 15:26:07 -0400 |
commit | b29c701deacd5d24453127c37ed77ef851c53b8b (patch) | |
tree | 79aed16366b49d8b6d033b696b055b46fb8f8d02 | |
parent | 3703f39965a197ebd91743fc38d0f640606b8da3 (diff) |
x86: fix endless page faults in mount_block_root for Linux 2.6
Page faults in kernel address space between PAGE_OFFSET up to
VMALLOC_START should not try to map as vmalloc.
Fix rarely endless page faults inside mount_block_root for root
filesystem at boot time.
All 32bit kernels up to 2.6.25 can fail into this hole.
I can not present this under native linux kernel. I see, that the 64bit
has fixed the problem. I copied the same lines into 32bit part.
Recorded debugs are from coLinux kernel 2.6.22.18 (virtualisation):
http://www.henrynestler.com/colinux/testing/pfn-check-0.7.3/20080410-antinx/bug16-recursive-page-fault-endless.txt
The physicaly memory was trimmed down to 192MB to better catch the bug.
More memory gets the bug more rarely.
Details, how every x86 32bit system can fail:
Start from "mount_block_root",
http://lxr.linux.no/linux/init/do_mounts.c#L297
There the variable "fs_names" got one memory page with 4096 bytes.
Variable "p" walks through the existing file system types. The first
string is no problem.
But, with the second loop in mount_block_root the offset of "p" is not
at beginning of page, the offset is for example +9, if "reiserfs" is the
first in list.
Than calls do_mount_root, and lands in sys_mount.
Remember: Variable "type_page" contains now "fs_type+9" and not contains
a full page.
The sys_mount copies 4096 bytes with function "exact_copy_from_user()":
http://lxr.linux.no/linux/fs/namespace.c#L1540
Mostly exist pages after the buffer "fs_names+4096+9" and the page fault
handler was not called. No problem.
In the case, if the page after "fs_names+4096" is not mapped, the page
fault handler was called from http://lxr.linux.no/linux/fs/namespace.c#L1320
The do_page_fault gots an address 0xc03b4000.
It's kernel address, address >= TASK_SIZE, but not from vmalloc! It's
from "__getname()" alias "kmem_cache_alloc".
The "error_code" is 0. "vmalloc_fault" will be call:
http://lxr.linux.no/linux/arch/i386/mm/fault.c#L332
"vmalloc_fault" tryed to find the physical page for a non existing
virtual memory area. The macro "pte_present" in vmalloc_fault()
got a next page fault for 0xc0000ed0 at:
http://lxr.linux.no/linux/arch/i386/mm/fault.c#L282
No PTE exist for such virtual address. The page fault handler was trying
to sync the physical page for the PTE lockup.
This called vmalloc_fault() again for address 0xc000000, and that also
was not existing. The endless began...
In normal case the cpu would still loop with disabled interrrupts. Under
coLinux this was catched by a stack overflow inside printk debugs.
Signed-off-by: Henry Nestler <henry.nestler@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
-rw-r--r-- | arch/x86/mm/fault.c | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index fd7e1798c75a..8bcb6f40ccb6 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c | |||
@@ -497,6 +497,11 @@ static int vmalloc_fault(unsigned long address) | |||
497 | unsigned long pgd_paddr; | 497 | unsigned long pgd_paddr; |
498 | pmd_t *pmd_k; | 498 | pmd_t *pmd_k; |
499 | pte_t *pte_k; | 499 | pte_t *pte_k; |
500 | |||
501 | /* Make sure we are in vmalloc area */ | ||
502 | if (!(address >= VMALLOC_START && address < VMALLOC_END)) | ||
503 | return -1; | ||
504 | |||
500 | /* | 505 | /* |
501 | * Synchronize this task's top level page-table | 506 | * Synchronize this task's top level page-table |
502 | * with the 'reference' page table. | 507 | * with the 'reference' page table. |