aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/xen
Commit message (Collapse)AuthorAge
...
* | | | | | Merge branch 'stable/128gb.v5.1' into stable/for-linus-3.7Konrad Rzeszutek Wilk2012-09-12
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * stable/128gb.v5.1: xen/mmu: If the revector fails, don't attempt to revector anything else. xen/p2m: When revectoring deal with holes in the P2M array. xen/mmu: Release just the MFN list, not MFN list and part of pagetables. xen/mmu: Remove from __ka space PMD entries for pagetables. xen/mmu: Copy and revector the P2M tree. xen/p2m: Add logic to revector a P2M tree to use __va leafs. xen/mmu: Recycle the Xen provided L4, L3, and L2 pages xen/mmu: For 64-bit do not call xen_map_identity_early xen/mmu: use copy_page instead of memcpy. xen/mmu: Provide comments describing the _ka and _va aliasing issue xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything. Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas." xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain. xen/x86: Use memblock_reserve for sensitive areas. xen/p2m: Fix the comment describing the P2M tree. Conflicts: arch/x86/xen/mmu.c The pagetable_init is the old xen_pagetable_setup_done and xen_pagetable_setup_start rolled in one. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: If the revector fails, don't attempt to revector anything else.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the P2M revectoring would fail, we would try to continue on by cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap is greedy and erases the PMD on both boundaries. Since the P2M array can share the PMD, we would wipe out part of the __ka that is still used in the P2M tree to point to P2M leafs. This fixes it by bypassing the revectoring and continuing on. If the revector fails, a nice WARN is printed so we can still troubleshoot this. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/p2m: When revectoring deal with holes in the P2M array.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we free the PFNs and then subsequently populate them back during bootup: Freeing 20000-20200 pfn range: 512 pages freed 1-1 mapping on 20000->20200 Freeing 40000-40200 pfn range: 512 pages freed 1-1 mapping on 40000->40200 Freeing bad80-badf4 pfn range: 116 pages freed 1-1 mapping on bad80->badf4 Freeing badf6-bae7f pfn range: 137 pages freed 1-1 mapping on badf6->bae7f Freeing bb000-100000 pfn range: 282624 pages freed 1-1 mapping on bb000->100000 Released 283999 pages of unused memory Set 283999 page(s) to 1-1 mapping Populating 1acb8a-1f20e9 pfn range: 283999 pages added We end up having the P2M array (that is the one that was grafted on the P2M tree) filled with IDENTITY_FRAME or INVALID_P2M_ENTRY) entries. The patch titled "xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID." recycles said slots and replaces the P2M tree leaf's with &mfn_list[xx] with p2m_identity or p2m_missing. And re-uses the P2M array sections for other P2M tree leaf's. For the above mentioned bootup excerpt, the PFNs at 0x20000->0x20200 are going to be IDENTITY based: P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME. We can re-use that and replace P2M[0][256] to point to p2m_identity. The "old" page (the grafted P2M array provided by Xen) that was at P2M[0][256] gets put somewhere else. Specifically at P2M[6][358], b/c when we populate back: Populating 1acb8a-1f20e9 pfn range: 283999 pages added we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with the new MFNs. That is all OK, except when we revector we assume that the PFN count would be the same in the grafted P2M array and in the newly allocated. Since that is no longer the case, as we have holes in the P2M that point to p2m_missing or p2m_identity we have to take that into account. [v2: Check for overflow] [v3: Move within the __va check] [v4: Fix the computation] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: Release just the MFN list, not MFN list and part of pagetables.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We call memblock_reserve for [start of mfn list] -> [PMD aligned end of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list]. This has the disastrous effect that if at bootup the end of mfn_list is not PMD aligned we end up returning to memblock parts of the region past the mfn_list array. And those parts are the PTE tables with the disastrous effect of seeing this at bootup: Write protecting the kernel read-only data: 10240k Freeing unused kernel memory: 1860k freed Freeing unused kernel memory: 200k freed (XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26) ... (XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0 (XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0 .. and so on. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: Remove from __ka space PMD entries for pagetables.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Please first read the description in "xen/mmu: Copy and revector the P2M tree." At this stage, the __ka address space (which is what the old P2M tree was using) is partially disassembled. The cleanup_highmap has removed the PMD entries from 0-16MB and anything past _brk_end up to the max_pfn_mapped (which is the end of the ramdisk). The xen_remove_p2m_tree and code around has ripped out the __ka for the old P2M array. Here we continue on doing it to where the Xen page-tables were. It is safe to do it, as the page-tables are addressed using __va. For good measure we delete anything that is within MODULES_VADDR and up to the end of the PMD. At this point the __ka only contains PMD entries for the start of the kernel up to __brk. [v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: Copy and revector the P2M tree.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Please first read the description in "xen/p2m: Add logic to revector a P2M tree to use __va leafs" patch. The 'xen_revector_p2m_tree()' function allocates a new P2M tree copies the contents of the old one in it, and returns the new one. At this stage, the __ka address space (which is what the old P2M tree was using) is partially disassembled. The cleanup_highmap has removed the PMD entries from 0-16MB and anything past _brk_end up to the max_pfn_mapped (which is the end of the ramdisk). We have revectored the P2M tree (and the one for save/restore as well) to use new shiny __va address to new MFNs. The xen_start_info has been taken care of already in 'xen_setup_kernel_pagetable()' and xen_start_info->shared_info in 'xen_setup_shared_info()', so we are free to roam and delete PMD entries - which is exactly what we are going to do. We rip out the __ka for the old P2M array. [v1: Fix smatch warnings] [v2: memset was doing 0 instead of 0xff] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/p2m: Add logic to revector a P2M tree to use __va leafs.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During bootup Xen supplies us with a P2M array. It sticks it right after the ramdisk, as can be seen with a 128GB PV guest: (certain parts removed for clarity): xc_dom_build_image: called xc_dom_alloc_segment: kernel : 0xffffffff81000000 -> 0xffffffff81e43000 (pfn 0x1000 + 0xe43 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000 xc_dom_alloc_segment: ramdisk : 0xffffffff81e43000 -> 0xffffffff925c7000 (pfn 0x1e43 + 0x10784 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000 xc_dom_alloc_segment: phys2mach : 0xffffffff925c7000 -> 0xffffffffa25c7000 (pfn 0x125c7 + 0x10000 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000 xc_dom_alloc_page : start info : 0xffffffffa25c7000 (pfn 0x225c7) xc_dom_alloc_page : xenstore : 0xffffffffa25c8000 (pfn 0x225c8) xc_dom_alloc_page : console : 0xffffffffa25c9000 (pfn 0x225c9) nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s) nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s) xc_dom_alloc_segment: page tables : 0xffffffffa25ca000 -> 0xffffffffa26e1000 (pfn 0x225ca + 0x117 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000 xc_dom_alloc_page : boot stack : 0xffffffffa26e1000 (pfn 0x226e1) xc_dom_build_image : virt_alloc_end : 0xffffffffa26e2000 xc_dom_build_image : virt_pgtab_end : 0xffffffffa2800000 So the physical memory and virtual (using __START_KERNEL_map addresses) layout looks as so: phys __ka /------------\ /-------------------\ | 0 | empty | 0xffffffff80000000| | .. | | .. | | 16MB | <= kernel starts | 0xffffffff81000000| | .. | | | | 30MB | <= kernel ends => | 0xffffffff81e43000| | .. | & ramdisk starts | .. | | 293MB | <= ramdisk ends=> | 0xffffffff925c7000| | .. | & P2M starts | .. | | .. | | .. | | 549MB | <= P2M ends => | 0xffffffffa25c7000| | .. | start_info | 0xffffffffa25c7000| | .. | xenstore | 0xffffffffa25c8000| | .. | cosole | 0xffffffffa25c9000| | 549MB | <= page tables => | 0xffffffffa25ca000| | .. | | | | 550MB | <= PGT end => | 0xffffffffa26e1000| | .. | boot stack | | \------------/ \-------------------/ As can be seen, the ramdisk, P2M and pagetables are taking a bit of __ka addresses space. Which is a problem since the MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits right in there! This results during bootup with the inability to load modules, with this error: ------------[ cut here ]------------ WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370() Call Trace: [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80 [<ffffffff810c6186>] ? load_module+0x66/0x19c0 [<ffffffff8105cadc>] module_alloc+0x5c/0x60 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b ---[ end trace fd8f7704fdea0291 ]--- vmalloc: allocation failure, allocated 16384 of 20480 bytes modprobe: page allocation failure: order:0, mode:0xd2 Since the __va and __ka are 1:1 up to MODULES_VADDR and cleanup_highmap rids __ka of the ramdisk mapping, what we want to do is similar - get rid of the P2M in the __ka address space. There are two ways of fixing this: 1) All P2M lookups instead of using the __ka address would use the __va address. This means we can safely erase from __ka space the PMD pointers that point to the PFNs for P2M array and be OK. 2). Allocate a new array, copy the existing P2M into it, revector the P2M tree to use that, and return the old P2M to the memory allocate. This has the advantage that it sets the stage for using XEN_ELF_NOTE_INIT_P2M feature. That feature allows us to set the exact virtual address space we want for the P2M - and allows us to boot as initial domain on large machines. So we pick option 2). This patch only lays the groundwork in the P2M code. The patch that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree." Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: Recycle the Xen provided L4, L3, and L2 pagesKonrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we are not using them. We end up only using the L1 pagetables and grafting those to our page-tables. [v1: Per Stefano's suggestion squashed two commits] [v2: Per Stefano's suggestion simplified loop] [v3: Fix smatch warnings] [v4: Add more comments] Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: For 64-bit do not call xen_map_identity_earlyKonrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | B/c we do not need it. During the startup the Xen provides us with all the initial memory mapped that we need to function. The initial memory mapped is up to the bootstack, which means we can reference using __ka up to 4.f): (from xen/interface/xen.h): 4. This the order of bootstrap elements in the initial virtual region: a. relocated kernel image b. initial ram disk [mod_start, mod_len] c. list of allocated page frames [mfn_list, nr_pages] d. start_info_t structure [register ESI (x86)] e. bootstrap page tables [pt_base, CR3 (x86)] f. bootstrap stack [register ESP (x86)] (initial ram disk may be ommitted). [v1: More comments in git commit] Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: use copy_page instead of memcpy.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After all, this is what it is there for. Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: Provide comments describing the _ka and _va aliasing issueKonrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Which is that the level2_kernel_pgt (__ka virtual addresses) and level2_ident_pgt (__va virtual address) contain the same PMD entries. So if you modify a PTE in __ka, it will be reflected in __va (and vice-versa). Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need to return the new PGD - as we do not use it. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." ↵Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and "xen/x86: Use memblock_reserve for sensitive areas." This reverts commit 806c312e50f122c47913145cf884f53dd09d9199 and commit 59b294403e9814e7c1154043567f0d71bac7a511. And also documents setup.c and why we want to do it that way, which is that we tried to make the the memblock_reserve more selective so that it would be clear what region is reserved. Sadly we ran in the problem wherein on a 64-bit hypervisor with a 32-bit initial domain, the pt_base has the cr3 value which is not neccessarily where the pagetable starts! As Jan put it: " Actually, the adjustment turns out to be correct: The page tables for a 32-on-64 dom0 get allocated in the order "first L1", "first L2", "first L3", so the offset to the page table base is indeed 2. When reading xen/include/public/xen.h's comment very strictly, this is not a violation (since there nothing is said that the first thing in the page table space is pointed to by pt_base; I admit that this seems to be implied though, namely do I think that it is implied that the page table space is the range [pt_base, pt_base + nt_pt_frames), whereas that range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames), which - without a priori knowledge - the kernel would have difficulty to figure out)." - so lets just fall back to the easy way and reserve the whole region. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain.Konrad Rzeszutek Wilk2012-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a 64-bit hypervisor is booted with a 32-bit initial domain, the hypervisor deals with the initial domain as "compat" and does some extra adjustments (like pagetables are 4 bytes instead of 8). It also adjusts the xen_start_info->pt_base incorrectly. When booted with a 32-bit hypervisor (32-bit initial domain): .. (XEN) Start info: cf831000->cf83147c (XEN) Page tables: cf832000->cf8b5000 .. [ 0.000000] PT: cf832000 (f832000) [ 0.000000] Reserving PT: f832000->f8b5000 And with a 64-bit hypervisor: (XEN) Start info: 00000000cf831000->00000000cf8314b4 (XEN) Page tables: 00000000cf832000->00000000cf8b6000 [ 0.000000] PT: cf834000 (f834000) [ 0.000000] Reserving PT: f834000->f8b8000 To deal with this, we keep keep track of the highest physical address we have reserved via memblock_reserve. If that address does not overlap with pt_base, we have a gap which we reserve. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/x86: Use memblock_reserve for sensitive areas.Konrad Rzeszutek Wilk2012-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | instead of a big memblock_reserve. This way we can be more selective in freeing regions (and it also makes it easier to understand where is what). [v1: Move the auto_translate_physmap to proper line] [v2: Per Stefano suggestion add more comments] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | | xen/p2m: Fix the comment describing the P2M tree.Konrad Rzeszutek Wilk2012-08-21
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It mixed up the p2m_mid_missing with p2m_missing. Also remove some extra spaces. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | | | | Merge branch 'x86/platform' of ↵Konrad Rzeszutek Wilk2012-09-12
|\ \ \ \ \ \ | | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into stable/for-linus-3.7 * 'x86/platform' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (9690 commits) x86: Document x86_init.paging.pagetable_init() x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done() x86: Move paging_init() call to x86_init.paging.pagetable_init() x86: Rename pagetable_setup_start() to pagetable_init() x86: Remove base argument from x86_init.paging.pagetable_setup_start Linux 3.6-rc5 HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured Remove user-triggerable BUG from mpol_to_str xen/pciback: Fix proper FLR steps. uml: fix compile error in deliver_alarm() dj: memory scribble in logi_dj Fix order of arguments to compat_put_time[spec|val] xen: Use correct masking in xen_swiotlb_alloc_coherent. xen: fix logical error in tlb flushing xen/p2m: Fix one-off error in checking the P2M tree directory. powerpc: Don't use __put_user() in patch_instruction powerpc: Make sure IPI handlers see data written by IPI senders powerpc: Restore correct DSCR in context switch powerpc: Fix DSCR inheritance in copy_thread() powerpc: Keep thread.dscr and thread.dscr_inherit in sync ...
| * | | | | x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done()Attilio Rao2012-09-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this stage x86_init.paging.pagetable_setup_done is only used in the XEN case. Move its content in the x86_init.paging.pagetable_init setup function and remove the now unused x86_init.paging.pagetable_setup_done remaining infrastructure. Signed-off-by: Attilio Rao <attilio.rao@citrix.com> Acked-by: <konrad.wilk@oracle.com> Cc: <Ian.Campbell@citrix.com> Cc: <Stefano.Stabellini@eu.citrix.com> Cc: <xen-devel@lists.xensource.com> Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | x86: Move paging_init() call to x86_init.paging.pagetable_init()Attilio Rao2012-09-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the paging_init() call to the platform specific pagetable_init() function, so we can get rid of the extra pagetable_setup_done() function pointer. Signed-off-by: Attilio Rao <attilio.rao@citrix.com> Acked-by: <konrad.wilk@oracle.com> Cc: <Ian.Campbell@citrix.com> Cc: <Stefano.Stabellini@eu.citrix.com> Cc: <xen-devel@lists.xensource.com> Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | x86: Rename pagetable_setup_start() to pagetable_init()Attilio Rao2012-09-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for unifying the pagetable_setup_start() and pagetable_setup_done() setup functions, rename appropriately all the infrastructure related to pagetable_setup_start(). Signed-off-by: Attilio Rao <attilio.rao@citrix.com> Ackedd-by: <konrad.wilk@oracle.com> Cc: <Ian.Campbell@citrix.com> Cc: <Stefano.Stabellini@eu.citrix.com> Cc: <xen-devel@lists.xensource.com> Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | x86: Remove base argument from x86_init.paging.pagetable_setup_startAttilio Rao2012-09-12
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We either use swapper_pg_dir or the argument is unused. Preparatory patch to simplify platform pagetable setup further. Signed-off-by: Attilio Rao <attilio.rao@citrix.com> Ackedb-by: <konrad.wilk@oracle.com> Cc: <Ian.Campbell@citrix.com> Cc: <Stefano.Stabellini@eu.citrix.com> Cc: <xen-devel@lists.xensource.com> Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | xen: fix logical error in tlb flushingAlex Shi2012-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While TLB_FLUSH_ALL gets passed as 'end' argument to flush_tlb_others(), the Xen code was made to check its 'start' parameter. That may give a incorrect op.cmd to MMUEXT_INVLPG_MULTI instead of MMUEXT_TLB_FLUSH_MULTI. Then it causes some page can not be flushed from TLB. This patch fixed this issue. Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | Merge commit '4cb38750d49010ae72e718d46605ac9ba5a851b4' into ↵Konrad Rzeszutek Wilk2012-09-05
| |\ \ \ \ | | | |/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stable/for-linus-3.6 * commit '4cb38750d49010ae72e718d46605ac9ba5a851b4': (6849 commits) bcma: fix invalid PMU chip control masks [libata] pata_cmd64x: whitespace cleanup libata-acpi: fix up for acpi_pm_device_sleep_state API sata_dwc_460ex: device tree may specify dma_channel ahci, trivial: fixed coding style issues related to braces ahci_platform: add hibernation callbacks libata-eh.c: local functions should not be exposed globally libata-transport.c: local functions should not be exposed globally sata_dwc_460ex: support hardreset ata: use module_pci_driver drivers/ata/pata_pcmcia.c: adjust suspicious bit operation pata_imx: Convert to clk_prepare_enable/clk_disable_unprepare ahci: Enable SB600 64bit DMA on MSI K9AGM2 (MS-7327) v2 [libata] Prevent interface errors with Seagate FreeAgent GoFlex drivers/acpi/glue: revert accidental license-related 6b66d95895c bits libata-acpi: add missing inlines in libata.h i2c-omap: Add support for I2C_M_STOP message flag i2c: Fall back to emulated SMBus if the operation isn't supported natively i2c: Add SCCB support i2c-tiny-usb: Add support for the Robofuzz OSIF USB/I2C converter ...
| | * | | Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2012-07-26
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/mm changes from Peter Anvin: "The big change here is the patchset by Alex Shi to use INVLPG to flush only the affected pages when we only need to flush a small page range. It also removes the special INVALIDATE_TLB_VECTOR interrupts (32 vectors!) and replace it with an ordinary IPI function call." Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next to changed line) * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tlb: Fix build warning and crash when building for !SMP x86/tlb: do flush_tlb_kernel_range by 'invlpg' x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR x86/tlb: enable tlb flush range support for x86 mm/mmu_gather: enable tlb flush range in generic mmu_gather x86/tlb: add tlb_flushall_shift knob into debugfs x86/tlb: add tlb_flushall_shift for specific CPU x86/tlb: fall back to flush all when meet a THP large page x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range x86/tlb_info: get last level TLB entry number of CPU x86: Add read_mostly declaration/definition to variables from smp.h x86: Define early read-mostly per-cpu macros
| | | * | | x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_rangeAlex Shi2012-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | x86 has no flush_tlb_range support in instruction level. Currently the flush_tlb_range just implemented by flushing all page table. That is not the best solution for all scenarios. In fact, if we just use 'invlpg' to flush few lines from TLB, we can get the performance gain from later remain TLB lines accessing. But the 'invlpg' instruction costs much of time. Its execution time can compete with cr3 rewriting, and even a bit more on SNB CPU. So, on a 512 4KB TLB entries CPU, the balance points is at: (512 - X) * 100ns(assumed TLB refill cost) = X(TLB flush entries) * 100ns(assumed invlpg cost) Here, X is 256, that is 1/2 of 512 entries. But with the mysterious CPU pre-fetcher and page miss handler Unit, the assumed TLB refill cost is far lower then 100ns in sequential access. And 2 HT siblings in one core makes the memory access more faster if they are accessing the same memory. So, in the patch, I just do the change when the target entries is less than 1/16 of whole active tlb entries. Actually, I have no data support for the percentage '1/16', so any suggestions are welcomed. As to hugetlb, guess due to smaller page table, and smaller active TLB entries, I didn't see benefit via my benchmark, so no optimizing now. My micro benchmark show in ideal scenarios, the performance improves 70 percent in reading. And in worst scenario, the reading/writing performance is similar with unpatched 3.4-rc4 kernel. Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP 'always': multi thread testing, '-t' paramter is thread number: with patch unpatched 3.4-rc4 ./mprotect -t 1 14ns 24ns ./mprotect -t 2 13ns 22ns ./mprotect -t 4 12ns 19ns ./mprotect -t 8 14ns 16ns ./mprotect -t 16 28ns 26ns ./mprotect -t 32 54ns 51ns ./mprotect -t 128 200ns 199ns Single process with sequencial flushing and memory accessing: with patch unpatched 3.4-rc4 ./mprotect 7ns 11ns ./mprotect -p 4096 -l 8 -n 10240 21ns 21ns [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com has additional performance numbers. ] Signed-off-by: Alex Shi <alex.shi@intel.com> Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
| | * | | | Merge tag 'stable/for-linus-3.6-rc0-tag' of ↵Linus Torvalds2012-07-24
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen update from Konrad Rzeszutek Wilk: "Features: * Performance improvement to lower the amount of traps the hypervisor has to do 32-bit guests. Mainly for setting PTE entries and updating TLS descriptors. * MCE polling driver to collect hypervisor MCE buffer and present them to /dev/mcelog. * Physical CPU online/offline support. When an privileged guest is booted it is present with virtual CPUs, which might have an 1:1 to physical CPUs but usually don't. This provides mechanism to offline/online physical CPUs. Bug-fixes for: * Coverity found fixes in the console and ACPI processor driver. * PVonHVM kexec fixes along with some cleanups. * Pages that fall within E820 gaps and non-RAM regions (and had been released to hypervisor) would be populated back, but potentially in non-RAM regions." * tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: populate correct number of pages when across mem boundary (v2) xen PVonHVM: move shared_info to MMIO before kexec xen: simplify init_hvm_pv_info xen: remove cast from HYPERVISOR_shared_info assignment xen: enable platform-pci only in a Xen guest xen/pv-on-hvm kexec: shutdown watches from old kernel xen/x86: avoid updating TLS descriptors if they haven't changed xen/x86: add desc_equal() to compare GDT descriptors xen/mm: zero PTEs for non-present MFNs in the initial page table xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable xen/hvc: Fix up checks when the info is allocated. xen/acpi: Fix potential memory leak. xen/mce: add .poll method for mcelog device driver xen/mce: schedule a workqueue to avoid sleep in atomic context xen/pcpu: Xen physical cpus online/offline sys interface xen/mce: Register native mce handler as vMCE bounce back point x86, MCE, AMD: Adjust initcall sequence for xen xen/mce: Add mcelog support for Xen platform
| | * \ \ \ \ Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds2012-07-22
| | |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp/hotplug changes from Ingo Molnar: "Various cleanups to the SMP hotplug code - a continuing effort of Thomas et al" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smpboot: Remove leftover declaration smp: Remove num_booting_cpus() smp: Remove ipi_call_lock[_irq]()/ipi_call_unlock[_irq]() POWERPC: Smp: remove call to ipi_call_lock()/ipi_call_unlock() SPARC: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq() ia64: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq() x86-smp-remove-call-to-ipi_call_lock-ipi_call_unlock tile: SMP: Remove call to ipi_call_lock()/ipi_call_unlock() S390: Smp: remove call to ipi_call_lock()/ipi_call_unlock() parisc: Smp: remove call to ipi_call_lock()/ipi_call_unlock() mn10300: SMP: Remove call to ipi_call_lock()/ipi_call_unlock() hexagon: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
| | | * | | | | x86-smp-remove-call-to-ipi_call_lock-ipi_call_unlockYong Zhang2012-06-05
| | | | |/ / / | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipi_call_lock/unlock() lock resp. unlock call_function.lock. This lock protects only the call_function data structure itself, but it's completely unrelated to cpu_online_mask. The mask to which the IPIs are sent is calculated before call_function.lock is taken in smp_call_function_many(), so the locking around set_cpu_online() is pointless and can be removed. [ tglx: Massaged changelog ] Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: ralf@linux-mips.org Cc: sshtylyov@mvista.com Cc: david.daney@cavium.com Cc: nikunj@linux.vnet.ibm.com Cc: paulmck@linux.vnet.ibm.com Cc: axboe@kernel.dk Cc: peterz@infradead.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1338275765-3217-7-git-send-email-yong.zhang0@gmail.com Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| | * | | | | Merge branch 'x86/cpu' into perf/coreIngo Molnar2012-07-05
| | |\ \ \ \ \ | | | |_|_|/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge this branch because we changed the wrmsr*_safe() API and there's a conflict. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | | * | | | x86, pvops: Remove hooks for {rd,wr}msr_safe_regsAndre Przywara2012-06-07
| | | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There were paravirt_ops hooks for the full register set variant of {rd,wr}msr_safe which are actually not used by anyone anymore. Remove them to make the code cleaner and avoid silent breakages when the pvops members were uninitialized. This has been boot-tested natively and under Xen with PVOPS enabled and disabled on one machine. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Link: http://lkml.kernel.org/r/1338562358-28182-2-git-send-email-bp@amd64.org Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
| * | | | | xen/p2m: Fix one-off error in checking the P2M tree directory.Konrad Rzeszutek Wilk2012-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We would traverse the full P2M top directory (from 0->MAX_DOMAIN_PAGES inclusive) when trying to figure out whether we can re-use some of the P2M middle leafs. Which meant that if the kernel was compiled with MAX_DOMAIN_PAGES=512 we would try to use the 512th entry. Fortunately for us the p2m_top_index has a check for this: BUG_ON(pfn >= MAX_P2M_PFN); which we hit and saw this: (XEN) domain_crash_sync called from entry.S (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.1.2-OVM x86_64 debug=n Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff819cadeb>] (XEN) RFLAGS: 0000000000000212 EM: 1 CONTEXT: pv guest (XEN) rax: ffffffff81db5000 rbx: ffffffff81db4000 rcx: 0000000000000000 (XEN) rdx: 0000000000480211 rsi: 0000000000000000 rdi: ffffffff81db4000 (XEN) rbp: ffffffff81793db8 rsp: ffffffff81793d38 r8: 0000000008000000 (XEN) r9: 4000000000000000 r10: 0000000000000000 r11: ffffffff81db7000 (XEN) r12: 0000000000000ff8 r13: ffffffff81df1ff8 r14: ffffffff81db6000 (XEN) r15: 0000000000000ff8 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 0000000661795000 cr2: 0000000000000000 Fixes-Oracle-Bug: 14570662 CC: stable@vger.kernel.org # only for v3.5 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are finished with return PFNs to the hypervisor, then populate it back, and also mark the E820 MMIO and E820 gaps as IDENTITY_FRAMEs, we then call P2M to set areas that can be used for ballooning. We were off by one, and ended up over-writting a P2M entry that most likely was an IDENTITY_FRAME. For example: 1-1 mapping on 40000->40200 1-1 mapping on bc558->bc5ac 1-1 mapping on bc5b4->bc8c5 1-1 mapping on bc8c6->bcb7c 1-1 mapping on bcd00->100000 Released 614 pages of unused memory Set 277889 page(s) to 1-1 mapping Populating 40200-40466 pfn range: 614 pages added => here we set from 40466 up to bc559 P2M tree to be INVALID_P2M_ENTRY. We should have done it up to bc558. The end result is that if anybody is trying to construct a PTE for PFN bc558 they end up with ~PAGE_PRESENT. CC: stable@vger.kernel.org Reported-by-and-Tested-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.Konrad Rzeszutek Wilk2012-08-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If P2M leaf is completly packed with INVALID_P2M_ENTRY or with 1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf with either a p2m_missing or p2m_identity respectively. The old page (which was created via extend_brk or was grafted on from the mfn_list) can be re-used for setting new PFNs. This also means we can remove git commit: 5bc6f9888db5739abfa0cae279b4b442e4db8049 xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back which tried to fix this. and make the amount that is required to be reserved much smaller. CC: stable@vger.kernel.org # for 3.5 only. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | Revert "xen PVonHVM: move shared_info to MMIO before kexec"Konrad Rzeszutek Wilk2012-08-16
| | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 00e37bdb0113a98408de42db85be002f21dbffd3. During shutdown of PVHVM guests with more than 2VCPUs on certain machines we can hit the race where the replaced shared_info is not replaced fast enough and the PV time clock retries reading the same area over and over without any any success and is stuck in an infinite loop. Acked-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back.Konrad Rzeszutek Wilk2012-08-02
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we release pages back during bootup: Freeing 9d-100 pfn range: 99 pages freed Freeing 9cf36-9d0d2 pfn range: 412 pages freed Freeing 9f6bd-9f6bf pfn range: 2 pages freed Freeing 9f714-9f7bf pfn range: 171 pages freed Freeing 9f7e0-9f7ff pfn range: 31 pages freed Freeing 9f800-100000 pfn range: 395264 pages freed Released 395979 pages of unused memory We then try to populate those pages back. In the P2M tree however the space for those leafs must be reserved - as such we use extend_brk. We reserve 8MB of _brk space, which means we can fit over 1048576 PFNs - which is more than we should ever need. Without this, on certain compilation of the kernel we would hit: (XEN) domain_crash_sync called from entry.S (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff818aad3b>] (XEN) RFLAGS: 0000000000000206 EM: 1 CONTEXT: pv guest (XEN) rax: ffffffff81a7c000 rbx: 000000000000003d rcx: 0000000000001000 (XEN) rdx: ffffffff81a7b000 rsi: 0000000000001000 rdi: 0000000000001000 (XEN) rbp: ffffffff81801cd8 rsp: ffffffff81801c98 r8: 0000000000100000 (XEN) r9: ffffffff81a7a000 r10: 0000000000000001 r11: 0000000000000003 (XEN) r12: 0000000000000004 r13: 0000000000000004 r14: 000000000000003d (XEN) r15: 00000000000001e8 cr0: 000000008005003b cr4: 00000000000006f0 (XEN) cr3: 0000000125803000 cr2: 0000000000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81801c98: .. which is extend_brk hitting a BUG_ON. Interestingly enough, most of the time we are not going to hit this b/c the _brk space is quite large (v3.5): ffffffff81a25000 B __brk_base ffffffff81e43000 B __brk_limit = ~4MB. vs earlier kernels (with this back-ported), the space is smaller: ffffffff81a25000 B __brk_base ffffffff81a7b000 B __brk_limit = 344 kBytes. where we would certainly hit this and hit extend_brk. Note that git commit c3d93f880197953f86ab90d9da4744e926b38e33 (xen: populate correct number of pages when across mem boundary (v2)) exposed this bug). [v1: Made it 8MB of _brk space instead of 4MB per Jan's suggestion] CC: stable@vger.kernel.org #only for 3.5 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen: populate correct number of pages when across mem boundary (v2)zhenzhong.duan2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When populate pages across a mem boundary at bootup, the page count populated isn't correct. This is due to mem populated to non-mem region and ignored. Pfn range is also wrongly aligned when mem boundary isn't page aligned. For a dom0 booted with dom_mem=3368952K(0xcd9ff000-4k) dmesg diff is: [ 0.000000] Freeing 9e-100 pfn range: 98 pages freed [ 0.000000] 1-1 mapping on 9e->100 [ 0.000000] 1-1 mapping on cd9ff->100000 [ 0.000000] Released 98 pages of unused memory [ 0.000000] Set 206435 page(s) to 1-1 mapping -[ 0.000000] Populating cd9fe-cda00 pfn range: 1 pages added +[ 0.000000] Populating cd9fe-cd9ff pfn range: 1 pages added +[ 0.000000] Populating 100000-100061 pfn range: 97 pages added [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 000000000009e000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 00000000cd9ff000 (usable) [ 0.000000] Xen: 00000000cd9ffc00 - 00000000cda53c00 (ACPI NVS) ... [ 0.000000] Xen: 0000000100000000 - 0000000100061000 (usable) [ 0.000000] Xen: 0000000100061000 - 000000012c000000 (unusable) ... [ 0.000000] MEMBLOCK configuration: ... -[ 0.000000] reserved[0x4] [0x000000cd9ff000-0x000000cd9ffbff], 0xc00 bytes -[ 0.000000] reserved[0x5] [0x00000100000000-0x00000100060fff], 0x61000 bytes Related xen memory layout: (XEN) Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009ec00 (usable) (XEN) 00000000000f0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000cd9ffc00 (usable) Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> [v2: If xen_do_chunk fail(populate), abort this chunk and any others] Suggested by David, thanks. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen PVonHVM: move shared_info to MMIO before kexecOlaf Hering2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently kexec in a PVonHVM guest fails with a triple fault because the new kernel overwrites the shared info page. The exact failure depends on the size of the kernel image. This patch moves the pfn from RAM into MMIO space before the kexec boot. The pfn containing the shared_info is located somewhere in RAM. This will cause trouble if the current kernel is doing a kexec boot into a new kernel. The new kernel (and its startup code) can not know where the pfn is, so it can not reserve the page. The hypervisor will continue to update the pfn, and as a result memory corruption occours in the new kernel. One way to work around this issue is to allocate a page in the xen-platform pci device's BAR memory range. But pci init is done very late and the shared_info page is already in use very early to read the pvclock. So moving the pfn from RAM to MMIO is racy because some code paths on other vcpus could access the pfn during the small window when the old pfn is moved to the new pfn. There is even a small window were the old pfn is not backed by a mfn, and during that time all reads return -1. Because it is not known upfront where the MMIO region is located it can not be used right from the start in xen_hvm_init_shared_info. To minimise trouble the move of the pfn is done shortly before kexec. This does not eliminate the race because all vcpus are still online when the syscore_ops will be called. But hopefully there is no work pending at this point in time. Also the syscore_op is run last which reduces the risk further. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen: simplify init_hvm_pv_infoOlaf Hering2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | init_hvm_pv_info is called only in PVonHVM context, move it into ifdef. init_hvm_pv_info does not fail, make it a void function. remove arguments from init_hvm_pv_info because they are not used by the caller. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen: remove cast from HYPERVISOR_shared_info assignmentOlaf Hering2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Both have type struct shared_info so no cast is needed. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen/x86: avoid updating TLS descriptors if they haven't changedDavid Vrabel2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When switching tasks in a Xen PV guest, avoid updating the TLS descriptors if they haven't changed. This improves the speed of context switches by almost 10% as much of the time the descriptors are the same or only one is different. The descriptors written into the GDT by Xen are modified from the values passed in the update_descriptor hypercall so we keep shadow copies of the three TLS descriptors to compare against. lmbench3 test Before After Improvement -------------------------------------------- lat_ctx -s 32 24 7.19 6.52 9% lat_pipe 12.56 11.66 7% Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen/x86: add desc_equal() to compare GDT descriptorsDavid Vrabel2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David Vrabel <david.vrabel@citrix.com> [v1: Moving it to the Xen file] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen/mm: zero PTEs for non-present MFNs in the initial page tableDavid Vrabel2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When constructing the initial page tables, if the MFN for a usable PFN is missing in the p2m then that frame is initially ballooned out. In this case, zero the PTE (as in decrease_reservation() in drivers/xen/balloon.c). This is obviously safe instead of having an valid PTE with an MFN of INVALID_P2M_ENTRY (~0). Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen/mm: do direct hypercall in xen_set_pte() if batching is unavailableDavid Vrabel2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xen_set_pte() if batching is unavailable (because the caller is in an interrupt context such as handling a page fault) it would fall back to using native_set_pte() and trapping and emulating the PTE write. On 32-bit guests this requires two traps for each PTE write (one for each dword of the PTE). Instead, do one mmu_update hypercall directly. During construction of the initial page tables, continue to use native_set_pte() because most of the PTEs being set are in writable and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and using a hypercall for this is very expensive. This significantly improves page fault performance in 32-bit PV guests. lmbench3 test Before After Improvement ---------------------------------------------- lat_pagefault 3.18 us 2.32 us 27% lat_proc fork 356 us 313.3 us 11% Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen/mce: Register native mce handler as vMCE bounce back pointLiu, Jinsong2012-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When Xen hypervisor inject vMCE to guest, use native mce handler to handle it Signed-off-by: Ke, Liping <liping.ke@intel.com> Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | xen/mce: Add mcelog support for Xen platformLiu, Jinsong2012-07-19
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When MCA error occurs, it would be handled by Xen hypervisor first, and then the error information would be sent to initial domain for logging. This patch gets error information from Xen hypervisor and convert Xen format error into Linux format mcelog. This logic is basically self-contained, not touching other kernel components. By using tools like mcelog tool users could read specific error information, like what they did under native Linux. To test follow directions outlined in Documentation/acpi/apei/einj.txt Acked-and-tested-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Ke, Liping <liping.ke@intel.com> Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | xen/mm: return more precise error from xen_remap_domain_range()David Vrabel2012-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Callers of xen_remap_domain_range() need to know if the remap failed because frame is currently paged out. So they can retry the remap later on. Return -ENOENT in this case. This assumes that the error codes returned by Xen are a subset of those used by the kernel. It is unclear if this is defined as part of the hypercall ABI. Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | xen/swiotlb: Fix compile warnings when using plain integer instead of NULL ↵Konrad Rzeszutek Wilk2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | pointer. arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | xen: allow privcmd for HVM guestsStefano Stabellini2012-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the "return -ENOSYS" for auto_translated_physmap guests from privcmd_mmap, thus it allows ARM guests to issue privcmd mmap calls. However privcmd mmap calls are still going to fail for HVM and hybrid guests on x86 because the xen_remap_domain_mfn_range implementation is currently PV only. Changes in v2: - better commit message; - return -EINVAL from xen_remap_domain_mfn_range if auto_translated_physmap. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | xen/apic/xenbus/swiotlb/pcifront/grant/tmem: Make functions or variables static.Konrad Rzeszutek Wilk2012-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need for those functions/variables to be visible. Make them static and also fix the compile warnings of this sort: drivers/xen/<some file>.c: warning: symbol '<blah>' was not declared. Should it be static? Some of them just require including the header file that declares the functions. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | xen/perf: Define .glob for the different hypercalls.Konrad Rzeszutek Wilk2012-07-30
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows us in perf to have this: 99.67% [kernel] [k] xen_hypercall_sched_op 0.11% [kernel] [k] xen_hypercall_xen_version instead of the borring ever-encompassing: 99.13% [kernel] [k] hypercall_page [v2: Use a macro to define the name and skip] [v3: Use balign per Jan's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>