aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/mm
Commit message (Collapse)AuthorAge
...
| * | x86: Don't include linux/irq.h from asm/hardirq.hNicolai Stange2018-08-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The next patch in this series will have to make the definition of irq_cpustat_t available to entering_irq(). Inclusion of asm/hardirq.h into asm/apic.h would cause circular header dependencies like asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/topology.h linux/smp.h asm/smp.h or linux/gfp.h linux/mmzone.h asm/mmzone.h asm/mmzone_64.h asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/irqdesc.h linux/kobject.h linux/sysfs.h linux/kernfs.h linux/idr.h linux/gfp.h and others. This causes compilation errors because of the header guards becoming effective in the second inclusion: symbols/macros that had been defined before wouldn't be available to intermediate headers in the #include chain anymore. A possible workaround would be to move the definition of irq_cpustat_t into its own header and include that from both, asm/hardirq.h and asm/apic.h. However, this wouldn't solve the real problem, namely asm/harirq.h unnecessarily pulling in all the linux/irq.h cruft: nothing in asm/hardirq.h itself requires it. Also, note that there are some other archs, like e.g. arm64, which don't have that #include in their asm/hardirq.h. Remove the linux/irq.h #include from x86' asm/hardirq.h. Fix resulting compilation errors by adding appropriate #includes to *.c files as needed. Note that some of these *.c files could be cleaned up a bit wrt. to their set of #includes, but that should better be done from separate patches, if at all. Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | x86/speculation/l1tf: Protect PAE swap entries against L1TFVlastimil Babka2018-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for offset. The lower word is zeroed, thus systems with less than 4GB memory are safe. With 4GB to 128GB the swap type selects the memory locations vulnerable to L1TF; with even more memory, also the swap offfset influences the address. This might be a problem with 32bit PAE guests running on large 64bit hosts. By continuing to keep the whole swap entry in either high or low 32bit word of PTE we would limit the swap size too much. Thus this patch uses the whole PAE PTE with the same layout as the 64bit version does. The macros just become a bit tricky since they assume the arch-dependent swp_entry_t to be 32bit. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com>
| * | x86/speculation/l1tf: Extend 64bit swap file size limitVlastimil Babka2018-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch has limited swap file size so that large offsets cannot clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation. It assumed that offsets are encoded starting with bit 12, same as pfn. But on x86_64, offsets are encoded starting with bit 9. Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA and 256TB with 46bit MAX_PA. Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | x86/speculation/l1tf: Limit swap file size to MAX_PA/2Andi Kleen2018-06-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory. Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit. The check is only enabled if the CPU is vulnerable to L1TF. In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
| * | x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappingsAndi Kleen2018-06-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For L1TF PROT_NONE mappings are protected by inverting the PFN in the page table entry. This sets the high bits in the CPU's address space, thus making sure to point to not point an unmapped entry to valid cached memory. Some server system BIOSes put the MMIO mappings high up in the physical address space. If such an high mapping was mapped to unprivileged users they could attack low memory by setting such a mapping to PROT_NONE. This could happen through a special device driver which is not access protected. Normal /dev/mem is of course access protected. To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings. Valid page mappings are allowed because the system is then unsafe anyways. It's not expected that users commonly use PROT_NONE on MMIO. But to minimize any impact this is only enforced if the mapping actually refers to a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip the check for root. For mmaps this is straight forward and can be handled in vm_insert_pfn and in remap_pfn_range(). For mprotect it's a bit trickier. At the point where the actual PTEs are accessed a lot of state has been changed and it would be difficult to undo on an error. Since this is a uncommon case use a separate early page talk walk pass for MMIO PROT_NONE mappings that checks for this condition early. For non MMIO and non PROT_NONE there are no changes. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
* | | Merge branch 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2018-08-13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull x86 PTI updates from Thomas Gleixner: "The Speck brigade sadly provides yet another large set of patches destroying the perfomance which we carefully built and preserved - PTI support for 32bit PAE. The missing counter part to the 64bit PTI code implemented by Joerg. - A set of fixes for the Global Bit mechanics for non PCID CPUs which were setting the Global Bit too widely and therefore possibly exposing interesting memory needlessly. - Protection against userspace-userspace SpectreRSB - Support for the upcoming Enhanced IBRS mode, which is preferred over IBRS. Unfortunately we dont know the performance impact of this, but it's expected to be less horrible than the IBRS hammering. - Cleanups and simplifications" * 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) x86/mm/pti: Move user W+X check into pti_finalize() x86/relocs: Add __end_rodata_aligned to S_REL x86/mm/pti: Clone kernel-image on PTE level for 32 bit x86/mm/pti: Don't clear permissions in pti_clone_pmd() x86/mm/pti: Fix 32 bit PCID check x86/mm/init: Remove freed kernel image areas from alias mapping x86/mm/init: Add helper for freeing kernel image pages x86/mm/init: Pass unconverted symbol addresses to free_init_pages() mm: Allow non-direct-map arguments to free_reserved_area() x86/mm/pti: Clear Global bit more aggressively x86/speculation: Support Enhanced IBRS on future CPUs x86/speculation: Protect against userspace-userspace spectreRSB x86/kexec: Allocate 8k PGDs for PTI Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables" x86/mm: Remove in_nmi() warning from vmalloc_fault() x86/entry/32: Check for VM86 mode in slow-path check perf/core: Make sure the ring-buffer is mapped in all page-tables x86/pti: Check the return value of pti_user_pagetable_walk_pmd() x86/pti: Check the return value of pti_user_pagetable_walk_p4d() x86/entry/32: Add debug code to check entry/exit CR3 ...
| * | | x86/mm/pti: Move user W+X check into pti_finalize()Joerg Roedel2018-08-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The user page-table gets the updated kernel mappings in pti_finalize(), which runs after the RO+X permissions got applied to the kernel page-table in mark_readonly(). But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in mark_readonly() for insecure mappings. This causes false-positive warnings, because the user page-table did not get the updated mappings yet. Move the W+X check for the user page-table into pti_finalize() after it updated all required mappings. [ tglx: Folded !NX supported fix ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Clone kernel-image on PTE level for 32 bitJoerg Roedel2018-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32 bit the kernel sections are not huge-page aligned. When we clone them on PMD-level we unevitably map some areas that are normal kernel memory and may contain secrets to user-space. To prevent that we need to clone the kernel-image on PTE-level for 32 bit. Also make the page-table cloning code more general so that it can handle PMD and PTE level cloning. This can be generalized further in the future to also handle clones on the P4D-level. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1533637471-30953-4-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Don't clear permissions in pti_clone_pmd()Joerg Roedel2018-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function sets the global-bit on cloned PMD entries, which only makes sense when the permissions are identical between the user and the kernel page-table. Further, only write-permissions are cleared for entry-text and kernel-text sections, which are not writeable at the end of the boot process. The reason why this RW clearing exists is that in the early PTI implementations the cloned kernel areas were set up during early boot before the kernel text is set to read only and not touched afterwards. This is not longer true. The cloned areas are still set up early to get the entry code working for interrupts and other things, but after the kernel text has been set RO the clone is repeated which copies the RO PMD/PTEs over to the user visible clone. That means the initial clearing of the writable bit can be avoided. [ tglx: Amended changelog ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1533637471-30953-3-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Fix 32 bit PCID checkJoerg Roedel2018-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check uses the wrong operator and causes false positive warnings in the kernel log on some systems. Fixes: 5e8105950a8b3 ('x86/mm/pti: Add Warning when booting on a PCID capable CPU') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1533637471-30953-2-git-send-email-joro@8bytes.org
| * | | Merge branch 'x86/pti-urgent' into x86/ptiThomas Gleixner2018-08-06
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Integrate the PTI Global bit fixes which conflict with the 32bit PTI support. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| | * | | x86/mm/init: Remove freed kernel image areas from alias mappingDave Hansen2018-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel image is mapped into two places in the virtual address space (addresses without KASLR, of course): 1. The kernel direct map (0xffff880000000000) 2. The "high kernel map" (0xffffffff81000000) We actually execute out of #2. If we get the address of a kernel symbol, it points to #2, but almost all physical-to-virtual translations point to Parts of the "high kernel map" alias are mapped in the userspace page tables with the Global bit for performance reasons. The parts that we map to userspace do not (er, should not) have secrets. When PTI is enabled then the global bit is usually not set in the high mapping and just used to compensate for poor performance on systems which lack PCID. This is fine, except that some areas in the kernel image that are adjacent to the non-secret-containing areas are unused holes. We free these holes back into the normal page allocator and reuse them as normal kernel memory. The memory will, of course, get *used* via the normal map, but the alias mapping is kept. This otherwise unused alias mapping of the holes will, by default keep the Global bit, be mapped out to userspace, and be vulnerable to Meltdown. Remove the alias mapping of these pages entirely. This is likely to fracture the 2M page mapping the kernel image near these areas, but this should affect a minority of the area. The pageattr code changes *all* aliases mapping the physical pages that it operates on (by default). We only want to modify a single alias, so we need to tweak its behavior. This unmapping behavior is currently dependent on PTI being in place. Going forward, we should at least consider doing this for all configurations. Having an extra read-write alias for memory is not exactly ideal for debugging things like random memory corruption and this does undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page Frame Ownership (XPFO). Before this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd After this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte current_user-0xffffffff82488000-0xffffffff82600000 1504K pte current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd [ tglx: Do not unmap on 32bit as there is only one mapping ] Fixes: 0f561fce4d69 ("x86/pti: Enable global pages for shared areas") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
| | * | | x86/mm/init: Add helper for freeing kernel image pagesDave Hansen2018-08-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When chunks of the kernel image are freed, free_init_pages() is used directly. Consolidate the three sites that do this. Also update the string to give an incrementally better description of that memory versus what was there before. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@google.com Cc: aarcange@redhat.com Cc: jgross@suse.com Cc: jpoimboe@redhat.com Cc: gregkh@linuxfoundation.org Cc: peterz@infradead.org Cc: hughd@google.com Cc: torvalds@linux-foundation.org Cc: bp@alien8.de Cc: luto@kernel.org Cc: ak@linux.intel.com Cc: Kees Cook <keescook@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
| | * | | x86/mm/init: Pass unconverted symbol addresses to free_init_pages()Dave Hansen2018-08-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The x86 code has several places where it frees parts of kernel image: 1. Unused SMP alternative 2. __init code 3. The hole between text and rodata 4. The hole between rodata and data We call free_init_pages() to do this. Strangely, we convert the symbol addresses to kernel direct map addresses in some cases (#3, #4) but not others (#1, #2). The virt_to_page() and the other code in free_reserved_area() now works fine for for symbol addresses on x86, so don't bother converting the addresses to direct map addresses before freeing them. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@google.com Cc: aarcange@redhat.com Cc: jgross@suse.com Cc: jpoimboe@redhat.com Cc: gregkh@linuxfoundation.org Cc: peterz@infradead.org Cc: hughd@google.com Cc: torvalds@linux-foundation.org Cc: bp@alien8.de Cc: luto@kernel.org Cc: ak@linux.intel.com Cc: Kees Cook <keescook@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20180802225828.89B2D0E2@viggo.jf.intel.com
| | * | | x86/mm/pti: Clear Global bit more aggressivelyDave Hansen2018-08-05
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel image starts out with the Global bit set across the entire kernel image. The bit is cleared with set_memory_nonglobal() in the configurations with PCIDs where the performance benefits of the Global bit are not needed. However, this is fragile. It means that we are stuck opting *out* of the less-secure (Global bit set) configuration, which seems backwards. Let's start more secure (Global bit clear) and then let things opt back in if they want performance, or are truly mapping common data between kernel and userspace. This fixes a bug. Before this patch, there are areas that are unmapped from the user page tables (like like everything above 0xffffffff82600000 in the example below). These have the hallmark of being a wrong Global area: they are not identical in the 'current_kernel' and 'current_user' page table dumps. They are also read-write, which means they're much more likely to contain secrets. Before this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW GLB NX pte current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE GLB NX pmd current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW GLB NX pte current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE GLB NX pmd current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW GLB NX pte current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd After this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd Fixes: 0f561fce4d69 ("x86/pti: Enable global pages for shared areas") Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@google.com Cc: aarcange@redhat.com Cc: jgross@suse.com Cc: jpoimboe@redhat.com Cc: gregkh@linuxfoundation.org Cc: peterz@infradead.org Cc: torvalds@linux-foundation.org Cc: bp@alien8.de Cc: luto@kernel.org Cc: ak@linux.intel.com Cc: Kees Cook <keescook@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20180802225825.A100C071@viggo.jf.intel.com
| * | | x86/mm: Remove in_nmi() warning from vmalloc_fault()Joerg Roedel2018-07-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is perfectly okay to take page-faults, especially on the vmalloc area while executing an NMI handler. Remove the warning. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: David H. Gutteridge <dhgutteridge@sympatico.ca> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1532533683-5988-2-git-send-email-joro@8bytes.org
| * | | x86/pti: Check the return value of pti_user_pagetable_walk_pmd()Jiang Biao2018-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pti_user_pagetable_walk_pmd() can return NULL, so the return value should be checked to prevent a NULL pointer dereference. Add the check and a warning when the PMD allocation fails. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dave.hansen@linux.intel.com Cc: luto@kernel.org Cc: hpa@zytor.com Cc: albcamus@gmail.com Cc: zhong.weidong@zte.com.cn Link: https://lkml.kernel.org/r/1532045192-49622-2-git-send-email-jiang.biao2@zte.com.cn
| * | | x86/pti: Check the return value of pti_user_pagetable_walk_p4d()Jiang Biao2018-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pti_user_pagetable_walk_p4d() can return NULL, so the return value should be checked to prevent a NULL pointer dereference. Add the check and a warning when the P4D allocation fails. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dave.hansen@linux.intel.com Cc: luto@kernel.org Cc: hpa@zytor.com Cc: albcamus@gmail.com Cc: zhong.weidong@zte.com.cn Link: https://lkml.kernel.org/r/1532045192-49622-1-git-send-email-jiang.biao2@zte.com.cn
| * | | x86/mm/pti: Add Warning when booting on a PCID capable CPUJoerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Warn the user in case the performance can be significantly improved by switching to a 64-bit kernel. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-39-git-send-email-joro@8bytes.org
| * | | x86/ldt: Reserve address-space range on 32 bit for the LDTJoerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reserve 2MB/4MB of address-space for mapping the LDT to user-space on 32 bit PTI kernels. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-34-git-send-email-joro@8bytes.org
| * | | x86/pgtable/pae: Use separate kernel PMDs for user page-tableJoerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When PTI is enabled, separate kernel PMDs in the user page-table are required to map the per-process LDT for user-space. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-33-git-send-email-joro@8bytes.org
| * | | x86/mm/dump_pagetables: Define INIT_PGDJoerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define INIT_PGD to point to the correct initial page-table for 32 and 64 bit and use it where needed. This fixes the build on 32 bit with CONFIG_PAGE_TABLE_ISOLATION enabled. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-32-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Clone entry-text again in pti_finalize()Joerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mapping for entry-text might have changed in the kernel after it was cloned to the user page-table. Clone again to update the user page-table to bring the mapping in sync with the kernel again. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-31-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Introduce pti_finalize()Joerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new function to finalize the kernel mappings for the userspace page-table after all ro/nx protections have been applied to the kernel mappings. Also move the call to pti_clone_kernel_text() to that function so that it will run on 32 bit kernels too. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-30-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Keep permissions when cloning kernel text in pti_clone_kernel_text()Joerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mapping the kernel text area to user-space makes only sense if it has the same permissions as in the kernel page-table. If permissions are different this will cause a TLB reload when using the kernel page-table, which is as good as not mapping it at all. On 64-bit kernels this patch makes no difference, as the whole range cloned by pti_clone_kernel_text() is mapped RO anyway. On 32 bit there are writeable mappings in the range, so just keep the permissions as they are. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-29-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Make pti_clone_kernel_text() compile on 32 bitJoerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pti_clone_kernel_text() function references __end_rodata_hpage_align, which is only present on x86-64. This makes sense as the end of the rodata section is not huge-page aligned on 32 bit. Nevertheless a symbol is required for the function that points at the right address for both 32 and 64 bit. Introduce __end_rodata_aligned for that purpose and use it in pti_clone_kernel_text(). Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-28-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32Joerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cloning on the P4D level would clone the complete kernel address space into the user-space page-tables for PAE kernels. Cloning on PMD level is fine for PAE and legacy paging. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-27-git-send-email-joro@8bytes.org
| * | | x86/mm/pti: Add an overflow check to pti_clone_pmds()Joerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The addr counter will overflow if the last PMD of the address space is cloned, resulting in an endless loop. Check for that and bail out of the loop when it happens. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-25-git-send-email-joro@8bytes.org
| * | | x86/pgtable/32: Allocate 8k page-tables when PTI is enabledJoerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate a kernel and a user page-table root when PTI is enabled. Also allocate a full page per root for PAE because otherwise the bit to flip in CR3 to switch between them would be non-constant, which creates a lot of hassle. Keep that for a later optimization. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-18-git-send-email-joro@8bytes.org
| * | | x86/pgtable: Rename pti_set_user_pgd() to pti_set_user_pgtbl()Joerg Roedel2018-07-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The way page-table folding is implemented on 32 bit, these functions are not only setting, but also PUDs and even PMDs. Give the function a more generic name to reflect that. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-16-git-send-email-joro@8bytes.org
| * | | x86/pti: Make pti_set_kernel_image_nonglobal() staticJiang Biao2018-07-16
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pti_set_kernel_image_nonglobal() is only used in pti.c, make it static. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: luto@kernel.org Cc: hpa@zytor.com Cc: albcamus@gmail.com Cc: zhong.weidong@zte.com.cn Link: https://lkml.kernel.org/r/1531713820-24544-4-git-send-email-jiang.biao2@zte.com.cn
* | | Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2018-08-13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Thomas Gleixner: - Make lazy TLB mode even lazier to avoid pointless switch_mm() operations, which reduces CPU load by 1-2% for memcache workloads - Small cleanups and improvements all over the place * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove redundant check for kmem_cache_create() arm/asm/tlb.h: Fix build error implicit func declaration x86/mm/tlb: Make clear_asid_other() static x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off() x86/mm/tlb: Always use lazy TLB mode x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Restructure switch_mm_irqs_off() x86/mm/tlb: Leave lazy TLB mode at page table free time mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids x86/mm: Add TLB purge to free pmd/pte page interfaces ioremap: Update pgtable free interfaces with addr x86/mm: Disable ioremap free page handling on x86-PAE
| * | | x86/mm: Remove redundant check for kmem_cache_create()Chengguang Xu2018-08-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flag 'SLAB_PANIC' implies panic on failure, So there is no need to check the returned pointer for NULL. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/1528804132-154948-1-git-send-email-cgxu519@gmx.com
| * | | x86/mm/tlb: Make clear_asid_other() staticzhong jiang2018-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes the following sparse warning: arch/x86/mm/tlb.c:38:6: warning: symbol 'clear_asid_other' was not declared. Should it be static? Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave.hansen@linux.intel.com Cc: kirill.shutemov@linux.intel.com Cc: tim.c.chen@linux.intel.com Link: http://lkml.kernel.org/r/1532159732-22939-1-git-send-email-zhongjiang@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()Rik van Riel2018-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Song Liu noticed switch_mm_irqs_off() taking a lot of CPU time in recent kernels,using 1.8% of a 48 CPU system during a netperf to localhost run. Digging into the profile, we noticed that cpumask_clear_cpu and cpumask_set_cpu together take about half of the CPU time taken by switch_mm_irqs_off(). However, the CPUs running netperf end up switching back and forth between netperf and the idle task, which does not require changes to the mm_cpumask. Furthermore, the init_mm cpumask ends up being the most heavily contended one in the system. Simply skipping changes to mm_cpumask(&init_mm) reduces overhead. Reported-and-tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-8-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm/tlb: Always use lazy TLB modeRik van Riel2018-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except at page table freeing time, and idle CPUs will no longer get shootdown IPIs for things like mprotect and madvise, we can always use lazy TLB mode. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUsRik van Riel2018-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPUs in !is_lazy have either received TLB flush IPIs earlier on during the munmap (when the user memory was unmapped), or have context switched and reloaded during that stage of the munmap. Page table free TLB flushes only need to be sent to CPUs in lazy TLB mode, which TLB contents might not yet be up to date yet. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-6-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm/tlb: Make lazy TLB mode lazierRik van Riel2018-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. This patch reduces total CPU use in the system by about 1-2% for a memcache workload on two socket systems, and by about 1% for a heavily multi-process netperf between two systems. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-5-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm/tlb: Restructure switch_mm_irqs_off()Rik van Riel2018-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move some code that will be needed for the lazy -> !lazy state transition when a lazy TLB CPU has gotten out of date. No functional changes, since the if (real_prev == next) branch always returns. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm/tlb: Leave lazy TLB mode at page table free timeRik van Riel2018-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andy discovered that speculative memory accesses while in lazy TLB mode can crash a system, when a CPU tries to dereference a speculative access using memory contents that used to be valid page table memory, but have since been reused for something else and point into la-la land. The latter problem can be prevented in two ways. The first is to always send a TLB shootdown IPI to CPUs in lazy TLB mode, while the second one is to only send the TLB shootdown at page table freeing time. The second should result in fewer IPIs, since operationgs like mprotect and madvise are very common with some workloads, but do not involve page table freeing. Also, on munmap, batching of page table freeing covers much larger ranges of virtual memory than the batching of unmapped user pages. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/mm: Add TLB purge to free pmd/pte page interfacesToshi Kani2018-07-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ioremap() calls pud_free_pmd_page() / pmd_free_pte_page() when it creates a pud / pmd map. The following preconditions are met at their entry. - All pte entries for a target pud/pmd address range have been cleared. - System-wide TLB purges have been peformed for a target pud/pmd address range. The preconditions assure that there is no stale TLB entry for the range. Speculation may not cache TLB entries since it requires all levels of page entries, including ptes, to have P & A-bits set for an associated address. However, speculation may cache pud/pmd entries (paging-structure caches) when they have P-bit set. Add a system-wide TLB purge (INVLPG) to a single page after clearing pud/pmd entry's P-bit. SDM 4.10.4.1, Operation that Invalidate TLBs and Paging-Structure Caches, states that: INVLPG invalidates all paging-structure caches associated with the current PCID regardless of the liner addresses to which they correspond. Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: mhocko@suse.com Cc: akpm@linux-foundation.org Cc: hpa@zytor.com Cc: cpandya@codeaurora.org Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: Joerg Roedel <joro@8bytes.org> Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20180627141348.21777-4-toshi.kani@hpe.com
| * | | ioremap: Update pgtable free interfaces with addrChintan Pandya2018-07-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following kernel panic was observed on ARM64 platform due to a stale TLB entry. 1. ioremap with 4K size, a valid pte page table is set. 2. iounmap it, its pte entry is set to 0. 3. ioremap the same address with 2M size, update its pmd entry with a new value. 4. CPU may hit an exception because the old pmd entry is still in TLB, which leads to a kernel panic. Commit b6bdb7517c3d ("mm/vmalloc: add interfaces to free unmapped page table") has addressed this panic by falling to pte mappings in the above case on ARM64. To support pmd mappings in all cases, TLB purge needs to be performed in this case on ARM64. Add a new arg, 'addr', to pud_free_pmd_page() and pmd_free_pte_page() so that TLB purge can be added later in seprate patches. [toshi.kani@hpe.com: merge changes, rewrite patch description] Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: mhocko@suse.com Cc: akpm@linux-foundation.org Cc: hpa@zytor.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: Will Deacon <will.deacon@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20180627141348.21777-3-toshi.kani@hpe.com
| * | | x86/mm: Disable ioremap free page handling on x86-PAEToshi Kani2018-07-04
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ioremap() supports pmd mappings on x86-PAE. However, kernel's pmd tables are not shared among processes on x86-PAE. Therefore, any update to sync'd pmd entries need re-syncing. Freeing a pte page also leads to a vmalloc fault and hits the BUG_ON in vmalloc_sync_one(). Disable free page handling on x86-PAE. pud_free_pmd_page() and pmd_free_pte_page() simply return 0 if a given pud/pmd entry is present. This assures that ioremap() does not update sync'd pmd entries at the cost of falling back to pte mappings. Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Reported-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: mhocko@suse.com Cc: akpm@linux-foundation.org Cc: hpa@zytor.com Cc: cpandya@codeaurora.org Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20180627141348.21777-2-toshi.kani@hpe.com
* | | x86/numa_emulation: Introduce uniform split capabilityDan Williams2018-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current NUMA emulation capabilities for splitting System RAM by a fixed size or by a set number of nodes may result in some nodes being larger than others. The implementation prioritizes establishing a minimum usable memory size over satisfying the requested number of NUMA nodes. Introduce a uniform split capability that evenly partitions each physical NUMA node into N emulated nodes. For example numa=fake=3U creates 6 emulated nodes total on a system that has 2 physical nodes. This capability is useful for debugging and evaluating platform memory-side-cache capabilities as described by the ACPI HMAT (see 5.2.27.5 Memory Side Cache Information Structure in ACPI 6.2a) Compare numa=fake=6 that results in only 5 nodes being created against numa=fake=3U which takes the 2 physical nodes and evenly divides them. numa=fake=6 available: 5 nodes (0-4) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 node 0 size: 2648 MB node 0 free: 2443 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 node 1 size: 2672 MB node 1 free: 2442 MB node 2 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 node 2 size: 5291 MB node 2 free: 5278 MB node 3 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 node 3 size: 2677 MB node 3 free: 2665 MB node 4 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 node 4 size: 2676 MB node 4 free: 2663 MB node distances: node 0 1 2 3 4 0: 10 20 10 20 20 1: 20 10 20 10 10 2: 10 20 10 20 20 3: 20 10 20 10 10 4: 20 10 20 10 10 numa=fake=3U available: 6 nodes (0-5) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 node 0 size: 2900 MB node 0 free: 2637 MB node 1 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 node 1 size: 3023 MB node 1 free: 3012 MB node 2 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 node 2 size: 2015 MB node 2 free: 2004 MB node 3 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 node 3 size: 2704 MB node 3 free: 2522 MB node 4 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 node 4 size: 2709 MB node 4 free: 2698 MB node 5 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 node 5 size: 2612 MB node 5 free: 2601 MB node distances: node 0 1 2 3 4 5 0: 10 10 10 20 20 20 1: 10 10 10 20 20 20 2: 10 10 10 20 20 20 3: 20 20 20 10 10 10 4: 20 20 20 10 10 10 5: 20 20 20 10 10 10 Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/153089328617.27680.14930758266174305832.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | x86/numa_emulation: Fix emulated-to-physical node mappingDan Williams2018-07-06
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without this change the distance table calculation for emulated nodes may use the wrong numa node and report an incorrect distance. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/153089328103.27680.14778434392225818887.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86/mm: Clean up the printk()s in show_fault_oops()Dmitry Vyukov2018-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | - Remove 'nx_warning' and 'smep_warning', which are just pointless obfuscation. - Also convert to pr_crit(). Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180627090715.28076-1-dvyukov@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86/mm: Get rid of KERN_CONT in show_fault_oops()Dmitry Vyukov2018-06-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KERN_CONT leads to split lines in kernel output and complicates useful changes to printk like printing context before each line. Only acceptable use of continuations is basically boot-time testing. Get rid of it. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180625123808.227417-1-dvyukov@gmail.com [ Removed unnecessary parentheses and prettified the printk statement. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'linus' into x86/urgentThomas Gleixner2018-06-22
|\| | | | | | | Required to queue a dependent fix.
| * treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAXStefan Agner2018-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With PHYS_ADDR_MAX there is now a type safe variant for all bits set. Make use of it. Patch created using a semantic patch as follows: // <smpl> @@ typedef phys_addr_t; @@ -(phys_addr_t)ULLONG_MAX +PHYS_ADDR_MAX // </smpl> Link: http://lkml.kernel.org/r/20180419214204.19322-1-stefan@agner.ch Signed-off-by: Stefan Agner <stefan@agner.ch> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: fix devmem_is_allowed() for sub-page System RAM intersectionsDan Williams2018-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hussam reports: I was poking around and for no real reason, I did cat /dev/mem and strings /dev/mem. Then I saw the following warning in dmesg. I saved it and rebooted immediately. memremap attempted on mixed range 0x000000000009c000 size: 0x1000 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 11810 at kernel/memremap.c:98 memremap+0x104/0x170 [..] Call Trace: xlate_dev_mem_ptr+0x25/0x40 read_mem+0x89/0x1a0 __vfs_read+0x36/0x170 The memremap() implementation checks for attempts to remap System RAM with MEMREMAP_WB and instead redirects those mapping attempts to the linear map. However, that only works if the physical address range being remapped is page aligned. In low memory we have situations like the following: 00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved ...where System RAM intersects Reserved ranges on a sub-page page granularity. Given that devmem_is_allowed() special cases any attempt to map System RAM in the first 1MB of memory, replace page_is_ram() with the more precise region_intersects() to trap attempts to map disallowed ranges. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199999 Link: http://lkml.kernel.org/r/152856436164.18127.2847888121707136898.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 92281dee825f ("arch: introduce memremap()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Hussam Al-Tayeb <me@hussam.eu.org> Tested-by: Hussam Al-Tayeb <me@hussam.eu.org> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>