aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAge
...
| * i387: fix x86-64 preemption-unsafe user stack save/restoreLinus Torvalds2012-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/954576 commit 15d8791cae75dca27bfda8ecfe87dca9379d6bb0 upstream. Commit 5b1cbac37798 ("i387: make irq_fpu_usable() tests more robust") added a sanity check to the #NM handler to verify that we never cause the "Device Not Available" exception in kernel mode. However, that check actually pinpointed a (fundamental) race where we do cause that exception as part of the signal stack FPU state save/restore code. Because we use the floating point instructions themselves to save and restore state directly from user mode, we cannot do that atomically with testing the TS_USEDFPU bit: the user mode access itself may cause a page fault, which causes a task switch, which saves and restores the FP/MMX state from the kernel buffers. This kind of "recursive" FP state save is fine per se, but it means that when the signal stack save/restore gets restarted, it will now take the '#NM' exception we originally tried to avoid. With preemption this can happen even without the page fault - but because of the user access, we cannot just disable preemption around the save/restore instruction. There are various ways to solve this, including using the "enable/disable_page_fault()" helpers to not allow page faults at all during the sequence, and fall back to copying things by hand without the use of the native FP state save/restore instructions. However, the simplest thing to do is to just allow the #NM from kernel space, but fix the race in setting and clearing CR0.TS that this all exposed: the TS bit changes and the TS_USEDFPU bit absolutely have to be atomic wrt scheduling, so while the actual state save/restore can be interrupted and restarted, the act of actually clearing/setting CR0.TS and the TS_USEDFPU bit together must not. Instead of just adding random "preempt_disable/enable()" calls to what is already excessively ugly code, this introduces some helper functions that mostly mirror the "kernel_fpu_begin/end()" functionality, just for the user state instead. Those helper functions should probably eventually replace the other ad-hoc CR0.TS and TS_USEDFPU tests too, but I'll need to think about it some more: the task switching functionality in particular needs to expose the difference between the 'prev' and 'next' threads, while the new helper functions intentionally were written to only work with 'current'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * i387: fix sense of sanity checkLinus Torvalds2012-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/954576 commit c38e23456278e967f094b08247ffc3711b1029b2 upstream. The check for save_init_fpu() (introduced in commit 5b1cbac37798: "i387: make irq_fpu_usable() tests more robust") was the wrong way around, but I hadn't noticed, because my "tests" were bogus: the FPU exceptions are disabled by default, so even doing a divide by zero never actually triggers this code at all unless you do extra work to enable them. So if anybody did enable them, they'd get one spurious warning. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * i387: make irq_fpu_usable() tests more robustLinus Torvalds2012-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/954576 commit 5b1cbac37798805c1fee18c8cebe5c0a13975b17 upstream. Some code - especially the crypto layer - wants to use the x86 FP/MMX/AVX register set in what may be interrupt (typically softirq) context. That *can* be ok, but the tests for when it was ok were somewhat suspect. We cannot touch the thread-specific status bits either, so we'd better check that we're not going to try to save FP state or anything like that. Now, it may be that the TS bit is always cleared *before* we set the USEDFPU bit (and only set when we had already cleared the USEDFP before), so the TS bit test may actually have been sufficient, but it certainly was not obviously so. So this explicitly verifies that we will not touch the TS_USEDFPU bit, and adds a few related sanity-checks. Because it seems that somehow AES-NI is corrupting user FP state. The cause is not clear, and this patch doesn't fix it, but while debugging it I really wanted the code to be more obviously correct and robust. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * i387: math_state_restore() isn't called from asmLinus Torvalds2012-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/954576 commit be98c2cdb15ba26148cd2bd58a857d4f7759ed38 upstream. It was marked asmlinkage for some really old and stale legacy reasons. Fix that and the equally stale comment. Noticed when debugging the irq_fpu_usable() bugs. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * xen pvhvm: do not remap pirqs onto evtchns if !xen_have_vector_callbackStefano Stabellini2012-03-08
| | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/937915 commit 207d543f472c1ac9552df79838dc807cbcaa9740 upstream. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * net: bpf_jit: fix divide by 0 generationEric Dumazet2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/926309 [ Upstream commit d00a9dd21bdf7908b70866794c8313ee8a5abd5c ] Several problems fixed in this patch : 1) Target of the conditional jump in case a divide by 0 is performed by a bpf is wrong. 2) Must 'generate' the full function prologue/epilogue at pass=0, or else we can stop too early in pass=1 if the proglen doesnt change. (if the increase of prologue/epilogue equals decrease of all instructions length because some jumps are converted to near jumps) 3) Change the wrong length detection at the end of code generation to issue a more explicit message, no need for a full stack trace. Reported-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86/microcode_amd: Add support for CPU family specific container filesAndreas Herrmann2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/926309 commit 5b68edc91cdc972c46f76f85eded7ffddc3ff5c2 upstream. We've decided to provide CPU family specific container files (starting with CPU family 15h). E.g. for family 15h we have to load microcode_amd_fam15h.bin instead of microcode_amd.bin Rationale is that starting with family 15h patch size is larger than 2KB which was hard coded as maximum patch size in various microcode loaders (not just Linux). Container files which include patches larger than 2KB cause different kinds of trouble with such old patch loaders. Thus we have to ensure that the default container file provides only patches with size less than 2KB. Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/20120120164412.GD24508@alberich.amd.com [ documented the naming convention and tidied the code a bit. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86/uv: Fix uv_gpa_to_soc_phys_ram() shiftRuss Anderson2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/926309 commit 5a51467b146ab7948d2f6812892eac120a30529c upstream. uv_gpa_to_soc_phys_ram() was inadvertently ignoring the shift values. This fix takes the shift into account. Signed-off-by: Russ Anderson <rja@sgi.com> Link: http://lkml.kernel.org/r/20120119020753.GA7228@sgi.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86/UV2: Fix BAU destination timeout initializationCliff Wickman2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/922799 commit d059f9fa84a30e04279c6ff615e9e2cf3b260191 upstream. Move the call to enable_timeouts() forward so that BAU_MISC_CONTROL is initialized before using it in calculate_destination_timeout(). Fix the calculation of a BAU destination timeout for UV2 (in calculate_destination_timeout()). Signed-off-by: Cliff Wickman <cpw@sgi.com> Link: http://lkml.kernel.org/r/20120116211848.GB5767@sgi.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * ACPI, x86: Use SRAT table rev to use 8bit or 32bit PXM fields (x86/x86-64)Kurt Garloff2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/922799 commit cd298f60a2451a16e0f077404bf69b62ec868733 upstream. In SRAT v1, we had 8bit proximity domain (PXM) fields; SRAT v2 provides 32bits for these. The new fields were reserved before. According to the ACPI spec, the OS must disregrard reserved fields. x86/x86-64 was rather inconsistent prior to this patch; it used 8 bits for the pxm field in cpu_affinity, but 32 bits in mem_affinity. This patch makes it consistent: Either use 8 bits consistently (SRAT rev 1 or lower) or 32 bits (SRAT rev 2 or higher). cc: x86@kernel.org Signed-off-by: Kurt Garloff <kurt@garloff.de> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86, UV: Update Boot messages for SGI UV2 platformJack Steiner2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/922799 commit da517a08ac5913cd80ce3507cddd00f2a091b13c upstream. SGI UV systems print a message during boot: UV: Found <num> blades Due to packaging changes, the blade count is not accurate for on the next generation of the platform. This patch corrects the count. Signed-off-by: Jack Steiner <steiner@sgi.com> Link: http://lkml.kernel.org/r/20120106191900.GA19772@sgi.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86: Fix mmap random address rangeLudwig Nussel2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/922799 commit 9af0c7a6fa860698d080481f24a342ba74b68982 upstream. On x86_32 casting the unsigned int result of get_random_int() to long may result in a negative value. On x86_32 the range of mmap_rnd() therefore was -255 to 255. The 32bit mode on x86_64 used 0 to 255 as intended. The bug was introduced by 675a081 ("x86: unify mmap_{32|64}.c") in January 2008. Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: harvey.harrison@gmail.com Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/201111152246.pAFMklOB028527@wpaz5.hot.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86/PCI: build amd_bus.o only when CONFIG_AMD_NB=yBjorn Helgaas2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/922799 commit 5cf9a4e69c1ff0ccdd1d2b7404f95c0531355274 upstream. We only need amd_bus.o for AMD systems with PCI. arch/x86/pci/Makefile already depends on CONFIG_PCI=y, so this patch just adds the dependency on CONFIG_AMD_NB. Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86/PCI: Ignore CPU non-addressable _CRS reserved memory resourcesGary Hade2012-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/922799 commit ae5cd86455381282ece162966183d3f208c6fad7 upstream. This assures that a _CRS reserved host bridge window or window region is not used if it is not addressable by the CPU. The new code either trims the window to exclude the non-addressable portion or totally ignores the window if the entire window is non-addressable. The current code has been shown to be problematic with 32-bit non-PAE kernels on systems where _CRS reserves resources above 4GB. Signed-off-by: Gary Hade <garyhade@us.ibm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Thomas Renninger <trenn@novell.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86/PCI: amd: factor out MMCONFIG discoveryBjorn Helgaas2012-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/647043 This factors out the AMD native MMCONFIG discovery so we can use it outside amd_bus.c. amd_bus.c reads AMD MSRs so it can remove the MMCONFIG area from the PCI resources. We may also need the MMCONFIG information to work around BIOS defects in the ACPI MCFG table. Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: stable@kernel.org # 2.6.34+ Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Herton Krzesinski <herton.krzesinski@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * net: bpf_jit: fix an off-one bug in x86_64 cond jump targetMarkus Kötter2012-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/913373 [ Upstream commit a03ffcf873fe0f2565386ca8ef832144c42e67fa ] x86 jump instruction size is 2 or 5 bytes (near/long jump), not 2 or 6 bytes. In case a conditional jump is followed by a long jump, conditional jump target is one byte past the start of target instruction. Signed-off-by: Markus Kötter <nepenthesdev@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * KVM: x86: Prevent starting PIT timers in the absence of irqchip supportJan Kiszka2012-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | User space may create the PIT and forgets about setting up the irqchips. In that case, firing PIT IRQs will crash the host: BUG: unable to handle kernel NULL pointer dereference at 0000000000000128 IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm] ... Call Trace: [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm] [<ffffffff81071431>] process_one_work+0x111/0x4d0 [<ffffffff81071bb2>] worker_thread+0x152/0x340 [<ffffffff81075c8e>] kthread+0x7e/0x90 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10 Prevent this by checking the irqchip mode before starting a timer. We can't deny creating the PIT if the irqchips aren't set up yet as current user land expects this order to work. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> (cherry picked from commit 0924ab2cfa98b1ece26c033d696651fd62896c69) CVE-2011-4622 BugLink: http://bugs.launchpad.net/bugs/911303 Acked-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * xen: only limit memory map to maximum reservation for domain 0.Ian Campbell2012-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/907778 commit d3db728125c4470a2d061ac10fa7395e18237263 upstream. d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM" clamped the total amount of RAM to the current maximum reservation. This is correct for dom0 but is not correct for guest domains. In order to boot a guest "pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for future memory expansion the guest must derive max_pfn from the e820 provided by the toolstack and not the current maximum reservation (which can reflect only the current maximum, not the guest lifetime max). The existing algorithm already behaves this correctly if we do not artificially limit the maximum number of pages for the guest case. For a guest booted with maxmem=512, memory=128 this results in: [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) -[ 0.000000] Xen: 0000000000100000 - 0000000008100000 (usable) -[ 0.000000] Xen: 0000000008100000 - 0000000020800000 (unusable) +[ 0.000000] Xen: 0000000000100000 - 0000000020800000 (usable) ... [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable) -[ 0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000 +[ 0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000 [ 0.000000] initial memory mapped : 0 - 027ff000 [ 0.000000] Base memory trampoline at [c009f000] 9f000 size 4096 -[ 0.000000] init_memory_mapping: 0000000000000000-0000000008100000 -[ 0.000000] 0000000000 - 0008100000 page 4k -[ 0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000 +[ 0.000000] init_memory_mapping: 0000000000000000-0000000020800000 +[ 0.000000] 0000000000 - 0020800000 page 4k +[ 0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000 [ 0.000000] xen: setting RW the range 27e8000 - 27ff000 [ 0.000000] 0MB HIGHMEM available. -[ 0.000000] 129MB LOWMEM available. -[ 0.000000] mapped low ram: 0 - 08100000 -[ 0.000000] low ram: 0 - 08100000 +[ 0.000000] 520MB LOWMEM available. +[ 0.000000] mapped low ram: 0 - 20800000 +[ 0.000000] low ram: 0 - 20800000 With this change "xl mem-set <domain> 512M" will successfully increase the guest RAM (by reducing the balloon). There is no change for dom0. Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * x86, hpet: Immediately disable HPET timer 1 if rtc irq is maskedMark Langsdorf2012-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/907778 commit 2ded6e6a94c98ea453a156748cb7fabaf39a76b9 upstream. When HPET is operating in RTC mode, the TN_ENABLE bit on timer1 controls whether the HPET or the RTC delivers interrupts to irq8. When the system goes into suspend, the RTC driver sends a signal to the HPET driver so that the HPET releases control of irq8, allowing the RTC to wake the system from suspend. The switchover is accomplished by a write to the HPET configuration registers which currently only occurs while servicing the HPET interrupt. On some systems, I have seen the system suspend before an HPET interrupt occurs, preventing the write to the HPET configuration register and leaving the HPET in control of the irq8. As the HPET is not active during suspend, it does not generate a wake signal and RTC alarms do not work. This patch forces the HPET driver to immediately transfer control of the irq8 channel to the RTC instead of waiting until the next interrupt event. Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * thp: add compound tail page _mapcount when mappedYouquan Song2012-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/907778 commit b6999b19120931ede364fa3b685e698a61fed31d upstream. With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page failed to be used. The root cause is that 1GB page allocation calls gup_huge_pud() while 2M page calls gup_huge_pmd. If compound pages are used and the page is a tail page, gup_huge_pmd() increases _mapcount to record tail page are mapped while gup_huge_pud does not do that. So when the mapped page is relesed, it will result in kernel oops because the page is not marked mapped. This patch add tail process for compound page in 1GB huge page which keeps the same process as 2M page. Reproduce like: 1. Add grub boot option: hugepagesz=1G hugepages=8 2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages 3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages -net none -device pci-assign,host=07:00.1 kernel BUG at mm/swap.c:114! invalid opcode: 0000 [#1] SMP Call Trace: put_page+0x15/0x37 kvm_release_pfn_clean+0x31/0x36 kvm_iommu_put_pages+0x94/0xb1 kvm_iommu_unmap_memslots+0x80/0xb6 kvm_assign_device+0xba/0x117 kvm_vm_ioctl_assigned_device+0x301/0xa47 kvm_vm_ioctl+0x36c/0x3a2 do_vfs_ioctl+0x49e/0x4e4 sys_ioctl+0x5a/0x7c system_call_fastpath+0x16/0x1b RIP put_compound_page+0xd4/0x168 Signed-off-by: Youquan Song <youquan.song@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
| * oprofile, x86: Fix crash when unloading module (nmi timer mode)Robert Richter2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/902312 commit 97f7f8189fe54e3cfe324ef9ad35064f3d2d3bff upstream. If oprofile uses the nmi timer interrupt there is a crash while unloading the module. The bug can be triggered with oprofile build as module and kernel parameter nolapic set. This patch fixes this. oprofile: using NMI timer interrupt. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58 PGD 42dbca067 PUD 41da6a067 PMD 0 Oops: 0002 [#1] PREEMPT SMP CPU 5 Modules linked in: oprofile(-) [last unloaded: oprofile] Pid: 2518, comm: modprobe Not tainted 3.1.0-rc7-00019-gb2fb49d #19 Advanced Micro Device Anaheim/Anaheim RIP: 0010:[<ffffffff8123c226>] [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58 RSP: 0018:ffff88041ef71e98 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffffffa0017100 RCX: dead000000200200 RDX: 0000000000000000 RSI: dead000000100100 RDI: ffffffff8178c620 RBP: ffff88041ef71ea8 R08: 0000000000000001 R09: 0000000000000082 R10: 0000000000000000 R11: ffff88041ef71de8 R12: 0000000000000080 R13: fffffffffffffff5 R14: 0000000000000001 R15: 0000000000610210 FS: 00007fc902f20700(0000) GS:ffff88042fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000041cdb6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process modprobe (pid: 2518, threadinfo ffff88041ef70000, task ffff88041d348040) Stack: ffff88041ef71eb8 ffffffffa0017790 ffff88041ef71eb8 ffffffffa0013532 ffff88041ef71ec8 ffffffffa00132d6 ffff88041ef71ed8 ffffffffa00159b2 ffff88041ef71f78 ffffffff81073115 656c69666f72706f 0000000000610200 Call Trace: [<ffffffffa0013532>] op_nmi_exit+0x15/0x17 [oprofile] [<ffffffffa00132d6>] oprofile_arch_exit+0xe/0x10 [oprofile] [<ffffffffa00159b2>] oprofile_exit+0x1e/0x20 [oprofile] [<ffffffff81073115>] sys_delete_module+0x1c3/0x22f [<ffffffff811bf09e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8148070b>] system_call_fastpath+0x16/0x1b Code: 20 c6 78 81 e8 c5 cc 23 00 48 8b 13 48 8b 43 08 48 be 00 01 10 00 00 00 ad de 48 b9 00 02 20 00 00 00 ad de 48 c7 c7 20 c6 78 81 89 42 08 48 89 10 48 89 33 48 89 4b 08 e8 a6 c0 23 00 5a 5b RIP [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58 RSP <ffff88041ef71e98> CR2: 0000000000000008 ---[ end trace 43a541a52956b7b0 ]--- Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * perf/x86: Fix PEBS instruction unwindPeter Zijlstra2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/902312 commit 57d1c0c03c6b48b2b96870d831b9ce6b917f53ac upstream. Masami spotted that we always try to decode the instruction stream as 64bit instructions when running a 64bit kernel, this doesn't work for ia32-compat proglets. Use TIF_IA32 to detect if we need to use the 32bit instruction decoder. Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86: Fix "Acer Aspire 1" reboot hangPeter Chubb2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/902312 commit 1ef03890969932e9359b9a4c658f7f87771910ac upstream. Looks like on some Acer Aspire 1s with older bioses, reboot via bios fails. It works on my machine, (with BIOS version 0.3310) but not on some others (BIOS version 0.3309). There's a log of problems at: https://bbs.archlinux.org/viewtopic.php?id=124136 This patch adds a different callback to the reboot quirk table, to allow rebooting via keybaord controller. Reported-by: Uroš Vampl <mobile.leecher@gmail.com> Tested-by: Vasily Khoruzhick <anarsoul@gmail.com> Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86/mpparse: Account for bus types other than ISA and PCIBjorn Helgaas2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/902312 commit 9e6866686bdf2dcf3aeb0838076237ede532dcc8 upstream. In commit f8924e770e04 ("x86: unify mp_bus_info"), the 32-bit and 64-bit versions of MP_bus_info were rearranged to match each other better. Unfortunately it introduced a regression: prior to that change we used to always set the mp_bus_not_pci bit, then clear it if we found a PCI bus. After it, we set mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave it alone otherwise. In the cases of ISA and PCI, there's not much difference. But ISA is not the only non-PCI bus, so it's better to always set mp_bus_not_pci and clear it only for PCI. Without this change, Dan's Dell PowerEdge 4200 panics on boot with a log indicating interrupt routing trouble unless the "noapic" option is supplied. With this change, the machine boots reliably without "noapic". Fixes http://bugs.debian.org/586494 Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Dan McGrath <troubledaemon@gmail.com> Cc: Alexey Starikovskiy <aystarik@gmail.com> [jrnieder@gmail.com: clarified commit message] Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * sched, x86: Avoid unnecessary overflow in sched_clockSalman Qazi2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/902312 commit 4cecf6d401a01d054afc1e5f605bcbfe553cb9b9 upstream. (Added the missing signed-off-by line) In hundreds of days, the __cycles_2_ns calculation in sched_clock has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing the final value to become zero. We can solve this without losing any precision. We can decompose TSC into quotient and remainder of division by the scale factor, and then use this to convert TSC into nanoseconds. Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111115221121.7262.88871.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86, kdump, ioapic: Reset remote-IRR in clear_IO_APICSuresh Siddha2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/901830 In the kdump scenario mentioned below, we can have a case where the device using level triggered interrupt will not generate any interrupts in the kdump kernel. 1. IO-APIC sends a level triggered interrupt to the CPU's local APIC. 2. Kernel crashed before the CPU services this interrupt, leaving the remote-IRR in the IO-APIC set. 3. kdump kernel boot sequence does clear_IO_APIC() as part of IO-APIC initialization. But this fails to reset remote-IRR bit of the IO-APIC RTE as the remote-IRR bit is read-only. 4. Device using that level triggered entry can't generate any more interrupts because of the remote-IRR bit. In clear_IO_APIC_pin(), check if the remote-IRR bit is set and if so do an explicit attempt to clear it (by doing EOI write on modern io-apic's and changing trigger mode to edge/level on older io-apic's). Also before doing the explicit EOI to the io-apic, ensure that the trigger mode is indeed set to level. This will enable the explicit EOI to the io-apic to reset the remote-IRR bit. Tested-by: Leonardo Chiquitto <lchiquitto@novell.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Fixes: https://bugzilla.novell.com/show_bug.cgi?id=701686 Cc: Rafael Wysocki <rjw@novell.com> Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: Thomas Renninger <trenn@suse.de> Cc: jbeulich@novell.com Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/20110825190657.157502602@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> (cherry picked from commit 1e75b31d638d5242ca8e9771dfdcbd28a5f041df) Signed-off-by: Brad Figg <brad.figg@canonical.com> Acked-by: Herton Krzesinski <herton.krzesinski@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.Zhenzhong Duan2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/893741 commit 90d4f5534d14815bd94c10e8ceccc57287657ecc upstream. PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled need VCPU placement to work, or else it will softlockup. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86, mrst: use a temporary variable for SFI irqMika Westerberg2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/893741 commit 153b19a3b9fd8b9478495b9ee1f93f6a77c564f9 upstream. SFI tables reside in RAM and should not be modified once they are written. Current code went to set pentry->irq to zero which causes subsequent reads to fail with invalid SFI table checksum. This will break kexec as the second kernel fails to validate SFI tables. To fix this we use temporary variable for irq number. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * sfi: table irq 0xFF means 'no interrupt'Kirill A. Shutemov2011-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/893741 commit a94cc4e6c0a26a7c8f79a432ab2c89534aa674d5 upstream. According to the SFI specification irq number 0xFF means device has no interrupt or interrupt attached via GPIO. Currently, we don't handle this special case and set irq field in *_board_info structs to 255. It leads to confusion in some drivers. Accelerometer driver tries to register interrupt 255, fails and prints "Cannot get IRQ" to dmesg. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * thp: share get_huge_page_tail()Andrea Arcangeli2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/890952 commit b35a35b556f5e6b7993ad0baf20173e75c09ce8c upstream. This avoids duplicating the function in every arch gup_fast. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * mm: thp: tail page refcounting fixAndrea Arcangeli2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/890952 commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream. Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * iommu/amd: Fix wrong shift directionJoerg Roedel2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/890952 commit fcd0861db1cf4e6ed99f60a815b7b72c2ed36ea4 upstream. The shift direction was wrong because the function takes a page number and i is the address is the loop. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * apic, i386/bigsmp: Fix false warnings regarding logical APIC ID mismatchesJan Beulich2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/890952 commit 838312be46f3abfbdc175f81c3e54a857994476d upstream. These warnings (generally one per CPU) are a result of initializing x86_cpu_to_logical_apicid while apic_default is still in use, but the check in setup_local_APIC() being done when apic_bigsmp was already used as an override in default_setup_apic_routing(): Overriding APIC driver with bigsmp Enabling APIC mode: Physflat. Using 5 I/O APICs ------------[ cut here ]------------ WARNING: at .../arch/x86/kernel/apic/apic.c:1239 ... CPU 1 irqstacks, hard=f1c9a000 soft=f1c9c000 Booting Node 0, Processors #1 smpboot cpu 1: start_ip = 9e000 Initializing CPU#1 ------------[ cut here ]------------ WARNING: at .../arch/x86/kernel/apic/apic.c:1239 setup_local_APIC+0x137/0x46b() Hardware name: ... CPU1 logical APIC ID: 2 != 8 ... Fix this (for the time being, i.e. until x86_32_early_logical_apicid() will get removed again, as Tejun says ought to be possible) by overriding the previously stored values at the point where the APIC driver gets overridden. v2: Move this and the pre-existing override logic into arch/x86/kernel/apic/bigsmp_32.c. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/4E835D16020000780005844C@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86: Fix compilation bug in kprobes' twobyte_is_boostableJosh Stone2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/890952 commit 315eb8a2a1b7f335d40ceeeb11b9e067475eb881 upstream. When compiling an i386_defconfig kernel with gcc-4.6.1-9.fc15.i686, I noticed a warning about the asm operand for test_bit in kprobes' can_boost. I discovered that this caused only the first long of twobyte_is_boostable[] to be output. Jakub filed and fixed gcc PR50571 to correct the warning and this output issue. But to solve it for less current gcc, we can make kprobes' twobyte_is_boostable[] non-const, and it won't be optimized out. Before: CC arch/x86/kernel/kprobes.o In file included from include/linux/bitops.h:22:0, from include/linux/kernel.h:17, from [...]/arch/x86/include/asm/percpu.h:44, from [...]/arch/x86/include/asm/current.h:5, from [...]/arch/x86/include/asm/processor.h:15, from [...]/arch/x86/include/asm/atomic.h:6, from include/linux/atomic.h:4, from include/linux/mutex.h:18, from include/linux/notifier.h:13, from include/linux/kprobes.h:34, from arch/x86/kernel/kprobes.c:43: [...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’: [...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default] $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt 551: 0f a3 05 00 00 00 00 bt %eax,0x0 554: R_386_32 .rodata.cst4 $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o arch/x86/kernel/kprobes.o: file format elf32-i386 Contents of section .data: 0000 48000000 00000000 00000000 00000000 H............... Contents of section .rodata.cst4: 0000 4c030000 L... Only a single long of twobyte_is_boostable[] is in the object file. After, without the const on twobyte_is_boostable: $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt 551: 0f a3 05 20 00 00 00 bt %eax,0x20 554: R_386_32 .data $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o arch/x86/kernel/kprobes.o: file format elf32-i386 Contents of section .data: 0000 48000000 00000000 00000000 00000000 H............... 0010 00000000 00000000 00000000 00000000 ................ 0020 4c030000 0f000200 ffff0000 ffcff0c0 L............... 0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77 ....;.......&..w Now all 32 bytes are output into .data instead. Signed-off-by: Josh Stone <jistone@redhat.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86: uv2: Workaround for UV2 Hub bug (system global address format)Jack Steiner2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/890952 commit 6a469e4665bc158599de55d64388861d0a9f10f4 upstream. This is a workaround for a UV2 hub bug that affects the format of system global addresses. The GRU API for UV2 was inadvertently broken by a hardware change. The format of the physical address used for TLB dropins and for addresses used with instructions running in unmapped mode has changed. This change was not documented and became apparent only when diags failed running on system simulators. For UV1, TLB and GRU instruction physical addresses are identical to socket physical addresses (although high NASID bits must be OR'ed into the address). For UV2, socket physical addresses need to be converted. The NODE portion of the physical address needs to be shifted so that the low bit is in bit 39 or bit 40, depending on an MMR value. It is not yet clear if this bug will be fixed in a silicon respin. If it is fixed, the hub revision will be incremented & the workaround disabled. Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86: Fix S4 regressionTakashi Iwai2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/881420 commit 8548c84da2f47e71bbbe300f55edb768492575f7 upstream. Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4 regression since 2.6.39, namely the machine reboots occasionally at S4 resume. It doesn't happen always, overall rate is about 1/20. But, like other bugs, once when this happens, it continues to happen. This patch fixes the problem by essentially reverting the memory assignment in the older way. Signed-off-by: Takashi Iwai <tiwai@suse.de> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Yinghai Lu <yinghai.lu@oracle.com> [ We'll hopefully find the real fix, but that's too late for 3.1 now ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * x86/PCI: use host bridge _CRS info on ASUS M2V-MX SEPaul Menzel2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/881420 commit 29cf7a30f8a0ce4af2406d93d5a332099be26923 upstream. In summary, this DMI quirk uses the _CRS info by default for the ASUS M2V-MX SE by turning on `pci=use_crs` and is similar to the quirk added by commit 2491762cfb47 ("x86/PCI: use host bridge _CRS info on ASRock ALiveSATA2-GLAN") whose commit message should be read for further information. Since commit 3e3da00c01d0 ("x86/pci: AMD one chain system to use pci read out res") Linux gives the following oops: parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE] HDA Intel 0000:20:01.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 HDA Intel 0000:20:01.0: setting latency timer to 64 BUG: unable to handle kernel paging request at ffffc90011c08000 IP: [<ffffffffa0578402>] azx_probe+0x3ad/0x86b [snd_hda_intel] PGD 13781a067 PUD 13781b067 PMD 1300ba067 PTE 800000fd00000173 Oops: 0009 [#1] SMP last sysfs file: /sys/module/snd_pcm/initstate CPU 0 Modules linked in: snd_hda_intel(+) snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event tpm_tis tpm snd_seq tpm_bios psmouse parport_pc snd_timer snd_seq_device parport processor evdev snd i2c_viapro thermal_sys amd64_edac_mod k8temp i2c_core soundcore shpchp pcspkr serio_raw asus_atk0110 pci_hotplug edac_core button snd_page_alloc edac_mce_amd ext3 jbd mbcache sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid1 md_mod usbhid hid sg sd_mod crc_t10dif sr_mod cdrom ata_generic uhci_hcd sata_via pata_via libata ehci_hcd usbcore scsi_mod via_rhine mii nls_base [last unloaded: scsi_wait_scan] Pid: 1153, comm: work_for_cpu Not tainted 2.6.37-1-amd64 #1 M2V-MX SE/System Product Name RIP: 0010:[<ffffffffa0578402>] [<ffffffffa0578402>] azx_probe+0x3ad/0x86b [snd_hda_intel] RSP: 0018:ffff88013153fe50 EFLAGS: 00010286 RAX: ffffc90011c08000 RBX: ffff88013029ec00 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff88013341d000 R08: 0000000000000000 R09: 0000000000000040 R10: 0000000000000286 R11: 0000000000003731 R12: ffff88013029c400 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88013341d090 FS: 0000000000000000(0000) GS:ffff8800bfc00000(0000) knlGS:00000000f7610ab0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffc90011c08000 CR3: 0000000132f57000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process work_for_cpu (pid: 1153, threadinfo ffff88013153e000, task ffff8801303c86c0) Stack: 0000000000000005 ffffffff8123ad65 00000000000136c0 ffff88013029c400 ffff8801303c8998 ffff88013341d000 ffff88013341d090 ffff8801322d9dc8 ffff88013341d208 0000000000000000 0000000000000000 ffffffff811ad232 Call Trace: [<ffffffff8123ad65>] ? __pm_runtime_set_status+0x162/0x186 [<ffffffff811ad232>] ? local_pci_probe+0x49/0x92 [<ffffffff8105afc5>] ? do_work_for_cpu+0x0/0x1b [<ffffffff8105afc5>] ? do_work_for_cpu+0x0/0x1b [<ffffffff8105afd0>] ? do_work_for_cpu+0xb/0x1b [<ffffffff8105fd3f>] ? kthread+0x7a/0x82 [<ffffffff8100a824>] ? kernel_thread_helper+0x4/0x10 [<ffffffff8105fcc5>] ? kthread+0x0/0x82 [<ffffffff8100a820>] ? kernel_thread_helper+0x0/0x10 Code: f4 01 00 00 ef 31 f6 48 89 df e8 29 dd ff ff 85 c0 0f 88 2b 03 00 00 48 89 ef e8 b4 39 c3 e0 8b 7b 40 e8 fc 9d b1 e0 48 8b 43 38 <66> 8b 10 66 89 14 24 8b 43 14 83 e8 03 83 f8 01 77 32 31 d2 be RIP [<ffffffffa0578402>] azx_probe+0x3ad/0x86b [snd_hda_intel] RSP <ffff88013153fe50> CR2: ffffc90011c08000 ---[ end trace 8d1f3ebc136437fd ]--- Trusting the ACPI _CRS information (`pci=use_crs`) fixes this problem. $ dmesg | grep -i crs # with the quirk PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug The match has to be against the DMI board entries though since the vendor entries are not populated. DMI: System manufacturer System Product Name/M2V-MX SE, BIOS 0304 10/30/2007 This quirk should be removed when `pci=use_crs` is enabled for machines from 2006 or earlier or some other solution is implemented. Using coreboot [1] with this board the problem does not exist but this quirk also does not affect it either. To be safe though the check is tightened to only take effect when the BIOS from American Megatrends is used. 15:13 < ruik> but coreboot does not need that 15:13 < ruik> because i have there only one root bus 15:13 < ruik> the audio is behind a bridge $ sudo dmidecode BIOS Information Vendor: American Megatrends Inc. Version: 0304 Release Date: 10/30/2007 [1] http://www.coreboot.org/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=30552 Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Signed-off-by: Paul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
| * UBUNTU: SAUCE: x86/paravirt: Partially revert "remove lazy mode in interrupts"Konrad Rzeszutek Wilk2011-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | which has git commit b8bcfe997e46150fedcc3f5b26b846400122fdd9. The unintended consequence of removing the flushing of MMU updates when doing kmap_atomic (or kunmap_atomic) is that we can hit a dereference bug when processing a "fork()" under a heavy loaded machine. Specifically we can hit: BUG: unable to handle kernel paging request at f573fc8c IP: [<c01abc54>] swap_count_continued+0x104/0x180 *pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1 EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3 EIP is at swap_count_continued+0x104/0x180 .. snip.. Call Trace: [<c01ac222>] ? __swap_duplicate+0xc2/0x160 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0 [<c01ac2e4>] ? swap_duplicate+0x14/0x40 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500 [<c01a0ca5>] ? copy_page_range+0x195/0x200 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0 [<c0132cf8>] ? dup_mm+0xa8/0x130 [<c013376a>] ? copy_process+0x98a/0xb30 [<c013395f>] ? do_fork+0x4f/0x280 [<c01573b3>] ? getnstimeofday+0x43/0x100 [<c010f770>] ? sys_clone+0x30/0x40 [<c06c048d>] ? ptregs_clone+0x15/0x48 [<c06bfb71>] ? syscall_call+0x7/0xb The problem looks that in copy_page_range we turn lazy mode on, and then in swap_entry_free we call swap_count_continued which ends up in: map = kmap_atomic(page, KM_USER0) + offset; and then later touches *map. Since we are running in batched mode (lazy) we don't actually set up the PTE mappings and the kmap_atomic is not done synchronously and ends up trying to dereference a page that has not been set. Looking at kmap_atomic_prot_pfn, it uses 'arch_flush_lazy_mmu_mode' and sprinkling that in kmap_atomic_prot and __kunmap_atomic makes the problem go away. CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: "H. Peter Anvin" <hpa@zytor.com> CC: x86@kernel.org CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: stable@kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> BugLink: http://bugs.launchpad.net/bugs/854050 (cherry-picked from ab67482036cee590753dd42b7f66aada97e6dcde linux-next) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Leann Ogasawara <leann.ogasawara@canonical.com> Acked-by: Andy Whitcroft <andy.whitcroft@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
| * perf, x86: Add model 45 SandyBridge supportYouquan Song2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit a34668f6beb4ab01e07683276d6a24bab6c175e0 upstream. Add support to Romely-EP SandyBridge. Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Anhua Xu <anhua.xu@intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1312264895-2010-1-git-send-email-youquan.song@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * xen/e820: if there is no dom0_mem=, don't tweak extra_pages.David Vrabel2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit e3b73c4a25e9a5705b4ef28b91676caf01f9bc9f upstream. The patch "xen: use maximum reservation to limit amount of usable RAM" (d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that do not use 'dom0_mem=' argument with: reserve RAM buffer: 000000133f2e2000 - 000000133fffffff (XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e (XEN) domain_crash_sync called from entry.S (XEN) Domain 0 (vcpu#0) crashed on cpu#0: ... The reason being that the last E820 entry is created using the 'extra_pages' (which is based on how many pages have been freed). The mentioned git commit sets the initial value of 'extra_pages' using a hypercall which returns the number of pages (if dom0_mem has been used) or -1 otherwise. If the later we return with MAX_DOMAIN_PAGES as basis for calculation: return min(max_pages, MAX_DOMAIN_PAGES); and use it: extra_limit = xen_get_max_pages(); if (extra_limit >= max_pfn) extra_pages = extra_limit - max_pfn; else extra_pages = 0; which means we end up with extra_pages = 128GB in PFNs (33554432) - 8GB in PFNs (2097152, on this specific box, can be larger or smaller), and then we add that value to the E820 making it: Xen: 00000000ff000000 - 0000000100000000 (reserved) Xen: 0000000100000000 - 000000133f2e2000 (usable) which is clearly wrong. It should look as so: Xen: 00000000ff000000 - 0000000100000000 (reserved) Xen: 0000000100000000 - 000000027fbda000 (usable) Naturally this problem does not present itself if dom0_mem=max:X is used. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * xen: use maximum reservation to limit amount of usable RAMDavid Vrabel2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11 upstream. Use the domain's maximum reservation to limit the amount of extra RAM for the memory balloon. This reduces the size of the pages tables and the amount of reserved low memory (which defaults to about 1/32 of the total RAM). On a system with 8 GiB of RAM with the domain limited to 1 GiB the kernel reports: Before: Memory: 627792k/4472000k available After: Memory: 549740k/11132224k available A increase of about 76 MiB (~1.5% of the unused 7 GiB). The reserved low memory is also reduced from 253 MiB to 32 MiB. The total additional usable RAM is 329 MiB. For dom0, this requires at patch to Xen ('x86: use 'dom0_mem' to limit the number of pages for dom0') (c/s 23790) Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * iommu/amd: Make sure iommu->need_sync contains correct valueJoerg Roedel2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit f1ca1512e765337a7c09eb875eedef8ea4e07654 upstream. The value is only set to true but never set back to false, which causes to many completion-wait commands to be sent to hardware. Fix it with this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * iommu/amd: Don't take domain->lock recursivlyJoerg Roedel2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit e33acde91140f1809952d1c135c36feb66a51887 upstream. The domain_flush_devices() function takes the domain->lock. But this function is only called from update_domain() which itself is already called unter the domain->lock. This causes a deadlock situation when the dma-address-space of a domain grows larger than 1GB. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * xen/smp: Warn user why they keel over - nosmp or noapic and what to use instead.Konrad Rzeszutek Wilk2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit ed467e69f16e6b480e2face7bc5963834d025f91 upstream. We have hit a couple of customer bugs where they would like to use those parameters to run an UP kernel - but both of those options turn of important sources of interrupt information so we end up not being able to boot. The correct way is to pass in 'dom0_max_vcpus=1' on the Xen hypervisor line and the kernel will patch itself to be a UP kernel. Fixes bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637308 Acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * xen: x86_32: do not enable iterrupts when returning from exception in ↵Igor Mammedov2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | interrupt context BugLink: http://bugs.launchpad.net/bugs/868628 commit d198d499148a0c64a41b3aba9e7dd43772832b91 upstream. If vmalloc page_fault happens inside of interrupt handler with interrupts disabled then on exit path from exception handler when there is no pending interrupts, the following code (arch/x86/xen/xen-asm_32.S:112): cmpw $0x0001, XEN_vcpu_info_pending(%eax) sete XEN_vcpu_info_mask(%eax) will enable interrupts even if they has been previously disabled according to eflags from the bounce frame (arch/x86/xen/xen-asm_32.S:99) testb $X86_EFLAGS_IF>>8, 8+1+ESP_OFFSET(%esp) setz XEN_vcpu_info_mask(%eax) Solution is in setting XEN_vcpu_info_mask only when it should be set according to cmpw $0x0001, XEN_vcpu_info_pending(%eax) but not clearing it if there isn't any pending events. Reproducer for bug is attached to RHBZ 707552 Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86, perf: Check that current->mm is alive before getting user callchainAndrey Vagin2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/868628 commit 20afc60f892d285fde179ead4b24e6a7938c2f1b upstream. An event may occur when an mm is already released. I added an event in dequeue_entity() and caught a panic with the following backtrace: [ 434.421110] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 434.421258] IP: [<ffffffff810464ac>] __get_user_pages_fast+0x9c/0x120 ... [ 434.421258] Call Trace: [ 434.421258] [<ffffffff8101ae81>] copy_from_user_nmi+0x51/0xf0 [ 434.421258] [<ffffffff8109a0d5>] ? sched_clock_local+0x25/0x90 [ 434.421258] [<ffffffff8101b048>] perf_callchain_user+0x128/0x170 [ 434.421258] [<ffffffff811154cd>] ? __perf_event_header__init_id+0xed/0x100 [ 434.421258] [<ffffffff81116690>] perf_prepare_sample+0x200/0x280 [ 434.421258] [<ffffffff81118da8>] __perf_event_overflow+0x1b8/0x290 [ 434.421258] [<ffffffff81065240>] ? tg_shares_up+0x0/0x670 [ 434.421258] [<ffffffff8104fe1a>] ? walk_tg_tree+0x6a/0xb0 [ 434.421258] [<ffffffff81118f44>] perf_swevent_overflow+0xc4/0xf0 [ 434.421258] [<ffffffff81119150>] do_perf_sw_event+0x1e0/0x250 [ 434.421258] [<ffffffff81119204>] perf_tp_event+0x44/0x70 [ 434.421258] [<ffffffff8105701f>] ftrace_profile_sched_block+0xdf/0x110 [ 434.421258] [<ffffffff8106121d>] dequeue_entity+0x2ad/0x2d0 [ 434.421258] [<ffffffff810614ec>] dequeue_task_fair+0x1c/0x60 [ 434.421258] [<ffffffff8105818a>] dequeue_task+0x9a/0xb0 [ 434.421258] [<ffffffff810581e2>] deactivate_task+0x42/0xe0 [ 434.421258] [<ffffffff814bc019>] thread_return+0x191/0x808 [ 434.421258] [<ffffffff81098a44>] ? switch_task_namespaces+0x24/0x60 [ 434.421258] [<ffffffff8106f4c4>] do_exit+0x464/0x910 [ 434.421258] [<ffffffff8106f9c8>] do_group_exit+0x58/0xd0 [ 434.421258] [<ffffffff8106fa57>] sys_exit_group+0x17/0x20 [ 434.421258] [<ffffffff8100b202>] system_call_fastpath+0x16/0x1b Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314693156-24131-1-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * x86-32, amd: Move va_align definition to unbreak 32-bit buildBorislav Petkov2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/862583 hpa reported that dfb09f9b7ab03fd367740e541a5caf830ed56726 breaks 32-bit builds with the following error message: /home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:437: undefined reference to `va_align' /home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:436: undefined reference to `va_align' This is due to the fact that va_align is a global in a 64-bit only compilation unit. Move it to mmap.c where it is visible to both subarches. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1312633899-1131-1-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@zytor.com> (cherry picked from git://tesla.tglx.de/git/linux-2.6-tip x86/cpu commit 9387f774d61b01ab71bade85e6d0bfab0b3419bd) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Leann Ogasawara <leann.ogasawara@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com>
| * x86, amd: Move BSP code to cpu_dev helperBorislav Petkov2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/862583 Move code which is run once on the BSP during boot into the cpu_dev helper. [ hpa: removed bogus cpu_has -> static_cpu_has conversion ] Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/20110805180409.GC26217@aftab Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> (cherry picked from git://tesla.tglx.de/git/linux-2.6-tip x86/cpu commit 8fa8b035085e7320c15875c1f6b03b290ca2dd66) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Leann Ogasawara <leann.ogasawara@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com>
| * x86: Add a BSP cpu_dev helperBorislav Petkov2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/862583 Add a function ptr to struct cpu_dev which is destined to be run only once on the BSP during boot. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/20110805180116.GB26217@aftab Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> (cherry picked from git://tesla.tglx.de/git/linux-2.6-tip x86/cpu commit a110b5ec7371592eac856ac5c22dc7b518952d44) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Leann Ogasawara <leann.ogasawara@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com>
| * x86, amd: Avoid cache aliasing penalties on AMD family 15hBorislav Petkov2011-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BugLink: http://bugs.launchpad.net/bugs/862583 This patch provides performance tuning for the "Bulldozer" CPU. With its shared instruction cache there is a chance of generating an excessive number of cache cross-invalidates when running specific workloads on the cores of a compute module. This excessive amount of cross-invalidations can be observed if cache lines backed by shared physical memory alias in bits [14:12] of their virtual addresses, as those bits are used for the index generation. This patch addresses the issue by clearing all the bits in the [14:12] slice of the file mapping's virtual address at generation time, thus forcing those bits the same for all mappings of a single shared library across processes and, in doing so, avoids instruction cache aliases. It also adds the command line option "align_va_addr=(32|64|on|off)" with which virtual address alignment can be enabled for 32-bit or 64-bit x86 individually, or both, or be completely disabled. This change leaves virtual region address allocation on other families and/or vendors unaffected. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1312550110-24160-2-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> (cherry picked from git://tesla.tglx.de/git/linux-2.6-tip x86/cpu commit dfb09f9b7ab03fd367740e541a5caf830ed56726) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Leann Ogasawara <leann.ogasawara@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com>