aboutsummaryrefslogtreecommitdiffstats
path: root/arch
Commit message (Collapse)AuthorAge
* perf/x86: Revamp PEBS event selectionAndi Kleen2014-08-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The basic idea is that it does not make sense to list all PEBS events individually. The list is very long, sometimes outdated and the hardware doesn't need it. If an event does not support PEBS it will just not count, there is no security issue. We need to only list events that something special, like supporting load or store addresses. This vastly simplifies the PEBS event selection. It also speeds up the scheduling because the scheduler doesn't have to walk as many constraints. Bugs fixed: - We do not allow setting forbidden flags with PEBS anymore (SDM 18.9.4), except for the special cycle event. This is done using a new constraint macro that also matches on the event flags. - Correct DataLA and load/store/na flags reporting on Haswell [Requires a followon patch] - We did not allow all PEBS events on Haswell: We were missing some valid subevents in d1-d2 (MEM_LOAD_UOPS_RETIRED.*, MEM_LOAD_UOPS_RETIRED_L3_HIT_RETIRED.*) This includes the changes proposed by Stephane earlier and obsoletes his patchkit (except for some changes on pre Sandy Bridge/Silvermont CPUs) I only did Sandy Bridge and Silvermont and later so far, mostly because these are the parts I could directly confirm the hardware behavior with hardware architects. Also I do not believe the older CPUs have any missing events in their PEBS list, so there's no pressing need to change them. I did not implement the flag proposed by Peter to allow setting forbidden flags. If really needed this could be implemented on to of this patch. v2: Fix broken store events on SNB/IVB (Stephane Eranian) v3: More fixes. Rename some arguments (Stephane Eranian) v4: List most Haswell events individually again to report memory operation type correctly. Add new flags to describe load/store/na for datala. Update description. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1407785233-32193-2-git-send-email-eranian@google.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Mark Davies <junk@eslaf.co.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86: Fix :pp without LBRAndi Kleen2014-08-13
| | | | | | | | | | | | | | | | | | This fixes a side effect of Kan's earlier patch to probe the LBRs at boot time. Normally when the LBRs are disabled cycles:pp is disabled too. So for example cycles:pp doesn't work. However this is not needed with PEBSv2 and later (Haswell) because it does not need LBRs to correct the IP-off-by-one. So add an extra check for PEBSv2 that also allows :pp Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: kan.liang@intel.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/1407456534-15747-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86: Use extended offcore mask on HaswellAndi Kleen2014-08-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HSW-EP has a larger offcore mask than the client Haswell CPUs. It is the same mask as on Sandy/IvyBridge-EP. All of Haswell was using the client mask, so some bits were missing. On the client parts some bits were also missing compared to Sandy/IvyBridge, in particular the bits to match on a L4 cache hit. The Haswell core in both client and server incarnations accepts the same bits (but some are nops), so we can use the same mask. So use the snbep extended mask, which is a superset of the client and the server, for all of Haswell. This allows specifying a number of extra offcore events, like for example for HSW-EP. % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3fffc00100,name=offcore_response_pf_l3_rfo_l3_miss_any_response/ true which were <not supported> before. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: eranian@google.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/1406840722-25416-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86/uncore: Fix coccinelle warningsFengguang Wu2014-08-13
| | | | | | | | | | | | | | | | | arch/x86/kernel/cpu/perf_event_intel_uncore_nhmex.c:961:2-3: Unneeded semicolon arch/x86/kernel/cpu/perf_event_intel_uncore_nhmex.c:1100:2-3: Unneeded semicolon arch/x86/kernel/cpu/perf_event_intel_uncore_nhmex.c:1138:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Yan, Zheng <zheng.z.yan@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/n/tip-ovfvr4nbqjo7nzc16y2lpjy9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86/uncore: move NHM-EX/WSM-EX specific code to seperate fileYan, Zheng2014-08-13
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1406704935-27708-4-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86/uncore: Move SNB/IVB-EP specific code to seperate fileYan, Zheng2014-08-13
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1406704935-27708-3-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86/uncore: Move NHM/SNB/IVB specific code to seperate fileYan, Zheng2014-08-13
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Stephane Eranian <eranian@google.com> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1406704935-27708-2-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86/uncore: Declare some functions and variablesYan, Zheng2014-08-13
| | | | | | | | | | | | | | Prepare for moving hardware specific code to seperate files. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: eranian@google.com Cc: andi@firstfloor.org Link: http://lkml.kernel.org/r/1406704935-27708-1-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86/intel: Update Intel modelsPeter Zijlstra2014-08-13
| | | | | | | | | | The model number descriptions got a bit messy, clean them up. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-oo3xclxdoy8s7ubssn929vaj@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'x86-vdso-for-linus' of ↵Linus Torvalds2014-08-04
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso updates from Ingo Molnar: "Further simplifications and improvements to the VDSO code, by Andy Lutomirski" * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86_64/vsyscall: Fix warn_bad_vsyscall log output x86/vdso: Set VM_MAYREAD for the vvar vma x86, vdso: Get rid of the fake section mechanism x86, vdso: Move the vvar area before the vdso text
| * x86_64/vsyscall: Fix warn_bad_vsyscall log outputAndy Lutomirski2014-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit in Linux 3.6: commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655 Author: Joe Perches <joe@perches.com> Date: Mon May 21 19:50:07 2012 -0700 x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level> caused warn_bad_vsyscall to output garbage in the middle of the line. Revert the bad part of it. The printk in question isn't actually bare; the level is "%s". The bug this fixes is purely cosmetic; backports are optional. Cc: <stable@vger.kernel.org> # v3.6+ Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/03eac1f24110bbe496ecc12a4df467e0d88466d4.1406330947.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * x86/vdso: Set VM_MAYREAD for the vvar vmaAndy Lutomirski2014-07-25
| | | | | | | | | | | | | | | | | | | | | | | | The VVAR area can, obviously, be read; that is kind of the point. AFAIK this has no effect whatsoever unless x86 suddenly turns into a nommu architecture. Nonetheless, not setting it is suspicious. Reported-by: Nathan Lynch <Nathan_Lynch@mentor.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/e4c8bf4bc2725bda22c4a4b7d0c82adcd8f8d9b8.1406330779.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * x86, vdso: Get rid of the fake section mechanismAndy Lutomirski2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that we can tolerate extra things dangling off the end of the vdso image, we can strip the vdso the old fashioned way rather than using an overcomplicated custom stripping algorithm. This is a partial reversion of: 6f121e5 x86, vdso: Reimplement vdso.so preparation in build-time C Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/50e01ed6dcc0575d20afd782f9fe98d5ee3e2d8a.1405040914.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * x86, vdso: Move the vvar area before the vdso textAndy Lutomirski2014-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Putting the vvar area after the vdso text is rather complicated: it only works of the total length of the vdso text mapping is known at vdso link time, and the linker doesn't allow symbol addresses to depend on the sizes of non-allocatable data after the PT_LOAD segment. Moving the vvar area before the vdso text will allow is to safely map non-allocatable data after the vdso text, which is a nice simplification. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/156c78c0d93144ff1055a66493783b9e56813983.1405040914.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | Merge branch 'x86-uv-for-linus' of ↵Linus Torvalds2014-08-04
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 UV TLB update from Ingo Molnar: "UV TLB shootdown logic updates for version of the UV architecture" * 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/uv: Update the UV3 TLB shootdown logic
| * | x86/uv: Update the UV3 TLB shootdown logicCliff Wickman2014-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update of TLB shootdown code for UV3. Kernel function native_flush_tlb_others() calls uv_flush_tlb_others() on UV to invalidate tlb page definitions on remote cpus. The UV systems have a hardware 'broadcast assist unit' which can be used to broadcast shootdown messages to all cpu's of selected nodes. The behavior of the BAU has changed only slightly with UV3: - UV3 is recognized with is_uv3_hub(). - UV2 functions and structures (uv2_xxx) are in most cases simply renamed to uv2_3_xxx. - Some UV2 error workarounds are not needed for UV3. (see uv_bau_message_interrupt and enable_timeouts) Signed-off-by: Cliff Wickman <cpw@sgi.com> Link: http://lkml.kernel.org/r/E1WkgWh-0001yJ-3K@eag09.americas.sgi.com [ Removed a few linebreak uglies. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'x86-ras-for-linus' of ↵Linus Torvalds2014-08-04
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Ingo Molnar: "The main changes in this cycle are: - RAS tracing/events infrastructure, by Gong Chen. - Various generalizations of the APEI code to make it available to non-x86 architectures, by Tomasz Nowicki" * 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ras: Fix build warnings in <linux/aer.h> acpi, apei, ghes: Factor out ioremap virtual memory for IRQ and NMI context. acpi, apei, ghes: Make NMI error notification to be GHES architecture extension. apei, mce: Factor out APEI architecture specific MCE calls. RAS, extlog: Adjust init flow trace, eMCA: Add a knob to adjust where to save event log trace, RAS: Add eMCA trace event interface RAS, debugfs: Add debugfs interface for RAS subsystem CPER: Adjust code flow of some functions x86, MCE: Robustify mcheck_init_device trace, AER: Move trace into unified interface trace, RAS: Add basic RAS trace event x86, MCE: Kill CPU_POST_DEAD
| * \ \ Merge tag 'please-pull-apei' into x86/rasH. Peter Anvin2014-07-30
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | APEI is currently implemented so that it depends on x86 hardware. The primary dependency is that GHES uses the x86 NMI for hardware error notification and MCE for memory error handling. These patches remove that dependency. Other APEI features such as error reporting via external IRQ, error serialization, or error injection, do not require changes to use them on non-x86 architectures. The following patch set eliminates the APEI Kconfig x86 dependency by making these changes: - treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI - group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which is used only for NMI path - identify architectural boxes and abstract it accordingly (tlb flush and MCE) - rework ioremap for both IRQ and NMI context NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled. Note, these patches introduce no functional changes for x86. The NMI notification feature is hard selected for x86. Architectures that want to use this feature should also provide NMI code infrastructure.
| | * | | acpi, apei, ghes: Factor out ioremap virtual memory for IRQ and NMI context.Tomasz Nowicki2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GHES currently maps two pages with atomic_ioremap. From now on, NMI is architectural depended so there is no need to allocate an NMI page for platforms without NMI support. To make it possible to not use a second page, swap the existing page order so that the IRQ context page is first, and the optional NMI context page is second. Then, use HAVE_ACPI_APEI_NMI to decide how many pages are to be allocated. Signed-off-by: Tomasz Nowicki <tomasz.nowicki@linaro.org> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com>
| | * | | acpi, apei, ghes: Make NMI error notification to be GHES architecture extension.Tomasz Nowicki2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently APEI depends on x86 architecture. It is because of NMI hardware error notification of GHES which is currently supported by x86 only. However, many other APEI features can be still used perfectly by other architectures. This commit adds two symbols: 1. HAVE_ACPI_APEI for those archs which support APEI. 2. HAVE_ACPI_APEI_NMI which is used for NMI code isolation in ghes.c file. NMI related data and functions are grouped so they can be wrapped inside one #ifdef section. Appropriate function stubs are provided for !NMI case. Note there is no functional changes for x86 due to hard selected HAVE_ACPI_APEI and HAVE_ACPI_APEI_NMI symbols. Signed-off-by: Tomasz Nowicki <tomasz.nowicki@linaro.org> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com>
| | * | | apei, mce: Factor out APEI architecture specific MCE calls.Tomasz Nowicki2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit abstracts MCE calls and provides weak corresponding default implementation for those architectures which do not need arch specific actions. Each platform willing to do additional architectural actions should provides desired function definition. It allows us to avoid wrap code into #ifdef in generic code and prevent new platform from introducing dummy stub function too. Initially, there are two APEI arch-specific calls: - arch_apei_enable_cmcff() - arch_apei_report_mem_error() Both interact with MCE driver for X86 architecture. Signed-off-by: Tomasz Nowicki <tomasz.nowicki@linaro.org> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com>
| * | | | x86, MCE: Robustify mcheck_init_deviceBorislav Petkov2014-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BorisO reports that misc_register() fails often on xen. The current code unregisters the CPU hotplug notifier in that case. If then a CPU is offlined and onlined back again, we end up with a second timer running on that CPU, leading to soft lockups and system hangs. So let's leave the hotcpu notifier always registered - even if mce_device_create failed for some cores and never unreg it so that we can deal with the timer handling accordingly. Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1403274493-1371-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Borislav Petkov <bp@suse.de>
| * | | | x86, MCE: Kill CPU_POST_DEADBorislav Petkov2014-06-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In conjunction with cleaning up CPU hotplug, we want to get rid of CPU_POST_DEAD. Kill this instance here and rediscover CMCI banks at the end of CPU_DEAD. Link: http://lkml.kernel.org/r/http://lkml.kernel.org/r/1400750624-19238-1-git-send-email-bp@alien8.de Signed-off-by: Borislav Petkov <bp@suse.de>
* | | | | Merge branch 'x86-platform-for-linus' of ↵Linus Torvalds2014-08-04
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Ingo Molnar: "The main changes in this cycle are: - Intel SOC driver updates, by Aubrey Li. - TS5500 platform updates, by Vivien Didelot" * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pmc_atom: Silence shift wrapping warnings in pmc_sleep_tmr_show() x86/pmc_atom: Expose PMC device state and platform sleep state x86/pmc_atom: Eisable a few S0ix wake up events for S0ix residency x86/platform: New Intel Atom SOC power management controller driver x86/platform/ts5500: Add support for TS-5400 boards x86/platform/ts5500: Add a 'name' sysfs attribute x86/platform/ts5500: Use the DEVICE_ATTR_RO() macro
| * | | | | x86/pmc_atom: Silence shift wrapping warnings in pmc_sleep_tmr_show()Dan Carpenter2014-08-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I don't know if we really need 64 bits here but these variables are declared as u64 and it can't hurt to cast this so we prevent any shift wrapping. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Aubrey Li <aubrey.li@linux.intel.com> Link: http://lkml.kernel.org/r/20140801082715.GE28869@mwanda Signed-off-by: H. Peter Anvin <hpa@zytor.com>
| * | | | | x86/pmc_atom: Expose PMC device state and platform sleep stateLi, Aubrey2014-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the following interfaces to exposes PMC device state and sleep state residency via debugfs: /sys/kernel/debugfs/pmc_atom/dev_state /sys/kernel/debugfs/pmc_atom/sleep_state Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Link: http://lkml.kernel.org/r/53B0FF59.8000600@linux.intel.com Signed-off-by: Kasagar, Srinidhi <srinidhi.kasagar@intel.com> Reviewed-by: Rudramuni, Vishwesh M <vishwesh.m.rudramuni@intel.com> Reviewed-by: Joe Perches <joe@perches.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | x86/pmc_atom: Eisable a few S0ix wake up events for S0ix residencyLi, Aubrey2014-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disable PMC S0IX_WAKE_EN events coming from LPC block(unused) and also from GPIO_SUS ored dedicated IRQs (must be disabled as per PMC programming rule), GPIOSCORE ored dedicated IRQs (must be disabled as per PMC programming rule), GPIO_SUS shared IRQ (not necessary since the IOAPIC_DS wake event will still work), GPIO_SCORE shared IRQ (not necessary since the IOAPIC_DS wake event will still work). Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Link: http://lkml.kernel.org/r/53B0FF22.5080403@linux.intel.com Signed-off-by: Olivier Leveque <olivier.leveque@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | x86/platform: New Intel Atom SOC power management controller driverLi, Aubrey2014-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Power Management Controller (PMC) controls many of the power management features present in the Atom SoC. This driver provides a native power off function via PMC PCI IO port. On some ACPI hardware-reduced platforms(e.g. ASUS-T100), ACPI sleep registers are not valid so that (*pm_power_off)() is not hooked by acpi_power_off(). The power off function in this driver is installed only when pm_power_off is NULL. Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Link: http://lkml.kernel.org/r/53B0FEEA.3010805@linux.intel.com Signed-off-by: Lejun Zhu <lejun.zhu@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | x86/platform/ts5500: Add support for TS-5400 boardsVivien Didelot2014-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch extends the TS-5500 platform driver to support the similar Technologic Systems TS-5400 Single Board Computer: http://wiki.embeddedarm.com/wiki/TS-5400 Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Savoir-faire Linux Inc. <kernel@savoirfairelinux.com> Link: http://lkml.kernel.org/r/1404860269-11837-4-git-send-email-vivien.didelot@savoirfairelinux.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/platform/ts5500: Add a 'name' sysfs attributeVivien Didelot2014-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new "name" attribute to the TS5500 sysfs group, to clarify which supported board model it is. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Savoir-faire Linux Inc. <kernel@savoirfairelinux.com> Link: http://lkml.kernel.org/r/1404860269-11837-3-git-send-email-vivien.didelot@savoirfairelinux.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | x86/platform/ts5500: Use the DEVICE_ATTR_RO() macroVivien Didelot2014-07-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the DEVICE_ATTR_RO() helper macro to simplify the declaration of read-only sysfs attributes in the TS5500 code.. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Savoir-faire Linux Inc. <kernel@savoirfairelinux.com> Link: http://lkml.kernel.org/r/1404860269-11837-2-git-send-email-vivien.didelot@savoirfairelinux.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2014-08-04
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm changes from Ingo Molnar: "The main change in this cycle is the rework of the TLB range flushing code, to simplify, fix and consolidate the code. By Dave Hansen" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Set TLB flush tunable to sane value (33) x86/mm: New tunable for single vs full TLB flush x86/mm: Add tracepoints for TLB flushes x86/mm: Unify remote INVLPG code x86/mm: Fix missed global TLB flush stat x86/mm: Rip out complicated, out-of-date, buggy TLB flushing x86/mm: Clean up the TLB flushing code x86/smep: Be more informative when signalling an SMEP fault
| * | | | | | x86/mm: Set TLB flush tunable to sane value (33)Dave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has been run through Intel's LKP tests across a wide range of modern sytems and workloads and it wasn't shown to make a measurable performance difference positive or negative. Now that we have some shiny new tracepoints, we can actually figure out what the heck is going on. During a kernel compile, 60% of the flush_tlb_mm_range() calls are for a single page. It breaks down like this: size percent percent<= V V V GLOBAL: 2.20% 2.20% avg cycles: 2283 1: 56.92% 59.12% avg cycles: 1276 2: 13.78% 72.90% avg cycles: 1505 3: 8.26% 81.16% avg cycles: 1880 4: 7.41% 88.58% avg cycles: 2447 5: 1.73% 90.31% avg cycles: 2358 6: 1.32% 91.63% avg cycles: 2563 7: 1.14% 92.77% avg cycles: 2862 8: 0.62% 93.39% avg cycles: 3542 9: 0.08% 93.47% avg cycles: 3289 10: 0.43% 93.90% avg cycles: 3570 11: 0.20% 94.10% avg cycles: 3767 12: 0.08% 94.18% avg cycles: 3996 13: 0.03% 94.20% avg cycles: 4077 14: 0.02% 94.23% avg cycles: 4836 15: 0.04% 94.26% avg cycles: 5699 16: 0.06% 94.32% avg cycles: 5041 17: 0.57% 94.89% avg cycles: 5473 18: 0.02% 94.91% avg cycles: 5396 19: 0.03% 94.95% avg cycles: 5296 20: 0.02% 94.96% avg cycles: 6749 21: 0.18% 95.14% avg cycles: 6225 22: 0.01% 95.15% avg cycles: 6393 23: 0.01% 95.16% avg cycles: 6861 24: 0.12% 95.28% avg cycles: 6912 25: 0.05% 95.32% avg cycles: 7190 26: 0.01% 95.33% avg cycles: 7793 27: 0.01% 95.34% avg cycles: 7833 28: 0.01% 95.35% avg cycles: 8253 29: 0.08% 95.42% avg cycles: 8024 30: 0.03% 95.45% avg cycles: 9670 31: 0.01% 95.46% avg cycles: 8949 32: 0.01% 95.46% avg cycles: 9350 33: 3.11% 98.57% avg cycles: 8534 34: 0.02% 98.60% avg cycles: 10977 35: 0.02% 98.62% avg cycles: 11400 We get in to dimishing returns pretty quickly. On pre-IvyBridge CPUs, we used to set the limit at 8 pages, and it was set at 128 on IvyBrige. That 128 number looks pretty silly considering that less than 0.5% of the flushes are that large. The previous code tried to size this number based on the size of the TLB. Good idea, but it's error-prone, needs maintenance (which it didn't get up to now), and probably would not matter in practice much. Settting it to 33 means that we cover the mallopt M_TRIM_THRESHOLD, which is the most universally common size to do flushes. That's the short version. Here's the long one for why I chose 33: 1. These numbers have a constant bias in the timestamps from the tracing. Probably counts for a couple hundred cycles in each of these tests, but it should be fairly _even_ across all of them. The smallest delta between the tracepoints I have ever seen is 335 cycles. This is one reason the cycles/page cost goes down in general as the flushes get larger. The true cost is nearer to 100 cycles. 2. A full flush is more expensive than a single invlpg, but not by much (single percentages). 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. 4. 22,000 cycles is approximately the equivalent of doing 85 invlpg operations. But, the odds are that the TLB can actually be filled up faster than that because TLB misses that are close in time also tend to leverage the same caches. 6. ~98% of flushes are <=33 pages. There are a lot of flushes of 33 pages, probably because libc's M_TRIM_THRESHOLD is set to 128k (32 pages) 7. I've found no consistent data to support changing the IvyBridge vs. SandyBridge tunable by a factor of 16 I used the performance counters on this hardware (IvyBridge i5-3320M) to figure out the tlb miss costs: ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush 7,720,030,970 dtlb_load_misses_walk_duration [57.13%] 169,856,353 dtlb_load_misses_walk_completed [57.15%] 708,832,859 dtlb_store_misses_walk_duration [57.17%] 19,346,823 dtlb_store_misses_walk_completed [57.17%] 2,779,687,402 itlb_misses_walk_duration [57.15%] 82,241,148 itlb_misses_walk_completed [57.13%] 770,717 itlb_itlb_flush [57.11%] Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. On a SandyBridge system with more cores and larger caches, those are dtlb=13.4ns and itlb=9.5ns. cat perf.stat.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss } On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any The assumptions that this code went in under: https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are about 100ns. Being generous, that is over by a factor of 6 on the refill side, although it is fairly close on the cost of an invlpg. An increase of a single invlpg operation seems to lengthen the flush range operation by about 200 cycles. Here is one example of the data collected for flushing 10 and 11 pages (full data are below): 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 How to generate this table: echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb echo x86-tsc > /sys/kernel/debug/tracing/trace_clock echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable Pipe the trace output in to this script: http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt Note that these data were gathered with the invlpg threshold set to 150 pages. Only data points with >=50 of samples were printed: Flush % of %<= in flush this pages es size ------------------------------------------------------------------------------ -1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335 3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131 4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877 5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885 6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397 7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441 8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721 9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864 13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289 14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236 15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390 16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643 17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229 18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224 19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367 20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185 21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964 22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83 23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61 24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307 25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533 26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94 27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66 28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73 29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846 30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296 31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79 32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60 33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936 34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268 35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177 36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161 37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182 38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195 39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128 40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78 41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477 42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74 43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78 44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108 45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78 46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261 47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67 48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77 49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66 50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73 52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52 54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87 55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109 56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208 57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53 58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59 59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63 62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54 64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248 65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324 85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50 127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75 128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466 129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927 181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547 256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550 512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304 1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65 Here are the tlb counters during a 10-second slice of a kernel compile for a SandyBridge system. It's better than IvyBridge, but probably due to the larger caches since this was one of the 'X' extreme parts. 10,873,007,282 dtlb_load_misses_walk_duration 250,711,333 dtlb_load_misses_walk_completed 1,212,395,865 dtlb_store_misses_walk_duration 31,615,772 dtlb_store_misses_walk_completed 5,091,010,274 itlb_misses_walk_duration 163,193,511 itlb_misses_walk_completed 1,321,980 itlb_itlb_flush 10.008045158 seconds time elapsed # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }' itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.com Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/mm: New tunable for single vs full TLB flushDave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the logic here is in the documentation file. Please take a look at it. I know we've come full-circle here back to a tunable, but this new one is *WAY* simpler. I challenge anyone to describe in one sentence how the old one worked. Here's the way the new one works: If we are flushing more pages than the ceiling, we use the full flush, otherwise we use per-page flushes. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.com Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/mm: Add tracepoints for TLB flushesDave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't have any good way to figure out what kinds of flushes are being attempted. Right now, we can try to use the vm counters, but those only tell us what we actually did with the hardware (one-by-one vs full) and don't tell us what was actually _requested_. This allows us to select out "interesting" TLB flushes that we might want to optimize (like the ranged ones) and ignore the ones that we have very little control over (the ones at context switch). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.com Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/mm: Unify remote INVLPG codeDave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently three paths through the remote flush code: 1. full invalidation 2. single page invalidation using invlpg 3. ranged invalidation using invlpg This takes 2 and 3 and combines them in to a single path by making the single-page one just be the start and end be start plus a single page. This makes placement of our tracepoint easier. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/mm: Fix missed global TLB flush statDave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we take the if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } path out of flush_tlb_mm_range(), we will have flushed the tlb, but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the way out of the function so that we always take a single path when doing a full tlb flush. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.com Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/mm: Rip out complicated, out-of-date, buggy TLB flushingDave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I think the flush_tlb_mm_range() code that tries to tune the flush sizes based on the CPU needs to get ripped out for several reasons: 1. It is obviously buggy. It uses mm->total_vm to judge the task's footprint in the TLB. It should certainly be using some measure of RSS, *NOT* ->total_vm since only resident memory can populate the TLB. 2. Haswell, and several other CPUs are missing from the intel_tlb_flushall_shift_set() function. Thus, it has been demonstrated to bitrot quickly in practice. 3. It is plain wrong in my vm: [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] tlb_flushall_shift: 6 Which leads to it to never use invlpg. 4. The assumptions about TLB refill costs are wrong: http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com (more on this in later patches) 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 I believe the sample times were too short. Running the benchmark in a loop yields times that vary quite a bit. Note that this leaves us with a static ceiling of 1 page. This is a conservative, dumb setting, and will be revised in a later patch. This also removes the code which attempts to predict whether we are flushing data or instructions. We expect instruction flushes to be relatively rare and not worth tuning for explicitly. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.com Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/mm: Clean up the TLB flushing codeDave Hansen2014-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) line of code is not exactly the easiest to audit, especially when it ends up at two different indentation levels. This eliminates one of the the copy-n-paste versions. It also gives us a unified exit point for each path through this function. We need this in a minute for our tracepoint. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.com Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | | | x86/smep: Be more informative when signalling an SMEP faultJiri Kosina2014-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If pagefault triggers due to SMEP triggering, it can't be really easily distinguished from any other oops-causing pagefault, which might lead to quite some confusion when trying to understand the reason for the oops. Print an explanatory message in case the fault happened during instruction fetch for _PAGE_USER page which is present and executable on SMEP-enabled CPUs. This is consistent with what we are doing for NX already; in addition to immediately seeing from the oops what might be happening, it can even easily give a good indication to sysadmins who are carefully monitoring their kernel logs that someone might be trying to pwn them. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1406102248490.1321@pobox.suse.cz Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | | | | | | Merge branch 'x86-efi-for-linus' of ↵Linus Torvalds2014-08-04
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI changes from Ingo Molnar: "Main changes in this cycle are: - arm64 efi stub fixes, preservation of FP/SIMD registers across firmware calls, and conversion of the EFI stub code into a static library - Ard Biesheuvel - Xen EFI support - Daniel Kiper - Support for autoloading the efivars driver - Lee, Chun-Yi - Use the PE/COFF headers in the x86 EFI boot stub to request that the stub be loaded with CONFIG_PHYSICAL_ALIGN alignment - Michael Brown - Consolidate all the x86 EFI quirks into one file - Saurabh Tangri - Additional error logging in x86 EFI boot stub - Ulf Winkelvos - Support loading initrd above 4G in EFI boot stub - Yinghai Lu - EFI reboot patches for ACPI hardware reduced platforms" * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits) efi/arm64: Handle missing virtual mapping for UEFI System Table arch/x86/xen: Silence compiler warnings xen: Silence compiler warnings x86/efi: Request desired alignment via the PE/COFF headers x86/efi: Add better error logging to EFI boot stub efi: Autoload efivars efi: Update stale locking comment for struct efivars arch/x86: Remove efi_set_rtc_mmss() arch/x86: Replace plain strings with constants xen: Put EFI machinery in place xen: Define EFI related stuff arch/x86: Remove redundant set_bit(EFI_MEMMAP) call arch/x86: Remove redundant set_bit(EFI_SYSTEM_TABLES) call efi: Introduce EFI_PARAVIRT flag arch/x86: Do not access EFI memory map if it is not available efi: Use early_mem*() instead of early_io*() arch/ia64: Define early_memunmap() x86/reboot: Add EFI reboot quirk for ACPI Hardware Reduced flag efi/reboot: Allow powering off machines using EFI efi/reboot: Add generic wrapper around EfiResetSystem() ...
| * | | | | | | efi/arm64: Handle missing virtual mapping for UEFI System TableArd Biesheuvel2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we cannot resolve the virtual address of the UEFI System Table, its physical offset must be missing from the virtual memory map, and there is really no point in proceeding with installing the virtual memory map and the runtime services dispatch table. So back out gracefully. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Mark Salter <msalter@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | arch/x86/xen: Silence compiler warningsDaniel Kiper2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compiler complains in the following way when x86 32-bit kernel with Xen support is build: CC arch/x86/xen/enlighten.o arch/x86/xen/enlighten.c: In function ‘xen_start_kernel’: arch/x86/xen/enlighten.c:1726:3: warning: right shift count >= width of type [enabled by default] Such line contains following EFI initialization code: boot_params.efi_info.efi_systab_hi = (__u32)(__pa(efi_systab_xen) >> 32); There is no issue if x86 64-bit kernel is build. However, 32-bit case generate warning (even if that code will not be executed because Xen does not work on 32-bit EFI platforms) due to __pa() returning unsigned long type which has 32-bits width. So move whole EFI initialization stuff to separate function and build it conditionally to avoid above mentioned warning on x86 32-bit architecture. Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | x86/efi: Request desired alignment via the PE/COFF headersMichael Brown2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The EFI boot stub goes to great pains to relocate the kernel image to an appropriately aligned address, as indicated by the ->kernel_alignment field in the bzImage header. However, for the PE stub entry case, we can request that the EFI PE/COFF loader do the work for us. Fix by exposing the desired alignment via the SectionAlignment field in the PE/COFF headers. Despite its name, this field provides an overall alignment requirement for the loaded file. (Naturally, the FileAlignment field describes the alignment for individual sections.) There is no way in the PE/COFF headers to express the concept of min_alignment; we therefore do not expose the minimum (as opposed to preferred) alignment. Signed-off-by: Michael Brown <mbrown@fensystems.co.uk> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | x86/efi: Add better error logging to EFI boot stubUlf Winkelvos2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hopefully this will enable us to better debug: https://bugzilla.kernel.org/show_bug.cgi?id=68761 Signed-off-by: Ulf Winkelvos <ulf@winkelvos.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | arch/x86: Remove efi_set_rtc_mmss()Daniel Kiper2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | efi_set_rtc_mmss() is never used to set RTC due to bugs found on many EFI platforms. It is set directly by mach_set_rtc_mmss(). Hence, remove unused efi_set_rtc_mmss() function. Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | arch/x86: Replace plain strings with constantsDaniel Kiper2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've got constants, so let's use them instead of hard-coded values. Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | xen: Put EFI machinery in placeDaniel Kiper2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables EFI usage under Xen dom0. Standard EFI Linux Kernel infrastructure cannot be used because it requires direct access to EFI data and code. However, in dom0 case it is not possible because above mentioned EFI stuff is fully owned and controlled by Xen hypervisor. In this case all calls from dom0 to EFI must be requested via special hypercall which in turn executes relevant EFI code in behalf of dom0. When dom0 kernel boots it checks for EFI availability on a machine. If it is detected then artificial EFI system table is filled. Native EFI callas are replaced by functions which mimics them by calling relevant hypercall. Later pointer to EFI system table is passed to standard EFI machinery and it continues EFI subsystem initialization taking into account that there is no direct access to EFI boot services, runtime, tables, structures, etc. After that system runs as usual. This patch is based on Jan Beulich and Tang Liang work. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | arch/x86: Remove redundant set_bit(EFI_MEMMAP) callDaniel Kiper2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove redundant set_bit(EFI_MEMMAP, &efi.flags) call. It is executed earlier in efi_memmap_init(). Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
| * | | | | | | arch/x86: Remove redundant set_bit(EFI_SYSTEM_TABLES) callDaniel Kiper2014-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove redundant set_bit(EFI_SYSTEM_TABLES, &efi.flags) call. It is executed earlier in efi_systab_init(). Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>