aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2015-04-14 17:37:47 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2015-04-14 17:37:47 -0400
commit6c8a53c9e6a151fffb07f8b4c34bd1e33dddd467 (patch)
tree791caf826ef136c521a97b7878f226b6ba1c1d75 /arch/x86
parente95e7f627062be5e6ce971ce873e6234c91ffc50 (diff)
parent066450be419fa48007a9f29e19828f2a86198754 (diff)
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar: "Core kernel changes: - One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively. (Right now it's limited to root-only, but in the future we might allow unprivileged use as well.) (Alexei Starovoitov) - Another non-trivial feature is per event clockid support: this allows, amongst other things, the selection of different clock sources for event timestamps traced via perf. This feature is sought by people who'd like to merge perf generated events with external events that were measured with different clocks: - cluster wide profiling - for system wide tracing with user-space events, - JIT profiling events etc. Matching perf tooling support is added as well, available via the -k, --clockid <clockid> parameter to perf record et al. (Peter Zijlstra) Hardware enablement kernel changes: - x86 Intel Processor Trace (PT) support: which is a hardware tracer on steroids, available on Broadwell CPUs. The hardware trace stream is directly output into the user-space ring-buffer, using the 'AUX' data format extension that was added to the perf core to support hardware constraints such as the necessity to have the tracing buffer physically contiguous. This patch-set was developed for two years and this is the result. A simple way to make use of this is to use BTS tracing, the PT driver emulates BTS output - available via the 'intel_bts' PMU. More explicit PT specific tooling support is in the works as well - will probably be ready by 4.2. (Alexander Shishkin, Peter Zijlstra) - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware feature of Intel Xeon CPUs that allows the measurement and allocation/partitioning of caches to individual workloads. These kernel changes expose the measurement side as a new PMU driver, which exposes various QoS related PMU events. (The partitioning change is work in progress and is planned to be merged as a cgroup extension.) (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P Waskiewicz Jr) - x86 Intel Haswell LBR call stack support: this is a new Haswell feature that allows the hardware recording of call chains, plus tooling support. To activate this feature you have to enable it via the new 'lbr' call-graph recording option: perf record --call-graph lbr perf report or: perf top --call-graph lbr This hardware feature is a lot faster than stack walk or dwarf based unwinding, but has some limitations: - It reuses the current LBR facility, so LBR call stack and branch record can not be enabled at the same time. - It is only available for user-space callchains. (Yan, Zheng) - x86 Intel Broadwell CPU support and various event constraints and event table fixes for earlier models. (Andi Kleen) - x86 Intel HT CPUs event scheduling workarounds. This is a complex CPU bug affecting the SNB,IVB,HSW families that results in counter value corruption. The mitigation code is automatically enabled and is transparent. (Maria Dimakopoulou, Stephane Eranian) The perf tooling side had a ton of changes in this cycle as well, so I'm only able to list the user visible changes here, in addition to the tooling changes outlined above: User visible changes affecting all tools: - Improve support of compressed kernel modules (Jiri Olsa) - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo) - Bash completion for subcommands (Yunlong Song) - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa) - Support missing -f to override perf.data file ownership. (Yunlong Song) - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo) User visible changes in individual tools: 'perf data': New tool for converting perf.data to other formats, initially for the CTF (Common Trace Format) from LTTng (Jiri Olsa, Sebastian Siewior) 'perf diff': Add --kallsyms option (David Ahern) 'perf list': Allow listing events with 'tracepoint' prefix (Yunlong Song) Sort the output of the command (Yunlong Song) 'perf kmem': Respect -i option (Jiri Olsa) Print big numbers using thousands' group (Namhyung Kim) Allow -v option (Namhyung Kim) Fix alignment of slab result table (Namhyung Kim) 'perf probe': Support multiple probes on different binaries on the same command line (Masami Hiramatsu) Support unnamed union/structure members data collection. (Masami Hiramatsu) Check kprobes blacklist when adding new events. (Masami Hiramatsu) 'perf record': Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra) Support recording running/enabled time (Andi Kleen) 'perf sched': Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song) 'perf report' and 'perf top': Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo) Indicate which callchain entries are annotated in the TUI hists browser (Arnaldo Carvalho de Melo) Add pid/tid filtering to 'report' and 'script' commands (David Ahern) Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT events (Arnaldo Carvalho de Melo) 'perf stat': Report unsupported events properly (Suzuki K. Poulose) Output running time and run/enabled ratio in CSV mode (Andi Kleen) 'perf trace': Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo) Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo) Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo) Dump stack on segfaults (Arnaldo Carvalho de Melo) No need to explicitely enable evsels for workload started from perf, let it be enabled via perf_event_attr.enable_on_exec, removing some events that take place in the 'perf trace' before a workload is really started by it. (Arnaldo Carvalho de Melo) Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo) There's also been a ton of infrastructure work done, such as the split-out of perf's build system into tools/build/ and other changes - see the shortlog and changelog for details" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits) perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init() perf evlist: Fix type for references to data_head/tail perf probe: Check the orphaned -x option perf probe: Support multiple probes on different binaries perf buildid-list: Fix segfault when show DSOs with hits perf tools: Fix cross-endian analysis perf tools: Fix error path to do closedir() when synthesizing threads perf tools: Fix synthesizing fork_event.ppid for non-main thread perf tools: Add 'I' event modifier for exclude_idle bit perf report: Don't call map__kmap if map is NULL. perf tests: Fix attr tests perf probe: Fix ARM 32 building error perf tools: Merge all perf_event_attr print functions perf record: Add clockid parameter perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10 perf sched replay: Support using -f to override perf.data file ownership perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task perf sched replay: Fix the segmentation fault problem caused by pr_err in threads perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations ...
Diffstat (limited to 'arch/x86')
-rw-r--r--arch/x86/include/asm/cpufeature.h10
-rw-r--r--arch/x86/include/asm/processor.h3
-rw-r--r--arch/x86/include/uapi/asm/msr-index.h18
-rw-r--r--arch/x86/kernel/cpu/Makefile3
-rw-r--r--arch/x86/kernel/cpu/common.c39
-rw-r--r--arch/x86/kernel/cpu/intel_pt.h131
-rw-r--r--arch/x86/kernel/cpu/perf_event.c205
-rw-r--r--arch/x86/kernel/cpu/perf_event.h167
-rw-r--r--arch/x86/kernel/cpu/perf_event_amd.c9
-rw-r--r--arch/x86/kernel/cpu/perf_event_amd_ibs.c12
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel.c908
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel_bts.c525
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel_cqm.c1379
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel_ds.c31
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel_lbr.c321
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel_pt.c1103
-rw-r--r--arch/x86/kernel/cpu/perf_event_intel_uncore_snbep.c3
-rw-r--r--arch/x86/kernel/cpu/scattered.c1
-rw-r--r--arch/x86/kernel/kprobes/core.c9
19 files changed, 4642 insertions, 235 deletions
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 854c04b3c9c2..7ee9b94d9921 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
12#include <asm/disabled-features.h> 12#include <asm/disabled-features.h>
13#endif 13#endif
14 14
15#define NCAPINTS 11 /* N 32-bit words worth of info */ 15#define NCAPINTS 13 /* N 32-bit words worth of info */
16#define NBUGINTS 1 /* N 32-bit bug flags */ 16#define NBUGINTS 1 /* N 32-bit bug flags */
17 17
18/* 18/*
@@ -195,6 +195,7 @@
195#define X86_FEATURE_HWP_ACT_WINDOW ( 7*32+ 12) /* Intel HWP_ACT_WINDOW */ 195#define X86_FEATURE_HWP_ACT_WINDOW ( 7*32+ 12) /* Intel HWP_ACT_WINDOW */
196#define X86_FEATURE_HWP_EPP ( 7*32+13) /* Intel HWP_EPP */ 196#define X86_FEATURE_HWP_EPP ( 7*32+13) /* Intel HWP_EPP */
197#define X86_FEATURE_HWP_PKG_REQ ( 7*32+14) /* Intel HWP_PKG_REQ */ 197#define X86_FEATURE_HWP_PKG_REQ ( 7*32+14) /* Intel HWP_PKG_REQ */
198#define X86_FEATURE_INTEL_PT ( 7*32+15) /* Intel Processor Trace */
198 199
199/* Virtualization flags: Linux defined, word 8 */ 200/* Virtualization flags: Linux defined, word 8 */
200#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ 201#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
@@ -226,6 +227,7 @@
226#define X86_FEATURE_ERMS ( 9*32+ 9) /* Enhanced REP MOVSB/STOSB */ 227#define X86_FEATURE_ERMS ( 9*32+ 9) /* Enhanced REP MOVSB/STOSB */
227#define X86_FEATURE_INVPCID ( 9*32+10) /* Invalidate Processor Context ID */ 228#define X86_FEATURE_INVPCID ( 9*32+10) /* Invalidate Processor Context ID */
228#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */ 229#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
230#define X86_FEATURE_CQM ( 9*32+12) /* Cache QoS Monitoring */
229#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */ 231#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
230#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */ 232#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
231#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ 233#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
@@ -244,6 +246,12 @@
244#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 */ 246#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 */
245#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS */ 247#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS */
246 248
249/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:0 (edx), word 11 */
250#define X86_FEATURE_CQM_LLC (11*32+ 1) /* LLC QoS if 1 */
251
252/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
253#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
254
247/* 255/*
248 * BUG word(s) 256 * BUG word(s)
249 */ 257 */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index d2203b5d9538..23ba6765b718 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -109,6 +109,9 @@ struct cpuinfo_x86 {
109 /* in KB - valid for CPUS which support this call: */ 109 /* in KB - valid for CPUS which support this call: */
110 int x86_cache_size; 110 int x86_cache_size;
111 int x86_cache_alignment; /* In bytes */ 111 int x86_cache_alignment; /* In bytes */
112 /* Cache QoS architectural values: */
113 int x86_cache_max_rmid; /* max index */
114 int x86_cache_occ_scale; /* scale to bytes */
112 int x86_power; 115 int x86_power;
113 unsigned long loops_per_jiffy; 116 unsigned long loops_per_jiffy;
114 /* cpuid returned max cores value: */ 117 /* cpuid returned max cores value: */
diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index 3ce079136c11..1a4eae695ca8 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
74#define MSR_IA32_PERF_CAPABILITIES 0x00000345 74#define MSR_IA32_PERF_CAPABILITIES 0x00000345
75#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6 75#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6
76 76
77#define MSR_IA32_RTIT_CTL 0x00000570
78#define RTIT_CTL_TRACEEN BIT(0)
79#define RTIT_CTL_OS BIT(2)
80#define RTIT_CTL_USR BIT(3)
81#define RTIT_CTL_CR3EN BIT(7)
82#define RTIT_CTL_TOPA BIT(8)
83#define RTIT_CTL_TSC_EN BIT(10)
84#define RTIT_CTL_DISRETC BIT(11)
85#define RTIT_CTL_BRANCH_EN BIT(13)
86#define MSR_IA32_RTIT_STATUS 0x00000571
87#define RTIT_STATUS_CONTEXTEN BIT(1)
88#define RTIT_STATUS_TRIGGEREN BIT(2)
89#define RTIT_STATUS_ERROR BIT(4)
90#define RTIT_STATUS_STOPPED BIT(5)
91#define MSR_IA32_RTIT_CR3_MATCH 0x00000572
92#define MSR_IA32_RTIT_OUTPUT_BASE 0x00000560
93#define MSR_IA32_RTIT_OUTPUT_MASK 0x00000561
94
77#define MSR_MTRRfix64K_00000 0x00000250 95#define MSR_MTRRfix64K_00000 0x00000250
78#define MSR_MTRRfix16K_80000 0x00000258 96#define MSR_MTRRfix16K_80000 0x00000258
79#define MSR_MTRRfix16K_A0000 0x00000259 97#define MSR_MTRRfix16K_A0000 0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 80091ae54c2b..9bff68798836 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,7 +39,8 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o
39endif 39endif
40obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o 40obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
41obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o 41obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
42obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o 42obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o perf_event_intel_cqm.o
43obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o perf_event_intel_bts.o
43 44
44obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \ 45obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
45 perf_event_intel_uncore_snb.o \ 46 perf_event_intel_uncore_snb.o \
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3f70538012e2..a62cf04dac8a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -646,6 +646,30 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
646 c->x86_capability[10] = eax; 646 c->x86_capability[10] = eax;
647 } 647 }
648 648
649 /* Additional Intel-defined flags: level 0x0000000F */
650 if (c->cpuid_level >= 0x0000000F) {
651 u32 eax, ebx, ecx, edx;
652
653 /* QoS sub-leaf, EAX=0Fh, ECX=0 */
654 cpuid_count(0x0000000F, 0, &eax, &ebx, &ecx, &edx);
655 c->x86_capability[11] = edx;
656 if (cpu_has(c, X86_FEATURE_CQM_LLC)) {
657 /* will be overridden if occupancy monitoring exists */
658 c->x86_cache_max_rmid = ebx;
659
660 /* QoS sub-leaf, EAX=0Fh, ECX=1 */
661 cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
662 c->x86_capability[12] = edx;
663 if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
664 c->x86_cache_max_rmid = ecx;
665 c->x86_cache_occ_scale = ebx;
666 }
667 } else {
668 c->x86_cache_max_rmid = -1;
669 c->x86_cache_occ_scale = -1;
670 }
671 }
672
649 /* AMD-defined flags: level 0x80000001 */ 673 /* AMD-defined flags: level 0x80000001 */
650 xlvl = cpuid_eax(0x80000000); 674 xlvl = cpuid_eax(0x80000000);
651 c->extended_cpuid_level = xlvl; 675 c->extended_cpuid_level = xlvl;
@@ -834,6 +858,20 @@ static void generic_identify(struct cpuinfo_x86 *c)
834 detect_nopl(c); 858 detect_nopl(c);
835} 859}
836 860
861static void x86_init_cache_qos(struct cpuinfo_x86 *c)
862{
863 /*
864 * The heavy lifting of max_rmid and cache_occ_scale are handled
865 * in get_cpu_cap(). Here we just set the max_rmid for the boot_cpu
866 * in case CQM bits really aren't there in this CPU.
867 */
868 if (c != &boot_cpu_data) {
869 boot_cpu_data.x86_cache_max_rmid =
870 min(boot_cpu_data.x86_cache_max_rmid,
871 c->x86_cache_max_rmid);
872 }
873}
874
837/* 875/*
838 * This does the hard work of actually picking apart the CPU stuff... 876 * This does the hard work of actually picking apart the CPU stuff...
839 */ 877 */
@@ -923,6 +961,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
923 961
924 init_hypervisor(c); 962 init_hypervisor(c);
925 x86_init_rdrand(c); 963 x86_init_rdrand(c);
964 x86_init_cache_qos(c);
926 965
927 /* 966 /*
928 * Clear/Set all flags overriden by options, need do it 967 * Clear/Set all flags overriden by options, need do it
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 000000000000..1c338b0eba05
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,131 @@
1/*
2 * Intel(R) Processor Trace PMU driver for perf
3 * Copyright (c) 2013-2014, Intel Corporation.
4 *
5 * This program is free software; you can redistribute it and/or modify it
6 * under the terms and conditions of the GNU General Public License,
7 * version 2, as published by the Free Software Foundation.
8 *
9 * This program is distributed in the hope it will be useful, but WITHOUT
10 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
12 * more details.
13 *
14 * Intel PT is specified in the Intel Architecture Instruction Set Extensions
15 * Programming Reference:
16 * http://software.intel.com/en-us/intel-isa-extensions
17 */
18
19#ifndef __INTEL_PT_H__
20#define __INTEL_PT_H__
21
22/*
23 * Single-entry ToPA: when this close to region boundary, switch
24 * buffers to avoid losing data.
25 */
26#define TOPA_PMI_MARGIN 512
27
28/*
29 * Table of Physical Addresses bits
30 */
31enum topa_sz {
32 TOPA_4K = 0,
33 TOPA_8K,
34 TOPA_16K,
35 TOPA_32K,
36 TOPA_64K,
37 TOPA_128K,
38 TOPA_256K,
39 TOPA_512K,
40 TOPA_1MB,
41 TOPA_2MB,
42 TOPA_4MB,
43 TOPA_8MB,
44 TOPA_16MB,
45 TOPA_32MB,
46 TOPA_64MB,
47 TOPA_128MB,
48 TOPA_SZ_END,
49};
50
51static inline unsigned int sizes(enum topa_sz tsz)
52{
53 return 1 << (tsz + 12);
54};
55
56struct topa_entry {
57 u64 end : 1;
58 u64 rsvd0 : 1;
59 u64 intr : 1;
60 u64 rsvd1 : 1;
61 u64 stop : 1;
62 u64 rsvd2 : 1;
63 u64 size : 4;
64 u64 rsvd3 : 2;
65 u64 base : 36;
66 u64 rsvd4 : 16;
67};
68
69#define TOPA_SHIFT 12
70#define PT_CPUID_LEAVES 2
71
72enum pt_capabilities {
73 PT_CAP_max_subleaf = 0,
74 PT_CAP_cr3_filtering,
75 PT_CAP_topa_output,
76 PT_CAP_topa_multiple_entries,
77 PT_CAP_payloads_lip,
78};
79
80struct pt_pmu {
81 struct pmu pmu;
82 u32 caps[4 * PT_CPUID_LEAVES];
83};
84
85/**
86 * struct pt_buffer - buffer configuration; one buffer per task_struct or
87 * cpu, depending on perf event configuration
88 * @cpu: cpu for per-cpu allocation
89 * @tables: list of ToPA tables in this buffer
90 * @first: shorthand for first topa table
91 * @last: shorthand for last topa table
92 * @cur: current topa table
93 * @nr_pages: buffer size in pages
94 * @cur_idx: current output region's index within @cur table
95 * @output_off: offset within the current output region
96 * @data_size: running total of the amount of data in this buffer
97 * @lost: if data was lost/truncated
98 * @head: logical write offset inside the buffer
99 * @snapshot: if this is for a snapshot/overwrite counter
100 * @stop_pos: STOP topa entry in the buffer
101 * @intr_pos: INT topa entry in the buffer
102 * @data_pages: array of pages from perf
103 * @topa_index: table of topa entries indexed by page offset
104 */
105struct pt_buffer {
106 int cpu;
107 struct list_head tables;
108 struct topa *first, *last, *cur;
109 unsigned int cur_idx;
110 size_t output_off;
111 unsigned long nr_pages;
112 local_t data_size;
113 local_t lost;
114 local64_t head;
115 bool snapshot;
116 unsigned long stop_pos, intr_pos;
117 void **data_pages;
118 struct topa_entry *topa_index[0];
119};
120
121/**
122 * struct pt - per-cpu pt context
123 * @handle: perf output handle
124 * @handle_nmi: do handle PT PMI on this cpu, there's an active event
125 */
126struct pt {
127 struct perf_output_handle handle;
128 int handle_nmi;
129};
130
131#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index e2888a3ad1e3..87848ebe2bb7 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -263,6 +263,14 @@ static void hw_perf_event_destroy(struct perf_event *event)
263 } 263 }
264} 264}
265 265
266void hw_perf_lbr_event_destroy(struct perf_event *event)
267{
268 hw_perf_event_destroy(event);
269
270 /* undo the lbr/bts event accounting */
271 x86_del_exclusive(x86_lbr_exclusive_lbr);
272}
273
266static inline int x86_pmu_initialized(void) 274static inline int x86_pmu_initialized(void)
267{ 275{
268 return x86_pmu.handle_irq != NULL; 276 return x86_pmu.handle_irq != NULL;
@@ -302,6 +310,35 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
302 return x86_pmu_extra_regs(val, event); 310 return x86_pmu_extra_regs(val, event);
303} 311}
304 312
313/*
314 * Check if we can create event of a certain type (that no conflicting events
315 * are present).
316 */
317int x86_add_exclusive(unsigned int what)
318{
319 int ret = -EBUSY, i;
320
321 if (atomic_inc_not_zero(&x86_pmu.lbr_exclusive[what]))
322 return 0;
323
324 mutex_lock(&pmc_reserve_mutex);
325 for (i = 0; i < ARRAY_SIZE(x86_pmu.lbr_exclusive); i++)
326 if (i != what && atomic_read(&x86_pmu.lbr_exclusive[i]))
327 goto out;
328
329 atomic_inc(&x86_pmu.lbr_exclusive[what]);
330 ret = 0;
331
332out:
333 mutex_unlock(&pmc_reserve_mutex);
334 return ret;
335}
336
337void x86_del_exclusive(unsigned int what)
338{
339 atomic_dec(&x86_pmu.lbr_exclusive[what]);
340}
341
305int x86_setup_perfctr(struct perf_event *event) 342int x86_setup_perfctr(struct perf_event *event)
306{ 343{
307 struct perf_event_attr *attr = &event->attr; 344 struct perf_event_attr *attr = &event->attr;
@@ -346,6 +383,12 @@ int x86_setup_perfctr(struct perf_event *event)
346 /* BTS is currently only allowed for user-mode. */ 383 /* BTS is currently only allowed for user-mode. */
347 if (!attr->exclude_kernel) 384 if (!attr->exclude_kernel)
348 return -EOPNOTSUPP; 385 return -EOPNOTSUPP;
386
387 /* disallow bts if conflicting events are present */
388 if (x86_add_exclusive(x86_lbr_exclusive_lbr))
389 return -EBUSY;
390
391 event->destroy = hw_perf_lbr_event_destroy;
349 } 392 }
350 393
351 hwc->config |= config; 394 hwc->config |= config;
@@ -399,39 +442,41 @@ int x86_pmu_hw_config(struct perf_event *event)
399 442
400 if (event->attr.precise_ip > precise) 443 if (event->attr.precise_ip > precise)
401 return -EOPNOTSUPP; 444 return -EOPNOTSUPP;
402 /* 445 }
403 * check that PEBS LBR correction does not conflict with 446 /*
404 * whatever the user is asking with attr->branch_sample_type 447 * check that PEBS LBR correction does not conflict with
405 */ 448 * whatever the user is asking with attr->branch_sample_type
406 if (event->attr.precise_ip > 1 && 449 */
407 x86_pmu.intel_cap.pebs_format < 2) { 450 if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
408 u64 *br_type = &event->attr.branch_sample_type; 451 u64 *br_type = &event->attr.branch_sample_type;
409 452
410 if (has_branch_stack(event)) { 453 if (has_branch_stack(event)) {
411 if (!precise_br_compat(event)) 454 if (!precise_br_compat(event))
412 return -EOPNOTSUPP; 455 return -EOPNOTSUPP;
413 456
414 /* branch_sample_type is compatible */ 457 /* branch_sample_type is compatible */
415 458
416 } else { 459 } else {
417 /* 460 /*
418 * user did not specify branch_sample_type 461 * user did not specify branch_sample_type
419 * 462 *
420 * For PEBS fixups, we capture all 463 * For PEBS fixups, we capture all
421 * the branches at the priv level of the 464 * the branches at the priv level of the
422 * event. 465 * event.
423 */ 466 */
424 *br_type = PERF_SAMPLE_BRANCH_ANY; 467 *br_type = PERF_SAMPLE_BRANCH_ANY;
425 468
426 if (!event->attr.exclude_user) 469 if (!event->attr.exclude_user)
427 *br_type |= PERF_SAMPLE_BRANCH_USER; 470 *br_type |= PERF_SAMPLE_BRANCH_USER;
428 471
429 if (!event->attr.exclude_kernel) 472 if (!event->attr.exclude_kernel)
430 *br_type |= PERF_SAMPLE_BRANCH_KERNEL; 473 *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
431 }
432 } 474 }
433 } 475 }
434 476
477 if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
478 event->attach_state |= PERF_ATTACH_TASK_DATA;
479
435 /* 480 /*
436 * Generate PMC IRQs: 481 * Generate PMC IRQs:
437 * (keep 'enabled' bit clear for now) 482 * (keep 'enabled' bit clear for now)
@@ -449,6 +494,12 @@ int x86_pmu_hw_config(struct perf_event *event)
449 if (event->attr.type == PERF_TYPE_RAW) 494 if (event->attr.type == PERF_TYPE_RAW)
450 event->hw.config |= event->attr.config & X86_RAW_EVENT_MASK; 495 event->hw.config |= event->attr.config & X86_RAW_EVENT_MASK;
451 496
497 if (event->attr.sample_period && x86_pmu.limit_period) {
498 if (x86_pmu.limit_period(event, event->attr.sample_period) >
499 event->attr.sample_period)
500 return -EINVAL;
501 }
502
452 return x86_setup_perfctr(event); 503 return x86_setup_perfctr(event);
453} 504}
454 505
@@ -728,14 +779,17 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
728 struct event_constraint *c; 779 struct event_constraint *c;
729 unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; 780 unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
730 struct perf_event *e; 781 struct perf_event *e;
731 int i, wmin, wmax, num = 0; 782 int i, wmin, wmax, unsched = 0;
732 struct hw_perf_event *hwc; 783 struct hw_perf_event *hwc;
733 784
734 bitmap_zero(used_mask, X86_PMC_IDX_MAX); 785 bitmap_zero(used_mask, X86_PMC_IDX_MAX);
735 786
787 if (x86_pmu.start_scheduling)
788 x86_pmu.start_scheduling(cpuc);
789
736 for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) { 790 for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
737 hwc = &cpuc->event_list[i]->hw; 791 hwc = &cpuc->event_list[i]->hw;
738 c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]); 792 c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
739 hwc->constraint = c; 793 hwc->constraint = c;
740 794
741 wmin = min(wmin, c->weight); 795 wmin = min(wmin, c->weight);
@@ -768,24 +822,30 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
768 822
769 /* slow path */ 823 /* slow path */
770 if (i != n) 824 if (i != n)
771 num = perf_assign_events(cpuc->event_list, n, wmin, 825 unsched = perf_assign_events(cpuc->event_list, n, wmin,
772 wmax, assign); 826 wmax, assign);
773 827
774 /* 828 /*
775 * Mark the event as committed, so we do not put_constraint() 829 * In case of success (unsched = 0), mark events as committed,
776 * in case new events are added and fail scheduling. 830 * so we do not put_constraint() in case new events are added
831 * and fail to be scheduled
832 *
833 * We invoke the lower level commit callback to lock the resource
834 *
835 * We do not need to do all of this in case we are called to
836 * validate an event group (assign == NULL)
777 */ 837 */
778 if (!num && assign) { 838 if (!unsched && assign) {
779 for (i = 0; i < n; i++) { 839 for (i = 0; i < n; i++) {
780 e = cpuc->event_list[i]; 840 e = cpuc->event_list[i];
781 e->hw.flags |= PERF_X86_EVENT_COMMITTED; 841 e->hw.flags |= PERF_X86_EVENT_COMMITTED;
842 if (x86_pmu.commit_scheduling)
843 x86_pmu.commit_scheduling(cpuc, e, assign[i]);
782 } 844 }
783 } 845 }
784 /* 846
785 * scheduling failed or is just a simulation, 847 if (!assign || unsched) {
786 * free resources if necessary 848
787 */
788 if (!assign || num) {
789 for (i = 0; i < n; i++) { 849 for (i = 0; i < n; i++) {
790 e = cpuc->event_list[i]; 850 e = cpuc->event_list[i];
791 /* 851 /*
@@ -795,11 +855,18 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
795 if ((e->hw.flags & PERF_X86_EVENT_COMMITTED)) 855 if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
796 continue; 856 continue;
797 857
858 /*
859 * release events that failed scheduling
860 */
798 if (x86_pmu.put_event_constraints) 861 if (x86_pmu.put_event_constraints)
799 x86_pmu.put_event_constraints(cpuc, e); 862 x86_pmu.put_event_constraints(cpuc, e);
800 } 863 }
801 } 864 }
802 return num ? -EINVAL : 0; 865
866 if (x86_pmu.stop_scheduling)
867 x86_pmu.stop_scheduling(cpuc);
868
869 return unsched ? -EINVAL : 0;
803} 870}
804 871
805/* 872/*
@@ -986,6 +1053,9 @@ int x86_perf_event_set_period(struct perf_event *event)
986 if (left > x86_pmu.max_period) 1053 if (left > x86_pmu.max_period)
987 left = x86_pmu.max_period; 1054 left = x86_pmu.max_period;
988 1055
1056 if (x86_pmu.limit_period)
1057 left = x86_pmu.limit_period(event, left);
1058
989 per_cpu(pmc_prev_left[idx], smp_processor_id()) = left; 1059 per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
990 1060
991 /* 1061 /*
@@ -1033,7 +1103,6 @@ static int x86_pmu_add(struct perf_event *event, int flags)
1033 1103
1034 hwc = &event->hw; 1104 hwc = &event->hw;
1035 1105
1036 perf_pmu_disable(event->pmu);
1037 n0 = cpuc->n_events; 1106 n0 = cpuc->n_events;
1038 ret = n = collect_events(cpuc, event, false); 1107 ret = n = collect_events(cpuc, event, false);
1039 if (ret < 0) 1108 if (ret < 0)
@@ -1071,7 +1140,6 @@ done_collect:
1071 1140
1072 ret = 0; 1141 ret = 0;
1073out: 1142out:
1074 perf_pmu_enable(event->pmu);
1075 return ret; 1143 return ret;
1076} 1144}
1077 1145
@@ -1103,7 +1171,7 @@ static void x86_pmu_start(struct perf_event *event, int flags)
1103void perf_event_print_debug(void) 1171void perf_event_print_debug(void)
1104{ 1172{
1105 u64 ctrl, status, overflow, pmc_ctrl, pmc_count, prev_left, fixed; 1173 u64 ctrl, status, overflow, pmc_ctrl, pmc_count, prev_left, fixed;
1106 u64 pebs; 1174 u64 pebs, debugctl;
1107 struct cpu_hw_events *cpuc; 1175 struct cpu_hw_events *cpuc;
1108 unsigned long flags; 1176 unsigned long flags;
1109 int cpu, idx; 1177 int cpu, idx;
@@ -1121,14 +1189,20 @@ void perf_event_print_debug(void)
1121 rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status); 1189 rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
1122 rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow); 1190 rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow);
1123 rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, fixed); 1191 rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, fixed);
1124 rdmsrl(MSR_IA32_PEBS_ENABLE, pebs);
1125 1192
1126 pr_info("\n"); 1193 pr_info("\n");
1127 pr_info("CPU#%d: ctrl: %016llx\n", cpu, ctrl); 1194 pr_info("CPU#%d: ctrl: %016llx\n", cpu, ctrl);
1128 pr_info("CPU#%d: status: %016llx\n", cpu, status); 1195 pr_info("CPU#%d: status: %016llx\n", cpu, status);
1129 pr_info("CPU#%d: overflow: %016llx\n", cpu, overflow); 1196 pr_info("CPU#%d: overflow: %016llx\n", cpu, overflow);
1130 pr_info("CPU#%d: fixed: %016llx\n", cpu, fixed); 1197 pr_info("CPU#%d: fixed: %016llx\n", cpu, fixed);
1131 pr_info("CPU#%d: pebs: %016llx\n", cpu, pebs); 1198 if (x86_pmu.pebs_constraints) {
1199 rdmsrl(MSR_IA32_PEBS_ENABLE, pebs);
1200 pr_info("CPU#%d: pebs: %016llx\n", cpu, pebs);
1201 }
1202 if (x86_pmu.lbr_nr) {
1203 rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
1204 pr_info("CPU#%d: debugctl: %016llx\n", cpu, debugctl);
1205 }
1132 } 1206 }
1133 pr_info("CPU#%d: active: %016llx\n", cpu, *(u64 *)cpuc->active_mask); 1207 pr_info("CPU#%d: active: %016llx\n", cpu, *(u64 *)cpuc->active_mask);
1134 1208
@@ -1321,11 +1395,12 @@ x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
1321{ 1395{
1322 unsigned int cpu = (long)hcpu; 1396 unsigned int cpu = (long)hcpu;
1323 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu); 1397 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
1324 int ret = NOTIFY_OK; 1398 int i, ret = NOTIFY_OK;
1325 1399
1326 switch (action & ~CPU_TASKS_FROZEN) { 1400 switch (action & ~CPU_TASKS_FROZEN) {
1327 case CPU_UP_PREPARE: 1401 case CPU_UP_PREPARE:
1328 cpuc->kfree_on_online = NULL; 1402 for (i = 0 ; i < X86_PERF_KFREE_MAX; i++)
1403 cpuc->kfree_on_online[i] = NULL;
1329 if (x86_pmu.cpu_prepare) 1404 if (x86_pmu.cpu_prepare)
1330 ret = x86_pmu.cpu_prepare(cpu); 1405 ret = x86_pmu.cpu_prepare(cpu);
1331 break; 1406 break;
@@ -1336,7 +1411,10 @@ x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
1336 break; 1411 break;
1337 1412
1338 case CPU_ONLINE: 1413 case CPU_ONLINE:
1339 kfree(cpuc->kfree_on_online); 1414 for (i = 0 ; i < X86_PERF_KFREE_MAX; i++) {
1415 kfree(cpuc->kfree_on_online[i]);
1416 cpuc->kfree_on_online[i] = NULL;
1417 }
1340 break; 1418 break;
1341 1419
1342 case CPU_DYING: 1420 case CPU_DYING:
@@ -1712,7 +1790,7 @@ static int validate_event(struct perf_event *event)
1712 if (IS_ERR(fake_cpuc)) 1790 if (IS_ERR(fake_cpuc))
1713 return PTR_ERR(fake_cpuc); 1791 return PTR_ERR(fake_cpuc);
1714 1792
1715 c = x86_pmu.get_event_constraints(fake_cpuc, event); 1793 c = x86_pmu.get_event_constraints(fake_cpuc, -1, event);
1716 1794
1717 if (!c || !c->weight) 1795 if (!c || !c->weight)
1718 ret = -EINVAL; 1796 ret = -EINVAL;
@@ -1914,10 +1992,10 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
1914 NULL, 1992 NULL,
1915}; 1993};
1916 1994
1917static void x86_pmu_flush_branch_stack(void) 1995static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
1918{ 1996{
1919 if (x86_pmu.flush_branch_stack) 1997 if (x86_pmu.sched_task)
1920 x86_pmu.flush_branch_stack(); 1998 x86_pmu.sched_task(ctx, sched_in);
1921} 1999}
1922 2000
1923void perf_check_microcode(void) 2001void perf_check_microcode(void)
@@ -1949,7 +2027,8 @@ static struct pmu pmu = {
1949 .commit_txn = x86_pmu_commit_txn, 2027 .commit_txn = x86_pmu_commit_txn,
1950 2028
1951 .event_idx = x86_pmu_event_idx, 2029 .event_idx = x86_pmu_event_idx,
1952 .flush_branch_stack = x86_pmu_flush_branch_stack, 2030 .sched_task = x86_pmu_sched_task,
2031 .task_ctx_size = sizeof(struct x86_perf_task_context),
1953}; 2032};
1954 2033
1955void arch_perf_update_userpage(struct perf_event *event, 2034void arch_perf_update_userpage(struct perf_event *event,
@@ -1968,13 +2047,23 @@ void arch_perf_update_userpage(struct perf_event *event,
1968 2047
1969 data = cyc2ns_read_begin(); 2048 data = cyc2ns_read_begin();
1970 2049
2050 /*
2051 * Internal timekeeping for enabled/running/stopped times
2052 * is always in the local_clock domain.
2053 */
1971 userpg->cap_user_time = 1; 2054 userpg->cap_user_time = 1;
1972 userpg->time_mult = data->cyc2ns_mul; 2055 userpg->time_mult = data->cyc2ns_mul;
1973 userpg->time_shift = data->cyc2ns_shift; 2056 userpg->time_shift = data->cyc2ns_shift;
1974 userpg->time_offset = data->cyc2ns_offset - now; 2057 userpg->time_offset = data->cyc2ns_offset - now;
1975 2058
1976 userpg->cap_user_time_zero = 1; 2059 /*
1977 userpg->time_zero = data->cyc2ns_offset; 2060 * cap_user_time_zero doesn't make sense when we're using a different
2061 * time base for the records.
2062 */
2063 if (event->clock == &local_clock) {
2064 userpg->cap_user_time_zero = 1;
2065 userpg->time_zero = data->cyc2ns_offset;
2066 }
1978 2067
1979 cyc2ns_read_end(data); 2068 cyc2ns_read_end(data);
1980} 2069}
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index df525d2be1e8..329f0356ad4a 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -71,6 +71,8 @@ struct event_constraint {
71#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */ 71#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
72#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */ 72#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
73#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */ 73#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
74#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */
75#define PERF_X86_EVENT_DYNAMIC 0x80 /* dynamic alloc'd constraint */
74#define PERF_X86_EVENT_RDPMC_ALLOWED 0x40 /* grant rdpmc permission */ 76#define PERF_X86_EVENT_RDPMC_ALLOWED 0x40 /* grant rdpmc permission */
75 77
76 78
@@ -123,8 +125,37 @@ struct intel_shared_regs {
123 unsigned core_id; /* per-core: core id */ 125 unsigned core_id; /* per-core: core id */
124}; 126};
125 127
128enum intel_excl_state_type {
129 INTEL_EXCL_UNUSED = 0, /* counter is unused */
130 INTEL_EXCL_SHARED = 1, /* counter can be used by both threads */
131 INTEL_EXCL_EXCLUSIVE = 2, /* counter can be used by one thread only */
132};
133
134struct intel_excl_states {
135 enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
136 enum intel_excl_state_type state[X86_PMC_IDX_MAX];
137 int num_alloc_cntrs;/* #counters allocated */
138 int max_alloc_cntrs;/* max #counters allowed */
139 bool sched_started; /* true if scheduling has started */
140};
141
142struct intel_excl_cntrs {
143 raw_spinlock_t lock;
144
145 struct intel_excl_states states[2];
146
147 int refcnt; /* per-core: #HT threads */
148 unsigned core_id; /* per-core: core id */
149};
150
126#define MAX_LBR_ENTRIES 16 151#define MAX_LBR_ENTRIES 16
127 152
153enum {
154 X86_PERF_KFREE_SHARED = 0,
155 X86_PERF_KFREE_EXCL = 1,
156 X86_PERF_KFREE_MAX
157};
158
128struct cpu_hw_events { 159struct cpu_hw_events {
129 /* 160 /*
130 * Generic x86 PMC bits 161 * Generic x86 PMC bits
@@ -179,6 +210,12 @@ struct cpu_hw_events {
179 * used on Intel NHM/WSM/SNB 210 * used on Intel NHM/WSM/SNB
180 */ 211 */
181 struct intel_shared_regs *shared_regs; 212 struct intel_shared_regs *shared_regs;
213 /*
214 * manage exclusive counter access between hyperthread
215 */
216 struct event_constraint *constraint_list; /* in enable order */
217 struct intel_excl_cntrs *excl_cntrs;
218 int excl_thread_id; /* 0 or 1 */
182 219
183 /* 220 /*
184 * AMD specific bits 221 * AMD specific bits
@@ -187,7 +224,7 @@ struct cpu_hw_events {
187 /* Inverted mask of bits to clear in the perf_ctr ctrl registers */ 224 /* Inverted mask of bits to clear in the perf_ctr ctrl registers */
188 u64 perf_ctr_virt_mask; 225 u64 perf_ctr_virt_mask;
189 226
190 void *kfree_on_online; 227 void *kfree_on_online[X86_PERF_KFREE_MAX];
191}; 228};
192 229
193#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\ 230#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
@@ -202,6 +239,10 @@ struct cpu_hw_events {
202#define EVENT_CONSTRAINT(c, n, m) \ 239#define EVENT_CONSTRAINT(c, n, m) \
203 __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0) 240 __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)
204 241
242#define INTEL_EXCLEVT_CONSTRAINT(c, n) \
243 __EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\
244 0, PERF_X86_EVENT_EXCL)
245
205/* 246/*
206 * The overlap flag marks event constraints with overlapping counter 247 * The overlap flag marks event constraints with overlapping counter
207 * masks. This is the case if the counter mask of such an event is not 248 * masks. This is the case if the counter mask of such an event is not
@@ -259,6 +300,10 @@ struct cpu_hw_events {
259#define INTEL_FLAGS_UEVENT_CONSTRAINT(c, n) \ 300#define INTEL_FLAGS_UEVENT_CONSTRAINT(c, n) \
260 EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS) 301 EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
261 302
303#define INTEL_EXCLUEVT_CONSTRAINT(c, n) \
304 __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
305 HWEIGHT(n), 0, PERF_X86_EVENT_EXCL)
306
262#define INTEL_PLD_CONSTRAINT(c, n) \ 307#define INTEL_PLD_CONSTRAINT(c, n) \
263 __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \ 308 __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
264 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT) 309 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
@@ -283,22 +328,40 @@ struct cpu_hw_events {
283 328
284/* Check flags and event code, and set the HSW load flag */ 329/* Check flags and event code, and set the HSW load flag */
285#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(code, n) \ 330#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(code, n) \
286 __EVENT_CONSTRAINT(code, n, \ 331 __EVENT_CONSTRAINT(code, n, \
287 ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \ 332 ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
288 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW) 333 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
289 334
335#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(code, n) \
336 __EVENT_CONSTRAINT(code, n, \
337 ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
338 HWEIGHT(n), 0, \
339 PERF_X86_EVENT_PEBS_LD_HSW|PERF_X86_EVENT_EXCL)
340
290/* Check flags and event code/umask, and set the HSW store flag */ 341/* Check flags and event code/umask, and set the HSW store flag */
291#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(code, n) \ 342#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(code, n) \
292 __EVENT_CONSTRAINT(code, n, \ 343 __EVENT_CONSTRAINT(code, n, \
293 INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \ 344 INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
294 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_ST_HSW) 345 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_ST_HSW)
295 346
347#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(code, n) \
348 __EVENT_CONSTRAINT(code, n, \
349 INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
350 HWEIGHT(n), 0, \
351 PERF_X86_EVENT_PEBS_ST_HSW|PERF_X86_EVENT_EXCL)
352
296/* Check flags and event code/umask, and set the HSW load flag */ 353/* Check flags and event code/umask, and set the HSW load flag */
297#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(code, n) \ 354#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(code, n) \
298 __EVENT_CONSTRAINT(code, n, \ 355 __EVENT_CONSTRAINT(code, n, \
299 INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \ 356 INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
300 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW) 357 HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
301 358
359#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(code, n) \
360 __EVENT_CONSTRAINT(code, n, \
361 INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
362 HWEIGHT(n), 0, \
363 PERF_X86_EVENT_PEBS_LD_HSW|PERF_X86_EVENT_EXCL)
364
302/* Check flags and event code/umask, and set the HSW N/A flag */ 365/* Check flags and event code/umask, and set the HSW N/A flag */
303#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(code, n) \ 366#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(code, n) \
304 __EVENT_CONSTRAINT(code, n, \ 367 __EVENT_CONSTRAINT(code, n, \
@@ -408,6 +471,13 @@ union x86_pmu_config {
408 471
409#define X86_CONFIG(args...) ((union x86_pmu_config){.bits = {args}}).value 472#define X86_CONFIG(args...) ((union x86_pmu_config){.bits = {args}}).value
410 473
474enum {
475 x86_lbr_exclusive_lbr,
476 x86_lbr_exclusive_bts,
477 x86_lbr_exclusive_pt,
478 x86_lbr_exclusive_max,
479};
480
411/* 481/*
412 * struct x86_pmu - generic x86 pmu 482 * struct x86_pmu - generic x86 pmu
413 */ 483 */
@@ -443,14 +513,25 @@ struct x86_pmu {
443 u64 max_period; 513 u64 max_period;
444 struct event_constraint * 514 struct event_constraint *
445 (*get_event_constraints)(struct cpu_hw_events *cpuc, 515 (*get_event_constraints)(struct cpu_hw_events *cpuc,
516 int idx,
446 struct perf_event *event); 517 struct perf_event *event);
447 518
448 void (*put_event_constraints)(struct cpu_hw_events *cpuc, 519 void (*put_event_constraints)(struct cpu_hw_events *cpuc,
449 struct perf_event *event); 520 struct perf_event *event);
521
522 void (*commit_scheduling)(struct cpu_hw_events *cpuc,
523 struct perf_event *event,
524 int cntr);
525
526 void (*start_scheduling)(struct cpu_hw_events *cpuc);
527
528 void (*stop_scheduling)(struct cpu_hw_events *cpuc);
529
450 struct event_constraint *event_constraints; 530 struct event_constraint *event_constraints;
451 struct x86_pmu_quirk *quirks; 531 struct x86_pmu_quirk *quirks;
452 int perfctr_second_write; 532 int perfctr_second_write;
453 bool late_ack; 533 bool late_ack;
534 unsigned (*limit_period)(struct perf_event *event, unsigned l);
454 535
455 /* 536 /*
456 * sysfs attrs 537 * sysfs attrs
@@ -472,7 +553,8 @@ struct x86_pmu {
472 void (*cpu_dead)(int cpu); 553 void (*cpu_dead)(int cpu);
473 554
474 void (*check_microcode)(void); 555 void (*check_microcode)(void);
475 void (*flush_branch_stack)(void); 556 void (*sched_task)(struct perf_event_context *ctx,
557 bool sched_in);
476 558
477 /* 559 /*
478 * Intel Arch Perfmon v2+ 560 * Intel Arch Perfmon v2+
@@ -504,10 +586,15 @@ struct x86_pmu {
504 bool lbr_double_abort; /* duplicated lbr aborts */ 586 bool lbr_double_abort; /* duplicated lbr aborts */
505 587
506 /* 588 /*
589 * Intel PT/LBR/BTS are exclusive
590 */
591 atomic_t lbr_exclusive[x86_lbr_exclusive_max];
592
593 /*
507 * Extra registers for events 594 * Extra registers for events
508 */ 595 */
509 struct extra_reg *extra_regs; 596 struct extra_reg *extra_regs;
510 unsigned int er_flags; 597 unsigned int flags;
511 598
512 /* 599 /*
513 * Intel host/guest support (KVM) 600 * Intel host/guest support (KVM)
@@ -515,6 +602,13 @@ struct x86_pmu {
515 struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr); 602 struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
516}; 603};
517 604
605struct x86_perf_task_context {
606 u64 lbr_from[MAX_LBR_ENTRIES];
607 u64 lbr_to[MAX_LBR_ENTRIES];
608 int lbr_callstack_users;
609 int lbr_stack_state;
610};
611
518#define x86_add_quirk(func_) \ 612#define x86_add_quirk(func_) \
519do { \ 613do { \
520 static struct x86_pmu_quirk __quirk __initdata = { \ 614 static struct x86_pmu_quirk __quirk __initdata = { \
@@ -524,8 +618,13 @@ do { \
524 x86_pmu.quirks = &__quirk; \ 618 x86_pmu.quirks = &__quirk; \
525} while (0) 619} while (0)
526 620
527#define ERF_NO_HT_SHARING 1 621/*
528#define ERF_HAS_RSP_1 2 622 * x86_pmu flags
623 */
624#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
625#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
626#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */
627#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */
529 628
530#define EVENT_VAR(_id) event_attr_##_id 629#define EVENT_VAR(_id) event_attr_##_id
531#define EVENT_PTR(_id) &event_attr_##_id.attr.attr 630#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -546,6 +645,12 @@ static struct perf_pmu_events_attr event_attr_##v = { \
546 645
547extern struct x86_pmu x86_pmu __read_mostly; 646extern struct x86_pmu x86_pmu __read_mostly;
548 647
648static inline bool x86_pmu_has_lbr_callstack(void)
649{
650 return x86_pmu.lbr_sel_map &&
651 x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
652}
653
549DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events); 654DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
550 655
551int x86_perf_event_set_period(struct perf_event *event); 656int x86_perf_event_set_period(struct perf_event *event);
@@ -588,6 +693,12 @@ static inline int x86_pmu_rdpmc_index(int index)
588 return x86_pmu.rdpmc_index ? x86_pmu.rdpmc_index(index) : index; 693 return x86_pmu.rdpmc_index ? x86_pmu.rdpmc_index(index) : index;
589} 694}
590 695
696int x86_add_exclusive(unsigned int what);
697
698void x86_del_exclusive(unsigned int what);
699
700void hw_perf_lbr_event_destroy(struct perf_event *event);
701
591int x86_setup_perfctr(struct perf_event *event); 702int x86_setup_perfctr(struct perf_event *event);
592 703
593int x86_pmu_hw_config(struct perf_event *event); 704int x86_pmu_hw_config(struct perf_event *event);
@@ -674,10 +785,34 @@ static inline int amd_pmu_init(void)
674 785
675#ifdef CONFIG_CPU_SUP_INTEL 786#ifdef CONFIG_CPU_SUP_INTEL
676 787
788static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
789{
790 /* user explicitly requested branch sampling */
791 if (has_branch_stack(event))
792 return true;
793
794 /* implicit branch sampling to correct PEBS skid */
795 if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
796 x86_pmu.intel_cap.pebs_format < 2)
797 return true;
798
799 return false;
800}
801
802static inline bool intel_pmu_has_bts(struct perf_event *event)
803{
804 if (event->attr.config == PERF_COUNT_HW_BRANCH_INSTRUCTIONS &&
805 !event->attr.freq && event->hw.sample_period == 1)
806 return true;
807
808 return false;
809}
810
677int intel_pmu_save_and_restart(struct perf_event *event); 811int intel_pmu_save_and_restart(struct perf_event *event);
678 812
679struct event_constraint * 813struct event_constraint *
680x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event); 814x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
815 struct perf_event *event);
681 816
682struct intel_shared_regs *allocate_shared_regs(int cpu); 817struct intel_shared_regs *allocate_shared_regs(int cpu);
683 818
@@ -727,13 +862,15 @@ void intel_pmu_pebs_disable_all(void);
727 862
728void intel_ds_init(void); 863void intel_ds_init(void);
729 864
865void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
866
730void intel_pmu_lbr_reset(void); 867void intel_pmu_lbr_reset(void);
731 868
732void intel_pmu_lbr_enable(struct perf_event *event); 869void intel_pmu_lbr_enable(struct perf_event *event);
733 870
734void intel_pmu_lbr_disable(struct perf_event *event); 871void intel_pmu_lbr_disable(struct perf_event *event);
735 872
736void intel_pmu_lbr_enable_all(void); 873void intel_pmu_lbr_enable_all(bool pmi);
737 874
738void intel_pmu_lbr_disable_all(void); 875void intel_pmu_lbr_disable_all(void);
739 876
@@ -747,8 +884,18 @@ void intel_pmu_lbr_init_atom(void);
747 884
748void intel_pmu_lbr_init_snb(void); 885void intel_pmu_lbr_init_snb(void);
749 886
887void intel_pmu_lbr_init_hsw(void);
888
750int intel_pmu_setup_lbr_filter(struct perf_event *event); 889int intel_pmu_setup_lbr_filter(struct perf_event *event);
751 890
891void intel_pt_interrupt(void);
892
893int intel_bts_interrupt(void);
894
895void intel_bts_enable_local(void);
896
897void intel_bts_disable_local(void);
898
752int p4_pmu_init(void); 899int p4_pmu_init(void);
753 900
754int p6_pmu_init(void); 901int p6_pmu_init(void);
@@ -758,6 +905,10 @@ int knc_pmu_init(void);
758ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr, 905ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
759 char *page); 906 char *page);
760 907
908static inline int is_ht_workaround_enabled(void)
909{
910 return !!(x86_pmu.flags & PMU_FL_EXCL_ENABLED);
911}
761#else /* CONFIG_CPU_SUP_INTEL */ 912#else /* CONFIG_CPU_SUP_INTEL */
762 913
763static inline void reserve_ds_buffers(void) 914static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 28926311aac1..1cee5d2d7ece 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -382,6 +382,7 @@ static int amd_pmu_cpu_prepare(int cpu)
382static void amd_pmu_cpu_starting(int cpu) 382static void amd_pmu_cpu_starting(int cpu)
383{ 383{
384 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu); 384 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
385 void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED];
385 struct amd_nb *nb; 386 struct amd_nb *nb;
386 int i, nb_id; 387 int i, nb_id;
387 388
@@ -399,7 +400,7 @@ static void amd_pmu_cpu_starting(int cpu)
399 continue; 400 continue;
400 401
401 if (nb->nb_id == nb_id) { 402 if (nb->nb_id == nb_id) {
402 cpuc->kfree_on_online = cpuc->amd_nb; 403 *onln = cpuc->amd_nb;
403 cpuc->amd_nb = nb; 404 cpuc->amd_nb = nb;
404 break; 405 break;
405 } 406 }
@@ -429,7 +430,8 @@ static void amd_pmu_cpu_dead(int cpu)
429} 430}
430 431
431static struct event_constraint * 432static struct event_constraint *
432amd_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) 433amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
434 struct perf_event *event)
433{ 435{
434 /* 436 /*
435 * if not NB event or no NB, then no constraints 437 * if not NB event or no NB, then no constraints
@@ -537,7 +539,8 @@ static struct event_constraint amd_f15_PMC50 = EVENT_CONSTRAINT(0, 0x3F, 0);
537static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0); 539static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0);
538 540
539static struct event_constraint * 541static struct event_constraint *
540amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, struct perf_event *event) 542amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, int idx,
543 struct perf_event *event)
541{ 544{
542 struct hw_perf_event *hwc = &event->hw; 545 struct hw_perf_event *hwc = &event->hw;
543 unsigned int event_code = amd_get_event_code(hwc); 546 unsigned int event_code = amd_get_event_code(hwc);
diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
index a61f5c6911da..989d3c215d2b 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
@@ -796,7 +796,7 @@ static int setup_ibs_ctl(int ibs_eilvt_off)
796 * the IBS interrupt vector is handled by perf_ibs_cpu_notifier that 796 * the IBS interrupt vector is handled by perf_ibs_cpu_notifier that
797 * is using the new offset. 797 * is using the new offset.
798 */ 798 */
799static int force_ibs_eilvt_setup(void) 799static void force_ibs_eilvt_setup(void)
800{ 800{
801 int offset; 801 int offset;
802 int ret; 802 int ret;
@@ -811,26 +811,24 @@ static int force_ibs_eilvt_setup(void)
811 811
812 if (offset == APIC_EILVT_NR_MAX) { 812 if (offset == APIC_EILVT_NR_MAX) {
813 printk(KERN_DEBUG "No EILVT entry available\n"); 813 printk(KERN_DEBUG "No EILVT entry available\n");
814 return -EBUSY; 814 return;
815 } 815 }
816 816
817 ret = setup_ibs_ctl(offset); 817 ret = setup_ibs_ctl(offset);
818 if (ret) 818 if (ret)
819 goto out; 819 goto out;
820 820
821 if (!ibs_eilvt_valid()) { 821 if (!ibs_eilvt_valid())
822 ret = -EFAULT;
823 goto out; 822 goto out;
824 }
825 823
826 pr_info("IBS: LVT offset %d assigned\n", offset); 824 pr_info("IBS: LVT offset %d assigned\n", offset);
827 825
828 return 0; 826 return;
829out: 827out:
830 preempt_disable(); 828 preempt_disable();
831 put_eilvt(offset); 829 put_eilvt(offset);
832 preempt_enable(); 830 preempt_enable();
833 return ret; 831 return;
834} 832}
835 833
836static void ibs_eilvt_setup(void) 834static void ibs_eilvt_setup(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 258990688a5e..9da2400c2ec3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -12,6 +12,7 @@
12#include <linux/init.h> 12#include <linux/init.h>
13#include <linux/slab.h> 13#include <linux/slab.h>
14#include <linux/export.h> 14#include <linux/export.h>
15#include <linux/watchdog.h>
15 16
16#include <asm/cpufeature.h> 17#include <asm/cpufeature.h>
17#include <asm/hardirq.h> 18#include <asm/hardirq.h>
@@ -113,6 +114,12 @@ static struct event_constraint intel_snb_event_constraints[] __read_mostly =
113 INTEL_EVENT_CONSTRAINT(0xcd, 0x8), /* MEM_TRANS_RETIRED.LOAD_LATENCY */ 114 INTEL_EVENT_CONSTRAINT(0xcd, 0x8), /* MEM_TRANS_RETIRED.LOAD_LATENCY */
114 INTEL_UEVENT_CONSTRAINT(0x04a3, 0xf), /* CYCLE_ACTIVITY.CYCLES_NO_DISPATCH */ 115 INTEL_UEVENT_CONSTRAINT(0x04a3, 0xf), /* CYCLE_ACTIVITY.CYCLES_NO_DISPATCH */
115 INTEL_UEVENT_CONSTRAINT(0x02a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */ 116 INTEL_UEVENT_CONSTRAINT(0x02a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */
117
118 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
119 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
120 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
121 INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
122
116 EVENT_CONSTRAINT_END 123 EVENT_CONSTRAINT_END
117}; 124};
118 125
@@ -131,15 +138,12 @@ static struct event_constraint intel_ivb_event_constraints[] __read_mostly =
131 INTEL_UEVENT_CONSTRAINT(0x08a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */ 138 INTEL_UEVENT_CONSTRAINT(0x08a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */
132 INTEL_UEVENT_CONSTRAINT(0x0ca3, 0x4), /* CYCLE_ACTIVITY.STALLS_L1D_PENDING */ 139 INTEL_UEVENT_CONSTRAINT(0x0ca3, 0x4), /* CYCLE_ACTIVITY.STALLS_L1D_PENDING */
133 INTEL_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PREC_DIST */ 140 INTEL_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PREC_DIST */
134 /* 141
135 * Errata BV98 -- MEM_*_RETIRED events can leak between counters of SMT 142 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
136 * siblings; disable these events because they can corrupt unrelated 143 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
137 * counters. 144 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
138 */ 145 INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
139 INTEL_EVENT_CONSTRAINT(0xd0, 0x0), /* MEM_UOPS_RETIRED.* */ 146
140 INTEL_EVENT_CONSTRAINT(0xd1, 0x0), /* MEM_LOAD_UOPS_RETIRED.* */
141 INTEL_EVENT_CONSTRAINT(0xd2, 0x0), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
142 INTEL_EVENT_CONSTRAINT(0xd3, 0x0), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
143 EVENT_CONSTRAINT_END 147 EVENT_CONSTRAINT_END
144}; 148};
145 149
@@ -217,6 +221,21 @@ static struct event_constraint intel_hsw_event_constraints[] = {
217 INTEL_UEVENT_CONSTRAINT(0x0ca3, 0x4), 221 INTEL_UEVENT_CONSTRAINT(0x0ca3, 0x4),
218 /* CYCLE_ACTIVITY.CYCLES_NO_EXECUTE */ 222 /* CYCLE_ACTIVITY.CYCLES_NO_EXECUTE */
219 INTEL_UEVENT_CONSTRAINT(0x04a3, 0xf), 223 INTEL_UEVENT_CONSTRAINT(0x04a3, 0xf),
224
225 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
226 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
227 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
228 INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
229
230 EVENT_CONSTRAINT_END
231};
232
233struct event_constraint intel_bdw_event_constraints[] = {
234 FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
235 FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
236 FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
237 INTEL_UEVENT_CONSTRAINT(0x148, 0x4), /* L1D_PEND_MISS.PENDING */
238 INTEL_EVENT_CONSTRAINT(0xa3, 0x4), /* CYCLE_ACTIVITY.* */
220 EVENT_CONSTRAINT_END 239 EVENT_CONSTRAINT_END
221}; 240};
222 241
@@ -415,6 +434,202 @@ static __initconst const u64 snb_hw_cache_event_ids
415 434
416}; 435};
417 436
437/*
438 * Notes on the events:
439 * - data reads do not include code reads (comparable to earlier tables)
440 * - data counts include speculative execution (except L1 write, dtlb, bpu)
441 * - remote node access includes remote memory, remote cache, remote mmio.
442 * - prefetches are not included in the counts because they are not
443 * reliably counted.
444 */
445
446#define HSW_DEMAND_DATA_RD BIT_ULL(0)
447#define HSW_DEMAND_RFO BIT_ULL(1)
448#define HSW_ANY_RESPONSE BIT_ULL(16)
449#define HSW_SUPPLIER_NONE BIT_ULL(17)
450#define HSW_L3_MISS_LOCAL_DRAM BIT_ULL(22)
451#define HSW_L3_MISS_REMOTE_HOP0 BIT_ULL(27)
452#define HSW_L3_MISS_REMOTE_HOP1 BIT_ULL(28)
453#define HSW_L3_MISS_REMOTE_HOP2P BIT_ULL(29)
454#define HSW_L3_MISS (HSW_L3_MISS_LOCAL_DRAM| \
455 HSW_L3_MISS_REMOTE_HOP0|HSW_L3_MISS_REMOTE_HOP1| \
456 HSW_L3_MISS_REMOTE_HOP2P)
457#define HSW_SNOOP_NONE BIT_ULL(31)
458#define HSW_SNOOP_NOT_NEEDED BIT_ULL(32)
459#define HSW_SNOOP_MISS BIT_ULL(33)
460#define HSW_SNOOP_HIT_NO_FWD BIT_ULL(34)
461#define HSW_SNOOP_HIT_WITH_FWD BIT_ULL(35)
462#define HSW_SNOOP_HITM BIT_ULL(36)
463#define HSW_SNOOP_NON_DRAM BIT_ULL(37)
464#define HSW_ANY_SNOOP (HSW_SNOOP_NONE| \
465 HSW_SNOOP_NOT_NEEDED|HSW_SNOOP_MISS| \
466 HSW_SNOOP_HIT_NO_FWD|HSW_SNOOP_HIT_WITH_FWD| \
467 HSW_SNOOP_HITM|HSW_SNOOP_NON_DRAM)
468#define HSW_SNOOP_DRAM (HSW_ANY_SNOOP & ~HSW_SNOOP_NON_DRAM)
469#define HSW_DEMAND_READ HSW_DEMAND_DATA_RD
470#define HSW_DEMAND_WRITE HSW_DEMAND_RFO
471#define HSW_L3_MISS_REMOTE (HSW_L3_MISS_REMOTE_HOP0|\
472 HSW_L3_MISS_REMOTE_HOP1|HSW_L3_MISS_REMOTE_HOP2P)
473#define HSW_LLC_ACCESS HSW_ANY_RESPONSE
474
475#define BDW_L3_MISS_LOCAL BIT(26)
476#define BDW_L3_MISS (BDW_L3_MISS_LOCAL| \
477 HSW_L3_MISS_REMOTE_HOP0|HSW_L3_MISS_REMOTE_HOP1| \
478 HSW_L3_MISS_REMOTE_HOP2P)
479
480
481static __initconst const u64 hsw_hw_cache_event_ids
482 [PERF_COUNT_HW_CACHE_MAX]
483 [PERF_COUNT_HW_CACHE_OP_MAX]
484 [PERF_COUNT_HW_CACHE_RESULT_MAX] =
485{
486 [ C(L1D ) ] = {
487 [ C(OP_READ) ] = {
488 [ C(RESULT_ACCESS) ] = 0x81d0, /* MEM_UOPS_RETIRED.ALL_LOADS */
489 [ C(RESULT_MISS) ] = 0x151, /* L1D.REPLACEMENT */
490 },
491 [ C(OP_WRITE) ] = {
492 [ C(RESULT_ACCESS) ] = 0x82d0, /* MEM_UOPS_RETIRED.ALL_STORES */
493 [ C(RESULT_MISS) ] = 0x0,
494 },
495 [ C(OP_PREFETCH) ] = {
496 [ C(RESULT_ACCESS) ] = 0x0,
497 [ C(RESULT_MISS) ] = 0x0,
498 },
499 },
500 [ C(L1I ) ] = {
501 [ C(OP_READ) ] = {
502 [ C(RESULT_ACCESS) ] = 0x0,
503 [ C(RESULT_MISS) ] = 0x280, /* ICACHE.MISSES */
504 },
505 [ C(OP_WRITE) ] = {
506 [ C(RESULT_ACCESS) ] = -1,
507 [ C(RESULT_MISS) ] = -1,
508 },
509 [ C(OP_PREFETCH) ] = {
510 [ C(RESULT_ACCESS) ] = 0x0,
511 [ C(RESULT_MISS) ] = 0x0,
512 },
513 },
514 [ C(LL ) ] = {
515 [ C(OP_READ) ] = {
516 [ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
517 [ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
518 },
519 [ C(OP_WRITE) ] = {
520 [ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
521 [ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
522 },
523 [ C(OP_PREFETCH) ] = {
524 [ C(RESULT_ACCESS) ] = 0x0,
525 [ C(RESULT_MISS) ] = 0x0,
526 },
527 },
528 [ C(DTLB) ] = {
529 [ C(OP_READ) ] = {
530 [ C(RESULT_ACCESS) ] = 0x81d0, /* MEM_UOPS_RETIRED.ALL_LOADS */
531 [ C(RESULT_MISS) ] = 0x108, /* DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK */
532 },
533 [ C(OP_WRITE) ] = {
534 [ C(RESULT_ACCESS) ] = 0x82d0, /* MEM_UOPS_RETIRED.ALL_STORES */
535 [ C(RESULT_MISS) ] = 0x149, /* DTLB_STORE_MISSES.MISS_CAUSES_A_WALK */
536 },
537 [ C(OP_PREFETCH) ] = {
538 [ C(RESULT_ACCESS) ] = 0x0,
539 [ C(RESULT_MISS) ] = 0x0,
540 },
541 },
542 [ C(ITLB) ] = {
543 [ C(OP_READ) ] = {
544 [ C(RESULT_ACCESS) ] = 0x6085, /* ITLB_MISSES.STLB_HIT */
545 [ C(RESULT_MISS) ] = 0x185, /* ITLB_MISSES.MISS_CAUSES_A_WALK */
546 },
547 [ C(OP_WRITE) ] = {
548 [ C(RESULT_ACCESS) ] = -1,
549 [ C(RESULT_MISS) ] = -1,
550 },
551 [ C(OP_PREFETCH) ] = {
552 [ C(RESULT_ACCESS) ] = -1,
553 [ C(RESULT_MISS) ] = -1,
554 },
555 },
556 [ C(BPU ) ] = {
557 [ C(OP_READ) ] = {
558 [ C(RESULT_ACCESS) ] = 0xc4, /* BR_INST_RETIRED.ALL_BRANCHES */
559 [ C(RESULT_MISS) ] = 0xc5, /* BR_MISP_RETIRED.ALL_BRANCHES */
560 },
561 [ C(OP_WRITE) ] = {
562 [ C(RESULT_ACCESS) ] = -1,
563 [ C(RESULT_MISS) ] = -1,
564 },
565 [ C(OP_PREFETCH) ] = {
566 [ C(RESULT_ACCESS) ] = -1,
567 [ C(RESULT_MISS) ] = -1,
568 },
569 },
570 [ C(NODE) ] = {
571 [ C(OP_READ) ] = {
572 [ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
573 [ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
574 },
575 [ C(OP_WRITE) ] = {
576 [ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
577 [ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
578 },
579 [ C(OP_PREFETCH) ] = {
580 [ C(RESULT_ACCESS) ] = 0x0,
581 [ C(RESULT_MISS) ] = 0x0,
582 },
583 },
584};
585
586static __initconst const u64 hsw_hw_cache_extra_regs
587 [PERF_COUNT_HW_CACHE_MAX]
588 [PERF_COUNT_HW_CACHE_OP_MAX]
589 [PERF_COUNT_HW_CACHE_RESULT_MAX] =
590{
591 [ C(LL ) ] = {
592 [ C(OP_READ) ] = {
593 [ C(RESULT_ACCESS) ] = HSW_DEMAND_READ|
594 HSW_LLC_ACCESS,
595 [ C(RESULT_MISS) ] = HSW_DEMAND_READ|
596 HSW_L3_MISS|HSW_ANY_SNOOP,
597 },
598 [ C(OP_WRITE) ] = {
599 [ C(RESULT_ACCESS) ] = HSW_DEMAND_WRITE|
600 HSW_LLC_ACCESS,
601 [ C(RESULT_MISS) ] = HSW_DEMAND_WRITE|
602 HSW_L3_MISS|HSW_ANY_SNOOP,
603 },
604 [ C(OP_PREFETCH) ] = {
605 [ C(RESULT_ACCESS) ] = 0x0,
606 [ C(RESULT_MISS) ] = 0x0,
607 },
608 },
609 [ C(NODE) ] = {
610 [ C(OP_READ) ] = {
611 [ C(RESULT_ACCESS) ] = HSW_DEMAND_READ|
612 HSW_L3_MISS_LOCAL_DRAM|
613 HSW_SNOOP_DRAM,
614 [ C(RESULT_MISS) ] = HSW_DEMAND_READ|
615 HSW_L3_MISS_REMOTE|
616 HSW_SNOOP_DRAM,
617 },
618 [ C(OP_WRITE) ] = {
619 [ C(RESULT_ACCESS) ] = HSW_DEMAND_WRITE|
620 HSW_L3_MISS_LOCAL_DRAM|
621 HSW_SNOOP_DRAM,
622 [ C(RESULT_MISS) ] = HSW_DEMAND_WRITE|
623 HSW_L3_MISS_REMOTE|
624 HSW_SNOOP_DRAM,
625 },
626 [ C(OP_PREFETCH) ] = {
627 [ C(RESULT_ACCESS) ] = 0x0,
628 [ C(RESULT_MISS) ] = 0x0,
629 },
630 },
631};
632
418static __initconst const u64 westmere_hw_cache_event_ids 633static __initconst const u64 westmere_hw_cache_event_ids
419 [PERF_COUNT_HW_CACHE_MAX] 634 [PERF_COUNT_HW_CACHE_MAX]
420 [PERF_COUNT_HW_CACHE_OP_MAX] 635 [PERF_COUNT_HW_CACHE_OP_MAX]
@@ -1029,21 +1244,10 @@ static __initconst const u64 slm_hw_cache_event_ids
1029 }, 1244 },
1030}; 1245};
1031 1246
1032static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event) 1247/*
1033{ 1248 * Use from PMIs where the LBRs are already disabled.
1034 /* user explicitly requested branch sampling */ 1249 */
1035 if (has_branch_stack(event)) 1250static void __intel_pmu_disable_all(void)
1036 return true;
1037
1038 /* implicit branch sampling to correct PEBS skid */
1039 if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
1040 x86_pmu.intel_cap.pebs_format < 2)
1041 return true;
1042
1043 return false;
1044}
1045
1046static void intel_pmu_disable_all(void)
1047{ 1251{
1048 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 1252 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
1049 1253
@@ -1051,17 +1255,24 @@ static void intel_pmu_disable_all(void)
1051 1255
1052 if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) 1256 if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
1053 intel_pmu_disable_bts(); 1257 intel_pmu_disable_bts();
1258 else
1259 intel_bts_disable_local();
1054 1260
1055 intel_pmu_pebs_disable_all(); 1261 intel_pmu_pebs_disable_all();
1262}
1263
1264static void intel_pmu_disable_all(void)
1265{
1266 __intel_pmu_disable_all();
1056 intel_pmu_lbr_disable_all(); 1267 intel_pmu_lbr_disable_all();
1057} 1268}
1058 1269
1059static void intel_pmu_enable_all(int added) 1270static void __intel_pmu_enable_all(int added, bool pmi)
1060{ 1271{
1061 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 1272 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
1062 1273
1063 intel_pmu_pebs_enable_all(); 1274 intel_pmu_pebs_enable_all();
1064 intel_pmu_lbr_enable_all(); 1275 intel_pmu_lbr_enable_all(pmi);
1065 wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 1276 wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
1066 x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask); 1277 x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask);
1067 1278
@@ -1073,7 +1284,13 @@ static void intel_pmu_enable_all(int added)
1073 return; 1284 return;
1074 1285
1075 intel_pmu_enable_bts(event->hw.config); 1286 intel_pmu_enable_bts(event->hw.config);
1076 } 1287 } else
1288 intel_bts_enable_local();
1289}
1290
1291static void intel_pmu_enable_all(int added)
1292{
1293 __intel_pmu_enable_all(added, false);
1077} 1294}
1078 1295
1079/* 1296/*
@@ -1207,7 +1424,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
1207 * must disable before any actual event 1424 * must disable before any actual event
1208 * because any event may be combined with LBR 1425 * because any event may be combined with LBR
1209 */ 1426 */
1210 if (intel_pmu_needs_lbr_smpl(event)) 1427 if (needs_branch_stack(event))
1211 intel_pmu_lbr_disable(event); 1428 intel_pmu_lbr_disable(event);
1212 1429
1213 if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) { 1430 if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1268,7 +1485,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
1268 * must enabled before any actual event 1485 * must enabled before any actual event
1269 * because any event may be combined with LBR 1486 * because any event may be combined with LBR
1270 */ 1487 */
1271 if (intel_pmu_needs_lbr_smpl(event)) 1488 if (needs_branch_stack(event))
1272 intel_pmu_lbr_enable(event); 1489 intel_pmu_lbr_enable(event);
1273 1490
1274 if (event->attr.exclude_host) 1491 if (event->attr.exclude_host)
@@ -1334,6 +1551,18 @@ static void intel_pmu_reset(void)
1334 if (ds) 1551 if (ds)
1335 ds->bts_index = ds->bts_buffer_base; 1552 ds->bts_index = ds->bts_buffer_base;
1336 1553
1554 /* Ack all overflows and disable fixed counters */
1555 if (x86_pmu.version >= 2) {
1556 intel_pmu_ack_status(intel_pmu_get_status());
1557 wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
1558 }
1559
1560 /* Reset LBRs and LBR freezing */
1561 if (x86_pmu.lbr_nr) {
1562 update_debugctlmsr(get_debugctlmsr() &
1563 ~(DEBUGCTLMSR_FREEZE_LBRS_ON_PMI|DEBUGCTLMSR_LBR));
1564 }
1565
1337 local_irq_restore(flags); 1566 local_irq_restore(flags);
1338} 1567}
1339 1568
@@ -1357,8 +1586,9 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
1357 */ 1586 */
1358 if (!x86_pmu.late_ack) 1587 if (!x86_pmu.late_ack)
1359 apic_write(APIC_LVTPC, APIC_DM_NMI); 1588 apic_write(APIC_LVTPC, APIC_DM_NMI);
1360 intel_pmu_disable_all(); 1589 __intel_pmu_disable_all();
1361 handled = intel_pmu_drain_bts_buffer(); 1590 handled = intel_pmu_drain_bts_buffer();
1591 handled += intel_bts_interrupt();
1362 status = intel_pmu_get_status(); 1592 status = intel_pmu_get_status();
1363 if (!status) 1593 if (!status)
1364 goto done; 1594 goto done;
@@ -1399,6 +1629,14 @@ again:
1399 } 1629 }
1400 1630
1401 /* 1631 /*
1632 * Intel PT
1633 */
1634 if (__test_and_clear_bit(55, (unsigned long *)&status)) {
1635 handled++;
1636 intel_pt_interrupt();
1637 }
1638
1639 /*
1402 * Checkpointed counters can lead to 'spurious' PMIs because the 1640 * Checkpointed counters can lead to 'spurious' PMIs because the
1403 * rollback caused by the PMI will have cleared the overflow status 1641 * rollback caused by the PMI will have cleared the overflow status
1404 * bit. Therefore always force probe these counters. 1642 * bit. Therefore always force probe these counters.
@@ -1433,7 +1671,7 @@ again:
1433 goto again; 1671 goto again;
1434 1672
1435done: 1673done:
1436 intel_pmu_enable_all(0); 1674 __intel_pmu_enable_all(0, true);
1437 /* 1675 /*
1438 * Only unmask the NMI after the overflow counters 1676 * Only unmask the NMI after the overflow counters
1439 * have been reset. This avoids spurious NMIs on 1677 * have been reset. This avoids spurious NMIs on
@@ -1464,7 +1702,7 @@ intel_bts_constraints(struct perf_event *event)
1464 1702
1465static int intel_alt_er(int idx) 1703static int intel_alt_er(int idx)
1466{ 1704{
1467 if (!(x86_pmu.er_flags & ERF_HAS_RSP_1)) 1705 if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
1468 return idx; 1706 return idx;
1469 1707
1470 if (idx == EXTRA_REG_RSP_0) 1708 if (idx == EXTRA_REG_RSP_0)
@@ -1624,7 +1862,8 @@ intel_shared_regs_constraints(struct cpu_hw_events *cpuc,
1624} 1862}
1625 1863
1626struct event_constraint * 1864struct event_constraint *
1627x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) 1865x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
1866 struct perf_event *event)
1628{ 1867{
1629 struct event_constraint *c; 1868 struct event_constraint *c;
1630 1869
@@ -1641,7 +1880,8 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
1641} 1880}
1642 1881
1643static struct event_constraint * 1882static struct event_constraint *
1644intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) 1883__intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
1884 struct perf_event *event)
1645{ 1885{
1646 struct event_constraint *c; 1886 struct event_constraint *c;
1647 1887
@@ -1657,7 +1897,278 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event
1657 if (c) 1897 if (c)
1658 return c; 1898 return c;
1659 1899
1660 return x86_get_event_constraints(cpuc, event); 1900 return x86_get_event_constraints(cpuc, idx, event);
1901}
1902
1903static void
1904intel_start_scheduling(struct cpu_hw_events *cpuc)
1905{
1906 struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
1907 struct intel_excl_states *xl, *xlo;
1908 int tid = cpuc->excl_thread_id;
1909 int o_tid = 1 - tid; /* sibling thread */
1910
1911 /*
1912 * nothing needed if in group validation mode
1913 */
1914 if (cpuc->is_fake || !is_ht_workaround_enabled())
1915 return;
1916
1917 /*
1918 * no exclusion needed
1919 */
1920 if (!excl_cntrs)
1921 return;
1922
1923 xlo = &excl_cntrs->states[o_tid];
1924 xl = &excl_cntrs->states[tid];
1925
1926 xl->sched_started = true;
1927 xl->num_alloc_cntrs = 0;
1928 /*
1929 * lock shared state until we are done scheduling
1930 * in stop_event_scheduling()
1931 * makes scheduling appear as a transaction
1932 */
1933 WARN_ON_ONCE(!irqs_disabled());
1934 raw_spin_lock(&excl_cntrs->lock);
1935
1936 /*
1937 * save initial state of sibling thread
1938 */
1939 memcpy(xlo->init_state, xlo->state, sizeof(xlo->init_state));
1940}
1941
1942static void
1943intel_stop_scheduling(struct cpu_hw_events *cpuc)
1944{
1945 struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
1946 struct intel_excl_states *xl, *xlo;
1947 int tid = cpuc->excl_thread_id;
1948 int o_tid = 1 - tid; /* sibling thread */
1949
1950 /*
1951 * nothing needed if in group validation mode
1952 */
1953 if (cpuc->is_fake || !is_ht_workaround_enabled())
1954 return;
1955 /*
1956 * no exclusion needed
1957 */
1958 if (!excl_cntrs)
1959 return;
1960
1961 xlo = &excl_cntrs->states[o_tid];
1962 xl = &excl_cntrs->states[tid];
1963
1964 /*
1965 * make new sibling thread state visible
1966 */
1967 memcpy(xlo->state, xlo->init_state, sizeof(xlo->state));
1968
1969 xl->sched_started = false;
1970 /*
1971 * release shared state lock (acquired in intel_start_scheduling())
1972 */
1973 raw_spin_unlock(&excl_cntrs->lock);
1974}
1975
1976static struct event_constraint *
1977intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
1978 int idx, struct event_constraint *c)
1979{
1980 struct event_constraint *cx;
1981 struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
1982 struct intel_excl_states *xl, *xlo;
1983 int is_excl, i;
1984 int tid = cpuc->excl_thread_id;
1985 int o_tid = 1 - tid; /* alternate */
1986
1987 /*
1988 * validating a group does not require
1989 * enforcing cross-thread exclusion
1990 */
1991 if (cpuc->is_fake || !is_ht_workaround_enabled())
1992 return c;
1993
1994 /*
1995 * no exclusion needed
1996 */
1997 if (!excl_cntrs)
1998 return c;
1999 /*
2000 * event requires exclusive counter access
2001 * across HT threads
2002 */
2003 is_excl = c->flags & PERF_X86_EVENT_EXCL;
2004
2005 /*
2006 * xl = state of current HT
2007 * xlo = state of sibling HT
2008 */
2009 xl = &excl_cntrs->states[tid];
2010 xlo = &excl_cntrs->states[o_tid];
2011
2012 /*
2013 * do not allow scheduling of more than max_alloc_cntrs
2014 * which is set to half the available generic counters.
2015 * this helps avoid counter starvation of sibling thread
2016 * by ensuring at most half the counters cannot be in
2017 * exclusive mode. There is not designated counters for the
2018 * limits. Any N/2 counters can be used. This helps with
2019 * events with specifix counter constraints
2020 */
2021 if (xl->num_alloc_cntrs++ == xl->max_alloc_cntrs)
2022 return &emptyconstraint;
2023
2024 cx = c;
2025
2026 /*
2027 * because we modify the constraint, we need
2028 * to make a copy. Static constraints come
2029 * from static const tables.
2030 *
2031 * only needed when constraint has not yet
2032 * been cloned (marked dynamic)
2033 */
2034 if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
2035
2036 /* sanity check */
2037 if (idx < 0)
2038 return &emptyconstraint;
2039
2040 /*
2041 * grab pre-allocated constraint entry
2042 */
2043 cx = &cpuc->constraint_list[idx];
2044
2045 /*
2046 * initialize dynamic constraint
2047 * with static constraint
2048 */
2049 memcpy(cx, c, sizeof(*cx));
2050
2051 /*
2052 * mark constraint as dynamic, so we
2053 * can free it later on
2054 */
2055 cx->flags |= PERF_X86_EVENT_DYNAMIC;
2056 }
2057
2058 /*
2059 * From here on, the constraint is dynamic.
2060 * Either it was just allocated above, or it
2061 * was allocated during a earlier invocation
2062 * of this function
2063 */
2064
2065 /*
2066 * Modify static constraint with current dynamic
2067 * state of thread
2068 *
2069 * EXCLUSIVE: sibling counter measuring exclusive event
2070 * SHARED : sibling counter measuring non-exclusive event
2071 * UNUSED : sibling counter unused
2072 */
2073 for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
2074 /*
2075 * exclusive event in sibling counter
2076 * our corresponding counter cannot be used
2077 * regardless of our event
2078 */
2079 if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
2080 __clear_bit(i, cx->idxmsk);
2081 /*
2082 * if measuring an exclusive event, sibling
2083 * measuring non-exclusive, then counter cannot
2084 * be used
2085 */
2086 if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
2087 __clear_bit(i, cx->idxmsk);
2088 }
2089
2090 /*
2091 * recompute actual bit weight for scheduling algorithm
2092 */
2093 cx->weight = hweight64(cx->idxmsk64);
2094
2095 /*
2096 * if we return an empty mask, then switch
2097 * back to static empty constraint to avoid
2098 * the cost of freeing later on
2099 */
2100 if (cx->weight == 0)
2101 cx = &emptyconstraint;
2102
2103 return cx;
2104}
2105
2106static struct event_constraint *
2107intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
2108 struct perf_event *event)
2109{
2110 struct event_constraint *c1 = event->hw.constraint;
2111 struct event_constraint *c2;
2112
2113 /*
2114 * first time only
2115 * - static constraint: no change across incremental scheduling calls
2116 * - dynamic constraint: handled by intel_get_excl_constraints()
2117 */
2118 c2 = __intel_get_event_constraints(cpuc, idx, event);
2119 if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
2120 bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
2121 c1->weight = c2->weight;
2122 c2 = c1;
2123 }
2124
2125 if (cpuc->excl_cntrs)
2126 return intel_get_excl_constraints(cpuc, event, idx, c2);
2127
2128 return c2;
2129}
2130
2131static void intel_put_excl_constraints(struct cpu_hw_events *cpuc,
2132 struct perf_event *event)
2133{
2134 struct hw_perf_event *hwc = &event->hw;
2135 struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
2136 struct intel_excl_states *xlo, *xl;
2137 unsigned long flags = 0; /* keep compiler happy */
2138 int tid = cpuc->excl_thread_id;
2139 int o_tid = 1 - tid;
2140
2141 /*
2142 * nothing needed if in group validation mode
2143 */
2144 if (cpuc->is_fake)
2145 return;
2146
2147 WARN_ON_ONCE(!excl_cntrs);
2148
2149 if (!excl_cntrs)
2150 return;
2151
2152 xl = &excl_cntrs->states[tid];
2153 xlo = &excl_cntrs->states[o_tid];
2154
2155 /*
2156 * put_constraint may be called from x86_schedule_events()
2157 * which already has the lock held so here make locking
2158 * conditional
2159 */
2160 if (!xl->sched_started)
2161 raw_spin_lock_irqsave(&excl_cntrs->lock, flags);
2162
2163 /*
2164 * if event was actually assigned, then mark the
2165 * counter state as unused now
2166 */
2167 if (hwc->idx >= 0)
2168 xlo->state[hwc->idx] = INTEL_EXCL_UNUSED;
2169
2170 if (!xl->sched_started)
2171 raw_spin_unlock_irqrestore(&excl_cntrs->lock, flags);
1661} 2172}
1662 2173
1663static void 2174static void
@@ -1678,7 +2189,57 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
1678static void intel_put_event_constraints(struct cpu_hw_events *cpuc, 2189static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
1679 struct perf_event *event) 2190 struct perf_event *event)
1680{ 2191{
2192 struct event_constraint *c = event->hw.constraint;
2193
1681 intel_put_shared_regs_event_constraints(cpuc, event); 2194 intel_put_shared_regs_event_constraints(cpuc, event);
2195
2196 /*
2197 * is PMU has exclusive counter restrictions, then
2198 * all events are subject to and must call the
2199 * put_excl_constraints() routine
2200 */
2201 if (c && cpuc->excl_cntrs)
2202 intel_put_excl_constraints(cpuc, event);
2203
2204 /* cleanup dynamic constraint */
2205 if (c && (c->flags & PERF_X86_EVENT_DYNAMIC))
2206 event->hw.constraint = NULL;
2207}
2208
2209static void intel_commit_scheduling(struct cpu_hw_events *cpuc,
2210 struct perf_event *event, int cntr)
2211{
2212 struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
2213 struct event_constraint *c = event->hw.constraint;
2214 struct intel_excl_states *xlo, *xl;
2215 int tid = cpuc->excl_thread_id;
2216 int o_tid = 1 - tid;
2217 int is_excl;
2218
2219 if (cpuc->is_fake || !c)
2220 return;
2221
2222 is_excl = c->flags & PERF_X86_EVENT_EXCL;
2223
2224 if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
2225 return;
2226
2227 WARN_ON_ONCE(!excl_cntrs);
2228
2229 if (!excl_cntrs)
2230 return;
2231
2232 xl = &excl_cntrs->states[tid];
2233 xlo = &excl_cntrs->states[o_tid];
2234
2235 WARN_ON_ONCE(!raw_spin_is_locked(&excl_cntrs->lock));
2236
2237 if (cntr >= 0) {
2238 if (is_excl)
2239 xlo->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
2240 else
2241 xlo->init_state[cntr] = INTEL_EXCL_SHARED;
2242 }
1682} 2243}
1683 2244
1684static void intel_pebs_aliases_core2(struct perf_event *event) 2245static void intel_pebs_aliases_core2(struct perf_event *event)
@@ -1747,10 +2308,21 @@ static int intel_pmu_hw_config(struct perf_event *event)
1747 if (event->attr.precise_ip && x86_pmu.pebs_aliases) 2308 if (event->attr.precise_ip && x86_pmu.pebs_aliases)
1748 x86_pmu.pebs_aliases(event); 2309 x86_pmu.pebs_aliases(event);
1749 2310
1750 if (intel_pmu_needs_lbr_smpl(event)) { 2311 if (needs_branch_stack(event)) {
1751 ret = intel_pmu_setup_lbr_filter(event); 2312 ret = intel_pmu_setup_lbr_filter(event);
1752 if (ret) 2313 if (ret)
1753 return ret; 2314 return ret;
2315
2316 /*
2317 * BTS is set up earlier in this path, so don't account twice
2318 */
2319 if (!intel_pmu_has_bts(event)) {
2320 /* disallow lbr if conflicting events are present */
2321 if (x86_add_exclusive(x86_lbr_exclusive_lbr))
2322 return -EBUSY;
2323
2324 event->destroy = hw_perf_lbr_event_destroy;
2325 }
1754 } 2326 }
1755 2327
1756 if (event->attr.type != PERF_TYPE_RAW) 2328 if (event->attr.type != PERF_TYPE_RAW)
@@ -1891,9 +2463,12 @@ static struct event_constraint counter2_constraint =
1891 EVENT_CONSTRAINT(0, 0x4, 0); 2463 EVENT_CONSTRAINT(0, 0x4, 0);
1892 2464
1893static struct event_constraint * 2465static struct event_constraint *
1894hsw_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) 2466hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
2467 struct perf_event *event)
1895{ 2468{
1896 struct event_constraint *c = intel_get_event_constraints(cpuc, event); 2469 struct event_constraint *c;
2470
2471 c = intel_get_event_constraints(cpuc, idx, event);
1897 2472
1898 /* Handle special quirk on in_tx_checkpointed only in counter 2 */ 2473 /* Handle special quirk on in_tx_checkpointed only in counter 2 */
1899 if (event->hw.config & HSW_IN_TX_CHECKPOINTED) { 2474 if (event->hw.config & HSW_IN_TX_CHECKPOINTED) {
@@ -1905,6 +2480,32 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
1905 return c; 2480 return c;
1906} 2481}
1907 2482
2483/*
2484 * Broadwell:
2485 *
2486 * The INST_RETIRED.ALL period always needs to have lowest 6 bits cleared
2487 * (BDM55) and it must not use a period smaller than 100 (BDM11). We combine
2488 * the two to enforce a minimum period of 128 (the smallest value that has bits
2489 * 0-5 cleared and >= 100).
2490 *
2491 * Because of how the code in x86_perf_event_set_period() works, the truncation
2492 * of the lower 6 bits is 'harmless' as we'll occasionally add a longer period
2493 * to make up for the 'lost' events due to carrying the 'error' in period_left.
2494 *
2495 * Therefore the effective (average) period matches the requested period,
2496 * despite coarser hardware granularity.
2497 */
2498static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
2499{
2500 if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
2501 X86_CONFIG(.event=0xc0, .umask=0x01)) {
2502 if (left < 128)
2503 left = 128;
2504 left &= ~0x3fu;
2505 }
2506 return left;
2507}
2508
1908PMU_FORMAT_ATTR(event, "config:0-7" ); 2509PMU_FORMAT_ATTR(event, "config:0-7" );
1909PMU_FORMAT_ATTR(umask, "config:8-15" ); 2510PMU_FORMAT_ATTR(umask, "config:8-15" );
1910PMU_FORMAT_ATTR(edge, "config:18" ); 2511PMU_FORMAT_ATTR(edge, "config:18" );
@@ -1979,16 +2580,52 @@ struct intel_shared_regs *allocate_shared_regs(int cpu)
1979 return regs; 2580 return regs;
1980} 2581}
1981 2582
2583static struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
2584{
2585 struct intel_excl_cntrs *c;
2586 int i;
2587
2588 c = kzalloc_node(sizeof(struct intel_excl_cntrs),
2589 GFP_KERNEL, cpu_to_node(cpu));
2590 if (c) {
2591 raw_spin_lock_init(&c->lock);
2592 for (i = 0; i < X86_PMC_IDX_MAX; i++) {
2593 c->states[0].state[i] = INTEL_EXCL_UNUSED;
2594 c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
2595
2596 c->states[1].state[i] = INTEL_EXCL_UNUSED;
2597 c->states[1].init_state[i] = INTEL_EXCL_UNUSED;
2598 }
2599 c->core_id = -1;
2600 }
2601 return c;
2602}
2603
1982static int intel_pmu_cpu_prepare(int cpu) 2604static int intel_pmu_cpu_prepare(int cpu)
1983{ 2605{
1984 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu); 2606 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
1985 2607
1986 if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map)) 2608 if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
1987 return NOTIFY_OK; 2609 cpuc->shared_regs = allocate_shared_regs(cpu);
2610 if (!cpuc->shared_regs)
2611 return NOTIFY_BAD;
2612 }
1988 2613
1989 cpuc->shared_regs = allocate_shared_regs(cpu); 2614 if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
1990 if (!cpuc->shared_regs) 2615 size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
1991 return NOTIFY_BAD; 2616
2617 cpuc->constraint_list = kzalloc(sz, GFP_KERNEL);
2618 if (!cpuc->constraint_list)
2619 return NOTIFY_BAD;
2620
2621 cpuc->excl_cntrs = allocate_excl_cntrs(cpu);
2622 if (!cpuc->excl_cntrs) {
2623 kfree(cpuc->constraint_list);
2624 kfree(cpuc->shared_regs);
2625 return NOTIFY_BAD;
2626 }
2627 cpuc->excl_thread_id = 0;
2628 }
1992 2629
1993 return NOTIFY_OK; 2630 return NOTIFY_OK;
1994} 2631}
@@ -2010,13 +2647,15 @@ static void intel_pmu_cpu_starting(int cpu)
2010 if (!cpuc->shared_regs) 2647 if (!cpuc->shared_regs)
2011 return; 2648 return;
2012 2649
2013 if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) { 2650 if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) {
2651 void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED];
2652
2014 for_each_cpu(i, topology_thread_cpumask(cpu)) { 2653 for_each_cpu(i, topology_thread_cpumask(cpu)) {
2015 struct intel_shared_regs *pc; 2654 struct intel_shared_regs *pc;
2016 2655
2017 pc = per_cpu(cpu_hw_events, i).shared_regs; 2656 pc = per_cpu(cpu_hw_events, i).shared_regs;
2018 if (pc && pc->core_id == core_id) { 2657 if (pc && pc->core_id == core_id) {
2019 cpuc->kfree_on_online = cpuc->shared_regs; 2658 *onln = cpuc->shared_regs;
2020 cpuc->shared_regs = pc; 2659 cpuc->shared_regs = pc;
2021 break; 2660 break;
2022 } 2661 }
@@ -2027,6 +2666,44 @@ static void intel_pmu_cpu_starting(int cpu)
2027 2666
2028 if (x86_pmu.lbr_sel_map) 2667 if (x86_pmu.lbr_sel_map)
2029 cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR]; 2668 cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
2669
2670 if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
2671 int h = x86_pmu.num_counters >> 1;
2672
2673 for_each_cpu(i, topology_thread_cpumask(cpu)) {
2674 struct intel_excl_cntrs *c;
2675
2676 c = per_cpu(cpu_hw_events, i).excl_cntrs;
2677 if (c && c->core_id == core_id) {
2678 cpuc->kfree_on_online[1] = cpuc->excl_cntrs;
2679 cpuc->excl_cntrs = c;
2680 cpuc->excl_thread_id = 1;
2681 break;
2682 }
2683 }
2684 cpuc->excl_cntrs->core_id = core_id;
2685 cpuc->excl_cntrs->refcnt++;
2686 /*
2687 * set hard limit to half the number of generic counters
2688 */
2689 cpuc->excl_cntrs->states[0].max_alloc_cntrs = h;
2690 cpuc->excl_cntrs->states[1].max_alloc_cntrs = h;
2691 }
2692}
2693
2694static void free_excl_cntrs(int cpu)
2695{
2696 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
2697 struct intel_excl_cntrs *c;
2698
2699 c = cpuc->excl_cntrs;
2700 if (c) {
2701 if (c->core_id == -1 || --c->refcnt == 0)
2702 kfree(c);
2703 cpuc->excl_cntrs = NULL;
2704 kfree(cpuc->constraint_list);
2705 cpuc->constraint_list = NULL;
2706 }
2030} 2707}
2031 2708
2032static void intel_pmu_cpu_dying(int cpu) 2709static void intel_pmu_cpu_dying(int cpu)
@@ -2041,19 +2718,9 @@ static void intel_pmu_cpu_dying(int cpu)
2041 cpuc->shared_regs = NULL; 2718 cpuc->shared_regs = NULL;
2042 } 2719 }
2043 2720
2044 fini_debug_store_on_cpu(cpu); 2721 free_excl_cntrs(cpu);
2045}
2046 2722
2047static void intel_pmu_flush_branch_stack(void) 2723 fini_debug_store_on_cpu(cpu);
2048{
2049 /*
2050 * Intel LBR does not tag entries with the
2051 * PID of the current task, then we need to
2052 * flush it on ctxsw
2053 * For now, we simply reset it
2054 */
2055 if (x86_pmu.lbr_nr)
2056 intel_pmu_lbr_reset();
2057} 2724}
2058 2725
2059PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63"); 2726PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
@@ -2107,7 +2774,7 @@ static __initconst const struct x86_pmu intel_pmu = {
2107 .cpu_starting = intel_pmu_cpu_starting, 2774 .cpu_starting = intel_pmu_cpu_starting,
2108 .cpu_dying = intel_pmu_cpu_dying, 2775 .cpu_dying = intel_pmu_cpu_dying,
2109 .guest_get_msrs = intel_guest_get_msrs, 2776 .guest_get_msrs = intel_guest_get_msrs,
2110 .flush_branch_stack = intel_pmu_flush_branch_stack, 2777 .sched_task = intel_pmu_lbr_sched_task,
2111}; 2778};
2112 2779
2113static __init void intel_clovertown_quirk(void) 2780static __init void intel_clovertown_quirk(void)
@@ -2264,6 +2931,27 @@ static __init void intel_nehalem_quirk(void)
2264 } 2931 }
2265} 2932}
2266 2933
2934/*
2935 * enable software workaround for errata:
2936 * SNB: BJ122
2937 * IVB: BV98
2938 * HSW: HSD29
2939 *
2940 * Only needed when HT is enabled. However detecting
2941 * if HT is enabled is difficult (model specific). So instead,
2942 * we enable the workaround in the early boot, and verify if
2943 * it is needed in a later initcall phase once we have valid
2944 * topology information to check if HT is actually enabled
2945 */
2946static __init void intel_ht_bug(void)
2947{
2948 x86_pmu.flags |= PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED;
2949
2950 x86_pmu.commit_scheduling = intel_commit_scheduling;
2951 x86_pmu.start_scheduling = intel_start_scheduling;
2952 x86_pmu.stop_scheduling = intel_stop_scheduling;
2953}
2954
2267EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3"); 2955EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
2268EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82") 2956EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82")
2269 2957
@@ -2443,7 +3131,7 @@ __init int intel_pmu_init(void)
2443 x86_pmu.event_constraints = intel_slm_event_constraints; 3131 x86_pmu.event_constraints = intel_slm_event_constraints;
2444 x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints; 3132 x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
2445 x86_pmu.extra_regs = intel_slm_extra_regs; 3133 x86_pmu.extra_regs = intel_slm_extra_regs;
2446 x86_pmu.er_flags |= ERF_HAS_RSP_1; 3134 x86_pmu.flags |= PMU_FL_HAS_RSP_1;
2447 pr_cont("Silvermont events, "); 3135 pr_cont("Silvermont events, ");
2448 break; 3136 break;
2449 3137
@@ -2461,7 +3149,7 @@ __init int intel_pmu_init(void)
2461 x86_pmu.enable_all = intel_pmu_nhm_enable_all; 3149 x86_pmu.enable_all = intel_pmu_nhm_enable_all;
2462 x86_pmu.pebs_constraints = intel_westmere_pebs_event_constraints; 3150 x86_pmu.pebs_constraints = intel_westmere_pebs_event_constraints;
2463 x86_pmu.extra_regs = intel_westmere_extra_regs; 3151 x86_pmu.extra_regs = intel_westmere_extra_regs;
2464 x86_pmu.er_flags |= ERF_HAS_RSP_1; 3152 x86_pmu.flags |= PMU_FL_HAS_RSP_1;
2465 3153
2466 x86_pmu.cpu_events = nhm_events_attrs; 3154 x86_pmu.cpu_events = nhm_events_attrs;
2467 3155
@@ -2478,6 +3166,7 @@ __init int intel_pmu_init(void)
2478 case 42: /* 32nm SandyBridge */ 3166 case 42: /* 32nm SandyBridge */
2479 case 45: /* 32nm SandyBridge-E/EN/EP */ 3167 case 45: /* 32nm SandyBridge-E/EN/EP */
2480 x86_add_quirk(intel_sandybridge_quirk); 3168 x86_add_quirk(intel_sandybridge_quirk);
3169 x86_add_quirk(intel_ht_bug);
2481 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, 3170 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
2482 sizeof(hw_cache_event_ids)); 3171 sizeof(hw_cache_event_ids));
2483 memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, 3172 memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
@@ -2492,9 +3181,11 @@ __init int intel_pmu_init(void)
2492 x86_pmu.extra_regs = intel_snbep_extra_regs; 3181 x86_pmu.extra_regs = intel_snbep_extra_regs;
2493 else 3182 else
2494 x86_pmu.extra_regs = intel_snb_extra_regs; 3183 x86_pmu.extra_regs = intel_snb_extra_regs;
3184
3185
2495 /* all extra regs are per-cpu when HT is on */ 3186 /* all extra regs are per-cpu when HT is on */
2496 x86_pmu.er_flags |= ERF_HAS_RSP_1; 3187 x86_pmu.flags |= PMU_FL_HAS_RSP_1;
2497 x86_pmu.er_flags |= ERF_NO_HT_SHARING; 3188 x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
2498 3189
2499 x86_pmu.cpu_events = snb_events_attrs; 3190 x86_pmu.cpu_events = snb_events_attrs;
2500 3191
@@ -2510,6 +3201,7 @@ __init int intel_pmu_init(void)
2510 3201
2511 case 58: /* 22nm IvyBridge */ 3202 case 58: /* 22nm IvyBridge */
2512 case 62: /* 22nm IvyBridge-EP/EX */ 3203 case 62: /* 22nm IvyBridge-EP/EX */
3204 x86_add_quirk(intel_ht_bug);
2513 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, 3205 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
2514 sizeof(hw_cache_event_ids)); 3206 sizeof(hw_cache_event_ids));
2515 /* dTLB-load-misses on IVB is different than SNB */ 3207 /* dTLB-load-misses on IVB is different than SNB */
@@ -2528,8 +3220,8 @@ __init int intel_pmu_init(void)
2528 else 3220 else
2529 x86_pmu.extra_regs = intel_snb_extra_regs; 3221 x86_pmu.extra_regs = intel_snb_extra_regs;
2530 /* all extra regs are per-cpu when HT is on */ 3222 /* all extra regs are per-cpu when HT is on */
2531 x86_pmu.er_flags |= ERF_HAS_RSP_1; 3223 x86_pmu.flags |= PMU_FL_HAS_RSP_1;
2532 x86_pmu.er_flags |= ERF_NO_HT_SHARING; 3224 x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
2533 3225
2534 x86_pmu.cpu_events = snb_events_attrs; 3226 x86_pmu.cpu_events = snb_events_attrs;
2535 3227
@@ -2545,19 +3237,20 @@ __init int intel_pmu_init(void)
2545 case 63: /* 22nm Haswell Server */ 3237 case 63: /* 22nm Haswell Server */
2546 case 69: /* 22nm Haswell ULT */ 3238 case 69: /* 22nm Haswell ULT */
2547 case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */ 3239 case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
3240 x86_add_quirk(intel_ht_bug);
2548 x86_pmu.late_ack = true; 3241 x86_pmu.late_ack = true;
2549 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids)); 3242 memcpy(hw_cache_event_ids, hsw_hw_cache_event_ids, sizeof(hw_cache_event_ids));
2550 memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs)); 3243 memcpy(hw_cache_extra_regs, hsw_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
2551 3244
2552 intel_pmu_lbr_init_snb(); 3245 intel_pmu_lbr_init_hsw();
2553 3246
2554 x86_pmu.event_constraints = intel_hsw_event_constraints; 3247 x86_pmu.event_constraints = intel_hsw_event_constraints;
2555 x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints; 3248 x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
2556 x86_pmu.extra_regs = intel_snbep_extra_regs; 3249 x86_pmu.extra_regs = intel_snbep_extra_regs;
2557 x86_pmu.pebs_aliases = intel_pebs_aliases_snb; 3250 x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
2558 /* all extra regs are per-cpu when HT is on */ 3251 /* all extra regs are per-cpu when HT is on */
2559 x86_pmu.er_flags |= ERF_HAS_RSP_1; 3252 x86_pmu.flags |= PMU_FL_HAS_RSP_1;
2560 x86_pmu.er_flags |= ERF_NO_HT_SHARING; 3253 x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
2561 3254
2562 x86_pmu.hw_config = hsw_hw_config; 3255 x86_pmu.hw_config = hsw_hw_config;
2563 x86_pmu.get_event_constraints = hsw_get_event_constraints; 3256 x86_pmu.get_event_constraints = hsw_get_event_constraints;
@@ -2566,6 +3259,39 @@ __init int intel_pmu_init(void)
2566 pr_cont("Haswell events, "); 3259 pr_cont("Haswell events, ");
2567 break; 3260 break;
2568 3261
3262 case 61: /* 14nm Broadwell Core-M */
3263 case 86: /* 14nm Broadwell Xeon D */
3264 x86_pmu.late_ack = true;
3265 memcpy(hw_cache_event_ids, hsw_hw_cache_event_ids, sizeof(hw_cache_event_ids));
3266 memcpy(hw_cache_extra_regs, hsw_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
3267
3268 /* L3_MISS_LOCAL_DRAM is BIT(26) in Broadwell */
3269 hw_cache_extra_regs[C(LL)][C(OP_READ)][C(RESULT_MISS)] = HSW_DEMAND_READ |
3270 BDW_L3_MISS|HSW_SNOOP_DRAM;
3271 hw_cache_extra_regs[C(LL)][C(OP_WRITE)][C(RESULT_MISS)] = HSW_DEMAND_WRITE|BDW_L3_MISS|
3272 HSW_SNOOP_DRAM;
3273 hw_cache_extra_regs[C(NODE)][C(OP_READ)][C(RESULT_ACCESS)] = HSW_DEMAND_READ|
3274 BDW_L3_MISS_LOCAL|HSW_SNOOP_DRAM;
3275 hw_cache_extra_regs[C(NODE)][C(OP_WRITE)][C(RESULT_ACCESS)] = HSW_DEMAND_WRITE|
3276 BDW_L3_MISS_LOCAL|HSW_SNOOP_DRAM;
3277
3278 intel_pmu_lbr_init_snb();
3279
3280 x86_pmu.event_constraints = intel_bdw_event_constraints;
3281 x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
3282 x86_pmu.extra_regs = intel_snbep_extra_regs;
3283 x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
3284 /* all extra regs are per-cpu when HT is on */
3285 x86_pmu.flags |= PMU_FL_HAS_RSP_1;
3286 x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
3287
3288 x86_pmu.hw_config = hsw_hw_config;
3289 x86_pmu.get_event_constraints = hsw_get_event_constraints;
3290 x86_pmu.cpu_events = hsw_events_attrs;
3291 x86_pmu.limit_period = bdw_limit_period;
3292 pr_cont("Broadwell events, ");
3293 break;
3294
2569 default: 3295 default:
2570 switch (x86_pmu.version) { 3296 switch (x86_pmu.version) {
2571 case 1: 3297 case 1:
@@ -2651,3 +3377,47 @@ __init int intel_pmu_init(void)
2651 3377
2652 return 0; 3378 return 0;
2653} 3379}
3380
3381/*
3382 * HT bug: phase 2 init
3383 * Called once we have valid topology information to check
3384 * whether or not HT is enabled
3385 * If HT is off, then we disable the workaround
3386 */
3387static __init int fixup_ht_bug(void)
3388{
3389 int cpu = smp_processor_id();
3390 int w, c;
3391 /*
3392 * problem not present on this CPU model, nothing to do
3393 */
3394 if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED))
3395 return 0;
3396
3397 w = cpumask_weight(topology_thread_cpumask(cpu));
3398 if (w > 1) {
3399 pr_info("PMU erratum BJ122, BV98, HSD29 worked around, HT is on\n");
3400 return 0;
3401 }
3402
3403 watchdog_nmi_disable_all();
3404
3405 x86_pmu.flags &= ~(PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED);
3406
3407 x86_pmu.commit_scheduling = NULL;
3408 x86_pmu.start_scheduling = NULL;
3409 x86_pmu.stop_scheduling = NULL;
3410
3411 watchdog_nmi_enable_all();
3412
3413 get_online_cpus();
3414
3415 for_each_online_cpu(c) {
3416 free_excl_cntrs(c);
3417 }
3418
3419 put_online_cpus();
3420 pr_info("PMU erratum BJ122, BV98, HSD29 workaround disabled, HT off\n");
3421 return 0;
3422}
3423subsys_initcall(fixup_ht_bug)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 000000000000..ac1f0c55f379
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,525 @@
1/*
2 * BTS PMU driver for perf
3 * Copyright (c) 2013-2014, Intel Corporation.
4 *
5 * This program is free software; you can redistribute it and/or modify it
6 * under the terms and conditions of the GNU General Public License,
7 * version 2, as published by the Free Software Foundation.
8 *
9 * This program is distributed in the hope it will be useful, but WITHOUT
10 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
12 * more details.
13 */
14
15#undef DEBUG
16
17#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
18
19#include <linux/bitops.h>
20#include <linux/types.h>
21#include <linux/slab.h>
22#include <linux/debugfs.h>
23#include <linux/device.h>
24#include <linux/coredump.h>
25
26#include <asm-generic/sizes.h>
27#include <asm/perf_event.h>
28
29#include "perf_event.h"
30
31struct bts_ctx {
32 struct perf_output_handle handle;
33 struct debug_store ds_back;
34 int started;
35};
36
37static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
38
39#define BTS_RECORD_SIZE 24
40#define BTS_SAFETY_MARGIN 4080
41
42struct bts_phys {
43 struct page *page;
44 unsigned long size;
45 unsigned long offset;
46 unsigned long displacement;
47};
48
49struct bts_buffer {
50 size_t real_size; /* multiple of BTS_RECORD_SIZE */
51 unsigned int nr_pages;
52 unsigned int nr_bufs;
53 unsigned int cur_buf;
54 bool snapshot;
55 local_t data_size;
56 local_t lost;
57 local_t head;
58 unsigned long end;
59 void **data_pages;
60 struct bts_phys buf[0];
61};
62
63struct pmu bts_pmu;
64
65void intel_pmu_enable_bts(u64 config);
66void intel_pmu_disable_bts(void);
67
68static size_t buf_size(struct page *page)
69{
70 return 1 << (PAGE_SHIFT + page_private(page));
71}
72
73static void *
74bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite)
75{
76 struct bts_buffer *buf;
77 struct page *page;
78 int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
79 unsigned long offset;
80 size_t size = nr_pages << PAGE_SHIFT;
81 int pg, nbuf, pad;
82
83 /* count all the high order buffers */
84 for (pg = 0, nbuf = 0; pg < nr_pages;) {
85 page = virt_to_page(pages[pg]);
86 if (WARN_ON_ONCE(!PagePrivate(page) && nr_pages > 1))
87 return NULL;
88 pg += 1 << page_private(page);
89 nbuf++;
90 }
91
92 /*
93 * to avoid interrupts in overwrite mode, only allow one physical
94 */
95 if (overwrite && nbuf > 1)
96 return NULL;
97
98 buf = kzalloc_node(offsetof(struct bts_buffer, buf[nbuf]), GFP_KERNEL, node);
99 if (!buf)
100 return NULL;
101
102 buf->nr_pages = nr_pages;
103 buf->nr_bufs = nbuf;
104 buf->snapshot = overwrite;
105 buf->data_pages = pages;
106 buf->real_size = size - size % BTS_RECORD_SIZE;
107
108 for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) {
109 unsigned int __nr_pages;
110
111 page = virt_to_page(pages[pg]);
112 __nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1;
113 buf->buf[nbuf].page = page;
114 buf->buf[nbuf].offset = offset;
115 buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0);
116 buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement;
117 pad = buf->buf[nbuf].size % BTS_RECORD_SIZE;
118 buf->buf[nbuf].size -= pad;
119
120 pg += __nr_pages;
121 offset += __nr_pages << PAGE_SHIFT;
122 }
123
124 return buf;
125}
126
127static void bts_buffer_free_aux(void *data)
128{
129 kfree(data);
130}
131
132static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx)
133{
134 return buf->buf[idx].offset + buf->buf[idx].displacement;
135}
136
137static void
138bts_config_buffer(struct bts_buffer *buf)
139{
140 int cpu = raw_smp_processor_id();
141 struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
142 struct bts_phys *phys = &buf->buf[buf->cur_buf];
143 unsigned long index, thresh = 0, end = phys->size;
144 struct page *page = phys->page;
145
146 index = local_read(&buf->head);
147
148 if (!buf->snapshot) {
149 if (buf->end < phys->offset + buf_size(page))
150 end = buf->end - phys->offset - phys->displacement;
151
152 index -= phys->offset + phys->displacement;
153
154 if (end - index > BTS_SAFETY_MARGIN)
155 thresh = end - BTS_SAFETY_MARGIN;
156 else if (end - index > BTS_RECORD_SIZE)
157 thresh = end - BTS_RECORD_SIZE;
158 else
159 thresh = end;
160 }
161
162 ds->bts_buffer_base = (u64)(long)page_address(page) + phys->displacement;
163 ds->bts_index = ds->bts_buffer_base + index;
164 ds->bts_absolute_maximum = ds->bts_buffer_base + end;
165 ds->bts_interrupt_threshold = !buf->snapshot
166 ? ds->bts_buffer_base + thresh
167 : ds->bts_absolute_maximum + BTS_RECORD_SIZE;
168}
169
170static void bts_buffer_pad_out(struct bts_phys *phys, unsigned long head)
171{
172 unsigned long index = head - phys->offset;
173
174 memset(page_address(phys->page) + index, 0, phys->size - index);
175}
176
177static bool bts_buffer_is_full(struct bts_buffer *buf, struct bts_ctx *bts)
178{
179 if (buf->snapshot)
180 return false;
181
182 if (local_read(&buf->data_size) >= bts->handle.size ||
183 bts->handle.size - local_read(&buf->data_size) < BTS_RECORD_SIZE)
184 return true;
185
186 return false;
187}
188
189static void bts_update(struct bts_ctx *bts)
190{
191 int cpu = raw_smp_processor_id();
192 struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
193 struct bts_buffer *buf = perf_get_aux(&bts->handle);
194 unsigned long index = ds->bts_index - ds->bts_buffer_base, old, head;
195
196 if (!buf)
197 return;
198
199 head = index + bts_buffer_offset(buf, buf->cur_buf);
200 old = local_xchg(&buf->head, head);
201
202 if (!buf->snapshot) {
203 if (old == head)
204 return;
205
206 if (ds->bts_index >= ds->bts_absolute_maximum)
207 local_inc(&buf->lost);
208
209 /*
210 * old and head are always in the same physical buffer, so we
211 * can subtract them to get the data size.
212 */
213 local_add(head - old, &buf->data_size);
214 } else {
215 local_set(&buf->data_size, head);
216 }
217}
218
219static void __bts_event_start(struct perf_event *event)
220{
221 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
222 struct bts_buffer *buf = perf_get_aux(&bts->handle);
223 u64 config = 0;
224
225 if (!buf || bts_buffer_is_full(buf, bts))
226 return;
227
228 event->hw.state = 0;
229
230 if (!buf->snapshot)
231 config |= ARCH_PERFMON_EVENTSEL_INT;
232 if (!event->attr.exclude_kernel)
233 config |= ARCH_PERFMON_EVENTSEL_OS;
234 if (!event->attr.exclude_user)
235 config |= ARCH_PERFMON_EVENTSEL_USR;
236
237 bts_config_buffer(buf);
238
239 /*
240 * local barrier to make sure that ds configuration made it
241 * before we enable BTS
242 */
243 wmb();
244
245 intel_pmu_enable_bts(config);
246}
247
248static void bts_event_start(struct perf_event *event, int flags)
249{
250 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
251
252 __bts_event_start(event);
253
254 /* PMI handler: this counter is running and likely generating PMIs */
255 ACCESS_ONCE(bts->started) = 1;
256}
257
258static void __bts_event_stop(struct perf_event *event)
259{
260 /*
261 * No extra synchronization is mandated by the documentation to have
262 * BTS data stores globally visible.
263 */
264 intel_pmu_disable_bts();
265
266 if (event->hw.state & PERF_HES_STOPPED)
267 return;
268
269 ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
270}
271
272static void bts_event_stop(struct perf_event *event, int flags)
273{
274 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
275
276 /* PMI handler: don't restart this counter */
277 ACCESS_ONCE(bts->started) = 0;
278
279 __bts_event_stop(event);
280
281 if (flags & PERF_EF_UPDATE)
282 bts_update(bts);
283}
284
285void intel_bts_enable_local(void)
286{
287 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
288
289 if (bts->handle.event && bts->started)
290 __bts_event_start(bts->handle.event);
291}
292
293void intel_bts_disable_local(void)
294{
295 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
296
297 if (bts->handle.event)
298 __bts_event_stop(bts->handle.event);
299}
300
301static int
302bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle)
303{
304 unsigned long head, space, next_space, pad, gap, skip, wakeup;
305 unsigned int next_buf;
306 struct bts_phys *phys, *next_phys;
307 int ret;
308
309 if (buf->snapshot)
310 return 0;
311
312 head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
313 if (WARN_ON_ONCE(head != local_read(&buf->head)))
314 return -EINVAL;
315
316 phys = &buf->buf[buf->cur_buf];
317 space = phys->offset + phys->displacement + phys->size - head;
318 pad = space;
319 if (space > handle->size) {
320 space = handle->size;
321 space -= space % BTS_RECORD_SIZE;
322 }
323 if (space <= BTS_SAFETY_MARGIN) {
324 /* See if next phys buffer has more space */
325 next_buf = buf->cur_buf + 1;
326 if (next_buf >= buf->nr_bufs)
327 next_buf = 0;
328 next_phys = &buf->buf[next_buf];
329 gap = buf_size(phys->page) - phys->displacement - phys->size +
330 next_phys->displacement;
331 skip = pad + gap;
332 if (handle->size >= skip) {
333 next_space = next_phys->size;
334 if (next_space + skip > handle->size) {
335 next_space = handle->size - skip;
336 next_space -= next_space % BTS_RECORD_SIZE;
337 }
338 if (next_space > space || !space) {
339 if (pad)
340 bts_buffer_pad_out(phys, head);
341 ret = perf_aux_output_skip(handle, skip);
342 if (ret)
343 return ret;
344 /* Advance to next phys buffer */
345 phys = next_phys;
346 space = next_space;
347 head = phys->offset + phys->displacement;
348 /*
349 * After this, cur_buf and head won't match ds
350 * anymore, so we must not be racing with
351 * bts_update().
352 */
353 buf->cur_buf = next_buf;
354 local_set(&buf->head, head);
355 }
356 }
357 }
358
359 /* Don't go far beyond wakeup watermark */
360 wakeup = BTS_SAFETY_MARGIN + BTS_RECORD_SIZE + handle->wakeup -
361 handle->head;
362 if (space > wakeup) {
363 space = wakeup;
364 space -= space % BTS_RECORD_SIZE;
365 }
366
367 buf->end = head + space;
368
369 /*
370 * If we have no space, the lost notification would have been sent when
371 * we hit absolute_maximum - see bts_update()
372 */
373 if (!space)
374 return -ENOSPC;
375
376 return 0;
377}
378
379int intel_bts_interrupt(void)
380{
381 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
382 struct perf_event *event = bts->handle.event;
383 struct bts_buffer *buf;
384 s64 old_head;
385 int err;
386
387 if (!event || !bts->started)
388 return 0;
389
390 buf = perf_get_aux(&bts->handle);
391 /*
392 * Skip snapshot counters: they don't use the interrupt, but
393 * there's no other way of telling, because the pointer will
394 * keep moving
395 */
396 if (!buf || buf->snapshot)
397 return 0;
398
399 old_head = local_read(&buf->head);
400 bts_update(bts);
401
402 /* no new data */
403 if (old_head == local_read(&buf->head))
404 return 0;
405
406 perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
407 !!local_xchg(&buf->lost, 0));
408
409 buf = perf_aux_output_begin(&bts->handle, event);
410 if (!buf)
411 return 1;
412
413 err = bts_buffer_reset(buf, &bts->handle);
414 if (err)
415 perf_aux_output_end(&bts->handle, 0, false);
416
417 return 1;
418}
419
420static void bts_event_del(struct perf_event *event, int mode)
421{
422 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
423 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
424 struct bts_buffer *buf = perf_get_aux(&bts->handle);
425
426 bts_event_stop(event, PERF_EF_UPDATE);
427
428 if (buf) {
429 if (buf->snapshot)
430 bts->handle.head =
431 local_xchg(&buf->data_size,
432 buf->nr_pages << PAGE_SHIFT);
433 perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
434 !!local_xchg(&buf->lost, 0));
435 }
436
437 cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
438 cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
439 cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
440 cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
441}
442
443static int bts_event_add(struct perf_event *event, int mode)
444{
445 struct bts_buffer *buf;
446 struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
447 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
448 struct hw_perf_event *hwc = &event->hw;
449 int ret = -EBUSY;
450
451 event->hw.state = PERF_HES_STOPPED;
452
453 if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
454 return -EBUSY;
455
456 if (bts->handle.event)
457 return -EBUSY;
458
459 buf = perf_aux_output_begin(&bts->handle, event);
460 if (!buf)
461 return -EINVAL;
462
463 ret = bts_buffer_reset(buf, &bts->handle);
464 if (ret) {
465 perf_aux_output_end(&bts->handle, 0, false);
466 return ret;
467 }
468
469 bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
470 bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
471 bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
472
473 if (mode & PERF_EF_START) {
474 bts_event_start(event, 0);
475 if (hwc->state & PERF_HES_STOPPED) {
476 bts_event_del(event, 0);
477 return -EBUSY;
478 }
479 }
480
481 return 0;
482}
483
484static void bts_event_destroy(struct perf_event *event)
485{
486 x86_del_exclusive(x86_lbr_exclusive_bts);
487}
488
489static int bts_event_init(struct perf_event *event)
490{
491 if (event->attr.type != bts_pmu.type)
492 return -ENOENT;
493
494 if (x86_add_exclusive(x86_lbr_exclusive_bts))
495 return -EBUSY;
496
497 event->destroy = bts_event_destroy;
498
499 return 0;
500}
501
502static void bts_event_read(struct perf_event *event)
503{
504}
505
506static __init int bts_init(void)
507{
508 if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
509 return -ENODEV;
510
511 bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
512 bts_pmu.task_ctx_nr = perf_sw_context;
513 bts_pmu.event_init = bts_event_init;
514 bts_pmu.add = bts_event_add;
515 bts_pmu.del = bts_event_del;
516 bts_pmu.start = bts_event_start;
517 bts_pmu.stop = bts_event_stop;
518 bts_pmu.read = bts_event_read;
519 bts_pmu.setup_aux = bts_buffer_setup_aux;
520 bts_pmu.free_aux = bts_buffer_free_aux;
521
522 return perf_pmu_register(&bts_pmu, "intel_bts", -1);
523}
524
525module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
new file mode 100644
index 000000000000..e4d1b8b738fa
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -0,0 +1,1379 @@
1/*
2 * Intel Cache Quality-of-Service Monitoring (CQM) support.
3 *
4 * Based very, very heavily on work by Peter Zijlstra.
5 */
6
7#include <linux/perf_event.h>
8#include <linux/slab.h>
9#include <asm/cpu_device_id.h>
10#include "perf_event.h"
11
12#define MSR_IA32_PQR_ASSOC 0x0c8f
13#define MSR_IA32_QM_CTR 0x0c8e
14#define MSR_IA32_QM_EVTSEL 0x0c8d
15
16static unsigned int cqm_max_rmid = -1;
17static unsigned int cqm_l3_scale; /* supposedly cacheline size */
18
19struct intel_cqm_state {
20 raw_spinlock_t lock;
21 int rmid;
22 int cnt;
23};
24
25static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
26
27/*
28 * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
29 * Also protects event->hw.cqm_rmid
30 *
31 * Hold either for stability, both for modification of ->hw.cqm_rmid.
32 */
33static DEFINE_MUTEX(cache_mutex);
34static DEFINE_RAW_SPINLOCK(cache_lock);
35
36/*
37 * Groups of events that have the same target(s), one RMID per group.
38 */
39static LIST_HEAD(cache_groups);
40
41/*
42 * Mask of CPUs for reading CQM values. We only need one per-socket.
43 */
44static cpumask_t cqm_cpumask;
45
46#define RMID_VAL_ERROR (1ULL << 63)
47#define RMID_VAL_UNAVAIL (1ULL << 62)
48
49#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
50
51#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
52
53/*
54 * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
55 *
56 * This rmid is always free and is guaranteed to have an associated
57 * near-zero occupancy value, i.e. no cachelines are tagged with this
58 * RMID, once __intel_cqm_rmid_rotate() returns.
59 */
60static unsigned int intel_cqm_rotation_rmid;
61
62#define INVALID_RMID (-1)
63
64/*
65 * Is @rmid valid for programming the hardware?
66 *
67 * rmid 0 is reserved by the hardware for all non-monitored tasks, which
68 * means that we should never come across an rmid with that value.
69 * Likewise, an rmid value of -1 is used to indicate "no rmid currently
70 * assigned" and is used as part of the rotation code.
71 */
72static inline bool __rmid_valid(unsigned int rmid)
73{
74 if (!rmid || rmid == INVALID_RMID)
75 return false;
76
77 return true;
78}
79
80static u64 __rmid_read(unsigned int rmid)
81{
82 u64 val;
83
84 /*
85 * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
86 * it just says that to increase confusion.
87 */
88 wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
89 rdmsrl(MSR_IA32_QM_CTR, val);
90
91 /*
92 * Aside from the ERROR and UNAVAIL bits, assume this thing returns
93 * the number of cachelines tagged with @rmid.
94 */
95 return val;
96}
97
98enum rmid_recycle_state {
99 RMID_YOUNG = 0,
100 RMID_AVAILABLE,
101 RMID_DIRTY,
102};
103
104struct cqm_rmid_entry {
105 unsigned int rmid;
106 enum rmid_recycle_state state;
107 struct list_head list;
108 unsigned long queue_time;
109};
110
111/*
112 * cqm_rmid_free_lru - A least recently used list of RMIDs.
113 *
114 * Oldest entry at the head, newest (most recently used) entry at the
115 * tail. This list is never traversed, it's only used to keep track of
116 * the lru order. That is, we only pick entries of the head or insert
117 * them on the tail.
118 *
119 * All entries on the list are 'free', and their RMIDs are not currently
120 * in use. To mark an RMID as in use, remove its entry from the lru
121 * list.
122 *
123 *
124 * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
125 *
126 * This list is contains RMIDs that no one is currently using but that
127 * may have a non-zero occupancy value associated with them. The
128 * rotation worker moves RMIDs from the limbo list to the free list once
129 * the occupancy value drops below __intel_cqm_threshold.
130 *
131 * Both lists are protected by cache_mutex.
132 */
133static LIST_HEAD(cqm_rmid_free_lru);
134static LIST_HEAD(cqm_rmid_limbo_lru);
135
136/*
137 * We use a simple array of pointers so that we can lookup a struct
138 * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
139 * and __put_rmid() from having to worry about dealing with struct
140 * cqm_rmid_entry - they just deal with rmids, i.e. integers.
141 *
142 * Once this array is initialized it is read-only. No locks are required
143 * to access it.
144 *
145 * All entries for all RMIDs can be looked up in the this array at all
146 * times.
147 */
148static struct cqm_rmid_entry **cqm_rmid_ptrs;
149
150static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
151{
152 struct cqm_rmid_entry *entry;
153
154 entry = cqm_rmid_ptrs[rmid];
155 WARN_ON(entry->rmid != rmid);
156
157 return entry;
158}
159
160/*
161 * Returns < 0 on fail.
162 *
163 * We expect to be called with cache_mutex held.
164 */
165static int __get_rmid(void)
166{
167 struct cqm_rmid_entry *entry;
168
169 lockdep_assert_held(&cache_mutex);
170
171 if (list_empty(&cqm_rmid_free_lru))
172 return INVALID_RMID;
173
174 entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
175 list_del(&entry->list);
176
177 return entry->rmid;
178}
179
180static void __put_rmid(unsigned int rmid)
181{
182 struct cqm_rmid_entry *entry;
183
184 lockdep_assert_held(&cache_mutex);
185
186 WARN_ON(!__rmid_valid(rmid));
187 entry = __rmid_entry(rmid);
188
189 entry->queue_time = jiffies;
190 entry->state = RMID_YOUNG;
191
192 list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
193}
194
195static int intel_cqm_setup_rmid_cache(void)
196{
197 struct cqm_rmid_entry *entry;
198 unsigned int nr_rmids;
199 int r = 0;
200
201 nr_rmids = cqm_max_rmid + 1;
202 cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
203 nr_rmids, GFP_KERNEL);
204 if (!cqm_rmid_ptrs)
205 return -ENOMEM;
206
207 for (; r <= cqm_max_rmid; r++) {
208 struct cqm_rmid_entry *entry;
209
210 entry = kmalloc(sizeof(*entry), GFP_KERNEL);
211 if (!entry)
212 goto fail;
213
214 INIT_LIST_HEAD(&entry->list);
215 entry->rmid = r;
216 cqm_rmid_ptrs[r] = entry;
217
218 list_add_tail(&entry->list, &cqm_rmid_free_lru);
219 }
220
221 /*
222 * RMID 0 is special and is always allocated. It's used for all
223 * tasks that are not monitored.
224 */
225 entry = __rmid_entry(0);
226 list_del(&entry->list);
227
228 mutex_lock(&cache_mutex);
229 intel_cqm_rotation_rmid = __get_rmid();
230 mutex_unlock(&cache_mutex);
231
232 return 0;
233fail:
234 while (r--)
235 kfree(cqm_rmid_ptrs[r]);
236
237 kfree(cqm_rmid_ptrs);
238 return -ENOMEM;
239}
240
241/*
242 * Determine if @a and @b measure the same set of tasks.
243 *
244 * If @a and @b measure the same set of tasks then we want to share a
245 * single RMID.
246 */
247static bool __match_event(struct perf_event *a, struct perf_event *b)
248{
249 /* Per-cpu and task events don't mix */
250 if ((a->attach_state & PERF_ATTACH_TASK) !=
251 (b->attach_state & PERF_ATTACH_TASK))
252 return false;
253
254#ifdef CONFIG_CGROUP_PERF
255 if (a->cgrp != b->cgrp)
256 return false;
257#endif
258
259 /* If not task event, we're machine wide */
260 if (!(b->attach_state & PERF_ATTACH_TASK))
261 return true;
262
263 /*
264 * Events that target same task are placed into the same cache group.
265 */
266 if (a->hw.target == b->hw.target)
267 return true;
268
269 /*
270 * Are we an inherited event?
271 */
272 if (b->parent == a)
273 return true;
274
275 return false;
276}
277
278#ifdef CONFIG_CGROUP_PERF
279static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
280{
281 if (event->attach_state & PERF_ATTACH_TASK)
282 return perf_cgroup_from_task(event->hw.target);
283
284 return event->cgrp;
285}
286#endif
287
288/*
289 * Determine if @a's tasks intersect with @b's tasks
290 *
291 * There are combinations of events that we explicitly prohibit,
292 *
293 * PROHIBITS
294 * system-wide -> cgroup and task
295 * cgroup -> system-wide
296 * -> task in cgroup
297 * task -> system-wide
298 * -> task in cgroup
299 *
300 * Call this function before allocating an RMID.
301 */
302static bool __conflict_event(struct perf_event *a, struct perf_event *b)
303{
304#ifdef CONFIG_CGROUP_PERF
305 /*
306 * We can have any number of cgroups but only one system-wide
307 * event at a time.
308 */
309 if (a->cgrp && b->cgrp) {
310 struct perf_cgroup *ac = a->cgrp;
311 struct perf_cgroup *bc = b->cgrp;
312
313 /*
314 * This condition should have been caught in
315 * __match_event() and we should be sharing an RMID.
316 */
317 WARN_ON_ONCE(ac == bc);
318
319 if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
320 cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
321 return true;
322
323 return false;
324 }
325
326 if (a->cgrp || b->cgrp) {
327 struct perf_cgroup *ac, *bc;
328
329 /*
330 * cgroup and system-wide events are mutually exclusive
331 */
332 if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
333 (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
334 return true;
335
336 /*
337 * Ensure neither event is part of the other's cgroup
338 */
339 ac = event_to_cgroup(a);
340 bc = event_to_cgroup(b);
341 if (ac == bc)
342 return true;
343
344 /*
345 * Must have cgroup and non-intersecting task events.
346 */
347 if (!ac || !bc)
348 return false;
349
350 /*
351 * We have cgroup and task events, and the task belongs
352 * to a cgroup. Check for for overlap.
353 */
354 if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
355 cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
356 return true;
357
358 return false;
359 }
360#endif
361 /*
362 * If one of them is not a task, same story as above with cgroups.
363 */
364 if (!(a->attach_state & PERF_ATTACH_TASK) ||
365 !(b->attach_state & PERF_ATTACH_TASK))
366 return true;
367
368 /*
369 * Must be non-overlapping.
370 */
371 return false;
372}
373
374struct rmid_read {
375 unsigned int rmid;
376 atomic64_t value;
377};
378
379static void __intel_cqm_event_count(void *info);
380
381/*
382 * Exchange the RMID of a group of events.
383 */
384static unsigned int
385intel_cqm_xchg_rmid(struct perf_event *group, unsigned int rmid)
386{
387 struct perf_event *event;
388 unsigned int old_rmid = group->hw.cqm_rmid;
389 struct list_head *head = &group->hw.cqm_group_entry;
390
391 lockdep_assert_held(&cache_mutex);
392
393 /*
394 * If our RMID is being deallocated, perform a read now.
395 */
396 if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
397 struct rmid_read rr = {
398 .value = ATOMIC64_INIT(0),
399 .rmid = old_rmid,
400 };
401
402 on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
403 &rr, 1);
404 local64_set(&group->count, atomic64_read(&rr.value));
405 }
406
407 raw_spin_lock_irq(&cache_lock);
408
409 group->hw.cqm_rmid = rmid;
410 list_for_each_entry(event, head, hw.cqm_group_entry)
411 event->hw.cqm_rmid = rmid;
412
413 raw_spin_unlock_irq(&cache_lock);
414
415 return old_rmid;
416}
417
418/*
419 * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
420 * cachelines are still tagged with RMIDs in limbo, we progressively
421 * increment the threshold until we find an RMID in limbo with <=
422 * __intel_cqm_threshold lines tagged. This is designed to mitigate the
423 * problem where cachelines tagged with an RMID are not steadily being
424 * evicted.
425 *
426 * On successful rotations we decrease the threshold back towards zero.
427 *
428 * __intel_cqm_max_threshold provides an upper bound on the threshold,
429 * and is measured in bytes because it's exposed to userland.
430 */
431static unsigned int __intel_cqm_threshold;
432static unsigned int __intel_cqm_max_threshold;
433
434/*
435 * Test whether an RMID has a zero occupancy value on this cpu.
436 */
437static void intel_cqm_stable(void *arg)
438{
439 struct cqm_rmid_entry *entry;
440
441 list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
442 if (entry->state != RMID_AVAILABLE)
443 break;
444
445 if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
446 entry->state = RMID_DIRTY;
447 }
448}
449
450/*
451 * If we have group events waiting for an RMID that don't conflict with
452 * events already running, assign @rmid.
453 */
454static bool intel_cqm_sched_in_event(unsigned int rmid)
455{
456 struct perf_event *leader, *event;
457
458 lockdep_assert_held(&cache_mutex);
459
460 leader = list_first_entry(&cache_groups, struct perf_event,
461 hw.cqm_groups_entry);
462 event = leader;
463
464 list_for_each_entry_continue(event, &cache_groups,
465 hw.cqm_groups_entry) {
466 if (__rmid_valid(event->hw.cqm_rmid))
467 continue;
468
469 if (__conflict_event(event, leader))
470 continue;
471
472 intel_cqm_xchg_rmid(event, rmid);
473 return true;
474 }
475
476 return false;
477}
478
479/*
480 * Initially use this constant for both the limbo queue time and the
481 * rotation timer interval, pmu::hrtimer_interval_ms.
482 *
483 * They don't need to be the same, but the two are related since if you
484 * rotate faster than you recycle RMIDs, you may run out of available
485 * RMIDs.
486 */
487#define RMID_DEFAULT_QUEUE_TIME 250 /* ms */
488
489static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
490
491/*
492 * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
493 * @nr_available: number of freeable RMIDs on the limbo list
494 *
495 * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
496 * cachelines are tagged with those RMIDs. After this we can reuse them
497 * and know that the current set of active RMIDs is stable.
498 *
499 * Return %true or %false depending on whether stabilization needs to be
500 * reattempted.
501 *
502 * If we return %true then @nr_available is updated to indicate the
503 * number of RMIDs on the limbo list that have been queued for the
504 * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
505 * are above __intel_cqm_threshold.
506 */
507static bool intel_cqm_rmid_stabilize(unsigned int *available)
508{
509 struct cqm_rmid_entry *entry, *tmp;
510
511 lockdep_assert_held(&cache_mutex);
512
513 *available = 0;
514 list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
515 unsigned long min_queue_time;
516 unsigned long now = jiffies;
517
518 /*
519 * We hold RMIDs placed into limbo for a minimum queue
520 * time. Before the minimum queue time has elapsed we do
521 * not recycle RMIDs.
522 *
523 * The reasoning is that until a sufficient time has
524 * passed since we stopped using an RMID, any RMID
525 * placed onto the limbo list will likely still have
526 * data tagged in the cache, which means we'll probably
527 * fail to recycle it anyway.
528 *
529 * We can save ourselves an expensive IPI by skipping
530 * any RMIDs that have not been queued for the minimum
531 * time.
532 */
533 min_queue_time = entry->queue_time +
534 msecs_to_jiffies(__rmid_queue_time_ms);
535
536 if (time_after(min_queue_time, now))
537 break;
538
539 entry->state = RMID_AVAILABLE;
540 (*available)++;
541 }
542
543 /*
544 * Fast return if none of the RMIDs on the limbo list have been
545 * sitting on the queue for the minimum queue time.
546 */
547 if (!*available)
548 return false;
549
550 /*
551 * Test whether an RMID is free for each package.
552 */
553 on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true);
554
555 list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) {
556 /*
557 * Exhausted all RMIDs that have waited min queue time.
558 */
559 if (entry->state == RMID_YOUNG)
560 break;
561
562 if (entry->state == RMID_DIRTY)
563 continue;
564
565 list_del(&entry->list); /* remove from limbo */
566
567 /*
568 * The rotation RMID gets priority if it's
569 * currently invalid. In which case, skip adding
570 * the RMID to the the free lru.
571 */
572 if (!__rmid_valid(intel_cqm_rotation_rmid)) {
573 intel_cqm_rotation_rmid = entry->rmid;
574 continue;
575 }
576
577 /*
578 * If we have groups waiting for RMIDs, hand
579 * them one now provided they don't conflict.
580 */
581 if (intel_cqm_sched_in_event(entry->rmid))
582 continue;
583
584 /*
585 * Otherwise place it onto the free list.
586 */
587 list_add_tail(&entry->list, &cqm_rmid_free_lru);
588 }
589
590
591 return __rmid_valid(intel_cqm_rotation_rmid);
592}
593
594/*
595 * Pick a victim group and move it to the tail of the group list.
596 * @next: The first group without an RMID
597 */
598static void __intel_cqm_pick_and_rotate(struct perf_event *next)
599{
600 struct perf_event *rotor;
601 unsigned int rmid;
602
603 lockdep_assert_held(&cache_mutex);
604
605 rotor = list_first_entry(&cache_groups, struct perf_event,
606 hw.cqm_groups_entry);
607
608 /*
609 * The group at the front of the list should always have a valid
610 * RMID. If it doesn't then no groups have RMIDs assigned and we
611 * don't need to rotate the list.
612 */
613 if (next == rotor)
614 return;
615
616 rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
617 __put_rmid(rmid);
618
619 list_rotate_left(&cache_groups);
620}
621
622/*
623 * Deallocate the RMIDs from any events that conflict with @event, and
624 * place them on the back of the group list.
625 */
626static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
627{
628 struct perf_event *group, *g;
629 unsigned int rmid;
630
631 lockdep_assert_held(&cache_mutex);
632
633 list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) {
634 if (group == event)
635 continue;
636
637 rmid = group->hw.cqm_rmid;
638
639 /*
640 * Skip events that don't have a valid RMID.
641 */
642 if (!__rmid_valid(rmid))
643 continue;
644
645 /*
646 * No conflict? No problem! Leave the event alone.
647 */
648 if (!__conflict_event(group, event))
649 continue;
650
651 intel_cqm_xchg_rmid(group, INVALID_RMID);
652 __put_rmid(rmid);
653 }
654}
655
656/*
657 * Attempt to rotate the groups and assign new RMIDs.
658 *
659 * We rotate for two reasons,
660 * 1. To handle the scheduling of conflicting events
661 * 2. To recycle RMIDs
662 *
663 * Rotating RMIDs is complicated because the hardware doesn't give us
664 * any clues.
665 *
666 * There's problems with the hardware interface; when you change the
667 * task:RMID map cachelines retain their 'old' tags, giving a skewed
668 * picture. In order to work around this, we must always keep one free
669 * RMID - intel_cqm_rotation_rmid.
670 *
671 * Rotation works by taking away an RMID from a group (the old RMID),
672 * and assigning the free RMID to another group (the new RMID). We must
673 * then wait for the old RMID to not be used (no cachelines tagged).
674 * This ensure that all cachelines are tagged with 'active' RMIDs. At
675 * this point we can start reading values for the new RMID and treat the
676 * old RMID as the free RMID for the next rotation.
677 *
678 * Return %true or %false depending on whether we did any rotating.
679 */
680static bool __intel_cqm_rmid_rotate(void)
681{
682 struct perf_event *group, *start = NULL;
683 unsigned int threshold_limit;
684 unsigned int nr_needed = 0;
685 unsigned int nr_available;
686 bool rotated = false;
687
688 mutex_lock(&cache_mutex);
689
690again:
691 /*
692 * Fast path through this function if there are no groups and no
693 * RMIDs that need cleaning.
694 */
695 if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
696 goto out;
697
698 list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
699 if (!__rmid_valid(group->hw.cqm_rmid)) {
700 if (!start)
701 start = group;
702 nr_needed++;
703 }
704 }
705
706 /*
707 * We have some event groups, but they all have RMIDs assigned
708 * and no RMIDs need cleaning.
709 */
710 if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
711 goto out;
712
713 if (!nr_needed)
714 goto stabilize;
715
716 /*
717 * We have more event groups without RMIDs than available RMIDs,
718 * or we have event groups that conflict with the ones currently
719 * scheduled.
720 *
721 * We force deallocate the rmid of the group at the head of
722 * cache_groups. The first event group without an RMID then gets
723 * assigned intel_cqm_rotation_rmid. This ensures we always make
724 * forward progress.
725 *
726 * Rotate the cache_groups list so the previous head is now the
727 * tail.
728 */
729 __intel_cqm_pick_and_rotate(start);
730
731 /*
732 * If the rotation is going to succeed, reduce the threshold so
733 * that we don't needlessly reuse dirty RMIDs.
734 */
735 if (__rmid_valid(intel_cqm_rotation_rmid)) {
736 intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
737 intel_cqm_rotation_rmid = __get_rmid();
738
739 intel_cqm_sched_out_conflicting_events(start);
740
741 if (__intel_cqm_threshold)
742 __intel_cqm_threshold--;
743 }
744
745 rotated = true;
746
747stabilize:
748 /*
749 * We now need to stablize the RMID we freed above (if any) to
750 * ensure that the next time we rotate we have an RMID with zero
751 * occupancy value.
752 *
753 * Alternatively, if we didn't need to perform any rotation,
754 * we'll have a bunch of RMIDs in limbo that need stabilizing.
755 */
756 threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale;
757
758 while (intel_cqm_rmid_stabilize(&nr_available) &&
759 __intel_cqm_threshold < threshold_limit) {
760 unsigned int steal_limit;
761
762 /*
763 * Don't spin if nobody is actively waiting for an RMID,
764 * the rotation worker will be kicked as soon as an
765 * event needs an RMID anyway.
766 */
767 if (!nr_needed)
768 break;
769
770 /* Allow max 25% of RMIDs to be in limbo. */
771 steal_limit = (cqm_max_rmid + 1) / 4;
772
773 /*
774 * We failed to stabilize any RMIDs so our rotation
775 * logic is now stuck. In order to make forward progress
776 * we have a few options:
777 *
778 * 1. rotate ("steal") another RMID
779 * 2. increase the threshold
780 * 3. do nothing
781 *
782 * We do both of 1. and 2. until we hit the steal limit.
783 *
784 * The steal limit prevents all RMIDs ending up on the
785 * limbo list. This can happen if every RMID has a
786 * non-zero occupancy above threshold_limit, and the
787 * occupancy values aren't dropping fast enough.
788 *
789 * Note that there is prioritisation at work here - we'd
790 * rather increase the number of RMIDs on the limbo list
791 * than increase the threshold, because increasing the
792 * threshold skews the event data (because we reuse
793 * dirty RMIDs) - threshold bumps are a last resort.
794 */
795 if (nr_available < steal_limit)
796 goto again;
797
798 __intel_cqm_threshold++;
799 }
800
801out:
802 mutex_unlock(&cache_mutex);
803 return rotated;
804}
805
806static void intel_cqm_rmid_rotate(struct work_struct *work);
807
808static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
809
810static struct pmu intel_cqm_pmu;
811
812static void intel_cqm_rmid_rotate(struct work_struct *work)
813{
814 unsigned long delay;
815
816 __intel_cqm_rmid_rotate();
817
818 delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
819 schedule_delayed_work(&intel_cqm_rmid_work, delay);
820}
821
822/*
823 * Find a group and setup RMID.
824 *
825 * If we're part of a group, we use the group's RMID.
826 */
827static void intel_cqm_setup_event(struct perf_event *event,
828 struct perf_event **group)
829{
830 struct perf_event *iter;
831 unsigned int rmid;
832 bool conflict = false;
833
834 list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
835 rmid = iter->hw.cqm_rmid;
836
837 if (__match_event(iter, event)) {
838 /* All tasks in a group share an RMID */
839 event->hw.cqm_rmid = rmid;
840 *group = iter;
841 return;
842 }
843
844 /*
845 * We only care about conflicts for events that are
846 * actually scheduled in (and hence have a valid RMID).
847 */
848 if (__conflict_event(iter, event) && __rmid_valid(rmid))
849 conflict = true;
850 }
851
852 if (conflict)
853 rmid = INVALID_RMID;
854 else
855 rmid = __get_rmid();
856
857 event->hw.cqm_rmid = rmid;
858}
859
860static void intel_cqm_event_read(struct perf_event *event)
861{
862 unsigned long flags;
863 unsigned int rmid;
864 u64 val;
865
866 /*
867 * Task events are handled by intel_cqm_event_count().
868 */
869 if (event->cpu == -1)
870 return;
871
872 raw_spin_lock_irqsave(&cache_lock, flags);
873 rmid = event->hw.cqm_rmid;
874
875 if (!__rmid_valid(rmid))
876 goto out;
877
878 val = __rmid_read(rmid);
879
880 /*
881 * Ignore this reading on error states and do not update the value.
882 */
883 if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
884 goto out;
885
886 local64_set(&event->count, val);
887out:
888 raw_spin_unlock_irqrestore(&cache_lock, flags);
889}
890
891static void __intel_cqm_event_count(void *info)
892{
893 struct rmid_read *rr = info;
894 u64 val;
895
896 val = __rmid_read(rr->rmid);
897
898 if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
899 return;
900
901 atomic64_add(val, &rr->value);
902}
903
904static inline bool cqm_group_leader(struct perf_event *event)
905{
906 return !list_empty(&event->hw.cqm_groups_entry);
907}
908
909static u64 intel_cqm_event_count(struct perf_event *event)
910{
911 unsigned long flags;
912 struct rmid_read rr = {
913 .value = ATOMIC64_INIT(0),
914 };
915
916 /*
917 * We only need to worry about task events. System-wide events
918 * are handled like usual, i.e. entirely with
919 * intel_cqm_event_read().
920 */
921 if (event->cpu != -1)
922 return __perf_event_count(event);
923
924 /*
925 * Only the group leader gets to report values. This stops us
926 * reporting duplicate values to userspace, and gives us a clear
927 * rule for which task gets to report the values.
928 *
929 * Note that it is impossible to attribute these values to
930 * specific packages - we forfeit that ability when we create
931 * task events.
932 */
933 if (!cqm_group_leader(event))
934 return 0;
935
936 /*
937 * Notice that we don't perform the reading of an RMID
938 * atomically, because we can't hold a spin lock across the
939 * IPIs.
940 *
941 * Speculatively perform the read, since @event might be
942 * assigned a different (possibly invalid) RMID while we're
943 * busying performing the IPI calls. It's therefore necessary to
944 * check @event's RMID afterwards, and if it has changed,
945 * discard the result of the read.
946 */
947 rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
948
949 if (!__rmid_valid(rr.rmid))
950 goto out;
951
952 on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
953
954 raw_spin_lock_irqsave(&cache_lock, flags);
955 if (event->hw.cqm_rmid == rr.rmid)
956 local64_set(&event->count, atomic64_read(&rr.value));
957 raw_spin_unlock_irqrestore(&cache_lock, flags);
958out:
959 return __perf_event_count(event);
960}
961
962static void intel_cqm_event_start(struct perf_event *event, int mode)
963{
964 struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
965 unsigned int rmid = event->hw.cqm_rmid;
966 unsigned long flags;
967
968 if (!(event->hw.cqm_state & PERF_HES_STOPPED))
969 return;
970
971 event->hw.cqm_state &= ~PERF_HES_STOPPED;
972
973 raw_spin_lock_irqsave(&state->lock, flags);
974
975 if (state->cnt++)
976 WARN_ON_ONCE(state->rmid != rmid);
977 else
978 WARN_ON_ONCE(state->rmid);
979
980 state->rmid = rmid;
981 wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
982
983 raw_spin_unlock_irqrestore(&state->lock, flags);
984}
985
986static void intel_cqm_event_stop(struct perf_event *event, int mode)
987{
988 struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
989 unsigned long flags;
990
991 if (event->hw.cqm_state & PERF_HES_STOPPED)
992 return;
993
994 event->hw.cqm_state |= PERF_HES_STOPPED;
995
996 raw_spin_lock_irqsave(&state->lock, flags);
997 intel_cqm_event_read(event);
998
999 if (!--state->cnt) {
1000 state->rmid = 0;
1001 wrmsrl(MSR_IA32_PQR_ASSOC, 0);
1002 } else {
1003 WARN_ON_ONCE(!state->rmid);
1004 }
1005
1006 raw_spin_unlock_irqrestore(&state->lock, flags);
1007}
1008
1009static int intel_cqm_event_add(struct perf_event *event, int mode)
1010{
1011 unsigned long flags;
1012 unsigned int rmid;
1013
1014 raw_spin_lock_irqsave(&cache_lock, flags);
1015
1016 event->hw.cqm_state = PERF_HES_STOPPED;
1017 rmid = event->hw.cqm_rmid;
1018
1019 if (__rmid_valid(rmid) && (mode & PERF_EF_START))
1020 intel_cqm_event_start(event, mode);
1021
1022 raw_spin_unlock_irqrestore(&cache_lock, flags);
1023
1024 return 0;
1025}
1026
1027static void intel_cqm_event_del(struct perf_event *event, int mode)
1028{
1029 intel_cqm_event_stop(event, mode);
1030}
1031
1032static void intel_cqm_event_destroy(struct perf_event *event)
1033{
1034 struct perf_event *group_other = NULL;
1035
1036 mutex_lock(&cache_mutex);
1037
1038 /*
1039 * If there's another event in this group...
1040 */
1041 if (!list_empty(&event->hw.cqm_group_entry)) {
1042 group_other = list_first_entry(&event->hw.cqm_group_entry,
1043 struct perf_event,
1044 hw.cqm_group_entry);
1045 list_del(&event->hw.cqm_group_entry);
1046 }
1047
1048 /*
1049 * And we're the group leader..
1050 */
1051 if (cqm_group_leader(event)) {
1052 /*
1053 * If there was a group_other, make that leader, otherwise
1054 * destroy the group and return the RMID.
1055 */
1056 if (group_other) {
1057 list_replace(&event->hw.cqm_groups_entry,
1058 &group_other->hw.cqm_groups_entry);
1059 } else {
1060 unsigned int rmid = event->hw.cqm_rmid;
1061
1062 if (__rmid_valid(rmid))
1063 __put_rmid(rmid);
1064 list_del(&event->hw.cqm_groups_entry);
1065 }
1066 }
1067
1068 mutex_unlock(&cache_mutex);
1069}
1070
1071static int intel_cqm_event_init(struct perf_event *event)
1072{
1073 struct perf_event *group = NULL;
1074 bool rotate = false;
1075
1076 if (event->attr.type != intel_cqm_pmu.type)
1077 return -ENOENT;
1078
1079 if (event->attr.config & ~QOS_EVENT_MASK)
1080 return -EINVAL;
1081
1082 /* unsupported modes and filters */
1083 if (event->attr.exclude_user ||
1084 event->attr.exclude_kernel ||
1085 event->attr.exclude_hv ||
1086 event->attr.exclude_idle ||
1087 event->attr.exclude_host ||
1088 event->attr.exclude_guest ||
1089 event->attr.sample_period) /* no sampling */
1090 return -EINVAL;
1091
1092 INIT_LIST_HEAD(&event->hw.cqm_group_entry);
1093 INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
1094
1095 event->destroy = intel_cqm_event_destroy;
1096
1097 mutex_lock(&cache_mutex);
1098
1099 /* Will also set rmid */
1100 intel_cqm_setup_event(event, &group);
1101
1102 if (group) {
1103 list_add_tail(&event->hw.cqm_group_entry,
1104 &group->hw.cqm_group_entry);
1105 } else {
1106 list_add_tail(&event->hw.cqm_groups_entry,
1107 &cache_groups);
1108
1109 /*
1110 * All RMIDs are either in use or have recently been
1111 * used. Kick the rotation worker to clean/free some.
1112 *
1113 * We only do this for the group leader, rather than for
1114 * every event in a group to save on needless work.
1115 */
1116 if (!__rmid_valid(event->hw.cqm_rmid))
1117 rotate = true;
1118 }
1119
1120 mutex_unlock(&cache_mutex);
1121
1122 if (rotate)
1123 schedule_delayed_work(&intel_cqm_rmid_work, 0);
1124
1125 return 0;
1126}
1127
1128EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
1129EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
1130EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
1131EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
1132EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
1133
1134static struct attribute *intel_cqm_events_attr[] = {
1135 EVENT_PTR(intel_cqm_llc),
1136 EVENT_PTR(intel_cqm_llc_pkg),
1137 EVENT_PTR(intel_cqm_llc_unit),
1138 EVENT_PTR(intel_cqm_llc_scale),
1139 EVENT_PTR(intel_cqm_llc_snapshot),
1140 NULL,
1141};
1142
1143static struct attribute_group intel_cqm_events_group = {
1144 .name = "events",
1145 .attrs = intel_cqm_events_attr,
1146};
1147
1148PMU_FORMAT_ATTR(event, "config:0-7");
1149static struct attribute *intel_cqm_formats_attr[] = {
1150 &format_attr_event.attr,
1151 NULL,
1152};
1153
1154static struct attribute_group intel_cqm_format_group = {
1155 .name = "format",
1156 .attrs = intel_cqm_formats_attr,
1157};
1158
1159static ssize_t
1160max_recycle_threshold_show(struct device *dev, struct device_attribute *attr,
1161 char *page)
1162{
1163 ssize_t rv;
1164
1165 mutex_lock(&cache_mutex);
1166 rv = snprintf(page, PAGE_SIZE-1, "%u\n", __intel_cqm_max_threshold);
1167 mutex_unlock(&cache_mutex);
1168
1169 return rv;
1170}
1171
1172static ssize_t
1173max_recycle_threshold_store(struct device *dev,
1174 struct device_attribute *attr,
1175 const char *buf, size_t count)
1176{
1177 unsigned int bytes, cachelines;
1178 int ret;
1179
1180 ret = kstrtouint(buf, 0, &bytes);
1181 if (ret)
1182 return ret;
1183
1184 mutex_lock(&cache_mutex);
1185
1186 __intel_cqm_max_threshold = bytes;
1187 cachelines = bytes / cqm_l3_scale;
1188
1189 /*
1190 * The new maximum takes effect immediately.
1191 */
1192 if (__intel_cqm_threshold > cachelines)
1193 __intel_cqm_threshold = cachelines;
1194
1195 mutex_unlock(&cache_mutex);
1196
1197 return count;
1198}
1199
1200static DEVICE_ATTR_RW(max_recycle_threshold);
1201
1202static struct attribute *intel_cqm_attrs[] = {
1203 &dev_attr_max_recycle_threshold.attr,
1204 NULL,
1205};
1206
1207static const struct attribute_group intel_cqm_group = {
1208 .attrs = intel_cqm_attrs,
1209};
1210
1211static const struct attribute_group *intel_cqm_attr_groups[] = {
1212 &intel_cqm_events_group,
1213 &intel_cqm_format_group,
1214 &intel_cqm_group,
1215 NULL,
1216};
1217
1218static struct pmu intel_cqm_pmu = {
1219 .hrtimer_interval_ms = RMID_DEFAULT_QUEUE_TIME,
1220 .attr_groups = intel_cqm_attr_groups,
1221 .task_ctx_nr = perf_sw_context,
1222 .event_init = intel_cqm_event_init,
1223 .add = intel_cqm_event_add,
1224 .del = intel_cqm_event_del,
1225 .start = intel_cqm_event_start,
1226 .stop = intel_cqm_event_stop,
1227 .read = intel_cqm_event_read,
1228 .count = intel_cqm_event_count,
1229};
1230
1231static inline void cqm_pick_event_reader(int cpu)
1232{
1233 int phys_id = topology_physical_package_id(cpu);
1234 int i;
1235
1236 for_each_cpu(i, &cqm_cpumask) {
1237 if (phys_id == topology_physical_package_id(i))
1238 return; /* already got reader for this socket */
1239 }
1240
1241 cpumask_set_cpu(cpu, &cqm_cpumask);
1242}
1243
1244static void intel_cqm_cpu_prepare(unsigned int cpu)
1245{
1246 struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
1247 struct cpuinfo_x86 *c = &cpu_data(cpu);
1248
1249 raw_spin_lock_init(&state->lock);
1250 state->rmid = 0;
1251 state->cnt = 0;
1252
1253 WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
1254 WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
1255}
1256
1257static void intel_cqm_cpu_exit(unsigned int cpu)
1258{
1259 int phys_id = topology_physical_package_id(cpu);
1260 int i;
1261
1262 /*
1263 * Is @cpu a designated cqm reader?
1264 */
1265 if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
1266 return;
1267
1268 for_each_online_cpu(i) {
1269 if (i == cpu)
1270 continue;
1271
1272 if (phys_id == topology_physical_package_id(i)) {
1273 cpumask_set_cpu(i, &cqm_cpumask);
1274 break;
1275 }
1276 }
1277}
1278
1279static int intel_cqm_cpu_notifier(struct notifier_block *nb,
1280 unsigned long action, void *hcpu)
1281{
1282 unsigned int cpu = (unsigned long)hcpu;
1283
1284 switch (action & ~CPU_TASKS_FROZEN) {
1285 case CPU_UP_PREPARE:
1286 intel_cqm_cpu_prepare(cpu);
1287 break;
1288 case CPU_DOWN_PREPARE:
1289 intel_cqm_cpu_exit(cpu);
1290 break;
1291 case CPU_STARTING:
1292 cqm_pick_event_reader(cpu);
1293 break;
1294 }
1295
1296 return NOTIFY_OK;
1297}
1298
1299static const struct x86_cpu_id intel_cqm_match[] = {
1300 { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_OCCUP_LLC },
1301 {}
1302};
1303
1304static int __init intel_cqm_init(void)
1305{
1306 char *str, scale[20];
1307 int i, cpu, ret;
1308
1309 if (!x86_match_cpu(intel_cqm_match))
1310 return -ENODEV;
1311
1312 cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
1313
1314 /*
1315 * It's possible that not all resources support the same number
1316 * of RMIDs. Instead of making scheduling much more complicated
1317 * (where we have to match a task's RMID to a cpu that supports
1318 * that many RMIDs) just find the minimum RMIDs supported across
1319 * all cpus.
1320 *
1321 * Also, check that the scales match on all cpus.
1322 */
1323 cpu_notifier_register_begin();
1324
1325 for_each_online_cpu(cpu) {
1326 struct cpuinfo_x86 *c = &cpu_data(cpu);
1327
1328 if (c->x86_cache_max_rmid < cqm_max_rmid)
1329 cqm_max_rmid = c->x86_cache_max_rmid;
1330
1331 if (c->x86_cache_occ_scale != cqm_l3_scale) {
1332 pr_err("Multiple LLC scale values, disabling\n");
1333 ret = -EINVAL;
1334 goto out;
1335 }
1336 }
1337
1338 /*
1339 * A reasonable upper limit on the max threshold is the number
1340 * of lines tagged per RMID if all RMIDs have the same number of
1341 * lines tagged in the LLC.
1342 *
1343 * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
1344 */
1345 __intel_cqm_max_threshold =
1346 boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1);
1347
1348 snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
1349 str = kstrdup(scale, GFP_KERNEL);
1350 if (!str) {
1351 ret = -ENOMEM;
1352 goto out;
1353 }
1354
1355 event_attr_intel_cqm_llc_scale.event_str = str;
1356
1357 ret = intel_cqm_setup_rmid_cache();
1358 if (ret)
1359 goto out;
1360
1361 for_each_online_cpu(i) {
1362 intel_cqm_cpu_prepare(i);
1363 cqm_pick_event_reader(i);
1364 }
1365
1366 __perf_cpu_notifier(intel_cqm_cpu_notifier);
1367
1368 ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
1369 if (ret)
1370 pr_err("Intel CQM perf registration failed: %d\n", ret);
1371 else
1372 pr_info("Intel CQM monitoring enabled\n");
1373
1374out:
1375 cpu_notifier_register_done();
1376
1377 return ret;
1378}
1379device_initcall(intel_cqm_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 073983398364..ca69ea56c712 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -461,7 +461,8 @@ void intel_pmu_enable_bts(u64 config)
461 461
462 debugctlmsr |= DEBUGCTLMSR_TR; 462 debugctlmsr |= DEBUGCTLMSR_TR;
463 debugctlmsr |= DEBUGCTLMSR_BTS; 463 debugctlmsr |= DEBUGCTLMSR_BTS;
464 debugctlmsr |= DEBUGCTLMSR_BTINT; 464 if (config & ARCH_PERFMON_EVENTSEL_INT)
465 debugctlmsr |= DEBUGCTLMSR_BTINT;
465 466
466 if (!(config & ARCH_PERFMON_EVENTSEL_OS)) 467 if (!(config & ARCH_PERFMON_EVENTSEL_OS))
467 debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS; 468 debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
@@ -611,6 +612,10 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
611 INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */ 612 INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
612 /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */ 613 /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
613 INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf), 614 INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
615 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
616 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
617 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
618 INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
614 /* Allow all events as PEBS with no flags */ 619 /* Allow all events as PEBS with no flags */
615 INTEL_ALL_EVENT_CONSTRAINT(0, 0xf), 620 INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
616 EVENT_CONSTRAINT_END 621 EVENT_CONSTRAINT_END
@@ -622,6 +627,10 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
622 INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */ 627 INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
623 /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */ 628 /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
624 INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf), 629 INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
630 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
631 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
632 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
633 INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
625 /* Allow all events as PEBS with no flags */ 634 /* Allow all events as PEBS with no flags */
626 INTEL_ALL_EVENT_CONSTRAINT(0, 0xf), 635 INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
627 EVENT_CONSTRAINT_END 636 EVENT_CONSTRAINT_END
@@ -633,16 +642,16 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
633 /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */ 642 /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
634 INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf), 643 INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
635 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */ 644 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
636 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */ 645 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
637 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */ 646 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
638 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */ 647 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */
639 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */ 648 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
640 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_STORES */ 649 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(0x12d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_STORES */
641 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x42d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_STORES */ 650 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(0x42d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_STORES */
642 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES */ 651 INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES */
643 INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */ 652 INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
644 INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd2, 0xf), /* MEM_LOAD_UOPS_L3_HIT_RETIRED.* */ 653 INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd2, 0xf), /* MEM_LOAD_UOPS_L3_HIT_RETIRED.* */
645 INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd3, 0xf), /* MEM_LOAD_UOPS_L3_MISS_RETIRED.* */ 654 INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd3, 0xf), /* MEM_LOAD_UOPS_L3_MISS_RETIRED.* */
646 /* Allow all events as PEBS with no flags */ 655 /* Allow all events as PEBS with no flags */
647 INTEL_ALL_EVENT_CONSTRAINT(0, 0xf), 656 INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
648 EVENT_CONSTRAINT_END 657 EVENT_CONSTRAINT_END
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 58f1a94beaf0..94e5b506caa6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
39#define LBR_IND_JMP_BIT 6 /* do not capture indirect jumps */ 39#define LBR_IND_JMP_BIT 6 /* do not capture indirect jumps */
40#define LBR_REL_JMP_BIT 7 /* do not capture relative jumps */ 40#define LBR_REL_JMP_BIT 7 /* do not capture relative jumps */
41#define LBR_FAR_BIT 8 /* do not capture far branches */ 41#define LBR_FAR_BIT 8 /* do not capture far branches */
42#define LBR_CALL_STACK_BIT 9 /* enable call stack */
42 43
43#define LBR_KERNEL (1 << LBR_KERNEL_BIT) 44#define LBR_KERNEL (1 << LBR_KERNEL_BIT)
44#define LBR_USER (1 << LBR_USER_BIT) 45#define LBR_USER (1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
49#define LBR_REL_JMP (1 << LBR_REL_JMP_BIT) 50#define LBR_REL_JMP (1 << LBR_REL_JMP_BIT)
50#define LBR_IND_JMP (1 << LBR_IND_JMP_BIT) 51#define LBR_IND_JMP (1 << LBR_IND_JMP_BIT)
51#define LBR_FAR (1 << LBR_FAR_BIT) 52#define LBR_FAR (1 << LBR_FAR_BIT)
53#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)
52 54
53#define LBR_PLM (LBR_KERNEL | LBR_USER) 55#define LBR_PLM (LBR_KERNEL | LBR_USER)
54 56
@@ -69,33 +71,31 @@ static enum {
69#define LBR_FROM_FLAG_IN_TX (1ULL << 62) 71#define LBR_FROM_FLAG_IN_TX (1ULL << 62)
70#define LBR_FROM_FLAG_ABORT (1ULL << 61) 72#define LBR_FROM_FLAG_ABORT (1ULL << 61)
71 73
72#define for_each_branch_sample_type(x) \
73 for ((x) = PERF_SAMPLE_BRANCH_USER; \
74 (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
75
76/* 74/*
77 * x86control flow change classification 75 * x86control flow change classification
78 * x86control flow changes include branches, interrupts, traps, faults 76 * x86control flow changes include branches, interrupts, traps, faults
79 */ 77 */
80enum { 78enum {
81 X86_BR_NONE = 0, /* unknown */ 79 X86_BR_NONE = 0, /* unknown */
82 80
83 X86_BR_USER = 1 << 0, /* branch target is user */ 81 X86_BR_USER = 1 << 0, /* branch target is user */
84 X86_BR_KERNEL = 1 << 1, /* branch target is kernel */ 82 X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
85 83
86 X86_BR_CALL = 1 << 2, /* call */ 84 X86_BR_CALL = 1 << 2, /* call */
87 X86_BR_RET = 1 << 3, /* return */ 85 X86_BR_RET = 1 << 3, /* return */
88 X86_BR_SYSCALL = 1 << 4, /* syscall */ 86 X86_BR_SYSCALL = 1 << 4, /* syscall */
89 X86_BR_SYSRET = 1 << 5, /* syscall return */ 87 X86_BR_SYSRET = 1 << 5, /* syscall return */
90 X86_BR_INT = 1 << 6, /* sw interrupt */ 88 X86_BR_INT = 1 << 6, /* sw interrupt */
91 X86_BR_IRET = 1 << 7, /* return from interrupt */ 89 X86_BR_IRET = 1 << 7, /* return from interrupt */
92 X86_BR_JCC = 1 << 8, /* conditional */ 90 X86_BR_JCC = 1 << 8, /* conditional */
93 X86_BR_JMP = 1 << 9, /* jump */ 91 X86_BR_JMP = 1 << 9, /* jump */
94 X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */ 92 X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
95 X86_BR_IND_CALL = 1 << 11,/* indirect calls */ 93 X86_BR_IND_CALL = 1 << 11,/* indirect calls */
96 X86_BR_ABORT = 1 << 12,/* transaction abort */ 94 X86_BR_ABORT = 1 << 12,/* transaction abort */
97 X86_BR_IN_TX = 1 << 13,/* in transaction */ 95 X86_BR_IN_TX = 1 << 13,/* in transaction */
98 X86_BR_NO_TX = 1 << 14,/* not in transaction */ 96 X86_BR_NO_TX = 1 << 14,/* not in transaction */
97 X86_BR_ZERO_CALL = 1 << 15,/* zero length call */
98 X86_BR_CALL_STACK = 1 << 16,/* call stack */
99}; 99};
100 100
101#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL) 101#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -112,13 +112,15 @@ enum {
112 X86_BR_JMP |\ 112 X86_BR_JMP |\
113 X86_BR_IRQ |\ 113 X86_BR_IRQ |\
114 X86_BR_ABORT |\ 114 X86_BR_ABORT |\
115 X86_BR_IND_CALL) 115 X86_BR_IND_CALL |\
116 X86_BR_ZERO_CALL)
116 117
117#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY) 118#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
118 119
119#define X86_BR_ANY_CALL \ 120#define X86_BR_ANY_CALL \
120 (X86_BR_CALL |\ 121 (X86_BR_CALL |\
121 X86_BR_IND_CALL |\ 122 X86_BR_IND_CALL |\
123 X86_BR_ZERO_CALL |\
122 X86_BR_SYSCALL |\ 124 X86_BR_SYSCALL |\
123 X86_BR_IRQ |\ 125 X86_BR_IRQ |\
124 X86_BR_INT) 126 X86_BR_INT)
@@ -130,17 +132,32 @@ static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
130 * otherwise it becomes near impossible to get a reliable stack. 132 * otherwise it becomes near impossible to get a reliable stack.
131 */ 133 */
132 134
133static void __intel_pmu_lbr_enable(void) 135static void __intel_pmu_lbr_enable(bool pmi)
134{ 136{
135 u64 debugctl;
136 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 137 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
138 u64 debugctl, lbr_select = 0, orig_debugctl;
137 139
138 if (cpuc->lbr_sel) 140 /*
139 wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config); 141 * No need to reprogram LBR_SELECT in a PMI, as it
142 * did not change.
143 */
144 if (cpuc->lbr_sel && !pmi) {
145 lbr_select = cpuc->lbr_sel->config;
146 wrmsrl(MSR_LBR_SELECT, lbr_select);
147 }
140 148
141 rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); 149 rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
142 debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI); 150 orig_debugctl = debugctl;
143 wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); 151 debugctl |= DEBUGCTLMSR_LBR;
152 /*
153 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
154 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
155 * may cause superfluous increase/decrease of LBR_TOS.
156 */
157 if (!(lbr_select & LBR_CALL_STACK))
158 debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
159 if (orig_debugctl != debugctl)
160 wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
144} 161}
145 162
146static void __intel_pmu_lbr_disable(void) 163static void __intel_pmu_lbr_disable(void)
@@ -181,9 +198,116 @@ void intel_pmu_lbr_reset(void)
181 intel_pmu_lbr_reset_64(); 198 intel_pmu_lbr_reset_64();
182} 199}
183 200
201/*
202 * TOS = most recently recorded branch
203 */
204static inline u64 intel_pmu_lbr_tos(void)
205{
206 u64 tos;
207
208 rdmsrl(x86_pmu.lbr_tos, tos);
209 return tos;
210}
211
212enum {
213 LBR_NONE,
214 LBR_VALID,
215};
216
217static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
218{
219 int i;
220 unsigned lbr_idx, mask;
221 u64 tos;
222
223 if (task_ctx->lbr_callstack_users == 0 ||
224 task_ctx->lbr_stack_state == LBR_NONE) {
225 intel_pmu_lbr_reset();
226 return;
227 }
228
229 mask = x86_pmu.lbr_nr - 1;
230 tos = intel_pmu_lbr_tos();
231 for (i = 0; i < x86_pmu.lbr_nr; i++) {
232 lbr_idx = (tos - i) & mask;
233 wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
234 wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
235 }
236 task_ctx->lbr_stack_state = LBR_NONE;
237}
238
239static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
240{
241 int i;
242 unsigned lbr_idx, mask;
243 u64 tos;
244
245 if (task_ctx->lbr_callstack_users == 0) {
246 task_ctx->lbr_stack_state = LBR_NONE;
247 return;
248 }
249
250 mask = x86_pmu.lbr_nr - 1;
251 tos = intel_pmu_lbr_tos();
252 for (i = 0; i < x86_pmu.lbr_nr; i++) {
253 lbr_idx = (tos - i) & mask;
254 rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
255 rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
256 }
257 task_ctx->lbr_stack_state = LBR_VALID;
258}
259
260void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
261{
262 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
263 struct x86_perf_task_context *task_ctx;
264
265 if (!x86_pmu.lbr_nr)
266 return;
267
268 /*
269 * If LBR callstack feature is enabled and the stack was saved when
270 * the task was scheduled out, restore the stack. Otherwise flush
271 * the LBR stack.
272 */
273 task_ctx = ctx ? ctx->task_ctx_data : NULL;
274 if (task_ctx) {
275 if (sched_in) {
276 __intel_pmu_lbr_restore(task_ctx);
277 cpuc->lbr_context = ctx;
278 } else {
279 __intel_pmu_lbr_save(task_ctx);
280 }
281 return;
282 }
283
284 /*
285 * When sampling the branck stack in system-wide, it may be
286 * necessary to flush the stack on context switch. This happens
287 * when the branch stack does not tag its entries with the pid
288 * of the current task. Otherwise it becomes impossible to
289 * associate a branch entry with a task. This ambiguity is more
290 * likely to appear when the branch stack supports priv level
291 * filtering and the user sets it to monitor only at the user
292 * level (which could be a useful measurement in system-wide
293 * mode). In that case, the risk is high of having a branch
294 * stack with branch from multiple tasks.
295 */
296 if (sched_in) {
297 intel_pmu_lbr_reset();
298 cpuc->lbr_context = ctx;
299 }
300}
301
302static inline bool branch_user_callstack(unsigned br_sel)
303{
304 return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
305}
306
184void intel_pmu_lbr_enable(struct perf_event *event) 307void intel_pmu_lbr_enable(struct perf_event *event)
185{ 308{
186 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 309 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
310 struct x86_perf_task_context *task_ctx;
187 311
188 if (!x86_pmu.lbr_nr) 312 if (!x86_pmu.lbr_nr)
189 return; 313 return;
@@ -198,18 +322,33 @@ void intel_pmu_lbr_enable(struct perf_event *event)
198 } 322 }
199 cpuc->br_sel = event->hw.branch_reg.reg; 323 cpuc->br_sel = event->hw.branch_reg.reg;
200 324
325 if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
326 event->ctx->task_ctx_data) {
327 task_ctx = event->ctx->task_ctx_data;
328 task_ctx->lbr_callstack_users++;
329 }
330
201 cpuc->lbr_users++; 331 cpuc->lbr_users++;
332 perf_sched_cb_inc(event->ctx->pmu);
202} 333}
203 334
204void intel_pmu_lbr_disable(struct perf_event *event) 335void intel_pmu_lbr_disable(struct perf_event *event)
205{ 336{
206 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 337 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
338 struct x86_perf_task_context *task_ctx;
207 339
208 if (!x86_pmu.lbr_nr) 340 if (!x86_pmu.lbr_nr)
209 return; 341 return;
210 342
343 if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
344 event->ctx->task_ctx_data) {
345 task_ctx = event->ctx->task_ctx_data;
346 task_ctx->lbr_callstack_users--;
347 }
348
211 cpuc->lbr_users--; 349 cpuc->lbr_users--;
212 WARN_ON_ONCE(cpuc->lbr_users < 0); 350 WARN_ON_ONCE(cpuc->lbr_users < 0);
351 perf_sched_cb_dec(event->ctx->pmu);
213 352
214 if (cpuc->enabled && !cpuc->lbr_users) { 353 if (cpuc->enabled && !cpuc->lbr_users) {
215 __intel_pmu_lbr_disable(); 354 __intel_pmu_lbr_disable();
@@ -218,12 +357,12 @@ void intel_pmu_lbr_disable(struct perf_event *event)
218 } 357 }
219} 358}
220 359
221void intel_pmu_lbr_enable_all(void) 360void intel_pmu_lbr_enable_all(bool pmi)
222{ 361{
223 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); 362 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
224 363
225 if (cpuc->lbr_users) 364 if (cpuc->lbr_users)
226 __intel_pmu_lbr_enable(); 365 __intel_pmu_lbr_enable(pmi);
227} 366}
228 367
229void intel_pmu_lbr_disable_all(void) 368void intel_pmu_lbr_disable_all(void)
@@ -234,18 +373,6 @@ void intel_pmu_lbr_disable_all(void)
234 __intel_pmu_lbr_disable(); 373 __intel_pmu_lbr_disable();
235} 374}
236 375
237/*
238 * TOS = most recently recorded branch
239 */
240static inline u64 intel_pmu_lbr_tos(void)
241{
242 u64 tos;
243
244 rdmsrl(x86_pmu.lbr_tos, tos);
245
246 return tos;
247}
248
249static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) 376static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
250{ 377{
251 unsigned long mask = x86_pmu.lbr_nr - 1; 378 unsigned long mask = x86_pmu.lbr_nr - 1;
@@ -350,7 +477,7 @@ void intel_pmu_lbr_read(void)
350 * - in case there is no HW filter 477 * - in case there is no HW filter
351 * - in case the HW filter has errata or limitations 478 * - in case the HW filter has errata or limitations
352 */ 479 */
353static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event) 480static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
354{ 481{
355 u64 br_type = event->attr.branch_sample_type; 482 u64 br_type = event->attr.branch_sample_type;
356 int mask = 0; 483 int mask = 0;
@@ -387,11 +514,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
387 if (br_type & PERF_SAMPLE_BRANCH_COND) 514 if (br_type & PERF_SAMPLE_BRANCH_COND)
388 mask |= X86_BR_JCC; 515 mask |= X86_BR_JCC;
389 516
517 if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
518 if (!x86_pmu_has_lbr_callstack())
519 return -EOPNOTSUPP;
520 if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
521 return -EINVAL;
522 mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
523 X86_BR_CALL_STACK;
524 }
525
390 /* 526 /*
391 * stash actual user request into reg, it may 527 * stash actual user request into reg, it may
392 * be used by fixup code for some CPU 528 * be used by fixup code for some CPU
393 */ 529 */
394 event->hw.branch_reg.reg = mask; 530 event->hw.branch_reg.reg = mask;
531 return 0;
395} 532}
396 533
397/* 534/*
@@ -403,14 +540,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
403{ 540{
404 struct hw_perf_event_extra *reg; 541 struct hw_perf_event_extra *reg;
405 u64 br_type = event->attr.branch_sample_type; 542 u64 br_type = event->attr.branch_sample_type;
406 u64 mask = 0, m; 543 u64 mask = 0, v;
407 u64 v; 544 int i;
408 545
409 for_each_branch_sample_type(m) { 546 for (i = 0; i < PERF_SAMPLE_BRANCH_MAX_SHIFT; i++) {
410 if (!(br_type & m)) 547 if (!(br_type & (1ULL << i)))
411 continue; 548 continue;
412 549
413 v = x86_pmu.lbr_sel_map[m]; 550 v = x86_pmu.lbr_sel_map[i];
414 if (v == LBR_NOT_SUPP) 551 if (v == LBR_NOT_SUPP)
415 return -EOPNOTSUPP; 552 return -EOPNOTSUPP;
416 553
@@ -420,8 +557,12 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
420 reg = &event->hw.branch_reg; 557 reg = &event->hw.branch_reg;
421 reg->idx = EXTRA_REG_LBR; 558 reg->idx = EXTRA_REG_LBR;
422 559
423 /* LBR_SELECT operates in suppress mode so invert mask */ 560 /*
424 reg->config = ~mask & x86_pmu.lbr_sel_mask; 561 * The first 9 bits (LBR_SEL_MASK) in LBR_SELECT operate
562 * in suppress mode. So LBR_SELECT should be set to
563 * (~mask & LBR_SEL_MASK) | (mask & ~LBR_SEL_MASK)
564 */
565 reg->config = mask ^ x86_pmu.lbr_sel_mask;
425 566
426 return 0; 567 return 0;
427} 568}
@@ -439,7 +580,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
439 /* 580 /*
440 * setup SW LBR filter 581 * setup SW LBR filter
441 */ 582 */
442 intel_pmu_setup_sw_lbr_filter(event); 583 ret = intel_pmu_setup_sw_lbr_filter(event);
584 if (ret)
585 return ret;
443 586
444 /* 587 /*
445 * setup HW LBR filter, if any 588 * setup HW LBR filter, if any
@@ -568,6 +711,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
568 ret = X86_BR_INT; 711 ret = X86_BR_INT;
569 break; 712 break;
570 case 0xe8: /* call near rel */ 713 case 0xe8: /* call near rel */
714 insn_get_immediate(&insn);
715 if (insn.immediate1.value == 0) {
716 /* zero length call */
717 ret = X86_BR_ZERO_CALL;
718 break;
719 }
571 case 0x9a: /* call far absolute */ 720 case 0x9a: /* call far absolute */
572 ret = X86_BR_CALL; 721 ret = X86_BR_CALL;
573 break; 722 break;
@@ -678,35 +827,49 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
678/* 827/*
679 * Map interface branch filters onto LBR filters 828 * Map interface branch filters onto LBR filters
680 */ 829 */
681static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { 830static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
682 [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY, 831 [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
683 [PERF_SAMPLE_BRANCH_USER] = LBR_USER, 832 [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
684 [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL, 833 [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
685 [PERF_SAMPLE_BRANCH_HV] = LBR_IGN, 834 [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
686 [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP 835 [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_REL_JMP
687 | LBR_IND_JMP | LBR_FAR, 836 | LBR_IND_JMP | LBR_FAR,
688 /* 837 /*
689 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches 838 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
690 */ 839 */
691 [PERF_SAMPLE_BRANCH_ANY_CALL] = 840 [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
692 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR, 841 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
693 /* 842 /*
694 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL 843 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
695 */ 844 */
696 [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP, 845 [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
697 [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, 846 [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
698}; 847};
699 848
700static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = { 849static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
701 [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY, 850 [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
702 [PERF_SAMPLE_BRANCH_USER] = LBR_USER, 851 [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
703 [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL, 852 [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
704 [PERF_SAMPLE_BRANCH_HV] = LBR_IGN, 853 [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
705 [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR, 854 [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
706 [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL 855 [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
707 | LBR_FAR, 856 | LBR_FAR,
708 [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL, 857 [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
709 [PERF_SAMPLE_BRANCH_COND] = LBR_JCC, 858 [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
859};
860
861static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
862 [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
863 [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
864 [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
865 [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
866 [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
867 [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
868 | LBR_FAR,
869 [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
870 [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
871 [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
872 | LBR_RETURN | LBR_CALL_STACK,
710}; 873};
711 874
712/* core */ 875/* core */
@@ -765,6 +928,20 @@ void __init intel_pmu_lbr_init_snb(void)
765 pr_cont("16-deep LBR, "); 928 pr_cont("16-deep LBR, ");
766} 929}
767 930
931/* haswell */
932void intel_pmu_lbr_init_hsw(void)
933{
934 x86_pmu.lbr_nr = 16;
935 x86_pmu.lbr_tos = MSR_LBR_TOS;
936 x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
937 x86_pmu.lbr_to = MSR_LBR_NHM_TO;
938
939 x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
940 x86_pmu.lbr_sel_map = hsw_lbr_sel_map;
941
942 pr_cont("16-deep LBR, ");
943}
944
768/* atom */ 945/* atom */
769void __init intel_pmu_lbr_init_atom(void) 946void __init intel_pmu_lbr_init_atom(void)
770{ 947{
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 000000000000..f2770641c0fd
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,1103 @@
1/*
2 * Intel(R) Processor Trace PMU driver for perf
3 * Copyright (c) 2013-2014, Intel Corporation.
4 *
5 * This program is free software; you can redistribute it and/or modify it
6 * under the terms and conditions of the GNU General Public License,
7 * version 2, as published by the Free Software Foundation.
8 *
9 * This program is distributed in the hope it will be useful, but WITHOUT
10 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
11 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
12 * more details.
13 *
14 * Intel PT is specified in the Intel Architecture Instruction Set Extensions
15 * Programming Reference:
16 * http://software.intel.com/en-us/intel-isa-extensions
17 */
18
19#undef DEBUG
20
21#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
22
23#include <linux/types.h>
24#include <linux/slab.h>
25#include <linux/device.h>
26
27#include <asm/perf_event.h>
28#include <asm/insn.h>
29#include <asm/io.h>
30
31#include "perf_event.h"
32#include "intel_pt.h"
33
34static DEFINE_PER_CPU(struct pt, pt_ctx);
35
36static struct pt_pmu pt_pmu;
37
38enum cpuid_regs {
39 CR_EAX = 0,
40 CR_ECX,
41 CR_EDX,
42 CR_EBX
43};
44
45/*
46 * Capabilities of Intel PT hardware, such as number of address bits or
47 * supported output schemes, are cached and exported to userspace as "caps"
48 * attribute group of pt pmu device
49 * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
50 * relevant bits together with intel_pt traces.
51 *
52 * These are necessary for both trace decoding (payloads_lip, contains address
53 * width encoded in IP-related packets), and event configuration (bitmasks with
54 * permitted values for certain bit fields).
55 */
56#define PT_CAP(_n, _l, _r, _m) \
57 [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
58 .reg = _r, .mask = _m }
59
60static struct pt_cap_desc {
61 const char *name;
62 u32 leaf;
63 u8 reg;
64 u32 mask;
65} pt_caps[] = {
66 PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
67 PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
68 PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
69 PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
70 PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
71};
72
73static u32 pt_cap_get(enum pt_capabilities cap)
74{
75 struct pt_cap_desc *cd = &pt_caps[cap];
76 u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
77 unsigned int shift = __ffs(cd->mask);
78
79 return (c & cd->mask) >> shift;
80}
81
82static ssize_t pt_cap_show(struct device *cdev,
83 struct device_attribute *attr,
84 char *buf)
85{
86 struct dev_ext_attribute *ea =
87 container_of(attr, struct dev_ext_attribute, attr);
88 enum pt_capabilities cap = (long)ea->var;
89
90 return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
91}
92
93static struct attribute_group pt_cap_group = {
94 .name = "caps",
95};
96
97PMU_FORMAT_ATTR(tsc, "config:10" );
98PMU_FORMAT_ATTR(noretcomp, "config:11" );
99
100static struct attribute *pt_formats_attr[] = {
101 &format_attr_tsc.attr,
102 &format_attr_noretcomp.attr,
103 NULL,
104};
105
106static struct attribute_group pt_format_group = {
107 .name = "format",
108 .attrs = pt_formats_attr,
109};
110
111static const struct attribute_group *pt_attr_groups[] = {
112 &pt_cap_group,
113 &pt_format_group,
114 NULL,
115};
116
117static int __init pt_pmu_hw_init(void)
118{
119 struct dev_ext_attribute *de_attrs;
120 struct attribute **attrs;
121 size_t size;
122 int ret;
123 long i;
124
125 attrs = NULL;
126 ret = -ENODEV;
127 if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
128 goto fail;
129
130 for (i = 0; i < PT_CPUID_LEAVES; i++) {
131 cpuid_count(20, i,
132 &pt_pmu.caps[CR_EAX + i*4],
133 &pt_pmu.caps[CR_EBX + i*4],
134 &pt_pmu.caps[CR_ECX + i*4],
135 &pt_pmu.caps[CR_EDX + i*4]);
136 }
137
138 ret = -ENOMEM;
139 size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps)+1);
140 attrs = kzalloc(size, GFP_KERNEL);
141 if (!attrs)
142 goto fail;
143
144 size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps)+1);
145 de_attrs = kzalloc(size, GFP_KERNEL);
146 if (!de_attrs)
147 goto fail;
148
149 for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
150 struct dev_ext_attribute *de_attr = de_attrs + i;
151
152 de_attr->attr.attr.name = pt_caps[i].name;
153
154 sysfs_attr_init(&de_attrs->attr.attr);
155
156 de_attr->attr.attr.mode = S_IRUGO;
157 de_attr->attr.show = pt_cap_show;
158 de_attr->var = (void *)i;
159
160 attrs[i] = &de_attr->attr.attr;
161 }
162
163 pt_cap_group.attrs = attrs;
164
165 return 0;
166
167fail:
168 kfree(attrs);
169
170 return ret;
171}
172
173#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
174
175static bool pt_event_valid(struct perf_event *event)
176{
177 u64 config = event->attr.config;
178
179 if ((config & PT_CONFIG_MASK) != config)
180 return false;
181
182 return true;
183}
184
185/*
186 * PT configuration helpers
187 * These all are cpu affine and operate on a local PT
188 */
189
190static bool pt_is_running(void)
191{
192 u64 ctl;
193
194 rdmsrl(MSR_IA32_RTIT_CTL, ctl);
195
196 return !!(ctl & RTIT_CTL_TRACEEN);
197}
198
199static void pt_config(struct perf_event *event)
200{
201 u64 reg;
202
203 reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
204
205 if (!event->attr.exclude_kernel)
206 reg |= RTIT_CTL_OS;
207 if (!event->attr.exclude_user)
208 reg |= RTIT_CTL_USR;
209
210 reg |= (event->attr.config & PT_CONFIG_MASK);
211
212 wrmsrl(MSR_IA32_RTIT_CTL, reg);
213}
214
215static void pt_config_start(bool start)
216{
217 u64 ctl;
218
219 rdmsrl(MSR_IA32_RTIT_CTL, ctl);
220 if (start)
221 ctl |= RTIT_CTL_TRACEEN;
222 else
223 ctl &= ~RTIT_CTL_TRACEEN;
224 wrmsrl(MSR_IA32_RTIT_CTL, ctl);
225
226 /*
227 * A wrmsr that disables trace generation serializes other PT
228 * registers and causes all data packets to be written to memory,
229 * but a fence is required for the data to become globally visible.
230 *
231 * The below WMB, separating data store and aux_head store matches
232 * the consumer's RMB that separates aux_head load and data load.
233 */
234 if (!start)
235 wmb();
236}
237
238static void pt_config_buffer(void *buf, unsigned int topa_idx,
239 unsigned int output_off)
240{
241 u64 reg;
242
243 wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
244
245 reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
246
247 wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
248}
249
250/*
251 * Keep ToPA table-related metadata on the same page as the actual table,
252 * taking up a few words from the top
253 */
254
255#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
256
257/**
258 * struct topa - page-sized ToPA table with metadata at the top
259 * @table: actual ToPA table entries, as understood by PT hardware
260 * @list: linkage to struct pt_buffer's list of tables
261 * @phys: physical address of this page
262 * @offset: offset of the first entry in this table in the buffer
263 * @size: total size of all entries in this table
264 * @last: index of the last initialized entry in this table
265 */
266struct topa {
267 struct topa_entry table[TENTS_PER_PAGE];
268 struct list_head list;
269 u64 phys;
270 u64 offset;
271 size_t size;
272 int last;
273};
274
275/* make -1 stand for the last table entry */
276#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
277
278/**
279 * topa_alloc() - allocate page-sized ToPA table
280 * @cpu: CPU on which to allocate.
281 * @gfp: Allocation flags.
282 *
283 * Return: On success, return the pointer to ToPA table page.
284 */
285static struct topa *topa_alloc(int cpu, gfp_t gfp)
286{
287 int node = cpu_to_node(cpu);
288 struct topa *topa;
289 struct page *p;
290
291 p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
292 if (!p)
293 return NULL;
294
295 topa = page_address(p);
296 topa->last = 0;
297 topa->phys = page_to_phys(p);
298
299 /*
300 * In case of singe-entry ToPA, always put the self-referencing END
301 * link as the 2nd entry in the table
302 */
303 if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
304 TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
305 TOPA_ENTRY(topa, 1)->end = 1;
306 }
307
308 return topa;
309}
310
311/**
312 * topa_free() - free a page-sized ToPA table
313 * @topa: Table to deallocate.
314 */
315static void topa_free(struct topa *topa)
316{
317 free_page((unsigned long)topa);
318}
319
320/**
321 * topa_insert_table() - insert a ToPA table into a buffer
322 * @buf: PT buffer that's being extended.
323 * @topa: New topa table to be inserted.
324 *
325 * If it's the first table in this buffer, set up buffer's pointers
326 * accordingly; otherwise, add a END=1 link entry to @topa to the current
327 * "last" table and adjust the last table pointer to @topa.
328 */
329static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
330{
331 struct topa *last = buf->last;
332
333 list_add_tail(&topa->list, &buf->tables);
334
335 if (!buf->first) {
336 buf->first = buf->last = buf->cur = topa;
337 return;
338 }
339
340 topa->offset = last->offset + last->size;
341 buf->last = topa;
342
343 if (!pt_cap_get(PT_CAP_topa_multiple_entries))
344 return;
345
346 BUG_ON(last->last != TENTS_PER_PAGE - 1);
347
348 TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
349 TOPA_ENTRY(last, -1)->end = 1;
350}
351
352/**
353 * topa_table_full() - check if a ToPA table is filled up
354 * @topa: ToPA table.
355 */
356static bool topa_table_full(struct topa *topa)
357{
358 /* single-entry ToPA is a special case */
359 if (!pt_cap_get(PT_CAP_topa_multiple_entries))
360 return !!topa->last;
361
362 return topa->last == TENTS_PER_PAGE - 1;
363}
364
365/**
366 * topa_insert_pages() - create a list of ToPA tables
367 * @buf: PT buffer being initialized.
368 * @gfp: Allocation flags.
369 *
370 * This initializes a list of ToPA tables with entries from
371 * the data_pages provided by rb_alloc_aux().
372 *
373 * Return: 0 on success or error code.
374 */
375static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp)
376{
377 struct topa *topa = buf->last;
378 int order = 0;
379 struct page *p;
380
381 p = virt_to_page(buf->data_pages[buf->nr_pages]);
382 if (PagePrivate(p))
383 order = page_private(p);
384
385 if (topa_table_full(topa)) {
386 topa = topa_alloc(buf->cpu, gfp);
387 if (!topa)
388 return -ENOMEM;
389
390 topa_insert_table(buf, topa);
391 }
392
393 TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
394 TOPA_ENTRY(topa, -1)->size = order;
395 if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
396 TOPA_ENTRY(topa, -1)->intr = 1;
397 TOPA_ENTRY(topa, -1)->stop = 1;
398 }
399
400 topa->last++;
401 topa->size += sizes(order);
402
403 buf->nr_pages += 1ul << order;
404
405 return 0;
406}
407
408/**
409 * pt_topa_dump() - print ToPA tables and their entries
410 * @buf: PT buffer.
411 */
412static void pt_topa_dump(struct pt_buffer *buf)
413{
414 struct topa *topa;
415
416 list_for_each_entry(topa, &buf->tables, list) {
417 int i;
418
419 pr_debug("# table @%p (%016Lx), off %llx size %zx\n", topa->table,
420 topa->phys, topa->offset, topa->size);
421 for (i = 0; i < TENTS_PER_PAGE; i++) {
422 pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
423 &topa->table[i],
424 (unsigned long)topa->table[i].base << TOPA_SHIFT,
425 sizes(topa->table[i].size),
426 topa->table[i].end ? 'E' : ' ',
427 topa->table[i].intr ? 'I' : ' ',
428 topa->table[i].stop ? 'S' : ' ',
429 *(u64 *)&topa->table[i]);
430 if ((pt_cap_get(PT_CAP_topa_multiple_entries) &&
431 topa->table[i].stop) ||
432 topa->table[i].end)
433 break;
434 }
435 }
436}
437
438/**
439 * pt_buffer_advance() - advance to the next output region
440 * @buf: PT buffer.
441 *
442 * Advance the current pointers in the buffer to the next ToPA entry.
443 */
444static void pt_buffer_advance(struct pt_buffer *buf)
445{
446 buf->output_off = 0;
447 buf->cur_idx++;
448
449 if (buf->cur_idx == buf->cur->last) {
450 if (buf->cur == buf->last)
451 buf->cur = buf->first;
452 else
453 buf->cur = list_entry(buf->cur->list.next, struct topa,
454 list);
455 buf->cur_idx = 0;
456 }
457}
458
459/**
460 * pt_update_head() - calculate current offsets and sizes
461 * @pt: Per-cpu pt context.
462 *
463 * Update buffer's current write pointer position and data size.
464 */
465static void pt_update_head(struct pt *pt)
466{
467 struct pt_buffer *buf = perf_get_aux(&pt->handle);
468 u64 topa_idx, base, old;
469
470 /* offset of the first region in this table from the beginning of buf */
471 base = buf->cur->offset + buf->output_off;
472
473 /* offset of the current output region within this table */
474 for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
475 base += sizes(buf->cur->table[topa_idx].size);
476
477 if (buf->snapshot) {
478 local_set(&buf->data_size, base);
479 } else {
480 old = (local64_xchg(&buf->head, base) &
481 ((buf->nr_pages << PAGE_SHIFT) - 1));
482 if (base < old)
483 base += buf->nr_pages << PAGE_SHIFT;
484
485 local_add(base - old, &buf->data_size);
486 }
487}
488
489/**
490 * pt_buffer_region() - obtain current output region's address
491 * @buf: PT buffer.
492 */
493static void *pt_buffer_region(struct pt_buffer *buf)
494{
495 return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
496}
497
498/**
499 * pt_buffer_region_size() - obtain current output region's size
500 * @buf: PT buffer.
501 */
502static size_t pt_buffer_region_size(struct pt_buffer *buf)
503{
504 return sizes(buf->cur->table[buf->cur_idx].size);
505}
506
507/**
508 * pt_handle_status() - take care of possible status conditions
509 * @pt: Per-cpu pt context.
510 */
511static void pt_handle_status(struct pt *pt)
512{
513 struct pt_buffer *buf = perf_get_aux(&pt->handle);
514 int advance = 0;
515 u64 status;
516
517 rdmsrl(MSR_IA32_RTIT_STATUS, status);
518
519 if (status & RTIT_STATUS_ERROR) {
520 pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
521 pt_topa_dump(buf);
522 status &= ~RTIT_STATUS_ERROR;
523 }
524
525 if (status & RTIT_STATUS_STOPPED) {
526 status &= ~RTIT_STATUS_STOPPED;
527
528 /*
529 * On systems that only do single-entry ToPA, hitting STOP
530 * means we are already losing data; need to let the decoder
531 * know.
532 */
533 if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
534 buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
535 local_inc(&buf->lost);
536 advance++;
537 }
538 }
539
540 /*
541 * Also on single-entry ToPA implementations, interrupt will come
542 * before the output reaches its output region's boundary.
543 */
544 if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
545 pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
546 void *head = pt_buffer_region(buf);
547
548 /* everything within this margin needs to be zeroed out */
549 memset(head + buf->output_off, 0,
550 pt_buffer_region_size(buf) -
551 buf->output_off);
552 advance++;
553 }
554
555 if (advance)
556 pt_buffer_advance(buf);
557
558 wrmsrl(MSR_IA32_RTIT_STATUS, status);
559}
560
561/**
562 * pt_read_offset() - translate registers into buffer pointers
563 * @buf: PT buffer.
564 *
565 * Set buffer's output pointers from MSR values.
566 */
567static void pt_read_offset(struct pt_buffer *buf)
568{
569 u64 offset, base_topa;
570
571 rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
572 buf->cur = phys_to_virt(base_topa);
573
574 rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
575 /* offset within current output region */
576 buf->output_off = offset >> 32;
577 /* index of current output region within this table */
578 buf->cur_idx = (offset & 0xffffff80) >> 7;
579}
580
581/**
582 * pt_topa_next_entry() - obtain index of the first page in the next ToPA entry
583 * @buf: PT buffer.
584 * @pg: Page offset in the buffer.
585 *
586 * When advancing to the next output region (ToPA entry), given a page offset
587 * into the buffer, we need to find the offset of the first page in the next
588 * region.
589 */
590static unsigned int pt_topa_next_entry(struct pt_buffer *buf, unsigned int pg)
591{
592 struct topa_entry *te = buf->topa_index[pg];
593
594 /* one region */
595 if (buf->first == buf->last && buf->first->last == 1)
596 return pg;
597
598 do {
599 pg++;
600 pg &= buf->nr_pages - 1;
601 } while (buf->topa_index[pg] == te);
602
603 return pg;
604}
605
606/**
607 * pt_buffer_reset_markers() - place interrupt and stop bits in the buffer
608 * @buf: PT buffer.
609 * @handle: Current output handle.
610 *
611 * Place INT and STOP marks to prevent overwriting old data that the consumer
612 * hasn't yet collected.
613 */
614static int pt_buffer_reset_markers(struct pt_buffer *buf,
615 struct perf_output_handle *handle)
616
617{
618 unsigned long idx, npages, end;
619
620 if (buf->snapshot)
621 return 0;
622
623 /* can't stop in the middle of an output region */
624 if (buf->output_off + handle->size + 1 <
625 sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size))
626 return -EINVAL;
627
628
629 /* single entry ToPA is handled by marking all regions STOP=1 INT=1 */
630 if (!pt_cap_get(PT_CAP_topa_multiple_entries))
631 return 0;
632
633 /* clear STOP and INT from current entry */
634 buf->topa_index[buf->stop_pos]->stop = 0;
635 buf->topa_index[buf->intr_pos]->intr = 0;
636
637 if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
638 npages = (handle->size + 1) >> PAGE_SHIFT;
639 end = (local64_read(&buf->head) >> PAGE_SHIFT) + npages;
640 /*if (end > handle->wakeup >> PAGE_SHIFT)
641 end = handle->wakeup >> PAGE_SHIFT;*/
642 idx = end & (buf->nr_pages - 1);
643 buf->stop_pos = idx;
644 idx = (local64_read(&buf->head) >> PAGE_SHIFT) + npages - 1;
645 idx &= buf->nr_pages - 1;
646 buf->intr_pos = idx;
647 }
648
649 buf->topa_index[buf->stop_pos]->stop = 1;
650 buf->topa_index[buf->intr_pos]->intr = 1;
651
652 return 0;
653}
654
655/**
656 * pt_buffer_setup_topa_index() - build topa_index[] table of regions
657 * @buf: PT buffer.
658 *
659 * topa_index[] references output regions indexed by offset into the
660 * buffer for purposes of quick reverse lookup.
661 */
662static void pt_buffer_setup_topa_index(struct pt_buffer *buf)
663{
664 struct topa *cur = buf->first, *prev = buf->last;
665 struct topa_entry *te_cur = TOPA_ENTRY(cur, 0),
666 *te_prev = TOPA_ENTRY(prev, prev->last - 1);
667 int pg = 0, idx = 0, ntopa = 0;
668
669 while (pg < buf->nr_pages) {
670 int tidx;
671
672 /* pages within one topa entry */
673 for (tidx = 0; tidx < 1 << te_cur->size; tidx++, pg++)
674 buf->topa_index[pg] = te_prev;
675
676 te_prev = te_cur;
677
678 if (idx == cur->last - 1) {
679 /* advance to next topa table */
680 idx = 0;
681 cur = list_entry(cur->list.next, struct topa, list);
682 ntopa++;
683 } else
684 idx++;
685 te_cur = TOPA_ENTRY(cur, idx);
686 }
687
688}
689
690/**
691 * pt_buffer_reset_offsets() - adjust buffer's write pointers from aux_head
692 * @buf: PT buffer.
693 * @head: Write pointer (aux_head) from AUX buffer.
694 *
695 * Find the ToPA table and entry corresponding to given @head and set buffer's
696 * "current" pointers accordingly.
697 */
698static void pt_buffer_reset_offsets(struct pt_buffer *buf, unsigned long head)
699{
700 int pg;
701
702 if (buf->snapshot)
703 head &= (buf->nr_pages << PAGE_SHIFT) - 1;
704
705 pg = (head >> PAGE_SHIFT) & (buf->nr_pages - 1);
706 pg = pt_topa_next_entry(buf, pg);
707
708 buf->cur = (struct topa *)((unsigned long)buf->topa_index[pg] & PAGE_MASK);
709 buf->cur_idx = ((unsigned long)buf->topa_index[pg] -
710 (unsigned long)buf->cur) / sizeof(struct topa_entry);
711 buf->output_off = head & (sizes(buf->cur->table[buf->cur_idx].size) - 1);
712
713 local64_set(&buf->head, head);
714 local_set(&buf->data_size, 0);
715}
716
717/**
718 * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
719 * @buf: PT buffer.
720 */
721static void pt_buffer_fini_topa(struct pt_buffer *buf)
722{
723 struct topa *topa, *iter;
724
725 list_for_each_entry_safe(topa, iter, &buf->tables, list) {
726 /*
727 * right now, this is in free_aux() path only, so
728 * no need to unlink this table from the list
729 */
730 topa_free(topa);
731 }
732}
733
734/**
735 * pt_buffer_init_topa() - initialize ToPA table for pt buffer
736 * @buf: PT buffer.
737 * @size: Total size of all regions within this ToPA.
738 * @gfp: Allocation flags.
739 */
740static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages,
741 gfp_t gfp)
742{
743 struct topa *topa;
744 int err;
745
746 topa = topa_alloc(buf->cpu, gfp);
747 if (!topa)
748 return -ENOMEM;
749
750 topa_insert_table(buf, topa);
751
752 while (buf->nr_pages < nr_pages) {
753 err = topa_insert_pages(buf, gfp);
754 if (err) {
755 pt_buffer_fini_topa(buf);
756 return -ENOMEM;
757 }
758 }
759
760 pt_buffer_setup_topa_index(buf);
761
762 /* link last table to the first one, unless we're double buffering */
763 if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
764 TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
765 TOPA_ENTRY(buf->last, -1)->end = 1;
766 }
767
768 pt_topa_dump(buf);
769 return 0;
770}
771
772/**
773 * pt_buffer_setup_aux() - set up topa tables for a PT buffer
774 * @cpu: Cpu on which to allocate, -1 means current.
775 * @pages: Array of pointers to buffer pages passed from perf core.
776 * @nr_pages: Number of pages in the buffer.
777 * @snapshot: If this is a snapshot/overwrite counter.
778 *
779 * This is a pmu::setup_aux callback that sets up ToPA tables and all the
780 * bookkeeping for an AUX buffer.
781 *
782 * Return: Our private PT buffer structure.
783 */
784static void *
785pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot)
786{
787 struct pt_buffer *buf;
788 int node, ret;
789
790 if (!nr_pages)
791 return NULL;
792
793 if (cpu == -1)
794 cpu = raw_smp_processor_id();
795 node = cpu_to_node(cpu);
796
797 buf = kzalloc_node(offsetof(struct pt_buffer, topa_index[nr_pages]),
798 GFP_KERNEL, node);
799 if (!buf)
800 return NULL;
801
802 buf->cpu = cpu;
803 buf->snapshot = snapshot;
804 buf->data_pages = pages;
805
806 INIT_LIST_HEAD(&buf->tables);
807
808 ret = pt_buffer_init_topa(buf, nr_pages, GFP_KERNEL);
809 if (ret) {
810 kfree(buf);
811 return NULL;
812 }
813
814 return buf;
815}
816
817/**
818 * pt_buffer_free_aux() - perf AUX deallocation path callback
819 * @data: PT buffer.
820 */
821static void pt_buffer_free_aux(void *data)
822{
823 struct pt_buffer *buf = data;
824
825 pt_buffer_fini_topa(buf);
826 kfree(buf);
827}
828
829/**
830 * pt_buffer_is_full() - check if the buffer is full
831 * @buf: PT buffer.
832 * @pt: Per-cpu pt handle.
833 *
834 * If the user hasn't read data from the output region that aux_head
835 * points to, the buffer is considered full: the user needs to read at
836 * least this region and update aux_tail to point past it.
837 */
838static bool pt_buffer_is_full(struct pt_buffer *buf, struct pt *pt)
839{
840 if (buf->snapshot)
841 return false;
842
843 if (local_read(&buf->data_size) >= pt->handle.size)
844 return true;
845
846 return false;
847}
848
849/**
850 * intel_pt_interrupt() - PT PMI handler
851 */
852void intel_pt_interrupt(void)
853{
854 struct pt *pt = this_cpu_ptr(&pt_ctx);
855 struct pt_buffer *buf;
856 struct perf_event *event = pt->handle.event;
857
858 /*
859 * There may be a dangling PT bit in the interrupt status register
860 * after PT has been disabled by pt_event_stop(). Make sure we don't
861 * do anything (particularly, re-enable) for this event here.
862 */
863 if (!ACCESS_ONCE(pt->handle_nmi))
864 return;
865
866 pt_config_start(false);
867
868 if (!event)
869 return;
870
871 buf = perf_get_aux(&pt->handle);
872 if (!buf)
873 return;
874
875 pt_read_offset(buf);
876
877 pt_handle_status(pt);
878
879 pt_update_head(pt);
880
881 perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
882 local_xchg(&buf->lost, 0));
883
884 if (!event->hw.state) {
885 int ret;
886
887 buf = perf_aux_output_begin(&pt->handle, event);
888 if (!buf) {
889 event->hw.state = PERF_HES_STOPPED;
890 return;
891 }
892
893 pt_buffer_reset_offsets(buf, pt->handle.head);
894 ret = pt_buffer_reset_markers(buf, &pt->handle);
895 if (ret) {
896 perf_aux_output_end(&pt->handle, 0, true);
897 return;
898 }
899
900 pt_config_buffer(buf->cur->table, buf->cur_idx,
901 buf->output_off);
902 wrmsrl(MSR_IA32_RTIT_STATUS, 0);
903 pt_config(event);
904 }
905}
906
907/*
908 * PMU callbacks
909 */
910
911static void pt_event_start(struct perf_event *event, int mode)
912{
913 struct pt *pt = this_cpu_ptr(&pt_ctx);
914 struct pt_buffer *buf = perf_get_aux(&pt->handle);
915
916 if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt)) {
917 event->hw.state = PERF_HES_STOPPED;
918 return;
919 }
920
921 ACCESS_ONCE(pt->handle_nmi) = 1;
922 event->hw.state = 0;
923
924 pt_config_buffer(buf->cur->table, buf->cur_idx,
925 buf->output_off);
926 wrmsrl(MSR_IA32_RTIT_STATUS, 0);
927 pt_config(event);
928}
929
930static void pt_event_stop(struct perf_event *event, int mode)
931{
932 struct pt *pt = this_cpu_ptr(&pt_ctx);
933
934 /*
935 * Protect against the PMI racing with disabling wrmsr,
936 * see comment in intel_pt_interrupt().
937 */
938 ACCESS_ONCE(pt->handle_nmi) = 0;
939 pt_config_start(false);
940
941 if (event->hw.state == PERF_HES_STOPPED)
942 return;
943
944 event->hw.state = PERF_HES_STOPPED;
945
946 if (mode & PERF_EF_UPDATE) {
947 struct pt *pt = this_cpu_ptr(&pt_ctx);
948 struct pt_buffer *buf = perf_get_aux(&pt->handle);
949
950 if (!buf)
951 return;
952
953 if (WARN_ON_ONCE(pt->handle.event != event))
954 return;
955
956 pt_read_offset(buf);
957
958 pt_handle_status(pt);
959
960 pt_update_head(pt);
961 }
962}
963
964static void pt_event_del(struct perf_event *event, int mode)
965{
966 struct pt *pt = this_cpu_ptr(&pt_ctx);
967 struct pt_buffer *buf;
968
969 pt_event_stop(event, PERF_EF_UPDATE);
970
971 buf = perf_get_aux(&pt->handle);
972
973 if (buf) {
974 if (buf->snapshot)
975 pt->handle.head =
976 local_xchg(&buf->data_size,
977 buf->nr_pages << PAGE_SHIFT);
978 perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
979 local_xchg(&buf->lost, 0));
980 }
981}
982
983static int pt_event_add(struct perf_event *event, int mode)
984{
985 struct pt_buffer *buf;
986 struct pt *pt = this_cpu_ptr(&pt_ctx);
987 struct hw_perf_event *hwc = &event->hw;
988 int ret = -EBUSY;
989
990 if (pt->handle.event)
991 goto out;
992
993 buf = perf_aux_output_begin(&pt->handle, event);
994 if (!buf) {
995 ret = -EINVAL;
996 goto out;
997 }
998
999 pt_buffer_reset_offsets(buf, pt->handle.head);
1000 if (!buf->snapshot) {
1001 ret = pt_buffer_reset_markers(buf, &pt->handle);
1002 if (ret) {
1003 perf_aux_output_end(&pt->handle, 0, true);
1004 goto out;
1005 }
1006 }
1007
1008 if (mode & PERF_EF_START) {
1009 pt_event_start(event, 0);
1010 if (hwc->state == PERF_HES_STOPPED) {
1011 pt_event_del(event, 0);
1012 ret = -EBUSY;
1013 }
1014 } else {
1015 hwc->state = PERF_HES_STOPPED;
1016 }
1017
1018 ret = 0;
1019out:
1020
1021 if (ret)
1022 hwc->state = PERF_HES_STOPPED;
1023
1024 return ret;
1025}
1026
1027static void pt_event_read(struct perf_event *event)
1028{
1029}
1030
1031static void pt_event_destroy(struct perf_event *event)
1032{
1033 x86_del_exclusive(x86_lbr_exclusive_pt);
1034}
1035
1036static int pt_event_init(struct perf_event *event)
1037{
1038 if (event->attr.type != pt_pmu.pmu.type)
1039 return -ENOENT;
1040
1041 if (!pt_event_valid(event))
1042 return -EINVAL;
1043
1044 if (x86_add_exclusive(x86_lbr_exclusive_pt))
1045 return -EBUSY;
1046
1047 event->destroy = pt_event_destroy;
1048
1049 return 0;
1050}
1051
1052static __init int pt_init(void)
1053{
1054 int ret, cpu, prior_warn = 0;
1055
1056 BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
1057 get_online_cpus();
1058 for_each_online_cpu(cpu) {
1059 u64 ctl;
1060
1061 ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
1062 if (!ret && (ctl & RTIT_CTL_TRACEEN))
1063 prior_warn++;
1064 }
1065 put_online_cpus();
1066
1067 if (prior_warn) {
1068 x86_add_exclusive(x86_lbr_exclusive_pt);
1069 pr_warn("PT is enabled at boot time, doing nothing\n");
1070
1071 return -EBUSY;
1072 }
1073
1074 ret = pt_pmu_hw_init();
1075 if (ret)
1076 return ret;
1077
1078 if (!pt_cap_get(PT_CAP_topa_output)) {
1079 pr_warn("ToPA output is not supported on this CPU\n");
1080 return -ENODEV;
1081 }
1082
1083 if (!pt_cap_get(PT_CAP_topa_multiple_entries))
1084 pt_pmu.pmu.capabilities =
1085 PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
1086
1087 pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
1088 pt_pmu.pmu.attr_groups = pt_attr_groups;
1089 pt_pmu.pmu.task_ctx_nr = perf_sw_context;
1090 pt_pmu.pmu.event_init = pt_event_init;
1091 pt_pmu.pmu.add = pt_event_add;
1092 pt_pmu.pmu.del = pt_event_del;
1093 pt_pmu.pmu.start = pt_event_start;
1094 pt_pmu.pmu.stop = pt_event_stop;
1095 pt_pmu.pmu.read = pt_event_read;
1096 pt_pmu.pmu.setup_aux = pt_buffer_setup_aux;
1097 pt_pmu.pmu.free_aux = pt_buffer_free_aux;
1098 ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
1099
1100 return ret;
1101}
1102
1103module_init(pt_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snbep.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snbep.c
index 21af6149edf2..12d9548457e7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore_snbep.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore_snbep.c
@@ -1132,8 +1132,7 @@ static int snbep_pci2phy_map_init(int devid)
1132 } 1132 }
1133 } 1133 }
1134 1134
1135 if (ubox_dev) 1135 pci_dev_put(ubox_dev);
1136 pci_dev_put(ubox_dev);
1137 1136
1138 return err ? pcibios_err_to_errno(err) : 0; 1137 return err ? pcibios_err_to_errno(err) : 0;
1139} 1138}
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 60639093d536..3d423a101fae 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -41,6 +41,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
41 { X86_FEATURE_HWP_ACT_WINDOW, CR_EAX, 9, 0x00000006, 0 }, 41 { X86_FEATURE_HWP_ACT_WINDOW, CR_EAX, 9, 0x00000006, 0 },
42 { X86_FEATURE_HWP_EPP, CR_EAX,10, 0x00000006, 0 }, 42 { X86_FEATURE_HWP_EPP, CR_EAX,10, 0x00000006, 0 },
43 { X86_FEATURE_HWP_PKG_REQ, CR_EAX,11, 0x00000006, 0 }, 43 { X86_FEATURE_HWP_PKG_REQ, CR_EAX,11, 0x00000006, 0 },
44 { X86_FEATURE_INTEL_PT, CR_EBX,25, 0x00000007, 0 },
44 { X86_FEATURE_APERFMPERF, CR_ECX, 0, 0x00000006, 0 }, 45 { X86_FEATURE_APERFMPERF, CR_ECX, 0, 0x00000006, 0 },
45 { X86_FEATURE_EPB, CR_ECX, 3, 0x00000006, 0 }, 46 { X86_FEATURE_EPB, CR_ECX, 3, 0x00000006, 0 },
46 { X86_FEATURE_HW_PSTATE, CR_EDX, 7, 0x80000007, 0 }, 47 { X86_FEATURE_HW_PSTATE, CR_EDX, 7, 0x80000007, 0 },
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 24d079604fd5..1deffe6cc873 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -354,6 +354,7 @@ int __copy_instruction(u8 *dest, u8 *src)
354{ 354{
355 struct insn insn; 355 struct insn insn;
356 kprobe_opcode_t buf[MAX_INSN_SIZE]; 356 kprobe_opcode_t buf[MAX_INSN_SIZE];
357 int length;
357 unsigned long recovered_insn = 358 unsigned long recovered_insn =
358 recover_probed_instruction(buf, (unsigned long)src); 359 recover_probed_instruction(buf, (unsigned long)src);
359 360
@@ -361,16 +362,18 @@ int __copy_instruction(u8 *dest, u8 *src)
361 return 0; 362 return 0;
362 kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE); 363 kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
363 insn_get_length(&insn); 364 insn_get_length(&insn);
365 length = insn.length;
366
364 /* Another subsystem puts a breakpoint, failed to recover */ 367 /* Another subsystem puts a breakpoint, failed to recover */
365 if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) 368 if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION)
366 return 0; 369 return 0;
367 memcpy(dest, insn.kaddr, insn.length); 370 memcpy(dest, insn.kaddr, length);
368 371
369#ifdef CONFIG_X86_64 372#ifdef CONFIG_X86_64
370 if (insn_rip_relative(&insn)) { 373 if (insn_rip_relative(&insn)) {
371 s64 newdisp; 374 s64 newdisp;
372 u8 *disp; 375 u8 *disp;
373 kernel_insn_init(&insn, dest, insn.length); 376 kernel_insn_init(&insn, dest, length);
374 insn_get_displacement(&insn); 377 insn_get_displacement(&insn);
375 /* 378 /*
376 * The copied instruction uses the %rip-relative addressing 379 * The copied instruction uses the %rip-relative addressing
@@ -394,7 +397,7 @@ int __copy_instruction(u8 *dest, u8 *src)
394 *(s32 *) disp = (s32) newdisp; 397 *(s32 *) disp = (s32) newdisp;
395 } 398 }
396#endif 399#endif
397 return insn.length; 400 return length;
398} 401}
399 402
400static int arch_copy_kprobe(struct kprobe *p) 403static int arch_copy_kprobe(struct kprobe *p)