summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJohannes Weiner <hannes@cmpxchg.org>2018-10-26 18:06:27 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2018-10-26 19:26:32 -0400
commiteb414681d5a07d28d2ff90dc05f69ec6b232ebd2 (patch)
tree69e37010954e597b404709ecd9a11b9f7373cf0f
parent246b3b3342c9b0a2e24cda2178be87bc36e1c874 (diff)
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--Documentation/accounting/psi.txt64
-rw-r--r--include/linux/psi.h28
-rw-r--r--include/linux/psi_types.h92
-rw-r--r--include/linux/sched.h10
-rw-r--r--init/Kconfig15
-rw-r--r--kernel/fork.c4
-rw-r--r--kernel/sched/Makefile1
-rw-r--r--kernel/sched/core.c12
-rw-r--r--kernel/sched/psi.c657
-rw-r--r--kernel/sched/sched.h2
-rw-r--r--kernel/sched/stats.h86
-rw-r--r--mm/compaction.c5
-rw-r--r--mm/filemap.c15
-rw-r--r--mm/page_alloc.c9
-rw-r--r--mm/vmscan.c9
15 files changed, 1003 insertions, 6 deletions
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index 000000000000..3753a82f1cf5
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
1================================
2PSI - Pressure Stall Information
3================================
4
5:Date: April, 2018
6:Author: Johannes Weiner <hannes@cmpxchg.org>
7
8When CPU, memory or IO devices are contended, workloads experience
9latency spikes, throughput losses, and run the risk of OOM kills.
10
11Without an accurate measure of such contention, users are forced to
12either play it safe and under-utilize their hardware resources, or
13roll the dice and frequently suffer the disruptions resulting from
14excessive overcommit.
15
16The psi feature identifies and quantifies the disruptions caused by
17such resource crunches and the time impact it has on complex workloads
18or even entire systems.
19
20Having an accurate measure of productivity losses caused by resource
21scarcity aids users in sizing workloads to hardware--or provisioning
22hardware according to workload demand.
23
24As psi aggregates this information in realtime, systems can be managed
25dynamically using techniques such as load shedding, migrating jobs to
26other systems or data centers, or strategically pausing or killing low
27priority or restartable batch jobs.
28
29This allows maximizing hardware utilization without sacrificing
30workload health or risking major disruptions such as OOM kills.
31
32Pressure interface
33==================
34
35Pressure information for each resource is exported through the
36respective file in /proc/pressure/ -- cpu, memory, and io.
37
38The format for CPU is as such:
39
40some avg10=0.00 avg60=0.00 avg300=0.00 total=0
41
42and for memory and IO:
43
44some avg10=0.00 avg60=0.00 avg300=0.00 total=0
45full avg10=0.00 avg60=0.00 avg300=0.00 total=0
46
47The "some" line indicates the share of time in which at least some
48tasks are stalled on a given resource.
49
50The "full" line indicates the share of time in which all non-idle
51tasks are stalled on a given resource simultaneously. In this state
52actual CPU cycles are going to waste, and a workload that spends
53extended time in this state is considered to be thrashing. This has
54severe impact on performance, and it's useful to distinguish this
55situation from a state where some tasks are stalled but the CPU is
56still doing productive work. As such, time spent in this subset of the
57stall state is tracked separately and exported in the "full" averages.
58
59The ratios are tracked as recent trends over ten, sixty, and three
60hundred second windows, which gives insight into short term events as
61well as medium and long term trends. The total absolute stall time is
62tracked and exported as well, to allow detection of latency spikes
63which wouldn't necessarily make a dent in the time averages, or to
64average trends over custom time frames.
diff --git a/include/linux/psi.h b/include/linux/psi.h
new file mode 100644
index 000000000000..b0daf050de58
--- /dev/null
+++ b/include/linux/psi.h
@@ -0,0 +1,28 @@
1#ifndef _LINUX_PSI_H
2#define _LINUX_PSI_H
3
4#include <linux/psi_types.h>
5#include <linux/sched.h>
6
7#ifdef CONFIG_PSI
8
9extern bool psi_disabled;
10
11void psi_init(void);
12
13void psi_task_change(struct task_struct *task, int clear, int set);
14
15void psi_memstall_tick(struct task_struct *task, int cpu);
16void psi_memstall_enter(unsigned long *flags);
17void psi_memstall_leave(unsigned long *flags);
18
19#else /* CONFIG_PSI */
20
21static inline void psi_init(void) {}
22
23static inline void psi_memstall_enter(unsigned long *flags) {}
24static inline void psi_memstall_leave(unsigned long *flags) {}
25
26#endif /* CONFIG_PSI */
27
28#endif /* _LINUX_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
new file mode 100644
index 000000000000..2cf422db5d18
--- /dev/null
+++ b/include/linux/psi_types.h
@@ -0,0 +1,92 @@
1#ifndef _LINUX_PSI_TYPES_H
2#define _LINUX_PSI_TYPES_H
3
4#include <linux/seqlock.h>
5#include <linux/types.h>
6
7#ifdef CONFIG_PSI
8
9/* Tracked task states */
10enum psi_task_count {
11 NR_IOWAIT,
12 NR_MEMSTALL,
13 NR_RUNNING,
14 NR_PSI_TASK_COUNTS,
15};
16
17/* Task state bitmasks */
18#define TSK_IOWAIT (1 << NR_IOWAIT)
19#define TSK_MEMSTALL (1 << NR_MEMSTALL)
20#define TSK_RUNNING (1 << NR_RUNNING)
21
22/* Resources that workloads could be stalled on */
23enum psi_res {
24 PSI_IO,
25 PSI_MEM,
26 PSI_CPU,
27 NR_PSI_RESOURCES,
28};
29
30/*
31 * Pressure states for each resource:
32 *
33 * SOME: Stalled tasks & working tasks
34 * FULL: Stalled tasks & no working tasks
35 */
36enum psi_states {
37 PSI_IO_SOME,
38 PSI_IO_FULL,
39 PSI_MEM_SOME,
40 PSI_MEM_FULL,
41 PSI_CPU_SOME,
42 /* Only per-CPU, to weigh the CPU in the global average: */
43 PSI_NONIDLE,
44 NR_PSI_STATES,
45};
46
47struct psi_group_cpu {
48 /* 1st cacheline updated by the scheduler */
49
50 /* Aggregator needs to know of concurrent changes */
51 seqcount_t seq ____cacheline_aligned_in_smp;
52
53 /* States of the tasks belonging to this group */
54 unsigned int tasks[NR_PSI_TASK_COUNTS];
55
56 /* Period time sampling buckets for each state of interest (ns) */
57 u32 times[NR_PSI_STATES];
58
59 /* Time of last task change in this group (rq_clock) */
60 u64 state_start;
61
62 /* 2nd cacheline updated by the aggregator */
63
64 /* Delta detection against the sampling buckets */
65 u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
66};
67
68struct psi_group {
69 /* Protects data updated during an aggregation */
70 struct mutex stat_lock;
71
72 /* Per-cpu task state & time tracking */
73 struct psi_group_cpu __percpu *pcpu;
74
75 /* Periodic aggregation state */
76 u64 total_prev[NR_PSI_STATES - 1];
77 u64 last_update;
78 u64 next_update;
79 struct delayed_work clock_work;
80
81 /* Total stall times and sampled pressure averages */
82 u64 total[NR_PSI_STATES - 1];
83 unsigned long avg[NR_PSI_STATES - 1][3];
84};
85
86#else /* CONFIG_PSI */
87
88struct psi_group { };
89
90#endif /* CONFIG_PSI */
91
92#endif /* _LINUX_PSI_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index adfb3f9a7597..b8fcc6b3080c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@
25#include <linux/latencytop.h> 25#include <linux/latencytop.h>
26#include <linux/sched/prio.h> 26#include <linux/sched/prio.h>
27#include <linux/signal_types.h> 27#include <linux/signal_types.h>
28#include <linux/psi_types.h>
28#include <linux/mm_types_task.h> 29#include <linux/mm_types_task.h>
29#include <linux/task_io_accounting.h> 30#include <linux/task_io_accounting.h>
30#include <linux/rseq.h> 31#include <linux/rseq.h>
@@ -706,6 +707,10 @@ struct task_struct {
706 unsigned sched_contributes_to_load:1; 707 unsigned sched_contributes_to_load:1;
707 unsigned sched_migrated:1; 708 unsigned sched_migrated:1;
708 unsigned sched_remote_wakeup:1; 709 unsigned sched_remote_wakeup:1;
710#ifdef CONFIG_PSI
711 unsigned sched_psi_wake_requeue:1;
712#endif
713
709 /* Force alignment to the next boundary: */ 714 /* Force alignment to the next boundary: */
710 unsigned :0; 715 unsigned :0;
711 716
@@ -965,6 +970,10 @@ struct task_struct {
965 kernel_siginfo_t *last_siginfo; 970 kernel_siginfo_t *last_siginfo;
966 971
967 struct task_io_accounting ioac; 972 struct task_io_accounting ioac;
973#ifdef CONFIG_PSI
974 /* Pressure stall state */
975 unsigned int psi_flags;
976#endif
968#ifdef CONFIG_TASK_XACCT 977#ifdef CONFIG_TASK_XACCT
969 /* Accumulated RSS usage: */ 978 /* Accumulated RSS usage: */
970 u64 acct_rss_mem1; 979 u64 acct_rss_mem1;
@@ -1391,6 +1400,7 @@ extern struct pid *cad_pid;
1391#define PF_KTHREAD 0x00200000 /* I am a kernel thread */ 1400#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
1392#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ 1401#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
1393#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ 1402#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
1403#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
1394#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ 1404#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
1395#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ 1405#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
1396#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ 1406#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/init/Kconfig b/init/Kconfig
index 317d5ccb5191..26e639df5517 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -490,6 +490,21 @@ config TASK_IO_ACCOUNTING
490 490
491 Say N if unsure. 491 Say N if unsure.
492 492
493config PSI
494 bool "Pressure stall information tracking"
495 help
496 Collect metrics that indicate how overcommitted the CPU, memory,
497 and IO capacity are in the system.
498
499 If you say Y here, the kernel will create /proc/pressure/ with the
500 pressure statistics files cpu, memory, and io. These will indicate
501 the share of walltime in which some or all tasks in the system are
502 delayed due to contention of the respective resource.
503
504 For more details see Documentation/accounting/psi.txt.
505
506 Say N if unsure.
507
493endmenu # "CPU/Task time and stats accounting" 508endmenu # "CPU/Task time and stats accounting"
494 509
495config CPU_ISOLATION 510config CPU_ISOLATION
diff --git a/kernel/fork.c b/kernel/fork.c
index 3c719fec46c5..8f82a3bdcb8f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1822,6 +1822,10 @@ static __latent_entropy struct task_struct *copy_process(
1822 1822
1823 p->default_timer_slack_ns = current->timer_slack_ns; 1823 p->default_timer_slack_ns = current->timer_slack_ns;
1824 1824
1825#ifdef CONFIG_PSI
1826 p->psi_flags = 0;
1827#endif
1828
1825 task_io_accounting_init(&p->ioac); 1829 task_io_accounting_init(&p->ioac);
1826 acct_clear_integrals(p); 1830 acct_clear_integrals(p);
1827 1831
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe183404c38..21fb5a5662b5 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
29obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o 29obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
30obj-$(CONFIG_MEMBARRIER) += membarrier.o 30obj-$(CONFIG_MEMBARRIER) += membarrier.o
31obj-$(CONFIG_CPU_ISOLATION) += isolation.o 31obj-$(CONFIG_CPU_ISOLATION) += isolation.o
32obj-$(CONFIG_PSI) += psi.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3efef387797..fd2fce8a001b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -722,8 +722,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
722 if (!(flags & ENQUEUE_NOCLOCK)) 722 if (!(flags & ENQUEUE_NOCLOCK))
723 update_rq_clock(rq); 723 update_rq_clock(rq);
724 724
725 if (!(flags & ENQUEUE_RESTORE)) 725 if (!(flags & ENQUEUE_RESTORE)) {
726 sched_info_queued(rq, p); 726 sched_info_queued(rq, p);
727 psi_enqueue(p, flags & ENQUEUE_WAKEUP);
728 }
727 729
728 p->sched_class->enqueue_task(rq, p, flags); 730 p->sched_class->enqueue_task(rq, p, flags);
729} 731}
@@ -733,8 +735,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
733 if (!(flags & DEQUEUE_NOCLOCK)) 735 if (!(flags & DEQUEUE_NOCLOCK))
734 update_rq_clock(rq); 736 update_rq_clock(rq);
735 737
736 if (!(flags & DEQUEUE_SAVE)) 738 if (!(flags & DEQUEUE_SAVE)) {
737 sched_info_dequeued(rq, p); 739 sched_info_dequeued(rq, p);
740 psi_dequeue(p, flags & DEQUEUE_SLEEP);
741 }
738 742
739 p->sched_class->dequeue_task(rq, p, flags); 743 p->sched_class->dequeue_task(rq, p, flags);
740} 744}
@@ -2037,6 +2041,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
2037 cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); 2041 cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
2038 if (task_cpu(p) != cpu) { 2042 if (task_cpu(p) != cpu) {
2039 wake_flags |= WF_MIGRATED; 2043 wake_flags |= WF_MIGRATED;
2044 psi_ttwu_dequeue(p);
2040 set_task_cpu(p, cpu); 2045 set_task_cpu(p, cpu);
2041 } 2046 }
2042 2047
@@ -3051,6 +3056,7 @@ void scheduler_tick(void)
3051 curr->sched_class->task_tick(rq, curr, 0); 3056 curr->sched_class->task_tick(rq, curr, 0);
3052 cpu_load_update_active(rq); 3057 cpu_load_update_active(rq);
3053 calc_global_load_tick(rq); 3058 calc_global_load_tick(rq);
3059 psi_task_tick(rq);
3054 3060
3055 rq_unlock(rq, &rf); 3061 rq_unlock(rq, &rf);
3056 3062
@@ -6067,6 +6073,8 @@ void __init sched_init(void)
6067 6073
6068 init_schedstats(); 6074 init_schedstats();
6069 6075
6076 psi_init();
6077
6070 scheduler_running = 1; 6078 scheduler_running = 1;
6071} 6079}
6072 6080
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
new file mode 100644
index 000000000000..595414599b98
--- /dev/null
+++ b/kernel/sched/psi.c
@@ -0,0 +1,657 @@
1/*
2 * Pressure stall information for CPU, memory and IO
3 *
4 * Copyright (c) 2018 Facebook, Inc.
5 * Author: Johannes Weiner <hannes@cmpxchg.org>
6 *
7 * When CPU, memory and IO are contended, tasks experience delays that
8 * reduce throughput and introduce latencies into the workload. Memory
9 * and IO contention, in addition, can cause a full loss of forward
10 * progress in which the CPU goes idle.
11 *
12 * This code aggregates individual task delays into resource pressure
13 * metrics that indicate problems with both workload health and
14 * resource utilization.
15 *
16 * Model
17 *
18 * The time in which a task can execute on a CPU is our baseline for
19 * productivity. Pressure expresses the amount of time in which this
20 * potential cannot be realized due to resource contention.
21 *
22 * This concept of productivity has two components: the workload and
23 * the CPU. To measure the impact of pressure on both, we define two
24 * contention states for a resource: SOME and FULL.
25 *
26 * In the SOME state of a given resource, one or more tasks are
27 * delayed on that resource. This affects the workload's ability to
28 * perform work, but the CPU may still be executing other tasks.
29 *
30 * In the FULL state of a given resource, all non-idle tasks are
31 * delayed on that resource such that nobody is advancing and the CPU
32 * goes idle. This leaves both workload and CPU unproductive.
33 *
34 * (Naturally, the FULL state doesn't exist for the CPU resource.)
35 *
36 * SOME = nr_delayed_tasks != 0
37 * FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
38 *
39 * The percentage of wallclock time spent in those compound stall
40 * states gives pressure numbers between 0 and 100 for each resource,
41 * where the SOME percentage indicates workload slowdowns and the FULL
42 * percentage indicates reduced CPU utilization:
43 *
44 * %SOME = time(SOME) / period
45 * %FULL = time(FULL) / period
46 *
47 * Multiple CPUs
48 *
49 * The more tasks and available CPUs there are, the more work can be
50 * performed concurrently. This means that the potential that can go
51 * unrealized due to resource contention *also* scales with non-idle
52 * tasks and CPUs.
53 *
54 * Consider a scenario where 257 number crunching tasks are trying to
55 * run concurrently on 256 CPUs. If we simply aggregated the task
56 * states, we would have to conclude a CPU SOME pressure number of
57 * 100%, since *somebody* is waiting on a runqueue at all
58 * times. However, that is clearly not the amount of contention the
59 * workload is experiencing: only one out of 256 possible exceution
60 * threads will be contended at any given time, or about 0.4%.
61 *
62 * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any
63 * given time *one* of the tasks is delayed due to a lack of memory.
64 * Again, looking purely at the task state would yield a memory FULL
65 * pressure number of 0%, since *somebody* is always making forward
66 * progress. But again this wouldn't capture the amount of execution
67 * potential lost, which is 1 out of 4 CPUs, or 25%.
68 *
69 * To calculate wasted potential (pressure) with multiple processors,
70 * we have to base our calculation on the number of non-idle tasks in
71 * conjunction with the number of available CPUs, which is the number
72 * of potential execution threads. SOME becomes then the proportion of
73 * delayed tasks to possibe threads, and FULL is the share of possible
74 * threads that are unproductive due to delays:
75 *
76 * threads = min(nr_nonidle_tasks, nr_cpus)
77 * SOME = min(nr_delayed_tasks / threads, 1)
78 * FULL = (threads - min(nr_running_tasks, threads)) / threads
79 *
80 * For the 257 number crunchers on 256 CPUs, this yields:
81 *
82 * threads = min(257, 256)
83 * SOME = min(1 / 256, 1) = 0.4%
84 * FULL = (256 - min(257, 256)) / 256 = 0%
85 *
86 * For the 1 out of 4 memory-delayed tasks, this yields:
87 *
88 * threads = min(4, 4)
89 * SOME = min(1 / 4, 1) = 25%
90 * FULL = (4 - min(3, 4)) / 4 = 25%
91 *
92 * [ Substitute nr_cpus with 1, and you can see that it's a natural
93 * extension of the single-CPU model. ]
94 *
95 * Implementation
96 *
97 * To assess the precise time spent in each such state, we would have
98 * to freeze the system on task changes and start/stop the state
99 * clocks accordingly. Obviously that doesn't scale in practice.
100 *
101 * Because the scheduler aims to distribute the compute load evenly
102 * among the available CPUs, we can track task state locally to each
103 * CPU and, at much lower frequency, extrapolate the global state for
104 * the cumulative stall times and the running averages.
105 *
106 * For each runqueue, we track:
107 *
108 * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
109 * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
110 * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
111 *
112 * and then periodically aggregate:
113 *
114 * tNONIDLE = sum(tNONIDLE[i])
115 *
116 * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
117 * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
118 *
119 * %SOME = tSOME / period
120 * %FULL = tFULL / period
121 *
122 * This gives us an approximation of pressure that is practical
123 * cost-wise, yet way more sensitive and accurate than periodic
124 * sampling of the aggregate task states would be.
125 */
126
127#include <linux/sched/loadavg.h>
128#include <linux/seq_file.h>
129#include <linux/proc_fs.h>
130#include <linux/seqlock.h>
131#include <linux/cgroup.h>
132#include <linux/module.h>
133#include <linux/sched.h>
134#include <linux/psi.h>
135#include "sched.h"
136
137static int psi_bug __read_mostly;
138
139bool psi_disabled __read_mostly;
140core_param(psi_disabled, psi_disabled, bool, 0644);
141
142/* Running averages - we need to be higher-res than loadavg */
143#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
144#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
145#define EXP_60s 1981 /* 1/exp(2s/60s) */
146#define EXP_300s 2034 /* 1/exp(2s/300s) */
147
148/* Sampling frequency in nanoseconds */
149static u64 psi_period __read_mostly;
150
151/* System-level pressure and stall tracking */
152static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
153static struct psi_group psi_system = {
154 .pcpu = &system_group_pcpu,
155};
156
157static void psi_update_work(struct work_struct *work);
158
159static void group_init(struct psi_group *group)
160{
161 int cpu;
162
163 for_each_possible_cpu(cpu)
164 seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
165 group->next_update = sched_clock() + psi_period;
166 INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
167 mutex_init(&group->stat_lock);
168}
169
170void __init psi_init(void)
171{
172 if (psi_disabled)
173 return;
174
175 psi_period = jiffies_to_nsecs(PSI_FREQ);
176 group_init(&psi_system);
177}
178
179static bool test_state(unsigned int *tasks, enum psi_states state)
180{
181 switch (state) {
182 case PSI_IO_SOME:
183 return tasks[NR_IOWAIT];
184 case PSI_IO_FULL:
185 return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
186 case PSI_MEM_SOME:
187 return tasks[NR_MEMSTALL];
188 case PSI_MEM_FULL:
189 return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
190 case PSI_CPU_SOME:
191 return tasks[NR_RUNNING] > 1;
192 case PSI_NONIDLE:
193 return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
194 tasks[NR_RUNNING];
195 default:
196 return false;
197 }
198}
199
200static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
201{
202 struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
203 unsigned int tasks[NR_PSI_TASK_COUNTS];
204 u64 now, state_start;
205 unsigned int seq;
206 int s;
207
208 /* Snapshot a coherent view of the CPU state */
209 do {
210 seq = read_seqcount_begin(&groupc->seq);
211 now = cpu_clock(cpu);
212 memcpy(times, groupc->times, sizeof(groupc->times));
213 memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
214 state_start = groupc->state_start;
215 } while (read_seqcount_retry(&groupc->seq, seq));
216
217 /* Calculate state time deltas against the previous snapshot */
218 for (s = 0; s < NR_PSI_STATES; s++) {
219 u32 delta;
220 /*
221 * In addition to already concluded states, we also
222 * incorporate currently active states on the CPU,
223 * since states may last for many sampling periods.
224 *
225 * This way we keep our delta sampling buckets small
226 * (u32) and our reported pressure close to what's
227 * actually happening.
228 */
229 if (test_state(tasks, s))
230 times[s] += now - state_start;
231
232 delta = times[s] - groupc->times_prev[s];
233 groupc->times_prev[s] = times[s];
234
235 times[s] = delta;
236 }
237}
238
239static void calc_avgs(unsigned long avg[3], int missed_periods,
240 u64 time, u64 period)
241{
242 unsigned long pct;
243
244 /* Fill in zeroes for periods of no activity */
245 if (missed_periods) {
246 avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
247 avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
248 avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
249 }
250
251 /* Sample the most recent active period */
252 pct = div_u64(time * 100, period);
253 pct *= FIXED_1;
254 avg[0] = calc_load(avg[0], EXP_10s, pct);
255 avg[1] = calc_load(avg[1], EXP_60s, pct);
256 avg[2] = calc_load(avg[2], EXP_300s, pct);
257}
258
259static bool update_stats(struct psi_group *group)
260{
261 u64 deltas[NR_PSI_STATES - 1] = { 0, };
262 unsigned long missed_periods = 0;
263 unsigned long nonidle_total = 0;
264 u64 now, expires, period;
265 int cpu;
266 int s;
267
268 mutex_lock(&group->stat_lock);
269
270 /*
271 * Collect the per-cpu time buckets and average them into a
272 * single time sample that is normalized to wallclock time.
273 *
274 * For averaging, each CPU is weighted by its non-idle time in
275 * the sampling period. This eliminates artifacts from uneven
276 * loading, or even entirely idle CPUs.
277 */
278 for_each_possible_cpu(cpu) {
279 u32 times[NR_PSI_STATES];
280 u32 nonidle;
281
282 get_recent_times(group, cpu, times);
283
284 nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
285 nonidle_total += nonidle;
286
287 for (s = 0; s < PSI_NONIDLE; s++)
288 deltas[s] += (u64)times[s] * nonidle;
289 }
290
291 /*
292 * Integrate the sample into the running statistics that are
293 * reported to userspace: the cumulative stall times and the
294 * decaying averages.
295 *
296 * Pressure percentages are sampled at PSI_FREQ. We might be
297 * called more often when the user polls more frequently than
298 * that; we might be called less often when there is no task
299 * activity, thus no data, and clock ticks are sporadic. The
300 * below handles both.
301 */
302
303 /* total= */
304 for (s = 0; s < NR_PSI_STATES - 1; s++)
305 group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
306
307 /* avgX= */
308 now = sched_clock();
309 expires = group->next_update;
310 if (now < expires)
311 goto out;
312 if (now - expires > psi_period)
313 missed_periods = div_u64(now - expires, psi_period);
314
315 /*
316 * The periodic clock tick can get delayed for various
317 * reasons, especially on loaded systems. To avoid clock
318 * drift, we schedule the clock in fixed psi_period intervals.
319 * But the deltas we sample out of the per-cpu buckets above
320 * are based on the actual time elapsing between clock ticks.
321 */
322 group->next_update = expires + ((1 + missed_periods) * psi_period);
323 period = now - (group->last_update + (missed_periods * psi_period));
324 group->last_update = now;
325
326 for (s = 0; s < NR_PSI_STATES - 1; s++) {
327 u32 sample;
328
329 sample = group->total[s] - group->total_prev[s];
330 /*
331 * Due to the lockless sampling of the time buckets,
332 * recorded time deltas can slip into the next period,
333 * which under full pressure can result in samples in
334 * excess of the period length.
335 *
336 * We don't want to report non-sensical pressures in
337 * excess of 100%, nor do we want to drop such events
338 * on the floor. Instead we punt any overage into the
339 * future until pressure subsides. By doing this we
340 * don't underreport the occurring pressure curve, we
341 * just report it delayed by one period length.
342 *
343 * The error isn't cumulative. As soon as another
344 * delta slips from a period P to P+1, by definition
345 * it frees up its time T in P.
346 */
347 if (sample > period)
348 sample = period;
349 group->total_prev[s] += sample;
350 calc_avgs(group->avg[s], missed_periods, sample, period);
351 }
352out:
353 mutex_unlock(&group->stat_lock);
354 return nonidle_total;
355}
356
357static void psi_update_work(struct work_struct *work)
358{
359 struct delayed_work *dwork;
360 struct psi_group *group;
361 bool nonidle;
362
363 dwork = to_delayed_work(work);
364 group = container_of(dwork, struct psi_group, clock_work);
365
366 /*
367 * If there is task activity, periodically fold the per-cpu
368 * times and feed samples into the running averages. If things
369 * are idle and there is no data to process, stop the clock.
370 * Once restarted, we'll catch up the running averages in one
371 * go - see calc_avgs() and missed_periods.
372 */
373
374 nonidle = update_stats(group);
375
376 if (nonidle) {
377 unsigned long delay = 0;
378 u64 now;
379
380 now = sched_clock();
381 if (group->next_update > now)
382 delay = nsecs_to_jiffies(group->next_update - now) + 1;
383 schedule_delayed_work(dwork, delay);
384 }
385}
386
387static void record_times(struct psi_group_cpu *groupc, int cpu,
388 bool memstall_tick)
389{
390 u32 delta;
391 u64 now;
392
393 now = cpu_clock(cpu);
394 delta = now - groupc->state_start;
395 groupc->state_start = now;
396
397 if (test_state(groupc->tasks, PSI_IO_SOME)) {
398 groupc->times[PSI_IO_SOME] += delta;
399 if (test_state(groupc->tasks, PSI_IO_FULL))
400 groupc->times[PSI_IO_FULL] += delta;
401 }
402
403 if (test_state(groupc->tasks, PSI_MEM_SOME)) {
404 groupc->times[PSI_MEM_SOME] += delta;
405 if (test_state(groupc->tasks, PSI_MEM_FULL))
406 groupc->times[PSI_MEM_FULL] += delta;
407 else if (memstall_tick) {
408 u32 sample;
409 /*
410 * Since we care about lost potential, a
411 * memstall is FULL when there are no other
412 * working tasks, but also when the CPU is
413 * actively reclaiming and nothing productive
414 * could run even if it were runnable.
415 *
416 * When the timer tick sees a reclaiming CPU,
417 * regardless of runnable tasks, sample a FULL
418 * tick (or less if it hasn't been a full tick
419 * since the last state change).
420 */
421 sample = min(delta, (u32)jiffies_to_nsecs(1));
422 groupc->times[PSI_MEM_FULL] += sample;
423 }
424 }
425
426 if (test_state(groupc->tasks, PSI_CPU_SOME))
427 groupc->times[PSI_CPU_SOME] += delta;
428
429 if (test_state(groupc->tasks, PSI_NONIDLE))
430 groupc->times[PSI_NONIDLE] += delta;
431}
432
433static void psi_group_change(struct psi_group *group, int cpu,
434 unsigned int clear, unsigned int set)
435{
436 struct psi_group_cpu *groupc;
437 unsigned int t, m;
438
439 groupc = per_cpu_ptr(group->pcpu, cpu);
440
441 /*
442 * First we assess the aggregate resource states this CPU's
443 * tasks have been in since the last change, and account any
444 * SOME and FULL time these may have resulted in.
445 *
446 * Then we update the task counts according to the state
447 * change requested through the @clear and @set bits.
448 */
449 write_seqcount_begin(&groupc->seq);
450
451 record_times(groupc, cpu, false);
452
453 for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
454 if (!(m & (1 << t)))
455 continue;
456 if (groupc->tasks[t] == 0 && !psi_bug) {
457 printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n",
458 cpu, t, groupc->tasks[0],
459 groupc->tasks[1], groupc->tasks[2],
460 clear, set);
461 psi_bug = 1;
462 }
463 groupc->tasks[t]--;
464 }
465
466 for (t = 0; set; set &= ~(1 << t), t++)
467 if (set & (1 << t))
468 groupc->tasks[t]++;
469
470 write_seqcount_end(&groupc->seq);
471
472 if (!delayed_work_pending(&group->clock_work))
473 schedule_delayed_work(&group->clock_work, PSI_FREQ);
474}
475
476void psi_task_change(struct task_struct *task, int clear, int set)
477{
478 int cpu = task_cpu(task);
479
480 if (!task->pid)
481 return;
482
483 if (((task->psi_flags & set) ||
484 (task->psi_flags & clear) != clear) &&
485 !psi_bug) {
486 printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
487 task->pid, task->comm, cpu,
488 task->psi_flags, clear, set);
489 psi_bug = 1;
490 }
491
492 task->psi_flags &= ~clear;
493 task->psi_flags |= set;
494
495 psi_group_change(&psi_system, cpu, clear, set);
496}
497
498void psi_memstall_tick(struct task_struct *task, int cpu)
499{
500 struct psi_group_cpu *groupc;
501
502 groupc = per_cpu_ptr(psi_system.pcpu, cpu);
503 write_seqcount_begin(&groupc->seq);
504 record_times(groupc, cpu, true);
505 write_seqcount_end(&groupc->seq);
506}
507
508/**
509 * psi_memstall_enter - mark the beginning of a memory stall section
510 * @flags: flags to handle nested sections
511 *
512 * Marks the calling task as being stalled due to a lack of memory,
513 * such as waiting for a refault or performing reclaim.
514 */
515void psi_memstall_enter(unsigned long *flags)
516{
517 struct rq_flags rf;
518 struct rq *rq;
519
520 if (psi_disabled)
521 return;
522
523 *flags = current->flags & PF_MEMSTALL;
524 if (*flags)
525 return;
526 /*
527 * PF_MEMSTALL setting & accounting needs to be atomic wrt
528 * changes to the task's scheduling state, otherwise we can
529 * race with CPU migration.
530 */
531 rq = this_rq_lock_irq(&rf);
532
533 current->flags |= PF_MEMSTALL;
534 psi_task_change(current, 0, TSK_MEMSTALL);
535
536 rq_unlock_irq(rq, &rf);
537}
538
539/**
540 * psi_memstall_leave - mark the end of an memory stall section
541 * @flags: flags to handle nested memdelay sections
542 *
543 * Marks the calling task as no longer stalled due to lack of memory.
544 */
545void psi_memstall_leave(unsigned long *flags)
546{
547 struct rq_flags rf;
548 struct rq *rq;
549
550 if (psi_disabled)
551 return;
552
553 if (*flags)
554 return;
555 /*
556 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
557 * changes to the task's scheduling state, otherwise we could
558 * race with CPU migration.
559 */
560 rq = this_rq_lock_irq(&rf);
561
562 current->flags &= ~PF_MEMSTALL;
563 psi_task_change(current, TSK_MEMSTALL, 0);
564
565 rq_unlock_irq(rq, &rf);
566}
567
568static int psi_show(struct seq_file *m, struct psi_group *group,
569 enum psi_res res)
570{
571 int full;
572
573 if (psi_disabled)
574 return -EOPNOTSUPP;
575
576 update_stats(group);
577
578 for (full = 0; full < 2 - (res == PSI_CPU); full++) {
579 unsigned long avg[3];
580 u64 total;
581 int w;
582
583 for (w = 0; w < 3; w++)
584 avg[w] = group->avg[res * 2 + full][w];
585 total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC);
586
587 seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
588 full ? "full" : "some",
589 LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
590 LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
591 LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
592 total);
593 }
594
595 return 0;
596}
597
598static int psi_io_show(struct seq_file *m, void *v)
599{
600 return psi_show(m, &psi_system, PSI_IO);
601}
602
603static int psi_memory_show(struct seq_file *m, void *v)
604{
605 return psi_show(m, &psi_system, PSI_MEM);
606}
607
608static int psi_cpu_show(struct seq_file *m, void *v)
609{
610 return psi_show(m, &psi_system, PSI_CPU);
611}
612
613static int psi_io_open(struct inode *inode, struct file *file)
614{
615 return single_open(file, psi_io_show, NULL);
616}
617
618static int psi_memory_open(struct inode *inode, struct file *file)
619{
620 return single_open(file, psi_memory_show, NULL);
621}
622
623static int psi_cpu_open(struct inode *inode, struct file *file)
624{
625 return single_open(file, psi_cpu_show, NULL);
626}
627
628static const struct file_operations psi_io_fops = {
629 .open = psi_io_open,
630 .read = seq_read,
631 .llseek = seq_lseek,
632 .release = single_release,
633};
634
635static const struct file_operations psi_memory_fops = {
636 .open = psi_memory_open,
637 .read = seq_read,
638 .llseek = seq_lseek,
639 .release = single_release,
640};
641
642static const struct file_operations psi_cpu_fops = {
643 .open = psi_cpu_open,
644 .read = seq_read,
645 .llseek = seq_lseek,
646 .release = single_release,
647};
648
649static int __init psi_proc_init(void)
650{
651 proc_mkdir("pressure", NULL);
652 proc_create("pressure/io", 0, NULL, &psi_io_fops);
653 proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
654 proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
655 return 0;
656}
657module_init(psi_proc_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1de189bb9209..618577fc9aa8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,6 +54,7 @@
54#include <linux/proc_fs.h> 54#include <linux/proc_fs.h>
55#include <linux/prefetch.h> 55#include <linux/prefetch.h>
56#include <linux/profile.h> 56#include <linux/profile.h>
57#include <linux/psi.h>
57#include <linux/rcupdate_wait.h> 58#include <linux/rcupdate_wait.h>
58#include <linux/security.h> 59#include <linux/security.h>
59#include <linux/stop_machine.h> 60#include <linux/stop_machine.h>
@@ -319,6 +320,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
319#ifdef CONFIG_CGROUP_SCHED 320#ifdef CONFIG_CGROUP_SCHED
320 321
321#include <linux/cgroup.h> 322#include <linux/cgroup.h>
323#include <linux/psi.h>
322 324
323struct cfs_rq; 325struct cfs_rq;
324struct rt_rq; 326struct rt_rq;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199a39b4..4904c4677000 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -55,6 +55,92 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
55# define schedstat_val_or_zero(var) 0 55# define schedstat_val_or_zero(var) 0
56#endif /* CONFIG_SCHEDSTATS */ 56#endif /* CONFIG_SCHEDSTATS */
57 57
58#ifdef CONFIG_PSI
59/*
60 * PSI tracks state that persists across sleeps, such as iowaits and
61 * memory stalls. As a result, it has to distinguish between sleeps,
62 * where a task's runnable state changes, and requeues, where a task
63 * and its state are being moved between CPUs and runqueues.
64 */
65static inline void psi_enqueue(struct task_struct *p, bool wakeup)
66{
67 int clear = 0, set = TSK_RUNNING;
68
69 if (psi_disabled)
70 return;
71
72 if (!wakeup || p->sched_psi_wake_requeue) {
73 if (p->flags & PF_MEMSTALL)
74 set |= TSK_MEMSTALL;
75 if (p->sched_psi_wake_requeue)
76 p->sched_psi_wake_requeue = 0;
77 } else {
78 if (p->in_iowait)
79 clear |= TSK_IOWAIT;
80 }
81
82 psi_task_change(p, clear, set);
83}
84
85static inline void psi_dequeue(struct task_struct *p, bool sleep)
86{
87 int clear = TSK_RUNNING, set = 0;
88
89 if (psi_disabled)
90 return;
91
92 if (!sleep) {
93 if (p->flags & PF_MEMSTALL)
94 clear |= TSK_MEMSTALL;
95 } else {
96 if (p->in_iowait)
97 set |= TSK_IOWAIT;
98 }
99
100 psi_task_change(p, clear, set);
101}
102
103static inline void psi_ttwu_dequeue(struct task_struct *p)
104{
105 if (psi_disabled)
106 return;
107 /*
108 * Is the task being migrated during a wakeup? Make sure to
109 * deregister its sleep-persistent psi states from the old
110 * queue, and let psi_enqueue() know it has to requeue.
111 */
112 if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
113 struct rq_flags rf;
114 struct rq *rq;
115 int clear = 0;
116
117 if (p->in_iowait)
118 clear |= TSK_IOWAIT;
119 if (p->flags & PF_MEMSTALL)
120 clear |= TSK_MEMSTALL;
121
122 rq = __task_rq_lock(p, &rf);
123 psi_task_change(p, clear, 0);
124 p->sched_psi_wake_requeue = 1;
125 __task_rq_unlock(rq, &rf);
126 }
127}
128
129static inline void psi_task_tick(struct rq *rq)
130{
131 if (psi_disabled)
132 return;
133
134 if (unlikely(rq->curr->flags & PF_MEMSTALL))
135 psi_memstall_tick(rq->curr, cpu_of(rq));
136}
137#else /* CONFIG_PSI */
138static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
139static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
140static inline void psi_ttwu_dequeue(struct task_struct *p) {}
141static inline void psi_task_tick(struct rq *rq) {}
142#endif /* CONFIG_PSI */
143
58#ifdef CONFIG_SCHED_INFO 144#ifdef CONFIG_SCHED_INFO
59static inline void sched_info_reset_dequeued(struct task_struct *t) 145static inline void sched_info_reset_dequeued(struct task_struct *t)
60{ 146{
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..7c607479de4a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -22,6 +22,7 @@
22#include <linux/kthread.h> 22#include <linux/kthread.h>
23#include <linux/freezer.h> 23#include <linux/freezer.h>
24#include <linux/page_owner.h> 24#include <linux/page_owner.h>
25#include <linux/psi.h>
25#include "internal.h" 26#include "internal.h"
26 27
27#ifdef CONFIG_COMPACTION 28#ifdef CONFIG_COMPACTION
@@ -2068,11 +2069,15 @@ static int kcompactd(void *p)
2068 pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1; 2069 pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
2069 2070
2070 while (!kthread_should_stop()) { 2071 while (!kthread_should_stop()) {
2072 unsigned long pflags;
2073
2071 trace_mm_compaction_kcompactd_sleep(pgdat->node_id); 2074 trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
2072 wait_event_freezable(pgdat->kcompactd_wait, 2075 wait_event_freezable(pgdat->kcompactd_wait,
2073 kcompactd_work_requested(pgdat)); 2076 kcompactd_work_requested(pgdat));
2074 2077
2078 psi_memstall_enter(&pflags);
2075 kcompactd_do_work(pgdat); 2079 kcompactd_do_work(pgdat);
2080 psi_memstall_leave(&pflags);
2076 } 2081 }
2077 2082
2078 return 0; 2083 return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 01a841f17bf4..41586009fa42 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -37,6 +37,7 @@
37#include <linux/shmem_fs.h> 37#include <linux/shmem_fs.h>
38#include <linux/rmap.h> 38#include <linux/rmap.h>
39#include <linux/delayacct.h> 39#include <linux/delayacct.h>
40#include <linux/psi.h>
40#include "internal.h" 41#include "internal.h"
41 42
42#define CREATE_TRACE_POINTS 43#define CREATE_TRACE_POINTS
@@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1075 struct wait_page_queue wait_page; 1076 struct wait_page_queue wait_page;
1076 wait_queue_entry_t *wait = &wait_page.wait; 1077 wait_queue_entry_t *wait = &wait_page.wait;
1077 bool thrashing = false; 1078 bool thrashing = false;
1079 unsigned long pflags;
1078 int ret = 0; 1080 int ret = 0;
1079 1081
1080 if (bit_nr == PG_locked && !PageSwapBacked(page) && 1082 if (bit_nr == PG_locked &&
1081 !PageUptodate(page) && PageWorkingset(page)) { 1083 !PageUptodate(page) && PageWorkingset(page)) {
1082 delayacct_thrashing_start(); 1084 if (!PageSwapBacked(page))
1085 delayacct_thrashing_start();
1086 psi_memstall_enter(&pflags);
1083 thrashing = true; 1087 thrashing = true;
1084 } 1088 }
1085 1089
@@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1121 1125
1122 finish_wait(q, wait); 1126 finish_wait(q, wait);
1123 1127
1124 if (thrashing) 1128 if (thrashing) {
1125 delayacct_thrashing_end(); 1129 if (!PageSwapBacked(page))
1130 delayacct_thrashing_end();
1131 psi_memstall_leave(&pflags);
1132 }
1126 1133
1127 /* 1134 /*
1128 * A signal could leave PageWaiters set. Clearing it here if 1135 * A signal could leave PageWaiters set. Clearing it here if
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20f25d06c00c..f97b5a1700a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
66#include <linux/ftrace.h> 66#include <linux/ftrace.h>
67#include <linux/lockdep.h> 67#include <linux/lockdep.h>
68#include <linux/nmi.h> 68#include <linux/nmi.h>
69#include <linux/psi.h>
69 70
70#include <asm/sections.h> 71#include <asm/sections.h>
71#include <asm/tlbflush.h> 72#include <asm/tlbflush.h>
@@ -3549,15 +3550,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
3549 enum compact_priority prio, enum compact_result *compact_result) 3550 enum compact_priority prio, enum compact_result *compact_result)
3550{ 3551{
3551 struct page *page; 3552 struct page *page;
3553 unsigned long pflags;
3552 unsigned int noreclaim_flag; 3554 unsigned int noreclaim_flag;
3553 3555
3554 if (!order) 3556 if (!order)
3555 return NULL; 3557 return NULL;
3556 3558
3559 psi_memstall_enter(&pflags);
3557 noreclaim_flag = memalloc_noreclaim_save(); 3560 noreclaim_flag = memalloc_noreclaim_save();
3561
3558 *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, 3562 *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
3559 prio); 3563 prio);
3564
3560 memalloc_noreclaim_restore(noreclaim_flag); 3565 memalloc_noreclaim_restore(noreclaim_flag);
3566 psi_memstall_leave(&pflags);
3561 3567
3562 if (*compact_result <= COMPACT_INACTIVE) 3568 if (*compact_result <= COMPACT_INACTIVE)
3563 return NULL; 3569 return NULL;
@@ -3756,11 +3762,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
3756 struct reclaim_state reclaim_state; 3762 struct reclaim_state reclaim_state;
3757 int progress; 3763 int progress;
3758 unsigned int noreclaim_flag; 3764 unsigned int noreclaim_flag;
3765 unsigned long pflags;
3759 3766
3760 cond_resched(); 3767 cond_resched();
3761 3768
3762 /* We now go into synchronous reclaim */ 3769 /* We now go into synchronous reclaim */
3763 cpuset_memory_pressure_bump(); 3770 cpuset_memory_pressure_bump();
3771 psi_memstall_enter(&pflags);
3764 fs_reclaim_acquire(gfp_mask); 3772 fs_reclaim_acquire(gfp_mask);
3765 noreclaim_flag = memalloc_noreclaim_save(); 3773 noreclaim_flag = memalloc_noreclaim_save();
3766 reclaim_state.reclaimed_slab = 0; 3774 reclaim_state.reclaimed_slab = 0;
@@ -3772,6 +3780,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
3772 current->reclaim_state = NULL; 3780 current->reclaim_state = NULL;
3773 memalloc_noreclaim_restore(noreclaim_flag); 3781 memalloc_noreclaim_restore(noreclaim_flag);
3774 fs_reclaim_release(gfp_mask); 3782 fs_reclaim_release(gfp_mask);
3783 psi_memstall_leave(&pflags);
3775 3784
3776 cond_resched(); 3785 cond_resched();
3777 3786
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 87e9fef341d2..8ea87586925e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
49#include <linux/prefetch.h> 49#include <linux/prefetch.h>
50#include <linux/printk.h> 50#include <linux/printk.h>
51#include <linux/dax.h> 51#include <linux/dax.h>
52#include <linux/psi.h>
52 53
53#include <asm/tlbflush.h> 54#include <asm/tlbflush.h>
54#include <asm/div64.h> 55#include <asm/div64.h>
@@ -3305,6 +3306,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
3305{ 3306{
3306 struct zonelist *zonelist; 3307 struct zonelist *zonelist;
3307 unsigned long nr_reclaimed; 3308 unsigned long nr_reclaimed;
3309 unsigned long pflags;
3308 int nid; 3310 int nid;
3309 unsigned int noreclaim_flag; 3311 unsigned int noreclaim_flag;
3310 struct scan_control sc = { 3312 struct scan_control sc = {
@@ -3333,9 +3335,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
3333 sc.gfp_mask, 3335 sc.gfp_mask,
3334 sc.reclaim_idx); 3336 sc.reclaim_idx);
3335 3337
3338 psi_memstall_enter(&pflags);
3336 noreclaim_flag = memalloc_noreclaim_save(); 3339 noreclaim_flag = memalloc_noreclaim_save();
3340
3337 nr_reclaimed = do_try_to_free_pages(zonelist, &sc); 3341 nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
3342
3338 memalloc_noreclaim_restore(noreclaim_flag); 3343 memalloc_noreclaim_restore(noreclaim_flag);
3344 psi_memstall_leave(&pflags);
3339 3345
3340 trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); 3346 trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
3341 3347
@@ -3500,6 +3506,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3500 int i; 3506 int i;
3501 unsigned long nr_soft_reclaimed; 3507 unsigned long nr_soft_reclaimed;
3502 unsigned long nr_soft_scanned; 3508 unsigned long nr_soft_scanned;
3509 unsigned long pflags;
3503 struct zone *zone; 3510 struct zone *zone;
3504 struct scan_control sc = { 3511 struct scan_control sc = {
3505 .gfp_mask = GFP_KERNEL, 3512 .gfp_mask = GFP_KERNEL,
@@ -3510,6 +3517,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3510 .may_swap = 1, 3517 .may_swap = 1,
3511 }; 3518 };
3512 3519
3520 psi_memstall_enter(&pflags);
3513 __fs_reclaim_acquire(); 3521 __fs_reclaim_acquire();
3514 3522
3515 count_vm_event(PAGEOUTRUN); 3523 count_vm_event(PAGEOUTRUN);
@@ -3611,6 +3619,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3611out: 3619out:
3612 snapshot_refaults(NULL, pgdat); 3620 snapshot_refaults(NULL, pgdat);
3613 __fs_reclaim_release(); 3621 __fs_reclaim_release();
3622 psi_memstall_leave(&pflags);
3614 /* 3623 /*
3615 * Return the order kswapd stopped reclaiming at as 3624 * Return the order kswapd stopped reclaiming at as
3616 * prepare_kswapd_sleep() takes it into account. If another caller 3625 * prepare_kswapd_sleep() takes it into account. If another caller