1 files changed, 182 insertions, 0 deletions
diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
new file mode 100644
index 000000000000..621111ce5740
--- /dev/null
+++ b/Documentation/accounting/psi.rst
@@ -0,0 +1,182 @@
+================================
+PSI - Pressure Stall Information
+================================
+:Date: April, 2018
+:Author: Johannes Weiner <hannes@cmpxchg.org>
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+Pressure interface
+==================
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+The format for CPU is as such::
+        some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+and for memory and IO::
+        some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+        full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+The ratios (in %) are tracked as recent trends over ten, sixty, and
+three hundred second windows, which gives insight into short term events
+as well as medium and long term trends. The total absolute stall time
+(in us) is tracked and exported as well, to allow detection of latency
+spikes which wouldn't necessarily make a dent in the time averages,
+or to average trends over custom time frames.
+Monitoring for pressure thresholds
+==================================
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used::
+        <some|full> <stall amount in us> <time window in us>
+For example writing "some 150000 1000000" into /proc/pressure/memory
+would add 150ms threshold for partial memory stall measured within
+1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
+would add 50ms threshold for full io stall measured within 1sec time window.
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+therefore for each trigger a separate open() syscall should be made even
+when opening the same psi interface file.
+Monitors activate only when system enters stall state for the monitored
+psi metric and deactivates upon exit from the stall state. While system is
+in the stall state psi signal growth is monitored at a rate of 10 times per
+tracking window.
+The kernel accepts window sizes ranging from 500ms to 10s, therefore min
+monitoring update interval is 50ms and max is 1s. Min limit is set to
+prevent overly frequent polling. Max limit is chosen as a high enough number
+after which monitors are most likely not needed and psi averages can be used
+instead.
+When activated, psi monitor stays active for at least the duration of one
+tracking window to avoid repeated activations/deactivations when system is
+bouncing in and out of the stall state.
+Notifications to the userspace are rate-limited to one per tracking window.
+The trigger will de-register when the file descriptor used to define the
+trigger  is closed.
+Userspace monitor usage example
+===============================
+::
+  #include <errno.h>
+  #include <fcntl.h>
+  #include <stdio.h>
+  #include <poll.h>
+  #include <string.h>
+  #include <unistd.h>
+  /*
+   * Monitor memory partial stall with 1s tracking window size
+   * and 150ms threshold.
+   */
+  int main() {
+        const char trig[] = "some 150000 1000000";
+        struct pollfd fds;
+        int n;
+        fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
+        if (fds.fd < 0) {
+                printf("/proc/pressure/memory open error: %s\n",
+                        strerror(errno));
+                return 1;
+        }
+        fds.events = POLLPRI;
+        if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
+                printf("/proc/pressure/memory write error: %s\n",
+                        strerror(errno));
+                return 1;
+        }
+        printf("waiting for events...\n");
+        while (1) {
+                n = poll(&fds, 1, -1);
+                if (n < 0) {
+                        printf("poll error: %s\n", strerror(errno));
+                        return 1;
+                }
+                if (fds.revents & POLLERR) {
+                        printf("got POLLERR, event source is gone\n");
+                        return 0;
+                }
+                if (fds.revents & POLLPRI) {
+                        printf("event triggered!\n");
+                } else {
+                        printf("unknown event received: 0x%x\n", fds.revents);
+                        return 1;
+                }
+        }
+        return 0;
+  }
+Cgroup2 interface
+=================
+In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
+mounted, pressure stall information is also tracked for tasks grouped
+into cgroups. Each subdirectory in the cgroupfs mountpoint contains
+cpu.pressure, memory.pressure, and io.pressure files; the format is
+the same as the /proc/pressure/ files.
+Per-cgroup psi monitors can be specified and used the same way as
+system-wide ones.

diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst new file mode 100644 index 000000000000..621111ce5740 --- /dev/null +++ b/Documentation/accounting/psi.rst
@@ -0,0 +1,182 @@
	1	================================
	2	PSI - Pressure Stall Information
	3	================================
	4
	5	:Date: April, 2018
	6	:Author: Johannes Weiner <hannes@cmpxchg.org>
	7
	8	When CPU, memory or IO devices are contended, workloads experience
	9	latency spikes, throughput losses, and run the risk of OOM kills.
	10
	11	Without an accurate measure of such contention, users are forced to
	12	either play it safe and under-utilize their hardware resources, or
	13	roll the dice and frequently suffer the disruptions resulting from
	14	excessive overcommit.
	15
	16	The psi feature identifies and quantifies the disruptions caused by
	17	such resource crunches and the time impact it has on complex workloads
	18	or even entire systems.
	19
	20	Having an accurate measure of productivity losses caused by resource
	21	scarcity aids users in sizing workloads to hardware--or provisioning
	22	hardware according to workload demand.
	23
	24	As psi aggregates this information in realtime, systems can be managed
	25	dynamically using techniques such as load shedding, migrating jobs to
	26	other systems or data centers, or strategically pausing or killing low
	27	priority or restartable batch jobs.
	28
	29	This allows maximizing hardware utilization without sacrificing
	30	workload health or risking major disruptions such as OOM kills.
	31
	32	Pressure interface
	33	==================
	34
	35	Pressure information for each resource is exported through the
	36	respective file in /proc/pressure/ -- cpu, memory, and io.
	37
	38	The format for CPU is as such::
	39
	40	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
	41
	42	and for memory and IO::
	43
	44	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
	45	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
	46
	47	The "some" line indicates the share of time in which at least some
	48	tasks are stalled on a given resource.
	49
	50	The "full" line indicates the share of time in which all non-idle
	51	tasks are stalled on a given resource simultaneously. In this state
	52	actual CPU cycles are going to waste, and a workload that spends
	53	extended time in this state is considered to be thrashing. This has
	54	severe impact on performance, and it's useful to distinguish this
	55	situation from a state where some tasks are stalled but the CPU is
	56	still doing productive work. As such, time spent in this subset of the
	57	stall state is tracked separately and exported in the "full" averages.
	58
	59	The ratios (in %) are tracked as recent trends over ten, sixty, and
	60	three hundred second windows, which gives insight into short term events
	61	as well as medium and long term trends. The total absolute stall time
	62	(in us) is tracked and exported as well, to allow detection of latency
	63	spikes which wouldn't necessarily make a dent in the time averages,
	64	or to average trends over custom time frames.
	65
	66	Monitoring for pressure thresholds
	67	==================================
	68
	69	Users can register triggers and use poll() to be woken up when resource
	70	pressure exceeds certain thresholds.
	71
	72	A trigger describes the maximum cumulative stall time over a specific
	73	time window, e.g. 100ms of total stall time within any 500ms window to
	74	generate a wakeup event.
	75
	76	To register a trigger user has to open psi interface file under
	77	/proc/pressure/ representing the resource to be monitored and write the
	78	desired threshold and time window. The open file descriptor should be
	79	used to wait for trigger events using select(), poll() or epoll().
	80	The following format is used::
	81
	82	<some\|full> <stall amount in us> <time window in us>
	83
	84	For example writing "some 150000 1000000" into /proc/pressure/memory
	85	would add 150ms threshold for partial memory stall measured within
	86	1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
	87	would add 50ms threshold for full io stall measured within 1sec time window.
	88
	89	Triggers can be set on more than one psi metric and more than one trigger
	90	for the same psi metric can be specified. However for each trigger a separate
	91	file descriptor is required to be able to poll it separately from others,
	92	therefore for each trigger a separate open() syscall should be made even
	93	when opening the same psi interface file.
	94
	95	Monitors activate only when system enters stall state for the monitored
	96	psi metric and deactivates upon exit from the stall state. While system is
	97	in the stall state psi signal growth is monitored at a rate of 10 times per
	98	tracking window.
	99
	100	The kernel accepts window sizes ranging from 500ms to 10s, therefore min
	101	monitoring update interval is 50ms and max is 1s. Min limit is set to
	102	prevent overly frequent polling. Max limit is chosen as a high enough number
	103	after which monitors are most likely not needed and psi averages can be used
	104	instead.
	105
	106	When activated, psi monitor stays active for at least the duration of one
	107	tracking window to avoid repeated activations/deactivations when system is
	108	bouncing in and out of the stall state.
	109
	110	Notifications to the userspace are rate-limited to one per tracking window.
	111
	112	The trigger will de-register when the file descriptor used to define the
	113	trigger is closed.
	114
	115	Userspace monitor usage example
	116	===============================
	117
	118	::
	119
	120	#include <errno.h>
	121	#include <fcntl.h>
	122	#include <stdio.h>
	123	#include <poll.h>
	124	#include <string.h>
	125	#include <unistd.h>
	126
	127	/*
	128	* Monitor memory partial stall with 1s tracking window size
	129	* and 150ms threshold.
	130	*/
	131	int main() {
	132	const char trig[] = "some 150000 1000000";
	133	struct pollfd fds;
	134	int n;
	135
	136	fds.fd = open("/proc/pressure/memory", O_RDWR \| O_NONBLOCK);
	137	if (fds.fd < 0) {
	138	printf("/proc/pressure/memory open error: %s\n",
	139	strerror(errno));
	140	return 1;
	141	}
	142	fds.events = POLLPRI;
	143
	144	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
	145	printf("/proc/pressure/memory write error: %s\n",
	146	strerror(errno));
	147	return 1;
	148	}
	149
	150	printf("waiting for events...\n");
	151	while (1) {
	152	n = poll(&fds, 1, -1);
	153	if (n < 0) {
	154	printf("poll error: %s\n", strerror(errno));
	155	return 1;
	156	}
	157	if (fds.revents & POLLERR) {
	158	printf("got POLLERR, event source is gone\n");
	159	return 0;
	160	}
	161	if (fds.revents & POLLPRI) {
	162	printf("event triggered!\n");
	163	} else {
	164	printf("unknown event received: 0x%x\n", fds.revents);
	165	return 1;
	166	}
	167	}
	168
	169	return 0;
	170	}
	171
	172	Cgroup2 interface
	173	=================
	174
	175	In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
	176	mounted, pressure stall information is also tracked for tasks grouped
	177	into cgroups. Each subdirectory in the cgroupfs mountpoint contains
	178	cpu.pressure, memory.pressure, and io.pressure files; the format is
	179	the same as the /proc/pressure/ files.
	180
	181	Per-cgroup psi monitors can be specified and used the same way as
	182	system-wide ones.