summaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2019-07-16 15:21:41 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2019-07-16 15:21:41 -0400
commitc309b6f24222246c18a8b65d3950e6e755440865 (patch)
tree11893170f5c246bb0dee8066e85878af04162ab0 /Documentation/admin-guide
parent3e859477a1db52a0435d06a55fdb54f62d69c292 (diff)
parent168869492e7009b6861b615f1d030c99bc805e83 (diff)
Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull rst conversion of docs from Mauro Carvalho Chehab: "As agreed with Jon, I'm sending this big series directly to you, c/c him, as this series required a special care, in order to avoid conflicts with other trees" * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits) docs: kbuild: fix build with pdf and fix some minor issues docs: block: fix pdf output docs: arm: fix a breakage with pdf output docs: don't use nested tables docs: gpio: add sysfs interface to the admin-guide docs: locking: add it to the main index docs: add some directories to the main documentation index docs: add SPDX tags to new index files docs: add a memory-devices subdir to driver-api docs: phy: place documentation under driver-api docs: serial: move it to the driver-api docs: driver-api: add remaining converted dirs to it docs: driver-api: add xilinx driver API documentation docs: driver-api: add a series of orphaned documents docs: admin-guide: add a series of orphaned documents docs: cgroup-v1: add it to the admin-guide book docs: aoe: add it to the driver-api book docs: add some documentation dirs to the driver-api book docs: driver-model: move it to the driver-api book docs: lp855x-driver.rst: add it to the driver-api book ...
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/aoe/aoe.rst150
-rw-r--r--Documentation/admin-guide/aoe/autoload.sh17
-rw-r--r--Documentation/admin-guide/aoe/examples.rst23
-rw-r--r--Documentation/admin-guide/aoe/index.rst17
-rw-r--r--Documentation/admin-guide/aoe/status.sh30
-rw-r--r--Documentation/admin-guide/aoe/todo.rst17
-rw-r--r--Documentation/admin-guide/aoe/udev-install.sh33
-rw-r--r--Documentation/admin-guide/aoe/udev.txt26
-rw-r--r--Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg588
-rw-r--r--Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg459
-rw-r--r--Documentation/admin-guide/blockdev/drbd/conn-states-8.dot18
-rw-r--r--Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst42
-rw-r--r--Documentation/admin-guide/blockdev/drbd/disk-states-8.dot16
-rw-r--r--Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot85
-rw-r--r--Documentation/admin-guide/blockdev/drbd/figures.rst30
-rw-r--r--Documentation/admin-guide/blockdev/drbd/index.rst19
-rw-r--r--Documentation/admin-guide/blockdev/drbd/node-states-8.dot13
-rw-r--r--Documentation/admin-guide/blockdev/floppy.rst255
-rw-r--r--Documentation/admin-guide/blockdev/index.rst16
-rw-r--r--Documentation/admin-guide/blockdev/nbd.rst31
-rw-r--r--Documentation/admin-guide/blockdev/paride.rst439
-rw-r--r--Documentation/admin-guide/blockdev/ramdisk.rst177
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst422
-rw-r--r--Documentation/admin-guide/btmrvl.rst124
-rw-r--r--Documentation/admin-guide/bug-hunting.rst4
-rw-r--r--Documentation/admin-guide/cgroup-v1/blkio-controller.rst302
-rw-r--r--Documentation/admin-guide/cgroup-v1/cgroups.rst695
-rw-r--r--Documentation/admin-guide/cgroup-v1/cpuacct.rst50
-rw-r--r--Documentation/admin-guide/cgroup-v1/cpusets.rst866
-rw-r--r--Documentation/admin-guide/cgroup-v1/devices.rst132
-rw-r--r--Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst127
-rw-r--r--Documentation/admin-guide/cgroup-v1/hugetlb.rst50
-rw-r--r--Documentation/admin-guide/cgroup-v1/index.rst28
-rw-r--r--Documentation/admin-guide/cgroup-v1/memcg_test.rst355
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst1003
-rw-r--r--Documentation/admin-guide/cgroup-v1/net_cls.rst44
-rw-r--r--Documentation/admin-guide/cgroup-v1/net_prio.rst57
-rw-r--r--Documentation/admin-guide/cgroup-v1/pids.rst92
-rw-r--r--Documentation/admin-guide/cgroup-v1/rdma.rst117
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst8
-rw-r--r--Documentation/admin-guide/clearing-warn-once.rst9
-rw-r--r--Documentation/admin-guide/cpu-load.rst114
-rw-r--r--Documentation/admin-guide/cputopology.rst177
-rw-r--r--Documentation/admin-guide/device-mapper/cache-policies.rst131
-rw-r--r--Documentation/admin-guide/device-mapper/cache.rst337
-rw-r--r--Documentation/admin-guide/device-mapper/delay.rst31
-rw-r--r--Documentation/admin-guide/device-mapper/dm-crypt.rst173
-rw-r--r--Documentation/admin-guide/device-mapper/dm-dust.txt272
-rw-r--r--Documentation/admin-guide/device-mapper/dm-flakey.rst74
-rw-r--r--Documentation/admin-guide/device-mapper/dm-init.rst125
-rw-r--r--Documentation/admin-guide/device-mapper/dm-integrity.rst259
-rw-r--r--Documentation/admin-guide/device-mapper/dm-io.rst75
-rw-r--r--Documentation/admin-guide/device-mapper/dm-log.rst57
-rw-r--r--Documentation/admin-guide/device-mapper/dm-queue-length.rst48
-rw-r--r--Documentation/admin-guide/device-mapper/dm-raid.rst419
-rw-r--r--Documentation/admin-guide/device-mapper/dm-service-time.rst101
-rw-r--r--Documentation/admin-guide/device-mapper/dm-uevent.rst110
-rw-r--r--Documentation/admin-guide/device-mapper/dm-zoned.rst146
-rw-r--r--Documentation/admin-guide/device-mapper/era.rst116
-rw-r--r--Documentation/admin-guide/device-mapper/index.rst42
-rw-r--r--Documentation/admin-guide/device-mapper/kcopyd.rst47
-rw-r--r--Documentation/admin-guide/device-mapper/linear.rst63
-rw-r--r--Documentation/admin-guide/device-mapper/log-writes.rst145
-rw-r--r--Documentation/admin-guide/device-mapper/persistent-data.rst88
-rw-r--r--Documentation/admin-guide/device-mapper/snapshot.rst196
-rw-r--r--Documentation/admin-guide/device-mapper/statistics.rst225
-rw-r--r--Documentation/admin-guide/device-mapper/striped.rst61
-rw-r--r--Documentation/admin-guide/device-mapper/switch.rst141
-rw-r--r--Documentation/admin-guide/device-mapper/thin-provisioning.rst427
-rw-r--r--Documentation/admin-guide/device-mapper/unstriped.rst135
-rw-r--r--Documentation/admin-guide/device-mapper/verity.rst229
-rw-r--r--Documentation/admin-guide/device-mapper/writecache.rst79
-rw-r--r--Documentation/admin-guide/device-mapper/zero.rst37
-rw-r--r--Documentation/admin-guide/efi-stub.rst100
-rw-r--r--Documentation/admin-guide/gpio/index.rst17
-rw-r--r--Documentation/admin-guide/gpio/sysfs.rst167
-rw-r--r--Documentation/admin-guide/highuid.rst80
-rw-r--r--Documentation/admin-guide/hw-vuln/l1tf.rst2
-rw-r--r--Documentation/admin-guide/hw_random.rst105
-rw-r--r--Documentation/admin-guide/index.rst28
-rw-r--r--Documentation/admin-guide/iostats.rst197
-rw-r--r--Documentation/admin-guide/kdump/gdbmacros.txt264
-rw-r--r--Documentation/admin-guide/kdump/index.rst20
-rw-r--r--Documentation/admin-guide/kdump/kdump.rst534
-rw-r--r--Documentation/admin-guide/kdump/vmcoreinfo.rst488
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst2
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt44
-rw-r--r--Documentation/admin-guide/kernel-per-CPU-kthreads.rst356
-rw-r--r--Documentation/admin-guide/laptops/asus-laptop.rst271
-rw-r--r--Documentation/admin-guide/laptops/disk-shock-protection.rst151
-rw-r--r--Documentation/admin-guide/laptops/index.rst17
-rw-r--r--Documentation/admin-guide/laptops/laptop-mode.rst781
-rw-r--r--Documentation/admin-guide/laptops/lg-laptop.rst84
-rw-r--r--Documentation/admin-guide/laptops/sony-laptop.rst174
-rw-r--r--Documentation/admin-guide/laptops/sonypi.rst158
-rw-r--r--Documentation/admin-guide/laptops/thinkpad-acpi.rst1562
-rw-r--r--Documentation/admin-guide/laptops/toshiba_haps.rst87
-rw-r--r--Documentation/admin-guide/lcd-panel-cgram.rst27
-rw-r--r--Documentation/admin-guide/ldm.rst121
-rw-r--r--Documentation/admin-guide/lockup-watchdogs.rst83
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst25
-rw-r--r--Documentation/admin-guide/mm/index.rst3
-rw-r--r--Documentation/admin-guide/mm/ksm.rst2
-rw-r--r--Documentation/admin-guide/mm/numa_memory_policy.rst2
-rw-r--r--Documentation/admin-guide/namespaces/compatibility-list.rst43
-rw-r--r--Documentation/admin-guide/namespaces/index.rst11
-rw-r--r--Documentation/admin-guide/namespaces/resource-control.rst18
-rw-r--r--Documentation/admin-guide/numastat.rst30
-rw-r--r--Documentation/admin-guide/perf/arm-ccn.rst61
-rw-r--r--Documentation/admin-guide/perf/arm_dsu_pmu.rst29
-rw-r--r--Documentation/admin-guide/perf/hisi-pmu.rst60
-rw-r--r--Documentation/admin-guide/perf/index.rst16
-rw-r--r--Documentation/admin-guide/perf/qcom_l2_pmu.rst39
-rw-r--r--Documentation/admin-guide/perf/qcom_l3_pmu.rst26
-rw-r--r--Documentation/admin-guide/perf/thunderx2-pmu.rst42
-rw-r--r--Documentation/admin-guide/perf/xgene-pmu.rst49
-rw-r--r--Documentation/admin-guide/pnp.rst292
-rw-r--r--Documentation/admin-guide/rapidio.rst107
-rw-r--r--Documentation/admin-guide/rtc.rst140
-rw-r--r--Documentation/admin-guide/svga.rst249
-rw-r--r--Documentation/admin-guide/sysctl/abi.rst67
-rw-r--r--Documentation/admin-guide/sysctl/fs.rst384
-rw-r--r--Documentation/admin-guide/sysctl/index.rst98
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst1177
-rw-r--r--Documentation/admin-guide/sysctl/net.rst461
-rw-r--r--Documentation/admin-guide/sysctl/sunrpc.rst25
-rw-r--r--Documentation/admin-guide/sysctl/user.rst78
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst964
-rw-r--r--Documentation/admin-guide/video-output.rst34
129 files changed, 22085 insertions, 33 deletions
diff --git a/Documentation/admin-guide/aoe/aoe.rst b/Documentation/admin-guide/aoe/aoe.rst
new file mode 100644
index 000000000000..a05e751363a0
--- /dev/null
+++ b/Documentation/admin-guide/aoe/aoe.rst
@@ -0,0 +1,150 @@
1Introduction
2============
3
4ATA over Ethernet is a network protocol that provides simple access to
5block storage on the LAN.
6
7 http://support.coraid.com/documents/AoEr11.txt
8
9The EtherDrive (R) HOWTO for 2.6 and 3.x kernels is found at ...
10
11 http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html
12
13It has many tips and hints! Please see, especially, recommended
14tunings for virtual memory:
15
16 http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.19
17
18The aoetools are userland programs that are designed to work with this
19driver. The aoetools are on sourceforge.
20
21 http://aoetools.sourceforge.net/
22
23The scripts in this Documentation/admin-guide/aoe directory are intended to
24document the use of the driver and are not necessary if you install
25the aoetools.
26
27
28Creating Device Nodes
29=====================
30
31 Users of udev should find the block device nodes created
32 automatically, but to create all the necessary device nodes, use the
33 udev configuration rules provided in udev.txt (in this directory).
34
35 There is a udev-install.sh script that shows how to install these
36 rules on your system.
37
38 There is also an autoload script that shows how to edit
39 /etc/modprobe.d/aoe.conf to ensure that the aoe module is loaded when
40 necessary. Preloading the aoe module is preferable to autoloading,
41 however, because AoE discovery takes a few seconds. It can be
42 confusing when an AoE device is not present the first time the a
43 command is run but appears a second later.
44
45Using Device Nodes
46==================
47
48 "cat /dev/etherd/err" blocks, waiting for error diagnostic output,
49 like any retransmitted packets.
50
51 "echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to
52 limit ATA over Ethernet traffic to eth2 and eth4. AoE traffic from
53 untrusted networks should be ignored as a matter of security. See
54 also the aoe_iflist driver option described below.
55
56 "echo > /dev/etherd/discover" tells the driver to find out what AoE
57 devices are available.
58
59 In the future these character devices may disappear and be replaced
60 by sysfs counterparts. Using the commands in aoetools insulates
61 users from these implementation details.
62
63 The block devices are named like this::
64
65 e{shelf}.{slot}
66 e{shelf}.{slot}p{part}
67
68 ... so that "e0.2" is the third blade from the left (slot 2) in the
69 first shelf (shelf address zero). That's the whole disk. The first
70 partition on that disk would be "e0.2p1".
71
72Using sysfs
73===========
74
75 Each aoe block device in /sys/block has the extra attributes of
76 state, mac, and netif. The state attribute is "up" when the device
77 is ready for I/O and "down" if detected but unusable. The
78 "down,closewait" state shows that the device is still open and
79 cannot come up again until it has been closed.
80
81 The mac attribute is the ethernet address of the remote AoE device.
82 The netif attribute is the network interface on the localhost
83 through which we are communicating with the remote AoE device.
84
85 There is a script in this directory that formats this information in
86 a convenient way. Users with aoetools should use the aoe-stat
87 command::
88
89 root@makki root# sh Documentation/admin-guide/aoe/status.sh
90 e10.0 eth3 up
91 e10.1 eth3 up
92 e10.2 eth3 up
93 e10.3 eth3 up
94 e10.4 eth3 up
95 e10.5 eth3 up
96 e10.6 eth3 up
97 e10.7 eth3 up
98 e10.8 eth3 up
99 e10.9 eth3 up
100 e4.0 eth1 up
101 e4.1 eth1 up
102 e4.2 eth1 up
103 e4.3 eth1 up
104 e4.4 eth1 up
105 e4.5 eth1 up
106 e4.6 eth1 up
107 e4.7 eth1 up
108 e4.8 eth1 up
109 e4.9 eth1 up
110
111 Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
112 option discussed below) instead of /dev/etherd/interfaces to limit
113 AoE traffic to the network interfaces in the given
114 whitespace-separated list. Unlike the old character device, the
115 sysfs entry can be read from as well as written to.
116
117 It's helpful to trigger discovery after setting the list of allowed
118 interfaces. The aoetools package provides an aoe-discover script
119 for this purpose. You can also directly use the
120 /dev/etherd/discover special file described above.
121
122Driver Options
123==============
124
125 There is a boot option for the built-in aoe driver and a
126 corresponding module parameter, aoe_iflist. Without this option,
127 all network interfaces may be used for ATA over Ethernet. Here is a
128 usage example for the module parameter::
129
130 modprobe aoe_iflist="eth1 eth3"
131
132 The aoe_deadsecs module parameter determines the maximum number of
133 seconds that the driver will wait for an AoE device to provide a
134 response to an AoE command. After aoe_deadsecs seconds have
135 elapsed, the AoE device will be marked as "down". A value of zero
136 is supported for testing purposes and makes the aoe driver keep
137 trying AoE commands forever.
138
139 The aoe_maxout module parameter has a default of 128. This is the
140 maximum number of unresponded packets that will be sent to an AoE
141 target at one time.
142
143 The aoe_dyndevs module parameter defaults to 1, meaning that the
144 driver will assign a block device minor number to a discovered AoE
145 target based on the order of its discovery. With dynamic minor
146 device numbers in use, a greater range of AoE shelf and slot
147 addresses can be supported. Users with udev will never have to
148 think about minor numbers. Using aoe_dyndevs=0 allows device nodes
149 to be pre-created using a static minor-number scheme with the
150 aoe-mkshelf script in the aoetools.
diff --git a/Documentation/admin-guide/aoe/autoload.sh b/Documentation/admin-guide/aoe/autoload.sh
new file mode 100644
index 000000000000..815dff4691c9
--- /dev/null
+++ b/Documentation/admin-guide/aoe/autoload.sh
@@ -0,0 +1,17 @@
1#!/bin/sh
2# set aoe to autoload by installing the
3# aliases in /etc/modprobe.d/
4
5f=/etc/modprobe.d/aoe.conf
6
7if test ! -r $f || test ! -w $f; then
8 echo "cannot configure $f for module autoloading" 1>&2
9 exit 1
10fi
11
12grep major-152 $f >/dev/null
13if [ $? = 1 ]; then
14 echo alias block-major-152 aoe >> $f
15 echo alias char-major-152 aoe >> $f
16fi
17
diff --git a/Documentation/admin-guide/aoe/examples.rst b/Documentation/admin-guide/aoe/examples.rst
new file mode 100644
index 000000000000..91f3198e52c1
--- /dev/null
+++ b/Documentation/admin-guide/aoe/examples.rst
@@ -0,0 +1,23 @@
1Example of udev rules
2---------------------
3
4 .. include:: udev.txt
5 :literal:
6
7Example of udev install rules script
8------------------------------------
9
10 .. literalinclude:: udev-install.sh
11 :language: shell
12
13Example script to get status
14----------------------------
15
16 .. literalinclude:: status.sh
17 :language: shell
18
19Example of AoE autoload script
20------------------------------
21
22 .. literalinclude:: autoload.sh
23 :language: shell
diff --git a/Documentation/admin-guide/aoe/index.rst b/Documentation/admin-guide/aoe/index.rst
new file mode 100644
index 000000000000..d71c5df15922
--- /dev/null
+++ b/Documentation/admin-guide/aoe/index.rst
@@ -0,0 +1,17 @@
1=======================
2ATA over Ethernet (AoE)
3=======================
4
5.. toctree::
6 :maxdepth: 1
7
8 aoe
9 todo
10 examples
11
12.. only:: subproject and html
13
14 Indices
15 =======
16
17 * :ref:`genindex`
diff --git a/Documentation/admin-guide/aoe/status.sh b/Documentation/admin-guide/aoe/status.sh
new file mode 100644
index 000000000000..eeec7baae57a
--- /dev/null
+++ b/Documentation/admin-guide/aoe/status.sh
@@ -0,0 +1,30 @@
1#! /bin/sh
2# collate and present sysfs information about AoE storage
3#
4# A more complete version of this script is aoe-stat, in the
5# aoetools.
6
7set -e
8format="%8s\t%8s\t%8s\n"
9me=`basename $0`
10sysd=${sysfs_dir:-/sys}
11
12# printf "$format" device mac netif state
13
14# Suse 9.1 Pro doesn't put /sys in /etc/mtab
15#test -z "`mount | grep sysfs`" && {
16test ! -d "$sysd/block" && {
17 echo "$me Error: sysfs is not mounted" 1>&2
18 exit 1
19}
20
21for d in `ls -d $sysd/block/etherd* 2>/dev/null | grep -v p` end; do
22 # maybe ls comes up empty, so we use "end"
23 test $d = end && continue
24
25 dev=`echo "$d" | sed 's/.*!//'`
26 printf "$format" \
27 "$dev" \
28 "`cat \"$d/netif\"`" \
29 "`cat \"$d/state\"`"
30done | sort
diff --git a/Documentation/admin-guide/aoe/todo.rst b/Documentation/admin-guide/aoe/todo.rst
new file mode 100644
index 000000000000..dea8db5a33e1
--- /dev/null
+++ b/Documentation/admin-guide/aoe/todo.rst
@@ -0,0 +1,17 @@
1TODO
2====
3
4There is a potential for deadlock when allocating a struct sk_buff for
5data that needs to be written out to aoe storage. If the data is
6being written from a dirty page in order to free that page, and if
7there are no other pages available, then deadlock may occur when a
8free page is needed for the sk_buff allocation. This situation has
9not been observed, but it would be nice to eliminate any potential for
10deadlock under memory pressure.
11
12Because ATA over Ethernet is not fragmented by the kernel's IP code,
13the destructor member of the struct sk_buff is available to the aoe
14driver. By using a mempool for allocating all but the first few
15sk_buffs, and by registering a destructor, we should be able to
16efficiently allocate sk_buffs without introducing any potential for
17deadlock.
diff --git a/Documentation/admin-guide/aoe/udev-install.sh b/Documentation/admin-guide/aoe/udev-install.sh
new file mode 100644
index 000000000000..15e86f58c036
--- /dev/null
+++ b/Documentation/admin-guide/aoe/udev-install.sh
@@ -0,0 +1,33 @@
1# install the aoe-specific udev rules from udev.txt into
2# the system's udev configuration
3#
4
5me="`basename $0`"
6
7# find udev.conf, often /etc/udev/udev.conf
8# (or environment can specify where to find udev.conf)
9#
10if test -z "$conf"; then
11 if test -r /etc/udev/udev.conf; then
12 conf=/etc/udev/udev.conf
13 else
14 conf="`find /etc -type f -name udev.conf 2> /dev/null`"
15 if test -z "$conf" || test ! -r "$conf"; then
16 echo "$me Error: no udev.conf found" 1>&2
17 exit 1
18 fi
19 fi
20fi
21
22# find the directory where udev rules are stored, often
23# /etc/udev/rules.d
24#
25rules_d="`sed -n '/^udev_rules=/{ s!udev_rules=!!; s!\"!!g; p; }' $conf`"
26if test -z "$rules_d" ; then
27 rules_d=/etc/udev/rules.d
28fi
29if test ! -d "$rules_d"; then
30 echo "$me Error: cannot find udev rules directory" 1>&2
31 exit 1
32fi
33sh -xc "cp `dirname $0`/udev.txt $rules_d/60-aoe.rules"
diff --git a/Documentation/admin-guide/aoe/udev.txt b/Documentation/admin-guide/aoe/udev.txt
new file mode 100644
index 000000000000..5fb756466bc7
--- /dev/null
+++ b/Documentation/admin-guide/aoe/udev.txt
@@ -0,0 +1,26 @@
1# These rules tell udev what device nodes to create for aoe support.
2# They may be installed along the following lines. Check the section
3# 8 udev manpage to see whether your udev supports SUBSYSTEM, and
4# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
5#
6# ecashin@makki ~$ su
7# Password:
8# bash# find /etc -type f -name udev.conf
9# /etc/udev/udev.conf
10# bash# grep udev_rules= /etc/udev/udev.conf
11# udev_rules="/etc/udev/rules.d/"
12# bash# ls /etc/udev/rules.d/
13# 10-wacom.rules 50-udev.rules
14# bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \
15# /etc/udev/rules.d/60-aoe.rules
16#
17
18# aoe char devices
19SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
20SUBSYSTEM=="aoe", KERNEL=="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
21SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
22SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
23SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220"
24
25# aoe block devices
26KERNEL=="etherd*", GROUP="disk"
diff --git a/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg b/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
new file mode 100644
index 000000000000..f87cfa0dc2fb
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
@@ -0,0 +1,588 @@
1<?xml version="1.0" encoding="UTF-8" standalone="no"?>
2<!-- Created with Inkscape (http://www.inkscape.org/) -->
3<svg
4 xmlns:svg="http://www.w3.org/2000/svg"
5 xmlns="http://www.w3.org/2000/svg"
6 version="1.0"
7 width="210mm"
8 height="297mm"
9 viewBox="0 0 21000 29700"
10 id="svg2"
11 style="fill-rule:evenodd">
12 <defs
13 id="defs4" />
14 <g
15 id="Default"
16 style="visibility:visible">
17 <desc
18 id="desc180">Master slide</desc>
19 </g>
20 <path
21 d="M 11999,8601 L 11899,8301 L 12099,8301 L 11999,8601 z"
22 id="path193"
23 style="fill:#008000;visibility:visible" />
24 <path
25 d="M 11999,7801 L 11999,8361"
26 id="path197"
27 style="fill:none;stroke:#008000;visibility:visible" />
28 <path
29 d="M 7999,10401 L 7899,10101 L 8099,10101 L 7999,10401 z"
30 id="path209"
31 style="fill:#008000;visibility:visible" />
32 <path
33 d="M 7999,9601 L 7999,10161"
34 id="path213"
35 style="fill:none;stroke:#008000;visibility:visible" />
36 <path
37 d="M 11999,7801 L 11685,7840 L 11724,7644 L 11999,7801 z"
38 id="path225"
39 style="fill:#008000;visibility:visible" />
40 <path
41 d="M 7999,7001 L 11764,7754"
42 id="path229"
43 style="fill:none;stroke:#008000;visibility:visible" />
44 <g
45 transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-1244.4792,1416.5139)"
46 id="g245"
47 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
48 <text
49 id="text247">
50 <tspan
51 x="9139 9368 9579 9808 9986 10075 10252 10481 10659 10837 10909"
52 y="9284"
53 id="tspan249">RSDataReply</tspan>
54 </text>
55 </g>
56 <path
57 d="M 7999,9601 L 8281,9458 L 8311,9655 L 7999,9601 z"
58 id="path259"
59 style="fill:#008000;visibility:visible" />
60 <path
61 d="M 11999,9001 L 8236,9565"
62 id="path263"
63 style="fill:none;stroke:#008000;visibility:visible" />
64 <g
65 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,1620.9382,-1639.4947)"
66 id="g279"
67 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
68 <text
69 id="text281">
70 <tspan
71 x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
72 y="7023"
73 id="tspan283">CsumRSRequest</tspan>
74 </text>
75 </g>
76 <text
77 id="text297"
78 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
79 <tspan
80 x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
81 y="5707"
82 id="tspan299">w_make_resync_request()</tspan>
83 </text>
84 <text
85 id="text313"
86 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
87 <tspan
88 x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
89 y="7806"
90 id="tspan315">receive_DataRequest()</tspan>
91 </text>
92 <text
93 id="text329"
94 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
95 <tspan
96 x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
97 y="8606"
98 id="tspan331">drbd_endio_read_sec()</tspan>
99 </text>
100 <text
101 id="text345"
102 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
103 <tspan
104 x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
105 y="9007"
106 id="tspan347">w_e_end_csum_rs_req()</tspan>
107 </text>
108 <text
109 id="text361"
110 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
111 <tspan
112 x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
113 y="9507"
114 id="tspan363">receive_RSDataReply()</tspan>
115 </text>
116 <text
117 id="text377"
118 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
119 <tspan
120 x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
121 y="10407"
122 id="tspan379">drbd_endio_write_sec()</tspan>
123 </text>
124 <text
125 id="text393"
126 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
127 <tspan
128 x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
129 y="10907"
130 id="tspan395">e_end_resync_block()</tspan>
131 </text>
132 <path
133 d="M 11999,11601 L 11685,11640 L 11724,11444 L 11999,11601 z"
134 id="path405"
135 style="fill:#000080;visibility:visible" />
136 <path
137 d="M 7999,10801 L 11764,11554"
138 id="path409"
139 style="fill:none;stroke:#000080;visibility:visible" />
140 <g
141 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,2434.7562,-1674.649)"
142 id="g425"
143 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
144 <text
145 id="text427">
146 <tspan
147 x="9320 9621 9726 9798 9887 10065 10277 10438"
148 y="10943"
149 id="tspan429">WriteAck</tspan>
150 </text>
151 </g>
152 <text
153 id="text443"
154 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
155 <tspan
156 x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
157 y="11559"
158 id="tspan445">got_BlockAck()</tspan>
159 </text>
160 <text
161 id="text459"
162 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
163 <tspan
164 x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14302 14540 14658 14777 14870 15107 15225 15437 15649 15886"
165 y="4877"
166 id="tspan461">Checksum based Resync, case not in sync</tspan>
167 </text>
168 <text
169 id="text475"
170 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
171 <tspan
172 x="6961 7266 7571 7854 8159 8299 8536 8654 8891 9010 9247 9484 9603 9840 9958 10077 10170 10407"
173 y="2806"
174 id="tspan477">DRBD-8.3 data flow</tspan>
175 </text>
176 <text
177 id="text491"
178 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
179 <tspan
180 x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
181 y="7005"
182 id="tspan493">w_e_send_csum()</tspan>
183 </text>
184 <path
185 d="M 11999,17601 L 11899,17301 L 12099,17301 L 11999,17601 z"
186 id="path503"
187 style="fill:#008000;visibility:visible" />
188 <path
189 d="M 11999,16801 L 11999,17361"
190 id="path507"
191 style="fill:none;stroke:#008000;visibility:visible" />
192 <path
193 d="M 11999,16801 L 11685,16840 L 11724,16644 L 11999,16801 z"
194 id="path519"
195 style="fill:#008000;visibility:visible" />
196 <path
197 d="M 7999,16001 L 11764,16754"
198 id="path523"
199 style="fill:none;stroke:#008000;visibility:visible" />
200 <g
201 transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-2539.5806,1529.3491)"
202 id="g539"
203 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
204 <text
205 id="text541">
206 <tspan
207 x="9269 9498 9709 9798 9959 10048 10226 10437 10598 10776"
208 y="18265"
209 id="tspan543">RSIsInSync</tspan>
210 </text>
211 </g>
212 <path
213 d="M 7999,18601 L 8281,18458 L 8311,18655 L 7999,18601 z"
214 id="path553"
215 style="fill:#000080;visibility:visible" />
216 <path
217 d="M 11999,18001 L 8236,18565"
218 id="path557"
219 style="fill:none;stroke:#000080;visibility:visible" />
220 <g
221 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,3461.4027,-1449.3012)"
222 id="g573"
223 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
224 <text
225 id="text575">
226 <tspan
227 x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
228 y="16023"
229 id="tspan577">CsumRSRequest</tspan>
230 </text>
231 </g>
232 <text
233 id="text591"
234 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
235 <tspan
236 x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
237 y="16806"
238 id="tspan593">receive_DataRequest()</tspan>
239 </text>
240 <text
241 id="text607"
242 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
243 <tspan
244 x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
245 y="17606"
246 id="tspan609">drbd_endio_read_sec()</tspan>
247 </text>
248 <text
249 id="text623"
250 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
251 <tspan
252 x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
253 y="18007"
254 id="tspan625">w_e_end_csum_rs_req()</tspan>
255 </text>
256 <text
257 id="text639"
258 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
259 <tspan
260 x="5735 5913 6091 6180 6357 6446 6607 6696 6874 7085 7246 7424 7585 7691"
261 y="18507"
262 id="tspan641">got_IsInSync()</tspan>
263 </text>
264 <text
265 id="text655"
266 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
267 <tspan
268 x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14159 14396 14514 14726 14937 15175"
269 y="13877"
270 id="tspan657">Checksum based Resync, case in sync</tspan>
271 </text>
272 <path
273 d="M 12000,24601 L 11900,24301 L 12100,24301 L 12000,24601 z"
274 id="path667"
275 style="fill:#008000;visibility:visible" />
276 <path
277 d="M 12000,23801 L 12000,24361"
278 id="path671"
279 style="fill:none;stroke:#008000;visibility:visible" />
280 <path
281 d="M 8000,26401 L 7900,26101 L 8100,26101 L 8000,26401 z"
282 id="path683"
283 style="fill:#008000;visibility:visible" />
284 <path
285 d="M 8000,25601 L 8000,26161"
286 id="path687"
287 style="fill:none;stroke:#008000;visibility:visible" />
288 <path
289 d="M 12000,23801 L 11686,23840 L 11725,23644 L 12000,23801 z"
290 id="path699"
291 style="fill:#008000;visibility:visible" />
292 <path
293 d="M 8000,23001 L 11765,23754"
294 id="path703"
295 style="fill:none;stroke:#008000;visibility:visible" />
296 <g
297 transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-3543.8452,1630.5143)"
298 id="g719"
299 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
300 <text
301 id="text721">
302 <tspan
303 x="9464 9710 9921 10150 10328 10505 10577"
304 y="25236"
305 id="tspan723">OVReply</tspan>
306 </text>
307 </g>
308 <path
309 d="M 8000,25601 L 8282,25458 L 8312,25655 L 8000,25601 z"
310 id="path733"
311 style="fill:#008000;visibility:visible" />
312 <path
313 d="M 12000,25001 L 8237,25565"
314 id="path737"
315 style="fill:none;stroke:#008000;visibility:visible" />
316 <g
317 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,4918.2801,-1381.2128)"
318 id="g753"
319 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
320 <text
321 id="text755">
322 <tspan
323 x="9142 9388 9599 9828 10006 10183 10361 10539 10700"
324 y="23106"
325 id="tspan757">OVRequest</tspan>
326 </text>
327 </g>
328 <text
329 id="text771"
330 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
331 <tspan
332 x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13656 13868 14097 14274 14452 14630 14808 14969 15058 15163"
333 y="23806"
334 id="tspan773">receive_OVRequest()</tspan>
335 </text>
336 <text
337 id="text787"
338 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
339 <tspan
340 x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
341 y="24606"
342 id="tspan789">drbd_endio_read_sec()</tspan>
343 </text>
344 <text
345 id="text803"
346 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
347 <tspan
348 x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14004 14182 14288 14465 14643 14749"
349 y="25007"
350 id="tspan805">w_e_end_ov_req()</tspan>
351 </text>
352 <text
353 id="text819"
354 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
355 <tspan
356 x="5101 5207 5385 5546 5723 5795 5956 6134 6312 6557 6769 6998 7175 7353 7425 7586 7692"
357 y="25507"
358 id="tspan821">receive_OVReply()</tspan>
359 </text>
360 <text
361 id="text835"
362 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
363 <tspan
364 x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
365 y="26407"
366 id="tspan837">drbd_endio_read_sec()</tspan>
367 </text>
368 <text
369 id="text851"
370 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
371 <tspan
372 x="4902 5131 5308 5486 5664 5842 6020 6197 6375 6553 6714 6892 6998 7175 7353 7425 7586 7692"
373 y="26907"
374 id="tspan853">w_e_end_ov_reply()</tspan>
375 </text>
376 <path
377 d="M 12000,27601 L 11686,27640 L 11725,27444 L 12000,27601 z"
378 id="path863"
379 style="fill:#000080;visibility:visible" />
380 <path
381 d="M 8000,26801 L 11765,27554"
382 id="path867"
383 style="fill:none;stroke:#000080;visibility:visible" />
384 <g
385 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,5704.1907,-1328.312)"
386 id="g883"
387 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
388 <text
389 id="text885">
390 <tspan
391 x="9279 9525 9736 9965 10143 10303 10481 10553"
392 y="26935"
393 id="tspan887">OVResult</tspan>
394 </text>
395 </g>
396 <text
397 id="text901"
398 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
399 <tspan
400 x="12200 12378 12556 12645 12822 13068 13280 13508 13686 13847 14025 14097 14185 14291"
401 y="27559"
402 id="tspan903">got_OVResult()</tspan>
403 </text>
404 <text
405 id="text917"
406 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
407 <tspan
408 x="8000 8330 8567 8660 8754 8991 9228 9346 9558 9795 9935 10028 10146"
409 y="21877"
410 id="tspan919">Online verify</tspan>
411 </text>
412 <text
413 id="text933"
414 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
415 <tspan
416 x="4641 4870 5047 5310 5488 5649 5826 6004 6182 6343 6521 6626 6804 6982 7160 7338 7499 7587 7693"
417 y="23005"
418 id="tspan935">w_make_ov_request()</tspan>
419 </text>
420 <path
421 d="M 8000,6500 L 7900,6200 L 8100,6200 L 8000,6500 z"
422 id="path945"
423 style="fill:#008000;visibility:visible" />
424 <path
425 d="M 8000,5700 L 8000,6260"
426 id="path949"
427 style="fill:none;stroke:#008000;visibility:visible" />
428 <path
429 d="M 3900,5500 L 3700,5500 L 3700,11000 L 3900,11000"
430 id="path961"
431 style="fill:none;stroke:#000000;visibility:visible" />
432 <path
433 d="M 3900,14500 L 3700,14500 L 3700,18600 L 3900,18600"
434 id="path973"
435 style="fill:none;stroke:#000000;visibility:visible" />
436 <path
437 d="M 3900,22800 L 3700,22800 L 3700,26900 L 3900,26900"
438 id="path985"
439 style="fill:none;stroke:#000000;visibility:visible" />
440 <text
441 id="text1001"
442 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
443 <tspan
444 x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
445 y="6506"
446 id="tspan1003">drbd_endio_read_sec()</tspan>
447 </text>
448 <text
449 id="text1017"
450 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
451 <tspan
452 x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
453 y="14708"
454 id="tspan1019">w_make_resync_request()</tspan>
455 </text>
456 <text
457 id="text1033"
458 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
459 <tspan
460 x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
461 y="16006"
462 id="tspan1035">w_e_send_csum()</tspan>
463 </text>
464 <path
465 d="M 8000,15501 L 7900,15201 L 8100,15201 L 8000,15501 z"
466 id="path1045"
467 style="fill:#008000;visibility:visible" />
468 <path
469 d="M 8000,14701 L 8000,15261"
470 id="path1049"
471 style="fill:none;stroke:#008000;visibility:visible" />
472 <text
473 id="text1065"
474 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
475 <tspan
476 x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
477 y="15507"
478 id="tspan1067">drbd_endio_read_sec()</tspan>
479 </text>
480 <path
481 d="M 16100,9000 L 16300,9000 L 16300,7500 L 16100,7500"
482 id="path1077"
483 style="fill:none;stroke:#000000;visibility:visible" />
484 <path
485 d="M 16100,18000 L 16300,18000 L 16300,16500 L 16100,16500"
486 id="path1089"
487 style="fill:none;stroke:#000000;visibility:visible" />
488 <path
489 d="M 16100,25000 L 16300,25000 L 16300,23500 L 16100,23500"
490 id="path1101"
491 style="fill:none;stroke:#000000;visibility:visible" />
492 <text
493 id="text1117"
494 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
495 <tspan
496 x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
497 y="5402"
498 id="tspan1119">rs_begin_io()</tspan>
499 </text>
500 <text
501 id="text1133"
502 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
503 <tspan
504 x="2027 2133 2294 2472 2649 2827 3005 3077 3255 3432 3504 3682 3788"
505 y="14402"
506 id="tspan1135">rs_begin_io()</tspan>
507 </text>
508 <text
509 id="text1149"
510 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
511 <tspan
512 x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
513 y="22602"
514 id="tspan1151">rs_begin_io()</tspan>
515 </text>
516 <text
517 id="text1165"
518 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
519 <tspan
520 x="1426 1532 1693 1871 2031 2209 2472 2649 2721 2899 2988 3166 3344 3416 3593 3699"
521 y="11302"
522 id="tspan1167">rs_complete_io()</tspan>
523 </text>
524 <text
525 id="text1181"
526 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
527 <tspan
528 x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
529 y="18931"
530 id="tspan1183">rs_complete_io()</tspan>
531 </text>
532 <text
533 id="text1197"
534 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
535 <tspan
536 x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
537 y="27231"
538 id="tspan1199">rs_complete_io()</tspan>
539 </text>
540 <text
541 id="text1213"
542 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
543 <tspan
544 x="16126 16232 16393 16571 16748 16926 17104 17176 17354 17531 17603 17781 17887"
545 y="7402"
546 id="tspan1215">rs_begin_io()</tspan>
547 </text>
548 <text
549 id="text1229"
550 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
551 <tspan
552 x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
553 y="16331"
554 id="tspan1231">rs_begin_io()</tspan>
555 </text>
556 <text
557 id="text1245"
558 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
559 <tspan
560 x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
561 y="23302"
562 id="tspan1247">rs_begin_io()</tspan>
563 </text>
564 <text
565 id="text1261"
566 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
567 <tspan
568 x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
569 y="9302"
570 id="tspan1263">rs_complete_io()</tspan>
571 </text>
572 <text
573 id="text1277"
574 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
575 <tspan
576 x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
577 y="18331"
578 id="tspan1279">rs_complete_io()</tspan>
579 </text>
580 <text
581 id="text1293"
582 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
583 <tspan
584 x="16126 16232 16393 16571 16731 16909 17172 17349 17421 17599 17688 17866 18044 18116 18293 18399"
585 y="25302"
586 id="tspan1295">rs_complete_io()</tspan>
587 </text>
588</svg>
diff --git a/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg b/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
new file mode 100644
index 000000000000..48a1e2165fec
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
@@ -0,0 +1,459 @@
1<?xml version="1.0" encoding="UTF-8" standalone="no"?>
2<!-- Created with Inkscape (http://www.inkscape.org/) -->
3<svg
4 xmlns:svg="http://www.w3.org/2000/svg"
5 xmlns="http://www.w3.org/2000/svg"
6 version="1.0"
7 width="210mm"
8 height="297mm"
9 viewBox="0 0 21000 29700"
10 id="svg2"
11 style="fill-rule:evenodd">
12 <defs
13 id="defs4" />
14 <g
15 id="Default"
16 style="visibility:visible">
17 <desc
18 id="desc176">Master slide</desc>
19 </g>
20 <path
21 d="M 11999,19601 L 11899,19301 L 12099,19301 L 11999,19601 z"
22 id="path189"
23 style="fill:#008000;visibility:visible" />
24 <path
25 d="M 11999,18801 L 11999,19361"
26 id="path193"
27 style="fill:none;stroke:#008000;visibility:visible" />
28 <path
29 d="M 7999,21401 L 7899,21101 L 8099,21101 L 7999,21401 z"
30 id="path205"
31 style="fill:#008000;visibility:visible" />
32 <path
33 d="M 7999,20601 L 7999,21161"
34 id="path209"
35 style="fill:none;stroke:#008000;visibility:visible" />
36 <path
37 d="M 11999,18801 L 11685,18840 L 11724,18644 L 11999,18801 z"
38 id="path221"
39 style="fill:#008000;visibility:visible" />
40 <path
41 d="M 7999,18001 L 11764,18754"
42 id="path225"
43 style="fill:none;stroke:#008000;visibility:visible" />
44 <text
45 x="-3023.845"
46 y="1106.8124"
47 transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
48 id="text243"
49 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
50 <tspan
51 x="6115.1553 6344.1553 6555.1553 6784.1553 6962.1553 7051.1553 7228.1553 7457.1553 7635.1553 7813.1553 7885.1553"
52 y="21390.812"
53 id="tspan245">RSDataReply</tspan>
54 </text>
55 <path
56 d="M 7999,20601 L 8281,20458 L 8311,20655 L 7999,20601 z"
57 id="path255"
58 style="fill:#008000;visibility:visible" />
59 <path
60 d="M 11999,20001 L 8236,20565"
61 id="path259"
62 style="fill:none;stroke:#008000;visibility:visible" />
63 <text
64 x="3502.5356"
65 y="-2184.6621"
66 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
67 id="text277"
68 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
69 <tspan
70 x="12321.536 12550.536 12761.536 12990.536 13168.536 13257.536 13434.536 13663.536 13841.536 14019.536 14196.536 14374.536 14535.536"
71 y="15854.338"
72 id="tspan279">RSDataRequest</tspan>
73 </text>
74 <text
75 id="text293"
76 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
77 <tspan
78 x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
79 y="17807"
80 id="tspan295">w_make_resync_request()</tspan>
81 </text>
82 <text
83 id="text309"
84 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
85 <tspan
86 x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
87 y="18806"
88 id="tspan311">receive_DataRequest()</tspan>
89 </text>
90 <text
91 id="text325"
92 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
93 <tspan
94 x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
95 y="19606"
96 id="tspan327">drbd_endio_read_sec()</tspan>
97 </text>
98 <text
99 id="text341"
100 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
101 <tspan
102 x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13770 13931 14109 14287 14375 14553 14731 14837 15015 15192 15298"
103 y="20007"
104 id="tspan343">w_e_end_rsdata_req()</tspan>
105 </text>
106 <text
107 id="text357"
108 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
109 <tspan
110 x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
111 y="20507"
112 id="tspan359">receive_RSDataReply()</tspan>
113 </text>
114 <text
115 id="text373"
116 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
117 <tspan
118 x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
119 y="21407"
120 id="tspan375">drbd_endio_write_sec()</tspan>
121 </text>
122 <text
123 id="text389"
124 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
125 <tspan
126 x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
127 y="21907"
128 id="tspan391">e_end_resync_block()</tspan>
129 </text>
130 <path
131 d="M 11999,22601 L 11685,22640 L 11724,22444 L 11999,22601 z"
132 id="path401"
133 style="fill:#000080;visibility:visible" />
134 <path
135 d="M 7999,21801 L 11764,22554"
136 id="path405"
137 style="fill:none;stroke:#000080;visibility:visible" />
138 <text
139 x="4290.3008"
140 y="-2369.6162"
141 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
142 id="text423"
143 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
144 <tspan
145 x="13610.301 13911.301 14016.301 14088.301 14177.301 14355.301 14567.301 14728.301"
146 y="19573.385"
147 id="tspan425">WriteAck</tspan>
148 </text>
149 <text
150 id="text439"
151 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
152 <tspan
153 x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
154 y="22559"
155 id="tspan441">got_BlockAck()</tspan>
156 </text>
157 <text
158 id="text455"
159 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
160 <tspan
161 x="7999 8304 8541 8753 8964 9201 9413 9531 9769 9862 10099 10310 10522 10734 10852 10971 11208 11348 11585 11822"
162 y="16877"
163 id="tspan457">Resync blocks, 4-32K</tspan>
164 </text>
165 <path
166 d="M 12000,7601 L 11900,7301 L 12100,7301 L 12000,7601 z"
167 id="path467"
168 style="fill:#008000;visibility:visible" />
169 <path
170 d="M 12000,6801 L 12000,7361"
171 id="path471"
172 style="fill:none;stroke:#008000;visibility:visible" />
173 <path
174 d="M 12000,6801 L 11686,6840 L 11725,6644 L 12000,6801 z"
175 id="path483"
176 style="fill:#008000;visibility:visible" />
177 <path
178 d="M 8000,6001 L 11765,6754"
179 id="path487"
180 style="fill:none;stroke:#008000;visibility:visible" />
181 <text
182 x="-1288.1796"
183 y="1279.7666"
184 transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
185 id="text505"
186 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
187 <tspan
188 x="8174.8208 8475.8203 8580.8203 8652.8203 8741.8203 8919.8203 9131.8203 9292.8203"
189 y="9516.7666"
190 id="tspan507">WriteAck</tspan>
191 </text>
192 <path
193 d="M 8000,8601 L 8282,8458 L 8312,8655 L 8000,8601 z"
194 id="path517"
195 style="fill:#000080;visibility:visible" />
196 <path
197 d="M 12000,8001 L 8237,8565"
198 id="path521"
199 style="fill:none;stroke:#000080;visibility:visible" />
200 <text
201 x="1065.6655"
202 y="-2097.7664"
203 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
204 id="text539"
205 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
206 <tspan
207 x="10682.666 10911.666 11088.666 11177.666"
208 y="4107.2339"
209 id="tspan541">Data</tspan>
210 </text>
211 <text
212 id="text555"
213 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
214 <tspan
215 x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
216 y="5505"
217 id="tspan557">drbd_make_request()</tspan>
218 </text>
219 <text
220 id="text571"
221 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
222 <tspan
223 x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14190"
224 y="6806"
225 id="tspan573">receive_Data()</tspan>
226 </text>
227 <text
228 id="text587"
229 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
230 <tspan
231 x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14207 14312 14384 14473 14651 14829 14990 15168 15328 15434"
232 y="7606"
233 id="tspan589">drbd_endio_write_sec()</tspan>
234 </text>
235 <text
236 id="text603"
237 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
238 <tspan
239 x="12192 12370 12548 12725 12903 13081 13259 13437 13509 13686 13847 14008 14114"
240 y="8007"
241 id="tspan605">e_end_block()</tspan>
242 </text>
243 <text
244 id="text619"
245 style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
246 <tspan
247 x="5647 5825 6003 6092 6269 6481 6553 6731 6892 7052 7264 7425 7586 7692"
248 y="8606"
249 id="tspan621">got_BlockAck()</tspan>
250 </text>
251 <text
252 id="text635"
253 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
254 <tspan
255 x="8000 8305 8542 8779 9016 9109 9346 9486 9604 9956 10049 10189 10328 10565 10705 10942 11179 11298 11603 11742 11835 11954 12191 12310 12428 12665 12902 13139 13279 13516 13753"
256 y="4877"
257 id="tspan637">Regular mirrored write, 512-32K</tspan>
258 </text>
259 <text
260 id="text651"
261 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
262 <tspan
263 x="5381 5610 5787 5948 6126 6304 6482 6659 6837 7015 7087 7265 7426 7587 7692"
264 y="6003"
265 id="tspan653">w_send_dblock()</tspan>
266 </text>
267 <path
268 d="M 8000,6800 L 7900,6500 L 8100,6500 L 8000,6800 z"
269 id="path663"
270 style="fill:#008000;visibility:visible" />
271 <path
272 d="M 8000,6000 L 8000,6560"
273 id="path667"
274 style="fill:none;stroke:#008000;visibility:visible" />
275 <text
276 id="text683"
277 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
278 <tspan
279 x="4602 4780 4886 5063 5241 5419 5597 5775 5952 6024 6202 6380 6609 6714 6786 6875 7053 7231 7409 7515 7587 7692"
280 y="6905"
281 id="tspan685">drbd_endio_write_pri()</tspan>
282 </text>
283 <path
284 d="M 12000,13602 L 11900,13302 L 12100,13302 L 12000,13602 z"
285 id="path695"
286 style="fill:#008000;visibility:visible" />
287 <path
288 d="M 12000,12802 L 12000,13362"
289 id="path699"
290 style="fill:none;stroke:#008000;visibility:visible" />
291 <path
292 d="M 12000,12802 L 11686,12841 L 11725,12645 L 12000,12802 z"
293 id="path711"
294 style="fill:#008000;visibility:visible" />
295 <path
296 d="M 8000,12002 L 11765,12755"
297 id="path715"
298 style="fill:none;stroke:#008000;visibility:visible" />
299 <text
300 x="-2155.5266"
301 y="1201.5964"
302 transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
303 id="text733"
304 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
305 <tspan
306 x="7202.4736 7431.4736 7608.4736 7697.4736 7875.4736 8104.4736 8282.4736 8459.4736 8531.4736"
307 y="15454.597"
308 id="tspan735">DataReply</tspan>
309 </text>
310 <path
311 d="M 8000,14602 L 8282,14459 L 8312,14656 L 8000,14602 z"
312 id="path745"
313 style="fill:#008000;visibility:visible" />
314 <path
315 d="M 12000,14002 L 8237,14566"
316 id="path749"
317 style="fill:none;stroke:#008000;visibility:visible" />
318 <text
319 x="2280.3804"
320 y="-2103.2141"
321 transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
322 id="text767"
323 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
324 <tspan
325 x="11316.381 11545.381 11722.381 11811.381 11989.381 12218.381 12396.381 12573.381 12751.381 12929.381 13090.381"
326 y="9981.7861"
327 id="tspan769">DataRequest</tspan>
328 </text>
329 <text
330 id="text783"
331 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
332 <tspan
333 x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
334 y="11506"
335 id="tspan785">drbd_make_request()</tspan>
336 </text>
337 <text
338 id="text799"
339 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
340 <tspan
341 x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14312 14490 14668 14846 15024 15185 15273 15379"
342 y="12807"
343 id="tspan801">receive_DataRequest()</tspan>
344 </text>
345 <text
346 id="text815"
347 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
348 <tspan
349 x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
350 y="13607"
351 id="tspan817">drbd_endio_read_sec()</tspan>
352 </text>
353 <text
354 id="text831"
355 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
356 <tspan
357 x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14021 14110 14288 14465 14571 14749 14927 15033"
358 y="14008"
359 id="tspan833">w_e_end_data_req()</tspan>
360 </text>
361 <g
362 id="g835"
363 style="visibility:visible">
364 <desc
365 id="desc837">Drawing</desc>
366 <text
367 id="text847"
368 style="font-size:318px;font-weight:400;fill:#008000;font-family:Helvetica embedded">
369 <tspan
370 x="4885 4991 5169 5330 5507 5579 5740 5918 6096 6324 6502 6591 6769 6997 7175 7353 7425 7586 7692"
371 y="14607"
372 id="tspan849">receive_DataReply()</tspan>
373 </text>
374 </g>
375 <text
376 id="text863"
377 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
378 <tspan
379 x="8000 8305 8398 8610 8821 8914 9151 9363 9575 9693 9833 10070 10307 10544 10663 10781 11018 11255 11493 11632 11869 12106"
380 y="10878"
381 id="tspan865">Diskless read, 512-32K</tspan>
382 </text>
383 <text
384 id="text879"
385 style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
386 <tspan
387 x="5029 5258 5435 5596 5774 5952 6130 6307 6413 6591 6769 6947 7125 7230 7408 7586 7692"
388 y="12004"
389 id="tspan881">w_send_read_req()</tspan>
390 </text>
391 <text
392 id="text895"
393 style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
394 <tspan
395 x="6961 7266 7571 7854 8159 8278 8515 8633 8870 9107 9226 9463 9581 9700 9793 10030"
396 y="2806"
397 id="tspan897">DRBD 8 data flow</tspan>
398 </text>
399 <path
400 d="M 3900,5300 L 3700,5300 L 3700,7000 L 3900,7000"
401 id="path907"
402 style="fill:none;stroke:#000000;visibility:visible" />
403 <path
404 d="M 3900,17600 L 3700,17600 L 3700,22000 L 3900,22000"
405 id="path919"
406 style="fill:none;stroke:#000000;visibility:visible" />
407 <path
408 d="M 16100,20000 L 16300,20000 L 16300,18500 L 16100,18500"
409 id="path931"
410 style="fill:none;stroke:#000000;visibility:visible" />
411 <text
412 id="text947"
413 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
414 <tspan
415 x="2126 2304 2376 2554 2731 2909 3087 3159 3337 3515 3587 3764 3870"
416 y="5202"
417 id="tspan949">al_begin_io()</tspan>
418 </text>
419 <text
420 id="text963"
421 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
422 <tspan
423 x="1632 1810 1882 2060 2220 2398 2661 2839 2910 3088 3177 3355 3533 3605 3783 3888"
424 y="7331"
425 id="tspan965">al_complete_io()</tspan>
426 </text>
427 <text
428 id="text979"
429 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
430 <tspan
431 x="2126 2232 2393 2571 2748 2926 3104 3176 3354 3531 3603 3781 3887"
432 y="17431"
433 id="tspan981">rs_begin_io()</tspan>
434 </text>
435 <text
436 id="text995"
437 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
438 <tspan
439 x="1626 1732 1893 2071 2231 2409 2672 2849 2921 3099 3188 3366 3544 3616 3793 3899"
440 y="22331"
441 id="tspan997">rs_complete_io()</tspan>
442 </text>
443 <text
444 id="text1011"
445 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
446 <tspan
447 x="16027 16133 16294 16472 16649 16827 17005 17077 17255 17432 17504 17682 17788"
448 y="18402"
449 id="tspan1013">rs_begin_io()</tspan>
450 </text>
451 <text
452 id="text1027"
453 style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
454 <tspan
455 x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
456 y="20331"
457 id="tspan1029">rs_complete_io()</tspan>
458 </text>
459</svg>
diff --git a/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot b/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
new file mode 100644
index 000000000000..025e8cf5e64a
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
@@ -0,0 +1,18 @@
1digraph conn_states {
2 StandAllone -> WFConnection [ label = "ioctl_set_net()" ]
3 WFConnection -> Unconnected [ label = "unable to bind()" ]
4 WFConnection -> WFReportParams [ label = "in connect() after accept" ]
5 WFReportParams -> StandAllone [ label = "checks in receive_param()" ]
6 WFReportParams -> Connected [ label = "in receive_param()" ]
7 WFReportParams -> WFBitMapS [ label = "sync_handshake()" ]
8 WFReportParams -> WFBitMapT [ label = "sync_handshake()" ]
9 WFBitMapS -> SyncSource [ label = "receive_bitmap()" ]
10 WFBitMapT -> SyncTarget [ label = "receive_bitmap()" ]
11 SyncSource -> Connected
12 SyncTarget -> Connected
13 SyncSource -> PausedSyncS
14 SyncTarget -> PausedSyncT
15 PausedSyncS -> SyncSource
16 PausedSyncT -> SyncTarget
17 Connected -> WFConnection [ label = "* on network error" ]
18}
diff --git a/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst b/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
new file mode 100644
index 000000000000..66036b901644
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
@@ -0,0 +1,42 @@
1================================
2kernel data structure for DRBD-9
3================================
4
5This describes the in kernel data structure for DRBD-9. Starting with
6Linux v3.14 we are reorganizing DRBD to use this data structure.
7
8Basic Data Structure
9====================
10
11A node has a number of DRBD resources. Each such resource has a number of
12devices (aka volumes) and connections to other nodes ("peer nodes"). Each DRBD
13device is represented by a block device locally.
14
15The DRBD objects are interconnected to form a matrix as depicted below; a
16drbd_peer_device object sits at each intersection between a drbd_device and a
17drbd_connection::
18
19 /--------------+---------------+.....+---------------\
20 | resource | device | | device |
21 +--------------+---------------+.....+---------------+
22 | connection | peer_device | | peer_device |
23 +--------------+---------------+.....+---------------+
24 : : : : :
25 : : : : :
26 +--------------+---------------+.....+---------------+
27 | connection | peer_device | | peer_device |
28 \--------------+---------------+.....+---------------/
29
30In this table, horizontally, devices can be accessed from resources by their
31volume number. Likewise, peer_devices can be accessed from connections by
32their volume number. Objects in the vertical direction are connected by double
33linked lists. There are back pointers from peer_devices to their connections a
34devices, and from connections and devices to their resource.
35
36All resources are in the drbd_resources double-linked list. In addition, all
37devices can be accessed by their minor device number via the drbd_devices idr.
38
39The drbd_resource, drbd_connection, and drbd_device objects are reference
40counted. The peer_device objects only serve to establish the links between
41devices and connections; their lifetime is determined by the lifetime of the
42device and connection which they reference.
diff --git a/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot b/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
new file mode 100644
index 000000000000..d06cfb46fb98
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
@@ -0,0 +1,16 @@
1digraph disk_states {
2 Diskless -> Inconsistent [ label = "ioctl_set_disk()" ]
3 Diskless -> Consistent [ label = "ioctl_set_disk()" ]
4 Diskless -> Outdated [ label = "ioctl_set_disk()" ]
5 Consistent -> Outdated [ label = "receive_param()" ]
6 Consistent -> UpToDate [ label = "receive_param()" ]
7 Consistent -> Inconsistent [ label = "start resync" ]
8 Outdated -> Inconsistent [ label = "start resync" ]
9 UpToDate -> Inconsistent [ label = "ioctl_replicate" ]
10 Inconsistent -> UpToDate [ label = "resync completed" ]
11 Consistent -> Failed [ label = "io completion error" ]
12 Outdated -> Failed [ label = "io completion error" ]
13 UpToDate -> Failed [ label = "io completion error" ]
14 Inconsistent -> Failed [ label = "io completion error" ]
15 Failed -> Diskless [ label = "sending notify to peer" ]
16}
diff --git a/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot b/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
new file mode 100644
index 000000000000..6d9cf0a7b11d
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
@@ -0,0 +1,85 @@
1// vim: set sw=2 sts=2 :
2digraph {
3 rankdir=BT
4 bgcolor=white
5
6 node [shape=plaintext]
7 node [fontcolor=black]
8
9 StandAlone [ style=filled,fillcolor=gray,label=StandAlone ]
10
11 node [fontcolor=lightgray]
12
13 Unconnected [ label=Unconnected ]
14
15 CommTrouble [ shape=record,
16 label="{communication loss|{Timeout|BrokenPipe|NetworkFailure}}" ]
17
18 node [fontcolor=gray]
19
20 subgraph cluster_try_connect {
21 label="try to connect, handshake"
22 rank=max
23 WFConnection [ label=WFConnection ]
24 WFReportParams [ label=WFReportParams ]
25 }
26
27 TearDown [ label=TearDown ]
28
29 Connected [ label=Connected,style=filled,fillcolor=green,fontcolor=black ]
30
31 node [fontcolor=lightblue]
32
33 StartingSyncS [ label=StartingSyncS ]
34 StartingSyncT [ label=StartingSyncT ]
35
36 subgraph cluster_bitmap_exchange {
37 node [fontcolor=red]
38 fontcolor=red
39 label="new application (WRITE?) requests blocked\lwhile bitmap is exchanged"
40
41 WFBitMapT [ label=WFBitMapT ]
42 WFSyncUUID [ label=WFSyncUUID ]
43 WFBitMapS [ label=WFBitMapS ]
44 }
45
46 node [fontcolor=blue]
47
48 cluster_resync [ shape=record,label="{<any>resynchronisation process running\l'concurrent' application requests allowed|{{<T>PausedSyncT\nSyncTarget}|{<S>PausedSyncS\nSyncSource}}}" ]
49
50 node [shape=box,fontcolor=black]
51
52 // drbdadm [label="drbdadm connect"]
53 // handshake [label="drbd_connect()\ndrbd_do_handshake\ndrbd_sync_handshake() etc."]
54 // comm_error [label="communication trouble"]
55
56 //
57 // edges
58 // --------------------------------------
59
60 StandAlone -> Unconnected [ label="drbdadm connect" ]
61 Unconnected -> StandAlone [ label="drbdadm disconnect\lor serious communication trouble" ]
62 Unconnected -> WFConnection [ label="receiver thread is started" ]
63 WFConnection -> WFReportParams [ headlabel="accept()\land/or \lconnect()\l" ]
64
65 WFReportParams -> StandAlone [ label="during handshake\lpeers do not agree\labout something essential" ]
66 WFReportParams -> Connected [ label="data identical\lno sync needed",color=green,fontcolor=green ]
67
68 WFReportParams -> WFBitMapS
69 WFReportParams -> WFBitMapT
70 WFBitMapT -> WFSyncUUID [minlen=0.1,constraint=false]
71
72 WFBitMapS -> cluster_resync:S
73 WFSyncUUID -> cluster_resync:T
74
75 edge [color=green]
76 cluster_resync:any -> Connected [ label="resnyc done",fontcolor=green ]
77
78 edge [color=red]
79 WFReportParams -> CommTrouble
80 Connected -> CommTrouble
81 cluster_resync:any -> CommTrouble
82 edge [color=black]
83 CommTrouble -> Unconnected [label="receiver thread is stopped" ]
84
85}
diff --git a/Documentation/admin-guide/blockdev/drbd/figures.rst b/Documentation/admin-guide/blockdev/drbd/figures.rst
new file mode 100644
index 000000000000..bd9a4901fe46
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/figures.rst
@@ -0,0 +1,30 @@
1.. SPDX-License-Identifier: GPL-2.0
2
3.. The here included files are intended to help understand the implementation
4
5Data flows that Relate some functions, and write packets
6========================================================
7
8.. kernel-figure:: DRBD-8.3-data-packets.svg
9 :alt: DRBD-8.3-data-packets.svg
10 :align: center
11
12.. kernel-figure:: DRBD-data-packets.svg
13 :alt: DRBD-data-packets.svg
14 :align: center
15
16
17Sub graphs of DRBD's state transitions
18======================================
19
20.. kernel-figure:: conn-states-8.dot
21 :alt: conn-states-8.dot
22 :align: center
23
24.. kernel-figure:: disk-states-8.dot
25 :alt: disk-states-8.dot
26 :align: center
27
28.. kernel-figure:: node-states-8.dot
29 :alt: node-states-8.dot
30 :align: center
diff --git a/Documentation/admin-guide/blockdev/drbd/index.rst b/Documentation/admin-guide/blockdev/drbd/index.rst
new file mode 100644
index 000000000000..68ecd5c113e9
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/index.rst
@@ -0,0 +1,19 @@
1==========================================
2Distributed Replicated Block Device - DRBD
3==========================================
4
5Description
6===========
7
8 DRBD is a shared-nothing, synchronously replicated block device. It
9 is designed to serve as a building block for high availability
10 clusters and in this context, is a "drop-in" replacement for shared
11 storage. Simplistically, you could see it as a network RAID 1.
12
13 Please visit http://www.drbd.org to find out more.
14
15.. toctree::
16 :maxdepth: 1
17
18 data-structure-v9
19 figures
diff --git a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
new file mode 100644
index 000000000000..bfa54e1f8016
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
@@ -0,0 +1,13 @@
1digraph node_states {
2 Secondary -> Primary [ label = "ioctl_set_state()" ]
3 Primary -> Secondary [ label = "ioctl_set_state()" ]
4}
5
6digraph peer_states {
7 Secondary -> Primary [ label = "recv state packet" ]
8 Primary -> Secondary [ label = "recv state packet" ]
9 Primary -> Unknown [ label = "connection lost" ]
10 Secondary -> Unknown [ label = "connection lost" ]
11 Unknown -> Primary [ label = "connected" ]
12 Unknown -> Secondary [ label = "connected" ]
13}
diff --git a/Documentation/admin-guide/blockdev/floppy.rst b/Documentation/admin-guide/blockdev/floppy.rst
new file mode 100644
index 000000000000..4a8f31cf4139
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/floppy.rst
@@ -0,0 +1,255 @@
1=============
2Floppy Driver
3=============
4
5FAQ list:
6=========
7
8A FAQ list may be found in the fdutils package (see below), and also
9at <http://fdutils.linux.lu/faq.html>.
10
11
12LILO configuration options (Thinkpad users, read this)
13======================================================
14
15The floppy driver is configured using the 'floppy=' option in
16lilo. This option can be typed at the boot prompt, or entered in the
17lilo configuration file.
18
19Example: If your kernel is called linux-2.6.9, type the following line
20at the lilo boot prompt (if you have a thinkpad)::
21
22 linux-2.6.9 floppy=thinkpad
23
24You may also enter the following line in /etc/lilo.conf, in the description
25of linux-2.6.9::
26
27 append = "floppy=thinkpad"
28
29Several floppy related options may be given, example::
30
31 linux-2.6.9 floppy=daring floppy=two_fdc
32 append = "floppy=daring floppy=two_fdc"
33
34If you give options both in the lilo config file and on the boot
35prompt, the option strings of both places are concatenated, the boot
36prompt options coming last. That's why there are also options to
37restore the default behavior.
38
39
40Module configuration options
41============================
42
43If you use the floppy driver as a module, use the following syntax::
44
45 modprobe floppy floppy="<options>"
46
47Example::
48
49 modprobe floppy floppy="omnibook messages"
50
51If you need certain options enabled every time you load the floppy driver,
52you can put::
53
54 options floppy floppy="omnibook messages"
55
56in a configuration file in /etc/modprobe.d/.
57
58
59The floppy driver related options are:
60
61 floppy=asus_pci
62 Sets the bit mask to allow only units 0 and 1. (default)
63
64 floppy=daring
65 Tells the floppy driver that you have a well behaved floppy controller.
66 This allows more efficient and smoother operation, but may fail on
67 certain controllers. This may speed up certain operations.
68
69 floppy=0,daring
70 Tells the floppy driver that your floppy controller should be used
71 with caution.
72
73 floppy=one_fdc
74 Tells the floppy driver that you have only one floppy controller.
75 (default)
76
77 floppy=two_fdc / floppy=<address>,two_fdc
78 Tells the floppy driver that you have two floppy controllers.
79 The second floppy controller is assumed to be at <address>.
80 This option is not needed if the second controller is at address
81 0x370, and if you use the 'cmos' option.
82
83 floppy=thinkpad
84 Tells the floppy driver that you have a Thinkpad. Thinkpads use an
85 inverted convention for the disk change line.
86
87 floppy=0,thinkpad
88 Tells the floppy driver that you don't have a Thinkpad.
89
90 floppy=omnibook / floppy=nodma
91 Tells the floppy driver not to use Dma for data transfers.
92 This is needed on HP Omnibooks, which don't have a workable
93 DMA channel for the floppy driver. This option is also useful
94 if you frequently get "Unable to allocate DMA memory" messages.
95 Indeed, dma memory needs to be continuous in physical memory,
96 and is thus harder to find, whereas non-dma buffers may be
97 allocated in virtual memory. However, I advise against this if
98 you have an FDC without a FIFO (8272A or 82072). 82072A and
99 later are OK. You also need at least a 486 to use nodma.
100 If you use nodma mode, I suggest you also set the FIFO
101 threshold to 10 or lower, in order to limit the number of data
102 transfer interrupts.
103
104 If you have a FIFO-able FDC, the floppy driver automatically
105 falls back on non DMA mode if no DMA-able memory can be found.
106 If you want to avoid this, explicitly ask for 'yesdma'.
107
108 floppy=yesdma
109 Tells the floppy driver that a workable DMA channel is available.
110 (default)
111
112 floppy=nofifo
113 Disables the FIFO entirely. This is needed if you get "Bus
114 master arbitration error" messages from your Ethernet card (or
115 from other devices) while accessing the floppy.
116
117 floppy=usefifo
118 Enables the FIFO. (default)
119
120 floppy=<threshold>,fifo_depth
121 Sets the FIFO threshold. This is mostly relevant in DMA
122 mode. If this is higher, the floppy driver tolerates more
123 interrupt latency, but it triggers more interrupts (i.e. it
124 imposes more load on the rest of the system). If this is
125 lower, the interrupt latency should be lower too (faster
126 processor). The benefit of a lower threshold is less
127 interrupts.
128
129 To tune the fifo threshold, switch on over/underrun messages
130 using 'floppycontrol --messages'. Then access a floppy
131 disk. If you get a huge amount of "Over/Underrun - retrying"
132 messages, then the fifo threshold is too low. Try with a
133 higher value, until you only get an occasional Over/Underrun.
134 It is a good idea to compile the floppy driver as a module
135 when doing this tuning. Indeed, it allows to try different
136 fifo values without rebooting the machine for each test. Note
137 that you need to do 'floppycontrol --messages' every time you
138 re-insert the module.
139
140 Usually, tuning the fifo threshold should not be needed, as
141 the default (0xa) is reasonable.
142
143 floppy=<drive>,<type>,cmos
144 Sets the CMOS type of <drive> to <type>. This is mandatory if
145 you have more than two floppy drives (only two can be
146 described in the physical CMOS), or if your BIOS uses
147 non-standard CMOS types. The CMOS types are:
148
149 == ==================================
150 0 Use the value of the physical CMOS
151 1 5 1/4 DD
152 2 5 1/4 HD
153 3 3 1/2 DD
154 4 3 1/2 HD
155 5 3 1/2 ED
156 6 3 1/2 ED
157 16 unknown or not installed
158 == ==================================
159
160 (Note: there are two valid types for ED drives. This is because 5 was
161 initially chosen to represent floppy *tapes*, and 6 for ED drives.
162 AMI ignored this, and used 5 for ED drives. That's why the floppy
163 driver handles both.)
164
165 floppy=unexpected_interrupts
166 Print a warning message when an unexpected interrupt is received.
167 (default)
168
169 floppy=no_unexpected_interrupts / floppy=L40SX
170 Don't print a message when an unexpected interrupt is received. This
171 is needed on IBM L40SX laptops in certain video modes. (There seems
172 to be an interaction between video and floppy. The unexpected
173 interrupts affect only performance, and can be safely ignored.)
174
175 floppy=broken_dcl
176 Don't use the disk change line, but assume that the disk was
177 changed whenever the device node is reopened. Needed on some
178 boxes where the disk change line is broken or unsupported.
179 This should be regarded as a stopgap measure, indeed it makes
180 floppy operation less efficient due to unneeded cache
181 flushings, and slightly more unreliable. Please verify your
182 cable, connection and jumper settings if you have any DCL
183 problems. However, some older drives, and also some laptops
184 are known not to have a DCL.
185
186 floppy=debug
187 Print debugging messages.
188
189 floppy=messages
190 Print informational messages for some operations (disk change
191 notifications, warnings about over and underruns, and about
192 autodetection).
193
194 floppy=silent_dcl_clear
195 Uses a less noisy way to clear the disk change line (which
196 doesn't involve seeks). Implied by 'daring' option.
197
198 floppy=<nr>,irq
199 Sets the floppy IRQ to <nr> instead of 6.
200
201 floppy=<nr>,dma
202 Sets the floppy DMA channel to <nr> instead of 2.
203
204 floppy=slow
205 Use PS/2 stepping rate::
206
207 PS/2 floppies have much slower step rates than regular floppies.
208 It's been recommended that take about 1/4 of the default speed
209 in some more extreme cases.
210
211
212Supporting utilities and additional documentation:
213==================================================
214
215Additional parameters of the floppy driver can be configured at
216runtime. Utilities which do this can be found in the fdutils package.
217This package also contains a new version of mtools which allows to
218access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
219It also contains additional documentation about the floppy driver.
220
221The latest version can be found at fdutils homepage:
222
223 http://fdutils.linux.lu
224
225The fdutils releases can be found at:
226
227 http://fdutils.linux.lu/download.html
228
229 http://www.tux.org/pub/knaff/fdutils/
230
231 ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
232
233Reporting problems about the floppy driver
234==========================================
235
236If you have a question or a bug report about the floppy driver, mail
237me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use
238comp.os.linux.hardware. As the volume in these groups is rather high,
239be sure to include the word "floppy" (or "FLOPPY") in the subject
240line. If the reported problem happens when mounting floppy disks, be
241sure to mention also the type of the filesystem in the subject line.
242
243Be sure to read the FAQ before mailing/posting any bug reports!
244
245Alain
246
247Changelog
248=========
249
25010-30-2004 :
251 Cleanup, updating, add reference to module configuration.
252 James Nelson <james4765@gmail.com>
253
2546-3-2000 :
255 Original Document
diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst
new file mode 100644
index 000000000000..b903cf152091
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/index.rst
@@ -0,0 +1,16 @@
1.. SPDX-License-Identifier: GPL-2.0
2
3===========================
4The Linux RapidIO Subsystem
5===========================
6
7.. toctree::
8 :maxdepth: 1
9
10 floppy
11 nbd
12 paride
13 ramdisk
14 zram
15
16 drbd/index
diff --git a/Documentation/admin-guide/blockdev/nbd.rst b/Documentation/admin-guide/blockdev/nbd.rst
new file mode 100644
index 000000000000..d78dfe559dcf
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/nbd.rst
@@ -0,0 +1,31 @@
1==================================
2Network Block Device (TCP version)
3==================================
4
51) Overview
6-----------
7
8What is it: With this compiled in the kernel (or as a module), Linux
9can use a remote server as one of its block devices. So every time
10the client computer wants to read, e.g., /dev/nb0, it sends a
11request over TCP to the server, which will reply with the data read.
12This can be used for stations with low disk space (or even diskless)
13to borrow disk space from another computer.
14Unlike NFS, it is possible to put any filesystem on it, etc.
15
16For more information, or to download the nbd-client and nbd-server
17tools, go to http://nbd.sf.net/.
18
19The nbd kernel module need only be installed on the client
20system, as the nbd-server is completely in userspace. In fact,
21the nbd-server has been successfully ported to other operating
22systems, including Windows.
23
24A) NBD parameters
25-----------------
26
27max_part
28 Number of partitions per device (default: 0).
29
30nbds_max
31 Number of block devices that should be initialized (default: 16).
diff --git a/Documentation/admin-guide/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst
new file mode 100644
index 000000000000..87b4278bf314
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/paride.rst
@@ -0,0 +1,439 @@
1===================================
2Linux and parallel port IDE devices
3===================================
4
5PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net>
6
71. Introduction
8===============
9
10Owing to the simplicity and near universality of the parallel port interface
11to personal computers, many external devices such as portable hard-disk,
12CD-ROM, LS-120 and tape drives use the parallel port to connect to their
13host computer. While some devices (notably scanners) use ad-hoc methods
14to pass commands and data through the parallel port interface, most
15external devices are actually identical to an internal model, but with
16a parallel-port adapter chip added in. Some of the original parallel port
17adapters were little more than mechanisms for multiplexing a SCSI bus.
18(The Iomega PPA-3 adapter used in the ZIP drives is an example of this
19approach). Most current designs, however, take a different approach.
20The adapter chip reproduces a small ISA or IDE bus in the external device
21and the communication protocol provides operations for reading and writing
22device registers, as well as data block transfer functions. Sometimes,
23the device being addressed via the parallel cable is a standard SCSI
24controller like an NCR 5380. The "ditto" family of external tape
25drives use the ISA replicator to interface a floppy disk controller,
26which is then connected to a floppy-tape mechanism. The vast majority
27of external parallel port devices, however, are now based on standard
28IDE type devices, which require no intermediate controller. If one
29were to open up a parallel port CD-ROM drive, for instance, one would
30find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
31that interconnected a standard PC parallel port cable and a standard
32IDE cable. It is usually possible to exchange the CD-ROM device with
33any other device using the IDE interface.
34
35The document describes the support in Linux for parallel port IDE
36devices. It does not cover parallel port SCSI devices, "ditto" tape
37drives or scanners. Many different devices are supported by the
38parallel port IDE subsystem, including:
39
40 - MicroSolutions backpack CD-ROM
41 - MicroSolutions backpack PD/CD
42 - MicroSolutions backpack hard-drives
43 - MicroSolutions backpack 8000t tape drive
44 - SyQuest EZ-135, EZ-230 & SparQ drives
45 - Avatar Shark
46 - Imation Superdisk LS-120
47 - Maxell Superdisk LS-120
48 - FreeCom Power CD
49 - Hewlett-Packard 5GB and 8GB tape drives
50 - Hewlett-Packard 7100 and 7200 CD-RW drives
51
52as well as most of the clone and no-name products on the market.
53
54To support such a wide range of devices, PARIDE, the parallel port IDE
55subsystem, is actually structured in three parts. There is a base
56paride module which provides a registry and some common methods for
57accessing the parallel ports. The second component is a set of
58high-level drivers for each of the different types of supported devices:
59
60 === =============
61 pd IDE disk
62 pcd ATAPI CD-ROM
63 pf ATAPI disk
64 pt ATAPI tape
65 pg ATAPI generic
66 === =============
67
68(Currently, the pg driver is only used with CD-R drives).
69
70The high-level drivers function according to the relevant standards.
71The third component of PARIDE is a set of low-level protocol drivers
72for each of the parallel port IDE adapter chips. Thanks to the interest
73and encouragement of Linux users from many parts of the world,
74support is available for almost all known adapter protocols:
75
76 ==== ====================================== ====
77 aten ATEN EH-100 (HK)
78 bpck Microsolutions backpack (US)
79 comm DataStor (old-type) "commuter" adapter (TW)
80 dstr DataStor EP-2000 (TW)
81 epat Shuttle EPAT (UK)
82 epia Shuttle EPIA (UK)
83 fit2 FIT TD-2000 (US)
84 fit3 FIT TD-3000 (US)
85 friq Freecom IQ cable (DE)
86 frpw Freecom Power (DE)
87 kbic KingByte KBIC-951A and KBIC-971A (TW)
88 ktti KT Technology PHd adapter (SG)
89 on20 OnSpec 90c20 (US)
90 on26 OnSpec 90c26 (US)
91 ==== ====================================== ====
92
93
942. Using the PARIDE subsystem
95=============================
96
97While configuring the Linux kernel, you may choose either to build
98the PARIDE drivers into your kernel, or to build them as modules.
99
100In either case, you will need to select "Parallel port IDE device support"
101as well as at least one of the high-level drivers and at least one
102of the parallel port communication protocols. If you do not know
103what kind of parallel port adapter is used in your drive, you could
104begin by checking the file names and any text files on your DOS
105installation floppy. Alternatively, you can look at the markings on
106the adapter chip itself. That's usually sufficient to identify the
107correct device.
108
109You can actually select all the protocol modules, and allow the PARIDE
110subsystem to try them all for you.
111
112For the "brand-name" products listed above, here are the protocol
113and high-level drivers that you would use:
114
115 ================ ============ ====== ========
116 Manufacturer Model Driver Protocol
117 ================ ============ ====== ========
118 MicroSolutions CD-ROM pcd bpck
119 MicroSolutions PD drive pf bpck
120 MicroSolutions hard-drive pd bpck
121 MicroSolutions 8000t tape pt bpck
122 SyQuest EZ, SparQ pd epat
123 Imation Superdisk pf epat
124 Maxell Superdisk pf friq
125 Avatar Shark pd epat
126 FreeCom CD-ROM pcd frpw
127 Hewlett-Packard 5GB Tape pt epat
128 Hewlett-Packard 7200e (CD) pcd epat
129 Hewlett-Packard 7200e (CD-R) pg epat
130 ================ ============ ====== ========
131
1322.1 Configuring built-in drivers
133---------------------------------
134
135We recommend that you get to know how the drivers work and how to
136configure them as loadable modules, before attempting to compile a
137kernel with the drivers built-in.
138
139If you built all of your PARIDE support directly into your kernel,
140and you have just a single parallel port IDE device, your kernel should
141locate it automatically for you. If you have more than one device,
142you may need to give some command line options to your bootloader
143(eg: LILO), how to do that is beyond the scope of this document.
144
145The high-level drivers accept a number of command line parameters, all
146of which are documented in the source files in linux/drivers/block/paride.
147By default, each driver will automatically try all parallel ports it
148can find, and all protocol types that have been installed, until it finds
149a parallel port IDE adapter. Once it finds one, the probe stops. So,
150if you have more than one device, you will need to tell the drivers
151how to identify them. This requires specifying the port address, the
152protocol identification number and, for some devices, the drive's
153chain ID. While your system is booting, a number of messages are
154displayed on the console. Like all such messages, they can be
155reviewed with the 'dmesg' command. Among those messages will be
156some lines like::
157
158 paride: bpck registered as protocol 0
159 paride: epat registered as protocol 1
160
161The numbers will always be the same until you build a new kernel with
162different protocol selections. You should note these numbers as you
163will need them to identify the devices.
164
165If you happen to be using a MicroSolutions backpack device, you will
166also need to know the unit ID number for each drive. This is usually
167the last two digits of the drive's serial number (but read MicroSolutions'
168documentation about this).
169
170As an example, let's assume that you have a MicroSolutions PD/CD drive
171with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
172EZ-135 connected to the chained port on the PD/CD drive and also an
173Imation Superdisk connected to port 0x278. You could give the following
174options on your boot command::
175
176 pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
177
178In the last option, pf.drive1 configures device /dev/pf1, the 0x378
179is the parallel port base address, the 0 is the protocol registration
180number and 36 is the chain ID.
181
182Please note: while PARIDE will work both with and without the
183PARPORT parallel port sharing system that is included by the
184"Parallel port support" option, PARPORT must be included and enabled
185if you want to use chains of devices on the same parallel port.
186
1872.2 Loading and configuring PARIDE as modules
188----------------------------------------------
189
190It is much faster and simpler to get to understand the PARIDE drivers
191if you use them as loadable kernel modules.
192
193Note 1:
194 using these drivers with the "kerneld" automatic module loading
195 system is not recommended for beginners, and is not documented here.
196
197Note 2:
198 if you build PARPORT support as a loadable module, PARIDE must
199 also be built as loadable modules, and PARPORT must be loaded before
200 the PARIDE modules.
201
202To use PARIDE, you must begin by::
203
204 insmod paride
205
206this loads a base module which provides a registry for the protocols,
207among other tasks.
208
209Then, load as many of the protocol modules as you think you might need.
210As you load each module, it will register the protocols that it supports,
211and print a log message to your kernel log file and your console. For
212example::
213
214 # insmod epat
215 paride: epat registered as protocol 0
216 # insmod kbic
217 paride: k951 registered as protocol 1
218 paride: k971 registered as protocol 2
219
220Finally, you can load high-level drivers for each kind of device that
221you have connected. By default, each driver will autoprobe for a single
222device, but you can support up to four similar devices by giving their
223individual co-ordinates when you load the driver.
224
225For example, if you had two no-name CD-ROM drives both using the
226KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
227you could give the following command::
228
229 # insmod pcd drive0=0x378,1 drive1=0x3bc,1
230
231For most adapters, giving a port address and protocol number is sufficient,
232but check the source files in linux/drivers/block/paride for more
233information. (Hopefully someone will write some man pages one day !).
234
235As another example, here's what happens when PARPORT is installed, and
236a SyQuest EZ-135 is attached to port 0x378::
237
238 # insmod paride
239 paride: version 1.0 installed
240 # insmod epat
241 paride: epat registered as protocol 0
242 # insmod pd
243 pd: pd version 1.0, major 45, cluster 64, nice 0
244 pda: Sharing parport1 at 0x378
245 pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1
246 pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media
247 pda: pda1
248
249Note that the last line is the output from the generic partition table
250scanner - in this case it reports that it has found a disk with one partition.
251
2522.3 Using a PARIDE device
253--------------------------
254
255Once the drivers have been loaded, you can access PARIDE devices in the
256same way as their traditional counterparts. You will probably need to
257create the device "special files". Here is a simple script that you can
258cut to a file and execute::
259
260 #!/bin/bash
261 #
262 # mkd -- a script to create the device special files for the PARIDE subsystem
263 #
264 function mkdev {
265 mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
266 }
267 #
268 function pd {
269 D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
270 mkdev pd$D b 45 $[ $1 * 16 ]
271 for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
272 do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
273 done
274 }
275 #
276 cd /dev
277 #
278 for u in 0 1 2 3 ; do pd $u ; done
279 for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
280 for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
281 for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
282 for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
283 for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
284 #
285 # end of mkd
286
287With the device files and drivers in place, you can access PARIDE devices
288like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
289
290 mount /dev/pcd0 /cdrom
291
292If you have a fresh Avatar Shark cartridge, and the drive is pda, you
293might do something like::
294
295 fdisk /dev/pda -- make a new partition table with
296 partition 1 of type 83
297
298 mke2fs /dev/pda1 -- to build the file system
299
300 mkdir /shark -- make a place to mount the disk
301
302 mount /dev/pda1 /shark
303
304Devices like the Imation superdisk work in the same way, except that
305they do not have a partition table. For example to make a 120MB
306floppy that you could share with a DOS system::
307
308 mkdosfs /dev/pf0
309 mount /dev/pf0 /mnt
310
311
3122.4 The pf driver
313------------------
314
315The pf driver is intended for use with parallel port ATAPI disk
316devices. The most common devices in this category are PD drives
317and LS-120 drives. Traditionally, media for these devices are not
318partitioned. Consequently, the pf driver does not support partitioned
319media. This may be changed in a future version of the driver.
320
3212.5 Using the pt driver
322------------------------
323
324The pt driver for parallel port ATAPI tape drives is a minimal driver.
325It does not yet support many of the standard tape ioctl operations.
326For best performance, a block size of 32KB should be used. You will
327probably want to set the parallel port delay to 0, if you can.
328
3292.6 Using the pg driver
330------------------------
331
332The pg driver can be used in conjunction with the cdrecord program
333to create CD-ROMs. Please get cdrecord version 1.6.1 or later
334from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
335your parallel port should ideally be set to EPP mode, and the "port delay"
336should be set to 0. With those settings it is possible to record at 2x
337speed without any buffer underruns. If you cannot get the driver to work
338in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
339
340
3413. Troubleshooting
342==================
343
3443.1 Use EPP mode if you can
345----------------------------
346
347The most common problems that people report with the PARIDE drivers
348concern the parallel port CMOS settings. At this time, none of the
349PARIDE protocol modules support ECP mode, or any ECP combination modes.
350If you are able to do so, please set your parallel port into EPP mode
351using your CMOS setup procedure.
352
3533.2 Check the port delay
354-------------------------
355
356Some parallel ports cannot reliably transfer data at full speed. To
357offset the errors, the PARIDE protocol modules introduce a "port
358delay" between each access to the i/o ports. Each protocol sets
359a default value for this delay. In most cases, the user can override
360the default and set it to 0 - resulting in somewhat higher transfer
361rates. In some rare cases (especially with older 486 systems) the
362default delays are not long enough. if you experience corrupt data
363transfers, or unexpected failures, you may wish to increase the
364port delay. The delay can be programmed using the "driveN" parameters
365to each of the high-level drivers. Please see the notes above, or
366read the comments at the beginning of the driver source files in
367linux/drivers/block/paride.
368
3693.3 Some drives need a printer reset
370-------------------------------------
371
372There appear to be a number of "noname" external drives on the market
373that do not always power up correctly. We have noticed this with some
374drives based on OnSpec and older Freecom adapters. In these rare cases,
375the adapter can often be reinitialised by issuing a "printer reset" on
376the parallel port. As the reset operation is potentially disruptive in
377multiple device environments, the PARIDE drivers will not do it
378automatically. You can however, force a printer reset by doing::
379
380 insmod lp reset=1
381 rmmod lp
382
383If you have one of these marginal cases, you should probably build
384your paride drivers as modules, and arrange to do the printer reset
385before loading the PARIDE drivers.
386
3873.4 Use the verbose option and dmesg if you need help
388------------------------------------------------------
389
390While a lot of testing has gone into these drivers to make them work
391as smoothly as possible, problems will arise. If you do have problems,
392please check all the obvious things first: does the drive work in
393DOS with the manufacturer's drivers ? If that doesn't yield any useful
394clues, then please make sure that only one drive is hooked to your system,
395and that either (a) PARPORT is enabled or (b) no other device driver
396is using your parallel port (check in /proc/ioports). Then, load the
397appropriate drivers (you can load several protocol modules if you want)
398as in::
399
400 # insmod paride
401 # insmod epat
402 # insmod bpck
403 # insmod kbic
404 ...
405 # insmod pd verbose=1
406
407(using the correct driver for the type of device you have, of course).
408The verbose=1 parameter will cause the drivers to log a trace of their
409activity as they attempt to locate your drive.
410
411Use 'dmesg' to capture a log of all the PARIDE messages (any messages
412beginning with paride:, a protocol module's name or a driver's name) and
413include that with your bug report. You can submit a bug report in one
414of two ways. Either send it directly to the author of the PARIDE suite,
415by e-mail to grant@torque.net, or join the linux-parport mailing list
416and post your report there.
417
4183.5 For more information or help
419---------------------------------
420
421You can join the linux-parport mailing list by sending a mail message
422to:
423
424 linux-parport-request@torque.net
425
426with the single word::
427
428 subscribe
429
430in the body of the mail message (not in the subject line). Please be
431sure that your mail program is correctly set up when you do this, as
432the list manager is a robot that will subscribe you using the reply
433address in your mail headers. REMOVE any anti-spam gimmicks you may
434have in your mail headers, when sending mail to the list server.
435
436You might also find some useful information on the linux-parport
437web pages (although they are not always up to date) at
438
439 http://web.archive.org/web/%2E/http://www.torque.net/parport/
diff --git a/Documentation/admin-guide/blockdev/ramdisk.rst b/Documentation/admin-guide/blockdev/ramdisk.rst
new file mode 100644
index 000000000000..b7c2268f8dec
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/ramdisk.rst
@@ -0,0 +1,177 @@
1==========================================
2Using the RAM disk block device with Linux
3==========================================
4
5.. Contents:
6
7 1) Overview
8 2) Kernel Command Line Parameters
9 3) Using "rdev -r"
10 4) An Example of Creating a Compressed RAM Disk
11
12
131) Overview
14-----------
15
16The RAM disk driver is a way to use main system memory as a block device. It
17is required for initrd, an initial filesystem used if you need to load modules
18in order to access the root filesystem (see Documentation/admin-guide/initrd.rst). It can
19also be used for a temporary filesystem for crypto work, since the contents
20are erased on reboot.
21
22The RAM disk dynamically grows as more space is required. It does this by using
23RAM from the buffer cache. The driver marks the buffers it is using as dirty
24so that the VM subsystem does not try to reclaim them later.
25
26The RAM disk supports up to 16 RAM disks by default, and can be reconfigured
27to support an unlimited number of RAM disks (at your own risk). Just change
28the configuration symbol BLK_DEV_RAM_COUNT in the Block drivers config menu
29and (re)build the kernel.
30
31To use RAM disk support with your system, run './MAKEDEV ram' from the /dev
32directory. RAM disks are all major number 1, and start with minor number 0
33for /dev/ram0, etc. If used, modern kernels use /dev/ram0 for an initrd.
34
35The new RAM disk also has the ability to load compressed RAM disk images,
36allowing one to squeeze more programs onto an average installation or
37rescue floppy disk.
38
39
402) Parameters
41---------------------------------
42
432a) Kernel Command Line Parameters
44
45 ramdisk_size=N
46 Size of the ramdisk.
47
48This parameter tells the RAM disk driver to set up RAM disks of N k size. The
49default is 4096 (4 MB).
50
512b) Module parameters
52
53 rd_nr
54 /dev/ramX devices created.
55
56 max_part
57 Maximum partition number.
58
59 rd_size
60 See ramdisk_size.
61
623) Using "rdev -r"
63------------------
64
65The usage of the word (two bytes) that "rdev -r" sets in the kernel image is
66as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up
67to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
6814 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a
69prompt/wait sequence is to be given before trying to read the RAM disk. Since
70the RAM disk dynamically grows as data is being written into it, a size field
71is not required. Bits 11 to 13 are not currently used and may as well be zero.
72These numbers are no magical secrets, as seen below::
73
74 ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
75 ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
76 ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
77
78Consider a typical two floppy disk setup, where you will have the
79kernel on disk one, and have already put a RAM disk image onto disk #2.
80
81Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk
82starts at an offset of 0 kB from the beginning of the floppy.
83The command line equivalent is: "ramdisk_start=0"
84
85You want bit 14 as one, indicating that a RAM disk is to be loaded.
86The command line equivalent is: "load_ramdisk=1"
87
88You want bit 15 as one, indicating that you want a prompt/keypress
89sequence so that you have a chance to switch floppy disks.
90The command line equivalent is: "prompt_ramdisk=1"
91
92Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
93So to create disk one of the set, you would do::
94
95 /usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
96 /usr/src/linux# rdev /dev/fd0 /dev/fd0
97 /usr/src/linux# rdev -r /dev/fd0 49152
98
99If you make a boot disk that has LILO, then for the above, you would use::
100
101 append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
102
103Since the default start = 0 and the default prompt = 1, you could use::
104
105 append = "load_ramdisk=1"
106
107
1084) An Example of Creating a Compressed RAM Disk
109-----------------------------------------------
110
111To create a RAM disk image, you will need a spare block device to
112construct it on. This can be the RAM disk device itself, or an
113unused disk partition (such as an unmounted swap partition). For this
114example, we will use the RAM disk device, "/dev/ram0".
115
116Note: This technique should not be done on a machine with less than 8 MB
117of RAM. If using a spare disk partition instead of /dev/ram0, then this
118restriction does not apply.
119
120a) Decide on the RAM disk size that you want. Say 2 MB for this example.
121 Create it by writing to the RAM disk device. (This step is not currently
122 required, but may be in the future.) It is wise to zero out the
123 area (esp. for disks) so that maximal compression is achieved for
124 the unused blocks of the image that you are about to create::
125
126 dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
127
128b) Make a filesystem on it. Say ext2fs for this example::
129
130 mke2fs -vm0 /dev/ram0 2048
131
132c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
133 and unmount it again.
134
135d) Compress the contents of the RAM disk. The level of compression
136 will be approximately 50% of the space used by the files. Unused
137 space on the RAM disk will compress to almost nothing::
138
139 dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
140
141e) Put the kernel onto the floppy::
142
143 dd if=zImage of=/dev/fd0 bs=1k
144
145f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
146 that is slightly larger than the kernel, so that you can put another
147 (possibly larger) kernel onto the same floppy later without overlapping
148 the RAM disk image. An offset of 400 kB for kernels about 350 kB in
149 size would be reasonable. Make sure offset+size of ram_image.gz is
150 not larger than the total space on your floppy (usually 1440 kB)::
151
152 dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
153
154g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
155 For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
156 have 2^15 + 2^14 + 400 = 49552::
157
158 rdev /dev/fd0 /dev/fd0
159 rdev -r /dev/fd0 49552
160
161That is it. You now have your boot/root compressed RAM disk floppy. Some
162users may wish to combine steps (d) and (f) by using a pipe.
163
164
165 Paul Gortmaker 12/95
166
167Changelog:
168----------
169
17010-22-04 :
171 Updated to reflect changes in command line options, remove
172 obsolete references, general cleanup.
173 James Nelson (james4765@gmail.com)
174
175
17612-95 :
177 Original Document
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
new file mode 100644
index 000000000000..6eccf13219ff
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -0,0 +1,422 @@
1========================================
2zram: Compressed RAM based block devices
3========================================
4
5Introduction
6============
7
8The zram module creates RAM based block devices named /dev/zram<id>
9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
10in memory itself. These disks allow very fast I/O and compression provides
11good amounts of memory savings. Some of the usecases include /tmp storage,
12use as swap disks, various caches under /var and maybe many more :)
13
14Statistics for individual zram devices are exported through sysfs nodes at
15/sys/block/zram<id>/
16
17Usage
18=====
19
20There are several ways to configure and manage zram device(-s):
21
22a) using zram and zram_control sysfs attributes
23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
24
25In this document we will describe only 'manual' zram configuration steps,
26IOW, zram and zram_control sysfs attributes.
27
28In order to get a better idea about zramctl please consult util-linux
29documentation, zramctl man-page or `zramctl --help`. Please be informed
30that zram maintainers do not develop/maintain util-linux or zramctl, should
31you have any questions please contact util-linux@vger.kernel.org
32
33Following shows a typical sequence of steps for using zram.
34
35WARNING
36=======
37
38For the sake of simplicity we skip error checking parts in most of the
39examples below. However, it is your sole responsibility to handle errors.
40
41zram sysfs attributes always return negative values in case of errors.
42The list of possible return codes:
43
44======== =============================================================
45-EBUSY an attempt to modify an attribute that cannot be changed once
46 the device has been initialised. Please reset device first;
47-ENOMEM zram was not able to allocate enough memory to fulfil your
48 needs;
49-EINVAL invalid input has been provided.
50======== =============================================================
51
52If you use 'echo', the returned value that is changed by 'echo' utility,
53and, in general case, something like::
54
55 echo 3 > /sys/block/zram0/max_comp_streams
56 if [ $? -ne 0 ];
57 handle_error
58 fi
59
60should suffice.
61
621) Load Module
63==============
64
65::
66
67 modprobe zram num_devices=4
68 This creates 4 devices: /dev/zram{0,1,2,3}
69
70num_devices parameter is optional and tells zram how many devices should be
71pre-created. Default: 1.
72
732) Set max number of compression streams
74========================================
75
76Regardless the value passed to this attribute, ZRAM will always
77allocate multiple compression streams - one per online CPUs - thus
78allowing several concurrent compression operations. The number of
79allocated compression streams goes down when some of the CPUs
80become offline. There is no single-compression-stream mode anymore,
81unless you are running a UP system or has only 1 CPU online.
82
83To find out how many streams are currently available::
84
85 cat /sys/block/zram0/max_comp_streams
86
873) Select compression algorithm
88===============================
89
90Using comp_algorithm device attribute one can see available and
91currently selected (shown in square brackets) compression algorithms,
92change selected compression algorithm (once the device is initialised
93there is no way to change compression algorithm).
94
95Examples::
96
97 #show supported compression algorithms
98 cat /sys/block/zram0/comp_algorithm
99 lzo [lz4]
100
101 #select lzo compression algorithm
102 echo lzo > /sys/block/zram0/comp_algorithm
103
104For the time being, the `comp_algorithm` content does not necessarily
105show every compression algorithm supported by the kernel. We keep this
106list primarily to simplify device configuration and one can configure
107a new device with a compression algorithm that is not listed in
108`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
109and, if some of the algorithms were built as modules, it's impossible
110to list all of them using, for instance, /proc/crypto or any other
111method. This, however, has an advantage of permitting the usage of
112custom crypto compression modules (implementing S/W or H/W compression).
113
1144) Set Disksize
115===============
116
117Set disk size by writing the value to sysfs node 'disksize'.
118The value can be either in bytes or you can use mem suffixes.
119Examples::
120
121 # Initialize /dev/zram0 with 50MB disksize
122 echo $((50*1024*1024)) > /sys/block/zram0/disksize
123
124 # Using mem suffixes
125 echo 256K > /sys/block/zram0/disksize
126 echo 512M > /sys/block/zram0/disksize
127 echo 1G > /sys/block/zram0/disksize
128
129Note:
130There is little point creating a zram of greater than twice the size of memory
131since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
132size of the disk when not in use so a huge zram is wasteful.
133
1345) Set memory limit: Optional
135=============================
136
137Set memory limit by writing the value to sysfs node 'mem_limit'.
138The value can be either in bytes or you can use mem suffixes.
139In addition, you could change the value in runtime.
140Examples::
141
142 # limit /dev/zram0 with 50MB memory
143 echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
144
145 # Using mem suffixes
146 echo 256K > /sys/block/zram0/mem_limit
147 echo 512M > /sys/block/zram0/mem_limit
148 echo 1G > /sys/block/zram0/mem_limit
149
150 # To disable memory limit
151 echo 0 > /sys/block/zram0/mem_limit
152
1536) Activate
154===========
155
156::
157
158 mkswap /dev/zram0
159 swapon /dev/zram0
160
161 mkfs.ext4 /dev/zram1
162 mount /dev/zram1 /tmp
163
1647) Add/remove zram devices
165==========================
166
167zram provides a control interface, which enables dynamic (on-demand) device
168addition and removal.
169
170In order to add a new /dev/zramX device, perform read operation on hot_add
171attribute. This will return either new device's device id (meaning that you
172can use /dev/zram<id>) or error code.
173
174Example::
175
176 cat /sys/class/zram-control/hot_add
177 1
178
179To remove the existing /dev/zramX device (where X is a device id)
180execute::
181
182 echo X > /sys/class/zram-control/hot_remove
183
1848) Stats
185========
186
187Per-device statistics are exported as various nodes under /sys/block/zram<id>/
188
189A brief description of exported device attributes. For more details please
190read Documentation/ABI/testing/sysfs-block-zram.
191
192====================== ====== ===============================================
193Name access description
194====================== ====== ===============================================
195disksize RW show and set the device's disk size
196initstate RO shows the initialization state of the device
197reset WO trigger device reset
198mem_used_max WO reset the `mem_used_max` counter (see later)
199mem_limit WO specifies the maximum amount of memory ZRAM can
200 use to store the compressed data
201writeback_limit WO specifies the maximum amount of write IO zram
202 can write out to backing device as 4KB unit
203writeback_limit_enable RW show and set writeback_limit feature
204max_comp_streams RW the number of possible concurrent compress
205 operations
206comp_algorithm RW show and change the compression algorithm
207compact WO trigger memory compaction
208debug_stat RO this file is used for zram debugging purposes
209backing_dev RW set up backend storage for zram to write out
210idle WO mark allocated slot as idle
211====================== ====== ===============================================
212
213
214User space is advised to use the following files to read the device statistics.
215
216File /sys/block/zram<id>/stat
217
218Represents block layer statistics. Read Documentation/block/stat.rst for
219details.
220
221File /sys/block/zram<id>/io_stat
222
223The stat file represents device's I/O statistics not accounted by block
224layer and, thus, not available in zram<id>/stat file. It consists of a
225single line of text and contains the following stats separated by
226whitespace:
227
228 ============= =============================================================
229 failed_reads The number of failed reads
230 failed_writes The number of failed writes
231 invalid_io The number of non-page-size-aligned I/O requests
232 notify_free Depending on device usage scenario it may account
233
234 a) the number of pages freed because of swap slot free
235 notifications
236 b) the number of pages freed because of
237 REQ_OP_DISCARD requests sent by bio. The former ones are
238 sent to a swap block device when a swap slot is freed,
239 which implies that this disk is being used as a swap disk.
240
241 The latter ones are sent by filesystem mounted with
242 discard option, whenever some data blocks are getting
243 discarded.
244 ============= =============================================================
245
246File /sys/block/zram<id>/mm_stat
247
248The stat file represents device's mm statistics. It consists of a single
249line of text and contains the following stats separated by whitespace:
250
251 ================ =============================================================
252 orig_data_size uncompressed size of data stored in this disk.
253 This excludes same-element-filled pages (same_pages) since
254 no memory is allocated for them.
255 Unit: bytes
256 compr_data_size compressed size of data stored in this disk
257 mem_used_total the amount of memory allocated for this disk. This
258 includes allocator fragmentation and metadata overhead,
259 allocated for this disk. So, allocator space efficiency
260 can be calculated using compr_data_size and this statistic.
261 Unit: bytes
262 mem_limit the maximum amount of memory ZRAM can use to store
263 the compressed data
264 mem_used_max the maximum amount of memory zram have consumed to
265 store the data
266 same_pages the number of same element filled pages written to this disk.
267 No memory is allocated for such pages.
268 pages_compacted the number of pages freed during compaction
269 huge_pages the number of incompressible pages
270 ================ =============================================================
271
272File /sys/block/zram<id>/bd_stat
273
274The stat file represents device's backing device statistics. It consists of
275a single line of text and contains the following stats separated by whitespace:
276
277 ============== =============================================================
278 bd_count size of data written in backing device.
279 Unit: 4K bytes
280 bd_reads the number of reads from backing device
281 Unit: 4K bytes
282 bd_writes the number of writes to backing device
283 Unit: 4K bytes
284 ============== =============================================================
285
2869) Deactivate
287=============
288
289::
290
291 swapoff /dev/zram0
292 umount /dev/zram1
293
29410) Reset
295=========
296
297 Write any positive value to 'reset' sysfs node::
298
299 echo 1 > /sys/block/zram0/reset
300 echo 1 > /sys/block/zram1/reset
301
302 This frees all the memory allocated for the given device and
303 resets the disksize to zero. You must set the disksize again
304 before reusing the device.
305
306Optional Feature
307================
308
309writeback
310---------
311
312With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
313to backing storage rather than keeping it in memory.
314To use the feature, admin should set up backing device via::
315
316 echo /dev/sda5 > /sys/block/zramX/backing_dev
317
318before disksize setting. It supports only partition at this moment.
319If admin want to use incompressible page writeback, they could do via::
320
321 echo huge > /sys/block/zramX/write
322
323To use idle page writeback, first, user need to declare zram pages
324as idle::
325
326 echo all > /sys/block/zramX/idle
327
328From now on, any pages on zram are idle pages. The idle mark
329will be removed until someone request access of the block.
330IOW, unless there is access request, those pages are still idle pages.
331
332Admin can request writeback of those idle pages at right timing via::
333
334 echo idle > /sys/block/zramX/writeback
335
336With the command, zram writeback idle pages from memory to the storage.
337
338If there are lots of write IO with flash device, potentially, it has
339flash wearout problem so that admin needs to design write limitation
340to guarantee storage health for entire product life.
341
342To overcome the concern, zram supports "writeback_limit" feature.
343The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
344any writeback. IOW, if admin want to apply writeback budget, he should
345enable writeback_limit_enable via::
346
347 $ echo 1 > /sys/block/zramX/writeback_limit_enable
348
349Once writeback_limit_enable is set, zram doesn't allow any writeback
350until admin set the budget via /sys/block/zramX/writeback_limit.
351
352(If admin doesn't enable writeback_limit_enable, writeback_limit's value
353assigned via /sys/block/zramX/writeback_limit is meaninless.)
354
355If admin want to limit writeback as per-day 400M, he could do it
356like below::
357
358 $ MB_SHIFT=20
359 $ 4K_SHIFT=12
360 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
361 /sys/block/zram0/writeback_limit.
362 $ echo 1 > /sys/block/zram0/writeback_limit_enable
363
364If admin want to allow further write again once the bugdet is exausted,
365he could do it like below::
366
367 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
368 /sys/block/zram0/writeback_limit
369
370If admin want to see remaining writeback budget since he set::
371
372 $ cat /sys/block/zramX/writeback_limit
373
374If admin want to disable writeback limit, he could do::
375
376 $ echo 0 > /sys/block/zramX/writeback_limit_enable
377
378The writeback_limit count will reset whenever you reset zram(e.g.,
379system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
380writeback happened until you reset the zram to allocate extra writeback
381budget in next setting is user's job.
382
383If admin want to measure writeback count in a certain period, he could
384know it via /sys/block/zram0/bd_stat's 3rd column.
385
386memory tracking
387===============
388
389With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
390zram block. It could be useful to catch cold or incompressible
391pages of the process with*pagemap.
392
393If you enable the feature, you could see block state via
394/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
395
396 300 75.033841 .wh.
397 301 63.806904 s...
398 302 63.806919 ..hi
399
400First column
401 zram's block index.
402Second column
403 access time since the system was booted
404Third column
405 state of the block:
406
407 s:
408 same page
409 w:
410 written page to backing store
411 h:
412 huge page
413 i:
414 idle page
415
416First line of above example says 300th block is accessed at 75.033841sec
417and the block's state is huge so it is written back to the backing
418storage. It's a debugging feature so anyone shouldn't rely on it to work
419properly.
420
421Nitin Gupta
422ngupta@vflare.org
diff --git a/Documentation/admin-guide/btmrvl.rst b/Documentation/admin-guide/btmrvl.rst
new file mode 100644
index 000000000000..ec57740ead0c
--- /dev/null
+++ b/Documentation/admin-guide/btmrvl.rst
@@ -0,0 +1,124 @@
1=============
2btmrvl driver
3=============
4
5All commands are used via debugfs interface.
6
7Set/get driver configurations
8=============================
9
10Path: /debug/btmrvl/config/
11
12gpiogap=[n], hscfgcmd
13 These commands are used to configure the host sleep parameters::
14 bit 8:0 -- Gap
15 bit 16:8 -- GPIO
16
17 where GPIO is the pin number of GPIO used to wake up the host.
18 It could be any valid GPIO pin# (e.g. 0-7) or 0xff (SDIO interface
19 wakeup will be used instead).
20
21 where Gap is the gap in milli seconds between wakeup signal and
22 wakeup event, or 0xff for special host sleep setting.
23
24 Usage::
25
26 # Use SDIO interface to wake up the host and set GAP to 0x80:
27 echo 0xff80 > /debug/btmrvl/config/gpiogap
28 echo 1 > /debug/btmrvl/config/hscfgcmd
29
30 # Use GPIO pin #3 to wake up the host and set GAP to 0xff:
31 echo 0x03ff > /debug/btmrvl/config/gpiogap
32 echo 1 > /debug/btmrvl/config/hscfgcmd
33
34psmode=[n], pscmd
35 These commands are used to enable/disable auto sleep mode
36
37 where the option is::
38
39 1 -- Enable auto sleep mode
40 0 -- Disable auto sleep mode
41
42 Usage::
43
44 # Enable auto sleep mode
45 echo 1 > /debug/btmrvl/config/psmode
46 echo 1 > /debug/btmrvl/config/pscmd
47
48 # Disable auto sleep mode
49 echo 0 > /debug/btmrvl/config/psmode
50 echo 1 > /debug/btmrvl/config/pscmd
51
52
53hsmode=[n], hscmd
54 These commands are used to enable host sleep or wake up firmware
55
56 where the option is::
57
58 1 -- Enable host sleep
59 0 -- Wake up firmware
60
61 Usage::
62
63 # Enable host sleep
64 echo 1 > /debug/btmrvl/config/hsmode
65 echo 1 > /debug/btmrvl/config/hscmd
66
67 # Wake up firmware
68 echo 0 > /debug/btmrvl/config/hsmode
69 echo 1 > /debug/btmrvl/config/hscmd
70
71
72Get driver status
73=================
74
75Path: /debug/btmrvl/status/
76
77Usage::
78
79 cat /debug/btmrvl/status/<args>
80
81where the args are:
82
83curpsmode
84 This command displays current auto sleep status.
85
86psstate
87 This command display the power save state.
88
89hsstate
90 This command display the host sleep state.
91
92txdnldrdy
93 This command displays the value of Tx download ready flag.
94
95Issuing a raw hci command
96=========================
97
98Use hcitool to issue raw hci command, refer to hcitool manual
99
100Usage::
101
102 Hcitool cmd <ogf> <ocf> [Parameters]
103
104Interface Control Command::
105
106 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface
107 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface
108 hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface
109 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x00 --Disable All interface
110 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface
111 hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface
112
113SD8688 firmware
114===============
115
116Images:
117
118- /lib/firmware/sd8688_helper.bin
119- /lib/firmware/sd8688.bin
120
121
122The images can be downloaded from:
123
124git.infradead.org/users/dwmw2/linux-firmware.git/libertas/
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst
index b761aa2a51d2..44b8a4edd348 100644
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -90,9 +90,9 @@ the disk is not available then you have three options:
90 run a null modem to a second machine and capture the output there 90 run a null modem to a second machine and capture the output there
91 using your favourite communication program. Minicom works well. 91 using your favourite communication program. Minicom works well.
92 92
93(3) Use Kdump (see Documentation/kdump/kdump.rst), 93(3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
94 extract the kernel ring buffer from old memory with using dmesg 94 extract the kernel ring buffer from old memory with using dmesg
95 gdbmacro in Documentation/kdump/gdbmacros.txt. 95 gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
96 96
97Finding the bug's location 97Finding the bug's location
98-------------------------- 98--------------------------
diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
new file mode 100644
index 000000000000..1d7d962933be
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
@@ -0,0 +1,302 @@
1===================
2Block IO Controller
3===================
4
5Overview
6========
7cgroup subsys "blkio" implements the block io controller. There seems to be
8a need of various kinds of IO control policies (like proportional BW, max BW)
9both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
10Plan is to use the same cgroup based management interface for blkio controller
11and based on user options switch IO policies in the background.
12
13One IO control policy is throttling policy which can be used to
14specify upper IO rate limits on devices. This policy is implemented in
15generic block layer and can be used on leaf nodes as well as higher
16level logical devices like device mapper.
17
18HOWTO
19=====
20Throttling/Upper Limit policy
21-----------------------------
22- Enable Block IO controller::
23
24 CONFIG_BLK_CGROUP=y
25
26- Enable throttling in block layer::
27
28 CONFIG_BLK_DEV_THROTTLING=y
29
30- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
31
32 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
33
34- Specify a bandwidth rate on particular device for root group. The format
35 for policy is "<major>:<minor> <bytes_per_second>"::
36
37 echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
38
39 Above will put a limit of 1MB/second on reads happening for root group
40 on device having major/minor number 8:16.
41
42- Run dd to read a file and see if rate is throttled to 1MB/s or not::
43
44 # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
45 1024+0 records in
46 1024+0 records out
47 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
48
49 Limits for writes can be put using blkio.throttle.write_bps_device file.
50
51Hierarchical Cgroups
52====================
53
54Throttling implements hierarchy support; however,
55throttling's hierarchy support is enabled iff "sane_behavior" is
56enabled from cgroup side, which currently is a development option and
57not publicly available.
58
59If somebody created a hierarchy like as follows::
60
61 root
62 / \
63 test1 test2
64 |
65 test3
66
67Throttling with "sane_behavior" will handle the
68hierarchy correctly. For throttling, all limits apply
69to the whole subtree while all statistics are local to the IOs
70directly generated by tasks in that cgroup.
71
72Throttling without "sane_behavior" enabled from cgroup side will
73practically treat all groups at same level as if it looks like the
74following::
75
76 pivot
77 / / \ \
78 root test1 test2 test3
79
80Various user visible config options
81===================================
82CONFIG_BLK_CGROUP
83 - Block IO controller.
84
85CONFIG_BFQ_CGROUP_DEBUG
86 - Debug help. Right now some additional stats file show up in cgroup
87 if this option is enabled.
88
89CONFIG_BLK_DEV_THROTTLING
90 - Enable block device throttling support in block layer.
91
92Details of cgroup files
93=======================
94Proportional weight policy files
95--------------------------------
96- blkio.weight
97 - Specifies per cgroup weight. This is default weight of the group
98 on all the devices until and unless overridden by per device rule.
99 (See blkio.weight_device).
100 Currently allowed range of weights is from 10 to 1000.
101
102- blkio.weight_device
103 - One can specify per cgroup per device rules using this interface.
104 These rules override the default value of group weight as specified
105 by blkio.weight.
106
107 Following is the format::
108
109 # echo dev_maj:dev_minor weight > blkio.weight_device
110
111 Configure weight=300 on /dev/sdb (8:16) in this cgroup::
112
113 # echo 8:16 300 > blkio.weight_device
114 # cat blkio.weight_device
115 dev weight
116 8:16 300
117
118 Configure weight=500 on /dev/sda (8:0) in this cgroup::
119
120 # echo 8:0 500 > blkio.weight_device
121 # cat blkio.weight_device
122 dev weight
123 8:0 500
124 8:16 300
125
126 Remove specific weight for /dev/sda in this cgroup::
127
128 # echo 8:0 0 > blkio.weight_device
129 # cat blkio.weight_device
130 dev weight
131 8:16 300
132
133- blkio.leaf_weight[_device]
134 - Equivalents of blkio.weight[_device] for the purpose of
135 deciding how much weight tasks in the given cgroup has while
136 competing with the cgroup's child cgroups. For details,
137 please refer to Documentation/block/cfq-iosched.txt.
138
139- blkio.time
140 - disk time allocated to cgroup per device in milliseconds. First
141 two fields specify the major and minor number of the device and
142 third field specifies the disk time allocated to group in
143 milliseconds.
144
145- blkio.sectors
146 - number of sectors transferred to/from disk by the group. First
147 two fields specify the major and minor number of the device and
148 third field specifies the number of sectors transferred by the
149 group to/from the device.
150
151- blkio.io_service_bytes
152 - Number of bytes transferred to/from the disk by the group. These
153 are further divided by the type of operation - read or write, sync
154 or async. First two fields specify the major and minor number of the
155 device, third field specifies the operation type and the fourth field
156 specifies the number of bytes.
157
158- blkio.io_serviced
159 - Number of IOs (bio) issued to the disk by the group. These
160 are further divided by the type of operation - read or write, sync
161 or async. First two fields specify the major and minor number of the
162 device, third field specifies the operation type and the fourth field
163 specifies the number of IOs.
164
165- blkio.io_service_time
166 - Total amount of time between request dispatch and request completion
167 for the IOs done by this cgroup. This is in nanoseconds to make it
168 meaningful for flash devices too. For devices with queue depth of 1,
169 this time represents the actual service time. When queue_depth > 1,
170 that is no longer true as requests may be served out of order. This
171 may cause the service time for a given IO to include the service time
172 of multiple IOs when served out of order which may result in total
173 io_service_time > actual time elapsed. This time is further divided by
174 the type of operation - read or write, sync or async. First two fields
175 specify the major and minor number of the device, third field
176 specifies the operation type and the fourth field specifies the
177 io_service_time in ns.
178
179- blkio.io_wait_time
180 - Total amount of time the IOs for this cgroup spent waiting in the
181 scheduler queues for service. This can be greater than the total time
182 elapsed since it is cumulative io_wait_time for all IOs. It is not a
183 measure of total time the cgroup spent waiting but rather a measure of
184 the wait_time for its individual IOs. For devices with queue_depth > 1
185 this metric does not include the time spent waiting for service once
186 the IO is dispatched to the device but till it actually gets serviced
187 (there might be a time lag here due to re-ordering of requests by the
188 device). This is in nanoseconds to make it meaningful for flash
189 devices too. This time is further divided by the type of operation -
190 read or write, sync or async. First two fields specify the major and
191 minor number of the device, third field specifies the operation type
192 and the fourth field specifies the io_wait_time in ns.
193
194- blkio.io_merged
195 - Total number of bios/requests merged into requests belonging to this
196 cgroup. This is further divided by the type of operation - read or
197 write, sync or async.
198
199- blkio.io_queued
200 - Total number of requests queued up at any given instant for this
201 cgroup. This is further divided by the type of operation - read or
202 write, sync or async.
203
204- blkio.avg_queue_size
205 - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
206 The average queue size for this cgroup over the entire time of this
207 cgroup's existence. Queue size samples are taken each time one of the
208 queues of this cgroup gets a timeslice.
209
210- blkio.group_wait_time
211 - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
212 This is the amount of time the cgroup had to wait since it became busy
213 (i.e., went from 0 to 1 request queued) to get a timeslice for one of
214 its queues. This is different from the io_wait_time which is the
215 cumulative total of the amount of time spent by each IO in that cgroup
216 waiting in the scheduler queue. This is in nanoseconds. If this is
217 read when the cgroup is in a waiting (for timeslice) state, the stat
218 will only report the group_wait_time accumulated till the last time it
219 got a timeslice and will not include the current delta.
220
221- blkio.empty_time
222 - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
223 This is the amount of time a cgroup spends without any pending
224 requests when not being served, i.e., it does not include any time
225 spent idling for one of the queues of the cgroup. This is in
226 nanoseconds. If this is read when the cgroup is in an empty state,
227 the stat will only report the empty_time accumulated till the last
228 time it had a pending request and will not include the current delta.
229
230- blkio.idle_time
231 - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
232 This is the amount of time spent by the IO scheduler idling for a
233 given cgroup in anticipation of a better request than the existing ones
234 from other queues/cgroups. This is in nanoseconds. If this is read
235 when the cgroup is in an idling state, the stat will only report the
236 idle_time accumulated till the last idle period and will not include
237 the current delta.
238
239- blkio.dequeue
240 - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
241 gives the statistics about how many a times a group was dequeued
242 from service tree of the device. First two fields specify the major
243 and minor number of the device and third field specifies the number
244 of times a group was dequeued from a particular device.
245
246- blkio.*_recursive
247 - Recursive version of various stats. These files show the
248 same information as their non-recursive counterparts but
249 include stats from all the descendant cgroups.
250
251Throttling/Upper limit policy files
252-----------------------------------
253- blkio.throttle.read_bps_device
254 - Specifies upper limit on READ rate from the device. IO rate is
255 specified in bytes per second. Rules are per device. Following is
256 the format::
257
258 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
259
260- blkio.throttle.write_bps_device
261 - Specifies upper limit on WRITE rate to the device. IO rate is
262 specified in bytes per second. Rules are per device. Following is
263 the format::
264
265 echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
266
267- blkio.throttle.read_iops_device
268 - Specifies upper limit on READ rate from the device. IO rate is
269 specified in IO per second. Rules are per device. Following is
270 the format::
271
272 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
273
274- blkio.throttle.write_iops_device
275 - Specifies upper limit on WRITE rate to the device. IO rate is
276 specified in io per second. Rules are per device. Following is
277 the format::
278
279 echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
280
281Note: If both BW and IOPS rules are specified for a device, then IO is
282 subjected to both the constraints.
283
284- blkio.throttle.io_serviced
285 - Number of IOs (bio) issued to the disk by the group. These
286 are further divided by the type of operation - read or write, sync
287 or async. First two fields specify the major and minor number of the
288 device, third field specifies the operation type and the fourth field
289 specifies the number of IOs.
290
291- blkio.throttle.io_service_bytes
292 - Number of bytes transferred to/from the disk by the group. These
293 are further divided by the type of operation - read or write, sync
294 or async. First two fields specify the major and minor number of the
295 device, third field specifies the operation type and the fourth field
296 specifies the number of bytes.
297
298Common files among various policies
299-----------------------------------
300- blkio.reset_stats
301 - Writing an int to this file will result in resetting all the stats
302 for that cgroup.
diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst
new file mode 100644
index 000000000000..b0688011ed06
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst
@@ -0,0 +1,695 @@
1==============
2Control Groups
3==============
4
5Written by Paul Menage <menage@google.com> based on
6Documentation/admin-guide/cgroup-v1/cpusets.rst
7
8Original copyright statements from cpusets.txt:
9
10Portions Copyright (C) 2004 BULL SA.
11
12Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
13
14Modified by Paul Jackson <pj@sgi.com>
15
16Modified by Christoph Lameter <cl@linux.com>
17
18.. CONTENTS:
19
20 1. Control Groups
21 1.1 What are cgroups ?
22 1.2 Why are cgroups needed ?
23 1.3 How are cgroups implemented ?
24 1.4 What does notify_on_release do ?
25 1.5 What does clone_children do ?
26 1.6 How do I use cgroups ?
27 2. Usage Examples and Syntax
28 2.1 Basic Usage
29 2.2 Attaching processes
30 2.3 Mounting hierarchies by name
31 3. Kernel API
32 3.1 Overview
33 3.2 Synchronization
34 3.3 Subsystem API
35 4. Extended attributes usage
36 5. Questions
37
381. Control Groups
39=================
40
411.1 What are cgroups ?
42----------------------
43
44Control Groups provide a mechanism for aggregating/partitioning sets of
45tasks, and all their future children, into hierarchical groups with
46specialized behaviour.
47
48Definitions:
49
50A *cgroup* associates a set of tasks with a set of parameters for one
51or more subsystems.
52
53A *subsystem* is a module that makes use of the task grouping
54facilities provided by cgroups to treat groups of tasks in
55particular ways. A subsystem is typically a "resource controller" that
56schedules a resource or applies per-cgroup limits, but it may be
57anything that wants to act on a group of processes, e.g. a
58virtualization subsystem.
59
60A *hierarchy* is a set of cgroups arranged in a tree, such that
61every task in the system is in exactly one of the cgroups in the
62hierarchy, and a set of subsystems; each subsystem has system-specific
63state attached to each cgroup in the hierarchy. Each hierarchy has
64an instance of the cgroup virtual filesystem associated with it.
65
66At any one time there may be multiple active hierarchies of task
67cgroups. Each hierarchy is a partition of all tasks in the system.
68
69User-level code may create and destroy cgroups by name in an
70instance of the cgroup virtual file system, specify and query to
71which cgroup a task is assigned, and list the task PIDs assigned to
72a cgroup. Those creations and assignments only affect the hierarchy
73associated with that instance of the cgroup file system.
74
75On their own, the only use for cgroups is for simple job
76tracking. The intention is that other subsystems hook into the generic
77cgroup support to provide new attributes for cgroups, such as
78accounting/limiting the resources which processes in a cgroup can
79access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow
80you to associate a set of CPUs and a set of memory nodes with the
81tasks in each cgroup.
82
831.2 Why are cgroups needed ?
84----------------------------
85
86There are multiple efforts to provide process aggregations in the
87Linux kernel, mainly for resource-tracking purposes. Such efforts
88include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
89namespaces. These all require the basic notion of a
90grouping/partitioning of processes, with newly forked processes ending
91up in the same group (cgroup) as their parent process.
92
93The kernel cgroup patch provides the minimum essential kernel
94mechanisms required to efficiently implement such groups. It has
95minimal impact on the system fast paths, and provides hooks for
96specific subsystems such as cpusets to provide additional behaviour as
97desired.
98
99Multiple hierarchy support is provided to allow for situations where
100the division of tasks into cgroups is distinctly different for
101different subsystems - having parallel hierarchies allows each
102hierarchy to be a natural division of tasks, without having to handle
103complex combinations of tasks that would be present if several
104unrelated subsystems needed to be forced into the same tree of
105cgroups.
106
107At one extreme, each resource controller or subsystem could be in a
108separate hierarchy; at the other extreme, all subsystems
109would be attached to the same hierarchy.
110
111As an example of a scenario (originally proposed by vatsa@in.ibm.com)
112that can benefit from multiple hierarchies, consider a large
113university server with various users - students, professors, system
114tasks etc. The resource planning for this server could be along the
115following lines::
116
117 CPU : "Top cpuset"
118 / \
119 CPUSet1 CPUSet2
120 | |
121 (Professors) (Students)
122
123 In addition (system tasks) are attached to topcpuset (so
124 that they can run anywhere) with a limit of 20%
125
126 Memory : Professors (50%), Students (30%), system (20%)
127
128 Disk : Professors (50%), Students (30%), system (20%)
129
130 Network : WWW browsing (20%), Network File System (60%), others (20%)
131 / \
132 Professors (15%) students (5%)
133
134Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
135into the NFS network class.
136
137At the same time Firefox/Lynx will share an appropriate CPU/Memory class
138depending on who launched it (prof/student).
139
140With the ability to classify tasks differently for different resources
141(by putting those resource subsystems in different hierarchies),
142the admin can easily set up a script which receives exec notifications
143and depending on who is launching the browser he can::
144
145 # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
146
147With only a single hierarchy, he now would potentially have to create
148a separate cgroup for every browser launched and associate it with
149appropriate network and other resource class. This may lead to
150proliferation of such cgroups.
151
152Also let's say that the administrator would like to give enhanced network
153access temporarily to a student's browser (since it is night and the user
154wants to do online gaming :)) OR give one of the student's simulation
155apps enhanced CPU power.
156
157With ability to write PIDs directly to resource classes, it's just a
158matter of::
159
160 # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
161 (after some time)
162 # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
163
164Without this ability, the administrator would have to split the cgroup into
165multiple separate ones and then associate the new cgroups with the
166new resource classes.
167
168
169
1701.3 How are cgroups implemented ?
171---------------------------------
172
173Control Groups extends the kernel as follows:
174
175 - Each task in the system has a reference-counted pointer to a
176 css_set.
177
178 - A css_set contains a set of reference-counted pointers to
179 cgroup_subsys_state objects, one for each cgroup subsystem
180 registered in the system. There is no direct link from a task to
181 the cgroup of which it's a member in each hierarchy, but this
182 can be determined by following pointers through the
183 cgroup_subsys_state objects. This is because accessing the
184 subsystem state is something that's expected to happen frequently
185 and in performance-critical code, whereas operations that require a
186 task's actual cgroup assignments (in particular, moving between
187 cgroups) are less common. A linked list runs through the cg_list
188 field of each task_struct using the css_set, anchored at
189 css_set->tasks.
190
191 - A cgroup hierarchy filesystem can be mounted for browsing and
192 manipulation from user space.
193
194 - You can list all the tasks (by PID) attached to any cgroup.
195
196The implementation of cgroups requires a few, simple hooks
197into the rest of the kernel, none in performance-critical paths:
198
199 - in init/main.c, to initialize the root cgroups and initial
200 css_set at system boot.
201
202 - in fork and exit, to attach and detach a task from its css_set.
203
204In addition, a new file system of type "cgroup" may be mounted, to
205enable browsing and modifying the cgroups presently known to the
206kernel. When mounting a cgroup hierarchy, you may specify a
207comma-separated list of subsystems to mount as the filesystem mount
208options. By default, mounting the cgroup filesystem attempts to
209mount a hierarchy containing all registered subsystems.
210
211If an active hierarchy with exactly the same set of subsystems already
212exists, it will be reused for the new mount. If no existing hierarchy
213matches, and any of the requested subsystems are in use in an existing
214hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
215is activated, associated with the requested subsystems.
216
217It's not currently possible to bind a new subsystem to an active
218cgroup hierarchy, or to unbind a subsystem from an active cgroup
219hierarchy. This may be possible in future, but is fraught with nasty
220error-recovery issues.
221
222When a cgroup filesystem is unmounted, if there are any
223child cgroups created below the top-level cgroup, that hierarchy
224will remain active even though unmounted; if there are no
225child cgroups then the hierarchy will be deactivated.
226
227No new system calls are added for cgroups - all support for
228querying and modifying cgroups is via this cgroup file system.
229
230Each task under /proc has an added file named 'cgroup' displaying,
231for each active hierarchy, the subsystem names and the cgroup name
232as the path relative to the root of the cgroup file system.
233
234Each cgroup is represented by a directory in the cgroup file system
235containing the following files describing that cgroup:
236
237 - tasks: list of tasks (by PID) attached to that cgroup. This list
238 is not guaranteed to be sorted. Writing a thread ID into this file
239 moves the thread into this cgroup.
240 - cgroup.procs: list of thread group IDs in the cgroup. This list is
241 not guaranteed to be sorted or free of duplicate TGIDs, and userspace
242 should sort/uniquify the list if this property is required.
243 Writing a thread group ID into this file moves all threads in that
244 group into this cgroup.
245 - notify_on_release flag: run the release agent on exit?
246 - release_agent: the path to use for release notifications (this file
247 exists in the top cgroup only)
248
249Other subsystems such as cpusets may add additional files in each
250cgroup dir.
251
252New cgroups are created using the mkdir system call or shell
253command. The properties of a cgroup, such as its flags, are
254modified by writing to the appropriate file in that cgroups
255directory, as listed above.
256
257The named hierarchical structure of nested cgroups allows partitioning
258a large system into nested, dynamically changeable, "soft-partitions".
259
260The attachment of each task, automatically inherited at fork by any
261children of that task, to a cgroup allows organizing the work load
262on a system into related sets of tasks. A task may be re-attached to
263any other cgroup, if allowed by the permissions on the necessary
264cgroup file system directories.
265
266When a task is moved from one cgroup to another, it gets a new
267css_set pointer - if there's an already existing css_set with the
268desired collection of cgroups then that group is reused, otherwise a new
269css_set is allocated. The appropriate existing css_set is located by
270looking into a hash table.
271
272To allow access from a cgroup to the css_sets (and hence tasks)
273that comprise it, a set of cg_cgroup_link objects form a lattice;
274each cg_cgroup_link is linked into a list of cg_cgroup_links for
275a single cgroup on its cgrp_link_list field, and a list of
276cg_cgroup_links for a single css_set on its cg_link_list.
277
278Thus the set of tasks in a cgroup can be listed by iterating over
279each css_set that references the cgroup, and sub-iterating over
280each css_set's task set.
281
282The use of a Linux virtual file system (vfs) to represent the
283cgroup hierarchy provides for a familiar permission and name space
284for cgroups, with a minimum of additional kernel code.
285
2861.4 What does notify_on_release do ?
287------------------------------------
288
289If the notify_on_release flag is enabled (1) in a cgroup, then
290whenever the last task in the cgroup leaves (exits or attaches to
291some other cgroup) and the last child cgroup of that cgroup
292is removed, then the kernel runs the command specified by the contents
293of the "release_agent" file in that hierarchy's root directory,
294supplying the pathname (relative to the mount point of the cgroup
295file system) of the abandoned cgroup. This enables automatic
296removal of abandoned cgroups. The default value of
297notify_on_release in the root cgroup at system boot is disabled
298(0). The default value of other cgroups at creation is the current
299value of their parents' notify_on_release settings. The default value of
300a cgroup hierarchy's release_agent path is empty.
301
3021.5 What does clone_children do ?
303---------------------------------
304
305This flag only affects the cpuset controller. If the clone_children
306flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
307configuration from the parent during initialization.
308
3091.6 How do I use cgroups ?
310--------------------------
311
312To start a new job that is to be contained within a cgroup, using
313the "cpuset" cgroup subsystem, the steps are something like::
314
315 1) mount -t tmpfs cgroup_root /sys/fs/cgroup
316 2) mkdir /sys/fs/cgroup/cpuset
317 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
318 4) Create the new cgroup by doing mkdir's and write's (or echo's) in
319 the /sys/fs/cgroup/cpuset virtual file system.
320 5) Start a task that will be the "founding father" of the new job.
321 6) Attach that task to the new cgroup by writing its PID to the
322 /sys/fs/cgroup/cpuset tasks file for that cgroup.
323 7) fork, exec or clone the job tasks from this founding father task.
324
325For example, the following sequence of commands will setup a cgroup
326named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
327and then start a subshell 'sh' in that cgroup::
328
329 mount -t tmpfs cgroup_root /sys/fs/cgroup
330 mkdir /sys/fs/cgroup/cpuset
331 mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
332 cd /sys/fs/cgroup/cpuset
333 mkdir Charlie
334 cd Charlie
335 /bin/echo 2-3 > cpuset.cpus
336 /bin/echo 1 > cpuset.mems
337 /bin/echo $$ > tasks
338 sh
339 # The subshell 'sh' is now running in cgroup Charlie
340 # The next line should display '/Charlie'
341 cat /proc/self/cgroup
342
3432. Usage Examples and Syntax
344============================
345
3462.1 Basic Usage
347---------------
348
349Creating, modifying, using cgroups can be done through the cgroup
350virtual filesystem.
351
352To mount a cgroup hierarchy with all available subsystems, type::
353
354 # mount -t cgroup xxx /sys/fs/cgroup
355
356The "xxx" is not interpreted by the cgroup code, but will appear in
357/proc/mounts so may be any useful identifying string that you like.
358
359Note: Some subsystems do not work without some user input first. For instance,
360if cpusets are enabled the user will have to populate the cpus and mems files
361for each new cgroup created before that group can be used.
362
363As explained in section `1.2 Why are cgroups needed?` you should create
364different hierarchies of cgroups for each single resource or group of
365resources you want to control. Therefore, you should mount a tmpfs on
366/sys/fs/cgroup and create directories for each cgroup resource or resource
367group::
368
369 # mount -t tmpfs cgroup_root /sys/fs/cgroup
370 # mkdir /sys/fs/cgroup/rg1
371
372To mount a cgroup hierarchy with just the cpuset and memory
373subsystems, type::
374
375 # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
376
377While remounting cgroups is currently supported, it is not recommend
378to use it. Remounting allows changing bound subsystems and
379release_agent. Rebinding is hardly useful as it only works when the
380hierarchy is empty and release_agent itself should be replaced with
381conventional fsnotify. The support for remounting will be removed in
382the future.
383
384To Specify a hierarchy's release_agent::
385
386 # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
387 xxx /sys/fs/cgroup/rg1
388
389Note that specifying 'release_agent' more than once will return failure.
390
391Note that changing the set of subsystems is currently only supported
392when the hierarchy consists of a single (root) cgroup. Supporting
393the ability to arbitrarily bind/unbind subsystems from an existing
394cgroup hierarchy is intended to be implemented in the future.
395
396Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
397tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
398is the cgroup that holds the whole system.
399
400If you want to change the value of release_agent::
401
402 # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
403
404It can also be changed via remount.
405
406If you want to create a new cgroup under /sys/fs/cgroup/rg1::
407
408 # cd /sys/fs/cgroup/rg1
409 # mkdir my_cgroup
410
411Now you want to do something with this cgroup:
412
413 # cd my_cgroup
414
415In this directory you can find several files::
416
417 # ls
418 cgroup.procs notify_on_release tasks
419 (plus whatever files added by the attached subsystems)
420
421Now attach your shell to this cgroup::
422
423 # /bin/echo $$ > tasks
424
425You can also create cgroups inside your cgroup by using mkdir in this
426directory::
427
428 # mkdir my_sub_cs
429
430To remove a cgroup, just use rmdir::
431
432 # rmdir my_sub_cs
433
434This will fail if the cgroup is in use (has cgroups inside, or
435has processes attached, or is held alive by other subsystem-specific
436reference).
437
4382.2 Attaching processes
439-----------------------
440
441::
442
443 # /bin/echo PID > tasks
444
445Note that it is PID, not PIDs. You can only attach ONE task at a time.
446If you have several tasks to attach, you have to do it one after another::
447
448 # /bin/echo PID1 > tasks
449 # /bin/echo PID2 > tasks
450 ...
451 # /bin/echo PIDn > tasks
452
453You can attach the current shell task by echoing 0::
454
455 # echo 0 > tasks
456
457You can use the cgroup.procs file instead of the tasks file to move all
458threads in a threadgroup at once. Echoing the PID of any task in a
459threadgroup to cgroup.procs causes all tasks in that threadgroup to be
460attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
461in the writing task's threadgroup.
462
463Note: Since every task is always a member of exactly one cgroup in each
464mounted hierarchy, to remove a task from its current cgroup you must
465move it into a new cgroup (possibly the root cgroup) by writing to the
466new cgroup's tasks file.
467
468Note: Due to some restrictions enforced by some cgroup subsystems, moving
469a process to another cgroup can fail.
470
4712.3 Mounting hierarchies by name
472--------------------------------
473
474Passing the name=<x> option when mounting a cgroups hierarchy
475associates the given name with the hierarchy. This can be used when
476mounting a pre-existing hierarchy, in order to refer to it by name
477rather than by its set of active subsystems. Each hierarchy is either
478nameless, or has a unique name.
479
480The name should match [\w.-]+
481
482When passing a name=<x> option for a new hierarchy, you need to
483specify subsystems manually; the legacy behaviour of mounting all
484subsystems when none are explicitly specified is not supported when
485you give a subsystem a name.
486
487The name of the subsystem appears as part of the hierarchy description
488in /proc/mounts and /proc/<pid>/cgroups.
489
490
4913. Kernel API
492=============
493
4943.1 Overview
495------------
496
497Each kernel subsystem that wants to hook into the generic cgroup
498system needs to create a cgroup_subsys object. This contains
499various methods, which are callbacks from the cgroup system, along
500with a subsystem ID which will be assigned by the cgroup system.
501
502Other fields in the cgroup_subsys object include:
503
504- subsys_id: a unique array index for the subsystem, indicating which
505 entry in cgroup->subsys[] this subsystem should be managing.
506
507- name: should be initialized to a unique subsystem name. Should be
508 no longer than MAX_CGROUP_TYPE_NAMELEN.
509
510- early_init: indicate if the subsystem needs early initialization
511 at system boot.
512
513Each cgroup object created by the system has an array of pointers,
514indexed by subsystem ID; this pointer is entirely managed by the
515subsystem; the generic cgroup code will never touch this pointer.
516
5173.2 Synchronization
518-------------------
519
520There is a global mutex, cgroup_mutex, used by the cgroup
521system. This should be taken by anything that wants to modify a
522cgroup. It may also be taken to prevent cgroups from being
523modified, but more specific locks may be more appropriate in that
524situation.
525
526See kernel/cgroup.c for more details.
527
528Subsystems can take/release the cgroup_mutex via the functions
529cgroup_lock()/cgroup_unlock().
530
531Accessing a task's cgroup pointer may be done in the following ways:
532- while holding cgroup_mutex
533- while holding the task's alloc_lock (via task_lock())
534- inside an rcu_read_lock() section via rcu_dereference()
535
5363.3 Subsystem API
537-----------------
538
539Each subsystem should:
540
541- add an entry in linux/cgroup_subsys.h
542- define a cgroup_subsys object called <name>_cgrp_subsys
543
544Each subsystem may export the following methods. The only mandatory
545methods are css_alloc/free. Any others that are null are presumed to
546be successful no-ops.
547
548``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
549(cgroup_mutex held by caller)
550
551Called to allocate a subsystem state object for a cgroup. The
552subsystem should allocate its subsystem state object for the passed
553cgroup, returning a pointer to the new object on success or a
554ERR_PTR() value. On success, the subsystem pointer should point to
555a structure of type cgroup_subsys_state (typically embedded in a
556larger subsystem-specific object), which will be initialized by the
557cgroup system. Note that this will be called at initialization to
558create the root subsystem state for this subsystem; this case can be
559identified by the passed cgroup object having a NULL parent (since
560it's the root of the hierarchy) and may be an appropriate place for
561initialization code.
562
563``int css_online(struct cgroup *cgrp)``
564(cgroup_mutex held by caller)
565
566Called after @cgrp successfully completed all allocations and made
567visible to cgroup_for_each_child/descendant_*() iterators. The
568subsystem may choose to fail creation by returning -errno. This
569callback can be used to implement reliable state sharing and
570propagation along the hierarchy. See the comment on
571cgroup_for_each_descendant_pre() for details.
572
573``void css_offline(struct cgroup *cgrp);``
574(cgroup_mutex held by caller)
575
576This is the counterpart of css_online() and called iff css_online()
577has succeeded on @cgrp. This signifies the beginning of the end of
578@cgrp. @cgrp is being removed and the subsystem should start dropping
579all references it's holding on @cgrp. When all references are dropped,
580cgroup removal will proceed to the next step - css_free(). After this
581callback, @cgrp should be considered dead to the subsystem.
582
583``void css_free(struct cgroup *cgrp)``
584(cgroup_mutex held by caller)
585
586The cgroup system is about to free @cgrp; the subsystem should free
587its subsystem state object. By the time this method is called, @cgrp
588is completely unused; @cgrp->parent is still valid. (Note - can also
589be called for a newly-created cgroup if an error occurs after this
590subsystem's create() method has been called for the new cgroup).
591
592``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
593(cgroup_mutex held by caller)
594
595Called prior to moving one or more tasks into a cgroup; if the
596subsystem returns an error, this will abort the attach operation.
597@tset contains the tasks to be attached and is guaranteed to have at
598least one task in it.
599
600If there are multiple tasks in the taskset, then:
601 - it's guaranteed that all are from the same thread group
602 - @tset contains all tasks from the thread group whether or not
603 they're switching cgroups
604 - the first task is the leader
605
606Each @tset entry also contains the task's old cgroup and tasks which
607aren't switching cgroup can be skipped easily using the
608cgroup_taskset_for_each() iterator. Note that this isn't called on a
609fork. If this method returns 0 (success) then this should remain valid
610while the caller holds cgroup_mutex and it is ensured that either
611attach() or cancel_attach() will be called in future.
612
613``void css_reset(struct cgroup_subsys_state *css)``
614(cgroup_mutex held by caller)
615
616An optional operation which should restore @css's configuration to the
617initial state. This is currently only used on the unified hierarchy
618when a subsystem is disabled on a cgroup through
619"cgroup.subtree_control" but should remain enabled because other
620subsystems depend on it. cgroup core makes such a css invisible by
621removing the associated interface files and invokes this callback so
622that the hidden subsystem can return to the initial neutral state.
623This prevents unexpected resource control from a hidden css and
624ensures that the configuration is in the initial state when it is made
625visible again later.
626
627``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
628(cgroup_mutex held by caller)
629
630Called when a task attach operation has failed after can_attach() has succeeded.
631A subsystem whose can_attach() has some side-effects should provide this
632function, so that the subsystem can implement a rollback. If not, not necessary.
633This will be called only about subsystems whose can_attach() operation have
634succeeded. The parameters are identical to can_attach().
635
636``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
637(cgroup_mutex held by caller)
638
639Called after the task has been attached to the cgroup, to allow any
640post-attachment activity that requires memory allocations or blocking.
641The parameters are identical to can_attach().
642
643``void fork(struct task_struct *task)``
644
645Called when a task is forked into a cgroup.
646
647``void exit(struct task_struct *task)``
648
649Called during task exit.
650
651``void free(struct task_struct *task)``
652
653Called when the task_struct is freed.
654
655``void bind(struct cgroup *root)``
656(cgroup_mutex held by caller)
657
658Called when a cgroup subsystem is rebound to a different hierarchy
659and root cgroup. Currently this will only involve movement between
660the default hierarchy (which never has sub-cgroups) and a hierarchy
661that is being created/destroyed (and hence has no sub-cgroups).
662
6634. Extended attribute usage
664===========================
665
666cgroup filesystem supports certain types of extended attributes in its
667directories and files. The current supported types are:
668
669 - Trusted (XATTR_TRUSTED)
670 - Security (XATTR_SECURITY)
671
672Both require CAP_SYS_ADMIN capability to set.
673
674Like in tmpfs, the extended attributes in cgroup filesystem are stored
675using kernel memory and it's advised to keep the usage at minimum. This
676is the reason why user defined extended attributes are not supported, since
677any user can do it and there's no limit in the value size.
678
679The current known users for this feature are SELinux to limit cgroup usage
680in containers and systemd for assorted meta data like main PID in a cgroup
681(systemd creates a cgroup per service).
682
6835. Questions
684============
685
686::
687
688 Q: what's up with this '/bin/echo' ?
689 A: bash's builtin 'echo' command does not check calls to write() against
690 errors. If you use it in the cgroup file system, you won't be
691 able to tell whether a command succeeded or failed.
692
693 Q: When I attach processes, only the first of the line gets really attached !
694 A: We can only return one error code per call to write(). So you should also
695 put only ONE PID.
diff --git a/Documentation/admin-guide/cgroup-v1/cpuacct.rst b/Documentation/admin-guide/cgroup-v1/cpuacct.rst
new file mode 100644
index 000000000000..d30ed81d2ad7
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/cpuacct.rst
@@ -0,0 +1,50 @@
1=========================
2CPU Accounting Controller
3=========================
4
5The CPU accounting controller is used to group tasks using cgroups and
6account the CPU usage of these groups of tasks.
7
8The CPU accounting controller supports multi-hierarchy groups. An accounting
9group accumulates the CPU usage of all of its child groups and the tasks
10directly present in its group.
11
12Accounting groups can be created by first mounting the cgroup filesystem::
13
14 # mount -t cgroup -ocpuacct none /sys/fs/cgroup
15
16With the above step, the initial or the parent accounting group becomes
17visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
18the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
19/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
20by this group which is essentially the CPU time obtained by all the tasks
21in the system.
22
23New accounting groups can be created under the parent group /sys/fs/cgroup::
24
25 # cd /sys/fs/cgroup
26 # mkdir g1
27 # echo $$ > g1/tasks
28
29The above steps create a new group g1 and move the current shell
30process (bash) into it. CPU time consumed by this bash and its children
31can be obtained from g1/cpuacct.usage and the same is accumulated in
32/sys/fs/cgroup/cpuacct.usage also.
33
34cpuacct.stat file lists a few statistics which further divide the
35CPU time obtained by the cgroup into user and system times. Currently
36the following statistics are supported:
37
38user: Time spent by tasks of the cgroup in user mode.
39system: Time spent by tasks of the cgroup in kernel mode.
40
41user and system are in USER_HZ unit.
42
43cpuacct controller uses percpu_counter interface to collect user and
44system times. This has two side effects:
45
46- It is theoretically possible to see wrong values for user and system times.
47 This is because percpu_counter_read() on 32bit systems isn't safe
48 against concurrent writes.
49- It is possible to see slightly outdated values for user and system times
50 due to the batch processing nature of percpu_counter.
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
new file mode 100644
index 000000000000..86a6ae995d54
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -0,0 +1,866 @@
1=======
2CPUSETS
3=======
4
5Copyright (C) 2004 BULL SA.
6
7Written by Simon.Derr@bull.net
8
9- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
10- Modified by Paul Jackson <pj@sgi.com>
11- Modified by Christoph Lameter <cl@linux.com>
12- Modified by Paul Menage <menage@google.com>
13- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
14
15.. CONTENTS:
16
17 1. Cpusets
18 1.1 What are cpusets ?
19 1.2 Why are cpusets needed ?
20 1.3 How are cpusets implemented ?
21 1.4 What are exclusive cpusets ?
22 1.5 What is memory_pressure ?
23 1.6 What is memory spread ?
24 1.7 What is sched_load_balance ?
25 1.8 What is sched_relax_domain_level ?
26 1.9 How do I use cpusets ?
27 2. Usage Examples and Syntax
28 2.1 Basic Usage
29 2.2 Adding/removing cpus
30 2.3 Setting flags
31 2.4 Attaching processes
32 3. Questions
33 4. Contact
34
351. Cpusets
36==========
37
381.1 What are cpusets ?
39----------------------
40
41Cpusets provide a mechanism for assigning a set of CPUs and Memory
42Nodes to a set of tasks. In this document "Memory Node" refers to
43an on-line node that contains memory.
44
45Cpusets constrain the CPU and Memory placement of tasks to only
46the resources within a task's current cpuset. They form a nested
47hierarchy visible in a virtual file system. These are the essential
48hooks, beyond what is already present, required to manage dynamic
49job placement on large systems.
50
51Cpusets use the generic cgroup subsystem described in
52Documentation/admin-guide/cgroup-v1/cgroups.rst.
53
54Requests by a task, using the sched_setaffinity(2) system call to
55include CPUs in its CPU affinity mask, and using the mbind(2) and
56set_mempolicy(2) system calls to include Memory Nodes in its memory
57policy, are both filtered through that task's cpuset, filtering out any
58CPUs or Memory Nodes not in that cpuset. The scheduler will not
59schedule a task on a CPU that is not allowed in its cpus_allowed
60vector, and the kernel page allocator will not allocate a page on a
61node that is not allowed in the requesting task's mems_allowed vector.
62
63User level code may create and destroy cpusets by name in the cgroup
64virtual file system, manage the attributes and permissions of these
65cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
66specify and query to which cpuset a task is assigned, and list the
67task pids assigned to a cpuset.
68
69
701.2 Why are cpusets needed ?
71----------------------------
72
73The management of large computer systems, with many processors (CPUs),
74complex memory cache hierarchies and multiple Memory Nodes having
75non-uniform access times (NUMA) presents additional challenges for
76the efficient scheduling and memory placement of processes.
77
78Frequently more modest sized systems can be operated with adequate
79efficiency just by letting the operating system automatically share
80the available CPU and Memory resources amongst the requesting tasks.
81
82But larger systems, which benefit more from careful processor and
83memory placement to reduce memory access times and contention,
84and which typically represent a larger investment for the customer,
85can benefit from explicitly placing jobs on properly sized subsets of
86the system.
87
88This can be especially valuable on:
89
90 * Web Servers running multiple instances of the same web application,
91 * Servers running different applications (for instance, a web server
92 and a database), or
93 * NUMA systems running large HPC applications with demanding
94 performance characteristics.
95
96These subsets, or "soft partitions" must be able to be dynamically
97adjusted, as the job mix changes, without impacting other concurrently
98executing jobs. The location of the running jobs pages may also be moved
99when the memory locations are changed.
100
101The kernel cpuset patch provides the minimum essential kernel
102mechanisms required to efficiently implement such subsets. It
103leverages existing CPU and Memory Placement facilities in the Linux
104kernel to avoid any additional impact on the critical scheduler or
105memory allocator code.
106
107
1081.3 How are cpusets implemented ?
109---------------------------------
110
111Cpusets provide a Linux kernel mechanism to constrain which CPUs and
112Memory Nodes are used by a process or set of processes.
113
114The Linux kernel already has a pair of mechanisms to specify on which
115CPUs a task may be scheduled (sched_setaffinity) and on which Memory
116Nodes it may obtain memory (mbind, set_mempolicy).
117
118Cpusets extends these two mechanisms as follows:
119
120 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
121 kernel.
122 - Each task in the system is attached to a cpuset, via a pointer
123 in the task structure to a reference counted cgroup structure.
124 - Calls to sched_setaffinity are filtered to just those CPUs
125 allowed in that task's cpuset.
126 - Calls to mbind and set_mempolicy are filtered to just
127 those Memory Nodes allowed in that task's cpuset.
128 - The root cpuset contains all the systems CPUs and Memory
129 Nodes.
130 - For any cpuset, one can define child cpusets containing a subset
131 of the parents CPU and Memory Node resources.
132 - The hierarchy of cpusets can be mounted at /dev/cpuset, for
133 browsing and manipulation from user space.
134 - A cpuset may be marked exclusive, which ensures that no other
135 cpuset (except direct ancestors and descendants) may contain
136 any overlapping CPUs or Memory Nodes.
137 - You can list all the tasks (by pid) attached to any cpuset.
138
139The implementation of cpusets requires a few, simple hooks
140into the rest of the kernel, none in performance critical paths:
141
142 - in init/main.c, to initialize the root cpuset at system boot.
143 - in fork and exit, to attach and detach a task from its cpuset.
144 - in sched_setaffinity, to mask the requested CPUs by what's
145 allowed in that task's cpuset.
146 - in sched.c migrate_live_tasks(), to keep migrating tasks within
147 the CPUs allowed by their cpuset, if possible.
148 - in the mbind and set_mempolicy system calls, to mask the requested
149 Memory Nodes by what's allowed in that task's cpuset.
150 - in page_alloc.c, to restrict memory to allowed nodes.
151 - in vmscan.c, to restrict page recovery to the current cpuset.
152
153You should mount the "cgroup" filesystem type in order to enable
154browsing and modifying the cpusets presently known to the kernel. No
155new system calls are added for cpusets - all support for querying and
156modifying cpusets is via this cpuset file system.
157
158The /proc/<pid>/status file for each task has four added lines,
159displaying the task's cpus_allowed (on which CPUs it may be scheduled)
160and mems_allowed (on which Memory Nodes it may obtain memory),
161in the two formats seen in the following example::
162
163 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
164 Cpus_allowed_list: 0-127
165 Mems_allowed: ffffffff,ffffffff
166 Mems_allowed_list: 0-63
167
168Each cpuset is represented by a directory in the cgroup file system
169containing (on top of the standard cgroup files) the following
170files describing that cpuset:
171
172 - cpuset.cpus: list of CPUs in that cpuset
173 - cpuset.mems: list of Memory Nodes in that cpuset
174 - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
175 - cpuset.cpu_exclusive flag: is cpu placement exclusive?
176 - cpuset.mem_exclusive flag: is memory placement exclusive?
177 - cpuset.mem_hardwall flag: is memory allocation hardwalled
178 - cpuset.memory_pressure: measure of how much paging pressure in cpuset
179 - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
180 - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
181 - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
182 - cpuset.sched_relax_domain_level: the searching range when migrating tasks
183
184In addition, only the root cpuset has the following file:
185
186 - cpuset.memory_pressure_enabled flag: compute memory_pressure?
187
188New cpusets are created using the mkdir system call or shell
189command. The properties of a cpuset, such as its flags, allowed
190CPUs and Memory Nodes, and attached tasks, are modified by writing
191to the appropriate file in that cpusets directory, as listed above.
192
193The named hierarchical structure of nested cpusets allows partitioning
194a large system into nested, dynamically changeable, "soft-partitions".
195
196The attachment of each task, automatically inherited at fork by any
197children of that task, to a cpuset allows organizing the work load
198on a system into related sets of tasks such that each set is constrained
199to using the CPUs and Memory Nodes of a particular cpuset. A task
200may be re-attached to any other cpuset, if allowed by the permissions
201on the necessary cpuset file system directories.
202
203Such management of a system "in the large" integrates smoothly with
204the detailed placement done on individual tasks and memory regions
205using the sched_setaffinity, mbind and set_mempolicy system calls.
206
207The following rules apply to each cpuset:
208
209 - Its CPUs and Memory Nodes must be a subset of its parents.
210 - It can't be marked exclusive unless its parent is.
211 - If its cpu or memory is exclusive, they may not overlap any sibling.
212
213These rules, and the natural hierarchy of cpusets, enable efficient
214enforcement of the exclusive guarantee, without having to scan all
215cpusets every time any of them change to ensure nothing overlaps a
216exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
217to represent the cpuset hierarchy provides for a familiar permission
218and name space for cpusets, with a minimum of additional kernel code.
219
220The cpus and mems files in the root (top_cpuset) cpuset are
221read-only. The cpus file automatically tracks the value of
222cpu_online_mask using a CPU hotplug notifier, and the mems file
223automatically tracks the value of node_states[N_MEMORY]--i.e.,
224nodes with memory--using the cpuset_track_online_nodes() hook.
225
226
2271.4 What are exclusive cpusets ?
228--------------------------------
229
230If a cpuset is cpu or mem exclusive, no other cpuset, other than
231a direct ancestor or descendant, may share any of the same CPUs or
232Memory Nodes.
233
234A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
235i.e. it restricts kernel allocations for page, buffer and other data
236commonly shared by the kernel across multiple users. All cpusets,
237whether hardwalled or not, restrict allocations of memory for user
238space. This enables configuring a system so that several independent
239jobs can share common kernel data, such as file system pages, while
240isolating each job's user allocation in its own cpuset. To do this,
241construct a large mem_exclusive cpuset to hold all the jobs, and
242construct child, non-mem_exclusive cpusets for each individual job.
243Only a small amount of typical kernel memory, such as requests from
244interrupt handlers, is allowed to be taken outside even a
245mem_exclusive cpuset.
246
247
2481.5 What is memory_pressure ?
249-----------------------------
250The memory_pressure of a cpuset provides a simple per-cpuset metric
251of the rate that the tasks in a cpuset are attempting to free up in
252use memory on the nodes of the cpuset to satisfy additional memory
253requests.
254
255This enables batch managers monitoring jobs running in dedicated
256cpusets to efficiently detect what level of memory pressure that job
257is causing.
258
259This is useful both on tightly managed systems running a wide mix of
260submitted jobs, which may choose to terminate or re-prioritize jobs that
261are trying to use more memory than allowed on the nodes assigned to them,
262and with tightly coupled, long running, massively parallel scientific
263computing jobs that will dramatically fail to meet required performance
264goals if they start to use more memory than allowed to them.
265
266This mechanism provides a very economical way for the batch manager
267to monitor a cpuset for signs of memory pressure. It's up to the
268batch manager or other user code to decide what to do about it and
269take action.
270
271==>
272 Unless this feature is enabled by writing "1" to the special file
273 /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
274 code of __alloc_pages() for this metric reduces to simply noticing
275 that the cpuset_memory_pressure_enabled flag is zero. So only
276 systems that enable this feature will compute the metric.
277
278Why a per-cpuset, running average:
279
280 Because this meter is per-cpuset, rather than per-task or mm,
281 the system load imposed by a batch scheduler monitoring this
282 metric is sharply reduced on large systems, because a scan of
283 the tasklist can be avoided on each set of queries.
284
285 Because this meter is a running average, instead of an accumulating
286 counter, a batch scheduler can detect memory pressure with a
287 single read, instead of having to read and accumulate results
288 for a period of time.
289
290 Because this meter is per-cpuset rather than per-task or mm,
291 the batch scheduler can obtain the key information, memory
292 pressure in a cpuset, with a single read, rather than having to
293 query and accumulate results over all the (dynamically changing)
294 set of tasks in the cpuset.
295
296A per-cpuset simple digital filter (requires a spinlock and 3 words
297of data per-cpuset) is kept, and updated by any task attached to that
298cpuset, if it enters the synchronous (direct) page reclaim code.
299
300A per-cpuset file provides an integer number representing the recent
301(half-life of 10 seconds) rate of direct page reclaims caused by
302the tasks in the cpuset, in units of reclaims attempted per second,
303times 1000.
304
305
3061.6 What is memory spread ?
307---------------------------
308There are two boolean flag files per cpuset that control where the
309kernel allocates pages for the file system buffers and related in
310kernel data structures. They are called 'cpuset.memory_spread_page' and
311'cpuset.memory_spread_slab'.
312
313If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
314the kernel will spread the file system buffers (page cache) evenly
315over all the nodes that the faulting task is allowed to use, instead
316of preferring to put those pages on the node where the task is running.
317
318If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
319then the kernel will spread some file system related slab caches,
320such as for inodes and dentries evenly over all the nodes that the
321faulting task is allowed to use, instead of preferring to put those
322pages on the node where the task is running.
323
324The setting of these flags does not affect anonymous data segment or
325stack segment pages of a task.
326
327By default, both kinds of memory spreading are off, and memory
328pages are allocated on the node local to where the task is running,
329except perhaps as modified by the task's NUMA mempolicy or cpuset
330configuration, so long as sufficient free memory pages are available.
331
332When new cpusets are created, they inherit the memory spread settings
333of their parent.
334
335Setting memory spreading causes allocations for the affected page
336or slab caches to ignore the task's NUMA mempolicy and be spread
337instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
338mempolicies will not notice any change in these calls as a result of
339their containing task's memory spread settings. If memory spreading
340is turned off, then the currently specified NUMA mempolicy once again
341applies to memory page allocations.
342
343Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
344files. By default they contain "0", meaning that the feature is off
345for that cpuset. If a "1" is written to that file, then that turns
346the named feature on.
347
348The implementation is simple.
349
350Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
351PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
352joins that cpuset. The page allocation calls for the page cache
353is modified to perform an inline check for this PFA_SPREAD_PAGE task
354flag, and if set, a call to a new routine cpuset_mem_spread_node()
355returns the node to prefer for the allocation.
356
357Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
358PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
359pages from the node returned by cpuset_mem_spread_node().
360
361The cpuset_mem_spread_node() routine is also simple. It uses the
362value of a per-task rotor cpuset_mem_spread_rotor to select the next
363node in the current task's mems_allowed to prefer for the allocation.
364
365This memory placement policy is also known (in other contexts) as
366round-robin or interleave.
367
368This policy can provide substantial improvements for jobs that need
369to place thread local data on the corresponding node, but that need
370to access large file system data sets that need to be spread across
371the several nodes in the jobs cpuset in order to fit. Without this
372policy, especially for jobs that might have one thread reading in the
373data set, the memory allocation across the nodes in the jobs cpuset
374can become very uneven.
375
3761.7 What is sched_load_balance ?
377--------------------------------
378
379The kernel scheduler (kernel/sched/core.c) automatically load balances
380tasks. If one CPU is underutilized, kernel code running on that
381CPU will look for tasks on other more overloaded CPUs and move those
382tasks to itself, within the constraints of such placement mechanisms
383as cpusets and sched_setaffinity.
384
385The algorithmic cost of load balancing and its impact on key shared
386kernel data structures such as the task list increases more than
387linearly with the number of CPUs being balanced. So the scheduler
388has support to partition the systems CPUs into a number of sched
389domains such that it only load balances within each sched domain.
390Each sched domain covers some subset of the CPUs in the system;
391no two sched domains overlap; some CPUs might not be in any sched
392domain and hence won't be load balanced.
393
394Put simply, it costs less to balance between two smaller sched domains
395than one big one, but doing so means that overloads in one of the
396two domains won't be load balanced to the other one.
397
398By default, there is one sched domain covering all CPUs, including those
399marked isolated using the kernel boot time "isolcpus=" argument. However,
400the isolated CPUs will not participate in load balancing, and will not
401have tasks running on them unless explicitly assigned.
402
403This default load balancing across all CPUs is not well suited for
404the following two situations:
405
406 1) On large systems, load balancing across many CPUs is expensive.
407 If the system is managed using cpusets to place independent jobs
408 on separate sets of CPUs, full load balancing is unnecessary.
409 2) Systems supporting realtime on some CPUs need to minimize
410 system overhead on those CPUs, including avoiding task load
411 balancing if that is not needed.
412
413When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
414setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
415be contained in a single sched domain, ensuring that load balancing
416can move a task (not otherwised pinned, as by sched_setaffinity)
417from any CPU in that cpuset to any other.
418
419When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
420scheduler will avoid load balancing across the CPUs in that cpuset,
421--except-- in so far as is necessary because some overlapping cpuset
422has "sched_load_balance" enabled.
423
424So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
425enabled, then the scheduler will have one sched domain covering all
426CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
427cpusets won't matter, as we're already fully load balancing.
428
429Therefore in the above two situations, the top cpuset flag
430"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
431child cpusets have this flag enabled.
432
433When doing this, you don't usually want to leave any unpinned tasks in
434the top cpuset that might use non-trivial amounts of CPU, as such tasks
435may be artificially constrained to some subset of CPUs, depending on
436the particulars of this flag setting in descendant cpusets. Even if
437such a task could use spare CPU cycles in some other CPUs, the kernel
438scheduler might not consider the possibility of load balancing that
439task to that underused CPU.
440
441Of course, tasks pinned to a particular CPU can be left in a cpuset
442that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
443else anyway.
444
445There is an impedance mismatch here, between cpusets and sched domains.
446Cpusets are hierarchical and nest. Sched domains are flat; they don't
447overlap and each CPU is in at most one sched domain.
448
449It is necessary for sched domains to be flat because load balancing
450across partially overlapping sets of CPUs would risk unstable dynamics
451that would be beyond our understanding. So if each of two partially
452overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
453form a single sched domain that is a superset of both. We won't move
454a task to a CPU outside its cpuset, but the scheduler load balancing
455code might waste some compute cycles considering that possibility.
456
457This mismatch is why there is not a simple one-to-one relation
458between which cpusets have the flag "cpuset.sched_load_balance" enabled,
459and the sched domain configuration. If a cpuset enables the flag, it
460will get balancing across all its CPUs, but if it disables the flag,
461it will only be assured of no load balancing if no other overlapping
462cpuset enables the flag.
463
464If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
465one of them has this flag enabled, then the other may find its
466tasks only partially load balanced, just on the overlapping CPUs.
467This is just the general case of the top_cpuset example given a few
468paragraphs above. In the general case, as in the top cpuset case,
469don't leave tasks that might use non-trivial amounts of CPU in
470such partially load balanced cpusets, as they may be artificially
471constrained to some subset of the CPUs allowed to them, for lack of
472load balancing to the other CPUs.
473
474CPUs in "cpuset.isolcpus" were excluded from load balancing by the
475isolcpus= kernel boot option, and will never be load balanced regardless
476of the value of "cpuset.sched_load_balance" in any cpuset.
477
4781.7.1 sched_load_balance implementation details.
479------------------------------------------------
480
481The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
482to most cpuset flags.) When enabled for a cpuset, the kernel will
483ensure that it can load balance across all the CPUs in that cpuset
484(makes sure that all the CPUs in the cpus_allowed of that cpuset are
485in the same sched domain.)
486
487If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
488then they will be (must be) both in the same sched domain.
489
490If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
491then by the above that means there is a single sched domain covering
492the whole system, regardless of any other cpuset settings.
493
494The kernel commits to user space that it will avoid load balancing
495where it can. It will pick as fine a granularity partition of sched
496domains as it can while still providing load balancing for any set
497of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
498
499The internal kernel cpuset to scheduler interface passes from the
500cpuset code to the scheduler code a partition of the load balanced
501CPUs in the system. This partition is a set of subsets (represented
502as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
503all the CPUs that must be load balanced.
504
505The cpuset code builds a new such partition and passes it to the
506scheduler sched domain setup code, to have the sched domains rebuilt
507as necessary, whenever:
508
509 - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
510 - or CPUs come or go from a cpuset with this flag enabled,
511 - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
512 and with this flag enabled changes,
513 - or a cpuset with non-empty CPUs and with this flag enabled is removed,
514 - or a cpu is offlined/onlined.
515
516This partition exactly defines what sched domains the scheduler should
517setup - one sched domain for each element (struct cpumask) in the
518partition.
519
520The scheduler remembers the currently active sched domain partitions.
521When the scheduler routine partition_sched_domains() is invoked from
522the cpuset code to update these sched domains, it compares the new
523partition requested with the current, and updates its sched domains,
524removing the old and adding the new, for each change.
525
526
5271.8 What is sched_relax_domain_level ?
528--------------------------------------
529
530In sched domain, the scheduler migrates tasks in 2 ways; periodic load
531balance on tick, and at time of some schedule events.
532
533When a task is woken up, scheduler try to move the task on idle CPU.
534For example, if a task A running on CPU X activates another task B
535on the same CPU X, and if CPU Y is X's sibling and performing idle,
536then scheduler migrate task B to CPU Y so that task B can start on
537CPU Y without waiting task A on CPU X.
538
539And if a CPU run out of tasks in its runqueue, the CPU try to pull
540extra tasks from other busy CPUs to help them before it is going to
541be idle.
542
543Of course it takes some searching cost to find movable tasks and/or
544idle CPUs, the scheduler might not search all CPUs in the domain
545every time. In fact, in some architectures, the searching ranges on
546events are limited in the same socket or node where the CPU locates,
547while the load balance on tick searches all.
548
549For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
550is idle while CPU X and the siblings are busy, scheduler can't migrate
551woken task B from X to Z since it is out of its searching range.
552As the result, task B on CPU X need to wait task A or wait load balance
553on the next tick. For some applications in special situation, waiting
5541 tick may be too long.
555
556The 'cpuset.sched_relax_domain_level' file allows you to request changing
557this searching range as you like. This file takes int value which
558indicates size of searching range in levels ideally as follows,
559otherwise initial value -1 that indicates the cpuset has no request.
560
561====== ===========================================================
562 -1 no request. use system default or follow request of others.
563 0 no search.
564 1 search siblings (hyperthreads in a core).
565 2 search cores in a package.
566 3 search cpus in a node [= system wide on non-NUMA system]
567 4 search nodes in a chunk of node [on NUMA system]
568 5 search system wide [on NUMA system]
569====== ===========================================================
570
571The system default is architecture dependent. The system default
572can be changed using the relax_domain_level= boot parameter.
573
574This file is per-cpuset and affect the sched domain where the cpuset
575belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
576is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
577there is no sched domain belonging the cpuset.
578
579If multiple cpusets are overlapping and hence they form a single sched
580domain, the largest value among those is used. Be careful, if one
581requests 0 and others are -1 then 0 is used.
582
583Note that modifying this file will have both good and bad effects,
584and whether it is acceptable or not depends on your situation.
585Don't modify this file if you are not sure.
586
587If your situation is:
588
589 - The migration costs between each cpu can be assumed considerably
590 small(for you) due to your special application's behavior or
591 special hardware support for CPU cache etc.
592 - The searching cost doesn't have impact(for you) or you can make
593 the searching cost enough small by managing cpuset to compact etc.
594 - The latency is required even it sacrifices cache hit rate etc.
595 then increasing 'sched_relax_domain_level' would benefit you.
596
597
5981.9 How do I use cpusets ?
599--------------------------
600
601In order to minimize the impact of cpusets on critical kernel
602code, such as the scheduler, and due to the fact that the kernel
603does not support one task updating the memory placement of another
604task directly, the impact on a task of changing its cpuset CPU
605or Memory Node placement, or of changing to which cpuset a task
606is attached, is subtle.
607
608If a cpuset has its Memory Nodes modified, then for each task attached
609to that cpuset, the next time that the kernel attempts to allocate
610a page of memory for that task, the kernel will notice the change
611in the task's cpuset, and update its per-task memory placement to
612remain within the new cpusets memory placement. If the task was using
613mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
614its new cpuset, then the task will continue to use whatever subset
615of MPOL_BIND nodes are still allowed in the new cpuset. If the task
616was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
617in the new cpuset, then the task will be essentially treated as if it
618was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
619as queried by get_mempolicy(), doesn't change). If a task is moved
620from one cpuset to another, then the kernel will adjust the task's
621memory placement, as above, the next time that the kernel attempts
622to allocate a page of memory for that task.
623
624If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
625will have its allowed CPU placement changed immediately. Similarly,
626if a task's pid is written to another cpuset's 'tasks' file, then its
627allowed CPU placement is changed immediately. If such a task had been
628bound to some subset of its cpuset using the sched_setaffinity() call,
629the task will be allowed to run on any CPU allowed in its new cpuset,
630negating the effect of the prior sched_setaffinity() call.
631
632In summary, the memory placement of a task whose cpuset is changed is
633updated by the kernel, on the next allocation of a page for that task,
634and the processor placement is updated immediately.
635
636Normally, once a page is allocated (given a physical page
637of main memory) then that page stays on whatever node it
638was allocated, so long as it remains allocated, even if the
639cpusets memory placement policy 'cpuset.mems' subsequently changes.
640If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
641tasks are attached to that cpuset, any pages that task had
642allocated to it on nodes in its previous cpuset are migrated
643to the task's new cpuset. The relative placement of the page within
644the cpuset is preserved during these migration operations if possible.
645For example if the page was on the second valid node of the prior cpuset
646then the page will be placed on the second valid node of the new cpuset.
647
648Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
649'cpuset.mems' file is modified, pages allocated to tasks in that
650cpuset, that were on nodes in the previous setting of 'cpuset.mems',
651will be moved to nodes in the new setting of 'mems.'
652Pages that were not in the task's prior cpuset, or in the cpuset's
653prior 'cpuset.mems' setting, will not be moved.
654
655There is an exception to the above. If hotplug functionality is used
656to remove all the CPUs that are currently assigned to a cpuset,
657then all the tasks in that cpuset will be moved to the nearest ancestor
658with non-empty cpus. But the moving of some (or all) tasks might fail if
659cpuset is bound with another cgroup subsystem which has some restrictions
660on task attaching. In this failing case, those tasks will stay
661in the original cpuset, and the kernel will automatically update
662their cpus_allowed to allow all online CPUs. When memory hotplug
663functionality for removing Memory Nodes is available, a similar exception
664is expected to apply there as well. In general, the kernel prefers to
665violate cpuset placement, over starving a task that has had all
666its allowed CPUs or Memory Nodes taken offline.
667
668There is a second exception to the above. GFP_ATOMIC requests are
669kernel internal allocations that must be satisfied, immediately.
670The kernel may drop some request, in rare cases even panic, if a
671GFP_ATOMIC alloc fails. If the request cannot be satisfied within
672the current task's cpuset, then we relax the cpuset, and look for
673memory anywhere we can find it. It's better to violate the cpuset
674than stress the kernel.
675
676To start a new job that is to be contained within a cpuset, the steps are:
677
678 1) mkdir /sys/fs/cgroup/cpuset
679 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
680 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
681 the /sys/fs/cgroup/cpuset virtual file system.
682 4) Start a task that will be the "founding father" of the new job.
683 5) Attach that task to the new cpuset by writing its pid to the
684 /sys/fs/cgroup/cpuset tasks file for that cpuset.
685 6) fork, exec or clone the job tasks from this founding father task.
686
687For example, the following sequence of commands will setup a cpuset
688named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
689and then start a subshell 'sh' in that cpuset::
690
691 mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
692 cd /sys/fs/cgroup/cpuset
693 mkdir Charlie
694 cd Charlie
695 /bin/echo 2-3 > cpuset.cpus
696 /bin/echo 1 > cpuset.mems
697 /bin/echo $$ > tasks
698 sh
699 # The subshell 'sh' is now running in cpuset Charlie
700 # The next line should display '/Charlie'
701 cat /proc/self/cpuset
702
703There are ways to query or modify cpusets:
704
705 - via the cpuset file system directly, using the various cd, mkdir, echo,
706 cat, rmdir commands from the shell, or their equivalent from C.
707 - via the C library libcpuset.
708 - via the C library libcgroup.
709 (http://sourceforge.net/projects/libcg/)
710 - via the python application cset.
711 (http://code.google.com/p/cpuset/)
712
713The sched_setaffinity calls can also be done at the shell prompt using
714SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
715calls can be done at the shell prompt using the numactl command
716(part of Andi Kleen's numa package).
717
7182. Usage Examples and Syntax
719============================
720
7212.1 Basic Usage
722---------------
723
724Creating, modifying, using the cpusets can be done through the cpuset
725virtual filesystem.
726
727To mount it, type:
728# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
729
730Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
731tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
732is the cpuset that holds the whole system.
733
734If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
735
736 # cd /sys/fs/cgroup/cpuset
737 # mkdir my_cpuset
738
739Now you want to do something with this cpuset::
740
741 # cd my_cpuset
742
743In this directory you can find several files::
744
745 # ls
746 cgroup.clone_children cpuset.memory_pressure
747 cgroup.event_control cpuset.memory_spread_page
748 cgroup.procs cpuset.memory_spread_slab
749 cpuset.cpu_exclusive cpuset.mems
750 cpuset.cpus cpuset.sched_load_balance
751 cpuset.mem_exclusive cpuset.sched_relax_domain_level
752 cpuset.mem_hardwall notify_on_release
753 cpuset.memory_migrate tasks
754
755Reading them will give you information about the state of this cpuset:
756the CPUs and Memory Nodes it can use, the processes that are using
757it, its properties. By writing to these files you can manipulate
758the cpuset.
759
760Set some flags::
761
762 # /bin/echo 1 > cpuset.cpu_exclusive
763
764Add some cpus::
765
766 # /bin/echo 0-7 > cpuset.cpus
767
768Add some mems::
769
770 # /bin/echo 0-7 > cpuset.mems
771
772Now attach your shell to this cpuset::
773
774 # /bin/echo $$ > tasks
775
776You can also create cpusets inside your cpuset by using mkdir in this
777directory::
778
779 # mkdir my_sub_cs
780
781To remove a cpuset, just use rmdir::
782
783 # rmdir my_sub_cs
784
785This will fail if the cpuset is in use (has cpusets inside, or has
786processes attached).
787
788Note that for legacy reasons, the "cpuset" filesystem exists as a
789wrapper around the cgroup filesystem.
790
791The command::
792
793 mount -t cpuset X /sys/fs/cgroup/cpuset
794
795is equivalent to::
796
797 mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
798 echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
799
8002.2 Adding/removing cpus
801------------------------
802
803This is the syntax to use when writing in the cpus or mems files
804in cpuset directories::
805
806 # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
807 # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
808
809To add a CPU to a cpuset, write the new list of CPUs including the
810CPU to be added. To add 6 to the above cpuset::
811
812 # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
813
814Similarly to remove a CPU from a cpuset, write the new list of CPUs
815without the CPU to be removed.
816
817To remove all the CPUs::
818
819 # /bin/echo "" > cpuset.cpus -> clear cpus list
820
8212.3 Setting flags
822-----------------
823
824The syntax is very simple::
825
826 # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
827 # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
828
8292.4 Attaching processes
830-----------------------
831
832::
833
834 # /bin/echo PID > tasks
835
836Note that it is PID, not PIDs. You can only attach ONE task at a time.
837If you have several tasks to attach, you have to do it one after another::
838
839 # /bin/echo PID1 > tasks
840 # /bin/echo PID2 > tasks
841 ...
842 # /bin/echo PIDn > tasks
843
844
8453. Questions
846============
847
848Q:
849 what's up with this '/bin/echo' ?
850
851A:
852 bash's builtin 'echo' command does not check calls to write() against
853 errors. If you use it in the cpuset file system, you won't be
854 able to tell whether a command succeeded or failed.
855
856Q:
857 When I attach processes, only the first of the line gets really attached !
858
859A:
860 We can only return one error code per call to write(). So you should also
861 put only ONE pid.
862
8634. Contact
864==========
865
866Web: http://www.bullopensource.org/cpuset
diff --git a/Documentation/admin-guide/cgroup-v1/devices.rst b/Documentation/admin-guide/cgroup-v1/devices.rst
new file mode 100644
index 000000000000..e1886783961e
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/devices.rst
@@ -0,0 +1,132 @@
1===========================
2Device Whitelist Controller
3===========================
4
51. Description
6==============
7
8Implement a cgroup to track and enforce open and mknod restrictions
9on device files. A device cgroup associates a device access
10whitelist with each cgroup. A whitelist entry has 4 fields.
11'type' is a (all), c (char), or b (block). 'all' means it applies
12to all types and all major and minor numbers. Major and minor are
13either an integer or * for all. Access is a composition of r
14(read), w (write), and m (mknod).
15
16The root device cgroup starts with rwm to 'all'. A child device
17cgroup gets a copy of the parent. Administrators can then remove
18devices from the whitelist or add new entries. A child cgroup can
19never receive a device access which is denied by its parent.
20
212. User Interface
22=================
23
24An entry is added using devices.allow, and removed using
25devices.deny. For instance::
26
27 echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
28
29allows cgroup 1 to read and mknod the device usually known as
30/dev/null. Doing::
31
32 echo a > /sys/fs/cgroup/1/devices.deny
33
34will remove the default 'a *:* rwm' entry. Doing::
35
36 echo a > /sys/fs/cgroup/1/devices.allow
37
38will add the 'a *:* rwm' entry to the whitelist.
39
403. Security
41===========
42
43Any task can move itself between cgroups. This clearly won't
44suffice, but we can decide the best way to adequately restrict
45movement as people get some experience with this. We may just want
46to require CAP_SYS_ADMIN, which at least is a separate bit from
47CAP_MKNOD. We may want to just refuse moving to a cgroup which
48isn't a descendant of the current one. Or we may want to use
49CAP_MAC_ADMIN, since we really are trying to lock down root.
50
51CAP_SYS_ADMIN is needed to modify the whitelist or move another
52task to a new cgroup. (Again we'll probably want to change that).
53
54A cgroup may not be granted more permissions than the cgroup's
55parent has.
56
574. Hierarchy
58============
59
60device cgroups maintain hierarchy by making sure a cgroup never has more
61access permissions than its parent. Every time an entry is written to
62a cgroup's devices.deny file, all its children will have that entry removed
63from their whitelist and all the locally set whitelist entries will be
64re-evaluated. In case one of the locally set whitelist entries would provide
65more access than the cgroup's parent, it'll be removed from the whitelist.
66
67Example::
68
69 A
70 / \
71 B
72
73 group behavior exceptions
74 A allow "b 8:* rwm", "c 116:1 rw"
75 B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
76
77If a device is denied in group A::
78
79 # echo "c 116:* r" > A/devices.deny
80
81it'll propagate down and after revalidating B's entries, the whitelist entry
82"c 116:2 rwm" will be removed::
83
84 group whitelist entries denied devices
85 A all "b 8:* rwm", "c 116:* rw"
86 B "c 1:3 rwm", "b 3:* rwm" all the rest
87
88In case parent's exceptions change and local exceptions are not allowed
89anymore, they'll be deleted.
90
91Notice that new whitelist entries will not be propagated::
92
93 A
94 / \
95 B
96
97 group whitelist entries denied devices
98 A "c 1:3 rwm", "c 1:5 r" all the rest
99 B "c 1:3 rwm", "c 1:5 r" all the rest
100
101when adding ``c *:3 rwm``::
102
103 # echo "c *:3 rwm" >A/devices.allow
104
105the result::
106
107 group whitelist entries denied devices
108 A "c *:3 rwm", "c 1:5 r" all the rest
109 B "c 1:3 rwm", "c 1:5 r" all the rest
110
111but now it'll be possible to add new entries to B::
112
113 # echo "c 2:3 rwm" >B/devices.allow
114 # echo "c 50:3 r" >B/devices.allow
115
116or even::
117
118 # echo "c *:3 rwm" >B/devices.allow
119
120Allowing or denying all by writing 'a' to devices.allow or devices.deny will
121not be possible once the device cgroups has children.
122
1234.1 Hierarchy (internal implementation)
124---------------------------------------
125
126device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
127list of exceptions. The internal state is controlled using the same user
128interface to preserve compatibility with the previous whitelist-only
129implementation. Removal or addition of exceptions that will reduce the access
130to devices will be propagated down the hierarchy.
131For every propagated exception, the effective rules will be re-evaluated based
132on current parent's access rules.
diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
new file mode 100644
index 000000000000..582d3427de3f
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
@@ -0,0 +1,127 @@
1==============
2Cgroup Freezer
3==============
4
5The cgroup freezer is useful to batch job management system which start
6and stop sets of tasks in order to schedule the resources of a machine
7according to the desires of a system administrator. This sort of program
8is often used on HPC clusters to schedule access to the cluster as a
9whole. The cgroup freezer uses cgroups to describe the set of tasks to
10be started/stopped by the batch job management system. It also provides
11a means to start and stop the tasks composing the job.
12
13The cgroup freezer will also be useful for checkpointing running groups
14of tasks. The freezer allows the checkpoint code to obtain a consistent
15image of the tasks by attempting to force the tasks in a cgroup into a
16quiescent state. Once the tasks are quiescent another task can
17walk /proc or invoke a kernel interface to gather information about the
18quiesced tasks. Checkpointed tasks can be restarted later should a
19recoverable error occur. This also allows the checkpointed tasks to be
20migrated between nodes in a cluster by copying the gathered information
21to another node and restarting the tasks there.
22
23Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
24and resuming tasks in userspace. Both of these signals are observable
25from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
26blocked, or ignored it can be seen by waiting or ptracing parent tasks.
27SIGCONT is especially unsuitable since it can be caught by the task. Any
28programs designed to watch for SIGSTOP and SIGCONT could be broken by
29attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
30demonstrate this problem using nested bash shells::
31
32 $ echo $$
33 16644
34 $ bash
35 $ echo $$
36 16690
37
38 From a second, unrelated bash shell:
39 $ kill -SIGSTOP 16690
40 $ kill -SIGCONT 16690
41
42 <at this point 16690 exits and causes 16644 to exit too>
43
44This happens because bash can observe both signals and choose how it
45responds to them.
46
47Another example of a program which catches and responds to these
48signals is gdb. In fact any program designed to use ptrace is likely to
49have a problem with this method of stopping and resuming tasks.
50
51In contrast, the cgroup freezer uses the kernel freezer code to
52prevent the freeze/unfreeze cycle from becoming visible to the tasks
53being frozen. This allows the bash example above and gdb to run as
54expected.
55
56The cgroup freezer is hierarchical. Freezing a cgroup freezes all
57tasks belonging to the cgroup and all its descendant cgroups. Each
58cgroup has its own state (self-state) and the state inherited from the
59parent (parent-state). Iff both states are THAWED, the cgroup is
60THAWED.
61
62The following cgroupfs files are created by cgroup freezer.
63
64* freezer.state: Read-write.
65
66 When read, returns the effective state of the cgroup - "THAWED",
67 "FREEZING" or "FROZEN". This is the combined self and parent-states.
68 If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
69
70 FREEZING cgroup transitions into FROZEN state when all tasks
71 belonging to the cgroup and its descendants become frozen. Note that
72 a cgroup reverts to FREEZING from FROZEN after a new task is added
73 to the cgroup or one of its descendant cgroups until the new task is
74 frozen.
75
76 When written, sets the self-state of the cgroup. Two values are
77 allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
78 if not already freezing, enters FREEZING state along with all its
79 descendant cgroups.
80
81 If THAWED is written, the self-state of the cgroup is changed to
82 THAWED. Note that the effective state may not change to THAWED if
83 the parent-state is still freezing. If a cgroup's effective state
84 becomes THAWED, all its descendants which are freezing because of
85 the cgroup also leave the freezing state.
86
87* freezer.self_freezing: Read only.
88
89 Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
90 This value is 1 iff the last write to freezer.state was "FROZEN".
91
92* freezer.parent_freezing: Read only.
93
94 Shows the parent-state. 0 if none of the cgroup's ancestors is
95 frozen; otherwise, 1.
96
97The root cgroup is non-freezable and the above interface files don't
98exist.
99
100* Examples of usage::
101
102 # mkdir /sys/fs/cgroup/freezer
103 # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
104 # mkdir /sys/fs/cgroup/freezer/0
105 # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
106
107to get status of the freezer subsystem::
108
109 # cat /sys/fs/cgroup/freezer/0/freezer.state
110 THAWED
111
112to freeze all tasks in the container::
113
114 # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
115 # cat /sys/fs/cgroup/freezer/0/freezer.state
116 FREEZING
117 # cat /sys/fs/cgroup/freezer/0/freezer.state
118 FROZEN
119
120to unfreeze all tasks in the container::
121
122 # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
123 # cat /sys/fs/cgroup/freezer/0/freezer.state
124 THAWED
125
126This is the basic mechanism which should do the right thing for user space task
127in a simple scenario.
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
new file mode 100644
index 000000000000..a3902aa253a9
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -0,0 +1,50 @@
1==================
2HugeTLB Controller
3==================
4
5The HugeTLB controller allows to limit the HugeTLB usage per control group and
6enforces the controller limit during page fault. Since HugeTLB doesn't
7support page reclaim, enforcing the limit at page fault time implies that,
8the application will get SIGBUS signal if it tries to access HugeTLB pages
9beyond its limit. This requires the application to know beforehand how much
10HugeTLB pages it would require for its use.
11
12HugeTLB controller can be created by first mounting the cgroup filesystem.
13
14# mount -t cgroup -o hugetlb none /sys/fs/cgroup
15
16With the above step, the initial or the parent HugeTLB group becomes
17visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
18the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
19
20New groups can be created under the parent group /sys/fs/cgroup::
21
22 # cd /sys/fs/cgroup
23 # mkdir g1
24 # echo $$ > g1/tasks
25
26The above steps create a new group g1 and move the current shell
27process (bash) into it.
28
29Brief summary of control files::
30
31 hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
32 hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
33 hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
34 hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
35
36For a system supporting three hugepage sizes (64k, 32M and 1G), the control
37files include::
38
39 hugetlb.1GB.limit_in_bytes
40 hugetlb.1GB.max_usage_in_bytes
41 hugetlb.1GB.usage_in_bytes
42 hugetlb.1GB.failcnt
43 hugetlb.64KB.limit_in_bytes
44 hugetlb.64KB.max_usage_in_bytes
45 hugetlb.64KB.usage_in_bytes
46 hugetlb.64KB.failcnt
47 hugetlb.32MB.limit_in_bytes
48 hugetlb.32MB.max_usage_in_bytes
49 hugetlb.32MB.usage_in_bytes
50 hugetlb.32MB.failcnt
diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst
new file mode 100644
index 000000000000..10bf48bae0b0
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@@ -0,0 +1,28 @@
1========================
2Control Groups version 1
3========================
4
5.. toctree::
6 :maxdepth: 1
7
8 cgroups
9
10 blkio-controller
11 cpuacct
12 cpusets
13 devices
14 freezer-subsystem
15 hugetlb
16 memcg_test
17 memory
18 net_cls
19 net_prio
20 pids
21 rdma
22
23.. only:: subproject and html
24
25 Indices
26 =======
27
28 * :ref:`genindex`
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
new file mode 100644
index 000000000000..3f7115e07b5d
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -0,0 +1,355 @@
1=====================================================
2Memory Resource Controller(Memcg) Implementation Memo
3=====================================================
4
5Last Updated: 2010/2
6
7Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
8
9Because VM is getting complex (one of reasons is memcg...), memcg's behavior
10is complex. This is a document for memcg's internal behavior.
11Please note that implementation details can be changed.
12
13(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
14
150. How to record usage ?
16========================
17
18 2 objects are used.
19
20 page_cgroup ....an object per page.
21
22 Allocated at boot or memory hotplug. Freed at memory hot removal.
23
24 swap_cgroup ... an entry per swp_entry.
25
26 Allocated at swapon(). Freed at swapoff().
27
28 The page_cgroup has USED bit and double count against a page_cgroup never
29 occurs. swap_cgroup is used only when a charged page is swapped-out.
30
311. Charge
32=========
33
34 a page/swp_entry may be charged (usage += PAGE_SIZE) at
35
36 mem_cgroup_try_charge()
37
382. Uncharge
39===========
40
41 a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
42
43 mem_cgroup_uncharge()
44 Called when a page's refcount goes down to 0.
45
46 mem_cgroup_uncharge_swap()
47 Called when swp_entry's refcnt goes down to 0. A charge against swap
48 disappears.
49
503. charge-commit-cancel
51=======================
52
53 Memcg pages are charged in two steps:
54
55 - mem_cgroup_try_charge()
56 - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
57
58 At try_charge(), there are no flags to say "this page is charged".
59 at this point, usage += PAGE_SIZE.
60
61 At commit(), the page is associated with the memcg.
62
63 At cancel(), simply usage -= PAGE_SIZE.
64
65Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
66
674. Anonymous
68============
69
70 Anonymous page is newly allocated at
71 - page fault into MAP_ANONYMOUS mapping.
72 - Copy-On-Write.
73
74 4.1 Swap-in.
75 At swap-in, the page is taken from swap-cache. There are 2 cases.
76
77 (a) If the SwapCache is newly allocated and read, it has no charges.
78 (b) If the SwapCache has been mapped by processes, it has been
79 charged already.
80
81 4.2 Swap-out.
82 At swap-out, typical state transition is below.
83
84 (a) add to swap cache. (marked as SwapCache)
85 swp_entry's refcnt += 1.
86 (b) fully unmapped.
87 swp_entry's refcnt += # of ptes.
88 (c) write back to swap.
89 (d) delete from swap cache. (remove from SwapCache)
90 swp_entry's refcnt -= 1.
91
92
93 Finally, at task exit,
94 (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
95
965. Page Cache
97=============
98
99 Page Cache is charged at
100 - add_to_page_cache_locked().
101
102 The logic is very clear. (About migration, see below)
103
104 Note:
105 __remove_from_page_cache() is called by remove_from_page_cache()
106 and __remove_mapping().
107
1086. Shmem(tmpfs) Page Cache
109===========================
110
111 The best way to understand shmem's page state transition is to read
112 mm/shmem.c.
113
114 But brief explanation of the behavior of memcg around shmem will be
115 helpful to understand the logic.
116
117 Shmem's page (just leaf page, not direct/indirect block) can be on
118
119 - radix-tree of shmem's inode.
120 - SwapCache.
121 - Both on radix-tree and SwapCache. This happens at swap-in
122 and swap-out,
123
124 It's charged when...
125
126 - A new page is added to shmem's radix-tree.
127 - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
128
1297. Page Migration
130=================
131
132 mem_cgroup_migrate()
133
1348. LRU
135======
136 Each memcg has its own private LRU. Now, its handling is under global
137 VM's control (means that it's handled under global pgdat->lru_lock).
138 Almost all routines around memcg's LRU is called by global LRU's
139 list management functions under pgdat->lru_lock.
140
141 A special function is mem_cgroup_isolate_pages(). This scans
142 memcg's private LRU and call __isolate_lru_page() to extract a page
143 from LRU.
144
145 (By __isolate_lru_page(), the page is removed from both of global and
146 private LRU.)
147
148
1499. Typical Tests.
150=================
151
152 Tests for racy cases.
153
1549.1 Small limit to memcg.
155-------------------------
156
157 When you do test to do racy case, it's good test to set memcg's limit
158 to be very small rather than GB. Many races found in the test under
159 xKB or xxMB limits.
160
161 (Memory behavior under GB and Memory behavior under MB shows very
162 different situation.)
163
1649.2 Shmem
165---------
166
167 Historically, memcg's shmem handling was poor and we saw some amount
168 of troubles here. This is because shmem is page-cache but can be
169 SwapCache. Test with shmem/tmpfs is always good test.
170
1719.3 Migration
172-------------
173
174 For NUMA, migration is an another special case. To do easy test, cpuset
175 is useful. Following is a sample script to do migration::
176
177 mount -t cgroup -o cpuset none /opt/cpuset
178
179 mkdir /opt/cpuset/01
180 echo 1 > /opt/cpuset/01/cpuset.cpus
181 echo 0 > /opt/cpuset/01/cpuset.mems
182 echo 1 > /opt/cpuset/01/cpuset.memory_migrate
183 mkdir /opt/cpuset/02
184 echo 1 > /opt/cpuset/02/cpuset.cpus
185 echo 1 > /opt/cpuset/02/cpuset.mems
186 echo 1 > /opt/cpuset/02/cpuset.memory_migrate
187
188 In above set, when you moves a task from 01 to 02, page migration to
189 node 0 to node 1 will occur. Following is a script to migrate all
190 under cpuset.::
191
192 --
193 move_task()
194 {
195 for pid in $1
196 do
197 /bin/echo $pid >$2/tasks 2>/dev/null
198 echo -n $pid
199 echo -n " "
200 done
201 echo END
202 }
203
204 G1_TASK=`cat ${G1}/tasks`
205 G2_TASK=`cat ${G2}/tasks`
206 move_task "${G1_TASK}" ${G2} &
207 --
208
2099.4 Memory hotplug
210------------------
211
212 memory hotplug test is one of good test.
213
214 to offline memory, do following::
215
216 # echo offline > /sys/devices/system/memory/memoryXXX/state
217
218 (XXX is the place of memory)
219
220 This is an easy way to test page migration, too.
221
2229.5 mkdir/rmdir
223---------------
224
225 When using hierarchy, mkdir/rmdir test should be done.
226 Use tests like the following::
227
228 echo 1 >/opt/cgroup/01/memory/use_hierarchy
229 mkdir /opt/cgroup/01/child_a
230 mkdir /opt/cgroup/01/child_b
231
232 set limit to 01.
233 add limit to 01/child_b
234 run jobs under child_a and child_b
235
236 create/delete following groups at random while jobs are running::
237
238 /opt/cgroup/01/child_a/child_aa
239 /opt/cgroup/01/child_b/child_bb
240 /opt/cgroup/01/child_c
241
242 running new jobs in new group is also good.
243
2449.6 Mount with other subsystems
245-------------------------------
246
247 Mounting with other subsystems is a good test because there is a
248 race and lock dependency with other cgroup subsystems.
249
250 example::
251
252 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
253
254 and do task move, mkdir, rmdir etc...under this.
255
2569.7 swapoff
257-----------
258
259 Besides management of swap is one of complicated parts of memcg,
260 call path of swap-in at swapoff is not same as usual swap-in path..
261 It's worth to be tested explicitly.
262
263 For example, test like following is good:
264
265 (Shell-A)::
266
267 # mount -t cgroup none /cgroup -o memory
268 # mkdir /cgroup/test
269 # echo 40M > /cgroup/test/memory.limit_in_bytes
270 # echo 0 > /cgroup/test/tasks
271
272 Run malloc(100M) program under this. You'll see 60M of swaps.
273
274 (Shell-B)::
275
276 # move all tasks in /cgroup/test to /cgroup
277 # /sbin/swapoff -a
278 # rmdir /cgroup/test
279 # kill malloc task.
280
281 Of course, tmpfs v.s. swapoff test should be tested, too.
282
2839.8 OOM-Killer
284--------------
285
286 Out-of-memory caused by memcg's limit will kill tasks under
287 the memcg. When hierarchy is used, a task under hierarchy
288 will be killed by the kernel.
289
290 In this case, panic_on_oom shouldn't be invoked and tasks
291 in other groups shouldn't be killed.
292
293 It's not difficult to cause OOM under memcg as following.
294
295 Case A) when you can swapoff::
296
297 #swapoff -a
298 #echo 50M > /memory.limit_in_bytes
299
300 run 51M of malloc
301
302 Case B) when you use mem+swap limitation::
303
304 #echo 50M > memory.limit_in_bytes
305 #echo 50M > memory.memsw.limit_in_bytes
306
307 run 51M of malloc
308
3099.9 Move charges at task migration
310----------------------------------
311
312 Charges associated with a task can be moved along with task migration.
313
314 (Shell-A)::
315
316 #mkdir /cgroup/A
317 #echo $$ >/cgroup/A/tasks
318
319 run some programs which uses some amount of memory in /cgroup/A.
320
321 (Shell-B)::
322
323 #mkdir /cgroup/B
324 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
325 #echo "pid of the program running in group A" >/cgroup/B/tasks
326
327 You can see charges have been moved by reading ``*.usage_in_bytes`` or
328 memory.stat of both A and B.
329
330 See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
331 be written to move_charge_at_immigrate.
332
3339.10 Memory thresholds
334----------------------
335
336 Memory controller implements memory thresholds using cgroups notification
337 API. You can use tools/cgroup/cgroup_event_listener.c to test it.
338
339 (Shell-A) Create cgroup and run event listener::
340
341 # mkdir /cgroup/A
342 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
343
344 (Shell-B) Add task to cgroup and try to allocate and free memory::
345
346 # echo $$ >/cgroup/A/tasks
347 # a="$(dd if=/dev/zero bs=1M count=10)"
348 # a=
349
350 You will see message from cgroup_event_listener every time you cross
351 the thresholds.
352
353 Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
354
355 It's good idea to test root cgroup as well.
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
new file mode 100644
index 000000000000..41bdc038dad9
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -0,0 +1,1003 @@
1==========================
2Memory Resource Controller
3==========================
4
5NOTE:
6 This document is hopelessly outdated and it asks for a complete
7 rewrite. It still contains a useful information so we are keeping it
8 here but make sure to check the current code if you need a deeper
9 understanding.
10
11NOTE:
12 The Memory Resource Controller has generically been referred to as the
13 memory controller in this document. Do not confuse memory controller
14 used here with the memory controller that is used in hardware.
15
16(For editors) In this document:
17 When we mention a cgroup (cgroupfs's directory) with memory controller,
18 we call it "memory cgroup". When you see git-log and source code, you'll
19 see patch's title and function names tend to use "memcg".
20 In this document, we avoid using it.
21
22Benefits and Purpose of the memory controller
23=============================================
24
25The memory controller isolates the memory behaviour of a group of tasks
26from the rest of the system. The article on LWN [12] mentions some probable
27uses of the memory controller. The memory controller can be used to
28
29a. Isolate an application or a group of applications
30 Memory-hungry applications can be isolated and limited to a smaller
31 amount of memory.
32b. Create a cgroup with a limited amount of memory; this can be used
33 as a good alternative to booting with mem=XXXX.
34c. Virtualization solutions can control the amount of memory they want
35 to assign to a virtual machine instance.
36d. A CD/DVD burner could control the amount of memory used by the
37 rest of the system to ensure that burning does not fail due to lack
38 of available memory.
39e. There are several other use cases; find one or use the controller just
40 for fun (to learn and hack on the VM subsystem).
41
42Current Status: linux-2.6.34-mmotm(development version of 2010/April)
43
44Features:
45
46 - accounting anonymous pages, file caches, swap caches usage and limiting them.
47 - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
48 - optionally, memory+swap usage can be accounted and limited.
49 - hierarchical accounting
50 - soft limit
51 - moving (recharging) account at moving a task is selectable.
52 - usage threshold notifier
53 - memory pressure notifier
54 - oom-killer disable knob and oom-notifier
55 - Root cgroup has no limit controls.
56
57 Kernel memory support is a work in progress, and the current version provides
58 basically functionality. (See Section 2.7)
59
60Brief summary of control files.
61
62==================================== ==========================================
63 tasks attach a task(thread) and show list of
64 threads
65 cgroup.procs show list of processes
66 cgroup.event_control an interface for event_fd()
67 memory.usage_in_bytes show current usage for memory
68 (See 5.5 for details)
69 memory.memsw.usage_in_bytes show current usage for memory+Swap
70 (See 5.5 for details)
71 memory.limit_in_bytes set/show limit of memory usage
72 memory.memsw.limit_in_bytes set/show limit of memory+Swap usage
73 memory.failcnt show the number of memory usage hits limits
74 memory.memsw.failcnt show the number of memory+Swap hits limits
75 memory.max_usage_in_bytes show max memory usage recorded
76 memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
77 memory.soft_limit_in_bytes set/show soft limit of memory usage
78 memory.stat show various statistics
79 memory.use_hierarchy set/show hierarchical account enabled
80 memory.force_empty trigger forced page reclaim
81 memory.pressure_level set memory pressure notifications
82 memory.swappiness set/show swappiness parameter of vmscan
83 (See sysctl's vm.swappiness)
84 memory.move_charge_at_immigrate set/show controls of moving charges
85 memory.oom_control set/show oom controls.
86 memory.numa_stat show the number of memory usage per numa
87 node
88
89 memory.kmem.limit_in_bytes set/show hard limit for kernel memory
90 memory.kmem.usage_in_bytes show current kernel memory allocation
91 memory.kmem.failcnt show the number of kernel memory usage
92 hits limits
93 memory.kmem.max_usage_in_bytes show max kernel memory usage recorded
94
95 memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory
96 memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation
97 memory.kmem.tcp.failcnt show the number of tcp buf memory usage
98 hits limits
99 memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded
100==================================== ==========================================
101
1021. History
103==========
104
105The memory controller has a long history. A request for comments for the memory
106controller was posted by Balbir Singh [1]. At the time the RFC was posted
107there were several implementations for memory control. The goal of the
108RFC was to build consensus and agreement for the minimal features required
109for memory control. The first RSS controller was posted by Balbir Singh[2]
110in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
111RSS controller. At OLS, at the resource management BoF, everyone suggested
112that we handle both page cache and RSS together. Another request was raised
113to allow user space handling of OOM. The current memory controller is
114at version 6; it combines both mapped (RSS) and unmapped Page
115Cache Control [11].
116
1172. Memory Control
118=================
119
120Memory is a unique resource in the sense that it is present in a limited
121amount. If a task requires a lot of CPU processing, the task can spread
122its processing over a period of hours, days, months or years, but with
123memory, the same physical memory needs to be reused to accomplish the task.
124
125The memory controller implementation has been divided into phases. These
126are:
127
1281. Memory controller
1292. mlock(2) controller
1303. Kernel user memory accounting and slab control
1314. user mappings length controller
132
133The memory controller is the first controller developed.
134
1352.1. Design
136-----------
137
138The core of the design is a counter called the page_counter. The
139page_counter tracks the current memory usage and limit of the group of
140processes associated with the controller. Each cgroup has a memory controller
141specific data structure (mem_cgroup) associated with it.
142
1432.2. Accounting
144---------------
145
146::
147
148 +--------------------+
149 | mem_cgroup |
150 | (page_counter) |
151 +--------------------+
152 / ^ \
153 / | \
154 +---------------+ | +---------------+
155 | mm_struct | |.... | mm_struct |
156 | | | | |
157 +---------------+ | +---------------+
158 |
159 + --------------+
160 |
161 +---------------+ +------+--------+
162 | page +----------> page_cgroup|
163 | | | |
164 +---------------+ +---------------+
165
166 (Figure 1: Hierarchy of Accounting)
167
168
169Figure 1 shows the important aspects of the controller
170
1711. Accounting happens per cgroup
1722. Each mm_struct knows about which cgroup it belongs to
1733. Each page has a pointer to the page_cgroup, which in turn knows the
174 cgroup it belongs to
175
176The accounting is done as follows: mem_cgroup_charge_common() is invoked to
177set up the necessary data structures and check if the cgroup that is being
178charged is over its limit. If it is, then reclaim is invoked on the cgroup.
179More details can be found in the reclaim section of this document.
180If everything goes well, a page meta-data-structure called page_cgroup is
181updated. page_cgroup has its own LRU on cgroup.
182(*) page_cgroup structure is allocated at boot/memory-hotplug time.
183
1842.2.1 Accounting details
185------------------------
186
187All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
188Some pages which are never reclaimable and will not be on the LRU
189are not accounted. We just account pages under usual VM management.
190
191RSS pages are accounted at page_fault unless they've already been accounted
192for earlier. A file page will be accounted for as Page Cache when it's
193inserted into inode (radix-tree). While it's mapped into the page tables of
194processes, duplicate accounting is carefully avoided.
195
196An RSS page is unaccounted when it's fully unmapped. A PageCache page is
197unaccounted when it's removed from radix-tree. Even if RSS pages are fully
198unmapped (by kswapd), they may exist as SwapCache in the system until they
199are really freed. Such SwapCaches are also accounted.
200A swapped-in page is not accounted until it's mapped.
201
202Note: The kernel does swapin-readahead and reads multiple swaps at once.
203This means swapped-in pages may contain pages for other tasks than a task
204causing page fault. So, we avoid accounting at swap-in I/O.
205
206At page migration, accounting information is kept.
207
208Note: we just account pages-on-LRU because our purpose is to control amount
209of used pages; not-on-LRU pages tend to be out-of-control from VM view.
210
2112.3 Shared Page Accounting
212--------------------------
213
214Shared pages are accounted on the basis of the first touch approach. The
215cgroup that first touches a page is accounted for the page. The principle
216behind this approach is that a cgroup that aggressively uses a shared
217page will eventually get charged for it (once it is uncharged from
218the cgroup that brought it in -- this will happen on memory pressure).
219
220But see section 8.2: when moving a task to another cgroup, its pages may
221be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
222
223Exception: If CONFIG_MEMCG_SWAP is not used.
224When you do swapoff and make swapped-out pages of shmem(tmpfs) to
225be backed into memory in force, charges for pages are accounted against the
226caller of swapoff rather than the users of shmem.
227
2282.4 Swap Extension (CONFIG_MEMCG_SWAP)
229--------------------------------------
230
231Swap Extension allows you to record charge for swap. A swapped-in page is
232charged back to original page allocator if possible.
233
234When swap is accounted, following files are added.
235
236 - memory.memsw.usage_in_bytes.
237 - memory.memsw.limit_in_bytes.
238
239memsw means memory+swap. Usage of memory+swap is limited by
240memsw.limit_in_bytes.
241
242Example: Assume a system with 4G of swap. A task which allocates 6G of memory
243(by mistake) under 2G memory limitation will use all swap.
244In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
245By using the memsw limit, you can avoid system OOM which can be caused by swap
246shortage.
247
248**why 'memory+swap' rather than swap**
249
250The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
251to move account from memory to swap...there is no change in usage of
252memory+swap. In other words, when we want to limit the usage of swap without
253affecting global LRU, memory+swap limit is better than just limiting swap from
254an OS point of view.
255
256**What happens when a cgroup hits memory.memsw.limit_in_bytes**
257
258When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
259in this cgroup. Then, swap-out will not be done by cgroup routine and file
260caches are dropped. But as mentioned above, global LRU can do swapout memory
261from it for sanity of the system's memory management state. You can't forbid
262it by cgroup.
263
2642.5 Reclaim
265-----------
266
267Each cgroup maintains a per cgroup LRU which has the same structure as
268global VM. When a cgroup goes over its limit, we first try
269to reclaim memory from the cgroup so as to make space for the new
270pages that the cgroup has touched. If the reclaim is unsuccessful,
271an OOM routine is invoked to select and kill the bulkiest task in the
272cgroup. (See 10. OOM Control below.)
273
274The reclaim algorithm has not been modified for cgroups, except that
275pages that are selected for reclaiming come from the per-cgroup LRU
276list.
277
278NOTE:
279 Reclaim does not work for the root cgroup, since we cannot set any
280 limits on the root cgroup.
281
282Note2:
283 When panic_on_oom is set to "2", the whole system will panic.
284
285When oom event notifier is registered, event will be delivered.
286(See oom_control section)
287
2882.6 Locking
289-----------
290
291 lock_page_cgroup()/unlock_page_cgroup() should not be called under
292 the i_pages lock.
293
294 Other lock order is following:
295
296 PG_locked.
297 mm->page_table_lock
298 pgdat->lru_lock
299 lock_page_cgroup.
300
301 In many cases, just lock_page_cgroup() is called.
302
303 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
304 pgdat->lru_lock, it has no lock of its own.
305
3062.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
307-----------------------------------------------
308
309With the Kernel memory extension, the Memory Controller is able to limit
310the amount of kernel memory used by the system. Kernel memory is fundamentally
311different than user memory, since it can't be swapped out, which makes it
312possible to DoS the system by consuming too much of this precious resource.
313
314Kernel memory accounting is enabled for all memory cgroups by default. But
315it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
316at boot time. In this case, kernel memory will not be accounted at all.
317
318Kernel memory limits are not imposed for the root cgroup. Usage for the root
319cgroup may or may not be accounted. The memory used is accumulated into
320memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
321(currently only for tcp).
322
323The main "kmem" counter is fed into the main counter, so kmem charges will
324also be visible from the user counter.
325
326Currently no soft limit is implemented for kernel memory. It is future work
327to trigger slab reclaim when those limits are reached.
328
3292.7.1 Current Kernel Memory resources accounted
330-----------------------------------------------
331
332stack pages:
333 every process consumes some stack pages. By accounting into
334 kernel memory, we prevent new processes from being created when the kernel
335 memory usage is too high.
336
337slab pages:
338 pages allocated by the SLAB or SLUB allocator are tracked. A copy
339 of each kmem_cache is created every time the cache is touched by the first time
340 from inside the memcg. The creation is done lazily, so some objects can still be
341 skipped while the cache is being created. All objects in a slab page should
342 belong to the same memcg. This only fails to hold when a task is migrated to a
343 different memcg during the page allocation by the cache.
344
345sockets memory pressure:
346 some sockets protocols have memory pressure
347 thresholds. The Memory Controller allows them to be controlled individually
348 per cgroup, instead of globally.
349
350tcp memory pressure:
351 sockets memory pressure for the tcp protocol.
352
3532.7.2 Common use cases
354----------------------
355
356Because the "kmem" counter is fed to the main user counter, kernel memory can
357never be limited completely independently of user memory. Say "U" is the user
358limit, and "K" the kernel limit. There are three possible ways limits can be
359set:
360
361U != 0, K = unlimited:
362 This is the standard memcg limitation mechanism already present before kmem
363 accounting. Kernel memory is completely ignored.
364
365U != 0, K < U:
366 Kernel memory is a subset of the user memory. This setup is useful in
367 deployments where the total amount of memory per-cgroup is overcommited.
368 Overcommiting kernel memory limits is definitely not recommended, since the
369 box can still run out of non-reclaimable memory.
370 In this case, the admin could set up K so that the sum of all groups is
371 never greater than the total memory, and freely set U at the cost of his
372 QoS.
373
374WARNING:
375 In the current implementation, memory reclaim will NOT be
376 triggered for a cgroup when it hits K while staying below U, which makes
377 this setup impractical.
378
379U != 0, K >= U:
380 Since kmem charges will also be fed to the user counter and reclaim will be
381 triggered for the cgroup for both kinds of memory. This setup gives the
382 admin a unified view of memory, and it is also useful for people who just
383 want to track kernel memory usage.
384
3853. User Interface
386=================
387
3883.0. Configuration
389------------------
390
391a. Enable CONFIG_CGROUPS
392b. Enable CONFIG_MEMCG
393c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
394d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
395
3963.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
397-------------------------------------------------------------------
398
399::
400
401 # mount -t tmpfs none /sys/fs/cgroup
402 # mkdir /sys/fs/cgroup/memory
403 # mount -t cgroup none /sys/fs/cgroup/memory -o memory
404
4053.2. Make the new group and move bash into it::
406
407 # mkdir /sys/fs/cgroup/memory/0
408 # echo $$ > /sys/fs/cgroup/memory/0/tasks
409
410Since now we're in the 0 cgroup, we can alter the memory limit::
411
412 # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
413
414NOTE:
415 We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
416 mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
417 Gibibytes.)
418
419NOTE:
420 We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
421
422NOTE:
423 We cannot set limits on the root cgroup any more.
424
425::
426
427 # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
428 4194304
429
430We can check the usage::
431
432 # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
433 1216512
434
435A successful write to this file does not guarantee a successful setting of
436this limit to the value written into the file. This can be due to a
437number of factors, such as rounding up to page boundaries or the total
438availability of memory on the system. The user is required to re-read
439this file after a write to guarantee the value committed by the kernel::
440
441 # echo 1 > memory.limit_in_bytes
442 # cat memory.limit_in_bytes
443 4096
444
445The memory.failcnt field gives the number of times that the cgroup limit was
446exceeded.
447
448The memory.stat file gives accounting information. Now, the number of
449caches, RSS and Active pages/Inactive pages are shown.
450
4514. Testing
452==========
453
454For testing features and implementation, see memcg_test.txt.
455
456Performance test is also important. To see pure memory controller's overhead,
457testing on tmpfs will give you good numbers of small overheads.
458Example: do kernel make on tmpfs.
459
460Page-fault scalability is also important. At measuring parallel
461page fault test, multi-process test may be better than multi-thread
462test because it has noise of shared objects/status.
463
464But the above two are testing extreme situations.
465Trying usual test under memory controller is always helpful.
466
4674.1 Troubleshooting
468-------------------
469
470Sometimes a user might find that the application under a cgroup is
471terminated by the OOM killer. There are several causes for this:
472
4731. The cgroup limit is too low (just too low to do anything useful)
4742. The user is using anonymous memory and swap is turned off or too low
475
476A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
477some of the pages cached in the cgroup (page cache pages).
478
479To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
480seeing what happens will be helpful.
481
4824.2 Task migration
483------------------
484
485When a task migrates from one cgroup to another, its charge is not
486carried forward by default. The pages allocated from the original cgroup still
487remain charged to it, the charge is dropped when the page is freed or
488reclaimed.
489
490You can move charges of a task along with task migration.
491See 8. "Move charges at task migration"
492
4934.3 Removing a cgroup
494---------------------
495
496A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
497cgroup might have some charge associated with it, even though all
498tasks have migrated away from it. (because we charge against pages, not
499against tasks.)
500
501We move the stats to root (if use_hierarchy==0) or parent (if
502use_hierarchy==1), and no change on the charge except uncharging
503from the child.
504
505Charges recorded in swap information is not updated at removal of cgroup.
506Recorded information is discarded and a cgroup which uses swap (swapcache)
507will be charged as a new owner of it.
508
509About use_hierarchy, see Section 6.
510
5115. Misc. interfaces
512===================
513
5145.1 force_empty
515---------------
516 memory.force_empty interface is provided to make cgroup's memory usage empty.
517 When writing anything to this::
518
519 # echo 0 > memory.force_empty
520
521 the cgroup will be reclaimed and as many pages reclaimed as possible.
522
523 The typical use case for this interface is before calling rmdir().
524 Though rmdir() offlines memcg, but the memcg may still stay there due to
525 charged file caches. Some out-of-use page caches may keep charged until
526 memory pressure happens. If you want to avoid that, force_empty will be useful.
527
528 Also, note that when memory.kmem.limit_in_bytes is set the charges due to
529 kernel pages will still be seen. This is not considered a failure and the
530 write will still return success. In this case, it is expected that
531 memory.kmem.usage_in_bytes == memory.usage_in_bytes.
532
533 About use_hierarchy, see Section 6.
534
5355.2 stat file
536-------------
537
538memory.stat file includes following statistics
539
540per-memory cgroup local status
541^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
542
543=============== ===============================================================
544cache # of bytes of page cache memory.
545rss # of bytes of anonymous and swap cache memory (includes
546 transparent hugepages).
547rss_huge # of bytes of anonymous transparent hugepages.
548mapped_file # of bytes of mapped file (includes tmpfs/shmem)
549pgpgin # of charging events to the memory cgroup. The charging
550 event happens each time a page is accounted as either mapped
551 anon page(RSS) or cache page(Page Cache) to the cgroup.
552pgpgout # of uncharging events to the memory cgroup. The uncharging
553 event happens each time a page is unaccounted from the cgroup.
554swap # of bytes of swap usage
555dirty # of bytes that are waiting to get written back to the disk.
556writeback # of bytes of file/anon cache that are queued for syncing to
557 disk.
558inactive_anon # of bytes of anonymous and swap cache memory on inactive
559 LRU list.
560active_anon # of bytes of anonymous and swap cache memory on active
561 LRU list.
562inactive_file # of bytes of file-backed memory on inactive LRU list.
563active_file # of bytes of file-backed memory on active LRU list.
564unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
565=============== ===============================================================
566
567status considering hierarchy (see memory.use_hierarchy settings)
568^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
569
570========================= ===================================================
571hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
572 under which the memory cgroup is
573hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
574 hierarchy under which memory cgroup is.
575
576total_<counter> # hierarchical version of <counter>, which in
577 addition to the cgroup's own value includes the
578 sum of all hierarchical children's values of
579 <counter>, i.e. total_cache
580========================= ===================================================
581
582The following additional stats are dependent on CONFIG_DEBUG_VM
583^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
584
585========================= ========================================
586recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
587recent_rotated_file VM internal parameter. (see mm/vmscan.c)
588recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
589recent_scanned_file VM internal parameter. (see mm/vmscan.c)
590========================= ========================================
591
592Memo:
593 recent_rotated means recent frequency of LRU rotation.
594 recent_scanned means recent # of scans to LRU.
595 showing for better debug please see the code for meanings.
596
597Note:
598 Only anonymous and swap cache memory is listed as part of 'rss' stat.
599 This should not be confused with the true 'resident set size' or the
600 amount of physical memory used by the cgroup.
601
602 'rss + mapped_file" will give you resident set size of cgroup.
603
604 (Note: file and shmem may be shared among other cgroups. In that case,
605 mapped_file is accounted only when the memory cgroup is owner of page
606 cache.)
607
6085.3 swappiness
609--------------
610
611Overrides /proc/sys/vm/swappiness for the particular group. The tunable
612in the root cgroup corresponds to the global swappiness setting.
613
614Please note that unlike during the global reclaim, limit reclaim
615enforces that 0 swappiness really prevents from any swapping even if
616there is a swap storage available. This might lead to memcg OOM killer
617if there are no file pages to reclaim.
618
6195.4 failcnt
620-----------
621
622A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
623This failcnt(== failure count) shows the number of times that a usage counter
624hit its limit. When a memory cgroup hits a limit, failcnt increases and
625memory under it will be reclaimed.
626
627You can reset failcnt by writing 0 to failcnt file::
628
629 # echo 0 > .../memory.failcnt
630
6315.5 usage_in_bytes
632------------------
633
634For efficiency, as other kernel components, memory cgroup uses some optimization
635to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
636method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
637value for efficient access. (Of course, when necessary, it's synchronized.)
638If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
639value in memory.stat(see 5.2).
640
6415.6 numa_stat
642-------------
643
644This is similar to numa_maps but operates on a per-memcg basis. This is
645useful for providing visibility into the numa locality information within
646an memcg since the pages are allowed to be allocated from any physical
647node. One of the use cases is evaluating application performance by
648combining this information with the application's CPU allocation.
649
650Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
651per-node page counts including "hierarchical_<counter>" which sums up all
652hierarchical children's values in addition to the memcg's own value.
653
654The output format of memory.numa_stat is::
655
656 total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
657 file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
658 anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
659 unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
660 hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
661
662The "total" count is sum of file + anon + unevictable.
663
6646. Hierarchy support
665====================
666
667The memory controller supports a deep hierarchy and hierarchical accounting.
668The hierarchy is created by creating the appropriate cgroups in the
669cgroup filesystem. Consider for example, the following cgroup filesystem
670hierarchy::
671
672 root
673 / | \
674 / | \
675 a b c
676 | \
677 | \
678 d e
679
680In the diagram above, with hierarchical accounting enabled, all memory
681usage of e, is accounted to its ancestors up until the root (i.e, c and root),
682that has memory.use_hierarchy enabled. If one of the ancestors goes over its
683limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
684children of the ancestor.
685
6866.1 Enabling hierarchical accounting and reclaim
687------------------------------------------------
688
689A memory cgroup by default disables the hierarchy feature. Support
690can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
691
692 # echo 1 > memory.use_hierarchy
693
694The feature can be disabled by::
695
696 # echo 0 > memory.use_hierarchy
697
698NOTE1:
699 Enabling/disabling will fail if either the cgroup already has other
700 cgroups created below it, or if the parent cgroup has use_hierarchy
701 enabled.
702
703NOTE2:
704 When panic_on_oom is set to "2", the whole system will panic in
705 case of an OOM event in any cgroup.
706
7077. Soft limits
708==============
709
710Soft limits allow for greater sharing of memory. The idea behind soft limits
711is to allow control groups to use as much of the memory as needed, provided
712
713a. There is no memory contention
714b. They do not exceed their hard limit
715
716When the system detects memory contention or low memory, control groups
717are pushed back to their soft limits. If the soft limit of each control
718group is very high, they are pushed back as much as possible to make
719sure that one control group does not starve the others of memory.
720
721Please note that soft limits is a best-effort feature; it comes with
722no guarantees, but it does its best to make sure that when memory is
723heavily contended for, memory is allocated based on the soft limit
724hints/setup. Currently soft limit based reclaim is set up such that
725it gets invoked from balance_pgdat (kswapd).
726
7277.1 Interface
728-------------
729
730Soft limits can be setup by using the following commands (in this example we
731assume a soft limit of 256 MiB)::
732
733 # echo 256M > memory.soft_limit_in_bytes
734
735If we want to change this to 1G, we can at any time use::
736
737 # echo 1G > memory.soft_limit_in_bytes
738
739NOTE1:
740 Soft limits take effect over a long period of time, since they involve
741 reclaiming memory for balancing between memory cgroups
742NOTE2:
743 It is recommended to set the soft limit always below the hard limit,
744 otherwise the hard limit will take precedence.
745
7468. Move charges at task migration
747=================================
748
749Users can move charges associated with a task along with task migration, that
750is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
751This feature is not supported in !CONFIG_MMU environments because of lack of
752page tables.
753
7548.1 Interface
755-------------
756
757This feature is disabled by default. It can be enabled (and disabled again) by
758writing to memory.move_charge_at_immigrate of the destination cgroup.
759
760If you want to enable it::
761
762 # echo (some positive value) > memory.move_charge_at_immigrate
763
764Note:
765 Each bits of move_charge_at_immigrate has its own meaning about what type
766 of charges should be moved. See 8.2 for details.
767Note:
768 Charges are moved only when you move mm->owner, in other words,
769 a leader of a thread group.
770Note:
771 If we cannot find enough space for the task in the destination cgroup, we
772 try to make space by reclaiming memory. Task migration may fail if we
773 cannot make enough space.
774Note:
775 It can take several seconds if you move charges much.
776
777And if you want disable it again::
778
779 # echo 0 > memory.move_charge_at_immigrate
780
7818.2 Type of charges which can be moved
782--------------------------------------
783
784Each bit in move_charge_at_immigrate has its own meaning about what type of
785charges should be moved. But in any case, it must be noted that an account of
786a page or a swap can be moved only when it is charged to the task's current
787(old) memory cgroup.
788
789+---+--------------------------------------------------------------------------+
790|bit| what type of charges would be moved ? |
791+===+==========================================================================+
792| 0 | A charge of an anonymous page (or swap of it) used by the target task. |
793| | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
794+---+--------------------------------------------------------------------------+
795| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
796| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
797| | anonymous pages, file pages (and swaps) in the range mmapped by the task |
798| | will be moved even if the task hasn't done page fault, i.e. they might |
799| | not be the task's "RSS", but other task's "RSS" that maps the same file. |
800| | And mapcount of the page is ignored (the page can be moved even if |
801| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
802| | enable move of swap charges. |
803+---+--------------------------------------------------------------------------+
804
8058.3 TODO
806--------
807
808- All of moving charge operations are done under cgroup_mutex. It's not good
809 behavior to hold the mutex too long, so we may need some trick.
810
8119. Memory thresholds
812====================
813
814Memory cgroup implements memory thresholds using the cgroups notification
815API (see cgroups.txt). It allows to register multiple memory and memsw
816thresholds and gets notifications when it crosses.
817
818To register a threshold, an application must:
819
820- create an eventfd using eventfd(2);
821- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
822- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
823 cgroup.event_control.
824
825Application will be notified through eventfd when memory usage crosses
826threshold in any direction.
827
828It's applicable for root and non-root cgroup.
829
83010. OOM Control
831===============
832
833memory.oom_control file is for OOM notification and other controls.
834
835Memory cgroup implements OOM notifier using the cgroup notification
836API (See cgroups.txt). It allows to register multiple OOM notification
837delivery and gets notification when OOM happens.
838
839To register a notifier, an application must:
840
841 - create an eventfd using eventfd(2)
842 - open memory.oom_control file
843 - write string like "<event_fd> <fd of memory.oom_control>" to
844 cgroup.event_control
845
846The application will be notified through eventfd when OOM happens.
847OOM notification doesn't work for the root cgroup.
848
849You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
850
851 #echo 1 > memory.oom_control
852
853If OOM-killer is disabled, tasks under cgroup will hang/sleep
854in memory cgroup's OOM-waitqueue when they request accountable memory.
855
856For running them, you have to relax the memory cgroup's OOM status by
857
858 * enlarge limit or reduce usage.
859
860To reduce usage,
861
862 * kill some tasks.
863 * move some tasks to other group with account migration.
864 * remove some files (on tmpfs?)
865
866Then, stopped tasks will work again.
867
868At reading, current status of OOM is shown.
869
870 - oom_kill_disable 0 or 1
871 (if 1, oom-killer is disabled)
872 - under_oom 0 or 1
873 (if 1, the memory cgroup is under OOM, tasks may be stopped.)
874
87511. Memory Pressure
876===================
877
878The pressure level notifications can be used to monitor the memory
879allocation cost; based on the pressure, applications can implement
880different strategies of managing their memory resources. The pressure
881levels are defined as following:
882
883The "low" level means that the system is reclaiming memory for new
884allocations. Monitoring this reclaiming activity might be useful for
885maintaining cache level. Upon notification, the program (typically
886"Activity Manager") might analyze vmstat and act in advance (i.e.
887prematurely shutdown unimportant services).
888
889The "medium" level means that the system is experiencing medium memory
890pressure, the system might be making swap, paging out active file caches,
891etc. Upon this event applications may decide to further analyze
892vmstat/zoneinfo/memcg or internal memory usage statistics and free any
893resources that can be easily reconstructed or re-read from a disk.
894
895The "critical" level means that the system is actively thrashing, it is
896about to out of memory (OOM) or even the in-kernel OOM killer is on its
897way to trigger. Applications should do whatever they can to help the
898system. It might be too late to consult with vmstat or any other
899statistics, so it's advisable to take an immediate action.
900
901By default, events are propagated upward until the event is handled, i.e. the
902events are not pass-through. For example, you have three cgroups: A->B->C. Now
903you set up an event listener on cgroups A, B and C, and suppose group C
904experiences some pressure. In this situation, only group C will receive the
905notification, i.e. groups A and B will not receive it. This is done to avoid
906excessive "broadcasting" of messages, which disturbs the system and which is
907especially bad if we are low on memory or thrashing. Group B, will receive
908notification only if there are no event listers for group C.
909
910There are three optional modes that specify different propagation behavior:
911
912 - "default": this is the default behavior specified above. This mode is the
913 same as omitting the optional mode parameter, preserved by backwards
914 compatibility.
915
916 - "hierarchy": events always propagate up to the root, similar to the default
917 behavior, except that propagation continues regardless of whether there are
918 event listeners at each level, with the "hierarchy" mode. In the above
919 example, groups A, B, and C will receive notification of memory pressure.
920
921 - "local": events are pass-through, i.e. they only receive notifications when
922 memory pressure is experienced in the memcg for which the notification is
923 registered. In the above example, group C will receive notification if
924 registered for "local" notification and the group experiences memory
925 pressure. However, group B will never receive notification, regardless if
926 there is an event listener for group C or not, if group B is registered for
927 local notification.
928
929The level and event notification mode ("hierarchy" or "local", if necessary) are
930specified by a comma-delimited string, i.e. "low,hierarchy" specifies
931hierarchical, pass-through, notification for all ancestor memcgs. Notification
932that is the default, non pass-through behavior, does not specify a mode.
933"medium,local" specifies pass-through notification for the medium level.
934
935The file memory.pressure_level is only used to setup an eventfd. To
936register a notification, an application must:
937
938- create an eventfd using eventfd(2);
939- open memory.pressure_level;
940- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
941 to cgroup.event_control.
942
943Application will be notified through eventfd when memory pressure is at
944the specific level (or higher). Read/write operations to
945memory.pressure_level are no implemented.
946
947Test:
948
949 Here is a small script example that makes a new cgroup, sets up a
950 memory limit, sets up a notification in the cgroup and then makes child
951 cgroup experience a critical pressure::
952
953 # cd /sys/fs/cgroup/memory/
954 # mkdir foo
955 # cd foo
956 # cgroup_event_listener memory.pressure_level low,hierarchy &
957 # echo 8000000 > memory.limit_in_bytes
958 # echo 8000000 > memory.memsw.limit_in_bytes
959 # echo $$ > tasks
960 # dd if=/dev/zero | read x
961
962 (Expect a bunch of notifications, and eventually, the oom-killer will
963 trigger.)
964
96512. TODO
966========
967
9681. Make per-cgroup scanner reclaim not-shared pages first
9692. Teach controller to account for shared-pages
9703. Start reclamation in the background when the limit is
971 not yet hit but the usage is getting closer
972
973Summary
974=======
975
976Overall, the memory controller has been a stable controller and has been
977commented and discussed quite extensively in the community.
978
979References
980==========
981
9821. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
9832. Singh, Balbir. Memory Controller (RSS Control),
984 http://lwn.net/Articles/222762/
9853. Emelianov, Pavel. Resource controllers based on process cgroups
986 http://lkml.org/lkml/2007/3/6/198
9874. Emelianov, Pavel. RSS controller based on process cgroups (v2)
988 http://lkml.org/lkml/2007/4/9/78
9895. Emelianov, Pavel. RSS controller based on process cgroups (v3)
990 http://lkml.org/lkml/2007/5/30/244
9916. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
9927. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
993 subsystem (v3), http://lwn.net/Articles/235534/
9948. Singh, Balbir. RSS controller v2 test results (lmbench),
995 http://lkml.org/lkml/2007/5/17/232
9969. Singh, Balbir. RSS controller v2 AIM9 results
997 http://lkml.org/lkml/2007/5/18/1
99810. Singh, Balbir. Memory controller v6 test results,
999 http://lkml.org/lkml/2007/8/19/36
100011. Singh, Balbir. Memory controller introduction (v6),
1001 http://lkml.org/lkml/2007/8/17/69
100212. Corbet, Jonathan, Controlling memory use in cgroups,
1003 http://lwn.net/Articles/243795/
diff --git a/Documentation/admin-guide/cgroup-v1/net_cls.rst b/Documentation/admin-guide/cgroup-v1/net_cls.rst
new file mode 100644
index 000000000000..a2cf272af7a0
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/net_cls.rst
@@ -0,0 +1,44 @@
1=========================
2Network classifier cgroup
3=========================
4
5The Network classifier cgroup provides an interface to
6tag network packets with a class identifier (classid).
7
8The Traffic Controller (tc) can be used to assign
9different priorities to packets from different cgroups.
10Also, Netfilter (iptables) can use this tag to perform
11actions on such packets.
12
13Creating a net_cls cgroups instance creates a net_cls.classid file.
14This net_cls.classid value is initialized to 0.
15
16You can write hexadecimal values to net_cls.classid; the format for these
17values is 0xAAAABBBB; AAAA is the major handle number and BBBB
18is the minor handle number.
19Reading net_cls.classid yields a decimal result.
20
21Example::
22
23 mkdir /sys/fs/cgroup/net_cls
24 mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
25 mkdir /sys/fs/cgroup/net_cls/0
26 echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
27
28- setting a 10:1 handle::
29
30 cat /sys/fs/cgroup/net_cls/0/net_cls.classid
31 1048577
32
33- configuring tc::
34
35 tc qdisc add dev eth0 root handle 10: htb
36 tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
37
38- creating traffic class 10:1::
39
40 tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
41
42configuring iptables, basic example::
43
44 iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP
diff --git a/Documentation/admin-guide/cgroup-v1/net_prio.rst b/Documentation/admin-guide/cgroup-v1/net_prio.rst
new file mode 100644
index 000000000000..b40905871c64
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/net_prio.rst
@@ -0,0 +1,57 @@
1=======================
2Network priority cgroup
3=======================
4
5The Network priority cgroup provides an interface to allow an administrator to
6dynamically set the priority of network traffic generated by various
7applications
8
9Nominally, an application would set the priority of its traffic via the
10SO_PRIORITY socket option. This however, is not always possible because:
11
121) The application may not have been coded to set this value
132) The priority of application traffic is often a site-specific administrative
14 decision rather than an application defined one.
15
16This cgroup allows an administrator to assign a process to a group which defines
17the priority of egress traffic on a given interface. Network priority groups can
18be created by first mounting the cgroup filesystem::
19
20 # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
21
22With the above step, the initial group acting as the parent accounting group
23becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
24the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
25
26Each net_prio cgroup contains two files that are subsystem specific
27
28net_prio.prioidx
29 This file is read-only, and is simply informative. It contains a unique
30 integer value that the kernel uses as an internal representation of this
31 cgroup.
32
33net_prio.ifpriomap
34 This file contains a map of the priorities assigned to traffic originating
35 from processes in this group and egressing the system on various interfaces.
36 It contains a list of tuples in the form <ifname priority>. Contents of this
37 file can be modified by echoing a string into the file using the same tuple
38 format. For example::
39
40 echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
41
42This command would force any traffic originating from processes belonging to the
43iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
44said traffic set to the value 5. The parent accounting group also has a
45writeable 'net_prio.ifpriomap' file that can be used to set a system default
46priority.
47
48Priorities are set immediately prior to queueing a frame to the device
49queueing discipline (qdisc) so priorities will be assigned prior to the hardware
50queue selection being made.
51
52One usage for the net_prio cgroup is with mqprio qdisc allowing application
53traffic to be steered to hardware/driver based traffic classes. These mappings
54can then be managed by administrators or other networking protocols such as
55DCBX.
56
57A new net_prio cgroup inherits the parent's configuration.
diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst
new file mode 100644
index 000000000000..6acebd9e72c8
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/pids.rst
@@ -0,0 +1,92 @@
1=========================
2Process Number Controller
3=========================
4
5Abstract
6--------
7
8The process number controller is used to allow a cgroup hierarchy to stop any
9new tasks from being fork()'d or clone()'d after a certain limit is reached.
10
11Since it is trivial to hit the task limit without hitting any kmemcg limits in
12place, PIDs are a fundamental resource. As such, PID exhaustion must be
13preventable in the scope of a cgroup hierarchy by allowing resource limiting of
14the number of tasks in a cgroup.
15
16Usage
17-----
18
19In order to use the `pids` controller, set the maximum number of tasks in
20pids.max (this is not available in the root cgroup for obvious reasons). The
21number of processes currently in the cgroup is given by pids.current.
22
23Organisational operations are not blocked by cgroup policies, so it is possible
24to have pids.current > pids.max. This can be done by either setting the limit to
25be smaller than pids.current, or attaching enough processes to the cgroup such
26that pids.current > pids.max. However, it is not possible to violate a cgroup
27policy through fork() or clone(). fork() and clone() will return -EAGAIN if the
28creation of a new process would cause a cgroup policy to be violated.
29
30To set a cgroup to have no limit, set pids.max to "max". This is the default for
31all new cgroups (N.B. that PID limits are hierarchical, so the most stringent
32limit in the hierarchy is followed).
33
34pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
35superset of parent/child/pids.current.
36
37The pids.events file contains event counters:
38
39 - max: Number of times fork failed because limit was hit.
40
41Example
42-------
43
44First, we mount the pids controller::
45
46 # mkdir -p /sys/fs/cgroup/pids
47 # mount -t cgroup -o pids none /sys/fs/cgroup/pids
48
49Then we create a hierarchy, set limits and attach processes to it::
50
51 # mkdir -p /sys/fs/cgroup/pids/parent/child
52 # echo 2 > /sys/fs/cgroup/pids/parent/pids.max
53 # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
54 # cat /sys/fs/cgroup/pids/parent/pids.current
55 2
56 #
57
58It should be noted that attempts to overcome the set limit (2 in this case) will
59fail::
60
61 # cat /sys/fs/cgroup/pids/parent/pids.current
62 2
63 # ( /bin/echo "Here's some processes for you." | cat )
64 sh: fork: Resource temporary unavailable
65 #
66
67Even if we migrate to a child cgroup (which doesn't have a set limit), we will
68not be able to overcome the most stringent limit in the hierarchy (in this case,
69parent's)::
70
71 # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
72 # cat /sys/fs/cgroup/pids/parent/pids.current
73 2
74 # cat /sys/fs/cgroup/pids/parent/child/pids.current
75 2
76 # cat /sys/fs/cgroup/pids/parent/child/pids.max
77 max
78 # ( /bin/echo "Here's some processes for you." | cat )
79 sh: fork: Resource temporary unavailable
80 #
81
82We can set a limit that is smaller than pids.current, which will stop any new
83processes from being forked at all (note that the shell itself counts towards
84pids.current)::
85
86 # echo 1 > /sys/fs/cgroup/pids/parent/pids.max
87 # /bin/echo "We can't even spawn a single process now."
88 sh: fork: Resource temporary unavailable
89 # echo 0 > /sys/fs/cgroup/pids/parent/pids.max
90 # /bin/echo "We can't even spawn a single process now."
91 sh: fork: Resource temporary unavailable
92 #
diff --git a/Documentation/admin-guide/cgroup-v1/rdma.rst b/Documentation/admin-guide/cgroup-v1/rdma.rst
new file mode 100644
index 000000000000..2fcb0a9bf790
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/rdma.rst
@@ -0,0 +1,117 @@
1===============
2RDMA Controller
3===============
4
5.. Contents
6
7 1. Overview
8 1-1. What is RDMA controller?
9 1-2. Why RDMA controller needed?
10 1-3. How is RDMA controller implemented?
11 2. Usage Examples
12
131. Overview
14===========
15
161-1. What is RDMA controller?
17-----------------------------
18
19RDMA controller allows user to limit RDMA/IB specific resources that a given
20set of processes can use. These processes are grouped using RDMA controller.
21
22RDMA controller defines two resources which can be limited for processes of a
23cgroup.
24
251-2. Why RDMA controller needed?
26--------------------------------
27
28Currently user space applications can easily take away all the rdma verb
29specific resources such as AH, CQ, QP, MR etc. Due to which other applications
30in other cgroup or kernel space ULPs may not even get chance to allocate any
31rdma resources. This can lead to service unavailability.
32
33Therefore RDMA controller is needed through which resource consumption
34of processes can be limited. Through this controller different rdma
35resources can be accounted.
36
371-3. How is RDMA controller implemented?
38----------------------------------------
39
40RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
41resource accounting per cgroup, per device using resource pool structure.
42Each such resource pool is limited up to 64 resources in given resource pool
43by rdma cgroup, which can be extended later if required.
44
45This resource pool object is linked to the cgroup css. Typically there
46are 0 to 4 resource pool instances per cgroup, per device in most use cases.
47But nothing limits to have it more. At present hundreds of RDMA devices per
48single cgroup may not be handled optimally, however there is no
49known use case or requirement for such configuration either.
50
51Since RDMA resources can be allocated from any process and can be freed by any
52of the child processes which shares the address space, rdma resources are
53always owned by the creator cgroup css. This allows process migration from one
54to other cgroup without major complexity of transferring resource ownership;
55because such ownership is not really present due to shared nature of
56rdma resources. Linking resources around css also ensures that cgroups can be
57deleted after processes migrated. This allow progress migration as well with
58active resources, even though that is not a primary use case.
59
60Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
61the caller. Same rdma cgroup should be passed while uncharging the resource.
62This also allows process migrated with active RDMA resource to charge
63to new owner cgroup for new resource. It also allows to uncharge resource of
64a process from previously charged cgroup which is migrated to new cgroup,
65even though that is not a primary use case.
66
67Resource pool object is created in following situations.
68(a) User sets the limit and no previous resource pool exist for the device
69of interest for the cgroup.
70(b) No resource limits were configured, but IB/RDMA stack tries to
71charge the resource. So that it correctly uncharge them when applications are
72running without limits and later on when limits are enforced during uncharging,
73otherwise usage count will drop to negative.
74
75Resource pool is destroyed if all the resource limits are set to max and
76it is the last resource getting deallocated.
77
78User should set all the limit to max value if it intents to remove/unconfigure
79the resource pool for a particular device.
80
81IB stack honors limits enforced by the rdma controller. When application
82query about maximum resource limits of IB device, it returns minimum of
83what is configured by user for a given cgroup and what is supported by
84IB device.
85
86Following resources can be accounted by rdma controller.
87
88 ========== =============================
89 hca_handle Maximum number of HCA Handles
90 hca_object Maximum number of HCA Objects
91 ========== =============================
92
932. Usage Examples
94=================
95
96(a) Configure resource limit::
97
98 echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
99 echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
100
101(b) Query resource limit::
102
103 cat /sys/fs/cgroup/rdma/2/rdma.max
104 #Output:
105 mlx4_0 hca_handle=2 hca_object=2000
106 ocrdma1 hca_handle=3 hca_object=max
107
108(c) Query current usage::
109
110 cat /sys/fs/cgroup/rdma/2/rdma.current
111 #Output:
112 mlx4_0 hca_handle=1 hca_object=20
113 ocrdma1 hca_handle=1 hca_object=23
114
115(d) Delete resource limit::
116
117 echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8269e869cb1e..3b29005aa981 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
9conventions of cgroup v2. It describes all userland-visible aspects 9conventions of cgroup v2. It describes all userland-visible aspects
10of cgroup including core and specific controller behaviors. All 10of cgroup including core and specific controller behaviors. All
11future changes must be reflected in this document. Documentation for 11future changes must be reflected in this document. Documentation for
12v1 is available under Documentation/cgroup-v1/. 12v1 is available under Documentation/admin-guide/cgroup-v1/.
13 13
14.. CONTENTS 14.. CONTENTS
15 15
@@ -1014,7 +1014,7 @@ All time durations are in microseconds.
1014 A read-only nested-key file which exists on non-root cgroups. 1014 A read-only nested-key file which exists on non-root cgroups.
1015 1015
1016 Shows pressure stall information for CPU. See 1016 Shows pressure stall information for CPU. See
1017 Documentation/accounting/psi.txt for details. 1017 Documentation/accounting/psi.rst for details.
1018 1018
1019 1019
1020Memory 1020Memory
@@ -1355,7 +1355,7 @@ PAGE_SIZE multiple when read back.
1355 A read-only nested-key file which exists on non-root cgroups. 1355 A read-only nested-key file which exists on non-root cgroups.
1356 1356
1357 Shows pressure stall information for memory. See 1357 Shows pressure stall information for memory. See
1358 Documentation/accounting/psi.txt for details. 1358 Documentation/accounting/psi.rst for details.
1359 1359
1360 1360
1361Usage Guidelines 1361Usage Guidelines
@@ -1498,7 +1498,7 @@ IO Interface Files
1498 A read-only nested-key file which exists on non-root cgroups. 1498 A read-only nested-key file which exists on non-root cgroups.
1499 1499
1500 Shows pressure stall information for IO. See 1500 Shows pressure stall information for IO. See
1501 Documentation/accounting/psi.txt for details. 1501 Documentation/accounting/psi.rst for details.
1502 1502
1503 1503
1504Writeback 1504Writeback
diff --git a/Documentation/admin-guide/clearing-warn-once.rst b/Documentation/admin-guide/clearing-warn-once.rst
new file mode 100644
index 000000000000..211fd926cf00
--- /dev/null
+++ b/Documentation/admin-guide/clearing-warn-once.rst
@@ -0,0 +1,9 @@
1Clearing WARN_ONCE
2------------------
3
4WARN_ONCE / WARN_ON_ONCE / printk_once only emit a message once.
5
6echo 1 > /sys/kernel/debug/clear_warn_once
7
8clears the state and allows the warnings to print once again.
9This can be useful after test suite runs to reproduce problems.
diff --git a/Documentation/admin-guide/cpu-load.rst b/Documentation/admin-guide/cpu-load.rst
new file mode 100644
index 000000000000..2d01ce43d2a2
--- /dev/null
+++ b/Documentation/admin-guide/cpu-load.rst
@@ -0,0 +1,114 @@
1========
2CPU load
3========
4
5Linux exports various bits of information via ``/proc/stat`` and
6``/proc/uptime`` that userland tools, such as top(1), use to calculate
7the average time system spent in a particular state, for example::
8
9 $ iostat
10 Linux 2.6.18.3-exp (linmac) 02/20/2007
11
12 avg-cpu: %user %nice %system %iowait %steal %idle
13 10.01 0.00 2.92 5.44 0.00 81.63
14
15 ...
16
17Here the system thinks that over the default sampling period the
18system spent 10.01% of the time doing work in user space, 2.92% in the
19kernel, and was overall 81.63% of the time idle.
20
21In most cases the ``/proc/stat`` information reflects the reality quite
22closely, however due to the nature of how/when the kernel collects
23this data sometimes it can not be trusted at all.
24
25So how is this information collected? Whenever timer interrupt is
26signalled the kernel looks what kind of task was running at this
27moment and increments the counter that corresponds to this tasks
28kind/state. The problem with this is that the system could have
29switched between various states multiple times between two timer
30interrupts yet the counter is incremented only for the last state.
31
32
33Example
34-------
35
36If we imagine the system with one task that periodically burns cycles
37in the following manner::
38
39 time line between two timer interrupts
40 |--------------------------------------|
41 ^ ^
42 |_ something begins working |
43 |_ something goes to sleep
44 (only to be awaken quite soon)
45
46In the above situation the system will be 0% loaded according to the
47``/proc/stat`` (since the timer interrupt will always happen when the
48system is executing the idle handler), but in reality the load is
49closer to 99%.
50
51One can imagine many more situations where this behavior of the kernel
52will lead to quite erratic information inside ``/proc/stat``::
53
54
55 /* gcc -o hog smallhog.c */
56 #include <time.h>
57 #include <limits.h>
58 #include <signal.h>
59 #include <sys/time.h>
60 #define HIST 10
61
62 static volatile sig_atomic_t stop;
63
64 static void sighandler (int signr)
65 {
66 (void) signr;
67 stop = 1;
68 }
69 static unsigned long hog (unsigned long niters)
70 {
71 stop = 0;
72 while (!stop && --niters);
73 return niters;
74 }
75 int main (void)
76 {
77 int i;
78 struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
79 .it_value = { .tv_sec = 0, .tv_usec = 1 } };
80 sigset_t set;
81 unsigned long v[HIST];
82 double tmp = 0.0;
83 unsigned long n;
84 signal (SIGALRM, &sighandler);
85 setitimer (ITIMER_REAL, &it, NULL);
86
87 hog (ULONG_MAX);
88 for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
89 for (i = 0; i < HIST; ++i) tmp += v[i];
90 tmp /= HIST;
91 n = tmp - (tmp / 3.0);
92
93 sigemptyset (&set);
94 sigaddset (&set, SIGALRM);
95
96 for (;;) {
97 hog (n);
98 sigwait (&set, &i);
99 }
100 return 0;
101 }
102
103
104References
105----------
106
107- http://lkml.org/lkml/2007/2/12/6
108- Documentation/filesystems/proc.txt (1.8)
109
110
111Thanks
112------
113
114Con Kolivas, Pavel Machek
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
new file mode 100644
index 000000000000..b90dafcc8237
--- /dev/null
+++ b/Documentation/admin-guide/cputopology.rst
@@ -0,0 +1,177 @@
1===========================================
2How CPU topology info is exported via sysfs
3===========================================
4
5Export CPU topology info via sysfs. Items (attributes) are similar
6to /proc/cpuinfo output of some architectures. They reside in
7/sys/devices/system/cpu/cpuX/topology/:
8
9physical_package_id:
10
11 physical package id of cpuX. Typically corresponds to a physical
12 socket number, but the actual value is architecture and platform
13 dependent.
14
15die_id:
16
17 the CPU die ID of cpuX. Typically it is the hardware platform's
18 identifier (rather than the kernel's). The actual value is
19 architecture and platform dependent.
20
21core_id:
22
23 the CPU core ID of cpuX. Typically it is the hardware platform's
24 identifier (rather than the kernel's). The actual value is
25 architecture and platform dependent.
26
27book_id:
28
29 the book ID of cpuX. Typically it is the hardware platform's
30 identifier (rather than the kernel's). The actual value is
31 architecture and platform dependent.
32
33drawer_id:
34
35 the drawer ID of cpuX. Typically it is the hardware platform's
36 identifier (rather than the kernel's). The actual value is
37 architecture and platform dependent.
38
39core_cpus:
40
41 internal kernel map of CPUs within the same core.
42 (deprecated name: "thread_siblings")
43
44core_cpus_list:
45
46 human-readable list of CPUs within the same core.
47 (deprecated name: "thread_siblings_list");
48
49package_cpus:
50
51 internal kernel map of the CPUs sharing the same physical_package_id.
52 (deprecated name: "core_siblings")
53
54package_cpus_list:
55
56 human-readable list of CPUs sharing the same physical_package_id.
57 (deprecated name: "core_siblings_list")
58
59die_cpus:
60
61 internal kernel map of CPUs within the same die.
62
63die_cpus_list:
64
65 human-readable list of CPUs within the same die.
66
67book_siblings:
68
69 internal kernel map of cpuX's hardware threads within the same
70 book_id.
71
72book_siblings_list:
73
74 human-readable list of cpuX's hardware threads within the same
75 book_id.
76
77drawer_siblings:
78
79 internal kernel map of cpuX's hardware threads within the same
80 drawer_id.
81
82drawer_siblings_list:
83
84 human-readable list of cpuX's hardware threads within the same
85 drawer_id.
86
87Architecture-neutral, drivers/base/topology.c, exports these attributes.
88However, the book and drawer related sysfs files will only be created if
89CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
90
91CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
92where they reflect the cpu and cache hierarchy.
93
94For an architecture to support this feature, it must define some of
95these macros in include/asm-XXX/topology.h::
96
97 #define topology_physical_package_id(cpu)
98 #define topology_die_id(cpu)
99 #define topology_core_id(cpu)
100 #define topology_book_id(cpu)
101 #define topology_drawer_id(cpu)
102 #define topology_sibling_cpumask(cpu)
103 #define topology_core_cpumask(cpu)
104 #define topology_die_cpumask(cpu)
105 #define topology_book_cpumask(cpu)
106 #define topology_drawer_cpumask(cpu)
107
108The type of ``**_id macros`` is int.
109The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter
110correspond with appropriate ``**_siblings`` sysfs attributes (except for
111topology_sibling_cpumask() which corresponds with thread_siblings).
112
113To be consistent on all architectures, include/linux/topology.h
114provides default definitions for any of the above macros that are
115not defined by include/asm-XXX/topology.h:
116
1171) topology_physical_package_id: -1
1182) topology_die_id: -1
1193) topology_core_id: 0
1204) topology_sibling_cpumask: just the given CPU
1215) topology_core_cpumask: just the given CPU
1226) topology_die_cpumask: just the given CPU
123
124For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
125default definitions for topology_book_id() and topology_book_cpumask().
126For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
127no default definitions for topology_drawer_id() and topology_drawer_cpumask().
128
129Additionally, CPU topology information is provided under
130/sys/devices/system/cpu and includes these files. The internal
131source for the output is in brackets ("[]").
132
133 =========== ==========================================================
134 kernel_max: the maximum CPU index allowed by the kernel configuration.
135 [NR_CPUS-1]
136
137 offline: CPUs that are not online because they have been
138 HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
139 of CPUs allowed by the kernel configuration (kernel_max
140 above). [~cpu_online_mask + cpus >= NR_CPUS]
141
142 online: CPUs that are online and being scheduled [cpu_online_mask]
143
144 possible: CPUs that have been allocated resources and can be
145 brought online if they are present. [cpu_possible_mask]
146
147 present: CPUs that have been identified as being present in the
148 system. [cpu_present_mask]
149 =========== ==========================================================
150
151The format for the above output is compatible with cpulist_parse()
152[see <linux/cpumask.h>]. Some examples follow.
153
154In this example, there are 64 CPUs in the system but cpus 32-63 exceed
155the kernel max which is limited to 0..31 by the NR_CPUS config option
156being 32. Note also that CPUs 2 and 4-31 are not online but could be
157brought online as they are both present and possible::
158
159 kernel_max: 31
160 offline: 2,4-31,32-63
161 online: 0-1,3
162 possible: 0-31
163 present: 0-31
164
165In this example, the NR_CPUS config option is 128, but the kernel was
166started with possible_cpus=144. There are 4 CPUs in the system and cpu2
167was manually taken offline (and is the only CPU that can be brought
168online.)::
169
170 kernel_max: 127
171 offline: 2,4-127,128-143
172 online: 0-1,3
173 possible: 0-127
174 present: 0-3
175
176See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
177as well as more information on the various cpumasks.
diff --git a/Documentation/admin-guide/device-mapper/cache-policies.rst b/Documentation/admin-guide/device-mapper/cache-policies.rst
new file mode 100644
index 000000000000..b17fe352fc41
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/cache-policies.rst
@@ -0,0 +1,131 @@
1=============================
2Guidance for writing policies
3=============================
4
5Try to keep transactionality out of it. The core is careful to
6avoid asking about anything that is migrating. This is a pain, but
7makes it easier to write the policies.
8
9Mappings are loaded into the policy at construction time.
10
11Every bio that is mapped by the target is referred to the policy.
12The policy can return a simple HIT or MISS or issue a migration.
13
14Currently there's no way for the policy to issue background work,
15e.g. to start writing back dirty blocks that are going to be evicted
16soon.
17
18Because we map bios, rather than requests it's easy for the policy
19to get fooled by many small bios. For this reason the core target
20issues periodic ticks to the policy. It's suggested that the policy
21doesn't update states (eg, hit counts) for a block more than once
22for each tick. The core ticks by watching bios complete, and so
23trying to see when the io scheduler has let the ios run.
24
25
26Overview of supplied cache replacement policies
27===============================================
28
29multiqueue (mq)
30---------------
31
32This policy is now an alias for smq (see below).
33
34The following tunables are accepted, but have no effect::
35
36 'sequential_threshold <#nr_sequential_ios>'
37 'random_threshold <#nr_random_ios>'
38 'read_promote_adjustment <value>'
39 'write_promote_adjustment <value>'
40 'discard_promote_adjustment <value>'
41
42Stochastic multiqueue (smq)
43---------------------------
44
45This policy is the default.
46
47The stochastic multi-queue (smq) policy addresses some of the problems
48with the multiqueue (mq) policy.
49
50The smq policy (vs mq) offers the promise of less memory utilization,
51improved performance and increased adaptability in the face of changing
52workloads. smq also does not have any cumbersome tuning knobs.
53
54Users may switch from "mq" to "smq" simply by appropriately reloading a
55DM table that is using the cache target. Doing so will cause all of the
56mq policy's hints to be dropped. Also, performance of the cache may
57degrade slightly until smq recalculates the origin device's hotspots
58that should be cached.
59
60Memory usage
61^^^^^^^^^^^^
62
63The mq policy used a lot of memory; 88 bytes per cache block on a 64
64bit machine.
65
66smq uses 28bit indexes to implement its data structures rather than
67pointers. It avoids storing an explicit hit count for each block. It
68has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
69the entries (each hotspot block covers a larger area than a single
70cache block).
71
72All this means smq uses ~25bytes per cache block. Still a lot of
73memory, but a substantial improvement nontheless.
74
75Level balancing
76^^^^^^^^^^^^^^^
77
78mq placed entries in different levels of the multiqueue structures
79based on their hit count (~ln(hit count)). This meant the bottom
80levels generally had the most entries, and the top ones had very
81few. Having unbalanced levels like this reduced the efficacy of the
82multiqueue.
83
84smq does not maintain a hit count, instead it swaps hit entries with
85the least recently used entry from the level above. The overall
86ordering being a side effect of this stochastic process. With this
87scheme we can decide how many entries occupy each multiqueue level,
88resulting in better promotion/demotion decisions.
89
90Adaptability:
91The mq policy maintained a hit count for each cache block. For a
92different block to get promoted to the cache its hit count has to
93exceed the lowest currently in the cache. This meant it could take a
94long time for the cache to adapt between varying IO patterns.
95
96smq doesn't maintain hit counts, so a lot of this problem just goes
97away. In addition it tracks performance of the hotspot queue, which
98is used to decide which blocks to promote. If the hotspot queue is
99performing badly then it starts moving entries more quickly between
100levels. This lets it adapt to new IO patterns very quickly.
101
102Performance
103^^^^^^^^^^^
104
105Testing smq shows substantially better performance than mq.
106
107cleaner
108-------
109
110The cleaner writes back all dirty blocks in a cache to decommission it.
111
112Examples
113========
114
115The syntax for a table is::
116
117 cache <metadata dev> <cache dev> <origin dev> <block size>
118 <#feature_args> [<feature arg>]*
119 <policy> <#policy_args> [<policy arg>]*
120
121The syntax to send a message using the dmsetup command is::
122
123 dmsetup message <mapped device> 0 sequential_threshold 1024
124 dmsetup message <mapped device> 0 random_threshold 8
125
126Using dmsetup::
127
128 dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
129 /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
130 creates a 128GB large mapped device named 'blah' with the
131 sequential threshold set to 1024 and the random_threshold set to 8.
diff --git a/Documentation/admin-guide/device-mapper/cache.rst b/Documentation/admin-guide/device-mapper/cache.rst
new file mode 100644
index 000000000000..f15e5254d05b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/cache.rst
@@ -0,0 +1,337 @@
1=====
2Cache
3=====
4
5Introduction
6============
7
8dm-cache is a device mapper target written by Joe Thornber, Heinz
9Mauelshagen, and Mike Snitzer.
10
11It aims to improve performance of a block device (eg, a spindle) by
12dynamically migrating some of its data to a faster, smaller device
13(eg, an SSD).
14
15This device-mapper solution allows us to insert this caching at
16different levels of the dm stack, for instance above the data device for
17a thin-provisioning pool. Caching solutions that are integrated more
18closely with the virtual memory system should give better performance.
19
20The target reuses the metadata library used in the thin-provisioning
21library.
22
23The decision as to what data to migrate and when is left to a plug-in
24policy module. Several of these have been written as we experiment,
25and we hope other people will contribute others for specific io
26scenarios (eg. a vm image server).
27
28Glossary
29========
30
31 Migration
32 Movement of the primary copy of a logical block from one
33 device to the other.
34 Promotion
35 Migration from slow device to fast device.
36 Demotion
37 Migration from fast device to slow device.
38
39The origin device always contains a copy of the logical block, which
40may be out of date or kept in sync with the copy on the cache device
41(depending on policy).
42
43Design
44======
45
46Sub-devices
47-----------
48
49The target is constructed by passing three devices to it (along with
50other parameters detailed later):
51
521. An origin device - the big, slow one.
53
542. A cache device - the small, fast one.
55
563. A small metadata device - records which blocks are in the cache,
57 which are dirty, and extra hints for use by the policy object.
58 This information could be put on the cache device, but having it
59 separate allows the volume manager to configure it differently,
60 e.g. as a mirror for extra robustness. This metadata device may only
61 be used by a single cache device.
62
63Fixed block size
64----------------
65
66The origin is divided up into blocks of a fixed size. This block size
67is configurable when you first create the cache. Typically we've been
68using block sizes of 256KB - 1024KB. The block size must be between 64
69sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
70
71Having a fixed block size simplifies the target a lot. But it is
72something of a compromise. For instance, a small part of a block may be
73getting hit a lot, yet the whole block will be promoted to the cache.
74So large block sizes are bad because they waste cache space. And small
75block sizes are bad because they increase the amount of metadata (both
76in core and on disk).
77
78Cache operating modes
79---------------------
80
81The cache has three operating modes: writeback, writethrough and
82passthrough.
83
84If writeback, the default, is selected then a write to a block that is
85cached will go only to the cache and the block will be marked dirty in
86the metadata.
87
88If writethrough is selected then a write to a cached block will not
89complete until it has hit both the origin and cache devices. Clean
90blocks should remain clean.
91
92If passthrough is selected, useful when the cache contents are not known
93to be coherent with the origin device, then all reads are served from
94the origin device (all reads miss the cache) and all writes are
95forwarded to the origin device; additionally, write hits cause cache
96block invalidates. To enable passthrough mode the cache must be clean.
97Passthrough mode allows a cache device to be activated without having to
98worry about coherency. Coherency that exists is maintained, although
99the cache will gradually cool as writes take place. If the coherency of
100the cache can later be verified, or established through use of the
101"invalidate_cblocks" message, the cache device can be transitioned to
102writethrough or writeback mode while still warm. Otherwise, the cache
103contents can be discarded prior to transitioning to the desired
104operating mode.
105
106A simple cleaner policy is provided, which will clean (write back) all
107dirty blocks in a cache. Useful for decommissioning a cache or when
108shrinking a cache. Shrinking the cache's fast device requires all cache
109blocks, in the area of the cache being removed, to be clean. If the
110area being removed from the cache still contains dirty blocks the resize
111will fail. Care must be taken to never reduce the volume used for the
112cache's fast device until the cache is clean. This is of particular
113importance if writeback mode is used. Writethrough and passthrough
114modes already maintain a clean cache. Future support to partially clean
115the cache, above a specified threshold, will allow for keeping the cache
116warm and in writeback mode during resize.
117
118Migration throttling
119--------------------
120
121Migrating data between the origin and cache device uses bandwidth.
122The user can set a throttle to prevent more than a certain amount of
123migration occurring at any one time. Currently we're not taking any
124account of normal io traffic going to the devices. More work needs
125doing here to avoid migrating during those peak io moments.
126
127For the time being, a message "migration_threshold <#sectors>"
128can be used to set the maximum number of sectors being migrated,
129the default being 2048 sectors (1MB).
130
131Updating on-disk metadata
132-------------------------
133
134On-disk metadata is committed every time a FLUSH or FUA bio is written.
135If no such requests are made then commits will occur every second. This
136means the cache behaves like a physical disk that has a volatile write
137cache. If power is lost you may lose some recent writes. The metadata
138should always be consistent in spite of any crash.
139
140The 'dirty' state for a cache block changes far too frequently for us
141to keep updating it on the fly. So we treat it as a hint. In normal
142operation it will be written when the dm device is suspended. If the
143system crashes all cache blocks will be assumed dirty when restarted.
144
145Per-block policy hints
146----------------------
147
148Policy plug-ins can store a chunk of data per cache block. It's up to
149the policy how big this chunk is, but it should be kept small. Like the
150dirty flags this data is lost if there's a crash so a safe fallback
151value should always be possible.
152
153Policy hints affect performance, not correctness.
154
155Policy messaging
156----------------
157
158Policies will have different tunables, specific to each one, so we
159need a generic way of getting and setting these. Device-mapper
160messages are used. Refer to cache-policies.txt.
161
162Discard bitset resolution
163-------------------------
164
165We can avoid copying data during migration if we know the block has
166been discarded. A prime example of this is when mkfs discards the
167whole block device. We store a bitset tracking the discard state of
168blocks. However, we allow this bitset to have a different block size
169from the cache blocks. This is because we need to track the discard
170state for all of the origin device (compare with the dirty bitset
171which is just for the smaller cache device).
172
173Target interface
174================
175
176Constructor
177-----------
178
179 ::
180
181 cache <metadata dev> <cache dev> <origin dev> <block size>
182 <#feature args> [<feature arg>]*
183 <policy> <#policy args> [policy args]*
184
185 ================ =======================================================
186 metadata dev fast device holding the persistent metadata
187 cache dev fast device holding cached data blocks
188 origin dev slow device holding original data blocks
189 block size cache unit size in sectors
190
191 #feature args number of feature arguments passed
192 feature args writethrough or passthrough (The default is writeback.)
193
194 policy the replacement policy to use
195 #policy args an even number of arguments corresponding to
196 key/value pairs passed to the policy
197 policy args key/value pairs passed to the policy
198 E.g. 'sequential_threshold 1024'
199 See cache-policies.txt for details.
200 ================ =======================================================
201
202Optional feature arguments are:
203
204
205 ==================== ========================================================
206 writethrough write through caching that prohibits cache block
207 content from being different from origin block content.
208 Without this argument, the default behaviour is to write
209 back cache block contents later for performance reasons,
210 so they may differ from the corresponding origin blocks.
211
212 passthrough a degraded mode useful for various cache coherency
213 situations (e.g., rolling back snapshots of
214 underlying storage). Reads and writes always go to
215 the origin. If a write goes to a cached origin
216 block, then the cache block is invalidated.
217 To enable passthrough mode the cache must be clean.
218
219 metadata2 use version 2 of the metadata. This stores the dirty
220 bits in a separate btree, which improves speed of
221 shutting down the cache.
222
223 no_discard_passdown disable passing down discards from the cache
224 to the origin's data device.
225 ==================== ========================================================
226
227A policy called 'default' is always registered. This is an alias for
228the policy we currently think is giving best all round performance.
229
230As the default policy could vary between kernels, if you are relying on
231the characteristics of a specific policy, always request it by name.
232
233Status
234------
235
236::
237
238 <metadata block size> <#used metadata blocks>/<#total metadata blocks>
239 <cache block size> <#used cache blocks>/<#total cache blocks>
240 <#read hits> <#read misses> <#write hits> <#write misses>
241 <#demotions> <#promotions> <#dirty> <#features> <features>*
242 <#core args> <core args>* <policy name> <#policy args> <policy args>*
243 <cache metadata mode>
244
245
246========================= =====================================================
247metadata block size Fixed block size for each metadata block in
248 sectors
249#used metadata blocks Number of metadata blocks used
250#total metadata blocks Total number of metadata blocks
251cache block size Configurable block size for the cache device
252 in sectors
253#used cache blocks Number of blocks resident in the cache
254#total cache blocks Total number of cache blocks
255#read hits Number of times a READ bio has been mapped
256 to the cache
257#read misses Number of times a READ bio has been mapped
258 to the origin
259#write hits Number of times a WRITE bio has been mapped
260 to the cache
261#write misses Number of times a WRITE bio has been
262 mapped to the origin
263#demotions Number of times a block has been removed
264 from the cache
265#promotions Number of times a block has been moved to
266 the cache
267#dirty Number of blocks in the cache that differ
268 from the origin
269#feature args Number of feature args to follow
270feature args 'writethrough' (optional)
271#core args Number of core arguments (must be even)
272core args Key/value pairs for tuning the core
273 e.g. migration_threshold
274policy name Name of the policy
275#policy args Number of policy arguments to follow (must be even)
276policy args Key/value pairs e.g. sequential_threshold
277cache metadata mode ro if read-only, rw if read-write
278
279 In serious cases where even a read-only mode is
280 deemed unsafe no further I/O will be permitted and
281 the status will just contain the string 'Fail'.
282 The userspace recovery tools should then be used.
283needs_check 'needs_check' if set, '-' if not set
284 A metadata operation has failed, resulting in the
285 needs_check flag being set in the metadata's
286 superblock. The metadata device must be
287 deactivated and checked/repaired before the
288 cache can be made fully operational again.
289 '-' indicates needs_check is not set.
290========================= =====================================================
291
292Messages
293--------
294
295Policies will have different tunables, specific to each one, so we
296need a generic way of getting and setting these. Device-mapper
297messages are used. (A sysfs interface would also be possible.)
298
299The message format is::
300
301 <key> <value>
302
303E.g.::
304
305 dmsetup message my_cache 0 sequential_threshold 1024
306
307
308Invalidation is removing an entry from the cache without writing it
309back. Cache blocks can be invalidated via the invalidate_cblocks
310message, which takes an arbitrary number of cblock ranges. Each cblock
311range's end value is "one past the end", meaning 5-10 expresses a range
312of values from 5 to 9. Each cblock must be expressed as a decimal
313value, in the future a variant message that takes cblock ranges
314expressed in hexadecimal may be needed to better support efficient
315invalidation of larger caches. The cache must be in passthrough mode
316when invalidate_cblocks is used::
317
318 invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
319
320E.g.::
321
322 dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
323
324Examples
325========
326
327The test suite can be found here:
328
329https://github.com/jthornber/device-mapper-test-suite
330
331::
332
333 dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
334 /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
335 dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
336 /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
337 mq 4 sequential_threshold 1024 random_threshold 8'
diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst
new file mode 100644
index 000000000000..917ba8c33359
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/delay.rst
@@ -0,0 +1,31 @@
1========
2dm-delay
3========
4
5Device-Mapper's "delay" target delays reads and/or writes
6and maps them to different devices.
7
8Parameters::
9
10 <device> <offset> <delay> [<write_device> <write_offset> <write_delay>
11 [<flush_device> <flush_offset> <flush_delay>]]
12
13With separate write parameters, the first set is only used for reads.
14Offsets are specified in sectors.
15Delays are specified in milliseconds.
16
17Example scripts
18===============
19
20::
21
22 #!/bin/sh
23 # Create device delaying rw operation for 500ms
24 echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
25
26::
27
28 #!/bin/sh
29 # Create device delaying only write operation for 500ms and
30 # splitting reads and writes to different devices $1 $2
31 echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst
new file mode 100644
index 000000000000..8f4a3f889d43
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst
@@ -0,0 +1,173 @@
1========
2dm-crypt
3========
4
5Device-Mapper's "crypt" target provides transparent encryption of block devices
6using the kernel crypto API.
7
8For a more detailed description of supported parameters see:
9https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
10
11Parameters::
12
13 <cipher> <key> <iv_offset> <device path> \
14 <offset> [<#opt_params> <opt_params>]
15
16<cipher>
17 Encryption cipher, encryption mode and Initial Vector (IV) generator.
18
19 The cipher specifications format is::
20
21 cipher[:keycount]-chainmode-ivmode[:ivopts]
22
23 Examples::
24
25 aes-cbc-essiv:sha256
26 aes-xts-plain64
27 serpent-xts-plain64
28
29 Cipher format also supports direct specification with kernel crypt API
30 format (selected by capi: prefix). The IV specification is the same
31 as for the first format type.
32 This format is mainly used for specification of authenticated modes.
33
34 The crypto API cipher specifications format is::
35
36 capi:cipher_api_spec-ivmode[:ivopts]
37
38 Examples::
39
40 capi:cbc(aes)-essiv:sha256
41 capi:xts(aes)-plain64
42
43 Examples of authenticated modes::
44
45 capi:gcm(aes)-random
46 capi:authenc(hmac(sha256),xts(aes))-random
47 capi:rfc7539(chacha20,poly1305)-random
48
49 The /proc/crypto contains a list of curently loaded crypto modes.
50
51<key>
52 Key used for encryption. It is encoded either as a hexadecimal number
53 or it can be passed as <key_string> prefixed with single colon
54 character (':') for keys residing in kernel keyring service.
55 You can only use key sizes that are valid for the selected cipher
56 in combination with the selected iv mode.
57 Note that for some iv modes the key string can contain additional
58 keys (for example IV seed) so the key contains more parts concatenated
59 into a single string.
60
61<key_string>
62 The kernel keyring key is identified by string in following format:
63 <key_size>:<key_type>:<key_description>.
64
65<key_size>
66 The encryption key size in bytes. The kernel key payload size must match
67 the value passed in <key_size>.
68
69<key_type>
70 Either 'logon' or 'user' kernel key type.
71
72<key_description>
73 The kernel keyring key description crypt target should look for
74 when loading key of <key_type>.
75
76<keycount>
77 Multi-key compatibility mode. You can define <keycount> keys and
78 then sectors are encrypted according to their offsets (sector 0 uses key0;
79 sector 1 uses key1 etc.). <keycount> must be a power of two.
80
81<iv_offset>
82 The IV offset is a sector count that is added to the sector number
83 before creating the IV.
84
85<device path>
86 This is the device that is going to be used as backend and contains the
87 encrypted data. You can specify it as a path like /dev/xxx or a device
88 number <major>:<minor>.
89
90<offset>
91 Starting sector within the device where the encrypted data begins.
92
93<#opt_params>
94 Number of optional parameters. If there are no optional parameters,
95 the optional paramaters section can be skipped or #opt_params can be zero.
96 Otherwise #opt_params is the number of following arguments.
97
98 Example of optional parameters section:
99 3 allow_discards same_cpu_crypt submit_from_crypt_cpus
100
101allow_discards
102 Block discard requests (a.k.a. TRIM) are passed through the crypt device.
103 The default is to ignore discard requests.
104
105 WARNING: Assess the specific security risks carefully before enabling this
106 option. For example, allowing discards on encrypted devices may lead to
107 the leak of information about the ciphertext device (filesystem type,
108 used space etc.) if the discarded blocks can be located easily on the
109 device later.
110
111same_cpu_crypt
112 Perform encryption using the same cpu that IO was submitted on.
113 The default is to use an unbound workqueue so that encryption work
114 is automatically balanced between available CPUs.
115
116submit_from_crypt_cpus
117 Disable offloading writes to a separate thread after encryption.
118 There are some situations where offloading write bios from the
119 encryption threads to a single thread degrades performance
120 significantly. The default is to offload write bios to the same
121 thread because it benefits CFQ to have writes submitted using the
122 same context.
123
124integrity:<bytes>:<type>
125 The device requires additional <bytes> metadata per-sector stored
126 in per-bio integrity structure. This metadata must by provided
127 by underlying dm-integrity target.
128
129 The <type> can be "none" if metadata is used only for persistent IV.
130
131 For Authenticated Encryption with Additional Data (AEAD)
132 the <type> is "aead". An AEAD mode additionally calculates and verifies
133 integrity for the encrypted device. The additional space is then
134 used for storing authentication tag (and persistent IV if needed).
135
136sector_size:<bytes>
137 Use <bytes> as the encryption unit instead of 512 bytes sectors.
138 This option can be in range 512 - 4096 bytes and must be power of two.
139 Virtual device will announce this size as a minimal IO and logical sector.
140
141iv_large_sectors
142 IV generators will use sector number counted in <sector_size> units
143 instead of default 512 bytes sectors.
144
145 For example, if <sector_size> is 4096 bytes, plain64 IV for the second
146 sector will be 8 (without flag) and 1 if iv_large_sectors is present.
147 The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
148 if this flag is specified.
149
150Example scripts
151===============
152LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
153encryption with dm-crypt using the 'cryptsetup' utility, see
154https://gitlab.com/cryptsetup/cryptsetup
155
156::
157
158 #!/bin/sh
159 # Create a crypt device using dmsetup
160 dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
161
162::
163
164 #!/bin/sh
165 # Create a crypt device using dmsetup when encryption key is stored in keyring service
166 dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
167
168::
169
170 #!/bin/sh
171 # Create a crypt device using cryptsetup and LUKS header with default cipher
172 cryptsetup luksFormat $1
173 cryptsetup luksOpen $1 crypt1
diff --git a/Documentation/admin-guide/device-mapper/dm-dust.txt b/Documentation/admin-guide/device-mapper/dm-dust.txt
new file mode 100644
index 000000000000..954d402a1f6a
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-dust.txt
@@ -0,0 +1,272 @@
1dm-dust
2=======
3
4This target emulates the behavior of bad sectors at arbitrary
5locations, and the ability to enable the emulation of the failures
6at an arbitrary time.
7
8This target behaves similarly to a linear target. At a given time,
9the user can send a message to the target to start failing read
10requests on specific blocks (to emulate the behavior of a hard disk
11drive with bad sectors).
12
13When the failure behavior is enabled (i.e.: when the output of
14"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
15in the "bad block list" will fail with EIO ("Input/output error").
16
17Writes of blocks in the "bad block list will result in the following:
18
191. Remove the block from the "bad block list".
202. Successfully complete the write.
21
22This emulates the "remapped sector" behavior of a drive with bad
23sectors.
24
25Normally, a drive that is encountering bad sectors will most likely
26encounter more bad sectors, at an unknown time or location.
27With dm-dust, the user can use the "addbadblock" and "removebadblock"
28messages to add arbitrary bad blocks at new locations, and the
29"enable" and "disable" messages to modulate the state of whether the
30configured "bad blocks" will be treated as bad, or bypassed.
31This allows the pre-writing of test data and metadata prior to
32simulating a "failure" event where bad sectors start to appear.
33
34Table parameters:
35-----------------
36<device_path> <offset> <blksz>
37
38Mandatory parameters:
39 <device_path>: path to the block device.
40 <offset>: offset to data area from start of device_path
41 <blksz>: block size in bytes
42 (minimum 512, maximum 1073741824, must be a power of 2)
43
44Usage instructions:
45-------------------
46
47First, find the size (in 512-byte sectors) of the device to be used:
48
49$ sudo blockdev --getsz /dev/vdb1
5033552384
51
52Create the dm-dust device:
53(For a device with a block size of 512 bytes)
54$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512'
55
56(For a device with a block size of 4096 bytes)
57$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096'
58
59Check the status of the read behavior ("bypass" indicates that all I/O
60will be passed through to the underlying device):
61$ sudo dmsetup status dust1
620 33552384 dust 252:17 bypass
63
64$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
65128+0 records in
66128+0 records out
67
68$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
69128+0 records in
70128+0 records out
71
72Adding and removing bad blocks:
73-------------------------------
74
75At any time (i.e.: whether the device has the "bad block" emulation
76enabled or disabled), bad blocks may be added or removed from the
77device via the "addbadblock" and "removebadblock" messages:
78
79$ sudo dmsetup message dust1 0 addbadblock 60
80kernel: device-mapper: dust: badblock added at block 60
81
82$ sudo dmsetup message dust1 0 addbadblock 67
83kernel: device-mapper: dust: badblock added at block 67
84
85$ sudo dmsetup message dust1 0 addbadblock 72
86kernel: device-mapper: dust: badblock added at block 72
87
88These bad blocks will be stored in the "bad block list".
89While the device is in "bypass" mode, reads and writes will succeed:
90
91$ sudo dmsetup status dust1
920 33552384 dust 252:17 bypass
93
94Enabling block read failures:
95-----------------------------
96
97To enable the "fail read on bad block" behavior, send the "enable" message:
98
99$ sudo dmsetup message dust1 0 enable
100kernel: device-mapper: dust: enabling read failures on bad sectors
101
102$ sudo dmsetup status dust1
1030 33552384 dust 252:17 fail_read_on_bad_block
104
105With the device in "fail read on bad block" mode, attempting to read a
106block will encounter an "Input/output error":
107
108$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
109dd: error reading '/dev/mapper/dust1': Input/output error
1100+0 records in
1110+0 records out
1120 bytes copied, 0.00040651 s, 0.0 kB/s
113
114...and writing to the bad blocks will remove the blocks from the list,
115therefore emulating the "remap" behavior of hard disk drives:
116
117$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
118128+0 records in
119128+0 records out
120
121kernel: device-mapper: dust: block 60 removed from badblocklist by write
122kernel: device-mapper: dust: block 67 removed from badblocklist by write
123kernel: device-mapper: dust: block 72 removed from badblocklist by write
124kernel: device-mapper: dust: block 87 removed from badblocklist by write
125
126Bad block add/remove error handling:
127------------------------------------
128
129Attempting to add a bad block that already exists in the list will
130result in an "Invalid argument" error, as well as a helpful message:
131
132$ sudo dmsetup message dust1 0 addbadblock 88
133device-mapper: message ioctl on dust1 failed: Invalid argument
134kernel: device-mapper: dust: block 88 already in badblocklist
135
136Attempting to remove a bad block that doesn't exist in the list will
137result in an "Invalid argument" error, as well as a helpful message:
138
139$ sudo dmsetup message dust1 0 removebadblock 87
140device-mapper: message ioctl on dust1 failed: Invalid argument
141kernel: device-mapper: dust: block 87 not found in badblocklist
142
143Counting the number of bad blocks in the bad block list:
144--------------------------------------------------------
145
146To count the number of bad blocks configured in the device, run the
147following message command:
148
149$ sudo dmsetup message dust1 0 countbadblocks
150
151A message will print with the number of bad blocks currently
152configured on the device:
153
154kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
155
156Querying for specific bad blocks:
157---------------------------------
158
159To find out if a specific block is in the bad block list, run the
160following message command:
161
162$ sudo dmsetup message dust1 0 queryblock 72
163
164The following message will print if the block is in the list:
165device-mapper: dust: queryblock: block 72 found in badblocklist
166
167The following message will print if the block is in the list:
168device-mapper: dust: queryblock: block 72 not found in badblocklist
169
170The "queryblock" message command will work in both the "enabled"
171and "disabled" modes, allowing the verification of whether a block
172will be treated as "bad" without having to issue I/O to the device,
173or having to "enable" the bad block emulation.
174
175Clearing the bad block list:
176----------------------------
177
178To clear the bad block list (without needing to individually run
179a "removebadblock" message command for every block), run the
180following message command:
181
182$ sudo dmsetup message dust1 0 clearbadblocks
183
184After clearing the bad block list, the following message will appear:
185
186kernel: device-mapper: dust: clearbadblocks: badblocks cleared
187
188If there were no bad blocks to clear, the following message will
189appear:
190
191kernel: device-mapper: dust: clearbadblocks: no badblocks found
192
193Message commands list:
194----------------------
195
196Below is a list of the messages that can be sent to a dust device:
197
198Operations on blocks (requires a <blknum> argument):
199
200addbadblock <blknum>
201queryblock <blknum>
202removebadblock <blknum>
203
204...where <blknum> is a block number within range of the device
205 (corresponding to the block size of the device.)
206
207Single argument message commands:
208
209countbadblocks
210clearbadblocks
211disable
212enable
213quiet
214
215Device removal:
216---------------
217
218When finished, remove the device via the "dmsetup remove" command:
219
220$ sudo dmsetup remove dust1
221
222Quiet mode:
223-----------
224
225On test runs with many bad blocks, it may be desirable to avoid
226excessive logging (from bad blocks added, removed, or "remapped").
227This can be done by enabling "quiet mode" via the following message:
228
229$ sudo dmsetup message dust1 0 quiet
230
231This will suppress log messages from add / remove / removed by write
232operations. Log messages from "countbadblocks" or "queryblock"
233message commands will still print in quiet mode.
234
235The status of quiet mode can be seen by running "dmsetup status":
236
237$ sudo dmsetup status dust1
2380 33552384 dust 252:17 fail_read_on_bad_block quiet
239
240To disable quiet mode, send the "quiet" message again:
241
242$ sudo dmsetup message dust1 0 quiet
243
244$ sudo dmsetup status dust1
2450 33552384 dust 252:17 fail_read_on_bad_block verbose
246
247(The presence of "verbose" indicates normal logging.)
248
249"Why not...?"
250-------------
251
252scsi_debug has a "medium error" mode that can fail reads on one
253specified sector (sector 0x1234, hardcoded in the source code), but
254it uses RAM for the persistent storage, which drastically decreases
255the potential device size.
256
257dm-flakey fails all I/O from all block locations at a specified time
258frequency, and not a given point in time.
259
260When a bad sector occurs on a hard disk drive, reads to that sector
261are failed by the device, usually resulting in an error code of EIO
262("I/O error") or ENODATA ("No data available"). However, a write to
263the sector may succeed, and result in the sector becoming readable
264after the device controller no longer experiences errors reading the
265sector (or after a reallocation of the sector). However, there may
266be bad sectors that occur on the device in the future, in a different,
267unpredictable location.
268
269This target seeks to provide a device that can exhibit the behavior
270of a bad sector at a known sector location, at a known time, based
271on a large storage device (at least tens of gigabytes, not occupying
272system memory).
diff --git a/Documentation/admin-guide/device-mapper/dm-flakey.rst b/Documentation/admin-guide/device-mapper/dm-flakey.rst
new file mode 100644
index 000000000000..86138735879d
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-flakey.rst
@@ -0,0 +1,74 @@
1=========
2dm-flakey
3=========
4
5This target is the same as the linear target except that it exhibits
6unreliable behaviour periodically. It's been found useful in simulating
7failing devices for testing purposes.
8
9Starting from the time the table is loaded, the device is available for
10<up interval> seconds, then exhibits unreliable behaviour for <down
11interval> seconds, and then this cycle repeats.
12
13Also, consider using this in combination with the dm-delay target too,
14which can delay reads and writes and/or send them to different
15underlying devices.
16
17Table parameters
18----------------
19
20::
21
22 <dev path> <offset> <up interval> <down interval> \
23 [<num_features> [<feature arguments>]]
24
25Mandatory parameters:
26
27 <dev path>:
28 Full pathname to the underlying block-device, or a
29 "major:minor" device-number.
30 <offset>:
31 Starting sector within the device.
32 <up interval>:
33 Number of seconds device is available.
34 <down interval>:
35 Number of seconds device returns errors.
36
37Optional feature parameters:
38
39 If no feature parameters are present, during the periods of
40 unreliability, all I/O returns errors.
41
42 drop_writes:
43 All write I/O is silently ignored.
44 Read I/O is handled correctly.
45
46 error_writes:
47 All write I/O is failed with an error signalled.
48 Read I/O is handled correctly.
49
50 corrupt_bio_byte <Nth_byte> <direction> <value> <flags>:
51 During <down interval>, replace <Nth_byte> of the data of
52 each matching bio with <value>.
53
54 <Nth_byte>:
55 The offset of the byte to replace.
56 Counting starts at 1, to replace the first byte.
57 <direction>:
58 Either 'r' to corrupt reads or 'w' to corrupt writes.
59 'w' is incompatible with drop_writes.
60 <value>:
61 The value (from 0-255) to write.
62 <flags>:
63 Perform the replacement only if bio->bi_opf has all the
64 selected flags set.
65
66Examples:
67
68Replaces the 32nd byte of READ bios with the value 1::
69
70 corrupt_bio_byte 32 r 1 0
71
72Replaces the 224th byte of REQ_META (=32) bios with the value 0::
73
74 corrupt_bio_byte 224 w 0 32
diff --git a/Documentation/admin-guide/device-mapper/dm-init.rst b/Documentation/admin-guide/device-mapper/dm-init.rst
new file mode 100644
index 000000000000..e5242ff17e9b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-init.rst
@@ -0,0 +1,125 @@
1================================
2Early creation of mapped devices
3================================
4
5It is possible to configure a device-mapper device to act as the root device for
6your system in two ways.
7
8The first is to build an initial ramdisk which boots to a minimal userspace
9which configures the device, then pivot_root(8) in to it.
10
11The second is to create one or more device-mappers using the module parameter
12"dm-mod.create=" through the kernel boot command line argument.
13
14The format is specified as a string of data separated by commas and optionally
15semi-colons, where:
16
17 - a comma is used to separate fields like name, uuid, flags and table
18 (specifies one device)
19 - a semi-colon is used to separate devices.
20
21So the format will look like this::
22
23 dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
24
25Where::
26
27 <name> ::= The device name.
28 <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
29 <minor> ::= The device minor number | ""
30 <flags> ::= "ro" | "rw"
31 <table> ::= <start_sector> <num_sectors> <target_type> <target_args>
32 <target_type> ::= "verity" | "linear" | ... (see list below)
33
34The dm line should be equivalent to the one used by the dmsetup tool with the
35`--concise` argument.
36
37Target types
38============
39
40Not all target types are available as there are serious risks in allowing
41activation of certain DM targets without first using userspace tools to check
42the validity of associated metadata.
43
44======================= =======================================================
45`cache` constrained, userspace should verify cache device
46`crypt` allowed
47`delay` allowed
48`era` constrained, userspace should verify metadata device
49`flakey` constrained, meant for test
50`linear` allowed
51`log-writes` constrained, userspace should verify metadata device
52`mirror` constrained, userspace should verify main/mirror device
53`raid` constrained, userspace should verify metadata device
54`snapshot` constrained, userspace should verify src/dst device
55`snapshot-origin` allowed
56`snapshot-merge` constrained, userspace should verify src/dst device
57`striped` allowed
58`switch` constrained, userspace should verify dev path
59`thin` constrained, requires dm target message from userspace
60`thin-pool` constrained, requires dm target message from userspace
61`verity` allowed
62`writecache` constrained, userspace should verify cache device
63`zero` constrained, not meant for rootfs
64======================= =======================================================
65
66If the target is not listed above, it is constrained by default (not tested).
67
68Examples
69========
70An example of booting to a linear array made up of user-mode linux block
71devices::
72
73 dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
74
75This will boot to a rw dm-linear target of 8192 sectors split across two block
76devices identified by their major:minor numbers. After boot, udev will rename
77this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
78
79An example of multiple device-mappers, with the dm-mod.create="..." contents
80is shown here split on multiple lines for readability::
81
82 dm-linear,,1,rw,
83 0 32768 linear 8:1 0,
84 32768 1024000 linear 8:2 0;
85 dm-verity,,3,ro,
86 0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256
87 ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42
88 5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51
89
90Other examples (per target):
91
92"crypt"::
93
94 dm-crypt,,8,ro,
95 0 1048576 crypt aes-xts-plain64
96 babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
97 /dev/sda 0 1 allow_discards
98
99"delay"::
100
101 dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
102
103"linear"::
104
105 dm-linear,,,rw,
106 0 32768 linear /dev/sda1 0,
107 32768 1024000 linear /dev/sda2 0,
108 1056768 204800 linear /dev/sda3 0,
109 1261568 512000 linear /dev/sda4 0
110
111"snapshot-origin"::
112
113 dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
114
115"striped"::
116
117 dm-striped,,4,ro,0 1638400 striped 4 4096
118 /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
119
120"verity"::
121
122 dm-verity,,4,ro,
123 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
124 fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
125 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst
new file mode 100644
index 000000000000..a30aa91b5fbe
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst
@@ -0,0 +1,259 @@
1============
2dm-integrity
3============
4
5The dm-integrity target emulates a block device that has additional
6per-sector tags that can be used for storing integrity information.
7
8A general problem with storing integrity tags with every sector is that
9writing the sector and the integrity tag must be atomic - i.e. in case of
10crash, either both sector and integrity tag or none of them is written.
11
12To guarantee write atomicity, the dm-integrity target uses journal, it
13writes sector data and integrity tags into a journal, commits the journal
14and then copies the data and integrity tags to their respective location.
15
16The dm-integrity target can be used with the dm-crypt target - in this
17situation the dm-crypt target creates the integrity data and passes them
18to the dm-integrity target via bio_integrity_payload attached to the bio.
19In this mode, the dm-crypt and dm-integrity targets provide authenticated
20disk encryption - if the attacker modifies the encrypted device, an I/O
21error is returned instead of random data.
22
23The dm-integrity target can also be used as a standalone target, in this
24mode it calculates and verifies the integrity tag internally. In this
25mode, the dm-integrity target can be used to detect silent data
26corruption on the disk or in the I/O path.
27
28There's an alternate mode of operation where dm-integrity uses bitmap
29instead of a journal. If a bit in the bitmap is 1, the corresponding
30region's data and integrity tags are not synchronized - if the machine
31crashes, the unsynchronized regions will be recalculated. The bitmap mode
32is faster than the journal mode, because we don't have to write the data
33twice, but it is also less reliable, because if data corruption happens
34when the machine crashes, it may not be detected.
35
36When loading the target for the first time, the kernel driver will format
37the device. But it will only format the device if the superblock contains
38zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
39target can't be loaded.
40
41To use the target for the first time:
42
431. overwrite the superblock with zeroes
442. load the dm-integrity target with one-sector size, the kernel driver
45 will format the device
463. unload the dm-integrity target
474. read the "provided_data_sectors" value from the superblock
485. load the dm-integrity target with the the target size
49 "provided_data_sectors"
506. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
51 with the size "provided_data_sectors"
52
53
54Target arguments:
55
561. the underlying block device
57
582. the number of reserved sector at the beginning of the device - the
59 dm-integrity won't read of write these sectors
60
613. the size of the integrity tag (if "-" is used, the size is taken from
62 the internal-hash algorithm)
63
644. mode:
65
66 D - direct writes (without journal)
67 in this mode, journaling is
68 not used and data sectors and integrity tags are written
69 separately. In case of crash, it is possible that the data
70 and integrity tag doesn't match.
71 J - journaled writes
72 data and integrity tags are written to the
73 journal and atomicity is guaranteed. In case of crash,
74 either both data and tag or none of them are written. The
75 journaled mode degrades write throughput twice because the
76 data have to be written twice.
77 B - bitmap mode - data and metadata are written without any
78 synchronization, the driver maintains a bitmap of dirty
79 regions where data and metadata don't match. This mode can
80 only be used with internal hash.
81 R - recovery mode - in this mode, journal is not replayed,
82 checksums are not checked and writes to the device are not
83 allowed. This mode is useful for data recovery if the
84 device cannot be activated in any of the other standard
85 modes.
86
875. the number of additional arguments
88
89Additional arguments:
90
91journal_sectors:number
92 The size of journal, this argument is used only if formatting the
93 device. If the device is already formatted, the value from the
94 superblock is used.
95
96interleave_sectors:number
97 The number of interleaved sectors. This values is rounded down to
98 a power of two. If the device is already formatted, the value from
99 the superblock is used.
100
101meta_device:device
102 Don't interleave the data and metadata on on device. Use a
103 separate device for metadata.
104
105buffer_sectors:number
106 The number of sectors in one buffer. The value is rounded down to
107 a power of two.
108
109 The tag area is accessed using buffers, the buffer size is
110 configurable. The large buffer size means that the I/O size will
111 be larger, but there could be less I/Os issued.
112
113journal_watermark:number
114 The journal watermark in percents. When the size of the journal
115 exceeds this watermark, the thread that flushes the journal will
116 be started.
117
118commit_time:number
119 Commit time in milliseconds. When this time passes, the journal is
120 written. The journal is also written immediatelly if the FLUSH
121 request is received.
122
123internal_hash:algorithm(:key) (the key is optional)
124 Use internal hash or crc.
125 When this argument is used, the dm-integrity target won't accept
126 integrity tags from the upper target, but it will automatically
127 generate and verify the integrity tags.
128
129 You can use a crc algorithm (such as crc32), then integrity target
130 will protect the data against accidental corruption.
131 You can also use a hmac algorithm (for example
132 "hmac(sha256):0123456789abcdef"), in this mode it will provide
133 cryptographic authentication of the data without encryption.
134
135 When this argument is not used, the integrity tags are accepted
136 from an upper layer target, such as dm-crypt. The upper layer
137 target should check the validity of the integrity tags.
138
139recalculate
140 Recalculate the integrity tags automatically. It is only valid
141 when using internal hash.
142
143journal_crypt:algorithm(:key) (the key is optional)
144 Encrypt the journal using given algorithm to make sure that the
145 attacker can't read the journal. You can use a block cipher here
146 (such as "cbc(aes)") or a stream cipher (for example "chacha20",
147 "salsa20", "ctr(aes)" or "ecb(arc4)").
148
149 The journal contains history of last writes to the block device,
150 an attacker reading the journal could see the last sector nubmers
151 that were written. From the sector numbers, the attacker can infer
152 the size of files that were written. To protect against this
153 situation, you can encrypt the journal.
154
155journal_mac:algorithm(:key) (the key is optional)
156 Protect sector numbers in the journal from accidental or malicious
157 modification. To protect against accidental modification, use a
158 crc algorithm, to protect against malicious modification, use a
159 hmac algorithm with a key.
160
161 This option is not needed when using internal-hash because in this
162 mode, the integrity of journal entries is checked when replaying
163 the journal. Thus, modified sector number would be detected at
164 this stage.
165
166block_size:number
167 The size of a data block in bytes. The larger the block size the
168 less overhead there is for per-block integrity metadata.
169 Supported values are 512, 1024, 2048 and 4096 bytes. If not
170 specified the default block size is 512 bytes.
171
172sectors_per_bit:number
173 In the bitmap mode, this parameter specifies the number of
174 512-byte sectors that corresponds to one bitmap bit.
175
176bitmap_flush_interval:number
177 The bitmap flush interval in milliseconds. The metadata buffers
178 are synchronized when this interval expires.
179
180
181The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can
182be changed when reloading the target (load an inactive table and swap the
183tables with suspend and resume). The other arguments should not be changed
184when reloading the target because the layout of disk data depend on them
185and the reloaded target would be non-functional.
186
187
188The layout of the formatted block device:
189
190* reserved sectors
191 (they are not used by this target, they can be used for
192 storing LUKS metadata or for other purpose), the size of the reserved
193 area is specified in the target arguments
194
195* superblock (4kiB)
196 * magic string - identifies that the device was formatted
197 * version
198 * log2(interleave sectors)
199 * integrity tag size
200 * the number of journal sections
201 * provided data sectors - the number of sectors that this target
202 provides (i.e. the size of the device minus the size of all
203 metadata and padding). The user of this target should not send
204 bios that access data beyond the "provided data sectors" limit.
205 * flags
206 SB_FLAG_HAVE_JOURNAL_MAC
207 - a flag is set if journal_mac is used
208 SB_FLAG_RECALCULATING
209 - recalculating is in progress
210 SB_FLAG_DIRTY_BITMAP
211 - journal area contains the bitmap of dirty
212 blocks
213 * log2(sectors per block)
214 * a position where recalculating finished
215* journal
216 The journal is divided into sections, each section contains:
217
218 * metadata area (4kiB), it contains journal entries
219
220 - every journal entry contains:
221
222 * logical sector (specifies where the data and tag should
223 be written)
224 * last 8 bytes of data
225 * integrity tag (the size is specified in the superblock)
226
227 - every metadata sector ends with
228
229 * mac (8-bytes), all the macs in 8 metadata sectors form a
230 64-byte value. It is used to store hmac of sector
231 numbers in the journal section, to protect against a
232 possibility that the attacker tampers with sector
233 numbers in the journal.
234 * commit id
235
236 * data area (the size is variable; it depends on how many journal
237 entries fit into the metadata area)
238
239 - every sector in the data area contains:
240
241 * data (504 bytes of data, the last 8 bytes are stored in
242 the journal entry)
243 * commit id
244
245 To test if the whole journal section was written correctly, every
246 512-byte sector of the journal ends with 8-byte commit id. If the
247 commit id matches on all sectors in a journal section, then it is
248 assumed that the section was written correctly. If the commit id
249 doesn't match, the section was written partially and it should not
250 be replayed.
251
252* one or more runs of interleaved tags and data.
253 Each run contains:
254
255 * tag area - it contains integrity tags. There is one tag for each
256 sector in the data area
257 * data area - it contains data sectors. The number of data sectors
258 in one run must be a power of two. log2 of this value is stored
259 in the superblock.
diff --git a/Documentation/admin-guide/device-mapper/dm-io.rst b/Documentation/admin-guide/device-mapper/dm-io.rst
new file mode 100644
index 000000000000..d2492917a1f5
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-io.rst
@@ -0,0 +1,75 @@
1=====
2dm-io
3=====
4
5Dm-io provides synchronous and asynchronous I/O services. There are three
6types of I/O services available, and each type has a sync and an async
7version.
8
9The user must set up an io_region structure to describe the desired location
10of the I/O. Each io_region indicates a block-device along with the starting
11sector and size of the region::
12
13 struct io_region {
14 struct block_device *bdev;
15 sector_t sector;
16 sector_t count;
17 };
18
19Dm-io can read from one io_region or write to one or more io_regions. Writes
20to multiple regions are specified by an array of io_region structures.
21
22The first I/O service type takes a list of memory pages as the data buffer for
23the I/O, along with an offset into the first page::
24
25 struct page_list {
26 struct page_list *next;
27 struct page *page;
28 };
29
30 int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw,
31 struct page_list *pl, unsigned int offset,
32 unsigned long *error_bits);
33 int dm_io_async(unsigned int num_regions, struct io_region *where, int rw,
34 struct page_list *pl, unsigned int offset,
35 io_notify_fn fn, void *context);
36
37The second I/O service type takes an array of bio vectors as the data buffer
38for the I/O. This service can be handy if the caller has a pre-assembled bio,
39but wants to direct different portions of the bio to different devices::
40
41 int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
42 int rw, struct bio_vec *bvec,
43 unsigned long *error_bits);
44 int dm_io_async_bvec(unsigned int num_regions, struct io_region *where,
45 int rw, struct bio_vec *bvec,
46 io_notify_fn fn, void *context);
47
48The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
49data buffer for the I/O. This service can be handy if the caller needs to do
50I/O to a large region but doesn't want to allocate a large number of individual
51memory pages::
52
53 int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
54 void *data, unsigned long *error_bits);
55 int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw,
56 void *data, io_notify_fn fn, void *context);
57
58Callers of the asynchronous I/O services must include the name of a completion
59callback routine and a pointer to some context data for the I/O::
60
61 typedef void (*io_notify_fn)(unsigned long error, void *context);
62
63The "error" parameter in this callback, as well as the `*error` parameter in
64all of the synchronous versions, is a bitset (instead of a simple error value).
65In the case of an write-I/O to multiple regions, this bitset allows dm-io to
66indicate success or failure on each individual region.
67
68Before using any of the dm-io services, the user should call dm_io_get()
69and specify the number of pages they expect to perform I/O on concurrently.
70Dm-io will attempt to resize its mempool to make sure enough pages are
71always available in order to avoid unnecessary waiting while performing I/O.
72
73When the user is finished using the dm-io services, they should call
74dm_io_put() and specify the same number of pages that were given on the
75dm_io_get() call.
diff --git a/Documentation/admin-guide/device-mapper/dm-log.rst b/Documentation/admin-guide/device-mapper/dm-log.rst
new file mode 100644
index 000000000000..ba4fce39bc27
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-log.rst
@@ -0,0 +1,57 @@
1=====================
2Device-Mapper Logging
3=====================
4The device-mapper logging code is used by some of the device-mapper
5RAID targets to track regions of the disk that are not consistent.
6A region (or portion of the address space) of the disk may be
7inconsistent because a RAID stripe is currently being operated on or
8a machine died while the region was being altered. In the case of
9mirrors, a region would be considered dirty/inconsistent while you
10are writing to it because the writes need to be replicated for all
11the legs of the mirror and may not reach the legs at the same time.
12Once all writes are complete, the region is considered clean again.
13
14There is a generic logging interface that the device-mapper RAID
15implementations use to perform logging operations (see
16dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
17logging implementations are available and provide different
18capabilities. The list includes:
19
20============== ==============================================================
21Type Files
22============== ==============================================================
23disk drivers/md/dm-log.c
24core drivers/md/dm-log.c
25userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
26============== ==============================================================
27
28The "disk" log type
29-------------------
30This log implementation commits the log state to disk. This way, the
31logging state survives reboots/crashes.
32
33The "core" log type
34-------------------
35This log implementation keeps the log state in memory. The log state
36will not survive a reboot or crash, but there may be a small boost in
37performance. This method can also be used if no storage device is
38available for storing log state.
39
40The "userspace" log type
41------------------------
42This log type simply provides a way to export the log API to userspace,
43so log implementations can be done there. This is done by forwarding most
44logging requests to userspace, where a daemon receives and processes the
45request.
46
47The structure used for communication between kernel and userspace are
48located in include/linux/dm-log-userspace.h. Due to the frequency,
49diversity, and 2-way communication nature of the exchanges between
50kernel and userspace, 'connector' is used as the interface for
51communication.
52
53There are currently two userspace log implementations that leverage this
54framework - "clustered-disk" and "clustered-core". These implementations
55provide a cluster-coherent log for shared-storage. Device-mapper mirroring
56can be used in a shared-storage environment when the cluster log implementations
57are employed.
diff --git a/Documentation/admin-guide/device-mapper/dm-queue-length.rst b/Documentation/admin-guide/device-mapper/dm-queue-length.rst
new file mode 100644
index 000000000000..d8e381c1cb02
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-queue-length.rst
@@ -0,0 +1,48 @@
1===============
2dm-queue-length
3===============
4
5dm-queue-length is a path selector module for device-mapper targets,
6which selects a path with the least number of in-flight I/Os.
7The path selector name is 'queue-length'.
8
9Table parameters for each path: [<repeat_count>]
10
11::
12
13 <repeat_count>: The number of I/Os to dispatch using the selected
14 path before switching to the next path.
15 If not given, internal default is used. To check
16 the default value, see the activated table.
17
18Status for each path: <status> <fail-count> <in-flight>
19
20::
21
22 <status>: 'A' if the path is active, 'F' if the path is failed.
23 <fail-count>: The number of path failures.
24 <in-flight>: The number of in-flight I/Os on the path.
25
26
27Algorithm
28=========
29
30dm-queue-length increments/decrements 'in-flight' when an I/O is
31dispatched/completed respectively.
32dm-queue-length selects a path with the minimum 'in-flight'.
33
34
35Examples
36========
37In case that 2 paths (sda and sdb) are used with repeat_count == 128.
38
39::
40
41 # echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
42 dmsetup create test
43 #
44 # dmsetup table
45 test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
46 #
47 # dmsetup status
48 test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst
new file mode 100644
index 000000000000..2fe255b130fb
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-raid.rst
@@ -0,0 +1,419 @@
1=======
2dm-raid
3=======
4
5The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
6It allows the MD RAID drivers to be accessed using a device-mapper
7interface.
8
9
10Mapping Table Interface
11-----------------------
12The target is named "raid" and it accepts the following parameters::
13
14 <raid_type> <#raid_params> <raid_params> \
15 <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
16
17<raid_type>:
18
19 ============= ===============================================================
20 raid0 RAID0 striping (no resilience)
21 raid1 RAID1 mirroring
22 raid4 RAID4 with dedicated last parity disk
23 raid5_n RAID5 with dedicated last parity disk supporting takeover
24 Same as raid4
25
26 - Transitory layout
27 raid5_la RAID5 left asymmetric
28
29 - rotating parity 0 with data continuation
30 raid5_ra RAID5 right asymmetric
31
32 - rotating parity N with data continuation
33 raid5_ls RAID5 left symmetric
34
35 - rotating parity 0 with data restart
36 raid5_rs RAID5 right symmetric
37
38 - rotating parity N with data restart
39 raid6_zr RAID6 zero restart
40
41 - rotating parity zero (left-to-right) with data restart
42 raid6_nr RAID6 N restart
43
44 - rotating parity N (right-to-left) with data restart
45 raid6_nc RAID6 N continue
46
47 - rotating parity N (right-to-left) with data continuation
48 raid6_n_6 RAID6 with dedicate parity disks
49
50 - parity and Q-syndrome on the last 2 disks;
51 layout for takeover from/to raid4/raid5_n
52 raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
53
54 - layout for takeover from raid5_la from/to raid6
55 raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
56
57 - layout for takeover from raid5_ra from/to raid6
58 raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
59
60 - layout for takeover from raid5_ls from/to raid6
61 raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
62
63 - layout for takeover from raid5_rs from/to raid6
64 raid10 Various RAID10 inspired algorithms chosen by additional params
65 (see raid10_format and raid10_copies below)
66
67 - RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
68 - RAID1E: Integrated Adjacent Stripe Mirroring
69 - RAID1E: Integrated Offset Stripe Mirroring
70 - and other similar RAID10 variants
71 ============= ===============================================================
72
73 Reference: Chapter 4 of
74 http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
75
76<#raid_params>: The number of parameters that follow.
77
78<raid_params> consists of
79
80 Mandatory parameters:
81 <chunk_size>:
82 Chunk size in sectors. This parameter is often known as
83 "stripe size". It is the only mandatory parameter and
84 is placed first.
85
86 followed by optional parameters (in any order):
87 [sync|nosync]
88 Force or prevent RAID initialization.
89
90 [rebuild <idx>]
91 Rebuild drive number 'idx' (first drive is 0).
92
93 [daemon_sleep <ms>]
94 Interval between runs of the bitmap daemon that
95 clear bits. A longer interval means less bitmap I/O but
96 resyncing after a failure is likely to take longer.
97
98 [min_recovery_rate <kB/sec/disk>]
99 Throttle RAID initialization
100 [max_recovery_rate <kB/sec/disk>]
101 Throttle RAID initialization
102 [write_mostly <idx>]
103 Mark drive index 'idx' write-mostly.
104 [max_write_behind <sectors>]
105 See '--write-behind=' (man mdadm)
106 [stripe_cache <sectors>]
107 Stripe cache size (RAID 4/5/6 only)
108 [region_size <sectors>]
109 The region_size multiplied by the number of regions is the
110 logical size of the array. The bitmap records the device
111 synchronisation state for each region.
112
113 [raid10_copies <# copies>], [raid10_format <near|far|offset>]
114 These two options are used to alter the default layout of
115 a RAID10 configuration. The number of copies is can be
116 specified, but the default is 2. There are also three
117 variations to how the copies are laid down - the default
118 is "near". Near copies are what most people think of with
119 respect to mirroring. If these options are left unspecified,
120 or 'raid10_copies 2' and/or 'raid10_format near' are given,
121 then the layouts for 2, 3 and 4 devices are:
122
123 ======== ========== ==============
124 2 drives 3 drives 4 drives
125 ======== ========== ==============
126 A1 A1 A1 A1 A2 A1 A1 A2 A2
127 A2 A2 A2 A3 A3 A3 A3 A4 A4
128 A3 A3 A4 A4 A5 A5 A5 A6 A6
129 A4 A4 A5 A6 A6 A7 A7 A8 A8
130 .. .. .. .. .. .. .. .. ..
131 ======== ========== ==============
132
133 The 2-device layout is equivalent 2-way RAID1. The 4-device
134 layout is what a traditional RAID10 would look like. The
135 3-device layout is what might be called a 'RAID1E - Integrated
136 Adjacent Stripe Mirroring'.
137
138 If 'raid10_copies 2' and 'raid10_format far', then the layouts
139 for 2, 3 and 4 devices are:
140
141 ======== ============ ===================
142 2 drives 3 drives 4 drives
143 ======== ============ ===================
144 A1 A2 A1 A2 A3 A1 A2 A3 A4
145 A3 A4 A4 A5 A6 A5 A6 A7 A8
146 A5 A6 A7 A8 A9 A9 A10 A11 A12
147 .. .. .. .. .. .. .. .. ..
148 A2 A1 A3 A1 A2 A2 A1 A4 A3
149 A4 A3 A6 A4 A5 A6 A5 A8 A7
150 A6 A5 A9 A7 A8 A10 A9 A12 A11
151 .. .. .. .. .. .. .. .. ..
152 ======== ============ ===================
153
154 If 'raid10_copies 2' and 'raid10_format offset', then the
155 layouts for 2, 3 and 4 devices are:
156
157 ======== ========== ================
158 2 drives 3 drives 4 drives
159 ======== ========== ================
160 A1 A2 A1 A2 A3 A1 A2 A3 A4
161 A2 A1 A3 A1 A2 A2 A1 A4 A3
162 A3 A4 A4 A5 A6 A5 A6 A7 A8
163 A4 A3 A6 A4 A5 A6 A5 A8 A7
164 A5 A6 A7 A8 A9 A9 A10 A11 A12
165 A6 A5 A9 A7 A8 A10 A9 A12 A11
166 .. .. .. .. .. .. .. .. ..
167 ======== ========== ================
168
169 Here we see layouts closely akin to 'RAID1E - Integrated
170 Offset Stripe Mirroring'.
171
172 [delta_disks <N>]
173 The delta_disks option value (-251 < N < +251) triggers
174 device removal (negative value) or device addition (positive
175 value) to any reshape supporting raid levels 4/5/6 and 10.
176 RAID levels 4/5/6 allow for addition of devices (metadata
177 and data device tuple), raid10_near and raid10_offset only
178 allow for device addition. raid10_far does not support any
179 reshaping at all.
180 A minimum of devices have to be kept to enforce resilience,
181 which is 3 devices for raid4/5 and 4 devices for raid6.
182
183 [data_offset <sectors>]
184 This option value defines the offset into each data device
185 where the data starts. This is used to provide out-of-place
186 reshaping space to avoid writing over data while
187 changing the layout of stripes, hence an interruption/crash
188 may happen at any time without the risk of losing data.
189 E.g. when adding devices to an existing raid set during
190 forward reshaping, the out-of-place space will be allocated
191 at the beginning of each raid device. The kernel raid4/5/6/10
192 MD personalities supporting such device addition will read the data from
193 the existing first stripes (those with smaller number of stripes)
194 starting at data_offset to fill up a new stripe with the larger
195 number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
196 and write that new stripe to offset 0. Same will be applied to all
197 N-1 other new stripes. This out-of-place scheme is used to change
198 the RAID type (i.e. the allocation algorithm) as well, e.g.
199 changing from raid5_ls to raid5_n.
200
201 [journal_dev <dev>]
202 This option adds a journal device to raid4/5/6 raid sets and
203 uses it to close the 'write hole' caused by the non-atomic updates
204 to the component devices which can cause data loss during recovery.
205 The journal device is used as writethrough thus causing writes to
206 be throttled versus non-journaled raid4/5/6 sets.
207 Takeover/reshape is not possible with a raid4/5/6 journal device;
208 it has to be deconfigured before requesting these.
209
210 [journal_mode <mode>]
211 This option sets the caching mode on journaled raid4/5/6 raid sets
212 (see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
213 If 'writeback' is selected the journal device has to be resilient
214 and must not suffer from the 'write hole' problem itself (e.g. use
215 raid1 or raid10) to avoid a single point of failure.
216
217<#raid_devs>: The number of devices composing the array.
218 Each device consists of two entries. The first is the device
219 containing the metadata (if any); the second is the one containing the
220 data. A Maximum of 64 metadata/data device entries are supported
221 up to target version 1.8.0.
222 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
223
224 If a drive has failed or is missing at creation time, a '-' can be
225 given for both the metadata and data drives for a given position.
226
227
228Example Tables
229--------------
230
231::
232
233 # RAID4 - 4 data drives, 1 parity (no metadata devices)
234 # No metadata devices specified to hold superblock/bitmap info
235 # Chunk size of 1MiB
236 # (Lines separated for easy reading)
237
238 0 1960893648 raid \
239 raid4 1 2048 \
240 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
241
242 # RAID4 - 4 data drives, 1 parity (with metadata devices)
243 # Chunk size of 1MiB, force RAID initialization,
244 # min recovery rate at 20 kiB/sec/disk
245
246 0 1960893648 raid \
247 raid4 4 2048 sync min_recovery_rate 20 \
248 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
249
250
251Status Output
252-------------
253'dmsetup table' displays the table used to construct the mapping.
254The optional parameters are always printed in the order listed
255above with "sync" or "nosync" always output ahead of the other
256arguments, regardless of the order used when originally loading the table.
257Arguments that can be repeated are ordered by value.
258
259
260'dmsetup status' yields information on the state and health of the array.
261The output is as follows (normally a single line, but expanded here for
262clarity)::
263
264 1: <s> <l> raid \
265 2: <raid_type> <#devices> <health_chars> \
266 3: <sync_ratio> <sync_action> <mismatch_cnt>
267
268Line 1 is the standard output produced by device-mapper.
269
270Line 2 & 3 are produced by the raid target and are best explained by example::
271
272 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
273
274Here we can see the RAID type is raid4, there are 5 devices - all of
275which are 'A'live, and the array is 2/490221568 complete with its initial
276recovery. Here is a fuller description of the individual fields:
277
278 =============== =========================================================
279 <raid_type> Same as the <raid_type> used to create the array.
280 <health_chars> One char for each device, indicating:
281
282 - 'A' = alive and in-sync
283 - 'a' = alive but not in-sync
284 - 'D' = dead/failed.
285 <sync_ratio> The ratio indicating how much of the array has undergone
286 the process described by 'sync_action'. If the
287 'sync_action' is "check" or "repair", then the process
288 of "resync" or "recover" can be considered complete.
289 <sync_action> One of the following possible states:
290
291 idle
292 - No synchronization action is being performed.
293 frozen
294 - The current action has been halted.
295 resync
296 - Array is undergoing its initial synchronization
297 or is resynchronizing after an unclean shutdown
298 (possibly aided by a bitmap).
299 recover
300 - A device in the array is being rebuilt or
301 replaced.
302 check
303 - A user-initiated full check of the array is
304 being performed. All blocks are read and
305 checked for consistency. The number of
306 discrepancies found are recorded in
307 <mismatch_cnt>. No changes are made to the
308 array by this action.
309 repair
310 - The same as "check", but discrepancies are
311 corrected.
312 reshape
313 - The array is undergoing a reshape.
314 <mismatch_cnt> The number of discrepancies found between mirror copies
315 in RAID1/10 or wrong parity values found in RAID4/5/6.
316 This value is valid only after a "check" of the array
317 is performed. A healthy array has a 'mismatch_cnt' of 0.
318 <data_offset> The current data offset to the start of the user data on
319 each component device of a raid set (see the respective
320 raid parameter to support out-of-place reshaping).
321 <journal_char> - 'A' - active write-through journal device.
322 - 'a' - active write-back journal device.
323 - 'D' - dead journal device.
324 - '-' - no journal device.
325 =============== =========================================================
326
327
328Message Interface
329-----------------
330The dm-raid target will accept certain actions through the 'message' interface.
331('man dmsetup' for more information on the message interface.) These actions
332include:
333
334 ========= ================================================
335 "idle" Halt the current sync action.
336 "frozen" Freeze the current sync action.
337 "resync" Initiate/continue a resync.
338 "recover" Initiate/continue a recover process.
339 "check" Initiate a check (i.e. a "scrub") of the array.
340 "repair" Initiate a repair of the array.
341 ========= ================================================
342
343
344Discard Support
345---------------
346The implementation of discard support among hardware vendors varies.
347When a block is discarded, some storage devices will return zeroes when
348the block is read. These devices set the 'discard_zeroes_data'
349attribute. Other devices will return random data. Confusingly, some
350devices that advertise 'discard_zeroes_data' will not reliably return
351zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks
352from a number of devices to calculate parity blocks and (for performance
353reasons) relies on 'discard_zeroes_data' being reliable, it is important
354that the devices be consistent. Blocks may be discarded in the middle
355of a RAID 4/5/6 stripe and if subsequent read results are not
356consistent, the parity blocks may be calculated differently at any time;
357making the parity blocks useless for redundancy. It is important to
358understand how your hardware behaves with discards if you are going to
359enable discards with RAID 4/5/6.
360
361Since the behavior of storage devices is unreliable in this respect,
362even when reporting 'discard_zeroes_data', by default RAID 4/5/6
363discard support is disabled -- this ensures data integrity at the
364expense of losing some performance.
365
366Storage devices that properly support 'discard_zeroes_data' are
367increasingly whitelisted in the kernel and can thus be trusted.
368
369For trusted devices, the following dm-raid module parameter can be set
370to safely enable discard support for RAID 4/5/6:
371
372 'devices_handle_discards_safely'
373
374
375Version History
376---------------
377
378::
379
380 1.0.0 Initial version. Support for RAID 4/5/6
381 1.1.0 Added support for RAID 1
382 1.2.0 Handle creation of arrays that contain failed devices.
383 1.3.0 Added support for RAID 10
384 1.3.1 Allow device replacement/rebuild for RAID 10
385 1.3.2 Fix/improve redundancy checking for RAID10
386 1.4.0 Non-functional change. Removes arg from mapping function.
387 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
388 1.4.2 Add RAID10 "far" and "offset" algorithm support.
389 1.5.0 Add message interface to allow manipulation of the sync_action.
390 New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
391 1.5.1 Add ability to restore transiently failed devices on resume.
392 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
393 1.6.0 Add discard support (and devices_handle_discard_safely module param).
394 1.7.0 Add support for MD RAID0 mappings.
395 1.8.0 Explicitly check for compatible flags in the superblock metadata
396 and reject to start the raid set if any are set by a newer
397 target version, thus avoiding data corruption on a raid set
398 with a reshape in progress.
399 1.9.0 Add support for RAID level takeover/reshape/region size
400 and set size reduction.
401 1.9.1 Fix activation of existing RAID 4/10 mapped devices
402 1.9.2 Don't emit '- -' on the status table line in case the constructor
403 fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
404 'D' on the status line. If '- -' is passed into the constructor, emit
405 '- -' on the table line and '-' as the status line health character.
406 1.10.0 Add support for raid4/5/6 journal device
407 1.10.1 Fix data corruption on reshape request
408 1.11.0 Fix table line argument order
409 (wrong raid10_copies/raid10_format sequence)
410 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
411 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
412 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
413 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
414 state races.
415 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
416 1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
417 deadlock/potential data corruption. Update superblock when
418 specific devices are requested via rebuild. Fix RAID leg
419 rebuild errors.
diff --git a/Documentation/admin-guide/device-mapper/dm-service-time.rst b/Documentation/admin-guide/device-mapper/dm-service-time.rst
new file mode 100644
index 000000000000..facf277fc13c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-service-time.rst
@@ -0,0 +1,101 @@
1===============
2dm-service-time
3===============
4
5dm-service-time is a path selector module for device-mapper targets,
6which selects a path with the shortest estimated service time for
7the incoming I/O.
8
9The service time for each path is estimated by dividing the total size
10of in-flight I/Os on a path with the performance value of the path.
11The performance value is a relative throughput value among all paths
12in a path-group, and it can be specified as a table argument.
13
14The path selector name is 'service-time'.
15
16Table parameters for each path:
17
18 [<repeat_count> [<relative_throughput>]]
19 <repeat_count>:
20 The number of I/Os to dispatch using the selected
21 path before switching to the next path.
22 If not given, internal default is used. To check
23 the default value, see the activated table.
24 <relative_throughput>:
25 The relative throughput value of the path
26 among all paths in the path-group.
27 The valid range is 0-100.
28 If not given, minimum value '1' is used.
29 If '0' is given, the path isn't selected while
30 other paths having a positive value are available.
31
32Status for each path:
33
34 <status> <fail-count> <in-flight-size> <relative_throughput>
35 <status>:
36 'A' if the path is active, 'F' if the path is failed.
37 <fail-count>:
38 The number of path failures.
39 <in-flight-size>:
40 The size of in-flight I/Os on the path.
41 <relative_throughput>:
42 The relative throughput value of the path
43 among all paths in the path-group.
44
45
46Algorithm
47=========
48
49dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
50dispatched and subtracts when completed.
51Basically, dm-service-time selects a path having minimum service time
52which is calculated by::
53
54 ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
55
56However, some optimizations below are used to reduce the calculation
57as much as possible.
58
59 1. If the paths have the same 'relative_throughput', skip
60 the division and just compare the 'in-flight-size'.
61
62 2. If the paths have the same 'in-flight-size', skip the division
63 and just compare the 'relative_throughput'.
64
65 3. If some paths have non-zero 'relative_throughput' and others
66 have zero 'relative_throughput', ignore those paths with zero
67 'relative_throughput'.
68
69If such optimizations can't be applied, calculate service time, and
70compare service time.
71If calculated service time is equal, the path having maximum
72'relative_throughput' may be better. So compare 'relative_throughput'
73then.
74
75
76Examples
77========
78In case that 2 paths (sda and sdb) are used with repeat_count == 128
79and sda has an average throughput 1GB/s and sdb has 4GB/s,
80'relative_throughput' value may be '1' for sda and '4' for sdb::
81
82 # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
83 dmsetup create test
84 #
85 # dmsetup table
86 test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
87 #
88 # dmsetup status
89 test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
90
91
92Or '2' for sda and '8' for sdb would be also true::
93
94 # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
95 dmsetup create test
96 #
97 # dmsetup table
98 test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
99 #
100 # dmsetup status
101 test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
diff --git a/Documentation/admin-guide/device-mapper/dm-uevent.rst b/Documentation/admin-guide/device-mapper/dm-uevent.rst
new file mode 100644
index 000000000000..4a8ee8d069c9
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-uevent.rst
@@ -0,0 +1,110 @@
1====================
2device-mapper uevent
3====================
4
5The device-mapper uevent code adds the capability to device-mapper to create
6and send kobject uevents (uevents). Previously device-mapper events were only
7available through the ioctl interface. The advantage of the uevents interface
8is the event contains environment attributes providing increased context for
9the event avoiding the need to query the state of the device-mapper device after
10the event is received.
11
12There are two functions currently for device-mapper events. The first function
13listed creates the event and the second function sends the event(s)::
14
15 void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
16 const char *path, unsigned nr_valid_paths)
17
18 void dm_send_uevents(struct list_head *events, struct kobject *kobj)
19
20
21The variables added to the uevent environment are:
22
23Variable Name: DM_TARGET
24------------------------
25:Uevent Action(s): KOBJ_CHANGE
26:Type: string
27:Description:
28:Value: Name of device-mapper target that generated the event.
29
30Variable Name: DM_ACTION
31------------------------
32:Uevent Action(s): KOBJ_CHANGE
33:Type: string
34:Description:
35:Value: Device-mapper specific action that caused the uevent action.
36 PATH_FAILED - A path has failed;
37 PATH_REINSTATED - A path has been reinstated.
38
39Variable Name: DM_SEQNUM
40------------------------
41:Uevent Action(s): KOBJ_CHANGE
42:Type: unsigned integer
43:Description: A sequence number for this specific device-mapper device.
44:Value: Valid unsigned integer range.
45
46Variable Name: DM_PATH
47----------------------
48:Uevent Action(s): KOBJ_CHANGE
49:Type: string
50:Description: Major and minor number of the path device pertaining to this
51 event.
52:Value: Path name in the form of "Major:Minor"
53
54Variable Name: DM_NR_VALID_PATHS
55--------------------------------
56:Uevent Action(s): KOBJ_CHANGE
57:Type: unsigned integer
58:Description:
59:Value: Valid unsigned integer range.
60
61Variable Name: DM_NAME
62----------------------
63:Uevent Action(s): KOBJ_CHANGE
64:Type: string
65:Description: Name of the device-mapper device.
66:Value: Name
67
68Variable Name: DM_UUID
69----------------------
70:Uevent Action(s): KOBJ_CHANGE
71:Type: string
72:Description: UUID of the device-mapper device.
73:Value: UUID. (Empty string if there isn't one.)
74
75An example of the uevents generated as captured by udevmonitor is shown
76below
77
781.) Path failure::
79
80 UEVENT[1192521009.711215] change@/block/dm-3
81 ACTION=change
82 DEVPATH=/block/dm-3
83 SUBSYSTEM=block
84 DM_TARGET=multipath
85 DM_ACTION=PATH_FAILED
86 DM_SEQNUM=1
87 DM_PATH=8:32
88 DM_NR_VALID_PATHS=0
89 DM_NAME=mpath2
90 DM_UUID=mpath-35333333000002328
91 MINOR=3
92 MAJOR=253
93 SEQNUM=1130
94
952.) Path reinstate::
96
97 UEVENT[1192521132.989927] change@/block/dm-3
98 ACTION=change
99 DEVPATH=/block/dm-3
100 SUBSYSTEM=block
101 DM_TARGET=multipath
102 DM_ACTION=PATH_REINSTATED
103 DM_SEQNUM=2
104 DM_PATH=8:32
105 DM_NR_VALID_PATHS=1
106 DM_NAME=mpath2
107 DM_UUID=mpath-35333333000002328
108 MINOR=3
109 MAJOR=253
110 SEQNUM=1131
diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst
new file mode 100644
index 000000000000..07f56ebc1730
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst
@@ -0,0 +1,146 @@
1========
2dm-zoned
3========
4
5The dm-zoned device mapper target exposes a zoned block device (ZBC and
6ZAC compliant devices) as a regular block device without any write
7pattern constraints. In effect, it implements a drive-managed zoned
8block device which hides from the user (a file system or an application
9doing raw block device accesses) the sequential write constraints of
10host-managed zoned block devices and can mitigate the potential
11device-side performance degradation due to excessive random writes on
12host-aware zoned block devices.
13
14For a more detailed description of the zoned block device models and
15their constraints see (for SCSI devices):
16
17http://www.t10.org/drafts.htm#ZBC_Family
18
19and (for ATA devices):
20
21http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
22
23The dm-zoned implementation is simple and minimizes system overhead (CPU
24and memory usage as well as storage capacity loss). For a 10TB
25host-managed disk with 256 MB zones, dm-zoned memory usage per disk
26instance is at most 4.5 MB and as little as 5 zones will be used
27internally for storing metadata and performaing reclaim operations.
28
29dm-zoned target devices are formatted and checked using the dmzadm
30utility available at:
31
32https://github.com/hgst/dm-zoned-tools
33
34Algorithm
35=========
36
37dm-zoned implements an on-disk buffering scheme to handle non-sequential
38write accesses to the sequential zones of a zoned block device.
39Conventional zones are used for caching as well as for storing internal
40metadata.
41
42The zones of the device are separated into 2 types:
43
441) Metadata zones: these are conventional zones used to store metadata.
45Metadata zones are not reported as useable capacity to the user.
46
472) Data zones: all remaining zones, the vast majority of which will be
48sequential zones used exclusively to store user data. The conventional
49zones of the device may be used also for buffering user random writes.
50Data in these zones may be directly mapped to the conventional zone, but
51later moved to a sequential zone so that the conventional zone can be
52reused for buffering incoming random writes.
53
54dm-zoned exposes a logical device with a sector size of 4096 bytes,
55irrespective of the physical sector size of the backend zoned block
56device being used. This allows reducing the amount of metadata needed to
57manage valid blocks (blocks written).
58
59The on-disk metadata format is as follows:
60
611) The first block of the first conventional zone found contains the
62super block which describes the on disk amount and position of metadata
63blocks.
64
652) Following the super block, a set of blocks is used to describe the
66mapping of the logical device blocks. The mapping is done per chunk of
67blocks, with the chunk size equal to the zoned block device size. The
68mapping table is indexed by chunk number and each mapping entry
69indicates the zone number of the device storing the chunk of data. Each
70mapping entry may also indicate if the zone number of a conventional
71zone used to buffer random modification to the data zone.
72
733) A set of blocks used to store bitmaps indicating the validity of
74blocks in the data zones follows the mapping table. A valid block is
75defined as a block that was written and not discarded. For a buffered
76data chunk, a block is always valid only in the data zone mapping the
77chunk or in the buffer zone of the chunk.
78
79For a logical chunk mapped to a conventional zone, all write operations
80are processed by directly writing to the zone. If the mapping zone is a
81sequential zone, the write operation is processed directly only if the
82write offset within the logical chunk is equal to the write pointer
83offset within of the sequential data zone (i.e. the write operation is
84aligned on the zone write pointer). Otherwise, write operations are
85processed indirectly using a buffer zone. In that case, an unused
86conventional zone is allocated and assigned to the chunk being
87accessed. Writing a block to the buffer zone of a chunk will
88automatically invalidate the same block in the sequential zone mapping
89the chunk. If all blocks of the sequential zone become invalid, the zone
90is freed and the chunk buffer zone becomes the primary zone mapping the
91chunk, resulting in native random write performance similar to a regular
92block device.
93
94Read operations are processed according to the block validity
95information provided by the bitmaps. Valid blocks are read either from
96the sequential zone mapping a chunk, or if the chunk is buffered, from
97the buffer zone assigned. If the accessed chunk has no mapping, or the
98accessed blocks are invalid, the read buffer is zeroed and the read
99operation terminated.
100
101After some time, the limited number of convnetional zones available may
102be exhausted (all used to map chunks or buffer sequential zones) and
103unaligned writes to unbuffered chunks become impossible. To avoid this
104situation, a reclaim process regularly scans used conventional zones and
105tries to reclaim the least recently used zones by copying the valid
106blocks of the buffer zone to a free sequential zone. Once the copy
107completes, the chunk mapping is updated to point to the sequential zone
108and the buffer zone freed for reuse.
109
110Metadata Protection
111===================
112
113To protect metadata against corruption in case of sudden power loss or
114system crash, 2 sets of metadata zones are used. One set, the primary
115set, is used as the main metadata region, while the secondary set is
116used as a staging area. Modified metadata is first written to the
117secondary set and validated by updating the super block in the secondary
118set, a generation counter is used to indicate that this set contains the
119newest metadata. Once this operation completes, in place of metadata
120block updates can be done in the primary metadata set. This ensures that
121one of the set is always consistent (all modifications committed or none
122at all). Flush operations are used as a commit point. Upon reception of
123a flush request, metadata modification activity is temporarily blocked
124(for both incoming BIO processing and reclaim process) and all dirty
125metadata blocks are staged and updated. Normal operation is then
126resumed. Flushing metadata thus only temporarily delays write and
127discard requests. Read requests can be processed concurrently while
128metadata flush is being executed.
129
130Usage
131=====
132
133A zoned block device must first be formatted using the dmzadm tool. This
134will analyze the device zone configuration, determine where to place the
135metadata sets on the device and initialize the metadata sets.
136
137Ex::
138
139 dmzadm --format /dev/sdxx
140
141For a formatted device, the target can be created normally with the
142dmsetup utility. The only parameter that dm-zoned requires is the
143underlying zoned block device name. Ex::
144
145 echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
146 dmsetup create dmz-`basename ${dev}`
diff --git a/Documentation/admin-guide/device-mapper/era.rst b/Documentation/admin-guide/device-mapper/era.rst
new file mode 100644
index 000000000000..90dd5c670b9f
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/era.rst
@@ -0,0 +1,116 @@
1======
2dm-era
3======
4
5Introduction
6============
7
8dm-era is a target that behaves similar to the linear target. In
9addition it keeps track of which blocks were written within a user
10defined period of time called an 'era'. Each era target instance
11maintains the current era as a monotonically increasing 32-bit
12counter.
13
14Use cases include tracking changed blocks for backup software, and
15partially invalidating the contents of a cache to restore cache
16coherency after rolling back a vendor snapshot.
17
18Constructor
19===========
20
21era <metadata dev> <origin dev> <block size>
22
23 ================ ======================================================
24 metadata dev fast device holding the persistent metadata
25 origin dev device holding data blocks that may change
26 block size block size of origin data device, granularity that is
27 tracked by the target
28 ================ ======================================================
29
30Messages
31========
32
33None of the dm messages take any arguments.
34
35checkpoint
36----------
37
38Possibly move to a new era. You shouldn't assume the era has
39incremented. After sending this message, you should check the
40current era via the status line.
41
42take_metadata_snap
43------------------
44
45Create a clone of the metadata, to allow a userland process to read it.
46
47drop_metadata_snap
48------------------
49
50Drop the metadata snapshot.
51
52Status
53======
54
55<metadata block size> <#used metadata blocks>/<#total metadata blocks>
56<current era> <held metadata root | '-'>
57
58========================= ==============================================
59metadata block size Fixed block size for each metadata block in
60 sectors
61#used metadata blocks Number of metadata blocks used
62#total metadata blocks Total number of metadata blocks
63current era The current era
64held metadata root The location, in blocks, of the metadata root
65 that has been 'held' for userspace read
66 access. '-' indicates there is no held root
67========================= ==============================================
68
69Detailed use case
70=================
71
72The scenario of invalidating a cache when rolling back a vendor
73snapshot was the primary use case when developing this target:
74
75Taking a vendor snapshot
76------------------------
77
78- Send a checkpoint message to the era target
79- Make a note of the current era in its status line
80- Take vendor snapshot (the era and snapshot should be forever
81 associated now).
82
83Rolling back to an vendor snapshot
84----------------------------------
85
86- Cache enters passthrough mode (see: dm-cache's docs in cache.txt)
87- Rollback vendor storage
88- Take metadata snapshot
89- Ascertain which blocks have been written since the snapshot was taken
90 by checking each block's era
91- Invalidate those blocks in the caching software
92- Cache returns to writeback/writethrough mode
93
94Memory usage
95============
96
97The target uses a bitset to record writes in the current era. It also
98has a spare bitset ready for switching over to a new era. Other than
99that it uses a few 4k blocks for updating metadata::
100
101 (4 * nr_blocks) bytes + buffers
102
103Resilience
104==========
105
106Metadata is updated on disk before a write to a previously unwritten
107block is performed. As such dm-era should not be effected by a hard
108crash such as power failure.
109
110Userland tools
111==============
112
113Userland tools are found in the increasingly poorly named
114thin-provisioning-tools project:
115
116 https://github.com/jthornber/thin-provisioning-tools
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
new file mode 100644
index 000000000000..c77c58b8f67b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -0,0 +1,42 @@
1=============
2Device Mapper
3=============
4
5.. toctree::
6 :maxdepth: 1
7
8 cache-policies
9 cache
10 delay
11 dm-crypt
12 dm-flakey
13 dm-init
14 dm-integrity
15 dm-io
16 dm-log
17 dm-queue-length
18 dm-raid
19 dm-service-time
20 dm-uevent
21 dm-zoned
22 era
23 kcopyd
24 linear
25 log-writes
26 persistent-data
27 snapshot
28 statistics
29 striped
30 switch
31 thin-provisioning
32 unstriped
33 verity
34 writecache
35 zero
36
37.. only:: subproject and html
38
39 Indices
40 =======
41
42 * :ref:`genindex`
diff --git a/Documentation/admin-guide/device-mapper/kcopyd.rst b/Documentation/admin-guide/device-mapper/kcopyd.rst
new file mode 100644
index 000000000000..7651d395127f
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/kcopyd.rst
@@ -0,0 +1,47 @@
1======
2kcopyd
3======
4
5Kcopyd provides the ability to copy a range of sectors from one block-device
6to one or more other block-devices, with an asynchronous completion
7notification. It is used by dm-snapshot and dm-mirror.
8
9Users of kcopyd must first create a client and indicate how many memory pages
10to set aside for their copy jobs. This is done with a call to
11kcopyd_client_create()::
12
13 int kcopyd_client_create(unsigned int num_pages,
14 struct kcopyd_client **result);
15
16To start a copy job, the user must set up io_region structures to describe
17the source and destinations of the copy. Each io_region indicates a
18block-device along with the starting sector and size of the region. The source
19of the copy is given as one io_region structure, and the destinations of the
20copy are given as an array of io_region structures::
21
22 struct io_region {
23 struct block_device *bdev;
24 sector_t sector;
25 sector_t count;
26 };
27
28To start the copy, the user calls kcopyd_copy(), passing in the client
29pointer, pointers to the source and destination io_regions, the name of a
30completion callback routine, and a pointer to some context data for the copy::
31
32 int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
33 unsigned int num_dests, struct io_region *dests,
34 unsigned int flags, kcopyd_notify_fn fn, void *context);
35
36 typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err,
37 void *context);
38
39When the copy completes, kcopyd will call the user's completion routine,
40passing back the user's context pointer. It will also indicate if a read or
41write error occurred during the copy.
42
43When a user is done with all their copy jobs, they should call
44kcopyd_client_destroy() to delete the kcopyd client, which will release the
45associated memory pages::
46
47 void kcopyd_client_destroy(struct kcopyd_client *kc);
diff --git a/Documentation/admin-guide/device-mapper/linear.rst b/Documentation/admin-guide/device-mapper/linear.rst
new file mode 100644
index 000000000000..9d17fc6e64a9
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/linear.rst
@@ -0,0 +1,63 @@
1=========
2dm-linear
3=========
4
5Device-Mapper's "linear" target maps a linear range of the Device-Mapper
6device onto a linear range of another device. This is the basic building
7block of logical volume managers.
8
9Parameters: <dev path> <offset>
10 <dev path>:
11 Full pathname to the underlying block-device, or a
12 "major:minor" device-number.
13 <offset>:
14 Starting sector within the device.
15
16
17Example scripts
18===============
19
20::
21
22 #!/bin/sh
23 # Create an identity mapping for a device
24 echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
25
26::
27
28 #!/bin/sh
29 # Join 2 devices together
30 size1=`blockdev --getsz $1`
31 size2=`blockdev --getsz $2`
32 echo "0 $size1 linear $1 0
33 $size1 $size2 linear $2 0" | dmsetup create joined
34
35::
36
37 #!/usr/bin/perl -w
38 # Split a device into 4M chunks and then join them together in reverse order.
39
40 my $name = "reverse";
41 my $extent_size = 4 * 1024 * 2;
42 my $dev = $ARGV[0];
43 my $table = "";
44 my $count = 0;
45
46 if (!defined($dev)) {
47 die("Please specify a device.\n");
48 }
49
50 my $dev_size = `blockdev --getsz $dev`;
51 my $extents = int($dev_size / $extent_size) -
52 (($dev_size % $extent_size) ? 1 : 0);
53
54 while ($extents > 0) {
55 my $this_start = $count * $extent_size;
56 $extents--;
57 $count++;
58 my $this_offset = $extents * $extent_size;
59
60 $table .= "$this_start $extent_size linear $dev $this_offset\n";
61 }
62
63 `echo \"$table\" | dmsetup create $name`;
diff --git a/Documentation/admin-guide/device-mapper/log-writes.rst b/Documentation/admin-guide/device-mapper/log-writes.rst
new file mode 100644
index 000000000000..23141f2ffb7c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/log-writes.rst
@@ -0,0 +1,145 @@
1=============
2dm-log-writes
3=============
4
5This target takes 2 devices, one to pass all IO to normally, and one to log all
6of the write operations to. This is intended for file system developers wishing
7to verify the integrity of metadata or data as the file system is written to.
8There is a log_write_entry written for every WRITE request and the target is
9able to take arbitrary data from userspace to insert into the log. The data
10that is in the WRITE requests is copied into the log to make the replay happen
11exactly as it happened originally.
12
13Log Ordering
14============
15
16We log things in order of completion once we are sure the write is no longer in
17cache. This means that normal WRITE requests are not actually logged until the
18next REQ_PREFLUSH request. This is to make it easier for userspace to replay
19the log in a way that correlates to what is on disk and not what is in cache,
20to make it easier to detect improper waiting/flushing.
21
22This works by attaching all WRITE requests to a list once the write completes.
23Once we see a REQ_PREFLUSH request we splice this list onto the request and once
24the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
25completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
26simulate the worst case scenario with regard to power failures. Consider the
27following example (W means write, C means complete):
28
29 W1,W2,W3,C3,C2,Wflush,C1,Cflush
30
31The log would show the following:
32
33 W3,W2,flush,W1....
34
35Again this is to simulate what is actually on disk, this allows us to detect
36cases where a power failure at a particular point in time would create an
37inconsistent file system.
38
39Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
40they complete as those requests will obviously bypass the device cache.
41
42Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
43have all the DISCARD requests, and then the WRITE requests and then the FLUSH
44request. Consider the following example:
45
46 WRITE block 1, DISCARD block 1, FLUSH
47
48If we logged DISCARD when it completed, the replay would look like this:
49
50 DISCARD 1, WRITE 1, FLUSH
51
52which isn't quite what happened and wouldn't be caught during the log replay.
53
54Target interface
55================
56
57i) Constructor
58
59 log-writes <dev_path> <log_dev_path>
60
61 ============= ==============================================
62 dev_path Device that all of the IO will go to normally.
63 log_dev_path Device where the log entries are written to.
64 ============= ==============================================
65
66ii) Status
67
68 <#logged entries> <highest allocated sector>
69
70 =========================== ========================
71 #logged entries Number of logged entries
72 highest allocated sector Highest allocated sector
73 =========================== ========================
74
75iii) Messages
76
77 mark <description>
78
79 You can use a dmsetup message to set an arbitrary mark in a log.
80 For example say you want to fsck a file system after every
81 write, but first you need to replay up to the mkfs to make sure
82 we're fsck'ing something reasonable, you would do something like
83 this::
84
85 mkfs.btrfs -f /dev/mapper/log
86 dmsetup message log 0 mark mkfs
87 <run test>
88
89 This would allow you to replay the log up to the mkfs mark and
90 then replay from that point on doing the fsck check in the
91 interval that you want.
92
93 Every log has a mark at the end labeled "dm-log-writes-end".
94
95Userspace component
96===================
97
98There is a userspace tool that will replay the log for you in various ways.
99It can be found here: https://github.com/josefbacik/log-writes
100
101Example usage
102=============
103
104Say you want to test fsync on your file system. You would do something like
105this::
106
107 TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
108 dmsetup create log --table "$TABLE"
109 mkfs.btrfs -f /dev/mapper/log
110 dmsetup message log 0 mark mkfs
111
112 mount /dev/mapper/log /mnt/btrfs-test
113 <some test that does fsync at the end>
114 dmsetup message log 0 mark fsync
115 md5sum /mnt/btrfs-test/foo
116 umount /mnt/btrfs-test
117
118 dmsetup remove log
119 replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
120 mount /dev/sdb /mnt/btrfs-test
121 md5sum /mnt/btrfs-test/foo
122 <verify md5sum's are correct>
123
124 Another option is to do a complicated file system operation and verify the file
125 system is consistent during the entire operation. You could do this with:
126
127 TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
128 dmsetup create log --table "$TABLE"
129 mkfs.btrfs -f /dev/mapper/log
130 dmsetup message log 0 mark mkfs
131
132 mount /dev/mapper/log /mnt/btrfs-test
133 <fsstress to dirty the fs>
134 btrfs filesystem balance /mnt/btrfs-test
135 umount /mnt/btrfs-test
136 dmsetup remove log
137
138 replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
139 btrfsck /dev/sdb
140 replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
141 --fsck "btrfsck /dev/sdb" --check fua
142
143And that will replay the log until it sees a FUA request, run the fsck command
144and if the fsck passes it will replay to the next FUA, until it is completed or
145the fsck command exists abnormally.
diff --git a/Documentation/admin-guide/device-mapper/persistent-data.rst b/Documentation/admin-guide/device-mapper/persistent-data.rst
new file mode 100644
index 000000000000..2065c3c5a091
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/persistent-data.rst
@@ -0,0 +1,88 @@
1===============
2Persistent data
3===============
4
5Introduction
6============
7
8The more-sophisticated device-mapper targets require complex metadata
9that is managed in kernel. In late 2010 we were seeing that various
10different targets were rolling their own data structures, for example:
11
12- Mikulas Patocka's multisnap implementation
13- Heinz Mauelshagen's thin provisioning target
14- Another btree-based caching target posted to dm-devel
15- Another multi-snapshot target based on a design of Daniel Phillips
16
17Maintaining these data structures takes a lot of work, so if possible
18we'd like to reduce the number.
19
20The persistent-data library is an attempt to provide a re-usable
21framework for people who want to store metadata in device-mapper
22targets. It's currently used by the thin-provisioning target and an
23upcoming hierarchical storage target.
24
25Overview
26========
27
28The main documentation is in the header files which can all be found
29under drivers/md/persistent-data.
30
31The block manager
32-----------------
33
34dm-block-manager.[hc]
35
36This provides access to the data on disk in fixed sized-blocks. There
37is a read/write locking interface to prevent concurrent accesses, and
38keep data that is being used in the cache.
39
40Clients of persistent-data are unlikely to use this directly.
41
42The transaction manager
43-----------------------
44
45dm-transaction-manager.[hc]
46
47This restricts access to blocks and enforces copy-on-write semantics.
48The only way you can get hold of a writable block through the
49transaction manager is by shadowing an existing block (ie. doing
50copy-on-write) or allocating a fresh one. Shadowing is elided within
51the same transaction so performance is reasonable. The commit method
52ensures that all data is flushed before it writes the superblock.
53On power failure your metadata will be as it was when last committed.
54
55The Space Maps
56--------------
57
58dm-space-map.h
59dm-space-map-metadata.[hc]
60dm-space-map-disk.[hc]
61
62On-disk data structures that keep track of reference counts of blocks.
63Also acts as the allocator of new blocks. Currently two
64implementations: a simpler one for managing blocks on a different
65device (eg. thinly-provisioned data blocks); and one for managing
66the metadata space. The latter is complicated by the need to store
67its own data within the space it's managing.
68
69The data structures
70-------------------
71
72dm-btree.[hc]
73dm-btree-remove.c
74dm-btree-spine.c
75dm-btree-internal.h
76
77Currently there is only one data structure, a hierarchical btree.
78There are plans to add more. For example, something with an
79array-like interface would see a lot of use.
80
81The btree is 'hierarchical' in that you can define it to be composed
82of nested btrees, and take multiple keys. For example, the
83thin-provisioning target uses a btree with two levels of nesting.
84The first maps a device id to a mapping tree, and that in turn maps a
85virtual block to a physical block.
86
87Values stored in the btrees can have arbitrary size. Keys are always
8864bits, although nesting allows you to use multiple keys.
diff --git a/Documentation/admin-guide/device-mapper/snapshot.rst b/Documentation/admin-guide/device-mapper/snapshot.rst
new file mode 100644
index 000000000000..ccdd8b587a74
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/snapshot.rst
@@ -0,0 +1,196 @@
1==============================
2Device-mapper snapshot support
3==============================
4
5Device-mapper allows you, without massive data copying:
6
7- To create snapshots of any block device i.e. mountable, saved states of
8 the block device which are also writable without interfering with the
9 original content;
10- To create device "forks", i.e. multiple different versions of the
11 same data stream.
12- To merge a snapshot of a block device back into the snapshot's origin
13 device.
14
15In the first two cases, dm copies only the chunks of data that get
16changed and uses a separate copy-on-write (COW) block device for
17storage.
18
19For snapshot merge the contents of the COW storage are merged back into
20the origin device.
21
22
23There are three dm targets available:
24snapshot, snapshot-origin, and snapshot-merge.
25
26- snapshot-origin <origin>
27
28which will normally have one or more snapshots based on it.
29Reads will be mapped directly to the backing device. For each write, the
30original data will be saved in the <COW device> of each snapshot to keep
31its visible content unchanged, at least until the <COW device> fills up.
32
33
34- snapshot <origin> <COW device> <persistent?> <chunksize>
35 [<# feature args> [<arg>]*]
36
37A snapshot of the <origin> block device is created. Changed chunks of
38<chunksize> sectors will be stored on the <COW device>. Writes will
39only go to the <COW device>. Reads will come from the <COW device> or
40from <origin> for unchanged data. <COW device> will often be
41smaller than the origin and if it fills up the snapshot will become
42useless and be disabled, returning errors. So it is important to monitor
43the amount of free space and expand the <COW device> before it fills up.
44
45<persistent?> is P (Persistent) or N (Not persistent - will not survive
46after reboot). O (Overflow) can be added as a persistent store option
47to allow userspace to advertise its support for seeing "Overflow" in the
48snapshot status. So supported store types are "P", "PO" and "N".
49
50The difference between persistent and transient is with transient
51snapshots less metadata must be saved on disk - they can be kept in
52memory by the kernel.
53
54When loading or unloading the snapshot target, the corresponding
55snapshot-origin or snapshot-merge target must be suspended. A failure to
56suspend the origin target could result in data corruption.
57
58Optional features:
59
60 discard_zeroes_cow - a discard issued to the snapshot device that
61 maps to entire chunks to will zero the corresponding exception(s) in
62 the snapshot's exception store.
63
64 discard_passdown_origin - a discard to the snapshot device is passed
65 down to the snapshot-origin's underlying device. This doesn't cause
66 copy-out to the snapshot exception store because the snapshot-origin
67 target is bypassed.
68
69 The discard_passdown_origin feature depends on the discard_zeroes_cow
70 feature being enabled.
71
72
73- snapshot-merge <origin> <COW device> <persistent> <chunksize>
74 [<# feature args> [<arg>]*]
75
76takes the same table arguments as the snapshot target except it only
77works with persistent snapshots. This target assumes the role of the
78"snapshot-origin" target and must not be loaded if the "snapshot-origin"
79is still present for <origin>.
80
81Creates a merging snapshot that takes control of the changed chunks
82stored in the <COW device> of an existing snapshot, through a handover
83procedure, and merges these chunks back into the <origin>. Once merging
84has started (in the background) the <origin> may be opened and the merge
85will continue while I/O is flowing to it. Changes to the <origin> are
86deferred until the merging snapshot's corresponding chunk(s) have been
87merged. Once merging has started the snapshot device, associated with
88the "snapshot" target, will return -EIO when accessed.
89
90
91How snapshot is used by LVM2
92============================
93When you create the first LVM2 snapshot of a volume, four dm devices are used:
94
951) a device containing the original mapping table of the source volume;
962) a device used as the <COW device>;
973) a "snapshot" device, combining #1 and #2, which is the visible snapshot
98 volume;
994) the "original" volume (which uses the device number used by the original
100 source volume), whose table is replaced by a "snapshot-origin" mapping
101 from device #1.
102
103A fixed naming scheme is used, so with the following commands::
104
105 lvcreate -L 1G -n base volumeGroup
106 lvcreate -L 100M --snapshot -n snap volumeGroup/base
107
108we'll have this situation (with volumes in above order)::
109
110 # dmsetup table|grep volumeGroup
111
112 volumeGroup-base-real: 0 2097152 linear 8:19 384
113 volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
114 volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
115 volumeGroup-base: 0 2097152 snapshot-origin 254:11
116
117 # ls -lL /dev/mapper/volumeGroup-*
118 brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
119 brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
120 brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
121 brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
122
123
124How snapshot-merge is used by LVM2
125==================================
126A merging snapshot assumes the role of the "snapshot-origin" while
127merging. As such the "snapshot-origin" is replaced with
128"snapshot-merge". The "-real" device is not changed and the "-cow"
129device is renamed to <origin name>-cow to aid LVM2's cleanup of the
130merging snapshot after it completes. The "snapshot" that hands over its
131COW device to the "snapshot-merge" is deactivated (unless using lvchange
132--refresh); but if it is left active it will simply return I/O errors.
133
134A snapshot will merge into its origin with the following command::
135
136 lvconvert --merge volumeGroup/snap
137
138we'll now have this situation::
139
140 # dmsetup table|grep volumeGroup
141
142 volumeGroup-base-real: 0 2097152 linear 8:19 384
143 volumeGroup-base-cow: 0 204800 linear 8:19 2097536
144 volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
145
146 # ls -lL /dev/mapper/volumeGroup-*
147 brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
148 brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
149 brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
150
151
152How to determine when a merging is complete
153===========================================
154The snapshot-merge and snapshot status lines end with:
155
156 <sectors_allocated>/<total_sectors> <metadata_sectors>
157
158Both <sectors_allocated> and <total_sectors> include both data and metadata.
159During merging, the number of sectors allocated gets smaller and
160smaller. Merging has finished when the number of sectors holding data
161is zero, in other words <sectors_allocated> == <metadata_sectors>.
162
163Here is a practical example (using a hybrid of lvm and dmsetup commands)::
164
165 # lvs
166 LV VG Attr LSize Origin Snap% Move Log Copy% Convert
167 base volumeGroup owi-a- 4.00g
168 snap volumeGroup swi-a- 1.00g base 18.97
169
170 # dmsetup status volumeGroup-snap
171 0 8388608 snapshot 397896/2097152 1560
172 ^^^^ metadata sectors
173
174 # lvconvert --merge -b volumeGroup/snap
175 Merging of volume snap started.
176
177 # lvs volumeGroup/snap
178 LV VG Attr LSize Origin Snap% Move Log Copy% Convert
179 base volumeGroup Owi-a- 4.00g 17.23
180
181 # dmsetup status volumeGroup-base
182 0 8388608 snapshot-merge 281688/2097152 1104
183
184 # dmsetup status volumeGroup-base
185 0 8388608 snapshot-merge 180480/2097152 712
186
187 # dmsetup status volumeGroup-base
188 0 8388608 snapshot-merge 16/2097152 16
189
190Merging has finished.
191
192::
193
194 # lvs
195 LV VG Attr LSize Origin Snap% Move Log Copy% Convert
196 base volumeGroup owi-a- 4.00g
diff --git a/Documentation/admin-guide/device-mapper/statistics.rst b/Documentation/admin-guide/device-mapper/statistics.rst
new file mode 100644
index 000000000000..41ded0bc5933
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/statistics.rst
@@ -0,0 +1,225 @@
1=============
2DM statistics
3=============
4
5Device Mapper supports the collection of I/O statistics on user-defined
6regions of a DM device. If no regions are defined no statistics are
7collected so there isn't any performance impact. Only bio-based DM
8devices are currently supported.
9
10Each user-defined region specifies a starting sector, length and step.
11Individual statistics will be collected for each step-sized area within
12the range specified.
13
14The I/O statistics counters for each step-sized area of a region are
15in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
16Documentation/admin-guide/iostats.rst). But two extra counters (12 and 13) are
17provided: total time spent reading and writing. When the histogram
18argument is used, the 14th parameter is reported that represents the
19histogram of latencies. All these counters may be accessed by sending
20the @stats_print message to the appropriate DM device via dmsetup.
21
22The reported times are in milliseconds and the granularity depends on
23the kernel ticks. When the option precise_timestamps is used, the
24reported times are in nanoseconds.
25
26Each region has a corresponding unique identifier, which we call a
27region_id, that is assigned when the region is created. The region_id
28must be supplied when querying statistics about the region, deleting the
29region, etc. Unique region_ids enable multiple userspace programs to
30request and process statistics for the same DM device without stepping
31on each other's data.
32
33The creation of DM statistics will allocate memory via kmalloc or
34fallback to using vmalloc space. At most, 1/4 of the overall system
35memory may be allocated by DM statistics. The admin can see how much
36memory is used by reading:
37
38 /sys/module/dm_mod/parameters/stats_current_allocated_bytes
39
40Messages
41========
42
43 @stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
44 Create a new region and return the region_id.
45
46 <range>
47 "-"
48 whole device
49 "<start_sector>+<length>"
50 a range of <length> 512-byte sectors
51 starting with <start_sector>.
52
53 <step>
54 "<area_size>"
55 the range is subdivided into areas each containing
56 <area_size> sectors.
57 "/<number_of_areas>"
58 the range is subdivided into the specified
59 number of areas.
60
61 <number_of_optional_arguments>
62 The number of optional arguments
63
64 <optional_arguments>
65 The following optional arguments are supported:
66
67 precise_timestamps
68 use precise timer with nanosecond resolution
69 instead of the "jiffies" variable. When this argument is
70 used, the resulting times are in nanoseconds instead of
71 milliseconds. Precise timestamps are a little bit slower
72 to obtain than jiffies-based timestamps.
73 histogram:n1,n2,n3,n4,...
74 collect histogram of latencies. The
75 numbers n1, n2, etc are times that represent the boundaries
76 of the histogram. If precise_timestamps is not used, the
77 times are in milliseconds, otherwise they are in
78 nanoseconds. For each range, the kernel will report the
79 number of requests that completed within this range. For
80 example, if we use "histogram:10,20,30", the kernel will
81 report four numbers a:b:c:d. a is the number of requests
82 that took 0-10 ms to complete, b is the number of requests
83 that took 10-20 ms to complete, c is the number of requests
84 that took 20-30 ms to complete and d is the number of
85 requests that took more than 30 ms to complete.
86
87 <program_id>
88 An optional parameter. A name that uniquely identifies
89 the userspace owner of the range. This groups ranges together
90 so that userspace programs can identify the ranges they
91 created and ignore those created by others.
92 The kernel returns this string back in the output of
93 @stats_list message, but it doesn't use it for anything else.
94 If we omit the number of optional arguments, program id must not
95 be a number, otherwise it would be interpreted as the number of
96 optional arguments.
97
98 <aux_data>
99 An optional parameter. A word that provides auxiliary data
100 that is useful to the client program that created the range.
101 The kernel returns this string back in the output of
102 @stats_list message, but it doesn't use this value for anything.
103
104 @stats_delete <region_id>
105 Delete the region with the specified id.
106
107 <region_id>
108 region_id returned from @stats_create
109
110 @stats_clear <region_id>
111 Clear all the counters except the in-flight i/o counters.
112
113 <region_id>
114 region_id returned from @stats_create
115
116 @stats_list [<program_id>]
117 List all regions registered with @stats_create.
118
119 <program_id>
120 An optional parameter.
121 If this parameter is specified, only matching regions
122 are returned.
123 If it is not specified, all regions are returned.
124
125 Output format:
126 <region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
127 precise_timestamps histogram:n1,n2,n3,...
128
129 The strings "precise_timestamps" and "histogram" are printed only
130 if they were specified when creating the region.
131
132 @stats_print <region_id> [<starting_line> <number_of_lines>]
133 Print counters for each step-sized area of a region.
134
135 <region_id>
136 region_id returned from @stats_create
137
138 <starting_line>
139 The index of the starting line in the output.
140 If omitted, all lines are returned.
141
142 <number_of_lines>
143 The number of lines to include in the output.
144 If omitted, all lines are returned.
145
146 Output format for each step-sized area of a region:
147
148 <start_sector>+<length>
149 counters
150
151 The first 11 counters have the same meaning as
152 `/sys/block/*/stat or /proc/diskstats`.
153
154 Please refer to Documentation/admin-guide/iostats.rst for details.
155
156 1. the number of reads completed
157 2. the number of reads merged
158 3. the number of sectors read
159 4. the number of milliseconds spent reading
160 5. the number of writes completed
161 6. the number of writes merged
162 7. the number of sectors written
163 8. the number of milliseconds spent writing
164 9. the number of I/Os currently in progress
165 10. the number of milliseconds spent doing I/Os
166 11. the weighted number of milliseconds spent doing I/Os
167
168 Additional counters:
169
170 12. the total time spent reading in milliseconds
171 13. the total time spent writing in milliseconds
172
173 @stats_print_clear <region_id> [<starting_line> <number_of_lines>]
174 Atomically print and then clear all the counters except the
175 in-flight i/o counters. Useful when the client consuming the
176 statistics does not want to lose any statistics (those updated
177 between printing and clearing).
178
179 <region_id>
180 region_id returned from @stats_create
181
182 <starting_line>
183 The index of the starting line in the output.
184 If omitted, all lines are printed and then cleared.
185
186 <number_of_lines>
187 The number of lines to process.
188 If omitted, all lines are printed and then cleared.
189
190 @stats_set_aux <region_id> <aux_data>
191 Store auxiliary data aux_data for the specified region.
192
193 <region_id>
194 region_id returned from @stats_create
195
196 <aux_data>
197 The string that identifies data which is useful to the client
198 program that created the range. The kernel returns this
199 string back in the output of @stats_list message, but it
200 doesn't use this value for anything.
201
202Examples
203========
204
205Subdivide the DM device 'vol' into 100 pieces and start collecting
206statistics on them::
207
208 dmsetup message vol 0 @stats_create - /100
209
210Set the auxiliary data string to "foo bar baz" (the escape for each
211space must also be escaped, otherwise the shell will consume them)::
212
213 dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
214
215List the statistics::
216
217 dmsetup message vol 0 @stats_list
218
219Print the statistics::
220
221 dmsetup message vol 0 @stats_print 0
222
223Delete the statistics::
224
225 dmsetup message vol 0 @stats_delete 0
diff --git a/Documentation/admin-guide/device-mapper/striped.rst b/Documentation/admin-guide/device-mapper/striped.rst
new file mode 100644
index 000000000000..e9a8da192ae1
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/striped.rst
@@ -0,0 +1,61 @@
1=========
2dm-stripe
3=========
4
5Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
6device across one or more underlying devices. Data is written in "chunks",
7with consecutive chunks rotating among the underlying devices. This can
8potentially provide improved I/O throughput by utilizing several physical
9devices in parallel.
10
11Parameters: <num devs> <chunk size> [<dev path> <offset>]+
12 <num devs>:
13 Number of underlying devices.
14 <chunk size>:
15 Size of each chunk of data. Must be at least as
16 large as the system's PAGE_SIZE.
17 <dev path>:
18 Full pathname to the underlying block-device, or a
19 "major:minor" device-number.
20 <offset>:
21 Starting sector within the device.
22
23One or more underlying devices can be specified. The striped device size must
24be a multiple of the chunk size multiplied by the number of underlying devices.
25
26
27Example scripts
28===============
29
30::
31
32 #!/usr/bin/perl -w
33 # Create a striped device across any number of underlying devices. The device
34 # will be called "stripe_dev" and have a chunk-size of 128k.
35
36 my $chunk_size = 128 * 2;
37 my $dev_name = "stripe_dev";
38 my $num_devs = @ARGV;
39 my @devs = @ARGV;
40 my ($min_dev_size, $stripe_dev_size, $i);
41
42 if (!$num_devs) {
43 die("Specify at least one device\n");
44 }
45
46 $min_dev_size = `blockdev --getsz $devs[0]`;
47 for ($i = 1; $i < $num_devs; $i++) {
48 my $this_size = `blockdev --getsz $devs[$i]`;
49 $min_dev_size = ($min_dev_size < $this_size) ?
50 $min_dev_size : $this_size;
51 }
52
53 $stripe_dev_size = $min_dev_size * $num_devs;
54 $stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
55
56 $table = "0 $stripe_dev_size striped $num_devs $chunk_size";
57 for ($i = 0; $i < $num_devs; $i++) {
58 $table .= " $devs[$i] 0";
59 }
60
61 `echo $table | dmsetup create $dev_name`;
diff --git a/Documentation/admin-guide/device-mapper/switch.rst b/Documentation/admin-guide/device-mapper/switch.rst
new file mode 100644
index 000000000000..7dde06be1a4f
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/switch.rst
@@ -0,0 +1,141 @@
1=========
2dm-switch
3=========
4
5The device-mapper switch target creates a device that supports an
6arbitrary mapping of fixed-size regions of I/O across a fixed set of
7paths. The path used for any specific region can be switched
8dynamically by sending the target a message.
9
10It maps I/O to underlying block devices efficiently when there is a large
11number of fixed-sized address regions but there is no simple pattern
12that would allow for a compact representation of the mapping such as
13dm-stripe.
14
15Background
16----------
17
18Dell EqualLogic and some other iSCSI storage arrays use a distributed
19frameless architecture. In this architecture, the storage group
20consists of a number of distinct storage arrays ("members") each having
21independent controllers, disk storage and network adapters. When a LUN
22is created it is spread across multiple members. The details of the
23spreading are hidden from initiators connected to this storage system.
24The storage group exposes a single target discovery portal, no matter
25how many members are being used. When iSCSI sessions are created, each
26session is connected to an eth port on a single member. Data to a LUN
27can be sent on any iSCSI session, and if the blocks being accessed are
28stored on another member the I/O will be forwarded as required. This
29forwarding is invisible to the initiator. The storage layout is also
30dynamic, and the blocks stored on disk may be moved from member to
31member as needed to balance the load.
32
33This architecture simplifies the management and configuration of both
34the storage group and initiators. In a multipathing configuration, it
35is possible to set up multiple iSCSI sessions to use multiple network
36interfaces on both the host and target to take advantage of the
37increased network bandwidth. An initiator could use a simple round
38robin algorithm to send I/O across all paths and let the storage array
39members forward it as necessary, but there is a performance advantage to
40sending data directly to the correct member.
41
42A device-mapper table already lets you map different regions of a
43device onto different targets. However in this architecture the LUN is
44spread with an address region size on the order of 10s of MBs, which
45means the resulting table could have more than a million entries and
46consume far too much memory.
47
48Using this device-mapper switch target we can now build a two-layer
49device hierarchy:
50
51 Upper Tier - Determine which array member the I/O should be sent to.
52 Lower Tier - Load balance amongst paths to a particular member.
53
54The lower tier consists of a single dm multipath device for each member.
55Each of these multipath devices contains the set of paths directly to
56the array member in one priority group, and leverages existing path
57selectors to load balance amongst these paths. We also build a
58non-preferred priority group containing paths to other array members for
59failover reasons.
60
61The upper tier consists of a single dm-switch device. This device uses
62a bitmap to look up the location of the I/O and choose the appropriate
63lower tier device to route the I/O. By using a bitmap we are able to
64use 4 bits for each address range in a 16 member group (which is very
65large for us). This is a much denser representation than the dm table
66b-tree can achieve.
67
68Construction Parameters
69=======================
70
71 <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
72 <num_paths>
73 The number of paths across which to distribute the I/O.
74
75 <region_size>
76 The number of 512-byte sectors in a region. Each region can be redirected
77 to any of the available paths.
78
79 <num_optional_args>
80 The number of optional arguments. Currently, no optional arguments
81 are supported and so this must be zero.
82
83 <dev_path>
84 The block device that represents a specific path to the device.
85
86 <offset>
87 The offset of the start of data on the specific <dev_path> (in units
88 of 512-byte sectors). This number is added to the sector number when
89 forwarding the request to the specific path. Typically it is zero.
90
91Messages
92========
93
94set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
95
96Modify the region table by specifying which regions are redirected to
97which paths.
98
99<index>
100 The region number (region size was specified in constructor parameters).
101 If index is omitted, the next region (previous index + 1) is used.
102 Expressed in hexadecimal (WITHOUT any prefix like 0x).
103
104<path_nr>
105 The path number in the range 0 ... (<num_paths> - 1).
106 Expressed in hexadecimal (WITHOUT any prefix like 0x).
107
108R<n>,<m>
109 This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
110 are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
111 slots.
112
113Status
114======
115
116No status line is reported.
117
118Example
119=======
120
121Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
122the same size.
123
124Create a switch device with 64kB region size::
125
126 dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
127 switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
128
129Set mappings for the first 7 entries to point to devices switch0, switch1,
130switch2, switch0, switch1, switch2, switch1::
131
132 dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
133
134Set repetitive mapping. This command::
135
136 dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
137
138is equivalent to::
139
140 dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
141 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2
diff --git a/Documentation/admin-guide/device-mapper/thin-provisioning.rst b/Documentation/admin-guide/device-mapper/thin-provisioning.rst
new file mode 100644
index 000000000000..bafebf79da4b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/thin-provisioning.rst
@@ -0,0 +1,427 @@
1=================
2Thin provisioning
3=================
4
5Introduction
6============
7
8This document describes a collection of device-mapper targets that
9between them implement thin-provisioning and snapshots.
10
11The main highlight of this implementation, compared to the previous
12implementation of snapshots, is that it allows many virtual devices to
13be stored on the same data volume. This simplifies administration and
14allows the sharing of data between volumes, thus reducing disk usage.
15
16Another significant feature is support for an arbitrary depth of
17recursive snapshots (snapshots of snapshots of snapshots ...). The
18previous implementation of snapshots did this by chaining together
19lookup tables, and so performance was O(depth). This new
20implementation uses a single data structure to avoid this degradation
21with depth. Fragmentation may still be an issue, however, in some
22scenarios.
23
24Metadata is stored on a separate device from data, giving the
25administrator some freedom, for example to:
26
27- Improve metadata resilience by storing metadata on a mirrored volume
28 but data on a non-mirrored one.
29
30- Improve performance by storing the metadata on SSD.
31
32Status
33======
34
35These targets are considered safe for production use. But different use
36cases will have different performance characteristics, for example due
37to fragmentation of the data volume.
38
39If you find this software is not performing as expected please mail
40dm-devel@redhat.com with details and we'll try our best to improve
41things for you.
42
43Userspace tools for checking and repairing the metadata have been fully
44developed and are available as 'thin_check' and 'thin_repair'. The name
45of the package that provides these utilities varies by distribution (on
46a Red Hat distribution it is named 'device-mapper-persistent-data').
47
48Cookbook
49========
50
51This section describes some quick recipes for using thin provisioning.
52They use the dmsetup program to control the device-mapper driver
53directly. End users will be advised to use a higher-level volume
54manager such as LVM2 once support has been added.
55
56Pool device
57-----------
58
59The pool device ties together the metadata volume and the data volume.
60It maps I/O linearly to the data volume and updates the metadata via
61two mechanisms:
62
63- Function calls from the thin targets
64
65- Device-mapper 'messages' from userspace which control the creation of new
66 virtual devices amongst other things.
67
68Setting up a fresh pool device
69------------------------------
70
71Setting up a pool device requires a valid metadata device, and a
72data device. If you do not have an existing metadata device you can
73make one by zeroing the first 4k to indicate empty metadata.
74
75 dd if=/dev/zero of=$metadata_dev bs=4096 count=1
76
77The amount of metadata you need will vary according to how many blocks
78are shared between thin devices (i.e. through snapshots). If you have
79less sharing than average you'll need a larger-than-average metadata device.
80
81As a guide, we suggest you calculate the number of bytes to use in the
82metadata device as 48 * $data_dev_size / $data_block_size but round it up
83to 2MB if the answer is smaller. If you're creating large numbers of
84snapshots which are recording large amounts of change, you may find you
85need to increase this.
86
87The largest size supported is 16GB: If the device is larger,
88a warning will be issued and the excess space will not be used.
89
90Reloading a pool table
91----------------------
92
93You may reload a pool's table, indeed this is how the pool is resized
94if it runs out of space. (N.B. While specifying a different metadata
95device when reloading is not forbidden at the moment, things will go
96wrong if it does not route I/O to exactly the same on-disk location as
97previously.)
98
99Using an existing pool device
100-----------------------------
101
102::
103
104 dmsetup create pool \
105 --table "0 20971520 thin-pool $metadata_dev $data_dev \
106 $data_block_size $low_water_mark"
107
108$data_block_size gives the smallest unit of disk space that can be
109allocated at a time expressed in units of 512-byte sectors.
110$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
111multiple of 128 (64KB). $data_block_size cannot be changed after the
112thin-pool is created. People primarily interested in thin provisioning
113may want to use a value such as 1024 (512KB). People doing lots of
114snapshotting may want a smaller value such as 128 (64KB). If you are
115not zeroing newly-allocated data, a larger $data_block_size in the
116region of 256000 (128MB) is suggested.
117
118$low_water_mark is expressed in blocks of size $data_block_size. If
119free space on the data device drops below this level then a dm event
120will be triggered which a userspace daemon should catch allowing it to
121extend the pool device. Only one such event will be sent.
122
123No special event is triggered if a just resumed device's free space is below
124the low water mark. However, resuming a device always triggers an
125event; a userspace daemon should verify that free space exceeds the low
126water mark when handling this event.
127
128A low water mark for the metadata device is maintained in the kernel and
129will trigger a dm event if free space on the metadata device drops below
130it.
131
132Updating on-disk metadata
133-------------------------
134
135On-disk metadata is committed every time a FLUSH or FUA bio is written.
136If no such requests are made then commits will occur every second. This
137means the thin-provisioning target behaves like a physical disk that has
138a volatile write cache. If power is lost you may lose some recent
139writes. The metadata should always be consistent in spite of any crash.
140
141If data space is exhausted the pool will either error or queue IO
142according to the configuration (see: error_if_no_space). If metadata
143space is exhausted or a metadata operation fails: the pool will error IO
144until the pool is taken offline and repair is performed to 1) fix any
145potential inconsistencies and 2) clear the flag that imposes repair.
146Once the pool's metadata device is repaired it may be resized, which
147will allow the pool to return to normal operation. Note that if a pool
148is flagged as needing repair, the pool's data and metadata devices
149cannot be resized until repair is performed. It should also be noted
150that when the pool's metadata space is exhausted the current metadata
151transaction is aborted. Given that the pool will cache IO whose
152completion may have already been acknowledged to upper IO layers
153(e.g. filesystem) it is strongly suggested that consistency checks
154(e.g. fsck) be performed on those layers when repair of the pool is
155required.
156
157Thin provisioning
158-----------------
159
160i) Creating a new thinly-provisioned volume.
161
162 To create a new thinly- provisioned volume you must send a message to an
163 active pool device, /dev/mapper/pool in this example::
164
165 dmsetup message /dev/mapper/pool 0 "create_thin 0"
166
167 Here '0' is an identifier for the volume, a 24-bit number. It's up
168 to the caller to allocate and manage these identifiers. If the
169 identifier is already in use, the message will fail with -EEXIST.
170
171ii) Using a thinly-provisioned volume.
172
173 Thinly-provisioned volumes are activated using the 'thin' target::
174
175 dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
176
177 The last parameter is the identifier for the thinp device.
178
179Internal snapshots
180------------------
181
182i) Creating an internal snapshot.
183
184 Snapshots are created with another message to the pool.
185
186 N.B. If the origin device that you wish to snapshot is active, you
187 must suspend it before creating the snapshot to avoid corruption.
188 This is NOT enforced at the moment, so please be careful!
189
190 ::
191
192 dmsetup suspend /dev/mapper/thin
193 dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
194 dmsetup resume /dev/mapper/thin
195
196 Here '1' is the identifier for the volume, a 24-bit number. '0' is the
197 identifier for the origin device.
198
199ii) Using an internal snapshot.
200
201 Once created, the user doesn't have to worry about any connection
202 between the origin and the snapshot. Indeed the snapshot is no
203 different from any other thinly-provisioned device and can be
204 snapshotted itself via the same method. It's perfectly legal to
205 have only one of them active, and there's no ordering requirement on
206 activating or removing them both. (This differs from conventional
207 device-mapper snapshots.)
208
209 Activate it exactly the same way as any other thinly-provisioned volume::
210
211 dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
212
213External snapshots
214------------------
215
216You can use an external **read only** device as an origin for a
217thinly-provisioned volume. Any read to an unprovisioned area of the
218thin device will be passed through to the origin. Writes trigger
219the allocation of new blocks as usual.
220
221One use case for this is VM hosts that want to run guests on
222thinly-provisioned volumes but have the base image on another device
223(possibly shared between many VMs).
224
225You must not write to the origin device if you use this technique!
226Of course, you may write to the thin device and take internal snapshots
227of the thin volume.
228
229i) Creating a snapshot of an external device
230
231 This is the same as creating a thin device.
232 You don't mention the origin at this stage.
233
234 ::
235
236 dmsetup message /dev/mapper/pool 0 "create_thin 0"
237
238ii) Using a snapshot of an external device.
239
240 Append an extra parameter to the thin target specifying the origin::
241
242 dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
243
244 N.B. All descendants (internal snapshots) of this snapshot require the
245 same extra origin parameter.
246
247Deactivation
248------------
249
250All devices using a pool must be deactivated before the pool itself
251can be.
252
253::
254
255 dmsetup remove thin
256 dmsetup remove snap
257 dmsetup remove pool
258
259Reference
260=========
261
262'thin-pool' target
263------------------
264
265i) Constructor
266
267 ::
268
269 thin-pool <metadata dev> <data dev> <data block size (sectors)> \
270 <low water mark (blocks)> [<number of feature args> [<arg>]*]
271
272 Optional feature arguments:
273
274 skip_block_zeroing:
275 Skip the zeroing of newly-provisioned blocks.
276
277 ignore_discard:
278 Disable discard support.
279
280 no_discard_passdown:
281 Don't pass discards down to the underlying
282 data device, but just remove the mapping.
283
284 read_only:
285 Don't allow any changes to be made to the pool
286 metadata. This mode is only available after the
287 thin-pool has been created and first used in full
288 read/write mode. It cannot be specified on initial
289 thin-pool creation.
290
291 error_if_no_space:
292 Error IOs, instead of queueing, if no space.
293
294 Data block size must be between 64KB (128 sectors) and 1GB
295 (2097152 sectors) inclusive.
296
297
298ii) Status
299
300 ::
301
302 <transaction id> <used metadata blocks>/<total metadata blocks>
303 <used data blocks>/<total data blocks> <held metadata root>
304 ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
305 needs_check|- metadata_low_watermark
306
307 transaction id:
308 A 64-bit number used by userspace to help synchronise with metadata
309 from volume managers.
310
311 used data blocks / total data blocks
312 If the number of free blocks drops below the pool's low water mark a
313 dm event will be sent to userspace. This event is edge-triggered and
314 it will occur only once after each resume so volume manager writers
315 should register for the event and then check the target's status.
316
317 held metadata root:
318 The location, in blocks, of the metadata root that has been
319 'held' for userspace read access. '-' indicates there is no
320 held root.
321
322 discard_passdown|no_discard_passdown
323 Whether or not discards are actually being passed down to the
324 underlying device. When this is enabled when loading the table,
325 it can get disabled if the underlying device doesn't support it.
326
327 ro|rw|out_of_data_space
328 If the pool encounters certain types of device failures it will
329 drop into a read-only metadata mode in which no changes to
330 the pool metadata (like allocating new blocks) are permitted.
331
332 In serious cases where even a read-only mode is deemed unsafe
333 no further I/O will be permitted and the status will just
334 contain the string 'Fail'. The userspace recovery tools
335 should then be used.
336
337 error_if_no_space|queue_if_no_space
338 If the pool runs out of data or metadata space, the pool will
339 either queue or error the IO destined to the data device. The
340 default is to queue the IO until more space is added or the
341 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
342 module parameter can be used to change this timeout -- it
343 defaults to 60 seconds but may be disabled using a value of 0.
344
345 needs_check
346 A metadata operation has failed, resulting in the needs_check
347 flag being set in the metadata's superblock. The metadata
348 device must be deactivated and checked/repaired before the
349 thin-pool can be made fully operational again. '-' indicates
350 needs_check is not set.
351
352 metadata_low_watermark:
353 Value of metadata low watermark in blocks. The kernel sets this
354 value internally but userspace needs to know this value to
355 determine if an event was caused by crossing this threshold.
356
357iii) Messages
358
359 create_thin <dev id>
360 Create a new thinly-provisioned device.
361 <dev id> is an arbitrary unique 24-bit identifier chosen by
362 the caller.
363
364 create_snap <dev id> <origin id>
365 Create a new snapshot of another thinly-provisioned device.
366 <dev id> is an arbitrary unique 24-bit identifier chosen by
367 the caller.
368 <origin id> is the identifier of the thinly-provisioned device
369 of which the new device will be a snapshot.
370
371 delete <dev id>
372 Deletes a thin device. Irreversible.
373
374 set_transaction_id <current id> <new id>
375 Userland volume managers, such as LVM, need a way to
376 synchronise their external metadata with the internal metadata of the
377 pool target. The thin-pool target offers to store an
378 arbitrary 64-bit transaction id and return it on the target's
379 status line. To avoid races you must provide what you think
380 the current transaction id is when you change it with this
381 compare-and-swap message.
382
383 reserve_metadata_snap
384 Reserve a copy of the data mapping btree for use by userland.
385 This allows userland to inspect the mappings as they were when
386 this message was executed. Use the pool's status command to
387 get the root block associated with the metadata snapshot.
388
389 release_metadata_snap
390 Release a previously reserved copy of the data mapping btree.
391
392'thin' target
393-------------
394
395i) Constructor
396
397 ::
398
399 thin <pool dev> <dev id> [<external origin dev>]
400
401 pool dev:
402 the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
403
404 dev id:
405 the internal device identifier of the device to be
406 activated.
407
408 external origin dev:
409 an optional block device outside the pool to be treated as a
410 read-only snapshot origin: reads to unprovisioned areas of the
411 thin target will be mapped to this device.
412
413The pool doesn't store any size against the thin devices. If you
414load a thin target that is smaller than you've been using previously,
415then you'll have no access to blocks mapped beyond the end. If you
416load a target that is bigger than before, then extra blocks will be
417provisioned as and when needed.
418
419ii) Status
420
421 <nr mapped sectors> <highest mapped sector>
422 If the pool has encountered device errors and failed, the status
423 will just contain the string 'Fail'. The userspace recovery
424 tools should then be used.
425
426 In the case where <nr mapped sectors> is 0, there is no highest
427 mapped sector and the value of <highest mapped sector> is unspecified.
diff --git a/Documentation/admin-guide/device-mapper/unstriped.rst b/Documentation/admin-guide/device-mapper/unstriped.rst
new file mode 100644
index 000000000000..0a8d3eb3f072
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/unstriped.rst
@@ -0,0 +1,135 @@
1================================
2Device-mapper "unstriped" target
3================================
4
5Introduction
6============
7
8The device-mapper "unstriped" target provides a transparent mechanism to
9unstripe a device-mapper "striped" target to access the underlying disks
10without having to touch the true backing block-device. It can also be
11used to unstripe a hardware RAID-0 to access backing disks.
12
13Parameters:
14<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
15
16<number of stripes>
17 The number of stripes in the RAID 0.
18
19<chunk size>
20 The amount of 512B sectors in the chunk striping.
21
22<dev_path>
23 The block device you wish to unstripe.
24
25<stripe #>
26 The stripe number within the device that corresponds to physical
27 drive you wish to unstripe. This must be 0 indexed.
28
29
30Why use this module?
31====================
32
33An example of undoing an existing dm-stripe
34-------------------------------------------
35
36This small bash script will setup 4 loop devices and use the existing
37striped target to combine the 4 devices into one. It then will use
38the unstriped target ontop of the striped device to access the
39individual backing loop devices. We write data to the newly exposed
40unstriped devices and verify the data written matches the correct
41underlying device on the striped array::
42
43 #!/bin/bash
44
45 MEMBER_SIZE=$((128 * 1024 * 1024))
46 NUM=4
47 SEQ_END=$((${NUM}-1))
48 CHUNK=256
49 BS=4096
50
51 RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
52 DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
53 COUNT=$((${MEMBER_SIZE} / ${BS}))
54
55 for i in $(seq 0 ${SEQ_END}); do
56 dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
57 losetup /dev/loop${i} member-${i}
58 DM_PARMS+=" /dev/loop${i} 0"
59 done
60
61 echo $DM_PARMS | dmsetup create raid0
62 for i in $(seq 0 ${SEQ_END}); do
63 echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
64 done;
65
66 for i in $(seq 0 ${SEQ_END}); do
67 dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
68 diff /dev/mapper/set-${i} member-${i}
69 done;
70
71 for i in $(seq 0 ${SEQ_END}); do
72 dmsetup remove set-${i}
73 done
74
75 dmsetup remove raid0
76
77 for i in $(seq 0 ${SEQ_END}); do
78 losetup -d /dev/loop${i}
79 rm -f member-${i}
80 done
81
82Another example
83---------------
84
85Intel NVMe drives contain two cores on the physical device.
86Each core of the drive has segregated access to its LBA range.
87The current LBA model has a RAID 0 128k chunk on each core, resulting
88in a 256k stripe across the two cores::
89
90 Core 0: Core 1:
91 __________ __________
92 | LBA 512| | LBA 768|
93 | LBA 0 | | LBA 256|
94 ---------- ----------
95
96The purpose of this unstriping is to provide better QoS in noisy
97neighbor environments. When two partitions are created on the
98aggregate drive without this unstriping, reads on one partition
99can affect writes on another partition. This is because the partitions
100are striped across the two cores. When we unstripe this hardware RAID 0
101and make partitions on each new exposed device the two partitions are now
102physically separated.
103
104With the dm-unstriped target we're able to segregate an fio script that
105has read and write jobs that are independent of each other. Compared to
106when we run the test on a combined drive with partitions, we were able
107to get a 92% reduction in read latency using this device mapper target.
108
109
110Example dmsetup usage
111=====================
112
113unstriped ontop of Intel NVMe device that has 2 cores
114-----------------------------------------------------
115
116::
117
118 dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
119 dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
120
121There will now be two devices that expose Intel NVMe core 0 and 1
122respectively::
123
124 /dev/mapper/nvmset0
125 /dev/mapper/nvmset1
126
127unstriped ontop of striped with 4 drives using 128K chunk size
128--------------------------------------------------------------
129
130::
131
132 dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
133 dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
134 dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
135 dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
new file mode 100644
index 000000000000..a4d1c1476d72
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -0,0 +1,229 @@
1=========
2dm-verity
3=========
4
5Device-Mapper's "verity" target provides transparent integrity checking of
6block devices using a cryptographic digest provided by the kernel crypto API.
7This target is read-only.
8
9Construction Parameters
10=======================
11
12::
13
14 <version> <dev> <hash_dev>
15 <data_block_size> <hash_block_size>
16 <num_data_blocks> <hash_start_block>
17 <algorithm> <digest> <salt>
18 [<#opt_params> <opt_params>]
19
20<version>
21 This is the type of the on-disk hash format.
22
23 0 is the original format used in the Chromium OS.
24 The salt is appended when hashing, digests are stored continuously and
25 the rest of the block is padded with zeroes.
26
27 1 is the current format that should be used for new devices.
28 The salt is prepended when hashing and each digest is
29 padded with zeroes to the power of two.
30
31<dev>
32 This is the device containing data, the integrity of which needs to be
33 checked. It may be specified as a path, like /dev/sdaX, or a device number,
34 <major>:<minor>.
35
36<hash_dev>
37 This is the device that supplies the hash tree data. It may be
38 specified similarly to the device path and may be the same device. If the
39 same device is used, the hash_start should be outside the configured
40 dm-verity device.
41
42<data_block_size>
43 The block size on a data device in bytes.
44 Each block corresponds to one digest on the hash device.
45
46<hash_block_size>
47 The size of a hash block in bytes.
48
49<num_data_blocks>
50 The number of data blocks on the data device. Additional blocks are
51 inaccessible. You can place hashes to the same partition as data, in this
52 case hashes are placed after <num_data_blocks>.
53
54<hash_start_block>
55 This is the offset, in <hash_block_size>-blocks, from the start of hash_dev
56 to the root block of the hash tree.
57
58<algorithm>
59 The cryptographic hash algorithm used for this device. This should
60 be the name of the algorithm, like "sha1".
61
62<digest>
63 The hexadecimal encoding of the cryptographic hash of the root hash block
64 and the salt. This hash should be trusted as there is no other authenticity
65 beyond this point.
66
67<salt>
68 The hexadecimal encoding of the salt value.
69
70<#opt_params>
71 Number of optional parameters. If there are no optional parameters,
72 the optional paramaters section can be skipped or #opt_params can be zero.
73 Otherwise #opt_params is the number of following arguments.
74
75 Example of optional parameters section:
76 1 ignore_corruption
77
78ignore_corruption
79 Log corrupted blocks, but allow read operations to proceed normally.
80
81restart_on_corruption
82 Restart the system when a corrupted block is discovered. This option is
83 not compatible with ignore_corruption and requires user space support to
84 avoid restart loops.
85
86ignore_zero_blocks
87 Do not verify blocks that are expected to contain zeroes and always return
88 zeroes instead. This may be useful if the partition contains unused blocks
89 that are not guaranteed to contain zeroes.
90
91use_fec_from_device <fec_dev>
92 Use forward error correction (FEC) to recover from corruption if hash
93 verification fails. Use encoding data from the specified device. This
94 may be the same device where data and hash blocks reside, in which case
95 fec_start must be outside data and hash areas.
96
97 If the encoding data covers additional metadata, it must be accessible
98 on the hash device after the hash blocks.
99
100 Note: block sizes for data and hash devices must match. Also, if the
101 verity <dev> is encrypted the <fec_dev> should be too.
102
103fec_roots <num>
104 Number of generator roots. This equals to the number of parity bytes in
105 the encoding data. For example, in RS(M, N) encoding, the number of roots
106 is M-N.
107
108fec_blocks <num>
109 The number of encoding data blocks on the FEC device. The block size for
110 the FEC device is <data_block_size>.
111
112fec_start <offset>
113 This is the offset, in <data_block_size> blocks, from the start of the
114 FEC device to the beginning of the encoding data.
115
116check_at_most_once
117 Verify data blocks only the first time they are read from the data device,
118 rather than every time. This reduces the overhead of dm-verity so that it
119 can be used on systems that are memory and/or CPU constrained. However, it
120 provides a reduced level of security because only offline tampering of the
121 data device's content will be detected, not online tampering.
122
123 Hash blocks are still verified each time they are read from the hash device,
124 since verification of hash blocks is less performance critical than data
125 blocks, and a hash block will not be verified any more after all the data
126 blocks it covers have been verified anyway.
127
128Theory of operation
129===================
130
131dm-verity is meant to be set up as part of a verified boot path. This
132may be anything ranging from a boot using tboot or trustedgrub to just
133booting from a known-good device (like a USB drive or CD).
134
135When a dm-verity device is configured, it is expected that the caller
136has been authenticated in some way (cryptographic signatures, etc).
137After instantiation, all hashes will be verified on-demand during
138disk access. If they cannot be verified up to the root node of the
139tree, the root hash, then the I/O will fail. This should detect
140tampering with any data on the device and the hash data.
141
142Cryptographic hashes are used to assert the integrity of the device on a
143per-block basis. This allows for a lightweight hash computation on first read
144into the page cache. Block hashes are stored linearly, aligned to the nearest
145block size.
146
147If forward error correction (FEC) support is enabled any recovery of
148corrupted data will be verified using the cryptographic hash of the
149corresponding data. This is why combining error correction with
150integrity checking is essential.
151
152Hash Tree
153---------
154
155Each node in the tree is a cryptographic hash. If it is a leaf node, the hash
156of some data block on disk is calculated. If it is an intermediary node,
157the hash of a number of child nodes is calculated.
158
159Each entry in the tree is a collection of neighboring nodes that fit in one
160block. The number is determined based on block_size and the size of the
161selected cryptographic digest algorithm. The hashes are linearly-ordered in
162this entry and any unaligned trailing space is ignored but included when
163calculating the parent node.
164
165The tree looks something like:
166
167 alg = sha256, num_blocks = 32768, block_size = 4096
168
169::
170
171 [ root ]
172 / . . . \
173 [entry_0] [entry_1]
174 / . . . \ . . . \
175 [entry_0_0] . . . [entry_0_127] . . . . [entry_1_127]
176 / ... \ / . . . \ / \
177 blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767
178
179
180On-disk format
181==============
182
183The verity kernel code does not read the verity metadata on-disk header.
184It only reads the hash blocks which directly follow the header.
185It is expected that a user-space tool will verify the integrity of the
186verity header.
187
188Alternatively, the header can be omitted and the dmsetup parameters can
189be passed via the kernel command-line in a rooted chain of trust where
190the command-line is verified.
191
192Directly following the header (and with sector number padded to the next hash
193block boundary) are the hash blocks which are stored a depth at a time
194(starting from the root), sorted in order of increasing index.
195
196The full specification of kernel parameters and on-disk metadata format
197is available at the cryptsetup project's wiki page
198
199 https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
200
201Status
202======
203V (for Valid) is returned if every check performed so far was valid.
204If any check failed, C (for Corruption) is returned.
205
206Example
207=======
208Set up a device::
209
210 # dmsetup create vroot --readonly --table \
211 "0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
212 "4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
213 "1234000000000000000000000000000000000000000000000000000000000000"
214
215A command line tool veritysetup is available to compute or verify
216the hash tree or activate the kernel device. This is available from
217the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
218(as a libcryptsetup extension).
219
220Create hash on the device::
221
222 # veritysetup format /dev/sda1 /dev/sda2
223 ...
224 Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
225
226Activate the device::
227
228 # veritysetup create vroot /dev/sda1 /dev/sda2 \
229 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
diff --git a/Documentation/admin-guide/device-mapper/writecache.rst b/Documentation/admin-guide/device-mapper/writecache.rst
new file mode 100644
index 000000000000..d3d7690f5e8d
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/writecache.rst
@@ -0,0 +1,79 @@
1=================
2Writecache target
3=================
4
5The writecache target caches writes on persistent memory or on SSD. It
6doesn't cache reads because reads are supposed to be cached in page cache
7in normal RAM.
8
9When the device is constructed, the first sector should be zeroed or the
10first sector should contain valid superblock from previous invocation.
11
12Constructor parameters:
13
141. type of the cache device - "p" or "s"
15
16 - p - persistent memory
17 - s - SSD
182. the underlying device that will be cached
193. the cache device
204. block size (4096 is recommended; the maximum block size is the page
21 size)
225. the number of optional parameters (the parameters with an argument
23 count as two)
24
25 start_sector n (default: 0)
26 offset from the start of cache device in 512-byte sectors
27 high_watermark n (default: 50)
28 start writeback when the number of used blocks reach this
29 watermark
30 low_watermark x (default: 45)
31 stop writeback when the number of used blocks drops below
32 this watermark
33 writeback_jobs n (default: unlimited)
34 limit the number of blocks that are in flight during
35 writeback. Setting this value reduces writeback
36 throughput, but it may improve latency of read requests
37 autocommit_blocks n (default: 64 for pmem, 65536 for ssd)
38 when the application writes this amount of blocks without
39 issuing the FLUSH request, the blocks are automatically
40 commited
41 autocommit_time ms (default: 1000)
42 autocommit time in milliseconds. The data is automatically
43 commited if this time passes and no FLUSH request is
44 received
45 fua (by default on)
46 applicable only to persistent memory - use the FUA flag
47 when writing data from persistent memory back to the
48 underlying device
49 nofua
50 applicable only to persistent memory - don't use the FUA
51 flag when writing back data and send the FLUSH request
52 afterwards
53
54 - some underlying devices perform better with fua, some
55 with nofua. The user should test it
56
57Status:
581. error indicator - 0 if there was no error, otherwise error number
592. the number of blocks
603. the number of free blocks
614. the number of blocks under writeback
62
63Messages:
64 flush
65 flush the cache device. The message returns successfully
66 if the cache device was flushed without an error
67 flush_on_suspend
68 flush the cache device on next suspend. Use this message
69 when you are going to remove the cache device. The proper
70 sequence for removing the cache device is:
71
72 1. send the "flush_on_suspend" message
73 2. load an inactive table with a linear target that maps
74 to the underlying device
75 3. suspend the device
76 4. ask for status and verify that there are no errors
77 5. resume the device, so that it will use the linear
78 target
79 6. the cache device is now inactive and it can be deleted
diff --git a/Documentation/admin-guide/device-mapper/zero.rst b/Documentation/admin-guide/device-mapper/zero.rst
new file mode 100644
index 000000000000..11fb5cf4597c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/zero.rst
@@ -0,0 +1,37 @@
1=======
2dm-zero
3=======
4
5Device-Mapper's "zero" target provides a block-device that always returns
6zero'd data on reads and silently drops writes. This is similar behavior to
7/dev/zero, but as a block-device instead of a character-device.
8
9Dm-zero has no target-specific parameters.
10
11One very interesting use of dm-zero is for creating "sparse" devices in
12conjunction with dm-snapshot. A sparse device reports a device-size larger
13than the amount of actual storage space available for that device. A user can
14write data anywhere within the sparse device and read it back like a normal
15device. Reads to previously unwritten areas will return a zero'd buffer. When
16enough data has been written to fill up the actual storage space, the sparse
17device is deactivated. This can be very useful for testing device and
18filesystem limitations.
19
20To create a sparse device, start by creating a dm-zero device that's the
21desired size of the sparse device. For this example, we'll assume a 10TB
22sparse device::
23
24 TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
25 echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
26
27Then create a snapshot of the zero device, using any available block-device as
28the COW device. The size of the COW device will determine the amount of real
29space available to the sparse device. For this example, we'll assume /dev/sdb1
30is an available 10GB partition::
31
32 echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
33 dmsetup create sparse1
34
35This will create a 10TB sparse device called /dev/mapper/sparse1 that has
3610GB of actual storage space available. If more than 10GB of data is written
37to this device, it will start returning I/O errors.
diff --git a/Documentation/admin-guide/efi-stub.rst b/Documentation/admin-guide/efi-stub.rst
new file mode 100644
index 000000000000..833edb0d0bc4
--- /dev/null
+++ b/Documentation/admin-guide/efi-stub.rst
@@ -0,0 +1,100 @@
1=================
2The EFI Boot Stub
3=================
4
5On the x86 and ARM platforms, a kernel zImage/bzImage can masquerade
6as a PE/COFF image, thereby convincing EFI firmware loaders to load
7it as an EFI executable. The code that modifies the bzImage header,
8along with the EFI-specific entry point that the firmware loader
9jumps to are collectively known as the "EFI boot stub", and live in
10arch/x86/boot/header.S and arch/x86/boot/compressed/eboot.c,
11respectively. For ARM the EFI stub is implemented in
12arch/arm/boot/compressed/efi-header.S and
13arch/arm/boot/compressed/efi-stub.c. EFI stub code that is shared
14between architectures is in drivers/firmware/efi/libstub.
15
16For arm64, there is no compressed kernel support, so the Image itself
17masquerades as a PE/COFF image and the EFI stub is linked into the
18kernel. The arm64 EFI stub lives in arch/arm64/kernel/efi-entry.S
19and drivers/firmware/efi/libstub/arm64-stub.c.
20
21By using the EFI boot stub it's possible to boot a Linux kernel
22without the use of a conventional EFI boot loader, such as grub or
23elilo. Since the EFI boot stub performs the jobs of a boot loader, in
24a certain sense it *IS* the boot loader.
25
26The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option.
27
28
29How to install bzImage.efi
30--------------------------
31
32The bzImage located in arch/x86/boot/bzImage must be copied to the EFI
33System Partition (ESP) and renamed with the extension ".efi". Without
34the extension the EFI firmware loader will refuse to execute it. It's
35not possible to execute bzImage.efi from the usual Linux file systems
36because EFI firmware doesn't have support for them. For ARM the
37arch/arm/boot/zImage should be copied to the system partition, and it
38may not need to be renamed. Similarly for arm64, arch/arm64/boot/Image
39should be copied but not necessarily renamed.
40
41
42Passing kernel parameters from the EFI shell
43--------------------------------------------
44
45Arguments to the kernel can be passed after bzImage.efi, e.g.::
46
47 fs0:> bzImage.efi console=ttyS0 root=/dev/sda4
48
49
50The "initrd=" option
51--------------------
52
53Like most boot loaders, the EFI stub allows the user to specify
54multiple initrd files using the "initrd=" option. This is the only EFI
55stub-specific command line parameter, everything else is passed to the
56kernel when it boots.
57
58The path to the initrd file must be an absolute path from the
59beginning of the ESP, relative path names do not work. Also, the path
60is an EFI-style path and directory elements must be separated with
61backslashes (\). For example, given the following directory layout::
62
63 fs0:>
64 Kernels\
65 bzImage.efi
66 initrd-large.img
67
68 Ramdisks\
69 initrd-small.img
70 initrd-medium.img
71
72to boot with the initrd-large.img file if the current working
73directory is fs0:\Kernels, the following command must be used::
74
75 fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img
76
77Notice how bzImage.efi can be specified with a relative path. That's
78because the image we're executing is interpreted by the EFI shell,
79which understands relative paths, whereas the rest of the command line
80is passed to bzImage.efi.
81
82
83The "dtb=" option
84-----------------
85
86For the ARM and arm64 architectures, a device tree must be provided to
87the kernel. Normally firmware shall supply the device tree via the
88EFI CONFIGURATION TABLE. However, the "dtb=" command line option can
89be used to override the firmware supplied device tree, or to supply
90one when firmware is unable to.
91
92Please note: Firmware adds runtime configuration information to the
93device tree before booting the kernel. If dtb= is used to override
94the device tree, then any runtime data provided by firmware will be
95lost. The dtb= option should only be used either as a debug tool, or
96as a last resort when a device tree is not provided in the EFI
97CONFIGURATION TABLE.
98
99"dtb=" is processed in the same manner as the "initrd=" option that is
100described above.
diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst
new file mode 100644
index 000000000000..a244ba4e87d5
--- /dev/null
+++ b/Documentation/admin-guide/gpio/index.rst
@@ -0,0 +1,17 @@
1.. SPDX-License-Identifier: GPL-2.0
2
3====
4gpio
5====
6
7.. toctree::
8 :maxdepth: 1
9
10 sysfs
11
12.. only:: subproject and html
13
14 Indices
15 =======
16
17 * :ref:`genindex`
diff --git a/Documentation/admin-guide/gpio/sysfs.rst b/Documentation/admin-guide/gpio/sysfs.rst
new file mode 100644
index 000000000000..ec09ffd983e7
--- /dev/null
+++ b/Documentation/admin-guide/gpio/sysfs.rst
@@ -0,0 +1,167 @@
1GPIO Sysfs Interface for Userspace
2==================================
3
4.. warning::
5
6 THIS ABI IS DEPRECATED, THE ABI DOCUMENTATION HAS BEEN MOVED TO
7 Documentation/ABI/obsolete/sysfs-gpio AND NEW USERSPACE CONSUMERS
8 ARE SUPPOSED TO USE THE CHARACTER DEVICE ABI. THIS OLD SYSFS ABI WILL
9 NOT BE DEVELOPED (NO NEW FEATURES), IT WILL JUST BE MAINTAINED.
10
11Refer to the examples in tools/gpio/* for an introduction to the new
12character device ABI. Also see the userspace header in
13include/uapi/linux/gpio.h
14
15The deprecated sysfs ABI
16------------------------
17Platforms which use the "gpiolib" implementors framework may choose to
18configure a sysfs user interface to GPIOs. This is different from the
19debugfs interface, since it provides control over GPIO direction and
20value instead of just showing a gpio state summary. Plus, it could be
21present on production systems without debugging support.
22
23Given appropriate hardware documentation for the system, userspace could
24know for example that GPIO #23 controls the write protect line used to
25protect boot loader segments in flash memory. System upgrade procedures
26may need to temporarily remove that protection, first importing a GPIO,
27then changing its output state, then updating the code before re-enabling
28the write protection. In normal use, GPIO #23 would never be touched,
29and the kernel would have no need to know about it.
30
31Again depending on appropriate hardware documentation, on some systems
32userspace GPIO can be used to determine system configuration data that
33standard kernels won't know about. And for some tasks, simple userspace
34GPIO drivers could be all that the system really needs.
35
36DO NOT ABUSE SYSFS TO CONTROL HARDWARE THAT HAS PROPER KERNEL DRIVERS.
37PLEASE READ THE DOCUMENT AT Documentation/driver-api/gpio/drivers-on-gpio.rst
38TO AVOID REINVENTING KERNEL WHEELS IN USERSPACE. I MEAN IT. REALLY.
39
40Paths in Sysfs
41--------------
42There are three kinds of entries in /sys/class/gpio:
43
44 - Control interfaces used to get userspace control over GPIOs;
45
46 - GPIOs themselves; and
47
48 - GPIO controllers ("gpio_chip" instances).
49
50That's in addition to standard files including the "device" symlink.
51
52The control interfaces are write-only:
53
54 /sys/class/gpio/
55
56 "export" ...
57 Userspace may ask the kernel to export control of
58 a GPIO to userspace by writing its number to this file.
59
60 Example: "echo 19 > export" will create a "gpio19" node
61 for GPIO #19, if that's not requested by kernel code.
62
63 "unexport" ...
64 Reverses the effect of exporting to userspace.
65
66 Example: "echo 19 > unexport" will remove a "gpio19"
67 node exported using the "export" file.
68
69GPIO signals have paths like /sys/class/gpio/gpio42/ (for GPIO #42)
70and have the following read/write attributes:
71
72 /sys/class/gpio/gpioN/
73
74 "direction" ...
75 reads as either "in" or "out". This value may
76 normally be written. Writing as "out" defaults to
77 initializing the value as low. To ensure glitch free
78 operation, values "low" and "high" may be written to
79 configure the GPIO as an output with that initial value.
80
81 Note that this attribute *will not exist* if the kernel
82 doesn't support changing the direction of a GPIO, or
83 it was exported by kernel code that didn't explicitly
84 allow userspace to reconfigure this GPIO's direction.
85
86 "value" ...
87 reads as either 0 (low) or 1 (high). If the GPIO
88 is configured as an output, this value may be written;
89 any nonzero value is treated as high.
90
91 If the pin can be configured as interrupt-generating interrupt
92 and if it has been configured to generate interrupts (see the
93 description of "edge"), you can poll(2) on that file and
94 poll(2) will return whenever the interrupt was triggered. If
95 you use poll(2), set the events POLLPRI and POLLERR. If you
96 use select(2), set the file descriptor in exceptfds. After
97 poll(2) returns, either lseek(2) to the beginning of the sysfs
98 file and read the new value or close the file and re-open it
99 to read the value.
100
101 "edge" ...
102 reads as either "none", "rising", "falling", or
103 "both". Write these strings to select the signal edge(s)
104 that will make poll(2) on the "value" file return.
105
106 This file exists only if the pin can be configured as an
107 interrupt generating input pin.
108
109 "active_low" ...
110 reads as either 0 (false) or 1 (true). Write
111 any nonzero value to invert the value attribute both
112 for reading and writing. Existing and subsequent
113 poll(2) support configuration via the edge attribute
114 for "rising" and "falling" edges will follow this
115 setting.
116
117GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
118controller implementing GPIOs starting at #42) and have the following
119read-only attributes:
120
121 /sys/class/gpio/gpiochipN/
122
123 "base" ...
124 same as N, the first GPIO managed by this chip
125
126 "label" ...
127 provided for diagnostics (not always unique)
128
129 "ngpio" ...
130 how many GPIOs this manages (N to N + ngpio - 1)
131
132Board documentation should in most cases cover what GPIOs are used for
133what purposes. However, those numbers are not always stable; GPIOs on
134a daughtercard might be different depending on the base board being used,
135or other cards in the stack. In such cases, you may need to use the
136gpiochip nodes (possibly in conjunction with schematics) to determine
137the correct GPIO number to use for a given signal.
138
139
140Exporting from Kernel code
141--------------------------
142Kernel code can explicitly manage exports of GPIOs which have already been
143requested using gpio_request()::
144
145 /* export the GPIO to userspace */
146 int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
147
148 /* reverse gpio_export() */
149 void gpiod_unexport(struct gpio_desc *desc);
150
151 /* create a sysfs link to an exported GPIO node */
152 int gpiod_export_link(struct device *dev, const char *name,
153 struct gpio_desc *desc);
154
155After a kernel driver requests a GPIO, it may only be made available in
156the sysfs interface by gpiod_export(). The driver can control whether the
157signal direction may change. This helps drivers prevent userspace code
158from accidentally clobbering important system state.
159
160This explicit exporting can help with debugging (by making some kinds
161of experiments easier), or can provide an always-there interface that's
162suitable for documenting as part of a board support package.
163
164After the GPIO has been exported, gpiod_export_link() allows creating
165symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can
166use this to provide the interface under their own device in sysfs with
167a descriptive name.
diff --git a/Documentation/admin-guide/highuid.rst b/Documentation/admin-guide/highuid.rst
new file mode 100644
index 000000000000..6ee70465c0ea
--- /dev/null
+++ b/Documentation/admin-guide/highuid.rst
@@ -0,0 +1,80 @@
1===================================================
2Notes on the change from 16-bit UIDs to 32-bit UIDs
3===================================================
4
5:Author: Chris Wing <wingc@umich.edu>
6:Last updated: January 11, 2000
7
8- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t
9 when communicating between user and kernel space in an ioctl or data
10 structure.
11
12- kernel code should use uid_t and gid_t in kernel-private structures and
13 code.
14
15What's left to be done for 32-bit UIDs on all Linux architectures:
16
17- Disk quotas have an interesting limitation that is not related to the
18 maximum UID/GID. They are limited by the maximum file size on the
19 underlying filesystem, because quota records are written at offsets
20 corresponding to the UID in question.
21 Further investigation is needed to see if the quota system can cope
22 properly with huge UIDs. If it can deal with 64-bit file offsets on all
23 architectures, this should not be a problem.
24
25- Decide whether or not to keep backwards compatibility with the system
26 accounting file, or if we should break it as the comments suggest
27 (currently, the old 16-bit UID and GID are still written to disk, and
28 part of the former pad space is used to store separate 32-bit UID and
29 GID)
30
31- Need to validate that OS emulation calls the 16-bit UID
32 compatibility syscalls, if the OS being emulated used 16-bit UIDs, or
33 uses the 32-bit UID system calls properly otherwise.
34
35 This affects at least:
36
37 - iBCS on Intel
38
39 - sparc32 emulation on sparc64
40 (need to support whatever new 32-bit UID system calls are added to
41 sparc32)
42
43- Validate that all filesystems behave properly.
44
45 At present, 32-bit UIDs _should_ work for:
46
47 - ext2
48 - ufs
49 - isofs
50 - nfs
51 - coda
52 - udf
53
54 Ioctl() fixups have been made for:
55
56 - ncpfs
57 - smbfs
58
59 Filesystems with simple fixups to prevent 16-bit UID wraparound:
60
61 - minix
62 - sysv
63 - qnx4
64
65 Other filesystems have not been checked yet.
66
67- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in
68 all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but
69 more are needed. (as well as new user<->kernel data structures)
70
71- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k,
72 sh, and sparc32. Fixing this is probably not that important, but would
73 require adding a new ELF section.
74
75- The ioctl()s used to control the in-kernel NFS server only support
76 16-bit UIDs on arm, i386, m68k, sh, and sparc32.
77
78- make sure that the UID mapping feature of AX25 networking works properly
79 (it should be safe because it's always used a 32-bit integer to
80 communicate between user and kernel)
diff --git a/Documentation/admin-guide/hw-vuln/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst
index 656aee262e23..f83212fae4d5 100644
--- a/Documentation/admin-guide/hw-vuln/l1tf.rst
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -241,7 +241,7 @@ Guest mitigation mechanisms
241 For further information about confining guests to a single or to a group 241 For further information about confining guests to a single or to a group
242 of cores consult the cpusets documentation: 242 of cores consult the cpusets documentation:
243 243
244 https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.rst 244 https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst
245 245
246.. _interrupt_isolation: 246.. _interrupt_isolation:
247 247
diff --git a/Documentation/admin-guide/hw_random.rst b/Documentation/admin-guide/hw_random.rst
new file mode 100644
index 000000000000..121de96e395e
--- /dev/null
+++ b/Documentation/admin-guide/hw_random.rst
@@ -0,0 +1,105 @@
1==========================================================
2Linux support for random number generator in i8xx chipsets
3==========================================================
4
5Introduction
6============
7
8The hw_random framework is software that makes use of a
9special hardware feature on your CPU or motherboard,
10a Random Number Generator (RNG). The software has two parts:
11a core providing the /dev/hwrng character device and its
12sysfs support, plus a hardware-specific driver that plugs
13into that core.
14
15To make the most effective use of these mechanisms, you
16should download the support software as well. Download the
17latest version of the "rng-tools" package from the
18hw_random driver's official Web site:
19
20 http://sourceforge.net/projects/gkernel/
21
22Those tools use /dev/hwrng to fill the kernel entropy pool,
23which is used internally and exported by the /dev/urandom and
24/dev/random special files.
25
26Theory of operation
27===================
28
29CHARACTER DEVICE. Using the standard open()
30and read() system calls, you can read random data from
31the hardware RNG device. This data is NOT CHECKED by any
32fitness tests, and could potentially be bogus (if the
33hardware is faulty or has been tampered with). Data is only
34output if the hardware "has-data" flag is set, but nevertheless
35a security-conscious person would run fitness tests on the
36data before assuming it is truly random.
37
38The rng-tools package uses such tests in "rngd", and lets you
39run them by hand with a "rngtest" utility.
40
41/dev/hwrng is char device major 10, minor 183.
42
43CLASS DEVICE. There is a /sys/class/misc/hw_random node with
44two unique attributes, "rng_available" and "rng_current". The
45"rng_available" attribute lists the hardware-specific drivers
46available, while "rng_current" lists the one which is currently
47connected to /dev/hwrng. If your system has more than one
48RNG available, you may change the one used by writing a name from
49the list in "rng_available" into "rng_current".
50
51==========================================================================
52
53
54Hardware driver for Intel/AMD/VIA Random Number Generators (RNG)
55 - Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com>
56 - Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com>
57
58
59About the Intel RNG hardware, from the firmware hub datasheet
60=============================================================
61
62The Firmware Hub integrates a Random Number Generator (RNG)
63using thermal noise generated from inherently random quantum
64mechanical properties of silicon. When not generating new random
65bits the RNG circuitry will enter a low power state. Intel will
66provide a binary software driver to give third party software
67access to our RNG for use as a security feature. At this time,
68the RNG is only to be used with a system in an OS-present state.
69
70Intel RNG Driver notes
71======================
72
73FIXME: support poll(2)
74
75.. note::
76
77 request_mem_region was removed, for three reasons:
78
79 1) Only one RNG is supported by this driver;
80 2) The location used by the RNG is a fixed location in
81 MMIO-addressable memory;
82 3) users with properly working BIOS e820 handling will always
83 have the region in which the RNG is located reserved, so
84 request_mem_region calls always fail for proper setups.
85 However, for people who use mem=XX, BIOS e820 information is
86 **not** in /proc/iomem, and request_mem_region(RNG_ADDR) can
87 succeed.
88
89Driver details
90==============
91
92Based on:
93 Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet
94 May 1999 Order Number: 290658-002 R
95
96Intel 82802 Firmware Hub:
97 Random Number Generator
98 Programmer's Reference Manual
99 December 1999 Order Number: 298029-001 R
100
101Intel 82802 Firmware HUB Random Number Generator Driver
102 Copyright (c) 2000 Matt Sottek <msottek@quiknet.com>
103
104Special thanks to Matt Sottek. I did the "guts", he
105did the "brains" and all the testing.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 24fbe0568eff..280355d08af5 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -16,6 +16,7 @@ etc.
16 README 16 README
17 kernel-parameters 17 kernel-parameters
18 devices 18 devices
19 sysctl/index
19 20
20This section describes CPU vulnerabilities and their mitigations. 21This section describes CPU vulnerabilities and their mitigations.
21 22
@@ -38,6 +39,8 @@ problems and bugs in particular.
38 ramoops 39 ramoops
39 dynamic-debug-howto 40 dynamic-debug-howto
40 init 41 init
42 kdump/index
43 perf/index
41 44
42This is the beginning of a section with information of interest to 45This is the beginning of a section with information of interest to
43application developers. Documents covering various aspects of the kernel 46application developers. Documents covering various aspects of the kernel
@@ -56,11 +59,13 @@ configure specific aspects of kernel behavior to your liking.
56 59
57 initrd 60 initrd
58 cgroup-v2 61 cgroup-v2
62 cgroup-v1/index
59 serial-console 63 serial-console
60 braille-console 64 braille-console
61 parport 65 parport
62 md 66 md
63 module-signing 67 module-signing
68 rapidio
64 sysrq 69 sysrq
65 unicode 70 unicode
66 vga-softcursor 71 vga-softcursor
@@ -69,14 +74,37 @@ configure specific aspects of kernel behavior to your liking.
69 java 74 java
70 ras 75 ras
71 bcache 76 bcache
77 blockdev/index
72 ext4 78 ext4
73 binderfs 79 binderfs
74 pm/index 80 pm/index
75 thunderbolt 81 thunderbolt
76 LSM/index 82 LSM/index
77 mm/index 83 mm/index
84 namespaces/index
78 perf-security 85 perf-security
79 acpi/index 86 acpi/index
87 aoe/index
88 btmrvl
89 clearing-warn-once
90 cpu-load
91 cputopology
92 device-mapper/index
93 efi-stub
94 gpio/index
95 highuid
96 hw_random
97 iostats
98 kernel-per-CPU-kthreads
99 laptops/index
100 lcd-panel-cgram
101 ldm
102 lockup-watchdogs
103 numastat
104 pnp
105 rtc
106 svga
107 video-output
80 108
81.. only:: subproject and html 109.. only:: subproject and html
82 110
diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
new file mode 100644
index 000000000000..5d63b18bd6d1
--- /dev/null
+++ b/Documentation/admin-guide/iostats.rst
@@ -0,0 +1,197 @@
1=====================
2I/O statistics fields
3=====================
4
5Since 2.4.20 (and some versions before, with patches), and 2.5.45,
6more extensive disk statistics have been introduced to help measure disk
7activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
8the work for you, but in case you are interested in creating your own
9tools, the fields are explained here.
10
11In 2.4 now, the information is found as additional fields in
12``/proc/partitions``. In 2.6 and upper, the same information is found in two
13places: one is in the file ``/proc/diskstats``, and the other is within
14the sysfs file system, which must be mounted in order to obtain
15the information. Throughout this document we'll assume that sysfs
16is mounted on ``/sys``, although of course it may be mounted anywhere.
17Both ``/proc/diskstats`` and sysfs use the same source for the information
18and so should not differ.
19
20Here are examples of these different formats::
21
22 2.4:
23 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
24 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
25
26 2.6+ sysfs:
27 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
28 35486 38030 38030 38030
29
30 2.6+ diskstats:
31 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
32 3 1 hda1 35486 38030 38030 38030
33
34 4.18+ diskstats:
35 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
36
37On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
38a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
39
40The advantage of one over the other is that the sysfs choice works well
41if you are watching a known, small set of disks. ``/proc/diskstats`` may
42be a better choice if you are watching a large number of disks because
43you'll avoid the overhead of 50, 100, or 500 or more opens/closes with
44each snapshot of your disk statistics.
45
46In 2.4, the statistics fields are those after the device name. In
47the above example, the first field of statistics would be 446216.
48By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
49find just the eleven fields, beginning with 446216. If you look at
50``/proc/diskstats``, the eleven fields will be preceded by the major and
51minor device numbers, and device name. Each of these formats provides
52eleven fields of statistics, each meaning exactly the same things.
53All fields except field 9 are cumulative since boot. Field 9 should
54go to zero as I/Os complete; all others only increase (unless they
55overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long
56(native word size) numbers, and on a very busy or long-lived system they
57may wrap. Applications should be prepared to deal with that; unless
58your observations are measured in large numbers of minutes or hours,
59they should not wrap twice before you notice them.
60
61Each set of stats only applies to the indicated device; if you want
62system-wide stats you'll have to find all the devices and sum them all up.
63
64Field 1 -- # of reads completed
65 This is the total number of reads completed successfully.
66
67Field 2 -- # of reads merged, field 6 -- # of writes merged
68 Reads and writes which are adjacent to each other may be merged for
69 efficiency. Thus two 4K reads may become one 8K read before it is
70 ultimately handed to the disk, and so it will be counted (and queued)
71 as only one I/O. This field lets you know how often this was done.
72
73Field 3 -- # of sectors read
74 This is the total number of sectors read successfully.
75
76Field 4 -- # of milliseconds spent reading
77 This is the total number of milliseconds spent by all reads (as
78 measured from __make_request() to end_that_request_last()).
79
80Field 5 -- # of writes completed
81 This is the total number of writes completed successfully.
82
83Field 6 -- # of writes merged
84 See the description of field 2.
85
86Field 7 -- # of sectors written
87 This is the total number of sectors written successfully.
88
89Field 8 -- # of milliseconds spent writing
90 This is the total number of milliseconds spent by all writes (as
91 measured from __make_request() to end_that_request_last()).
92
93Field 9 -- # of I/Os currently in progress
94 The only field that should go to zero. Incremented as requests are
95 given to appropriate struct request_queue and decremented as they finish.
96
97Field 10 -- # of milliseconds spent doing I/Os
98 This field increases so long as field 9 is nonzero.
99
100 Since 5.0 this field counts jiffies when at least one request was
101 started or completed. If request runs more than 2 jiffies then some
102 I/O time will not be accounted unless there are other requests.
103
104Field 11 -- weighted # of milliseconds spent doing I/Os
105 This field is incremented at each I/O start, I/O completion, I/O
106 merge, or read of these stats by the number of I/Os in progress
107 (field 9) times the number of milliseconds spent doing I/O since the
108 last update of this field. This can provide an easy measure of both
109 I/O completion time and the backlog that may be accumulating.
110
111Field 12 -- # of discards completed
112 This is the total number of discards completed successfully.
113
114Field 13 -- # of discards merged
115 See the description of field 2
116
117Field 14 -- # of sectors discarded
118 This is the total number of sectors discarded successfully.
119
120Field 15 -- # of milliseconds spent discarding
121 This is the total number of milliseconds spent by all discards (as
122 measured from __make_request() to end_that_request_last()).
123
124To avoid introducing performance bottlenecks, no locks are held while
125modifying these counters. This implies that minor inaccuracies may be
126introduced when changes collide, so (for instance) adding up all the
127read I/Os issued per partition should equal those made to the disks ...
128but due to the lack of locking it may only be very close.
129
130In 2.6+, there are counters for each CPU, which make the lack of locking
131almost a non-issue. When the statistics are read, the per-CPU counters
132are summed (possibly overflowing the unsigned long variable they are
133summed to) and the result given to the user. There is no convenient
134user interface for accessing the per-CPU counters themselves.
135
136Disks vs Partitions
137-------------------
138
139There were significant changes between 2.4 and 2.6+ in the I/O subsystem.
140As a result, some statistic information disappeared. The translation from
141a disk address relative to a partition to the disk address relative to
142the host disk happens much earlier. All merges and timings now happen
143at the disk level rather than at both the disk and partition level as
144in 2.4. Consequently, you'll see a different statistics output on 2.6+ for
145partitions from that for disks. There are only *four* fields available
146for partitions on 2.6+ machines. This is reflected in the examples above.
147
148Field 1 -- # of reads issued
149 This is the total number of reads issued to this partition.
150
151Field 2 -- # of sectors read
152 This is the total number of sectors requested to be read from this
153 partition.
154
155Field 3 -- # of writes issued
156 This is the total number of writes issued to this partition.
157
158Field 4 -- # of sectors written
159 This is the total number of sectors requested to be written to
160 this partition.
161
162Note that since the address is translated to a disk-relative one, and no
163record of the partition-relative address is kept, the subsequent success
164or failure of the read cannot be attributed to the partition. In other
165words, the number of reads for partitions is counted slightly before time
166of queuing for partitions, and at completion for whole disks. This is
167a subtle distinction that is probably uninteresting for most cases.
168
169More significant is the error induced by counting the numbers of
170reads/writes before merges for partitions and after for disks. Since a
171typical workload usually contains a lot of successive and adjacent requests,
172the number of reads/writes issued can be several times higher than the
173number of reads/writes completed.
174
175In 2.6.25, the full statistic set is again available for partitions and
176disk and partition statistics are consistent again. Since we still don't
177keep record of the partition-relative address, an operation is attributed to
178the partition which contains the first sector of the request after the
179eventual merges. As requests can be merged across partition, this could lead
180to some (probably insignificant) inaccuracy.
181
182Additional notes
183----------------
184
185In 2.6+, sysfs is not mounted by default. If your distribution of
186Linux hasn't added it already, here's the line you'll want to add to
187your ``/etc/fstab``::
188
189 none /sys sysfs defaults 0 0
190
191
192In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they
193appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
194``/proc/stat`` take a very different format from those in ``/proc/partitions``
195(see proc(5), if your system has it.)
196
197-- ricklind@us.ibm.com
diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt
new file mode 100644
index 000000000000..220d0a80ca2c
--- /dev/null
+++ b/Documentation/admin-guide/kdump/gdbmacros.txt
@@ -0,0 +1,264 @@
1#
2# This file contains a few gdb macros (user defined commands) to extract
3# useful information from kernel crashdump (kdump) like stack traces of
4# all the processes or a particular process and trapinfo.
5#
6# These macros can be used by copying this file in .gdbinit (put in home
7# directory or current directory) or by invoking gdb command with
8# --command=<command-file-name> option
9#
10# Credits:
11# Alexander Nyberg <alexn@telia.com>
12# V Srivatsa <vatsa@in.ibm.com>
13# Maneesh Soni <maneesh@in.ibm.com>
14#
15
16define bttnobp
17 set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
18 set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
19 set $init_t=&init_task
20 set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
21 set var $stacksize = sizeof(union thread_union)
22 while ($next_t != $init_t)
23 set $next_t=(struct task_struct *)$next_t
24 printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
25 printf "===================\n"
26 set var $stackp = $next_t.thread.sp
27 set var $stack_top = ($stackp & ~($stacksize - 1)) + $stacksize
28
29 while ($stackp < $stack_top)
30 if (*($stackp) > _stext && *($stackp) < _sinittext)
31 info symbol *($stackp)
32 end
33 set $stackp += 4
34 end
35 set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
36 while ($next_th != $next_t)
37 set $next_th=(struct task_struct *)$next_th
38 printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
39 printf "===================\n"
40 set var $stackp = $next_t.thread.sp
41 set var $stack_top = ($stackp & ~($stacksize - 1)) + stacksize
42
43 while ($stackp < $stack_top)
44 if (*($stackp) > _stext && *($stackp) < _sinittext)
45 info symbol *($stackp)
46 end
47 set $stackp += 4
48 end
49 set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
50 end
51 set $next_t=(char *)($next_t->tasks.next) - $tasks_off
52 end
53end
54document bttnobp
55 dump all thread stack traces on a kernel compiled with !CONFIG_FRAME_POINTER
56end
57
58define btthreadstack
59 set var $pid_task = $arg0
60
61 printf "\npid %d; comm %s:\n", $pid_task.pid, $pid_task.comm
62 printf "task struct: "
63 print $pid_task
64 printf "===================\n"
65 set var $stackp = $pid_task.thread.sp
66 set var $stacksize = sizeof(union thread_union)
67 set var $stack_top = ($stackp & ~($stacksize - 1)) + $stacksize
68 set var $stack_bot = ($stackp & ~($stacksize - 1))
69
70 set $stackp = *((unsigned long *) $stackp)
71 while (($stackp < $stack_top) && ($stackp > $stack_bot))
72 set var $addr = *(((unsigned long *) $stackp) + 1)
73 info symbol $addr
74 set $stackp = *((unsigned long *) $stackp)
75 end
76end
77document btthreadstack
78 dump a thread stack using the given task structure pointer
79end
80
81
82define btt
83 set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
84 set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
85 set $init_t=&init_task
86 set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
87 while ($next_t != $init_t)
88 set $next_t=(struct task_struct *)$next_t
89 btthreadstack $next_t
90
91 set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
92 while ($next_th != $next_t)
93 set $next_th=(struct task_struct *)$next_th
94 btthreadstack $next_th
95 set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
96 end
97 set $next_t=(char *)($next_t->tasks.next) - $tasks_off
98 end
99end
100document btt
101 dump all thread stack traces on a kernel compiled with CONFIG_FRAME_POINTER
102end
103
104define btpid
105 set var $pid = $arg0
106 set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
107 set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
108 set $init_t=&init_task
109 set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
110 set var $pid_task = 0
111
112 while ($next_t != $init_t)
113 set $next_t=(struct task_struct *)$next_t
114
115 if ($next_t.pid == $pid)
116 set $pid_task = $next_t
117 end
118
119 set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
120 while ($next_th != $next_t)
121 set $next_th=(struct task_struct *)$next_th
122 if ($next_th.pid == $pid)
123 set $pid_task = $next_th
124 end
125 set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
126 end
127 set $next_t=(char *)($next_t->tasks.next) - $tasks_off
128 end
129
130 btthreadstack $pid_task
131end
132document btpid
133 backtrace of pid
134end
135
136
137define trapinfo
138 set var $pid = $arg0
139 set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
140 set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
141 set $init_t=&init_task
142 set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
143 set var $pid_task = 0
144
145 while ($next_t != $init_t)
146 set $next_t=(struct task_struct *)$next_t
147
148 if ($next_t.pid == $pid)
149 set $pid_task = $next_t
150 end
151
152 set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
153 while ($next_th != $next_t)
154 set $next_th=(struct task_struct *)$next_th
155 if ($next_th.pid == $pid)
156 set $pid_task = $next_th
157 end
158 set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
159 end
160 set $next_t=(char *)($next_t->tasks.next) - $tasks_off
161 end
162
163 printf "Trapno %ld, cr2 0x%lx, error_code %ld\n", $pid_task.thread.trap_no, \
164 $pid_task.thread.cr2, $pid_task.thread.error_code
165
166end
167document trapinfo
168 Run info threads and lookup pid of thread #1
169 'trapinfo <pid>' will tell you by which trap & possibly
170 address the kernel panicked.
171end
172
173define dump_log_idx
174 set $idx = $arg0
175 if ($argc > 1)
176 set $prev_flags = $arg1
177 else
178 set $prev_flags = 0
179 end
180 set $msg = ((struct printk_log *) (log_buf + $idx))
181 set $prefix = 1
182 set $newline = 1
183 set $log = log_buf + $idx + sizeof(*$msg)
184
185 # prev & LOG_CONT && !(msg->flags & LOG_PREIX)
186 if (($prev_flags & 8) && !($msg->flags & 4))
187 set $prefix = 0
188 end
189
190 # msg->flags & LOG_CONT
191 if ($msg->flags & 8)
192 # (prev & LOG_CONT && !(prev & LOG_NEWLINE))
193 if (($prev_flags & 8) && !($prev_flags & 2))
194 set $prefix = 0
195 end
196 # (!(msg->flags & LOG_NEWLINE))
197 if (!($msg->flags & 2))
198 set $newline = 0
199 end
200 end
201
202 if ($prefix)
203 printf "[%5lu.%06lu] ", $msg->ts_nsec / 1000000000, $msg->ts_nsec % 1000000000
204 end
205 if ($msg->text_len != 0)
206 eval "printf \"%%%d.%ds\", $log", $msg->text_len, $msg->text_len
207 end
208 if ($newline)
209 printf "\n"
210 end
211 if ($msg->dict_len > 0)
212 set $dict = $log + $msg->text_len
213 set $idx = 0
214 set $line = 1
215 while ($idx < $msg->dict_len)
216 if ($line)
217 printf " "
218 set $line = 0
219 end
220 set $c = $dict[$idx]
221 if ($c == '\0')
222 printf "\n"
223 set $line = 1
224 else
225 if ($c < ' ' || $c >= 127 || $c == '\\')
226 printf "\\x%02x", $c
227 else
228 printf "%c", $c
229 end
230 end
231 set $idx = $idx + 1
232 end
233 printf "\n"
234 end
235end
236document dump_log_idx
237 Dump a single log given its index in the log buffer. The first
238 parameter is the index into log_buf, the second is optional and
239 specified the previous log buffer's flags, used for properly
240 formatting continued lines.
241end
242
243define dmesg
244 set $i = log_first_idx
245 set $end_idx = log_first_idx
246 set $prev_flags = 0
247
248 while (1)
249 set $msg = ((struct printk_log *) (log_buf + $i))
250 if ($msg->len == 0)
251 set $i = 0
252 else
253 dump_log_idx $i $prev_flags
254 set $i = $i + $msg->len
255 set $prev_flags = $msg->flags
256 end
257 if ($i == $end_idx)
258 loop_break
259 end
260 end
261end
262document dmesg
263 print the kernel ring buffer
264end
diff --git a/Documentation/admin-guide/kdump/index.rst b/Documentation/admin-guide/kdump/index.rst
new file mode 100644
index 000000000000..8e2ebd0383cd
--- /dev/null
+++ b/Documentation/admin-guide/kdump/index.rst
@@ -0,0 +1,20 @@
1
2================================================================
3Documentation for Kdump - The kexec-based Crash Dumping Solution
4================================================================
5
6This document includes overview, setup and installation, and analysis
7information.
8
9.. toctree::
10 :maxdepth: 1
11
12 kdump
13 vmcoreinfo
14
15.. only:: subproject and html
16
17 Indices
18 =======
19
20 * :ref:`genindex`
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
new file mode 100644
index 000000000000..ac7e131d2935
--- /dev/null
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -0,0 +1,534 @@
1================================================================
2Documentation for Kdump - The kexec-based Crash Dumping Solution
3================================================================
4
5This document includes overview, setup and installation, and analysis
6information.
7
8Overview
9========
10
11Kdump uses kexec to quickly boot to a dump-capture kernel whenever a
12dump of the system kernel's memory needs to be taken (for example, when
13the system panics). The system kernel's memory image is preserved across
14the reboot and is accessible to the dump-capture kernel.
15
16You can use common commands, such as cp and scp, to copy the
17memory image to a dump file on the local disk, or across the network to
18a remote system.
19
20Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
21s390x, arm and arm64 architectures.
22
23When the system kernel boots, it reserves a small section of memory for
24the dump-capture kernel. This ensures that ongoing Direct Memory Access
25(DMA) from the system kernel does not corrupt the dump-capture kernel.
26The kexec -p command loads the dump-capture kernel into this reserved
27memory.
28
29On x86 machines, the first 640 KB of physical memory is needed to boot,
30regardless of where the kernel loads. Therefore, kexec backs up this
31region just before rebooting into the dump-capture kernel.
32
33Similarly on PPC64 machines first 32KB of physical memory is needed for
34booting regardless of where the kernel is loaded and to support 64K page
35size kexec backs up the first 64KB memory.
36
37For s390x, when kdump is triggered, the crashkernel region is exchanged
38with the region [0, crashkernel region size] and then the kdump kernel
39runs in [0, crashkernel region size]. Therefore no relocatable kernel is
40needed for s390x.
41
42All of the necessary information about the system kernel's core image is
43encoded in the ELF format, and stored in a reserved area of memory
44before a crash. The physical address of the start of the ELF header is
45passed to the dump-capture kernel through the elfcorehdr= boot
46parameter. Optionally the size of the ELF header can also be passed
47when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
48
49
50With the dump-capture kernel, you can access the memory image through
51/proc/vmcore. This exports the dump as an ELF-format file that you can
52write out using file copy commands such as cp or scp. Further, you can
53use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
54debug the dump file. This method ensures that the dump pages are correctly
55ordered.
56
57
58Setup and Installation
59======================
60
61Install kexec-tools
62-------------------
63
641) Login as the root user.
65
662) Download the kexec-tools user-space package from the following URL:
67
68http://kernel.org/pub/linux/utils/kernel/kexec/kexec-tools.tar.gz
69
70This is a symlink to the latest version.
71
72The latest kexec-tools git tree is available at:
73
74- git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
75- http://www.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
76
77There is also a gitweb interface available at
78http://www.kernel.org/git/?p=utils/kernel/kexec/kexec-tools.git
79
80More information about kexec-tools can be found at
81http://horms.net/projects/kexec/
82
833) Unpack the tarball with the tar command, as follows::
84
85 tar xvpzf kexec-tools.tar.gz
86
874) Change to the kexec-tools directory, as follows::
88
89 cd kexec-tools-VERSION
90
915) Configure the package, as follows::
92
93 ./configure
94
956) Compile the package, as follows::
96
97 make
98
997) Install the package, as follows::
100
101 make install
102
103
104Build the system and dump-capture kernels
105-----------------------------------------
106There are two possible methods of using Kdump.
107
1081) Build a separate custom dump-capture kernel for capturing the
109 kernel core dump.
110
1112) Or use the system kernel binary itself as dump-capture kernel and there is
112 no need to build a separate dump-capture kernel. This is possible
113 only with the architectures which support a relocatable kernel. As
114 of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support
115 relocatable kernel.
116
117Building a relocatable kernel is advantageous from the point of view that
118one does not have to build a second kernel for capturing the dump. But
119at the same time one might want to build a custom dump capture kernel
120suitable to his needs.
121
122Following are the configuration setting required for system and
123dump-capture kernels for enabling kdump support.
124
125System kernel config options
126----------------------------
127
1281) Enable "kexec system call" in "Processor type and features."::
129
130 CONFIG_KEXEC=y
131
1322) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
133 filesystems." This is usually enabled by default::
134
135 CONFIG_SYSFS=y
136
137 Note that "sysfs file system support" might not appear in the "Pseudo
138 filesystems" menu if "Configure standard kernel features (for small
139 systems)" is not enabled in "General Setup." In this case, check the
140 .config file itself to ensure that sysfs is turned on, as follows::
141
142 grep 'CONFIG_SYSFS' .config
143
1443) Enable "Compile the kernel with debug info" in "Kernel hacking."::
145
146 CONFIG_DEBUG_INFO=Y
147
148 This causes the kernel to be built with debug symbols. The dump
149 analysis tools require a vmlinux with debug symbols in order to read
150 and analyze a dump file.
151
152Dump-capture kernel config options (Arch Independent)
153-----------------------------------------------------
154
1551) Enable "kernel crash dumps" support under "Processor type and
156 features"::
157
158 CONFIG_CRASH_DUMP=y
159
1602) Enable "/proc/vmcore support" under "Filesystems" -> "Pseudo filesystems"::
161
162 CONFIG_PROC_VMCORE=y
163
164 (CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.)
165
166Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
167--------------------------------------------------------------------
168
1691) On i386, enable high memory support under "Processor type and
170 features"::
171
172 CONFIG_HIGHMEM64G=y
173
174 or::
175
176 CONFIG_HIGHMEM4G
177
1782) On i386 and x86_64, disable symmetric multi-processing support
179 under "Processor type and features"::
180
181 CONFIG_SMP=n
182
183 (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
184 when loading the dump-capture kernel, see section "Load the Dump-capture
185 Kernel".)
186
1873) If one wants to build and use a relocatable kernel,
188 Enable "Build a relocatable kernel" support under "Processor type and
189 features"::
190
191 CONFIG_RELOCATABLE=y
192
1934) Use a suitable value for "Physical address where the kernel is
194 loaded" (under "Processor type and features"). This only appears when
195 "kernel crash dumps" is enabled. A suitable value depends upon
196 whether kernel is relocatable or not.
197
198 If you are using a relocatable kernel use CONFIG_PHYSICAL_START=0x100000
199 This will compile the kernel for physical address 1MB, but given the fact
200 kernel is relocatable, it can be run from any physical address hence
201 kexec boot loader will load it in memory region reserved for dump-capture
202 kernel.
203
204 Otherwise it should be the start of memory region reserved for
205 second kernel using boot parameter "crashkernel=Y@X". Here X is
206 start of memory region reserved for dump-capture kernel.
207 Generally X is 16MB (0x1000000). So you can set
208 CONFIG_PHYSICAL_START=0x1000000
209
2105) Make and install the kernel and its modules. DO NOT add this kernel
211 to the boot loader configuration files.
212
213Dump-capture kernel config options (Arch Dependent, ppc64)
214----------------------------------------------------------
215
2161) Enable "Build a kdump crash kernel" support under "Kernel" options::
217
218 CONFIG_CRASH_DUMP=y
219
2202) Enable "Build a relocatable kernel" support::
221
222 CONFIG_RELOCATABLE=y
223
224 Make and install the kernel and its modules.
225
226Dump-capture kernel config options (Arch Dependent, ia64)
227----------------------------------------------------------
228
229- No specific options are required to create a dump-capture kernel
230 for ia64, other than those specified in the arch independent section
231 above. This means that it is possible to use the system kernel
232 as a dump-capture kernel if desired.
233
234 The crashkernel region can be automatically placed by the system
235 kernel at run time. This is done by specifying the base address as 0,
236 or omitting it all together::
237
238 crashkernel=256M@0
239
240 or::
241
242 crashkernel=256M
243
244 If the start address is specified, note that the start address of the
245 kernel will be aligned to 64Mb, so if the start address is not then
246 any space below the alignment point will be wasted.
247
248Dump-capture kernel config options (Arch Dependent, arm)
249----------------------------------------------------------
250
251- To use a relocatable kernel,
252 Enable "AUTO_ZRELADDR" support under "Boot" options::
253
254 AUTO_ZRELADDR=y
255
256Dump-capture kernel config options (Arch Dependent, arm64)
257----------------------------------------------------------
258
259- Please note that kvm of the dump-capture kernel will not be enabled
260 on non-VHE systems even if it is configured. This is because the CPU
261 will not be reset to EL2 on panic.
262
263Extended crashkernel syntax
264===========================
265
266While the "crashkernel=size[@offset]" syntax is sufficient for most
267configurations, sometimes it's handy to have the reserved memory dependent
268on the value of System RAM -- that's mostly for distributors that pre-setup
269the kernel command line to avoid a unbootable system after some memory has
270been removed from the machine.
271
272The syntax is::
273
274 crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
275 range=start-[end]
276
277For example::
278
279 crashkernel=512M-2G:64M,2G-:128M
280
281This would mean:
282
283 1) if the RAM is smaller than 512M, then don't reserve anything
284 (this is the "rescue" case)
285 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
286 3) if the RAM size is larger than 2G, then reserve 128M
287
288
289
290Boot into System Kernel
291=======================
292
2931) Update the boot loader (such as grub, yaboot, or lilo) configuration
294 files as necessary.
295
2962) Boot the system kernel with the boot parameter "crashkernel=Y@X",
297 where Y specifies how much memory to reserve for the dump-capture kernel
298 and X specifies the beginning of this reserved memory. For example,
299 "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
300 starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
301
302 On x86 and x86_64, use "crashkernel=64M@16M".
303
304 On ppc64, use "crashkernel=128M@32M".
305
306 On ia64, 256M@256M is a generous value that typically works.
307 The region may be automatically placed on ia64, see the
308 dump-capture kernel config option notes above.
309 If use sparse memory, the size should be rounded to GRANULE boundaries.
310
311 On s390x, typically use "crashkernel=xxM". The value of xx is dependent
312 on the memory consumption of the kdump system. In general this is not
313 dependent on the memory size of the production system.
314
315 On arm, the use of "crashkernel=Y@X" is no longer necessary; the
316 kernel will automatically locate the crash kernel image within the
317 first 512MB of RAM if X is not given.
318
319 On arm64, use "crashkernel=Y[@X]". Note that the start address of
320 the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
321
322Load the Dump-capture Kernel
323============================
324
325After booting to the system kernel, dump-capture kernel needs to be
326loaded.
327
328Based on the architecture and type of image (relocatable or not), one
329can choose to load the uncompressed vmlinux or compressed bzImage/vmlinuz
330of dump-capture kernel. Following is the summary.
331
332For i386 and x86_64:
333
334 - Use vmlinux if kernel is not relocatable.
335 - Use bzImage/vmlinuz if kernel is relocatable.
336
337For ppc64:
338
339 - Use vmlinux
340
341For ia64:
342
343 - Use vmlinux or vmlinuz.gz
344
345For s390x:
346
347 - Use image or bzImage
348
349For arm:
350
351 - Use zImage
352
353For arm64:
354
355 - Use vmlinux or Image
356
357If you are using an uncompressed vmlinux image then use following command
358to load dump-capture kernel::
359
360 kexec -p <dump-capture-kernel-vmlinux-image> \
361 --initrd=<initrd-for-dump-capture-kernel> --args-linux \
362 --append="root=<root-dev> <arch-specific-options>"
363
364If you are using a compressed bzImage/vmlinuz, then use following command
365to load dump-capture kernel::
366
367 kexec -p <dump-capture-kernel-bzImage> \
368 --initrd=<initrd-for-dump-capture-kernel> \
369 --append="root=<root-dev> <arch-specific-options>"
370
371If you are using a compressed zImage, then use following command
372to load dump-capture kernel::
373
374 kexec --type zImage -p <dump-capture-kernel-bzImage> \
375 --initrd=<initrd-for-dump-capture-kernel> \
376 --dtb=<dtb-for-dump-capture-kernel> \
377 --append="root=<root-dev> <arch-specific-options>"
378
379If you are using an uncompressed Image, then use following command
380to load dump-capture kernel::
381
382 kexec -p <dump-capture-kernel-Image> \
383 --initrd=<initrd-for-dump-capture-kernel> \
384 --append="root=<root-dev> <arch-specific-options>"
385
386Please note, that --args-linux does not need to be specified for ia64.
387It is planned to make this a no-op on that architecture, but for now
388it should be omitted
389
390Following are the arch specific command line options to be used while
391loading dump-capture kernel.
392
393For i386, x86_64 and ia64:
394
395 "1 irqpoll maxcpus=1 reset_devices"
396
397For ppc64:
398
399 "1 maxcpus=1 noirqdistrib reset_devices"
400
401For s390x:
402
403 "1 maxcpus=1 cgroup_disable=memory"
404
405For arm:
406
407 "1 maxcpus=1 reset_devices"
408
409For arm64:
410
411 "1 maxcpus=1 reset_devices"
412
413Notes on loading the dump-capture kernel:
414
415* By default, the ELF headers are stored in ELF64 format to support
416 systems with more than 4GB memory. On i386, kexec automatically checks if
417 the physical RAM size exceeds the 4 GB limit and if not, uses ELF32.
418 So, on non-PAE systems, ELF32 is always used.
419
420 The --elf32-core-headers option can be used to force the generation of ELF32
421 headers. This is necessary because GDB currently cannot open vmcore files
422 with ELF64 headers on 32-bit systems.
423
424* The "irqpoll" boot parameter reduces driver initialization failures
425 due to shared interrupts in the dump-capture kernel.
426
427* You must specify <root-dev> in the format corresponding to the root
428 device name in the output of mount command.
429
430* Boot parameter "1" boots the dump-capture kernel into single-user
431 mode without networking. If you want networking, use "3".
432
433* We generally don't have to bring up a SMP kernel just to capture the
434 dump. Hence generally it is useful either to build a UP dump-capture
435 kernel or specify maxcpus=1 option while loading dump-capture kernel.
436 Note, though maxcpus always works, you had better replace it with
437 nr_cpus to save memory if supported by the current ARCH, such as x86.
438
439* You should enable multi-cpu support in dump-capture kernel if you intend
440 to use multi-thread programs with it, such as parallel dump feature of
441 makedumpfile. Otherwise, the multi-thread program may have a great
442 performance degradation. To enable multi-cpu support, you should bring up an
443 SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
444 options while loading it.
445
446* For s390x there are two kdump modes: If a ELF header is specified with
447 the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
448 is done on all other architectures. If no elfcorehdr= kernel parameter is
449 specified, the s390x kdump kernel dynamically creates the header. The
450 second mode has the advantage that for CPU and memory hotplug, kdump has
451 not to be reloaded with kexec_load().
452
453* For s390x systems with many attached devices the "cio_ignore" kernel
454 parameter should be used for the kdump kernel in order to prevent allocation
455 of kernel memory for devices that are not relevant for kdump. The same
456 applies to systems that use SCSI/FCP devices. In that case the
457 "allow_lun_scan" zfcp module parameter should be set to zero before
458 setting FCP devices online.
459
460Kernel Panic
461============
462
463After successfully loading the dump-capture kernel as previously
464described, the system will reboot into the dump-capture kernel if a
465system crash is triggered. Trigger points are located in panic(),
466die(), die_nmi() and in the sysrq handler (ALT-SysRq-c).
467
468The following conditions will execute a crash trigger point:
469
470If a hard lockup is detected and "NMI watchdog" is configured, the system
471will boot into the dump-capture kernel ( die_nmi() ).
472
473If die() is called, and it happens to be a thread with pid 0 or 1, or die()
474is called inside interrupt context or die() is called and panic_on_oops is set,
475the system will boot into the dump-capture kernel.
476
477On powerpc systems when a soft-reset is generated, die() is called by all cpus
478and the system will boot into the dump-capture kernel.
479
480For testing purposes, you can trigger a crash by using "ALT-SysRq-c",
481"echo c > /proc/sysrq-trigger" or write a module to force the panic.
482
483Write Out the Dump File
484=======================
485
486After the dump-capture kernel is booted, write out the dump file with
487the following command::
488
489 cp /proc/vmcore <dump-file>
490
491
492Analysis
493========
494
495Before analyzing the dump image, you should reboot into a stable kernel.
496
497You can do limited analysis using GDB on the dump file copied out of
498/proc/vmcore. Use the debug vmlinux built with -g and run the following
499command::
500
501 gdb vmlinux <dump-file>
502
503Stack trace for the task on processor 0, register display, and memory
504display work fine.
505
506Note: GDB cannot analyze core files generated in ELF64 format for x86.
507On systems with a maximum of 4GB of memory, you can generate
508ELF32-format headers using the --elf32-core-headers kernel option on the
509dump kernel.
510
511You can also use the Crash utility to analyze dump files in Kdump
512format. Crash is available on Dave Anderson's site at the following URL:
513
514 http://people.redhat.com/~anderson/
515
516Trigger Kdump on WARN()
517=======================
518
519The kernel parameter, panic_on_warn, calls panic() in all WARN() paths. This
520will cause a kdump to occur at the panic() call. In cases where a user wants
521to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
522to achieve the same behaviour.
523
524Contact
525=======
526
527- Vivek Goyal (vgoyal@redhat.com)
528- Maneesh Soni (maneesh@in.ibm.com)
529
530GDB macros
531==========
532
533.. include:: gdbmacros.txt
534 :literal:
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
new file mode 100644
index 000000000000..007a6b86e0ee
--- /dev/null
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -0,0 +1,488 @@
1==========
2VMCOREINFO
3==========
4
5What is it?
6===========
7
8VMCOREINFO is a special ELF note section. It contains various
9information from the kernel like structure size, page size, symbol
10values, field offsets, etc. These data are packed into an ELF note
11section and used by user-space tools like crash and makedumpfile to
12analyze a kernel's memory layout.
13
14Common variables
15================
16
17init_uts_ns.name.release
18------------------------
19
20The version of the Linux kernel. Used to find the corresponding source
21code from which the kernel has been built. For example, crash uses it to
22find the corresponding vmlinux in order to process vmcore.
23
24PAGE_SIZE
25---------
26
27The size of a page. It is the smallest unit of data used by the memory
28management facilities. It is usually 4096 bytes of size and a page is
29aligned on 4096 bytes. Used for computing page addresses.
30
31init_uts_ns
32-----------
33
34The UTS namespace which is used to isolate two specific elements of the
35system that relate to the uname(2) system call. It is named after the
36data structure used to store information returned by the uname(2) system
37call.
38
39User-space tools can get the kernel name, host name, kernel release
40number, kernel version, architecture name and OS type from it.
41
42node_online_map
43---------------
44
45An array node_states[N_ONLINE] which represents the set of online nodes
46in a system, one bit position per node number. Used to keep track of
47which nodes are in the system and online.
48
49swapper_pg_dir
50--------------
51
52The global page directory pointer of the kernel. Used to translate
53virtual to physical addresses.
54
55_stext
56------
57
58Defines the beginning of the text section. In general, _stext indicates
59the kernel start address. Used to convert a virtual address from the
60direct kernel map to a physical address.
61
62vmap_area_list
63--------------
64
65Stores the virtual area list. makedumpfile gets the vmalloc start value
66from this variable and its value is necessary for vmalloc translation.
67
68mem_map
69-------
70
71Physical addresses are translated to struct pages by treating them as
72an index into the mem_map array. Right-shifting a physical address
73PAGE_SHIFT bits converts it into a page frame number which is an index
74into that mem_map array.
75
76Used to map an address to the corresponding struct page.
77
78contig_page_data
79----------------
80
81Makedumpfile gets the pglist_data structure from this symbol, which is
82used to describe the memory layout.
83
84User-space tools use this to exclude free pages when dumping memory.
85
86mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
87--------------------------------------------------------------------------
88
89The address of the mem_section array, its length, structure size, and
90the section_mem_map offset.
91
92It exists in the sparse memory mapping model, and it is also somewhat
93similar to the mem_map variable, both of them are used to translate an
94address.
95
96page
97----
98
99The size of a page structure. struct page is an important data structure
100and it is widely used to compute contiguous memory.
101
102pglist_data
103-----------
104
105The size of a pglist_data structure. This value is used to check if the
106pglist_data structure is valid. It is also used for checking the memory
107type.
108
109zone
110----
111
112The size of a zone structure. This value is used to check if the zone
113structure has been found. It is also used for excluding free pages.
114
115free_area
116---------
117
118The size of a free_area structure. It indicates whether the free_area
119structure is valid or not. Useful when excluding free pages.
120
121list_head
122---------
123
124The size of a list_head structure. Used when iterating lists in a
125post-mortem analysis session.
126
127nodemask_t
128----------
129
130The size of a nodemask_t type. Used to compute the number of online
131nodes.
132
133(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
134-------------------------------------------------------------------------------------------------
135
136User-space tools compute their values based on the offset of these
137variables. The variables are used when excluding unnecessary pages.
138
139(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_spanned_pages|node_id)
140-----------------------------------------------------------------------------------------
141
142On NUMA machines, each NUMA node has a pg_data_t to describe its memory
143layout. On UMA machines there is a single pglist_data which describes the
144whole memory.
145
146These values are used to check the memory type and to compute the
147virtual address for memory map.
148
149(zone, free_area|vm_stat|spanned_pages)
150---------------------------------------
151
152Each node is divided into a number of blocks called zones which
153represent ranges within memory. A zone is described by a structure zone.
154
155User-space tools compute required values based on the offset of these
156variables.
157
158(free_area, free_list)
159----------------------
160
161Offset of the free_list's member. This value is used to compute the number
162of free pages.
163
164Each zone has a free_area structure array called free_area[MAX_ORDER].
165The free_list represents a linked list of free page blocks.
166
167(list_head, next|prev)
168----------------------
169
170Offsets of the list_head's members. list_head is used to define a
171circular linked list. User-space tools need these in order to traverse
172lists.
173
174(vmap_area, va_start|list)
175--------------------------
176
177Offsets of the vmap_area's members. They carry vmalloc-specific
178information. Makedumpfile gets the start address of the vmalloc region
179from this.
180
181(zone.free_area, MAX_ORDER)
182---------------------------
183
184Free areas descriptor. User-space tools use this value to iterate the
185free_area ranges. MAX_ORDER is used by the zone buddy allocator.
186
187log_first_idx
188-------------
189
190Index of the first record stored in the buffer log_buf. Used by
191user-space tools to read the strings in the log_buf.
192
193log_buf
194-------
195
196Console output is written to the ring buffer log_buf at index
197log_first_idx. Used to get the kernel log.
198
199log_buf_len
200-----------
201
202log_buf's length.
203
204clear_idx
205---------
206
207The index that the next printk() record to read after the last clear
208command. It indicates the first record after the last SYSLOG_ACTION
209_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
210the dmesg log.
211
212log_next_idx
213------------
214
215The index of the next record to store in the buffer log_buf. Used to
216compute the index of the current buffer position.
217
218printk_log
219----------
220
221The size of a structure printk_log. Used to compute the size of
222messages, and extract dmesg log. It encapsulates header information for
223log_buf, such as timestamp, syslog level, etc.
224
225(printk_log, ts_nsec|len|text_len|dict_len)
226-------------------------------------------
227
228It represents field offsets in struct printk_log. User space tools
229parse it and check whether the values of printk_log's members have been
230changed.
231
232(free_area.free_list, MIGRATE_TYPES)
233------------------------------------
234
235The number of migrate types for pages. The free_list is described by the
236array. Used by tools to compute the number of free pages.
237
238NR_FREE_PAGES
239-------------
240
241On linux-2.6.21 or later, the number of free pages is in
242vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
243
244PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
245------------------------------------------------------------------------------
246
247Page attributes. These flags are used to filter various unnecessary for
248dumping pages.
249
250PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
251-----------------------------------------------------------------------------
252
253More page attributes. These flags are used to filter various unnecessary for
254dumping pages.
255
256
257HUGETLB_PAGE_DTOR
258-----------------
259
260The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
261excludes these pages.
262
263x86_64
264======
265
266phys_base
267---------
268
269Used to convert the virtual address of an exported kernel symbol to its
270corresponding physical address.
271
272init_top_pgt
273------------
274
275Used to walk through the whole page table and convert virtual addresses
276to physical addresses. The init_top_pgt is somewhat similar to
277swapper_pg_dir, but it is only used in x86_64.
278
279pgtable_l5_enabled
280------------------
281
282User-space tools need to know whether the crash kernel was in 5-level
283paging mode.
284
285node_data
286---------
287
288This is a struct pglist_data array and stores all NUMA nodes
289information. Makedumpfile gets the pglist_data structure from it.
290
291(node_data, MAX_NUMNODES)
292-------------------------
293
294The maximum number of nodes in system.
295
296KERNELOFFSET
297------------
298
299The kernel randomization offset. Used to compute the page offset. If
300KASLR is disabled, this value is zero.
301
302KERNEL_IMAGE_SIZE
303-----------------
304
305Currently unused by Makedumpfile. Used to compute the module virtual
306address by Crash.
307
308sme_mask
309--------
310
311AMD-specific with SME support: it indicates the secure memory encryption
312mask. Makedumpfile tools need to know whether the crash kernel was
313encrypted. If SME is enabled in the first kernel, the crash kernel's
314page table entries (pgd/pud/pmd/pte) contain the memory encryption
315mask. This is used to remove the SME mask and obtain the true physical
316address.
317
318Currently, sme_mask stores the value of the C-bit position. If needed,
319additional SME-relevant info can be placed in that variable.
320
321For example::
322
323 [ misc ][ enc bit ][ other misc SME info ]
324 0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
325 63 59 55 51 47 43 39 35 31 27 ... 3
326
327x86_32
328======
329
330X86_PAE
331-------
332
333Denotes whether physical address extensions are enabled. It has the cost
334of a higher page table lookup overhead, and also consumes more page
335table space per process. Used to check whether PAE was enabled in the
336crash kernel when converting virtual addresses to physical addresses.
337
338ia64
339====
340
341pgdat_list|(pgdat_list, MAX_NUMNODES)
342-------------------------------------
343
344pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
345indicates the number of the nodes.
346
347node_memblk|(node_memblk, NR_NODE_MEMBLKS)
348------------------------------------------
349
350List of node memory chunks. Filled when parsing the SRAT table to obtain
351information about memory nodes. NR_NODE_MEMBLKS indicates the number of
352node memory chunks.
353
354These values are used to compute the number of nodes the crashed kernel used.
355
356node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
357----------------------------------------------------------------
358
359The size of a struct node_memblk_s and the offsets of the
360node_memblk_s's members. Used to compute the number of nodes.
361
362PGTABLE_3|PGTABLE_4
363-------------------
364
365User-space tools need to know whether the crash kernel was in 3-level or
3664-level paging mode. Used to distinguish the page table.
367
368ARM64
369=====
370
371VA_BITS
372-------
373
374The maximum number of bits for virtual addresses. Used to compute the
375virtual memory ranges.
376
377kimage_voffset
378--------------
379
380The offset between the kernel virtual and physical mappings. Used to
381translate virtual to physical addresses.
382
383PHYS_OFFSET
384-----------
385
386Indicates the physical address of the start of memory. Similar to
387kimage_voffset, which is used to translate virtual to physical
388addresses.
389
390KERNELOFFSET
391------------
392
393The kernel randomization offset. Used to compute the page offset. If
394KASLR is disabled, this value is zero.
395
396arm
397===
398
399ARM_LPAE
400--------
401
402It indicates whether the crash kernel supports large physical address
403extensions. Used to translate virtual to physical addresses.
404
405s390
406====
407
408lowcore_ptr
409-----------
410
411An array with a pointer to the lowcore of every CPU. Used to print the
412psw and all registers information.
413
414high_memory
415-----------
416
417Used to get the vmalloc_start address from the high_memory symbol.
418
419(lowcore_ptr, NR_CPUS)
420----------------------
421
422The maximum number of CPUs.
423
424powerpc
425=======
426
427
428node_data|(node_data, MAX_NUMNODES)
429-----------------------------------
430
431See above.
432
433contig_page_data
434----------------
435
436See above.
437
438vmemmap_list
439------------
440
441The vmemmap_list maintains the entire vmemmap physical mapping. Used
442to get vmemmap list count and populated vmemmap regions info. If the
443vmemmap address translation information is stored in the crash kernel,
444it is used to translate vmemmap kernel virtual addresses.
445
446mmu_vmemmap_psize
447-----------------
448
449The size of a page. Used to translate virtual to physical addresses.
450
451mmu_psize_defs
452--------------
453
454Page size definitions, i.e. 4k, 64k, or 16M.
455
456Used to make vtop translations.
457
458vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|(vmemmap_backing, virt_addr)
459--------------------------------------------------------------------------------------------
460
461The vmemmap virtual address space management does not have a traditional
462page table to track which virtual struct pages are backed by a physical
463mapping. The virtual to physical mappings are tracked in a simple linked
464list format.
465
466User-space tools need to know the offset of list, phys and virt_addr
467when computing the count of vmemmap regions.
468
469mmu_psize_def|(mmu_psize_def, shift)
470------------------------------------
471
472The size of a struct mmu_psize_def and the offset of mmu_psize_def's
473member.
474
475Used in vtop translations.
476
477sh
478==
479
480node_data|(node_data, MAX_NUMNODES)
481-----------------------------------
482
483See above.
484
485X2TLB
486-----
487
488Indicates whether the crashed kernel enabled SH extended mode.
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 5d29ba5ad88c..d05d531b4ec9 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -118,7 +118,7 @@ parameter is applicable::
118 LOOP Loopback device support is enabled. 118 LOOP Loopback device support is enabled.
119 M68k M68k architecture is enabled. 119 M68k M68k architecture is enabled.
120 These options have more detailed description inside of 120 These options have more detailed description inside of
121 Documentation/m68k/kernel-options.txt. 121 Documentation/m68k/kernel-options.rst.
122 MDA MDA console support is enabled. 122 MDA MDA console support is enabled.
123 MIPS MIPS architecture is enabled. 123 MIPS MIPS architecture is enabled.
124 MOUSE Appropriate mouse support is enabled. 124 MOUSE Appropriate mouse support is enabled.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f8b62360b18c..34a363f91b46 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -430,7 +430,7 @@
430 430
431 blkdevparts= Manual partition parsing of block device(s) for 431 blkdevparts= Manual partition parsing of block device(s) for
432 embedded devices based on command line input. 432 embedded devices based on command line input.
433 See Documentation/block/cmdline-partition.txt 433 See Documentation/block/cmdline-partition.rst
434 434
435 boot_delay= Milliseconds to delay each printk during boot. 435 boot_delay= Milliseconds to delay each printk during boot.
436 Values larger than 10 seconds (10000) are changed to 436 Values larger than 10 seconds (10000) are changed to
@@ -708,14 +708,14 @@
708 [KNL, x86_64] select a region under 4G first, and 708 [KNL, x86_64] select a region under 4G first, and
709 fall back to reserve region above 4G when '@offset' 709 fall back to reserve region above 4G when '@offset'
710 hasn't been specified. 710 hasn't been specified.
711 See Documentation/kdump/kdump.rst for further details. 711 See Documentation/admin-guide/kdump/kdump.rst for further details.
712 712
713 crashkernel=range1:size1[,range2:size2,...][@offset] 713 crashkernel=range1:size1[,range2:size2,...][@offset]
714 [KNL] Same as above, but depends on the memory 714 [KNL] Same as above, but depends on the memory
715 in the running system. The syntax of range is 715 in the running system. The syntax of range is
716 start-[end] where start and end are both 716 start-[end] where start and end are both
717 a memory unit (amount[KMG]). See also 717 a memory unit (amount[KMG]). See also
718 Documentation/kdump/kdump.rst for an example. 718 Documentation/admin-guide/kdump/kdump.rst for an example.
719 719
720 crashkernel=size[KMG],high 720 crashkernel=size[KMG],high
721 [KNL, x86_64] range could be above 4G. Allow kernel 721 [KNL, x86_64] range could be above 4G. Allow kernel
@@ -930,7 +930,7 @@
930 edid/1680x1050.bin, or edid/1920x1080.bin is given 930 edid/1680x1050.bin, or edid/1920x1080.bin is given
931 and no file with the same name exists. Details and 931 and no file with the same name exists. Details and
932 instructions how to build your own EDID data are 932 instructions how to build your own EDID data are
933 available in Documentation/EDID/howto.rst. An EDID 933 available in Documentation/driver-api/edid.rst. An EDID
934 data set will only be used for a particular connector, 934 data set will only be used for a particular connector,
935 if its name and a colon are prepended to the EDID 935 if its name and a colon are prepended to the EDID
936 name. Each connector may use a unique EDID data 936 name. Each connector may use a unique EDID data
@@ -1199,15 +1199,15 @@
1199 1199
1200 elevator= [IOSCHED] 1200 elevator= [IOSCHED]
1201 Format: { "mq-deadline" | "kyber" | "bfq" } 1201 Format: { "mq-deadline" | "kyber" | "bfq" }
1202 See Documentation/block/deadline-iosched.txt, 1202 See Documentation/block/deadline-iosched.rst,
1203 Documentation/block/kyber-iosched.txt and 1203 Documentation/block/kyber-iosched.rst and
1204 Documentation/block/bfq-iosched.txt for details. 1204 Documentation/block/bfq-iosched.rst for details.
1205 1205
1206 elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] 1206 elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
1207 Specifies physical address of start of kernel core 1207 Specifies physical address of start of kernel core
1208 image elf header and optionally the size. Generally 1208 image elf header and optionally the size. Generally
1209 kexec loader will pass this option to capture kernel. 1209 kexec loader will pass this option to capture kernel.
1210 See Documentation/kdump/kdump.rst for details. 1210 See Documentation/admin-guide/kdump/kdump.rst for details.
1211 1211
1212 enable_mtrr_cleanup [X86] 1212 enable_mtrr_cleanup [X86]
1213 The kernel tries to adjust MTRR layout from continuous 1213 The kernel tries to adjust MTRR layout from continuous
@@ -1249,7 +1249,7 @@
1249 See also Documentation/fault-injection/. 1249 See also Documentation/fault-injection/.
1250 1250
1251 floppy= [HW] 1251 floppy= [HW]
1252 See Documentation/blockdev/floppy.txt. 1252 See Documentation/admin-guide/blockdev/floppy.rst.
1253 1253
1254 force_pal_cache_flush 1254 force_pal_cache_flush
1255 [IA-64] Avoid check_sal_cache_flush which may hang on 1255 [IA-64] Avoid check_sal_cache_flush which may hang on
@@ -2234,7 +2234,7 @@
2234 memblock=debug [KNL] Enable memblock debug messages. 2234 memblock=debug [KNL] Enable memblock debug messages.
2235 2235
2236 load_ramdisk= [RAM] List of ramdisks to load from floppy 2236 load_ramdisk= [RAM] List of ramdisks to load from floppy
2237 See Documentation/blockdev/ramdisk.txt. 2237 See Documentation/admin-guide/blockdev/ramdisk.rst.
2238 2238
2239 lockd.nlm_grace_period=P [NFS] Assign grace period. 2239 lockd.nlm_grace_period=P [NFS] Assign grace period.
2240 Format: <integer> 2240 Format: <integer>
@@ -3144,7 +3144,7 @@
3144 numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 3144 numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
3145 'node', 'default' can be specified 3145 'node', 'default' can be specified
3146 This can be set from sysctl after boot. 3146 This can be set from sysctl after boot.
3147 See Documentation/sysctl/vm.txt for details. 3147 See Documentation/admin-guide/sysctl/vm.rst for details.
3148 3148
3149 ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. 3149 ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
3150 See Documentation/debugging-via-ohci1394.txt for more 3150 See Documentation/debugging-via-ohci1394.txt for more
@@ -3268,7 +3268,7 @@
3268 3268
3269 pcd. [PARIDE] 3269 pcd. [PARIDE]
3270 See header of drivers/block/paride/pcd.c. 3270 See header of drivers/block/paride/pcd.c.
3271 See also Documentation/blockdev/paride.txt. 3271 See also Documentation/admin-guide/blockdev/paride.rst.
3272 3272
3273 pci=option[,option...] [PCI] various PCI subsystem options. 3273 pci=option[,option...] [PCI] various PCI subsystem options.
3274 3274
@@ -3512,7 +3512,7 @@
3512 needed on a platform with proper driver support. 3512 needed on a platform with proper driver support.
3513 3513
3514 pd. [PARIDE] 3514 pd. [PARIDE]
3515 See Documentation/blockdev/paride.txt. 3515 See Documentation/admin-guide/blockdev/paride.rst.
3516 3516
3517 pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at 3517 pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
3518 boot time. 3518 boot time.
@@ -3527,10 +3527,10 @@
3527 and performance comparison. 3527 and performance comparison.
3528 3528
3529 pf. [PARIDE] 3529 pf. [PARIDE]
3530 See Documentation/blockdev/paride.txt. 3530 See Documentation/admin-guide/blockdev/paride.rst.
3531 3531
3532 pg. [PARIDE] 3532 pg. [PARIDE]
3533 See Documentation/blockdev/paride.txt. 3533 See Documentation/admin-guide/blockdev/paride.rst.
3534 3534
3535 pirq= [SMP,APIC] Manual mp-table setup 3535 pirq= [SMP,APIC] Manual mp-table setup
3536 See Documentation/x86/i386/IO-APIC.rst. 3536 See Documentation/x86/i386/IO-APIC.rst.
@@ -3642,7 +3642,7 @@
3642 3642
3643 prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk 3643 prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
3644 before loading. 3644 before loading.
3645 See Documentation/blockdev/ramdisk.txt. 3645 See Documentation/admin-guide/blockdev/ramdisk.rst.
3646 3646
3647 psi= [KNL] Enable or disable pressure stall information 3647 psi= [KNL] Enable or disable pressure stall information
3648 tracking. 3648 tracking.
@@ -3664,7 +3664,7 @@
3664 pstore.backend= Specify the name of the pstore backend to use 3664 pstore.backend= Specify the name of the pstore backend to use
3665 3665
3666 pt. [PARIDE] 3666 pt. [PARIDE]
3667 See Documentation/blockdev/paride.txt. 3667 See Documentation/admin-guide/blockdev/paride.rst.
3668 3668
3669 pti= [X86_64] Control Page Table Isolation of user and 3669 pti= [X86_64] Control Page Table Isolation of user and
3670 kernel address spaces. Disabling this feature 3670 kernel address spaces. Disabling this feature
@@ -3693,7 +3693,7 @@
3693 See Documentation/admin-guide/md.rst. 3693 See Documentation/admin-guide/md.rst.
3694 3694
3695 ramdisk_size= [RAM] Sizes of RAM disks in kilobytes 3695 ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
3696 See Documentation/blockdev/ramdisk.txt. 3696 See Documentation/admin-guide/blockdev/ramdisk.rst.
3697 3697
3698 random.trust_cpu={on,off} 3698 random.trust_cpu={on,off}
3699 [KNL] Enable or disable trusting the use of the 3699 [KNL] Enable or disable trusting the use of the
@@ -4089,7 +4089,7 @@
4089 4089
4090 relax_domain_level= 4090 relax_domain_level=
4091 [KNL, SMP] Set scheduler's default relax_domain_level. 4091 [KNL, SMP] Set scheduler's default relax_domain_level.
4092 See Documentation/cgroup-v1/cpusets.rst. 4092 See Documentation/admin-guide/cgroup-v1/cpusets.rst.
4093 4093
4094 reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory 4094 reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
4095 Format: <base1>,<size1>[,<base2>,<size2>,...] 4095 Format: <base1>,<size1>[,<base2>,<size2>,...]
@@ -4347,7 +4347,7 @@
4347 Format: <integer> 4347 Format: <integer>
4348 4348
4349 sonypi.*= [HW] Sony Programmable I/O Control Device driver 4349 sonypi.*= [HW] Sony Programmable I/O Control Device driver
4350 See Documentation/laptops/sonypi.txt 4350 See Documentation/admin-guide/laptops/sonypi.rst
4351 4351
4352 spectre_v2= [X86] Control mitigation of Spectre variant 2 4352 spectre_v2= [X86] Control mitigation of Spectre variant 2
4353 (indirect branch speculation) vulnerability. 4353 (indirect branch speculation) vulnerability.
@@ -4599,7 +4599,7 @@
4599 swapaccount=[0|1] 4599 swapaccount=[0|1]
4600 [KNL] Enable accounting of swap in memory resource 4600 [KNL] Enable accounting of swap in memory resource
4601 controller if no parameter or 1 is given or disable 4601 controller if no parameter or 1 is given or disable
4602 it if 0 is given (See Documentation/cgroup-v1/memory.rst) 4602 it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
4603 4603
4604 swiotlb= [ARM,IA-64,PPC,MIPS,X86] 4604 swiotlb= [ARM,IA-64,PPC,MIPS,X86]
4605 Format: { <int> | force | noforce } 4605 Format: { <int> | force | noforce }
@@ -5066,7 +5066,7 @@
5066 5066
5067 vga= [BOOT,X86-32] Select a particular video mode 5067 vga= [BOOT,X86-32] Select a particular video mode
5068 See Documentation/x86/boot.rst and 5068 See Documentation/x86/boot.rst and
5069 Documentation/svga.txt. 5069 Documentation/admin-guide/svga.rst.
5070 Use vga=ask for menu. 5070 Use vga=ask for menu.
5071 This is actually a boot loader parameter; the value is 5071 This is actually a boot loader parameter; the value is
5072 passed to the kernel using a special protocol. 5072 passed to the kernel using a special protocol.
diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
new file mode 100644
index 000000000000..4f18456dd3b1
--- /dev/null
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
@@ -0,0 +1,356 @@
1==========================================
2Reducing OS jitter due to per-cpu kthreads
3==========================================
4
5This document lists per-CPU kthreads in the Linux kernel and presents
6options to control their OS jitter. Note that non-per-CPU kthreads are
7not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
8them to a "housekeeping" CPU dedicated to such work.
9
10References
11==========
12
13- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
14
15- Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
16
17- man taskset: Using the taskset command to bind tasks to sets
18 of CPUs.
19
20- man sched_setaffinity: Using the sched_setaffinity() system
21 call to bind tasks to sets of CPUs.
22
23- /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
24 writing "0" to offline and "1" to online.
25
26- In order to locate kernel-generated OS jitter on CPU N:
27
28 cd /sys/kernel/debug/tracing
29 echo 1 > max_graph_depth # Increase the "1" for more detail
30 echo function_graph > current_tracer
31 # run workload
32 cat per_cpu/cpuN/trace
33
34kthreads
35========
36
37Name:
38 ehca_comp/%u
39
40Purpose:
41 Periodically process Infiniband-related work.
42
43To reduce its OS jitter, do any of the following:
44
451. Don't use eHCA Infiniband hardware, instead choosing hardware
46 that does not require per-CPU kthreads. This will prevent these
47 kthreads from being created in the first place. (This will
48 work for most people, as this hardware, though important, is
49 relatively old and is produced in relatively low unit volumes.)
502. Do all eHCA-Infiniband-related work on other CPUs, including
51 interrupts.
523. Rework the eHCA driver so that its per-CPU kthreads are
53 provisioned only on selected CPUs.
54
55
56Name:
57 irq/%d-%s
58
59Purpose:
60 Handle threaded interrupts.
61
62To reduce its OS jitter, do the following:
63
641. Use irq affinity to force the irq threads to execute on
65 some other CPU.
66
67Name:
68 kcmtpd_ctr_%d
69
70Purpose:
71 Handle Bluetooth work.
72
73To reduce its OS jitter, do one of the following:
74
751. Don't use Bluetooth, in which case these kthreads won't be
76 created in the first place.
772. Use irq affinity to force Bluetooth-related interrupts to
78 occur on some other CPU and furthermore initiate all
79 Bluetooth activity on some other CPU.
80
81Name:
82 ksoftirqd/%u
83
84Purpose:
85 Execute softirq handlers when threaded or when under heavy load.
86
87To reduce its OS jitter, each softirq vector must be handled
88separately as follows:
89
90TIMER_SOFTIRQ
91-------------
92
93Do all of the following:
94
951. To the extent possible, keep the CPU out of the kernel when it
96 is non-idle, for example, by avoiding system calls and by forcing
97 both kernel threads and interrupts to execute elsewhere.
982. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force
99 the CPU offline, then bring it back online. This forces
100 recurring timers to migrate elsewhere. If you are concerned
101 with multiple CPUs, force them all offline before bringing the
102 first one back online. Once you have onlined the CPUs in question,
103 do not offline any other CPUs, because doing so could force the
104 timer back onto one of the CPUs in question.
105
106NET_TX_SOFTIRQ and NET_RX_SOFTIRQ
107---------------------------------
108
109Do all of the following:
110
1111. Force networking interrupts onto other CPUs.
1122. Initiate any network I/O on other CPUs.
1133. Once your application has started, prevent CPU-hotplug operations
114 from being initiated from tasks that might run on the CPU to
115 be de-jittered. (It is OK to force this CPU offline and then
116 bring it back online before you start your application.)
117
118BLOCK_SOFTIRQ
119-------------
120
121Do all of the following:
122
1231. Force block-device interrupts onto some other CPU.
1242. Initiate any block I/O on other CPUs.
1253. Once your application has started, prevent CPU-hotplug operations
126 from being initiated from tasks that might run on the CPU to
127 be de-jittered. (It is OK to force this CPU offline and then
128 bring it back online before you start your application.)
129
130IRQ_POLL_SOFTIRQ
131----------------
132
133Do all of the following:
134
1351. Force block-device interrupts onto some other CPU.
1362. Initiate any block I/O and block-I/O polling on other CPUs.
1373. Once your application has started, prevent CPU-hotplug operations
138 from being initiated from tasks that might run on the CPU to
139 be de-jittered. (It is OK to force this CPU offline and then
140 bring it back online before you start your application.)
141
142TASKLET_SOFTIRQ
143---------------
144
145Do one or more of the following:
146
1471. Avoid use of drivers that use tasklets. (Such drivers will contain
148 calls to things like tasklet_schedule().)
1492. Convert all drivers that you must use from tasklets to workqueues.
1503. Force interrupts for drivers using tasklets onto other CPUs,
151 and also do I/O involving these drivers on other CPUs.
152
153SCHED_SOFTIRQ
154-------------
155
156Do all of the following:
157
1581. Avoid sending scheduler IPIs to the CPU to be de-jittered,
159 for example, ensure that at most one runnable kthread is present
160 on that CPU. If a thread that expects to run on the de-jittered
161 CPU awakens, the scheduler will send an IPI that can result in
162 a subsequent SCHED_SOFTIRQ.
1632. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
164 is marked as an adaptive-ticks CPU using the "nohz_full="
165 boot parameter. This reduces the number of scheduler-clock
166 interrupts that the de-jittered CPU receives, minimizing its
167 chances of being selected to do the load balancing work that
168 runs in SCHED_SOFTIRQ context.
1693. To the extent possible, keep the CPU out of the kernel when it
170 is non-idle, for example, by avoiding system calls and by
171 forcing both kernel threads and interrupts to execute elsewhere.
172 This further reduces the number of scheduler-clock interrupts
173 received by the de-jittered CPU.
174
175HRTIMER_SOFTIRQ
176---------------
177
178Do all of the following:
179
1801. To the extent possible, keep the CPU out of the kernel when it
181 is non-idle. For example, avoid system calls and force both
182 kernel threads and interrupts to execute elsewhere.
1832. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the
184 CPU offline, then bring it back online. This forces recurring
185 timers to migrate elsewhere. If you are concerned with multiple
186 CPUs, force them all offline before bringing the first one
187 back online. Once you have onlined the CPUs in question, do not
188 offline any other CPUs, because doing so could force the timer
189 back onto one of the CPUs in question.
190
191RCU_SOFTIRQ
192-----------
193
194Do at least one of the following:
195
1961. Offload callbacks and keep the CPU in either dyntick-idle or
197 adaptive-ticks state by doing all of the following:
198
199 a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be
200 de-jittered is marked as an adaptive-ticks CPU using the
201 "nohz_full=" boot parameter. Bind the rcuo kthreads to
202 housekeeping CPUs, which can tolerate OS jitter.
203 b. To the extent possible, keep the CPU out of the kernel
204 when it is non-idle, for example, by avoiding system
205 calls and by forcing both kernel threads and interrupts
206 to execute elsewhere.
207
2082. Enable RCU to do its processing remotely via dyntick-idle by
209 doing all of the following:
210
211 a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
212 b. Ensure that the CPU goes idle frequently, allowing other
213 CPUs to detect that it has passed through an RCU quiescent
214 state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
215 userspace execution also allows other CPUs to detect that
216 the CPU in question has passed through a quiescent state.
217 c. To the extent possible, keep the CPU out of the kernel
218 when it is non-idle, for example, by avoiding system
219 calls and by forcing both kernel threads and interrupts
220 to execute elsewhere.
221
222Name:
223 kworker/%u:%d%s (cpu, id, priority)
224
225Purpose:
226 Execute workqueue requests
227
228To reduce its OS jitter, do any of the following:
229
2301. Run your workload at a real-time priority, which will allow
231 preempting the kworker daemons.
2322. A given workqueue can be made visible in the sysfs filesystem
233 by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
234 Such a workqueue can be confined to a given subset of the
235 CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
236 files. The set of WQ_SYSFS workqueues can be displayed using
237 "ls sys/devices/virtual/workqueue". That said, the workqueues
238 maintainer would like to caution people against indiscriminately
239 sprinkling WQ_SYSFS across all the workqueues. The reason for
240 caution is that it is easy to add WQ_SYSFS, but because sysfs is
241 part of the formal user/kernel API, it can be nearly impossible
242 to remove it, even if its addition was a mistake.
2433. Do any of the following needed to avoid jitter that your
244 application cannot tolerate:
245
246 a. Build your kernel with CONFIG_SLUB=y rather than
247 CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
248 use of each CPU's workqueues to run its cache_reap()
249 function.
250 b. Avoid using oprofile, thus avoiding OS jitter from
251 wq_sync_buffer().
252 c. Limit your CPU frequency so that a CPU-frequency
253 governor is not required, possibly enlisting the aid of
254 special heatsinks or other cooling technologies. If done
255 correctly, and if you CPU architecture permits, you should
256 be able to build your kernel with CONFIG_CPU_FREQ=n to
257 avoid the CPU-frequency governor periodically running
258 on each CPU, including cs_dbs_timer() and od_dbs_timer().
259
260 WARNING: Please check your CPU specifications to
261 make sure that this is safe on your particular system.
262 d. As of v3.18, Christoph Lameter's on-demand vmstat workers
263 commit prevents OS jitter due to vmstat_update() on
264 CONFIG_SMP=y systems. Before v3.18, is not possible
265 to entirely get rid of the OS jitter, but you can
266 decrease its frequency by writing a large value to
267 /proc/sys/vm/stat_interval. The default value is HZ,
268 for an interval of one second. Of course, larger values
269 will make your virtual-memory statistics update more
270 slowly. Of course, you can also run your workload at
271 a real-time priority, thus preempting vmstat_update(),
272 but if your workload is CPU-bound, this is a bad idea.
273 However, there is an RFC patch from Christoph Lameter
274 (based on an earlier one from Gilad Ben-Yossef) that
275 reduces or even eliminates vmstat overhead for some
276 workloads at https://lkml.org/lkml/2013/9/4/379.
277 e. Boot with "elevator=noop" to avoid workqueue use by
278 the block layer.
279 f. If running on high-end powerpc servers, build with
280 CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
281 daemon from running on each CPU every second or so.
282 (This will require editing Kconfig files and will defeat
283 this platform's RAS functionality.) This avoids jitter
284 due to the rtas_event_scan() function.
285 WARNING: Please check your CPU specifications to
286 make sure that this is safe on your particular system.
287 g. If running on Cell Processor, build your kernel with
288 CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
289 spu_gov_work().
290 WARNING: Please check your CPU specifications to
291 make sure that this is safe on your particular system.
292 h. If running on PowerMAC, build your kernel with
293 CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
294 avoiding OS jitter from rackmeter_do_timer().
295
296Name:
297 rcuc/%u
298
299Purpose:
300 Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
301
302To reduce its OS jitter, do at least one of the following:
303
3041. Build the kernel with CONFIG_PREEMPT=n. This prevents these
305 kthreads from being created in the first place, and also obviates
306 the need for RCU priority boosting. This approach is feasible
307 for workloads that do not require high degrees of responsiveness.
3082. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these
309 kthreads from being created in the first place. This approach
310 is feasible only if your workload never requires RCU priority
311 boosting, for example, if you ensure frequent idle time on all
312 CPUs that might execute within the kernel.
3133. Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs=
314 boot parameter offloading RCU callbacks from all CPUs susceptible
315 to OS jitter. This approach prevents the rcuc/%u kthreads from
316 having any work to do, so that they are never awakened.
3174. Ensure that the CPU never enters the kernel, and, in particular,
318 avoid initiating any CPU hotplug operations on this CPU. This is
319 another way of preventing any callbacks from being queued on the
320 CPU, again preventing the rcuc/%u kthreads from having any work
321 to do.
322
323Name:
324 rcuop/%d and rcuos/%d
325
326Purpose:
327 Offload RCU callbacks from the corresponding CPU.
328
329To reduce its OS jitter, do at least one of the following:
330
3311. Use affinity, cgroups, or other mechanism to force these kthreads
332 to execute on some other CPU.
3332. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
334 kthreads from being created in the first place. However, please
335 note that this will not eliminate OS jitter, but will instead
336 shift it to RCU_SOFTIRQ.
337
338Name:
339 watchdog/%u
340
341Purpose:
342 Detect software lockups on each CPU.
343
344To reduce its OS jitter, do at least one of the following:
345
3461. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
347 kthreads from being created in the first place.
3482. Boot with "nosoftlockup=0", which will also prevent these kthreads
349 from being created. Other related watchdog and softlockup boot
350 parameters may be found in Documentation/admin-guide/kernel-parameters.rst
351 and Documentation/watchdog/watchdog-parameters.rst.
3523. Echo a zero to /proc/sys/kernel/watchdog to disable the
353 watchdog timer.
3544. Echo a large number of /proc/sys/kernel/watchdog_thresh in
355 order to reduce the frequency of OS jitter due to the watchdog
356 timer down to a level that is acceptable for your workload.
diff --git a/Documentation/admin-guide/laptops/asus-laptop.rst b/Documentation/admin-guide/laptops/asus-laptop.rst
new file mode 100644
index 000000000000..95176321a25a
--- /dev/null
+++ b/Documentation/admin-guide/laptops/asus-laptop.rst
@@ -0,0 +1,271 @@
1==================
2Asus Laptop Extras
3==================
4
5Version 0.1
6
7August 6, 2009
8
9Corentin Chary <corentincj@iksaif.net>
10http://acpi4asus.sf.net/
11
12 This driver provides support for extra features of ACPI-compatible ASUS laptops.
13 It may also support some MEDION, JVC or VICTOR laptops (such as MEDION 9675 or
14 VICTOR XP7210 for example). It makes all the extra buttons generate input
15 events (like keyboards).
16
17 On some models adds support for changing the display brightness and output,
18 switching the LCD backlight on and off, and most importantly, allows you to
19 blink those fancy LEDs intended for reporting mail and wireless status.
20
21This driver supersedes the old asus_acpi driver.
22
23Requirements
24------------
25
26 Kernel 2.6.X sources, configured for your computer, with ACPI support.
27 You also need CONFIG_INPUT and CONFIG_ACPI.
28
29Status
30------
31
32 The features currently supported are the following (see below for
33 detailed description):
34
35 - Fn key combinations
36 - Bluetooth enable and disable
37 - Wlan enable and disable
38 - GPS enable and disable
39 - Video output switching
40 - Ambient Light Sensor on and off
41 - LED control
42 - LED Display control
43 - LCD brightness control
44 - LCD on and off
45
46 A compatibility table by model and feature is maintained on the web
47 site, http://acpi4asus.sf.net/.
48
49Usage
50-----
51
52 Try "modprobe asus-laptop". Check your dmesg (simply type dmesg). You should
53 see some lines like this :
54
55 Asus Laptop Extras version 0.42
56 - L2D model detected.
57
58 If it is not the output you have on your laptop, send it (and the laptop's
59 DSDT) to me.
60
61 That's all, now, all the events generated by the hotkeys of your laptop
62 should be reported via netlink events. You can check with
63 "acpi_genl monitor" (part of the acpica project).
64
65 Hotkeys are also reported as input keys (like keyboards) you can check
66 which key are supported using "xev" under X11.
67
68 You can get information on the version of your DSDT table by reading the
69 /sys/devices/platform/asus-laptop/infos entry. If you have a question or a
70 bug report to do, please include the output of this entry.
71
72LEDs
73----
74
75 You can modify LEDs be echoing values to `/sys/class/leds/asus/*/brightness`::
76
77 echo 1 > /sys/class/leds/asus::mail/brightness
78
79 will switch the mail LED on.
80
81 You can also know if they are on/off by reading their content and use
82 kernel triggers like disk-activity or heartbeat.
83
84Backlight
85---------
86
87 You can control lcd backlight power and brightness with
88 /sys/class/backlight/asus-laptop/. Brightness Values are between 0 and 15.
89
90Wireless devices
91----------------
92
93 You can turn the internal Bluetooth adapter on/off with the bluetooth entry
94 (only on models with Bluetooth). This usually controls the associated LED.
95 Same for Wlan adapter.
96
97Display switching
98-----------------
99
100 Note: the display switching code is currently considered EXPERIMENTAL.
101
102 Switching works for the following models:
103
104 - L3800C
105 - A2500H
106 - L5800C
107 - M5200N
108 - W1000N (albeit with some glitches)
109 - M6700R
110 - A6JC
111 - F3J
112
113 Switching doesn't work for the following:
114
115 - M3700N
116 - L2X00D (locks the laptop under certain conditions)
117
118 To switch the displays, echo values from 0 to 15 to
119 /sys/devices/platform/asus-laptop/display. The significance of those values
120 is as follows:
121
122 +-------+-----+-----+-----+-----+-----+
123 | Bin | Val | DVI | TV | CRT | LCD |
124 +-------+-----+-----+-----+-----+-----+
125 | 0000 | 0 | | | | |
126 +-------+-----+-----+-----+-----+-----+
127 | 0001 | 1 | | | | X |
128 +-------+-----+-----+-----+-----+-----+
129 | 0010 | 2 | | | X | |
130 +-------+-----+-----+-----+-----+-----+
131 | 0011 | 3 | | | X | X |
132 +-------+-----+-----+-----+-----+-----+
133 | 0100 | 4 | | X | | |
134 +-------+-----+-----+-----+-----+-----+
135 | 0101 | 5 | | X | | X |
136 +-------+-----+-----+-----+-----+-----+
137 | 0110 | 6 | | X | X | |
138 +-------+-----+-----+-----+-----+-----+
139 | 0111 | 7 | | X | X | X |
140 +-------+-----+-----+-----+-----+-----+
141 | 1000 | 8 | X | | | |
142 +-------+-----+-----+-----+-----+-----+
143 | 1001 | 9 | X | | | X |
144 +-------+-----+-----+-----+-----+-----+
145 | 1010 | 10 | X | | X | |
146 +-------+-----+-----+-----+-----+-----+
147 | 1011 | 11 | X | | X | X |
148 +-------+-----+-----+-----+-----+-----+
149 | 1100 | 12 | X | X | | |
150 +-------+-----+-----+-----+-----+-----+
151 | 1101 | 13 | X | X | | X |
152 +-------+-----+-----+-----+-----+-----+
153 | 1110 | 14 | X | X | X | |
154 +-------+-----+-----+-----+-----+-----+
155 | 1111 | 15 | X | X | X | X |
156 +-------+-----+-----+-----+-----+-----+
157
158 In most cases, the appropriate displays must be plugged in for the above
159 combinations to work. TV-Out may need to be initialized at boot time.
160
161 Debugging:
162
163 1) Check whether the Fn+F8 key:
164
165 a) does not lock the laptop (try a boot with noapic / nolapic if it does)
166 b) generates events (0x6n, where n is the value corresponding to the
167 configuration above)
168 c) actually works
169
170 Record the disp value at every configuration.
171 2) Echo values from 0 to 15 to /sys/devices/platform/asus-laptop/display.
172 Record its value, note any change. If nothing changes, try a broader range,
173 up to 65535.
174 3) Send ANY output (both positive and negative reports are needed, unless your
175 machine is already listed above) to the acpi4asus-user mailing list.
176
177 Note: on some machines (e.g. L3C), after the module has been loaded, only 0x6n
178 events are generated and no actual switching occurs. In such a case, a line
179 like::
180
181 echo $((10#$arg-60)) > /sys/devices/platform/asus-laptop/display
182
183 will usually do the trick ($arg is the 0000006n-like event passed to acpid).
184
185 Note: there is currently no reliable way to read display status on xxN
186 (Centrino) models.
187
188LED display
189-----------
190
191 Some models like the W1N have a LED display that can be used to display
192 several items of information.
193
194 LED display works for the following models:
195
196 - W1000N
197 - W1J
198
199 To control the LED display, use the following::
200
201 echo 0x0T000DDD > /sys/devices/platform/asus-laptop/
202
203 where T control the 3 letters display, and DDD the 3 digits display,
204 according to the tables below::
205
206 DDD (digits)
207 000 to 999 = display digits
208 AAA = ---
209 BBB to FFF = turn-off
210
211 T (type)
212 0 = off
213 1 = dvd
214 2 = vcd
215 3 = mp3
216 4 = cd
217 5 = tv
218 6 = cpu
219 7 = vol
220
221 For example "echo 0x01000001 >/sys/devices/platform/asus-laptop/ledd"
222 would display "DVD001".
223
224Driver options
225--------------
226
227 Options can be passed to the asus-laptop driver using the standard
228 module argument syntax (<param>=<value> when passing the option to the
229 module or asus-laptop.<param>=<value> on the kernel boot line when
230 asus-laptop is statically linked into the kernel).
231
232 wapf: WAPF defines the behavior of the Fn+Fx wlan key
233 The significance of values is yet to be found, but
234 most of the time:
235
236 - 0x0 should do nothing
237 - 0x1 should allow to control the device with Fn+Fx key.
238 - 0x4 should send an ACPI event (0x88) while pressing the Fn+Fx key
239 - 0x5 like 0x1 or 0x4
240
241 The default value is 0x1.
242
243Unsupported models
244------------------
245
246 These models will never be supported by this module, as they use a completely
247 different mechanism to handle LEDs and extra stuff (meaning we have no clue
248 how it works):
249
250 - ASUS A1300 (A1B), A1370D
251 - ASUS L7300G
252 - ASUS L8400
253
254Patches, Errors, Questions
255--------------------------
256
257 I appreciate any success or failure
258 reports, especially if they add to or correct the compatibility table.
259 Please include the following information in your report:
260
261 - Asus model name
262 - a copy of your ACPI tables, using the "acpidump" utility
263 - a copy of /sys/devices/platform/asus-laptop/infos
264 - which driver features work and which don't
265 - the observed behavior of non-working features
266
267 Any other comments or patches are also more than welcome.
268
269 acpi4asus-user@lists.sourceforge.net
270
271 http://sourceforge.net/projects/acpi4asus
diff --git a/Documentation/admin-guide/laptops/disk-shock-protection.rst b/Documentation/admin-guide/laptops/disk-shock-protection.rst
new file mode 100644
index 000000000000..e97c5f78d8c3
--- /dev/null
+++ b/Documentation/admin-guide/laptops/disk-shock-protection.rst
@@ -0,0 +1,151 @@
1==========================
2Hard disk shock protection
3==========================
4
5Author: Elias Oltmanns <eo@nebensachen.de>
6
7Last modified: 2008-10-03
8
9
10.. 0. Contents
11
12 1. Intro
13 2. The interface
14 3. References
15 4. CREDITS
16
17
181. Intro
19--------
20
21ATA/ATAPI-7 specifies the IDLE IMMEDIATE command with unload feature.
22Issuing this command should cause the drive to switch to idle mode and
23unload disk heads. This feature is being used in modern laptops in
24conjunction with accelerometers and appropriate software to implement
25a shock protection facility. The idea is to stop all I/O operations on
26the internal hard drive and park its heads on the ramp when critical
27situations are anticipated. The desire to have such a feature
28available on GNU/Linux systems has been the original motivation to
29implement a generic disk head parking interface in the Linux kernel.
30Please note, however, that other components have to be set up on your
31system in order to get disk shock protection working (see
32section 3. References below for pointers to more information about
33that).
34
35
362. The interface
37----------------
38
39For each ATA device, the kernel exports the file
40`block/*/device/unload_heads` in sysfs (here assumed to be mounted under
41/sys). Access to `/sys/block/*/device/unload_heads` is denied with
42-EOPNOTSUPP if the device does not support the unload feature.
43Otherwise, writing an integer value to this file will take the heads
44of the respective drive off the platter and block all I/O operations
45for the specified number of milliseconds. When the timeout expires and
46no further disk head park request has been issued in the meantime,
47normal operation will be resumed. The maximal value accepted for a
48timeout is 30000 milliseconds. Exceeding this limit will return
49-EOVERFLOW, but heads will be parked anyway and the timeout will be
50set to 30 seconds. However, you can always change a timeout to any
51value between 0 and 30000 by issuing a subsequent head park request
52before the timeout of the previous one has expired. In particular, the
53total timeout can exceed 30 seconds and, more importantly, you can
54cancel a previously set timeout and resume normal operation
55immediately by specifying a timeout of 0. Values below -2 are rejected
56with -EINVAL (see below for the special meaning of -1 and -2). If the
57timeout specified for a recent head park request has not yet expired,
58reading from `/sys/block/*/device/unload_heads` will report the number
59of milliseconds remaining until normal operation will be resumed;
60otherwise, reading the unload_heads attribute will return 0.
61
62For example, do the following in order to park the heads of drive
63/dev/sda and stop all I/O operations for five seconds::
64
65 # echo 5000 > /sys/block/sda/device/unload_heads
66
67A simple::
68
69 # cat /sys/block/sda/device/unload_heads
70
71will show you how many milliseconds are left before normal operation
72will be resumed.
73
74A word of caution: The fact that the interface operates on a basis of
75milliseconds may raise expectations that cannot be satisfied in
76reality. In fact, the ATA specs clearly state that the time for an
77unload operation to complete is vendor specific. The hint in ATA-7
78that this will typically be within 500 milliseconds apparently has
79been dropped in ATA-8.
80
81There is a technical detail of this implementation that may cause some
82confusion and should be discussed here. When a head park request has
83been issued to a device successfully, all I/O operations on the
84controller port this device is attached to will be deferred. That is
85to say, any other device that may be connected to the same port will
86be affected too. The only exception is that a subsequent head unload
87request to that other device will be executed immediately. Further
88operations on that port will be deferred until the timeout specified
89for either device on the port has expired. As far as PATA (old style
90IDE) configurations are concerned, there can only be two devices
91attached to any single port. In SATA world we have port multipliers
92which means that a user-issued head parking request to one device may
93actually result in stopping I/O to a whole bunch of devices. However,
94since this feature is supposed to be used on laptops and does not seem
95to be very useful in any other environment, there will be mostly one
96device per port. Even if the CD/DVD writer happens to be connected to
97the same port as the hard drive, it generally *should* recover just
98fine from the occasional buffer under-run incurred by a head park
99request to the HD. Actually, when you are using an ide driver rather
100than its libata counterpart (i.e. your disk is called /dev/hda
101instead of /dev/sda), then parking the heads of one drive (drive X)
102will generally not affect the mode of operation of another drive
103(drive Y) on the same port as described above. It is only when a port
104reset is required to recover from an exception on drive Y that further
105I/O operations on that drive (and the reset itself) will be delayed
106until drive X is no longer in the parked state.
107
108Finally, there are some hard drives that only comply with an earlier
109version of the ATA standard than ATA-7, but do support the unload
110feature nonetheless. Unfortunately, there is no safe way Linux can
111detect these devices, so you won't be able to write to the
112unload_heads attribute. If you know that your device really does
113support the unload feature (for instance, because the vendor of your
114laptop or the hard drive itself told you so), then you can tell the
115kernel to enable the usage of this feature for that drive by writing
116the special value -1 to the unload_heads attribute::
117
118 # echo -1 > /sys/block/sda/device/unload_heads
119
120will enable the feature for /dev/sda, and giving -2 instead of -1 will
121disable it again.
122
123
1243. References
125-------------
126
127There are several laptops from different vendors featuring shock
128protection capabilities. As manufacturers have refused to support open
129source development of the required software components so far, Linux
130support for shock protection varies considerably between different
131hardware implementations. Ideally, this section should contain a list
132of pointers at different projects aiming at an implementation of shock
133protection on different systems. Unfortunately, I only know of a
134single project which, although still considered experimental, is fit
135for use. Please feel free to add projects that have been the victims
136of my ignorance.
137
138- http://www.thinkwiki.org/wiki/HDAPS
139
140 See this page for information about Linux support of the hard disk
141 active protection system as implemented in IBM/Lenovo Thinkpads.
142
143
1444. CREDITS
145----------
146
147This implementation of disk head parking has been inspired by a patch
148originally published by Jon Escombe <lists@dresco.co.uk>. My efforts
149to develop an implementation of this feature that is fit to be merged
150into mainline have been aided by various kernel developers, in
151particular by Tejun Heo and Bartlomiej Zolnierkiewicz.
diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst
new file mode 100644
index 000000000000..cd9a1c2695fd
--- /dev/null
+++ b/Documentation/admin-guide/laptops/index.rst
@@ -0,0 +1,17 @@
1.. SPDX-License-Identifier: GPL-2.0
2
3==============
4Laptop Drivers
5==============
6
7.. toctree::
8 :maxdepth: 1
9
10 asus-laptop
11 disk-shock-protection
12 laptop-mode
13 lg-laptop
14 sony-laptop
15 sonypi
16 thinkpad-acpi
17 toshiba_haps
diff --git a/Documentation/admin-guide/laptops/laptop-mode.rst b/Documentation/admin-guide/laptops/laptop-mode.rst
new file mode 100644
index 000000000000..c984c4262f2e
--- /dev/null
+++ b/Documentation/admin-guide/laptops/laptop-mode.rst
@@ -0,0 +1,781 @@
1===============================================
2How to conserve battery power using laptop-mode
3===============================================
4
5Document Author: Bart Samwel (bart@samwel.tk)
6
7Date created: January 2, 2004
8
9Last modified: December 06, 2004
10
11Introduction
12------------
13
14Laptop mode is used to minimize the time that the hard disk needs to be spun up,
15to conserve battery power on laptops. It has been reported to cause significant
16power savings.
17
18.. Contents
19
20 * Introduction
21 * Installation
22 * Caveats
23 * The Details
24 * Tips & Tricks
25 * Control script
26 * ACPI integration
27 * Monitoring tool
28
29
30Installation
31------------
32
33To use laptop mode, you don't need to set any kernel configuration options
34or anything. Simply install all the files included in this document, and
35laptop mode will automatically be started when you're on battery. For
36your convenience, a tarball containing an installer can be downloaded at:
37
38 http://www.samwel.tk/laptop_mode/laptop_mode/
39
40To configure laptop mode, you need to edit the configuration file, which is
41located in /etc/default/laptop-mode on Debian-based systems, or in
42/etc/sysconfig/laptop-mode on other systems.
43
44Unfortunately, automatic enabling of laptop mode does not work for
45laptops that don't have ACPI. On those laptops, you need to start laptop
46mode manually. To start laptop mode, run "laptop_mode start", and to
47stop it, run "laptop_mode stop". (Note: The laptop mode tools package now
48has experimental support for APM, you might want to try that first.)
49
50
51Caveats
52-------
53
54* The downside of laptop mode is that you have a chance of losing up to 10
55 minutes of work. If you cannot afford this, don't use it! The supplied ACPI
56 scripts automatically turn off laptop mode when the battery almost runs out,
57 so that you won't lose any data at the end of your battery life.
58
59* Most desktop hard drives have a very limited lifetime measured in spindown
60 cycles, typically about 50.000 times (it's usually listed on the spec sheet).
61 Check your drive's rating, and don't wear down your drive's lifetime if you
62 don't need to.
63
64* If you mount some of your ext3/reiserfs filesystems with the -n option, then
65 the control script will not be able to remount them correctly. You must set
66 DO_REMOUNTS=0 in the control script, otherwise it will remount them with the
67 wrong options -- or it will fail because it cannot write to /etc/mtab.
68
69* If you have your filesystems listed as type "auto" in fstab, like I did, then
70 the control script will not recognize them as filesystems that need remounting.
71 You must list the filesystems with their true type instead.
72
73* It has been reported that some versions of the mutt mail client use file access
74 times to determine whether a folder contains new mail. If you use mutt and
75 experience this, you must disable the noatime remounting by setting the option
76 DO_REMOUNT_NOATIME to 0 in the configuration file.
77
78
79The Details
80-----------
81
82Laptop mode is controlled by the knob /proc/sys/vm/laptop_mode. This knob is
83present for all kernels that have the laptop mode patch, regardless of any
84configuration options. When the knob is set, any physical disk I/O (that might
85have caused the hard disk to spin up) causes Linux to flush all dirty blocks. The
86result of this is that after a disk has spun down, it will not be spun up
87anymore to write dirty blocks, because those blocks had already been written
88immediately after the most recent read operation. The value of the laptop_mode
89knob determines the time between the occurrence of disk I/O and when the flush
90is triggered. A sensible value for the knob is 5 seconds. Setting the knob to
910 disables laptop mode.
92
93To increase the effectiveness of the laptop_mode strategy, the laptop_mode
94control script increases dirty_expire_centisecs and dirty_writeback_centisecs in
95/proc/sys/vm to about 10 minutes (by default), which means that pages that are
96dirtied are not forced to be written to disk as often. The control script also
97changes the dirty background ratio, so that background writeback of dirty pages
98is not done anymore. Combined with a higher commit value (also 10 minutes) for
99ext3 or ReiserFS filesystems (also done automatically by the control script),
100this results in concentration of disk activity in a small time interval which
101occurs only once every 10 minutes, or whenever the disk is forced to spin up by
102a cache miss. The disk can then be spun down in the periods of inactivity.
103
104If you want to find out which process caused the disk to spin up, you can
105gather information by setting the flag /proc/sys/vm/block_dump. When this flag
106is set, Linux reports all disk read and write operations that take place, and
107all block dirtyings done to files. This makes it possible to debug why a disk
108needs to spin up, and to increase battery life even more. The output of
109block_dump is written to the kernel output, and it can be retrieved using
110"dmesg". When you use block_dump and your kernel logging level also includes
111kernel debugging messages, you probably want to turn off klogd, otherwise
112the output of block_dump will be logged, causing disk activity that is not
113normally there.
114
115
116Configuration
117-------------
118
119The laptop mode configuration file is located in /etc/default/laptop-mode on
120Debian-based systems, or in /etc/sysconfig/laptop-mode on other systems. It
121contains the following options:
122
123MAX_AGE:
124
125Maximum time, in seconds, of hard drive spindown time that you are
126comfortable with. Worst case, it's possible that you could lose this
127amount of work if your battery fails while you're in laptop mode.
128
129MINIMUM_BATTERY_MINUTES:
130
131Automatically disable laptop mode if the remaining number of minutes of
132battery power is less than this value. Default is 10 minutes.
133
134AC_HD/BATT_HD:
135
136The idle timeout that should be set on your hard drive when laptop mode
137is active (BATT_HD) and when it is not active (AC_HD). The defaults are
13820 seconds (value 4) for BATT_HD and 2 hours (value 244) for AC_HD. The
139possible values are those listed in the manual page for "hdparm" for the
140"-S" option.
141
142HD:
143
144The devices for which the spindown timeout should be adjusted by laptop mode.
145Default is /dev/hda. If you specify multiple devices, separate them by a space.
146
147READAHEAD:
148
149Disk readahead, in 512-byte sectors, while laptop mode is active. A large
150readahead can prevent disk accesses for things like executable pages (which are
151loaded on demand while the application executes) and sequentially accessed data
152(MP3s).
153
154DO_REMOUNTS:
155
156The control script automatically remounts any mounted journaled filesystems
157with appropriate commit interval options. When this option is set to 0, this
158feature is disabled.
159
160DO_REMOUNT_NOATIME:
161
162When remounting, should the filesystems be remounted with the noatime option?
163Normally, this is set to "1" (enabled), but there may be programs that require
164access time recording.
165
166DIRTY_RATIO:
167
168The percentage of memory that is allowed to contain "dirty" or unsaved data
169before a writeback is forced, while laptop mode is active. Corresponds to
170the /proc/sys/vm/dirty_ratio sysctl.
171
172DIRTY_BACKGROUND_RATIO:
173
174The percentage of memory that is allowed to contain "dirty" or unsaved data
175after a forced writeback is done due to an exceeding of DIRTY_RATIO. Set
176this nice and low. This corresponds to the /proc/sys/vm/dirty_background_ratio
177sysctl.
178
179Note that the behaviour of dirty_background_ratio is quite different
180when laptop mode is active and when it isn't. When laptop mode is inactive,
181dirty_background_ratio is the threshold percentage at which background writeouts
182start taking place. When laptop mode is active, however, background writeouts
183are disabled, and the dirty_background_ratio only determines how much writeback
184is done when dirty_ratio is reached.
185
186DO_CPU:
187
188Enable CPU frequency scaling when in laptop mode. (Requires CPUFreq to be setup.
189See Documentation/admin-guide/pm/cpufreq.rst for more info. Disabled by default.)
190
191CPU_MAXFREQ:
192
193When on battery, what is the maximum CPU speed that the system should use? Legal
194values are "slowest" for the slowest speed that your CPU is able to operate at,
195or a value listed in /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies.
196
197
198Tips & Tricks
199-------------
200
201* Bartek Kania reports getting up to 50 minutes of extra battery life (on top
202 of his regular 3 to 3.5 hours) using a spindown time of 5 seconds (BATT_HD=1).
203
204* You can spin down the disk while playing MP3, by setting disk readahead
205 to 8MB (READAHEAD=16384). Effectively, the disk will read a complete MP3 at
206 once, and will then spin down while the MP3 is playing. (Thanks to Bartek
207 Kania.)
208
209* Drew Scott Daniels observed: "I don't know why, but when I decrease the number
210 of colours that my display uses it consumes less battery power. I've seen
211 this on powerbooks too. I hope that this is a piece of information that
212 might be useful to the Laptop Mode patch or its users."
213
214* In syslog.conf, you can prefix entries with a dash `-` to omit syncing the
215 file after every logging. When you're using laptop-mode and your disk doesn't
216 spin down, this is a likely culprit.
217
218* Richard Atterer observed that laptop mode does not work well with noflushd
219 (http://noflushd.sourceforge.net/), it seems that noflushd prevents laptop-mode
220 from doing its thing.
221
222* If you're worried about your data, you might want to consider using a USB
223 memory stick or something like that as a "working area". (Be aware though
224 that flash memory can only handle a limited number of writes, and overuse
225 may wear out your memory stick pretty quickly. Do _not_ use journalling
226 filesystems on flash memory sticks.)
227
228
229Configuration file for control and ACPI battery scripts
230-------------------------------------------------------
231
232This allows the tunables to be changed for the scripts via an external
233configuration file
234
235It should be installed as /etc/default/laptop-mode on Debian, and as
236/etc/sysconfig/laptop-mode on Red Hat, SUSE, Mandrake, and other work-alikes.
237
238Config file::
239
240 # Maximum time, in seconds, of hard drive spindown time that you are
241 # comfortable with. Worst case, it's possible that you could lose this
242 # amount of work if your battery fails you while in laptop mode.
243 #MAX_AGE=600
244
245 # Automatically disable laptop mode when the number of minutes of battery
246 # that you have left goes below this threshold.
247 MINIMUM_BATTERY_MINUTES=10
248
249 # Read-ahead, in 512-byte sectors. You can spin down the disk while playing MP3/OGG
250 # by setting the disk readahead to 8MB (READAHEAD=16384). Effectively, the disk
251 # will read a complete MP3 at once, and will then spin down while the MP3/OGG is
252 # playing.
253 #READAHEAD=4096
254
255 # Shall we remount journaled fs. with appropriate commit interval? (1=yes)
256 #DO_REMOUNTS=1
257
258 # And shall we add the "noatime" option to that as well? (1=yes)
259 #DO_REMOUNT_NOATIME=1
260
261 # Dirty synchronous ratio. At this percentage of dirty pages the process
262 # which
263 # calls write() does its own writeback
264 #DIRTY_RATIO=40
265
266 #
267 # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
268 # exceeded, the kernel will wake flusher threads which will then reduce the
269 # amount of dirty memory to dirty_background_ratio. Set this nice and low,
270 # so once some writeout has commenced, we do a lot of it.
271 #
272 #DIRTY_BACKGROUND_RATIO=5
273
274 # kernel default dirty buffer age
275 #DEF_AGE=30
276 #DEF_UPDATE=5
277 #DEF_DIRTY_BACKGROUND_RATIO=10
278 #DEF_DIRTY_RATIO=40
279 #DEF_XFS_AGE_BUFFER=15
280 #DEF_XFS_SYNC_INTERVAL=30
281 #DEF_XFS_BUFD_INTERVAL=1
282
283 # This must be adjusted manually to the value of HZ in the running kernel
284 # on 2.4, until the XFS people change their 2.4 external interfaces to work in
285 # centisecs. This can be automated, but it's a work in progress that still
286 # needs# some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for
287 # external interfaces, and that is currently always set to 100. So you don't
288 # need to change this on 2.6.
289 #XFS_HZ=100
290
291 # Should the maximum CPU frequency be adjusted down while on battery?
292 # Requires CPUFreq to be setup.
293 # See Documentation/admin-guide/pm/cpufreq.rst for more info
294 #DO_CPU=0
295
296 # When on battery what is the maximum CPU speed that the system should
297 # use? Legal values are "slowest" for the slowest speed that your
298 # CPU is able to operate at, or a value listed in:
299 # /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
300 # Only applicable if DO_CPU=1.
301 #CPU_MAXFREQ=slowest
302
303 # Idle timeout for your hard drive (man hdparm for valid values, -S option)
304 # Default is 2 hours on AC (AC_HD=244) and 20 seconds for battery (BATT_HD=4).
305 #AC_HD=244
306 #BATT_HD=4
307
308 # The drives for which to adjust the idle timeout. Separate them by a space,
309 # e.g. HD="/dev/hda /dev/hdb".
310 #HD="/dev/hda"
311
312 # Set the spindown timeout on a hard drive?
313 #DO_HD=1
314
315
316Control script
317--------------
318
319Please note that this control script works for the Linux 2.4 and 2.6 series (thanks
320to Kiko Piris).
321
322Control script::
323
324 #!/bin/bash
325
326 # start or stop laptop_mode, best run by a power management daemon when
327 # ac gets connected/disconnected from a laptop
328 #
329 # install as /sbin/laptop_mode
330 #
331 # Contributors to this script: Kiko Piris
332 # Bart Samwel
333 # Micha Feigin
334 # Andrew Morton
335 # Herve Eychenne
336 # Dax Kelson
337 #
338 # Original Linux 2.4 version by: Jens Axboe
339
340 #############################################################################
341
342 # Source config
343 if [ -f /etc/default/laptop-mode ] ; then
344 # Debian
345 . /etc/default/laptop-mode
346 elif [ -f /etc/sysconfig/laptop-mode ] ; then
347 # Others
348 . /etc/sysconfig/laptop-mode
349 fi
350
351 # Don't raise an error if the config file is incomplete
352 # set defaults instead:
353
354 # Maximum time, in seconds, of hard drive spindown time that you are
355 # comfortable with. Worst case, it's possible that you could lose this
356 # amount of work if your battery fails you while in laptop mode.
357 MAX_AGE=${MAX_AGE:-'600'}
358
359 # Read-ahead, in kilobytes
360 READAHEAD=${READAHEAD:-'4096'}
361
362 # Shall we remount journaled fs. with appropriate commit interval? (1=yes)
363 DO_REMOUNTS=${DO_REMOUNTS:-'1'}
364
365 # And shall we add the "noatime" option to that as well? (1=yes)
366 DO_REMOUNT_NOATIME=${DO_REMOUNT_NOATIME:-'1'}
367
368 # Shall we adjust the idle timeout on a hard drive?
369 DO_HD=${DO_HD:-'1'}
370
371 # Adjust idle timeout on which hard drive?
372 HD="${HD:-'/dev/hda'}"
373
374 # spindown time for HD (hdparm -S values)
375 AC_HD=${AC_HD:-'244'}
376 BATT_HD=${BATT_HD:-'4'}
377
378 # Dirty synchronous ratio. At this percentage of dirty pages the process which
379 # calls write() does its own writeback
380 DIRTY_RATIO=${DIRTY_RATIO:-'40'}
381
382 # cpu frequency scaling
383 # See Documentation/admin-guide/pm/cpufreq.rst for more info
384 DO_CPU=${CPU_MANAGE:-'0'}
385 CPU_MAXFREQ=${CPU_MAXFREQ:-'slowest'}
386
387 #
388 # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
389 # exceeded, the kernel will wake flusher threads which will then reduce the
390 # amount of dirty memory to dirty_background_ratio. Set this nice and low,
391 # so once some writeout has commenced, we do a lot of it.
392 #
393 DIRTY_BACKGROUND_RATIO=${DIRTY_BACKGROUND_RATIO:-'5'}
394
395 # kernel default dirty buffer age
396 DEF_AGE=${DEF_AGE:-'30'}
397 DEF_UPDATE=${DEF_UPDATE:-'5'}
398 DEF_DIRTY_BACKGROUND_RATIO=${DEF_DIRTY_BACKGROUND_RATIO:-'10'}
399 DEF_DIRTY_RATIO=${DEF_DIRTY_RATIO:-'40'}
400 DEF_XFS_AGE_BUFFER=${DEF_XFS_AGE_BUFFER:-'15'}
401 DEF_XFS_SYNC_INTERVAL=${DEF_XFS_SYNC_INTERVAL:-'30'}
402 DEF_XFS_BUFD_INTERVAL=${DEF_XFS_BUFD_INTERVAL:-'1'}
403
404 # This must be adjusted manually to the value of HZ in the running kernel
405 # on 2.4, until the XFS people change their 2.4 external interfaces to work in
406 # centisecs. This can be automated, but it's a work in progress that still needs
407 # some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for external
408 # interfaces, and that is currently always set to 100. So you don't need to
409 # change this on 2.6.
410 XFS_HZ=${XFS_HZ:-'100'}
411
412 #############################################################################
413
414 KLEVEL="$(uname -r |
415 {
416 IFS='.' read a b c
417 echo $a.$b
418 }
419 )"
420 case "$KLEVEL" in
421 "2.4"|"2.6")
422 ;;
423 *)
424 echo "Unhandled kernel version: $KLEVEL ('uname -r' = '$(uname -r)')" >&2
425 exit 1
426 ;;
427 esac
428
429 if [ ! -e /proc/sys/vm/laptop_mode ] ; then
430 echo "Kernel is not patched with laptop_mode patch." >&2
431 exit 1
432 fi
433
434 if [ ! -w /proc/sys/vm/laptop_mode ] ; then
435 echo "You do not have enough privileges to enable laptop_mode." >&2
436 exit 1
437 fi
438
439 # Remove an option (the first parameter) of the form option=<number> from
440 # a mount options string (the rest of the parameters).
441 parse_mount_opts () {
442 OPT="$1"
443 shift
444 echo ",$*," | sed \
445 -e 's/,'"$OPT"'=[0-9]*,/,/g' \
446 -e 's/,,*/,/g' \
447 -e 's/^,//' \
448 -e 's/,$//'
449 }
450
451 # Remove an option (the first parameter) without any arguments from
452 # a mount option string (the rest of the parameters).
453 parse_nonumber_mount_opts () {
454 OPT="$1"
455 shift
456 echo ",$*," | sed \
457 -e 's/,'"$OPT"',/,/g' \
458 -e 's/,,*/,/g' \
459 -e 's/^,//' \
460 -e 's/,$//'
461 }
462
463 # Find out the state of a yes/no option (e.g. "atime"/"noatime") in
464 # fstab for a given filesystem, and use this state to replace the
465 # value of the option in another mount options string. The device
466 # is the first argument, the option name the second, and the default
467 # value the third. The remainder is the mount options string.
468 #
469 # Example:
470 # parse_yesno_opts_wfstab /dev/hda1 atime atime defaults,noatime
471 #
472 # If fstab contains, say, "rw" for this filesystem, then the result
473 # will be "defaults,atime".
474 parse_yesno_opts_wfstab () {
475 L_DEV="$1"
476 OPT="$2"
477 DEF_OPT="$3"
478 shift 3
479 L_OPTS="$*"
480 PARSEDOPTS1="$(parse_nonumber_mount_opts $OPT $L_OPTS)"
481 PARSEDOPTS1="$(parse_nonumber_mount_opts no$OPT $PARSEDOPTS1)"
482 # Watch for a default atime in fstab
483 FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)"
484 if echo "$FSTAB_OPTS" | grep "$OPT" > /dev/null ; then
485 # option specified in fstab: extract the value and use it
486 if echo "$FSTAB_OPTS" | grep "no$OPT" > /dev/null ; then
487 echo "$PARSEDOPTS1,no$OPT"
488 else
489 # no$OPT not found -- so we must have $OPT.
490 echo "$PARSEDOPTS1,$OPT"
491 fi
492 else
493 # option not specified in fstab -- choose the default.
494 echo "$PARSEDOPTS1,$DEF_OPT"
495 fi
496 }
497
498 # Find out the state of a numbered option (e.g. "commit=NNN") in
499 # fstab for a given filesystem, and use this state to replace the
500 # value of the option in another mount options string. The device
501 # is the first argument, and the option name the second. The
502 # remainder is the mount options string in which the replacement
503 # must be done.
504 #
505 # Example:
506 # parse_mount_opts_wfstab /dev/hda1 commit defaults,commit=7
507 #
508 # If fstab contains, say, "commit=3,rw" for this filesystem, then the
509 # result will be "rw,commit=3".
510 parse_mount_opts_wfstab () {
511 L_DEV="$1"
512 OPT="$2"
513 shift 2
514 L_OPTS="$*"
515 PARSEDOPTS1="$(parse_mount_opts $OPT $L_OPTS)"
516 # Watch for a default commit in fstab
517 FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)"
518 if echo "$FSTAB_OPTS" | grep "$OPT=" > /dev/null ; then
519 # option specified in fstab: extract the value, and use it
520 echo -n "$PARSEDOPTS1,$OPT="
521 echo ",$FSTAB_OPTS," | sed \
522 -e 's/.*,'"$OPT"'=//' \
523 -e 's/,.*//'
524 else
525 # option not specified in fstab: set it to 0
526 echo "$PARSEDOPTS1,$OPT=0"
527 fi
528 }
529
530 deduce_fstype () {
531 MP="$1"
532 # My root filesystem unfortunately has
533 # type "unknown" in /etc/mtab. If we encounter
534 # "unknown", we try to get the type from fstab.
535 cat /etc/fstab |
536 grep -v '^#' |
537 while read FSTAB_DEV FSTAB_MP FSTAB_FST FSTAB_OPTS FSTAB_DUMP FSTAB_DUMP ; do
538 if [ "$FSTAB_MP" = "$MP" ]; then
539 echo $FSTAB_FST
540 exit 0
541 fi
542 done
543 }
544
545 if [ $DO_REMOUNT_NOATIME -eq 1 ] ; then
546 NOATIME_OPT=",noatime"
547 fi
548
549 case "$1" in
550 start)
551 AGE=$((100*$MAX_AGE))
552 XFS_AGE=$(($XFS_HZ*$MAX_AGE))
553 echo -n "Starting laptop_mode"
554
555 if [ -d /proc/sys/vm/pagebuf ] ; then
556 # (For 2.4 and early 2.6.)
557 # This only needs to be set, not reset -- it is only used when
558 # laptop mode is enabled.
559 echo $XFS_AGE > /proc/sys/vm/pagebuf/lm_flush_age
560 echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
561 elif [ -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
562 # (A couple of early 2.6 laptop mode patches had these.)
563 # The same goes for these.
564 echo $XFS_AGE > /proc/sys/fs/xfs/lm_age_buffer
565 echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
566 elif [ -f /proc/sys/fs/xfs/age_buffer ] ; then
567 # (2.6.6)
568 # But not for these -- they are also used in normal
569 # operation.
570 echo $XFS_AGE > /proc/sys/fs/xfs/age_buffer
571 echo $XFS_AGE > /proc/sys/fs/xfs/sync_interval
572 elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then
573 # (2.6.7 upwards)
574 # And not for these either. These are in centisecs,
575 # not USER_HZ, so we have to use $AGE, not $XFS_AGE.
576 echo $AGE > /proc/sys/fs/xfs/age_buffer_centisecs
577 echo $AGE > /proc/sys/fs/xfs/xfssyncd_centisecs
578 echo 3000 > /proc/sys/fs/xfs/xfsbufd_centisecs
579 fi
580
581 case "$KLEVEL" in
582 "2.4")
583 echo 1 > /proc/sys/vm/laptop_mode
584 echo "30 500 0 0 $AGE $AGE 60 20 0" > /proc/sys/vm/bdflush
585 ;;
586 "2.6")
587 echo 5 > /proc/sys/vm/laptop_mode
588 echo "$AGE" > /proc/sys/vm/dirty_writeback_centisecs
589 echo "$AGE" > /proc/sys/vm/dirty_expire_centisecs
590 echo "$DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
591 echo "$DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
592 ;;
593 esac
594 if [ $DO_REMOUNTS -eq 1 ]; then
595 cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
596 PARSEDOPTS="$(parse_mount_opts "$OPTS")"
597 if [ "$FST" = 'unknown' ]; then
598 FST=$(deduce_fstype $MP)
599 fi
600 case "$FST" in
601 "ext3"|"reiserfs")
602 PARSEDOPTS="$(parse_mount_opts commit "$OPTS")"
603 mount $DEV -t $FST $MP -o remount,$PARSEDOPTS,commit=$MAX_AGE$NOATIME_OPT
604 ;;
605 "xfs")
606 mount $DEV -t $FST $MP -o remount,$OPTS$NOATIME_OPT
607 ;;
608 esac
609 if [ -b $DEV ] ; then
610 blockdev --setra $(($READAHEAD * 2)) $DEV
611 fi
612 done
613 fi
614 if [ $DO_HD -eq 1 ] ; then
615 for THISHD in $HD ; do
616 /sbin/hdparm -S $BATT_HD $THISHD > /dev/null 2>&1
617 /sbin/hdparm -B 1 $THISHD > /dev/null 2>&1
618 done
619 fi
620 if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then
621 if [ $CPU_MAXFREQ = 'slowest' ]; then
622 CPU_MAXFREQ=`cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq`
623 fi
624 echo $CPU_MAXFREQ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
625 fi
626 echo "."
627 ;;
628 stop)
629 U_AGE=$((100*$DEF_UPDATE))
630 B_AGE=$((100*$DEF_AGE))
631 echo -n "Stopping laptop_mode"
632 echo 0 > /proc/sys/vm/laptop_mode
633 if [ -f /proc/sys/fs/xfs/age_buffer -a ! -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
634 # These need to be restored, if there are no lm_*.
635 echo $(($XFS_HZ*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer
636 echo $(($XFS_HZ*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/sync_interval
637 elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then
638 # These need to be restored as well.
639 echo $((100*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer_centisecs
640 echo $((100*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/xfssyncd_centisecs
641 echo $((100*$DEF_XFS_BUFD_INTERVAL)) > /proc/sys/fs/xfs/xfsbufd_centisecs
642 fi
643 case "$KLEVEL" in
644 "2.4")
645 echo "30 500 0 0 $U_AGE $B_AGE 60 20 0" > /proc/sys/vm/bdflush
646 ;;
647 "2.6")
648 echo "$U_AGE" > /proc/sys/vm/dirty_writeback_centisecs
649 echo "$B_AGE" > /proc/sys/vm/dirty_expire_centisecs
650 echo "$DEF_DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
651 echo "$DEF_DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
652 ;;
653 esac
654 if [ $DO_REMOUNTS -eq 1 ] ; then
655 cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
656 # Reset commit and atime options to defaults.
657 if [ "$FST" = 'unknown' ]; then
658 FST=$(deduce_fstype $MP)
659 fi
660 case "$FST" in
661 "ext3"|"reiserfs")
662 PARSEDOPTS="$(parse_mount_opts_wfstab $DEV commit $OPTS)"
663 PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $PARSEDOPTS)"
664 mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
665 ;;
666 "xfs")
667 PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $OPTS)"
668 mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
669 ;;
670 esac
671 if [ -b $DEV ] ; then
672 blockdev --setra 256 $DEV
673 fi
674 done
675 fi
676 if [ $DO_HD -eq 1 ] ; then
677 for THISHD in $HD ; do
678 /sbin/hdparm -S $AC_HD $THISHD > /dev/null 2>&1
679 /sbin/hdparm -B 255 $THISHD > /dev/null 2>&1
680 done
681 fi
682 if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then
683 echo `cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq` > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
684 fi
685 echo "."
686 ;;
687 *)
688 echo "Usage: $0 {start|stop}" 2>&1
689 exit 1
690 ;;
691
692 esac
693
694 exit 0
695
696
697ACPI integration
698----------------
699
700Dax Kelson submitted this so that the ACPI acpid daemon will
701kick off the laptop_mode script and run hdparm. The part that
702automatically disables laptop mode when the battery is low was
703written by Jan Topinski.
704
705/etc/acpi/events/ac_adapter::
706
707 event=ac_adapter
708 action=/etc/acpi/actions/ac.sh %e
709
710/etc/acpi/events/battery::
711
712 event=battery.*
713 action=/etc/acpi/actions/battery.sh %e
714
715/etc/acpi/actions/ac.sh::
716
717 #!/bin/bash
718
719 # ac on/offline event handler
720
721 status=`awk '/^state: / { print $2 }' /proc/acpi/ac_adapter/$2/state`
722
723 case $status in
724 "on-line")
725 /sbin/laptop_mode stop
726 exit 0
727 ;;
728 "off-line")
729 /sbin/laptop_mode start
730 exit 0
731 ;;
732 esac
733
734
735/etc/acpi/actions/battery.sh::
736
737 #! /bin/bash
738
739 # Automatically disable laptop mode when the battery almost runs out.
740
741 BATT_INFO=/proc/acpi/battery/$2/state
742
743 if [[ -f /proc/sys/vm/laptop_mode ]]
744 then
745 LM=`cat /proc/sys/vm/laptop_mode`
746 if [[ $LM -gt 0 ]]
747 then
748 if [[ -f $BATT_INFO ]]
749 then
750 # Source the config file only now that we know we need
751 if [ -f /etc/default/laptop-mode ] ; then
752 # Debian
753 . /etc/default/laptop-mode
754 elif [ -f /etc/sysconfig/laptop-mode ] ; then
755 # Others
756 . /etc/sysconfig/laptop-mode
757 fi
758 MINIMUM_BATTERY_MINUTES=${MINIMUM_BATTERY_MINUTES:-'10'}
759
760 ACTION="`cat $BATT_INFO | grep charging | cut -c 26-`"
761 if [[ ACTION -eq "discharging" ]]
762 then
763 PRESENT_RATE=`cat $BATT_INFO | grep "present rate:" | sed "s/.* \([0-9][0-9]* \).*/\1/" `
764 REMAINING=`cat $BATT_INFO | grep "remaining capacity:" | sed "s/.* \([0-9][0-9]* \).*/\1/" `
765 fi
766 if (($REMAINING * 60 / $PRESENT_RATE < $MINIMUM_BATTERY_MINUTES))
767 then
768 /sbin/laptop_mode stop
769 fi
770 else
771 logger -p daemon.warning "You are using laptop mode and your battery interface $BATT_INFO is missing. This may lead to loss of data when the battery runs out. Check kernel ACPI support and /proc/acpi/battery folder, and edit /etc/acpi/battery.sh to set BATT_INFO to the correct path."
772 fi
773 fi
774 fi
775
776
777Monitoring tool
778---------------
779
780Bartek Kania submitted this, it can be used to measure how much time your disk
781spends spun up/down. See tools/laptop/dslm/dslm.c
diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst
new file mode 100644
index 000000000000..ce9b14671cb9
--- /dev/null
+++ b/Documentation/admin-guide/laptops/lg-laptop.rst
@@ -0,0 +1,84 @@
1.. SPDX-License-Identifier: GPL-2.0+
2
3
4LG Gram laptop extra features
5=============================
6
7By Matan Ziv-Av <matan@svgalib.org>
8
9
10Hotkeys
11-------
12
13The following FN keys are ignored by the kernel without this driver:
14
15- FN-F1 (LG control panel) - Generates F15
16- FN-F5 (Touchpad toggle) - Generates F13
17- FN-F6 (Airplane mode) - Generates RFKILL
18- FN-F8 (Keyboard backlight) - Generates F16.
19 This key also changes keyboard backlight mode.
20- FN-F9 (Reader mode) - Generates F14
21
22The rest of the FN keys work without a need for a special driver.
23
24
25Reader mode
26-----------
27
28Writing 0/1 to /sys/devices/platform/lg-laptop/reader_mode disables/enables
29reader mode. In this mode the screen colors change (blue color reduced),
30and the reader mode indicator LED (on F9 key) turns on.
31
32
33FN Lock
34-------
35
36Writing 0/1 to /sys/devices/platform/lg-laptop/fn_lock disables/enables
37FN lock.
38
39
40Battery care limit
41------------------
42
43Writing 80/100 to /sys/devices/platform/lg-laptop/battery_care_limit
44sets the maximum capacity to charge the battery. Limiting the charge
45reduces battery capacity loss over time.
46
47This value is reset to 100 when the kernel boots.
48
49
50Fan mode
51--------
52
53Writing 1/0 to /sys/devices/platform/lg-laptop/fan_mode disables/enables
54the fan silent mode.
55
56
57USB charge
58----------
59
60Writing 0/1 to /sys/devices/platform/lg-laptop/usb_charge disables/enables
61charging another device from the USB port while the device is turned off.
62
63This value is reset to 0 when the kernel boots.
64
65
66LEDs
67~~~~
68
69The are two LED devices supported by the driver:
70
71Keyboard backlight
72------------------
73
74A led device named kbd_led controls the keyboard backlight. There are three
75lighting level: off (0), low (127) and high (255).
76
77The keyboard backlight is also controlled by the key combination FN-F8
78which cycles through those levels.
79
80
81Touchpad indicator LED
82----------------------
83
84On the F5 key. Controlled by led device names tpad_led.
diff --git a/Documentation/admin-guide/laptops/sony-laptop.rst b/Documentation/admin-guide/laptops/sony-laptop.rst
new file mode 100644
index 000000000000..9edcc7f6612f
--- /dev/null
+++ b/Documentation/admin-guide/laptops/sony-laptop.rst
@@ -0,0 +1,174 @@
1=========================================
2Sony Notebook Control Driver (SNC) Readme
3=========================================
4
5 - Copyright (C) 2004- 2005 Stelian Pop <stelian@popies.net>
6 - Copyright (C) 2007 Mattia Dongili <malattia@linux.it>
7
8This mini-driver drives the SNC and SPIC device present in the ACPI BIOS of the
9Sony Vaio laptops. This driver mixes both devices functions under the same
10(hopefully consistent) interface. This also means that the sonypi driver is
11obsoleted by sony-laptop now.
12
13Fn keys (hotkeys):
14------------------
15
16Some models report hotkeys through the SNC or SPIC devices, such events are
17reported both through the ACPI subsystem as acpi events and through the INPUT
18subsystem. See the logs of /proc/bus/input/devices to find out what those
19events are and which input devices are created by the driver.
20Additionally, loading the driver with the debug option will report all events
21in the kernel log.
22
23The "scancodes" passed to the input system (that can be remapped with udev)
24are indexes to the table "sony_laptop_input_keycode_map" in the sony-laptop.c
25module. For example the "FN/E" key combination (EJECTCD on some models)
26generates the scancode 20 (0x14).
27
28Backlight control:
29------------------
30If your laptop model supports it, you will find sysfs files in the
31/sys/class/backlight/sony/
32directory. You will be able to query and set the current screen
33brightness:
34
35 ====================== =========================================
36 brightness get/set screen brightness (an integer
37 between 0 and 7)
38 actual_brightness reading from this file will query the HW
39 to get real brightness value
40 max_brightness the maximum brightness value
41 ====================== =========================================
42
43
44Platform specific:
45------------------
46Loading the sony-laptop module will create a
47/sys/devices/platform/sony-laptop/
48directory populated with some files.
49
50You then read/write integer values from/to those files by using
51standard UNIX tools.
52
53The files are:
54
55 ====================== ==========================================
56 brightness_default screen brightness which will be set
57 when the laptop will be rebooted
58 cdpower power on/off the internal CD drive
59 audiopower power on/off the internal sound card
60 lanpower power on/off the internal ethernet card
61 (only in debug mode)
62 bluetoothpower power on/off the internal bluetooth device
63 fanspeed get/set the fan speed
64 ====================== ==========================================
65
66Note that some files may be missing if they are not supported
67by your particular laptop model.
68
69Example usage::
70
71 # echo "1" > /sys/devices/platform/sony-laptop/brightness_default
72
73sets the lowest screen brightness for the next and later reboots
74
75::
76
77 # echo "8" > /sys/devices/platform/sony-laptop/brightness_default
78
79sets the highest screen brightness for the next and later reboots
80
81::
82
83 # cat /sys/devices/platform/sony-laptop/brightness_default
84
85retrieves the value
86
87::
88
89 # echo "0" > /sys/devices/platform/sony-laptop/audiopower
90
91powers off the sound card
92
93::
94
95 # echo "1" > /sys/devices/platform/sony-laptop/audiopower
96
97powers on the sound card.
98
99
100RFkill control:
101---------------
102More recent Vaio models expose a consistent set of ACPI methods to
103control radio frequency emitting devices. If you are a lucky owner of
104such a laptop you will find the necessary rfkill devices under
105/sys/class/rfkill. Check those starting with sony-* in::
106
107 # grep . /sys/class/rfkill/*/{state,name}
108
109
110Development:
111------------
112
113If you want to help with the development of this driver (and
114you are not afraid of any side effects doing strange things with
115your ACPI BIOS could have on your laptop), load the driver and
116pass the option 'debug=1'.
117
118REPEAT:
119 **DON'T DO THIS IF YOU DON'T LIKE RISKY BUSINESS.**
120
121In your kernel logs you will find the list of all ACPI methods
122the SNC device has on your laptop.
123
124* For new models you will see a long list of meaningless method names,
125 reading the DSDT table source should reveal that:
126
127(1) the SNC device uses an internal capability lookup table
128(2) SN00 is used to find values in the lookup table
129(3) SN06 and SN07 are used to call into the real methods based on
130 offsets you can obtain iterating the table using SN00
131(4) SN02 used to enable events.
132
133Some values in the capability lookup table are more or less known, see
134the code for all sony_call_snc_handle calls, others are more obscure.
135
136* For old models you can see the GCDP/GCDP methods used to pwer on/off
137 the CD drive, but there are others and they are usually different from
138 model to model.
139
140**I HAVE NO IDEA WHAT THOSE METHODS DO.**
141
142The sony-laptop driver creates, for some of those methods (the most
143current ones found on several Vaio models), an entry under
144/sys/devices/platform/sony-laptop, just like the 'cdpower' one.
145You can create other entries corresponding to your own laptop methods by
146further editing the source (see the 'sony_nc_values' table, and add a new
147entry to this table with your get/set method names using the
148SNC_HANDLE_NAMES macro).
149
150Your mission, should you accept it, is to try finding out what
151those entries are for, by reading/writing random values from/to those
152files and find out what is the impact on your laptop.
153
154Should you find anything interesting, please report it back to me,
155I will not disavow all knowledge of your actions :)
156
157See also http://www.linux.it/~malattia/wiki/index.php/Sony_drivers for other
158useful info.
159
160Bugs/Limitations:
161-----------------
162
163* This driver is not based on official documentation from Sony
164 (because there is none), so there is no guarantee this driver
165 will work at all, or do the right thing. Although this hasn't
166 happened to me, this driver could do very bad things to your
167 laptop, including permanent damage.
168
169* The sony-laptop and sonypi drivers do not interact at all. In the
170 future, sonypi will be removed and replaced by sony-laptop.
171
172* spicctrl, which is the userspace tool used to communicate with the
173 sonypi driver (through /dev/sonypi) is deprecated as well since all
174 its features are now available under the sysfs tree via sony-laptop.
diff --git a/Documentation/admin-guide/laptops/sonypi.rst b/Documentation/admin-guide/laptops/sonypi.rst
new file mode 100644
index 000000000000..c6eaaf48f7c1
--- /dev/null
+++ b/Documentation/admin-guide/laptops/sonypi.rst
@@ -0,0 +1,158 @@
1==================================================
2Sony Programmable I/O Control Device Driver Readme
3==================================================
4
5 - Copyright (C) 2001-2004 Stelian Pop <stelian@popies.net>
6 - Copyright (C) 2001-2002 Alcôve <www.alcove.com>
7 - Copyright (C) 2001 Michael Ashley <m.ashley@unsw.edu.au>
8 - Copyright (C) 2001 Junichi Morita <jun1m@mars.dti.ne.jp>
9 - Copyright (C) 2000 Takaya Kinjo <t-kinjo@tc4.so-net.ne.jp>
10 - Copyright (C) 2000 Andrew Tridgell <tridge@samba.org>
11
12This driver enables access to the Sony Programmable I/O Control Device which
13can be found in many Sony Vaio laptops. Some newer Sony laptops (seems to be
14limited to new FX series laptops, at least the FX501 and the FX702) lack a
15sonypi device and are not supported at all by this driver.
16
17It will give access (through a user space utility) to some events those laptops
18generate, like:
19
20 - jogdial events (the small wheel on the side of Vaios)
21 - capture button events (only on Vaio Picturebook series)
22 - Fn keys
23 - bluetooth button (only on C1VR model)
24 - programmable keys, back, help, zoom, thumbphrase buttons, etc.
25 (when available)
26
27Those events (see linux/sonypi.h) can be polled using the character device node
28/dev/sonypi (major 10, minor auto allocated or specified as a option).
29A simple daemon which translates the jogdial movements into mouse wheel events
30can be downloaded at: <http://popies.net/sonypi/>
31
32Another option to intercept the events is to get them directly through the
33input layer.
34
35This driver supports also some ioctl commands for setting the LCD screen
36brightness and querying the batteries charge information (some more
37commands may be added in the future).
38
39This driver can also be used to set the camera controls on Picturebook series
40(brightness, contrast etc), and is used by the video4linux driver for the
41Motion Eye camera.
42
43Please note that this driver was created by reverse engineering the Windows
44driver and the ACPI BIOS, because Sony doesn't agree to release any programming
45specs for its laptops. If someone convinces them to do so, drop me a note.
46
47Driver options:
48---------------
49
50Several options can be passed to the sonypi driver using the standard
51module argument syntax (<param>=<value> when passing the option to the
52module or sonypi.<param>=<value> on the kernel boot line when sonypi is
53statically linked into the kernel). Those options are:
54
55 =============== =======================================================
56 minor: minor number of the misc device /dev/sonypi,
57 default is -1 (automatic allocation, see /proc/misc
58 or kernel logs)
59
60 camera: if you have a PictureBook series Vaio (with the
61 integrated MotionEye camera), set this parameter to 1
62 in order to let the driver access to the camera
63
64 fnkeyinit: on some Vaios (C1VE, C1VR etc), the Fn key events don't
65 get enabled unless you set this parameter to 1.
66 Do not use this option unless it's actually necessary,
67 some Vaio models don't deal well with this option.
68 This option is available only if the kernel is
69 compiled without ACPI support (since it conflicts
70 with it and it shouldn't be required anyway if
71 ACPI is already enabled).
72
73 verbose: set to 1 to print unknown events received from the
74 sonypi device.
75 set to 2 to print all events received from the
76 sonypi device.
77
78 compat: uses some compatibility code for enabling the sonypi
79 events. If the driver worked for you in the past
80 (prior to version 1.5) and does not work anymore,
81 add this option and report to the author.
82
83 mask: event mask telling the driver what events will be
84 reported to the user. This parameter is required for
85 some Vaio models where the hardware reuses values
86 used in other Vaio models (like the FX series who does
87 not have a jogdial but reuses the jogdial events for
88 programmable keys events). The default event mask is
89 set to 0xffffffff, meaning that all possible events
90 will be tried. You can use the following bits to
91 construct your own event mask (from
92 drivers/char/sonypi.h)::
93
94 SONYPI_JOGGER_MASK 0x0001
95 SONYPI_CAPTURE_MASK 0x0002
96 SONYPI_FNKEY_MASK 0x0004
97 SONYPI_BLUETOOTH_MASK 0x0008
98 SONYPI_PKEY_MASK 0x0010
99 SONYPI_BACK_MASK 0x0020
100 SONYPI_HELP_MASK 0x0040
101 SONYPI_LID_MASK 0x0080
102 SONYPI_ZOOM_MASK 0x0100
103 SONYPI_THUMBPHRASE_MASK 0x0200
104 SONYPI_MEYE_MASK 0x0400
105 SONYPI_MEMORYSTICK_MASK 0x0800
106 SONYPI_BATTERY_MASK 0x1000
107 SONYPI_WIRELESS_MASK 0x2000
108
109 useinput: if set (which is the default) two input devices are
110 created, one which interprets the jogdial events as
111 mouse events, the other one which acts like a
112 keyboard reporting the pressing of the special keys.
113 =============== =======================================================
114
115Module use:
116-----------
117
118In order to automatically load the sonypi module on use, you can put those
119lines a configuration file in /etc/modprobe.d/::
120
121 alias char-major-10-250 sonypi
122 options sonypi minor=250
123
124This supposes the use of minor 250 for the sonypi device::
125
126 # mknod /dev/sonypi c 10 250
127
128Bugs:
129-----
130
131 - several users reported that this driver disables the BIOS-managed
132 Fn-keys which put the laptop in sleeping state, or switch the
133 external monitor on/off. There is no workaround yet, since this
134 driver disables all APM management for those keys, by enabling the
135 ACPI management (and the ACPI core stuff is not complete yet). If
136 you have one of those laptops with working Fn keys and want to
137 continue to use them, don't use this driver.
138
139 - some users reported that the laptop speed is lower (dhrystone
140 tested) when using the driver with the fnkeyinit parameter. I cannot
141 reproduce it on my laptop and not all users have this problem.
142 This happens because the fnkeyinit parameter enables the ACPI
143 mode (but without additional ACPI control, like processor
144 speed handling etc). Use ACPI instead of APM if it works on your
145 laptop.
146
147 - sonypi lacks the ability to distinguish between certain key
148 events on some models.
149
150 - some models with the nvidia card (geforce go 6200 tc) uses a
151 different way to adjust the backlighting of the screen. There
152 is a userspace utility to adjust the brightness on those models,
153 which can be downloaded from
154 http://www.acc.umu.se/~erikw/program/smartdimmer-0.1.tar.bz2
155
156 - since all development was done by reverse engineering, there is
157 *absolutely no guarantee* that this driver will not crash your
158 laptop. Permanently.
diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
new file mode 100644
index 000000000000..adea0bf2acc5
--- /dev/null
+++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
@@ -0,0 +1,1562 @@
1===========================
2ThinkPad ACPI Extras Driver
3===========================
4
5Version 0.25
6
7October 16th, 2013
8
9- Borislav Deianov <borislav@users.sf.net>
10- Henrique de Moraes Holschuh <hmh@hmh.eng.br>
11
12http://ibm-acpi.sf.net/
13
14This is a Linux driver for the IBM and Lenovo ThinkPad laptops. It
15supports various features of these laptops which are accessible
16through the ACPI and ACPI EC framework, but not otherwise fully
17supported by the generic Linux ACPI drivers.
18
19This driver used to be named ibm-acpi until kernel 2.6.21 and release
200.13-20070314. It used to be in the drivers/acpi tree, but it was
21moved to the drivers/misc tree and renamed to thinkpad-acpi for kernel
222.6.22, and release 0.14. It was moved to drivers/platform/x86 for
23kernel 2.6.29 and release 0.22.
24
25The driver is named "thinkpad-acpi". In some places, like module
26names and log messages, "thinkpad_acpi" is used because of userspace
27issues.
28
29"tpacpi" is used as a shorthand where "thinkpad-acpi" would be too
30long due to length limitations on some Linux kernel versions.
31
32Status
33------
34
35The features currently supported are the following (see below for
36detailed description):
37
38 - Fn key combinations
39 - Bluetooth enable and disable
40 - video output switching, expansion control
41 - ThinkLight on and off
42 - CMOS/UCMS control
43 - LED control
44 - ACPI sounds
45 - temperature sensors
46 - Experimental: embedded controller register dump
47 - LCD brightness control
48 - Volume control
49 - Fan control and monitoring: fan speed, fan enable/disable
50 - WAN enable and disable
51 - UWB enable and disable
52
53A compatibility table by model and feature is maintained on the web
54site, http://ibm-acpi.sf.net/. I appreciate any success or failure
55reports, especially if they add to or correct the compatibility table.
56Please include the following information in your report:
57
58 - ThinkPad model name
59 - a copy of your ACPI tables, using the "acpidump" utility
60 - a copy of the output of dmidecode, with serial numbers
61 and UUIDs masked off
62 - which driver features work and which don't
63 - the observed behavior of non-working features
64
65Any other comments or patches are also more than welcome.
66
67
68Installation
69------------
70
71If you are compiling this driver as included in the Linux kernel
72sources, look for the CONFIG_THINKPAD_ACPI Kconfig option.
73It is located on the menu path: "Device Drivers" -> "X86 Platform
74Specific Device Drivers" -> "ThinkPad ACPI Laptop Extras".
75
76
77Features
78--------
79
80The driver exports two different interfaces to userspace, which can be
81used to access the features it provides. One is a legacy procfs-based
82interface, which will be removed at some time in the future. The other
83is a new sysfs-based interface which is not complete yet.
84
85The procfs interface creates the /proc/acpi/ibm directory. There is a
86file under that directory for each feature it supports. The procfs
87interface is mostly frozen, and will change very little if at all: it
88will not be extended to add any new functionality in the driver, instead
89all new functionality will be implemented on the sysfs interface.
90
91The sysfs interface tries to blend in the generic Linux sysfs subsystems
92and classes as much as possible. Since some of these subsystems are not
93yet ready or stabilized, it is expected that this interface will change,
94and any and all userspace programs must deal with it.
95
96
97Notes about the sysfs interface
98^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
99
100Unlike what was done with the procfs interface, correctness when talking
101to the sysfs interfaces will be enforced, as will correctness in the
102thinkpad-acpi's implementation of sysfs interfaces.
103
104Also, any bugs in the thinkpad-acpi sysfs driver code or in the
105thinkpad-acpi's implementation of the sysfs interfaces will be fixed for
106maximum correctness, even if that means changing an interface in
107non-compatible ways. As these interfaces mature both in the kernel and
108in thinkpad-acpi, such changes should become quite rare.
109
110Applications interfacing to the thinkpad-acpi sysfs interfaces must
111follow all sysfs guidelines and correctly process all errors (the sysfs
112interface makes extensive use of errors). File descriptors and open /
113close operations to the sysfs inodes must also be properly implemented.
114
115The version of thinkpad-acpi's sysfs interface is exported by the driver
116as a driver attribute (see below).
117
118Sysfs driver attributes are on the driver's sysfs attribute space,
119for 2.6.23+ this is /sys/bus/platform/drivers/thinkpad_acpi/ and
120/sys/bus/platform/drivers/thinkpad_hwmon/
121
122Sysfs device attributes are on the thinkpad_acpi device sysfs attribute
123space, for 2.6.23+ this is /sys/devices/platform/thinkpad_acpi/.
124
125Sysfs device attributes for the sensors and fan are on the
126thinkpad_hwmon device's sysfs attribute space, but you should locate it
127looking for a hwmon device with the name attribute of "thinkpad", or
128better yet, through libsensors. For 4.14+ sysfs attributes were moved to the
129hwmon device (/sys/bus/platform/devices/thinkpad_hwmon/hwmon/hwmon? or
130/sys/class/hwmon/hwmon?).
131
132Driver version
133--------------
134
135procfs: /proc/acpi/ibm/driver
136
137sysfs driver attribute: version
138
139The driver name and version. No commands can be written to this file.
140
141
142Sysfs interface version
143-----------------------
144
145sysfs driver attribute: interface_version
146
147Version of the thinkpad-acpi sysfs interface, as an unsigned long
148(output in hex format: 0xAAAABBCC), where:
149
150 AAAA
151 - major revision
152 BB
153 - minor revision
154 CC
155 - bugfix revision
156
157The sysfs interface version changelog for the driver can be found at the
158end of this document. Changes to the sysfs interface done by the kernel
159subsystems are not documented here, nor are they tracked by this
160attribute.
161
162Changes to the thinkpad-acpi sysfs interface are only considered
163non-experimental when they are submitted to Linux mainline, at which
164point the changes in this interface are documented and interface_version
165may be updated. If you are using any thinkpad-acpi features not yet
166sent to mainline for merging, you do so on your own risk: these features
167may disappear, or be implemented in a different and incompatible way by
168the time they are merged in Linux mainline.
169
170Changes that are backwards-compatible by nature (e.g. the addition of
171attributes that do not change the way the other attributes work) do not
172always warrant an update of interface_version. Therefore, one must
173expect that an attribute might not be there, and deal with it properly
174(an attribute not being there *is* a valid way to make it clear that a
175feature is not available in sysfs).
176
177
178Hot keys
179--------
180
181procfs: /proc/acpi/ibm/hotkey
182
183sysfs device attribute: hotkey_*
184
185In a ThinkPad, the ACPI HKEY handler is responsible for communicating
186some important events and also keyboard hot key presses to the operating
187system. Enabling the hotkey functionality of thinkpad-acpi signals the
188firmware that such a driver is present, and modifies how the ThinkPad
189firmware will behave in many situations.
190
191The driver enables the HKEY ("hot key") event reporting automatically
192when loaded, and disables it when it is removed.
193
194The driver will report HKEY events in the following format::
195
196 ibm/hotkey HKEY 00000080 0000xxxx
197
198Some of these events refer to hot key presses, but not all of them.
199
200The driver will generate events over the input layer for hot keys and
201radio switches, and over the ACPI netlink layer for other events. The
202input layer support accepts the standard IOCTLs to remap the keycodes
203assigned to each hot key.
204
205The hot key bit mask allows some control over which hot keys generate
206events. If a key is "masked" (bit set to 0 in the mask), the firmware
207will handle it. If it is "unmasked", it signals the firmware that
208thinkpad-acpi would prefer to handle it, if the firmware would be so
209kind to allow it (and it often doesn't!).
210
211Not all bits in the mask can be modified. Not all bits that can be
212modified do anything. Not all hot keys can be individually controlled
213by the mask. Some models do not support the mask at all. The behaviour
214of the mask is, therefore, highly dependent on the ThinkPad model.
215
216The driver will filter out any unmasked hotkeys, so even if the firmware
217doesn't allow disabling an specific hotkey, the driver will not report
218events for unmasked hotkeys.
219
220Note that unmasking some keys prevents their default behavior. For
221example, if Fn+F5 is unmasked, that key will no longer enable/disable
222Bluetooth by itself in firmware.
223
224Note also that not all Fn key combinations are supported through ACPI
225depending on the ThinkPad model and firmware version. On those
226ThinkPads, it is still possible to support some extra hotkeys by
227polling the "CMOS NVRAM" at least 10 times per second. The driver
228attempts to enables this functionality automatically when required.
229
230procfs notes
231^^^^^^^^^^^^
232
233The following commands can be written to the /proc/acpi/ibm/hotkey file::
234
235 echo 0xffffffff > /proc/acpi/ibm/hotkey -- enable all hot keys
236 echo 0 > /proc/acpi/ibm/hotkey -- disable all possible hot keys
237 ... any other 8-hex-digit mask ...
238 echo reset > /proc/acpi/ibm/hotkey -- restore the recommended mask
239
240The following commands have been deprecated and will cause the kernel
241to log a warning::
242
243 echo enable > /proc/acpi/ibm/hotkey -- does nothing
244 echo disable > /proc/acpi/ibm/hotkey -- returns an error
245
246The procfs interface does not support NVRAM polling control. So as to
247maintain maximum bug-to-bug compatibility, it does not report any masks,
248nor does it allow one to manipulate the hot key mask when the firmware
249does not support masks at all, even if NVRAM polling is in use.
250
251sysfs notes
252^^^^^^^^^^^
253
254 hotkey_bios_enabled:
255 DEPRECATED, WILL BE REMOVED SOON.
256
257 Returns 0.
258
259 hotkey_bios_mask:
260 DEPRECATED, DON'T USE, WILL BE REMOVED IN THE FUTURE.
261
262 Returns the hot keys mask when thinkpad-acpi was loaded.
263 Upon module unload, the hot keys mask will be restored
264 to this value. This is always 0x80c, because those are
265 the hotkeys that were supported by ancient firmware
266 without mask support.
267
268 hotkey_enable:
269 DEPRECATED, WILL BE REMOVED SOON.
270
271 0: returns -EPERM
272 1: does nothing
273
274 hotkey_mask:
275 bit mask to enable reporting (and depending on
276 the firmware, ACPI event generation) for each hot key
277 (see above). Returns the current status of the hot keys
278 mask, and allows one to modify it.
279
280 hotkey_all_mask:
281 bit mask that should enable event reporting for all
282 supported hot keys, when echoed to hotkey_mask above.
283 Unless you know which events need to be handled
284 passively (because the firmware *will* handle them
285 anyway), do *not* use hotkey_all_mask. Use
286 hotkey_recommended_mask, instead. You have been warned.
287
288 hotkey_recommended_mask:
289 bit mask that should enable event reporting for all
290 supported hot keys, except those which are always
291 handled by the firmware anyway. Echo it to
292 hotkey_mask above, to use. This is the default mask
293 used by the driver.
294
295 hotkey_source_mask:
296 bit mask that selects which hot keys will the driver
297 poll the NVRAM for. This is auto-detected by the driver
298 based on the capabilities reported by the ACPI firmware,
299 but it can be overridden at runtime.
300
301 Hot keys whose bits are set in hotkey_source_mask are
302 polled for in NVRAM, and reported as hotkey events if
303 enabled in hotkey_mask. Only a few hot keys are
304 available through CMOS NVRAM polling.
305
306 Warning: when in NVRAM mode, the volume up/down/mute
307 keys are synthesized according to changes in the mixer,
308 which uses a single volume up or volume down hotkey
309 press to unmute, as per the ThinkPad volume mixer user
310 interface. When in ACPI event mode, volume up/down/mute
311 events are reported by the firmware and can behave
312 differently (and that behaviour changes with firmware
313 version -- not just with firmware models -- as well as
314 OSI(Linux) state).
315
316 hotkey_poll_freq:
317 frequency in Hz for hot key polling. It must be between
318 0 and 25 Hz. Polling is only carried out when strictly
319 needed.
320
321 Setting hotkey_poll_freq to zero disables polling, and
322 will cause hot key presses that require NVRAM polling
323 to never be reported.
324
325 Setting hotkey_poll_freq too low may cause repeated
326 pressings of the same hot key to be misreported as a
327 single key press, or to not even be detected at all.
328 The recommended polling frequency is 10Hz.
329
330 hotkey_radio_sw:
331 If the ThinkPad has a hardware radio switch, this
332 attribute will read 0 if the switch is in the "radios
333 disabled" position, and 1 if the switch is in the
334 "radios enabled" position.
335
336 This attribute has poll()/select() support.
337
338 hotkey_tablet_mode:
339 If the ThinkPad has tablet capabilities, this attribute
340 will read 0 if the ThinkPad is in normal mode, and
341 1 if the ThinkPad is in tablet mode.
342
343 This attribute has poll()/select() support.
344
345 wakeup_reason:
346 Set to 1 if the system is waking up because the user
347 requested a bay ejection. Set to 2 if the system is
348 waking up because the user requested the system to
349 undock. Set to zero for normal wake-ups or wake-ups
350 due to unknown reasons.
351
352 This attribute has poll()/select() support.
353
354 wakeup_hotunplug_complete:
355 Set to 1 if the system was waken up because of an
356 undock or bay ejection request, and that request
357 was successfully completed. At this point, it might
358 be useful to send the system back to sleep, at the
359 user's choice. Refer to HKEY events 0x4003 and
360 0x3003, below.
361
362 This attribute has poll()/select() support.
363
364input layer notes
365^^^^^^^^^^^^^^^^^
366
367A Hot key is mapped to a single input layer EV_KEY event, possibly
368followed by an EV_MSC MSC_SCAN event that shall contain that key's scan
369code. An EV_SYN event will always be generated to mark the end of the
370event block.
371
372Do not use the EV_MSC MSC_SCAN events to process keys. They are to be
373used as a helper to remap keys, only. They are particularly useful when
374remapping KEY_UNKNOWN keys.
375
376The events are available in an input device, with the following id:
377
378 ============== ==============================
379 Bus BUS_HOST
380 vendor 0x1014 (PCI_VENDOR_ID_IBM) or
381 0x17aa (PCI_VENDOR_ID_LENOVO)
382 product 0x5054 ("TP")
383 version 0x4101
384 ============== ==============================
385
386The version will have its LSB incremented if the keymap changes in a
387backwards-compatible way. The MSB shall always be 0x41 for this input
388device. If the MSB is not 0x41, do not use the device as described in
389this section, as it is either something else (e.g. another input device
390exported by a thinkpad driver, such as HDAPS) or its functionality has
391been changed in a non-backwards compatible way.
392
393Adding other event types for other functionalities shall be considered a
394backwards-compatible change for this input device.
395
396Thinkpad-acpi Hot Key event map (version 0x4101):
397
398======= ======= ============== ==============================================
399ACPI Scan
400event code Key Notes
401======= ======= ============== ==============================================
4020x1001 0x00 FN+F1 -
403
4040x1002 0x01 FN+F2 IBM: battery (rare)
405 Lenovo: Screen lock
406
4070x1003 0x02 FN+F3 Many IBM models always report
408 this hot key, even with hot keys
409 disabled or with Fn+F3 masked
410 off
411 IBM: screen lock, often turns
412 off the ThinkLight as side-effect
413 Lenovo: battery
414
4150x1004 0x03 FN+F4 Sleep button (ACPI sleep button
416 semantics, i.e. sleep-to-RAM).
417 It always generates some kind
418 of event, either the hot key
419 event or an ACPI sleep button
420 event. The firmware may
421 refuse to generate further FN+F4
422 key presses until a S3 or S4 ACPI
423 sleep cycle is performed or some
424 time passes.
425
4260x1005 0x04 FN+F5 Radio. Enables/disables
427 the internal Bluetooth hardware
428 and W-WAN card if left in control
429 of the firmware. Does not affect
430 the WLAN card.
431 Should be used to turn on/off all
432 radios (Bluetooth+W-WAN+WLAN),
433 really.
434
4350x1006 0x05 FN+F6 -
436
4370x1007 0x06 FN+F7 Video output cycle.
438 Do you feel lucky today?
439
4400x1008 0x07 FN+F8 IBM: toggle screen expand
441 Lenovo: configure UltraNav,
442 or toggle screen expand
443
4440x1009 0x08 FN+F9 -
445
446... ... ... ...
447
4480x100B 0x0A FN+F11 -
449
4500x100C 0x0B FN+F12 Sleep to disk. You are always
451 supposed to handle it yourself,
452 either through the ACPI event,
453 or through a hotkey event.
454 The firmware may refuse to
455 generate further FN+F12 key
456 press events until a S3 or S4
457 ACPI sleep cycle is performed,
458 or some time passes.
459
4600x100D 0x0C FN+BACKSPACE -
4610x100E 0x0D FN+INSERT -
4620x100F 0x0E FN+DELETE -
463
4640x1010 0x0F FN+HOME Brightness up. This key is
465 always handled by the firmware
466 in IBM ThinkPads, even when
467 unmasked. Just leave it alone.
468 For Lenovo ThinkPads with a new
469 BIOS, it has to be handled either
470 by the ACPI OSI, or by userspace.
471 The driver does the right thing,
472 never mess with this.
4730x1011 0x10 FN+END Brightness down. See brightness
474 up for details.
475
4760x1012 0x11 FN+PGUP ThinkLight toggle. This key is
477 always handled by the firmware,
478 even when unmasked.
479
4800x1013 0x12 FN+PGDOWN -
481
4820x1014 0x13 FN+SPACE Zoom key
483
4840x1015 0x14 VOLUME UP Internal mixer volume up. This
485 key is always handled by the
486 firmware, even when unmasked.
487 NOTE: Lenovo seems to be changing
488 this.
4890x1016 0x15 VOLUME DOWN Internal mixer volume up. This
490 key is always handled by the
491 firmware, even when unmasked.
492 NOTE: Lenovo seems to be changing
493 this.
4940x1017 0x16 MUTE Mute internal mixer. This
495 key is always handled by the
496 firmware, even when unmasked.
497
4980x1018 0x17 THINKPAD ThinkPad/Access IBM/Lenovo key
499
5000x1019 0x18 unknown
501
502... ... ...
503
5040x1020 0x1F unknown
505======= ======= ============== ==============================================
506
507The ThinkPad firmware does not allow one to differentiate when most hot
508keys are pressed or released (either that, or we don't know how to, yet).
509For these keys, the driver generates a set of events for a key press and
510immediately issues the same set of events for a key release. It is
511unknown by the driver if the ThinkPad firmware triggered these events on
512hot key press or release, but the firmware will do it for either one, not
513both.
514
515If a key is mapped to KEY_RESERVED, it generates no input events at all.
516If a key is mapped to KEY_UNKNOWN, it generates an input event that
517includes an scan code. If a key is mapped to anything else, it will
518generate input device EV_KEY events.
519
520In addition to the EV_KEY events, thinkpad-acpi may also issue EV_SW
521events for switches:
522
523============== ==============================================
524SW_RFKILL_ALL T60 and later hardware rfkill rocker switch
525SW_TABLET_MODE Tablet ThinkPads HKEY events 0x5009 and 0x500A
526============== ==============================================
527
528Non hotkey ACPI HKEY event map
529------------------------------
530
531Events that are never propagated by the driver:
532
533====== ==================================================
5340x2304 System is waking up from suspend to undock
5350x2305 System is waking up from suspend to eject bay
5360x2404 System is waking up from hibernation to undock
5370x2405 System is waking up from hibernation to eject bay
5380x5001 Lid closed
5390x5002 Lid opened
5400x5009 Tablet swivel: switched to tablet mode
5410x500A Tablet swivel: switched to normal mode
5420x5010 Brightness level changed/control event
5430x6000 KEYBOARD: Numlock key pressed
5440x6005 KEYBOARD: Fn key pressed (TO BE VERIFIED)
5450x7000 Radio Switch may have changed state
546====== ==================================================
547
548
549Events that are propagated by the driver to userspace:
550
551====== =====================================================
5520x2313 ALARM: System is waking up from suspend because
553 the battery is nearly empty
5540x2413 ALARM: System is waking up from hibernation because
555 the battery is nearly empty
5560x3003 Bay ejection (see 0x2x05) complete, can sleep again
5570x3006 Bay hotplug request (hint to power up SATA link when
558 the optical drive tray is ejected)
5590x4003 Undocked (see 0x2x04), can sleep again
5600x4010 Docked into hotplug port replicator (non-ACPI dock)
5610x4011 Undocked from hotplug port replicator (non-ACPI dock)
5620x500B Tablet pen inserted into its storage bay
5630x500C Tablet pen removed from its storage bay
5640x6011 ALARM: battery is too hot
5650x6012 ALARM: battery is extremely hot
5660x6021 ALARM: a sensor is too hot
5670x6022 ALARM: a sensor is extremely hot
5680x6030 System thermal table changed
5690x6032 Thermal Control command set completion (DYTC, Windows)
5700x6040 Nvidia Optimus/AC adapter related (TO BE VERIFIED)
5710x60C0 X1 Yoga 2016, Tablet mode status changed
5720x60F0 Thermal Transformation changed (GMTS, Windows)
573====== =====================================================
574
575Battery nearly empty alarms are a last resort attempt to get the
576operating system to hibernate or shutdown cleanly (0x2313), or shutdown
577cleanly (0x2413) before power is lost. They must be acted upon, as the
578wake up caused by the firmware will have negated most safety nets...
579
580When any of the "too hot" alarms happen, according to Lenovo the user
581should suspend or hibernate the laptop (and in the case of battery
582alarms, unplug the AC adapter) to let it cool down. These alarms do
583signal that something is wrong, they should never happen on normal
584operating conditions.
585
586The "extremely hot" alarms are emergencies. According to Lenovo, the
587operating system is to force either an immediate suspend or hibernate
588cycle, or a system shutdown. Obviously, something is very wrong if this
589happens.
590
591
592Brightness hotkey notes
593^^^^^^^^^^^^^^^^^^^^^^^
594
595Don't mess with the brightness hotkeys in a Thinkpad. If you want
596notifications for OSD, use the sysfs backlight class event support.
597
598The driver will issue KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN events
599automatically for the cases were userspace has to do something to
600implement brightness changes. When you override these events, you will
601either fail to handle properly the ThinkPads that require explicit
602action to change backlight brightness, or the ThinkPads that require
603that no action be taken to work properly.
604
605
606Bluetooth
607---------
608
609procfs: /proc/acpi/ibm/bluetooth
610
611sysfs device attribute: bluetooth_enable (deprecated)
612
613sysfs rfkill class: switch "tpacpi_bluetooth_sw"
614
615This feature shows the presence and current state of a ThinkPad
616Bluetooth device in the internal ThinkPad CDC slot.
617
618If the ThinkPad supports it, the Bluetooth state is stored in NVRAM,
619so it is kept across reboots and power-off.
620
621Procfs notes
622^^^^^^^^^^^^
623
624If Bluetooth is installed, the following commands can be used::
625
626 echo enable > /proc/acpi/ibm/bluetooth
627 echo disable > /proc/acpi/ibm/bluetooth
628
629Sysfs notes
630^^^^^^^^^^^
631
632 If the Bluetooth CDC card is installed, it can be enabled /
633 disabled through the "bluetooth_enable" thinkpad-acpi device
634 attribute, and its current status can also be queried.
635
636 enable:
637
638 - 0: disables Bluetooth / Bluetooth is disabled
639 - 1: enables Bluetooth / Bluetooth is enabled.
640
641 Note: this interface has been superseded by the generic rfkill
642 class. It has been deprecated, and it will be removed in year
643 2010.
644
645 rfkill controller switch "tpacpi_bluetooth_sw": refer to
646 Documentation/driver-api/rfkill.rst for details.
647
648
649Video output control -- /proc/acpi/ibm/video
650--------------------------------------------
651
652This feature allows control over the devices used for video output -
653LCD, CRT or DVI (if available). The following commands are available::
654
655 echo lcd_enable > /proc/acpi/ibm/video
656 echo lcd_disable > /proc/acpi/ibm/video
657 echo crt_enable > /proc/acpi/ibm/video
658 echo crt_disable > /proc/acpi/ibm/video
659 echo dvi_enable > /proc/acpi/ibm/video
660 echo dvi_disable > /proc/acpi/ibm/video
661 echo auto_enable > /proc/acpi/ibm/video
662 echo auto_disable > /proc/acpi/ibm/video
663 echo expand_toggle > /proc/acpi/ibm/video
664 echo video_switch > /proc/acpi/ibm/video
665
666NOTE:
667 Access to this feature is restricted to processes owning the
668 CAP_SYS_ADMIN capability for safety reasons, as it can interact badly
669 enough with some versions of X.org to crash it.
670
671Each video output device can be enabled or disabled individually.
672Reading /proc/acpi/ibm/video shows the status of each device.
673
674Automatic video switching can be enabled or disabled. When automatic
675video switching is enabled, certain events (e.g. opening the lid,
676docking or undocking) cause the video output device to change
677automatically. While this can be useful, it also causes flickering
678and, on the X40, video corruption. By disabling automatic switching,
679the flickering or video corruption can be avoided.
680
681The video_switch command cycles through the available video outputs
682(it simulates the behavior of Fn-F7).
683
684Video expansion can be toggled through this feature. This controls
685whether the display is expanded to fill the entire LCD screen when a
686mode with less than full resolution is used. Note that the current
687video expansion status cannot be determined through this feature.
688
689Note that on many models (particularly those using Radeon graphics
690chips) the X driver configures the video card in a way which prevents
691Fn-F7 from working. This also disables the video output switching
692features of this driver, as it uses the same ACPI methods as
693Fn-F7. Video switching on the console should still work.
694
695UPDATE: refer to https://bugs.freedesktop.org/show_bug.cgi?id=2000
696
697
698ThinkLight control
699------------------
700
701procfs: /proc/acpi/ibm/light
702
703sysfs attributes: as per LED class, for the "tpacpi::thinklight" LED
704
705procfs notes
706^^^^^^^^^^^^
707
708The ThinkLight status can be read and set through the procfs interface. A
709few models which do not make the status available will show the ThinkLight
710status as "unknown". The available commands are::
711
712 echo on > /proc/acpi/ibm/light
713 echo off > /proc/acpi/ibm/light
714
715sysfs notes
716^^^^^^^^^^^
717
718The ThinkLight sysfs interface is documented by the LED class
719documentation, in Documentation/leds/leds-class.rst. The ThinkLight LED name
720is "tpacpi::thinklight".
721
722Due to limitations in the sysfs LED class, if the status of the ThinkLight
723cannot be read or if it is unknown, thinkpad-acpi will report it as "off".
724It is impossible to know if the status returned through sysfs is valid.
725
726
727CMOS/UCMS control
728-----------------
729
730procfs: /proc/acpi/ibm/cmos
731
732sysfs device attribute: cmos_command
733
734This feature is mostly used internally by the ACPI firmware to keep the legacy
735CMOS NVRAM bits in sync with the current machine state, and to record this
736state so that the ThinkPad will retain such settings across reboots.
737
738Some of these commands actually perform actions in some ThinkPad models, but
739this is expected to disappear more and more in newer models. As an example, in
740a T43 and in a X40, commands 12 and 13 still control the ThinkLight state for
741real, but commands 0 to 2 don't control the mixer anymore (they have been
742phased out) and just update the NVRAM.
743
744The range of valid cmos command numbers is 0 to 21, but not all have an
745effect and the behavior varies from model to model. Here is the behavior
746on the X40 (tpb is the ThinkPad Buttons utility):
747
748 - 0 - Related to "Volume down" key press
749 - 1 - Related to "Volume up" key press
750 - 2 - Related to "Mute on" key press
751 - 3 - Related to "Access IBM" key press
752 - 4 - Related to "LCD brightness up" key press
753 - 5 - Related to "LCD brightness down" key press
754 - 11 - Related to "toggle screen expansion" key press/function
755 - 12 - Related to "ThinkLight on"
756 - 13 - Related to "ThinkLight off"
757 - 14 - Related to "ThinkLight" key press (toggle ThinkLight)
758
759The cmos command interface is prone to firmware split-brain problems, as
760in newer ThinkPads it is just a compatibility layer. Do not use it, it is
761exported just as a debug tool.
762
763
764LED control
765-----------
766
767procfs: /proc/acpi/ibm/led
768sysfs attributes: as per LED class, see below for names
769
770Some of the LED indicators can be controlled through this feature. On
771some older ThinkPad models, it is possible to query the status of the
772LED indicators as well. Newer ThinkPads cannot query the real status
773of the LED indicators.
774
775Because misuse of the LEDs could induce an unaware user to perform
776dangerous actions (like undocking or ejecting a bay device while the
777buses are still active), or mask an important alarm (such as a nearly
778empty battery, or a broken battery), access to most LEDs is
779restricted.
780
781Unrestricted access to all LEDs requires that thinkpad-acpi be
782compiled with the CONFIG_THINKPAD_ACPI_UNSAFE_LEDS option enabled.
783Distributions must never enable this option. Individual users that
784are aware of the consequences are welcome to enabling it.
785
786Audio mute and microphone mute LEDs are supported, but currently not
787visible to userspace. They are used by the snd-hda-intel audio driver.
788
789procfs notes
790^^^^^^^^^^^^
791
792The available commands are::
793
794 echo '<LED number> on' >/proc/acpi/ibm/led
795 echo '<LED number> off' >/proc/acpi/ibm/led
796 echo '<LED number> blink' >/proc/acpi/ibm/led
797
798The <LED number> range is 0 to 15. The set of LEDs that can be
799controlled varies from model to model. Here is the common ThinkPad
800mapping:
801
802 - 0 - power
803 - 1 - battery (orange)
804 - 2 - battery (green)
805 - 3 - UltraBase/dock
806 - 4 - UltraBay
807 - 5 - UltraBase battery slot
808 - 6 - (unknown)
809 - 7 - standby
810 - 8 - dock status 1
811 - 9 - dock status 2
812 - 10, 11 - (unknown)
813 - 12 - thinkvantage
814 - 13, 14, 15 - (unknown)
815
816All of the above can be turned on and off and can be made to blink.
817
818sysfs notes
819^^^^^^^^^^^
820
821The ThinkPad LED sysfs interface is described in detail by the LED class
822documentation, in Documentation/leds/leds-class.rst.
823
824The LEDs are named (in LED ID order, from 0 to 12):
825"tpacpi::power", "tpacpi:orange:batt", "tpacpi:green:batt",
826"tpacpi::dock_active", "tpacpi::bay_active", "tpacpi::dock_batt",
827"tpacpi::unknown_led", "tpacpi::standby", "tpacpi::dock_status1",
828"tpacpi::dock_status2", "tpacpi::unknown_led2", "tpacpi::unknown_led3",
829"tpacpi::thinkvantage".
830
831Due to limitations in the sysfs LED class, if the status of the LED
832indicators cannot be read due to an error, thinkpad-acpi will report it as
833a brightness of zero (same as LED off).
834
835If the thinkpad firmware doesn't support reading the current status,
836trying to read the current LED brightness will just return whatever
837brightness was last written to that attribute.
838
839These LEDs can blink using hardware acceleration. To request that a
840ThinkPad indicator LED should blink in hardware accelerated mode, use the
841"timer" trigger, and leave the delay_on and delay_off parameters set to
842zero (to request hardware acceleration autodetection).
843
844LEDs that are known not to exist in a given ThinkPad model are not
845made available through the sysfs interface. If you have a dock and you
846notice there are LEDs listed for your ThinkPad that do not exist (and
847are not in the dock), or if you notice that there are missing LEDs,
848a report to ibm-acpi-devel@lists.sourceforge.net is appreciated.
849
850
851ACPI sounds -- /proc/acpi/ibm/beep
852----------------------------------
853
854The BEEP method is used internally by the ACPI firmware to provide
855audible alerts in various situations. This feature allows the same
856sounds to be triggered manually.
857
858The commands are non-negative integer numbers::
859
860 echo <number> >/proc/acpi/ibm/beep
861
862The valid <number> range is 0 to 17. Not all numbers trigger sounds
863and the sounds vary from model to model. Here is the behavior on the
864X40:
865
866 - 0 - stop a sound in progress (but use 17 to stop 16)
867 - 2 - two beeps, pause, third beep ("low battery")
868 - 3 - single beep
869 - 4 - high, followed by low-pitched beep ("unable")
870 - 5 - single beep
871 - 6 - very high, followed by high-pitched beep ("AC/DC")
872 - 7 - high-pitched beep
873 - 9 - three short beeps
874 - 10 - very long beep
875 - 12 - low-pitched beep
876 - 15 - three high-pitched beeps repeating constantly, stop with 0
877 - 16 - one medium-pitched beep repeating constantly, stop with 17
878 - 17 - stop 16
879
880
881Temperature sensors
882-------------------
883
884procfs: /proc/acpi/ibm/thermal
885
886sysfs device attributes: (hwmon "thinkpad") temp*_input
887
888Most ThinkPads include six or more separate temperature sensors but only
889expose the CPU temperature through the standard ACPI methods. This
890feature shows readings from up to eight different sensors on older
891ThinkPads, and up to sixteen different sensors on newer ThinkPads.
892
893For example, on the X40, a typical output may be:
894
895temperatures:
896 42 42 45 41 36 -128 33 -128
897
898On the T43/p, a typical output may be:
899
900temperatures:
901 48 48 36 52 38 -128 31 -128 48 52 48 -128 -128 -128 -128 -128
902
903The mapping of thermal sensors to physical locations varies depending on
904system-board model (and thus, on ThinkPad model).
905
906http://thinkwiki.org/wiki/Thermal_Sensors is a public wiki page that
907tries to track down these locations for various models.
908
909Most (newer?) models seem to follow this pattern:
910
911- 1: CPU
912- 2: (depends on model)
913- 3: (depends on model)
914- 4: GPU
915- 5: Main battery: main sensor
916- 6: Bay battery: main sensor
917- 7: Main battery: secondary sensor
918- 8: Bay battery: secondary sensor
919- 9-15: (depends on model)
920
921For the R51 (source: Thomas Gruber):
922
923- 2: Mini-PCI
924- 3: Internal HDD
925
926For the T43, T43/p (source: Shmidoax/Thinkwiki.org)
927http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_T43.2C_T43p
928
929- 2: System board, left side (near PCMCIA slot), reported as HDAPS temp
930- 3: PCMCIA slot
931- 9: MCH (northbridge) to DRAM Bus
932- 10: Clock-generator, mini-pci card and ICH (southbridge), under Mini-PCI
933 card, under touchpad
934- 11: Power regulator, underside of system board, below F2 key
935
936The A31 has a very atypical layout for the thermal sensors
937(source: Milos Popovic, http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_A31)
938
939- 1: CPU
940- 2: Main Battery: main sensor
941- 3: Power Converter
942- 4: Bay Battery: main sensor
943- 5: MCH (northbridge)
944- 6: PCMCIA/ambient
945- 7: Main Battery: secondary sensor
946- 8: Bay Battery: secondary sensor
947
948
949Procfs notes
950^^^^^^^^^^^^
951
952 Readings from sensors that are not available return -128.
953 No commands can be written to this file.
954
955Sysfs notes
956^^^^^^^^^^^
957
958 Sensors that are not available return the ENXIO error. This
959 status may change at runtime, as there are hotplug thermal
960 sensors, like those inside the batteries and docks.
961
962 thinkpad-acpi thermal sensors are reported through the hwmon
963 subsystem, and follow all of the hwmon guidelines at
964 Documentation/hwmon.
965
966EXPERIMENTAL: Embedded controller register dump
967-----------------------------------------------
968
969This feature is not included in the thinkpad driver anymore.
970Instead the EC can be accessed through /sys/kernel/debug/ec with
971a userspace tool which can be found here:
972ftp://ftp.suse.com/pub/people/trenn/sources/ec
973
974Use it to determine the register holding the fan
975speed on some models. To do that, do the following:
976
977 - make sure the battery is fully charged
978 - make sure the fan is running
979 - use above mentioned tool to read out the EC
980
981Often fan and temperature values vary between
982readings. Since temperatures don't change vary fast, you can take
983several quick dumps to eliminate them.
984
985You can use a similar method to figure out the meaning of other
986embedded controller registers - e.g. make sure nothing else changes
987except the charging or discharging battery to determine which
988registers contain the current battery capacity, etc. If you experiment
989with this, do send me your results (including some complete dumps with
990a description of the conditions when they were taken.)
991
992
993LCD brightness control
994----------------------
995
996procfs: /proc/acpi/ibm/brightness
997
998sysfs backlight device "thinkpad_screen"
999
1000This feature allows software control of the LCD brightness on ThinkPad
1001models which don't have a hardware brightness slider.
1002
1003It has some limitations: the LCD backlight cannot be actually turned
1004on or off by this interface, it just controls the backlight brightness
1005level.
1006
1007On IBM (and some of the earlier Lenovo) ThinkPads, the backlight control
1008has eight brightness levels, ranging from 0 to 7. Some of the levels
1009may not be distinct. Later Lenovo models that implement the ACPI
1010display backlight brightness control methods have 16 levels, ranging
1011from 0 to 15.
1012
1013For IBM ThinkPads, there are two interfaces to the firmware for direct
1014brightness control, EC and UCMS (or CMOS). To select which one should be
1015used, use the brightness_mode module parameter: brightness_mode=1 selects
1016EC mode, brightness_mode=2 selects UCMS mode, brightness_mode=3 selects EC
1017mode with NVRAM backing (so that brightness changes are remembered across
1018shutdown/reboot).
1019
1020The driver tries to select which interface to use from a table of
1021defaults for each ThinkPad model. If it makes a wrong choice, please
1022report this as a bug, so that we can fix it.
1023
1024Lenovo ThinkPads only support brightness_mode=2 (UCMS).
1025
1026When display backlight brightness controls are available through the
1027standard ACPI interface, it is best to use it instead of this direct
1028ThinkPad-specific interface. The driver will disable its native
1029backlight brightness control interface if it detects that the standard
1030ACPI interface is available in the ThinkPad.
1031
1032If you want to use the thinkpad-acpi backlight brightness control
1033instead of the generic ACPI video backlight brightness control for some
1034reason, you should use the acpi_backlight=vendor kernel parameter.
1035
1036The brightness_enable module parameter can be used to control whether
1037the LCD brightness control feature will be enabled when available.
1038brightness_enable=0 forces it to be disabled. brightness_enable=1
1039forces it to be enabled when available, even if the standard ACPI
1040interface is also available.
1041
1042Procfs notes
1043^^^^^^^^^^^^
1044
1045The available commands are::
1046
1047 echo up >/proc/acpi/ibm/brightness
1048 echo down >/proc/acpi/ibm/brightness
1049 echo 'level <level>' >/proc/acpi/ibm/brightness
1050
1051Sysfs notes
1052^^^^^^^^^^^
1053
1054The interface is implemented through the backlight sysfs class, which is
1055poorly documented at this time.
1056
1057Locate the thinkpad_screen device under /sys/class/backlight, and inside
1058it there will be the following attributes:
1059
1060 max_brightness:
1061 Reads the maximum brightness the hardware can be set to.
1062 The minimum is always zero.
1063
1064 actual_brightness:
1065 Reads what brightness the screen is set to at this instant.
1066
1067 brightness:
1068 Writes request the driver to change brightness to the
1069 given value. Reads will tell you what brightness the
1070 driver is trying to set the display to when "power" is set
1071 to zero and the display has not been dimmed by a kernel
1072 power management event.
1073
1074 power:
1075 power management mode, where 0 is "display on", and 1 to 3
1076 will dim the display backlight to brightness level 0
1077 because thinkpad-acpi cannot really turn the backlight
1078 off. Kernel power management events can temporarily
1079 increase the current power management level, i.e. they can
1080 dim the display.
1081
1082
1083WARNING:
1084
1085 Whatever you do, do NOT ever call thinkpad-acpi backlight-level change
1086 interface and the ACPI-based backlight level change interface
1087 (available on newer BIOSes, and driven by the Linux ACPI video driver)
1088 at the same time. The two will interact in bad ways, do funny things,
1089 and maybe reduce the life of the backlight lamps by needlessly kicking
1090 its level up and down at every change.
1091
1092
1093Volume control (Console Audio control)
1094--------------------------------------
1095
1096procfs: /proc/acpi/ibm/volume
1097
1098ALSA: "ThinkPad Console Audio Control", default ID: "ThinkPadEC"
1099
1100NOTE: by default, the volume control interface operates in read-only
1101mode, as it is supposed to be used for on-screen-display purposes.
1102The read/write mode can be enabled through the use of the
1103"volume_control=1" module parameter.
1104
1105NOTE: distros are urged to not enable volume_control by default, this
1106should be done by the local admin only. The ThinkPad UI is for the
1107console audio control to be done through the volume keys only, and for
1108the desktop environment to just provide on-screen-display feedback.
1109Software volume control should be done only in the main AC97/HDA
1110mixer.
1111
1112
1113About the ThinkPad Console Audio control
1114^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1115
1116ThinkPads have a built-in amplifier and muting circuit that drives the
1117console headphone and speakers. This circuit is after the main AC97
1118or HDA mixer in the audio path, and under exclusive control of the
1119firmware.
1120
1121ThinkPads have three special hotkeys to interact with the console
1122audio control: volume up, volume down and mute.
1123
1124It is worth noting that the normal way the mute function works (on
1125ThinkPads that do not have a "mute LED") is:
1126
11271. Press mute to mute. It will *always* mute, you can press it as
1128 many times as you want, and the sound will remain mute.
1129
11302. Press either volume key to unmute the ThinkPad (it will _not_
1131 change the volume, it will just unmute).
1132
1133This is a very superior design when compared to the cheap software-only
1134mute-toggle solution found on normal consumer laptops: you can be
1135absolutely sure the ThinkPad will not make noise if you press the mute
1136button, no matter the previous state.
1137
1138The IBM ThinkPads, and the earlier Lenovo ThinkPads have variable-gain
1139amplifiers driving the speakers and headphone output, and the firmware
1140also handles volume control for the headphone and speakers on these
1141ThinkPads without any help from the operating system (this volume
1142control stage exists after the main AC97 or HDA mixer in the audio
1143path).
1144
1145The newer Lenovo models only have firmware mute control, and depend on
1146the main HDA mixer to do volume control (which is done by the operating
1147system). In this case, the volume keys are filtered out for unmute
1148key press (there are some firmware bugs in this area) and delivered as
1149normal key presses to the operating system (thinkpad-acpi is not
1150involved).
1151
1152
1153The ThinkPad-ACPI volume control
1154^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1155
1156The preferred way to interact with the Console Audio control is the
1157ALSA interface.
1158
1159The legacy procfs interface allows one to read the current state,
1160and if volume control is enabled, accepts the following commands::
1161
1162 echo up >/proc/acpi/ibm/volume
1163 echo down >/proc/acpi/ibm/volume
1164 echo mute >/proc/acpi/ibm/volume
1165 echo unmute >/proc/acpi/ibm/volume
1166 echo 'level <level>' >/proc/acpi/ibm/volume
1167
1168The <level> number range is 0 to 14 although not all of them may be
1169distinct. To unmute the volume after the mute command, use either the
1170up or down command (the level command will not unmute the volume), or
1171the unmute command.
1172
1173You can use the volume_capabilities parameter to tell the driver
1174whether your thinkpad has volume control or mute-only control:
1175volume_capabilities=1 for mixers with mute and volume control,
1176volume_capabilities=2 for mixers with only mute control.
1177
1178If the driver misdetects the capabilities for your ThinkPad model,
1179please report this to ibm-acpi-devel@lists.sourceforge.net, so that we
1180can update the driver.
1181
1182There are two strategies for volume control. To select which one
1183should be used, use the volume_mode module parameter: volume_mode=1
1184selects EC mode, and volume_mode=3 selects EC mode with NVRAM backing
1185(so that volume/mute changes are remembered across shutdown/reboot).
1186
1187The driver will operate in volume_mode=3 by default. If that does not
1188work well on your ThinkPad model, please report this to
1189ibm-acpi-devel@lists.sourceforge.net.
1190
1191The driver supports the standard ALSA module parameters. If the ALSA
1192mixer is disabled, the driver will disable all volume functionality.
1193
1194
1195Fan control and monitoring: fan speed, fan enable/disable
1196---------------------------------------------------------
1197
1198procfs: /proc/acpi/ibm/fan
1199
1200sysfs device attributes: (hwmon "thinkpad") fan1_input, pwm1, pwm1_enable, fan2_input
1201
1202sysfs hwmon driver attributes: fan_watchdog
1203
1204NOTE NOTE NOTE:
1205 fan control operations are disabled by default for
1206 safety reasons. To enable them, the module parameter "fan_control=1"
1207 must be given to thinkpad-acpi.
1208
1209This feature attempts to show the current fan speed, control mode and
1210other fan data that might be available. The speed is read directly
1211from the hardware registers of the embedded controller. This is known
1212to work on later R, T, X and Z series ThinkPads but may show a bogus
1213value on other models.
1214
1215Some Lenovo ThinkPads support a secondary fan. This fan cannot be
1216controlled separately, it shares the main fan control.
1217
1218Fan levels
1219^^^^^^^^^^
1220
1221Most ThinkPad fans work in "levels" at the firmware interface. Level 0
1222stops the fan. The higher the level, the higher the fan speed, although
1223adjacent levels often map to the same fan speed. 7 is the highest
1224level, where the fan reaches the maximum recommended speed.
1225
1226Level "auto" means the EC changes the fan level according to some
1227internal algorithm, usually based on readings from the thermal sensors.
1228
1229There is also a "full-speed" level, also known as "disengaged" level.
1230In this level, the EC disables the speed-locked closed-loop fan control,
1231and drives the fan as fast as it can go, which might exceed hardware
1232limits, so use this level with caution.
1233
1234The fan usually ramps up or down slowly from one speed to another, and
1235it is normal for the EC to take several seconds to react to fan
1236commands. The full-speed level may take up to two minutes to ramp up to
1237maximum speed, and in some ThinkPads, the tachometer readings go stale
1238while the EC is transitioning to the full-speed level.
1239
1240WARNING WARNING WARNING: do not leave the fan disabled unless you are
1241monitoring all of the temperature sensor readings and you are ready to
1242enable it if necessary to avoid overheating.
1243
1244An enabled fan in level "auto" may stop spinning if the EC decides the
1245ThinkPad is cool enough and doesn't need the extra airflow. This is
1246normal, and the EC will spin the fan up if the various thermal readings
1247rise too much.
1248
1249On the X40, this seems to depend on the CPU and HDD temperatures.
1250Specifically, the fan is turned on when either the CPU temperature
1251climbs to 56 degrees or the HDD temperature climbs to 46 degrees. The
1252fan is turned off when the CPU temperature drops to 49 degrees and the
1253HDD temperature drops to 41 degrees. These thresholds cannot
1254currently be controlled.
1255
1256The ThinkPad's ACPI DSDT code will reprogram the fan on its own when
1257certain conditions are met. It will override any fan programming done
1258through thinkpad-acpi.
1259
1260The thinkpad-acpi kernel driver can be programmed to revert the fan
1261level to a safe setting if userspace does not issue one of the procfs
1262fan commands: "enable", "disable", "level" or "watchdog", or if there
1263are no writes to pwm1_enable (or to pwm1 *if and only if* pwm1_enable is
1264set to 1, manual mode) within a configurable amount of time of up to
1265120 seconds. This functionality is called fan safety watchdog.
1266
1267Note that the watchdog timer stops after it enables the fan. It will be
1268rearmed again automatically (using the same interval) when one of the
1269above mentioned fan commands is received. The fan watchdog is,
1270therefore, not suitable to protect against fan mode changes made through
1271means other than the "enable", "disable", and "level" procfs fan
1272commands, or the hwmon fan control sysfs interface.
1273
1274Procfs notes
1275^^^^^^^^^^^^
1276
1277The fan may be enabled or disabled with the following commands::
1278
1279 echo enable >/proc/acpi/ibm/fan
1280 echo disable >/proc/acpi/ibm/fan
1281
1282Placing a fan on level 0 is the same as disabling it. Enabling a fan
1283will try to place it in a safe level if it is too slow or disabled.
1284
1285The fan level can be controlled with the command::
1286
1287 echo 'level <level>' > /proc/acpi/ibm/fan
1288
1289Where <level> is an integer from 0 to 7, or one of the words "auto" or
1290"full-speed" (without the quotes). Not all ThinkPads support the "auto"
1291and "full-speed" levels. The driver accepts "disengaged" as an alias for
1292"full-speed", and reports it as "disengaged" for backwards
1293compatibility.
1294
1295On the X31 and X40 (and ONLY on those models), the fan speed can be
1296controlled to a certain degree. Once the fan is running, it can be
1297forced to run faster or slower with the following command::
1298
1299 echo 'speed <speed>' > /proc/acpi/ibm/fan
1300
1301The sustainable range of fan speeds on the X40 appears to be from about
13023700 to about 7350. Values outside this range either do not have any
1303effect or the fan speed eventually settles somewhere in that range. The
1304fan cannot be stopped or started with this command. This functionality
1305is incomplete, and not available through the sysfs interface.
1306
1307To program the safety watchdog, use the "watchdog" command::
1308
1309 echo 'watchdog <interval in seconds>' > /proc/acpi/ibm/fan
1310
1311If you want to disable the watchdog, use 0 as the interval.
1312
1313Sysfs notes
1314^^^^^^^^^^^
1315
1316The sysfs interface follows the hwmon subsystem guidelines for the most
1317part, and the exception is the fan safety watchdog.
1318
1319Writes to any of the sysfs attributes may return the EINVAL error if
1320that operation is not supported in a given ThinkPad or if the parameter
1321is out-of-bounds, and EPERM if it is forbidden. They may also return
1322EINTR (interrupted system call), and EIO (I/O error while trying to talk
1323to the firmware).
1324
1325Features not yet implemented by the driver return ENOSYS.
1326
1327hwmon device attribute pwm1_enable:
1328 - 0: PWM offline (fan is set to full-speed mode)
1329 - 1: Manual PWM control (use pwm1 to set fan level)
1330 - 2: Hardware PWM control (EC "auto" mode)
1331 - 3: reserved (Software PWM control, not implemented yet)
1332
1333 Modes 0 and 2 are not supported by all ThinkPads, and the
1334 driver is not always able to detect this. If it does know a
1335 mode is unsupported, it will return -EINVAL.
1336
1337hwmon device attribute pwm1:
1338 Fan level, scaled from the firmware values of 0-7 to the hwmon
1339 scale of 0-255. 0 means fan stopped, 255 means highest normal
1340 speed (level 7).
1341
1342 This attribute only commands the fan if pmw1_enable is set to 1
1343 (manual PWM control).
1344
1345hwmon device attribute fan1_input:
1346 Fan tachometer reading, in RPM. May go stale on certain
1347 ThinkPads while the EC transitions the PWM to offline mode,
1348 which can take up to two minutes. May return rubbish on older
1349 ThinkPads.
1350
1351hwmon device attribute fan2_input:
1352 Fan tachometer reading, in RPM, for the secondary fan.
1353 Available only on some ThinkPads. If the secondary fan is
1354 not installed, will always read 0.
1355
1356hwmon driver attribute fan_watchdog:
1357 Fan safety watchdog timer interval, in seconds. Minimum is
1358 1 second, maximum is 120 seconds. 0 disables the watchdog.
1359
1360To stop the fan: set pwm1 to zero, and pwm1_enable to 1.
1361
1362To start the fan in a safe mode: set pwm1_enable to 2. If that fails
1363with EINVAL, try to set pwm1_enable to 1 and pwm1 to at least 128 (255
1364would be the safest choice, though).
1365
1366
1367WAN
1368---
1369
1370procfs: /proc/acpi/ibm/wan
1371
1372sysfs device attribute: wwan_enable (deprecated)
1373
1374sysfs rfkill class: switch "tpacpi_wwan_sw"
1375
1376This feature shows the presence and current state of the built-in
1377Wireless WAN device.
1378
1379If the ThinkPad supports it, the WWAN state is stored in NVRAM,
1380so it is kept across reboots and power-off.
1381
1382It was tested on a Lenovo ThinkPad X60. It should probably work on other
1383ThinkPad models which come with this module installed.
1384
1385Procfs notes
1386^^^^^^^^^^^^
1387
1388If the W-WAN card is installed, the following commands can be used::
1389
1390 echo enable > /proc/acpi/ibm/wan
1391 echo disable > /proc/acpi/ibm/wan
1392
1393Sysfs notes
1394^^^^^^^^^^^
1395
1396 If the W-WAN card is installed, it can be enabled /
1397 disabled through the "wwan_enable" thinkpad-acpi device
1398 attribute, and its current status can also be queried.
1399
1400 enable:
1401 - 0: disables WWAN card / WWAN card is disabled
1402 - 1: enables WWAN card / WWAN card is enabled.
1403
1404 Note: this interface has been superseded by the generic rfkill
1405 class. It has been deprecated, and it will be removed in year
1406 2010.
1407
1408 rfkill controller switch "tpacpi_wwan_sw": refer to
1409 Documentation/driver-api/rfkill.rst for details.
1410
1411
1412EXPERIMENTAL: UWB
1413-----------------
1414
1415This feature is considered EXPERIMENTAL because it has not been extensively
1416tested and validated in various ThinkPad models yet. The feature may not
1417work as expected. USE WITH CAUTION! To use this feature, you need to supply
1418the experimental=1 parameter when loading the module.
1419
1420sysfs rfkill class: switch "tpacpi_uwb_sw"
1421
1422This feature exports an rfkill controller for the UWB device, if one is
1423present and enabled in the BIOS.
1424
1425Sysfs notes
1426^^^^^^^^^^^
1427
1428 rfkill controller switch "tpacpi_uwb_sw": refer to
1429 Documentation/driver-api/rfkill.rst for details.
1430
1431Adaptive keyboard
1432-----------------
1433
1434sysfs device attribute: adaptive_kbd_mode
1435
1436This sysfs attribute controls the keyboard "face" that will be shown on the
1437Lenovo X1 Carbon 2nd gen (2014)'s adaptive keyboard. The value can be read
1438and set.
1439
1440- 1 = Home mode
1441- 2 = Web-browser mode
1442- 3 = Web-conference mode
1443- 4 = Function mode
1444- 5 = Layflat mode
1445
1446For more details about which buttons will appear depending on the mode, please
1447review the laptop's user guide:
1448http://www.lenovo.com/shop/americas/content/user_guides/x1carbon_2_ug_en.pdf
1449
1450Multiple Commands, Module Parameters
1451------------------------------------
1452
1453Multiple commands can be written to the proc files in one shot by
1454separating them with commas, for example::
1455
1456 echo enable,0xffff > /proc/acpi/ibm/hotkey
1457 echo lcd_disable,crt_enable > /proc/acpi/ibm/video
1458
1459Commands can also be specified when loading the thinkpad-acpi module,
1460for example::
1461
1462 modprobe thinkpad_acpi hotkey=enable,0xffff video=auto_disable
1463
1464
1465Enabling debugging output
1466-------------------------
1467
1468The module takes a debug parameter which can be used to selectively
1469enable various classes of debugging output, for example::
1470
1471 modprobe thinkpad_acpi debug=0xffff
1472
1473will enable all debugging output classes. It takes a bitmask, so
1474to enable more than one output class, just add their values.
1475
1476 ============= ======================================
1477 Debug bitmask Description
1478 ============= ======================================
1479 0x8000 Disclose PID of userspace programs
1480 accessing some functions of the driver
1481 0x0001 Initialization and probing
1482 0x0002 Removal
1483 0x0004 RF Transmitter control (RFKILL)
1484 (bluetooth, WWAN, UWB...)
1485 0x0008 HKEY event interface, hotkeys
1486 0x0010 Fan control
1487 0x0020 Backlight brightness
1488 0x0040 Audio mixer/volume control
1489 ============= ======================================
1490
1491There is also a kernel build option to enable more debugging
1492information, which may be necessary to debug driver problems.
1493
1494The level of debugging information output by the driver can be changed
1495at runtime through sysfs, using the driver attribute debug_level. The
1496attribute takes the same bitmask as the debug module parameter above.
1497
1498
1499Force loading of module
1500-----------------------
1501
1502If thinkpad-acpi refuses to detect your ThinkPad, you can try to specify
1503the module parameter force_load=1. Regardless of whether this works or
1504not, please contact ibm-acpi-devel@lists.sourceforge.net with a report.
1505
1506
1507Sysfs interface changelog
1508^^^^^^^^^^^^^^^^^^^^^^^^^
1509
1510========= ===============================================================
15110x000100: Initial sysfs support, as a single platform driver and
1512 device.
15130x000200: Hot key support for 32 hot keys, and radio slider switch
1514 support.
15150x010000: Hot keys are now handled by default over the input
1516 layer, the radio switch generates input event EV_RADIO,
1517 and the driver enables hot key handling by default in
1518 the firmware.
1519
15200x020000: ABI fix: added a separate hwmon platform device and
1521 driver, which must be located by name (thinkpad)
1522 and the hwmon class for libsensors4 (lm-sensors 3)
1523 compatibility. Moved all hwmon attributes to this
1524 new platform device.
1525
15260x020100: Marker for thinkpad-acpi with hot key NVRAM polling
1527 support. If you must, use it to know you should not
1528 start a userspace NVRAM poller (allows to detect when
1529 NVRAM is compiled out by the user because it is
1530 unneeded/undesired in the first place).
15310x020101: Marker for thinkpad-acpi with hot key NVRAM polling
1532 and proper hotkey_mask semantics (version 8 of the
1533 NVRAM polling patch). Some development snapshots of
1534 0.18 had an earlier version that did strange things
1535 to hotkey_mask.
1536
15370x020200: Add poll()/select() support to the following attributes:
1538 hotkey_radio_sw, wakeup_hotunplug_complete, wakeup_reason
1539
15400x020300: hotkey enable/disable support removed, attributes
1541 hotkey_bios_enabled and hotkey_enable deprecated and
1542 marked for removal.
1543
15440x020400: Marker for 16 LEDs support. Also, LEDs that are known
1545 to not exist in a given model are not registered with
1546 the LED sysfs class anymore.
1547
15480x020500: Updated hotkey driver, hotkey_mask is always available
1549 and it is always able to disable hot keys. Very old
1550 thinkpads are properly supported. hotkey_bios_mask
1551 is deprecated and marked for removal.
1552
15530x020600: Marker for backlight change event support.
1554
15550x020700: Support for mute-only mixers.
1556 Volume control in read-only mode by default.
1557 Marker for ALSA mixer support.
1558
15590x030000: Thermal and fan sysfs attributes were moved to the hwmon
1560 device instead of being attached to the backing platform
1561 device.
1562========= ===============================================================
diff --git a/Documentation/admin-guide/laptops/toshiba_haps.rst b/Documentation/admin-guide/laptops/toshiba_haps.rst
new file mode 100644
index 000000000000..d28b6c3f2849
--- /dev/null
+++ b/Documentation/admin-guide/laptops/toshiba_haps.rst
@@ -0,0 +1,87 @@
1====================================
2Toshiba HDD Active Protection Sensor
3====================================
4
5Kernel driver: toshiba_haps
6
7Author: Azael Avalos <coproscefalo@gmail.com>
8
9
10.. 0. Contents
11
12 1. Description
13 2. Interface
14 3. Accelerometer axes
15 4. Supported devices
16 5. Usage
17
18
191. Description
20--------------
21
22This driver provides support for the accelerometer found in various Toshiba
23laptops, being called "Toshiba HDD Protection - Shock Sensor" officially,
24and detects laptops automatically with this device.
25On Windows, Toshiba provided software monitors this device and provides
26automatic HDD protection (head unload) on sudden moves or harsh vibrations,
27however, this driver only provides a notification via a sysfs file to let
28userspace tools or daemons act accordingly, as well as providing a sysfs
29file to set the desired protection level or sensor sensibility.
30
31
322. Interface
33------------
34
35This device comes with 3 methods:
36
37==== =====================================================================
38_STA Checks existence of the device, returning Zero if the device does not
39 exists or is not supported.
40PTLV Sets the desired protection level.
41RSSS Shuts down the HDD protection interface for a few seconds,
42 then restores normal operation.
43==== =====================================================================
44
45Note:
46 The presence of Solid State Drives (SSD) can make this driver to fail loading,
47 given the fact that such drives have no movable parts, and thus, not requiring
48 any "protection" as well as failing during the evaluation of the _STA method
49 found under this device.
50
51
523. Accelerometer axes
53---------------------
54
55This device does not report any axes, however, to query the sensor position
56a couple HCI (Hardware Configuration Interface) calls (0x6D and 0xA6) are
57provided to query such information, handled by the kernel module toshiba_acpi
58since kernel version 3.15.
59
60
614. Supported devices
62--------------------
63
64This driver binds itself to the ACPI device TOS620A, and any Toshiba laptop
65with this device is supported, given the fact that they have the presence of
66conventional HDD and not only SSD, or a combination of both HDD and SSD.
67
68
695. Usage
70--------
71
72The sysfs files under /sys/devices/LNXSYSTM:00/LNXSYBUS:00/TOS620A:00/ are:
73
74================ ============================================================
75protection_level The protection_level is readable and writeable, and
76 provides a way to let userspace query the current protection
77 level, as well as set the desired protection level, the
78 available protection levels are::
79
80 ============ ======= ========== ========
81 0 - Disabled 1 - Low 2 - Medium 3 - High
82 ============ ======= ========== ========
83
84reset_protection The reset_protection entry is writeable only, being "1"
85 the only parameter it accepts, it is used to trigger
86 a reset of the protection interface.
87================ ============================================================
diff --git a/Documentation/admin-guide/lcd-panel-cgram.rst b/Documentation/admin-guide/lcd-panel-cgram.rst
new file mode 100644
index 000000000000..a3eb00c62f53
--- /dev/null
+++ b/Documentation/admin-guide/lcd-panel-cgram.rst
@@ -0,0 +1,27 @@
1======================================
2Parallel port LCD/Keypad Panel support
3======================================
4
5Some LCDs allow you to define up to 8 characters, mapped to ASCII
6characters 0 to 7. The escape code to define a new character is
7'\e[LG' followed by one digit from 0 to 7, representing the character
8number, and up to 8 couples of hex digits terminated by a semi-colon
9(';'). Each couple of digits represents a line, with 1-bits for each
10illuminated pixel with LSB on the right. Lines are numbered from the
11top of the character to the bottom. On a 5x7 matrix, only the 5 lower
12bits of the 7 first bytes are used for each character. If the string
13is incomplete, only complete lines will be redefined. Here are some
14examples::
15
16 printf "\e[LG0010101050D1F0C04;" => 0 = [enter]
17 printf "\e[LG1040E1F0000000000;" => 1 = [up]
18 printf "\e[LG2000000001F0E0400;" => 2 = [down]
19 printf "\e[LG3040E1F001F0E0400;" => 3 = [up-down]
20 printf "\e[LG40002060E1E0E0602;" => 4 = [left]
21 printf "\e[LG500080C0E0F0E0C08;" => 5 = [right]
22 printf "\e[LG60016051516141400;" => 6 = "IP"
23
24 printf "\e[LG00103071F1F070301;" => big speaker
25 printf "\e[LG00002061E1E060200;" => small speaker
26
27Willy
diff --git a/Documentation/admin-guide/ldm.rst b/Documentation/admin-guide/ldm.rst
new file mode 100644
index 000000000000..12c571368e73
--- /dev/null
+++ b/Documentation/admin-guide/ldm.rst
@@ -0,0 +1,121 @@
1==========================================
2LDM - Logical Disk Manager (Dynamic Disks)
3==========================================
4
5:Author: Originally Written by FlatCap - Richard Russon <ldm@flatcap.org>.
6:Last Updated: Anton Altaparmakov on 30 March 2007 for Windows Vista.
7
8Overview
9--------
10
11Windows 2000, XP, and Vista use a new partitioning scheme. It is a complete
12replacement for the MSDOS style partitions. It stores its information in a
131MiB journalled database at the end of the physical disk. The size of
14partitions is limited only by disk space. The maximum number of partitions is
15nearly 2000.
16
17Any partitions created under the LDM are called "Dynamic Disks". There are no
18longer any primary or extended partitions. Normal MSDOS style partitions are
19now known as Basic Disks.
20
21If you wish to use Spanned, Striped, Mirrored or RAID 5 Volumes, you must use
22Dynamic Disks. The journalling allows Windows to make changes to these
23partitions and filesystems without the need to reboot.
24
25Once the LDM driver has divided up the disk, you can use the MD driver to
26assemble any multi-partition volumes, e.g. Stripes, RAID5.
27
28To prevent legacy applications from repartitioning the disk, the LDM creates a
29dummy MSDOS partition containing one disk-sized partition. This is what is
30supported with the Linux LDM driver.
31
32A newer approach that has been implemented with Vista is to put LDM on top of a
33GPT label disk. This is not supported by the Linux LDM driver yet.
34
35
36Example
37-------
38
39Below we have a 50MiB disk, divided into seven partitions.
40
41.. note::
42
43 The missing 1MiB at the end of the disk is where the LDM database is
44 stored.
45
46+-------++--------------+---------+-----++--------------+---------+----+
47|Device || Offset Bytes | Sectors | MiB || Size Bytes | Sectors | MiB|
48+=======++==============+=========+=====++==============+=========+====+
49|hda || 0 | 0 | 0 || 52428800 | 102400 | 50|
50+-------++--------------+---------+-----++--------------+---------+----+
51|hda1 || 51380224 | 100352 | 49 || 1048576 | 2048 | 1|
52+-------++--------------+---------+-----++--------------+---------+----+
53|hda2 || 16384 | 32 | 0 || 6979584 | 13632 | 6|
54+-------++--------------+---------+-----++--------------+---------+----+
55|hda3 || 6995968 | 13664 | 6 || 10485760 | 20480 | 10|
56+-------++--------------+---------+-----++--------------+---------+----+
57|hda4 || 17481728 | 34144 | 16 || 4194304 | 8192 | 4|
58+-------++--------------+---------+-----++--------------+---------+----+
59|hda5 || 21676032 | 42336 | 20 || 5242880 | 10240 | 5|
60+-------++--------------+---------+-----++--------------+---------+----+
61|hda6 || 26918912 | 52576 | 25 || 10485760 | 20480 | 10|
62+-------++--------------+---------+-----++--------------+---------+----+
63|hda7 || 37404672 | 73056 | 35 || 13959168 | 27264 | 13|
64+-------++--------------+---------+-----++--------------+---------+----+
65
66The LDM Database may not store the partitions in the order that they appear on
67disk, but the driver will sort them.
68
69When Linux boots, you will see something like::
70
71 hda: 102400 sectors w/32KiB Cache, CHS=50/64/32
72 hda: [LDM] hda1 hda2 hda3 hda4 hda5 hda6 hda7
73
74
75Compiling LDM Support
76---------------------
77
78To enable LDM, choose the following two options:
79
80 - "Advanced partition selection" CONFIG_PARTITION_ADVANCED
81 - "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION
82
83If you believe the driver isn't working as it should, you can enable the extra
84debugging code. This will produce a LOT of output. The option is:
85
86 - "Windows LDM extra logging" CONFIG_LDM_DEBUG
87
88N.B. The partition code cannot be compiled as a module.
89
90As with all the partition code, if the driver doesn't see signs of its type of
91partition, it will pass control to another driver, so there is no harm in
92enabling it.
93
94If you have Dynamic Disks but don't enable the driver, then all you will see
95is a dummy MSDOS partition filling the whole disk. You won't be able to mount
96any of the volumes on the disk.
97
98
99Booting
100-------
101
102If you enable LDM support, then lilo is capable of booting from any of the
103discovered partitions. However, grub does not understand the LDM partitioning
104and cannot boot from a Dynamic Disk.
105
106
107More Documentation
108------------------
109
110There is an Overview of the LDM together with complete Technical Documentation.
111It is available for download.
112
113 http://www.linux-ntfs.org/
114
115If you have any LDM questions that aren't answered in the documentation, email
116me.
117
118Cheers,
119 FlatCap - Richard Russon
120 ldm@flatcap.org
121
diff --git a/Documentation/admin-guide/lockup-watchdogs.rst b/Documentation/admin-guide/lockup-watchdogs.rst
new file mode 100644
index 000000000000..290840c160af
--- /dev/null
+++ b/Documentation/admin-guide/lockup-watchdogs.rst
@@ -0,0 +1,83 @@
1===============================================================
2Softlockup detector and hardlockup detector (aka nmi_watchdog)
3===============================================================
4
5The Linux kernel can act as a watchdog to detect both soft and hard
6lockups.
7
8A 'softlockup' is defined as a bug that causes the kernel to loop in
9kernel mode for more than 20 seconds (see "Implementation" below for
10details), without giving other tasks a chance to run. The current
11stack trace is displayed upon detection and, by default, the system
12will stay locked up. Alternatively, the kernel can be configured to
13panic; a sysctl, "kernel.softlockup_panic", a kernel parameter,
14"softlockup_panic" (see "Documentation/admin-guide/kernel-parameters.rst" for
15details), and a compile option, "BOOTPARAM_SOFTLOCKUP_PANIC", are
16provided for this.
17
18A 'hardlockup' is defined as a bug that causes the CPU to loop in
19kernel mode for more than 10 seconds (see "Implementation" below for
20details), without letting other interrupts have a chance to run.
21Similarly to the softlockup case, the current stack trace is displayed
22upon detection and the system will stay locked up unless the default
23behavior is changed, which can be done through a sysctl,
24'hardlockup_panic', a compile time knob, "BOOTPARAM_HARDLOCKUP_PANIC",
25and a kernel parameter, "nmi_watchdog"
26(see "Documentation/admin-guide/kernel-parameters.rst" for details).
27
28The panic option can be used in combination with panic_timeout (this
29timeout is set through the confusingly named "kernel.panic" sysctl),
30to cause the system to reboot automatically after a specified amount
31of time.
32
33Implementation
34==============
35
36The soft and hard lockup detectors are built on top of the hrtimer and
37perf subsystems, respectively. A direct consequence of this is that,
38in principle, they should work in any architecture where these
39subsystems are present.
40
41A periodic hrtimer runs to generate interrupts and kick the watchdog
42task. An NMI perf event is generated every "watchdog_thresh"
43(compile-time initialized to 10 and configurable through sysctl of the
44same name) seconds to check for hardlockups. If any CPU in the system
45does not receive any hrtimer interrupt during that time the
46'hardlockup detector' (the handler for the NMI perf event) will
47generate a kernel warning or call panic, depending on the
48configuration.
49
50The watchdog task is a high priority kernel thread that updates a
51timestamp every time it is scheduled. If that timestamp is not updated
52for 2*watchdog_thresh seconds (the softlockup threshold) the
53'softlockup detector' (coded inside the hrtimer callback function)
54will dump useful debug information to the system log, after which it
55will call panic if it was instructed to do so or resume execution of
56other kernel code.
57
58The period of the hrtimer is 2*watchdog_thresh/5, which means it has
59two or three chances to generate an interrupt before the hardlockup
60detector kicks in.
61
62As explained above, a kernel knob is provided that allows
63administrators to configure the period of the hrtimer and the perf
64event. The right value for a particular environment is a trade-off
65between fast response to lockups and detection overhead.
66
67By default, the watchdog runs on all online cores. However, on a
68kernel configured with NO_HZ_FULL, by default the watchdog runs only
69on the housekeeping cores, not the cores specified in the "nohz_full"
70boot argument. If we allowed the watchdog to run by default on
71the "nohz_full" cores, we would have to run timer ticks to activate
72the scheduler, which would prevent the "nohz_full" functionality
73from protecting the user code on those cores from the kernel.
74Of course, disabling it by default on the nohz_full cores means that
75when those cores do enter the kernel, by default we will not be
76able to detect if they lock up. However, allowing the watchdog
77to continue to run on the housekeeping (non-tickless) cores means
78that we will continue to detect lockups properly on those cores.
79
80In either case, the set of cores excluded from running the watchdog
81may be adjusted via the kernel.watchdog_cpumask sysctl. For
82nohz_full cores, this may be useful for debugging a case where the
83kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
new file mode 100644
index 000000000000..4e06ffabd78a
--- /dev/null
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -0,0 +1,25 @@
1=====================
2CMA Debugfs Interface
3=====================
4
5The CMA debugfs interface is useful to retrieve basic information out of the
6different CMA areas and to test allocation/release in each of the areas.
7
8Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
9kernel's CMA index. So the first CMA zone would be:
10
11 <debugfs>/cma/cma-0
12
13The structure of the files created under that directory is as follows:
14
15 - [RO] base_pfn: The base PFN (Page Frame Number) of the zone.
16 - [RO] count: Amount of memory in the CMA area.
17 - [RO] order_per_bit: Order of pages represented by one bit.
18 - [RO] bitmap: The bitmap of page states in the zone.
19 - [WO] alloc: Allocate N pages from that CMA area. For example::
20
21 echo 5 > <debugfs>/cma/cma-2/alloc
22
23would try to allocate 5 pages from the cma-2 area.
24
25 - [WO] free: Free N pages from that CMA area, similar to the above.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index ddf8d8d33377..11db46448354 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -11,7 +11,7 @@ processes address space and many other cool things.
11Linux memory management is a complex system with many configurable 11Linux memory management is a complex system with many configurable
12settings. Most of these settings are available via ``/proc`` 12settings. Most of these settings are available via ``/proc``
13filesystem and can be quired and adjusted using ``sysctl``. These APIs 13filesystem and can be quired and adjusted using ``sysctl``. These APIs
14are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. 14are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_.
15 15
16.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html 16.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
17 17
@@ -26,6 +26,7 @@ the Linux memory management.
26 :maxdepth: 1 26 :maxdepth: 1
27 27
28 concepts 28 concepts
29 cma_debugfs
29 hugetlbpage 30 hugetlbpage
30 idle_page_tracking 31 idle_page_tracking
31 ksm 32 ksm
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 9303786632d1..874eb0c77d34 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -59,7 +59,7 @@ MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
59 59
60If a region of memory must be split into at least one new MADV_MERGEABLE 60If a region of memory must be split into at least one new MADV_MERGEABLE
61or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process 61or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
62will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). 62will exceed ``vm.max_map_count`` (see Documentation/admin-guide/sysctl/vm.rst).
63 63
64Like other madvise calls, they are intended for use on mapped areas of 64Like other madvise calls, they are intended for use on mapped areas of
65the user address space: they will report ENOMEM if the specified range 65the user address space: they will report ENOMEM if the specified range
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 546f174e5d6a..8463f5538fda 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
15support. 15support.
16 16
17Memory policies should not be confused with cpusets 17Memory policies should not be confused with cpusets
18(``Documentation/cgroup-v1/cpusets.rst``) 18(``Documentation/admin-guide/cgroup-v1/cpusets.rst``)
19which is an administrative mechanism for restricting the nodes from which 19which is an administrative mechanism for restricting the nodes from which
20memory may be allocated by a set of processes. Memory policies are a 20memory may be allocated by a set of processes. Memory policies are a
21programming interface that a NUMA-aware application can take advantage of. When 21programming interface that a NUMA-aware application can take advantage of. When
diff --git a/Documentation/admin-guide/namespaces/compatibility-list.rst b/Documentation/admin-guide/namespaces/compatibility-list.rst
new file mode 100644
index 000000000000..318800b2a943
--- /dev/null
+++ b/Documentation/admin-guide/namespaces/compatibility-list.rst
@@ -0,0 +1,43 @@
1=============================
2Namespaces compatibility list
3=============================
4
5This document contains the information about the problems user
6may have when creating tasks living in different namespaces.
7
8Here's the summary. This matrix shows the known problems, that
9occur when tasks share some namespace (the columns) while living
10in different other namespaces (the rows):
11
12==== === === === === ==== ===
13- UTS IPC VFS PID User Net
14==== === === === === ==== ===
15UTS X
16IPC X 1
17VFS X
18PID 1 1 X
19User 2 2 X
20Net X
21==== === === === === ==== ===
22
231. Both the IPC and the PID namespaces provide IDs to address
24 object inside the kernel. E.g. semaphore with IPCID or
25 process group with pid.
26
27 In both cases, tasks shouldn't try exposing this ID to some
28 other task living in a different namespace via a shared filesystem
29 or IPC shmem/message. The fact is that this ID is only valid
30 within the namespace it was obtained in and may refer to some
31 other object in another namespace.
32
332. Intentionally, two equal user IDs in different user namespaces
34 should not be equal from the VFS point of view. In other
35 words, user 10 in one user namespace shouldn't have the same
36 access permissions to files, belonging to user 10 in another
37 namespace.
38
39 The same is true for the IPC namespaces being shared - two users
40 from different user namespaces should not access the same IPC objects
41 even having equal UIDs.
42
43 But currently this is not so.
diff --git a/Documentation/admin-guide/namespaces/index.rst b/Documentation/admin-guide/namespaces/index.rst
new file mode 100644
index 000000000000..384f2e0f33d2
--- /dev/null
+++ b/Documentation/admin-guide/namespaces/index.rst
@@ -0,0 +1,11 @@
1.. SPDX-License-Identifier: GPL-2.0
2
3==========
4Namespaces
5==========
6
7.. toctree::
8 :maxdepth: 1
9
10 compatibility-list
11 resource-control
diff --git a/Documentation/admin-guide/namespaces/resource-control.rst b/Documentation/admin-guide/namespaces/resource-control.rst
new file mode 100644
index 000000000000..369556e00f0c
--- /dev/null
+++ b/Documentation/admin-guide/namespaces/resource-control.rst
@@ -0,0 +1,18 @@
1===========================
2Namespaces research control
3===========================
4
5There are a lot of kinds of objects in the kernel that don't have
6individual limits or that have limits that are ineffective when a set
7of processes is allowed to switch user ids. With user namespaces
8enabled in a kernel for people who don't trust their users or their
9users programs to play nice this problems becomes more acute.
10
11Therefore it is recommended that memory control groups be enabled in
12kernels that enable user namespaces, and it is further recommended
13that userspace configure memory control groups to limit how much
14memory user's they don't trust to play nice can use.
15
16Memory control groups can be configured by installing the libcgroup
17package present on most distros editing /etc/cgrules.conf,
18/etc/cgconfig.conf and setting up libpam-cgroup.
diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
new file mode 100644
index 000000000000..aaf1667489f8
--- /dev/null
+++ b/Documentation/admin-guide/numastat.rst
@@ -0,0 +1,30 @@
1===============================
2Numa policy hit/miss statistics
3===============================
4
5/sys/devices/system/node/node*/numastat
6
7All units are pages. Hugepages have separate counters.
8
9=============== ============================================================
10numa_hit A process wanted to allocate memory from this node,
11 and succeeded.
12
13numa_miss A process wanted to allocate memory from another node,
14 but ended up with memory from this node.
15
16numa_foreign A process wanted to allocate on this node,
17 but ended up with memory from another one.
18
19local_node A process ran on this node and got memory from it.
20
21other_node A process ran on this node and got memory from another node.
22
23interleave_hit Interleaving wanted to allocate from this node
24 and succeeded.
25=============== ============================================================
26
27For easier reading you can use the numastat utility from the numactl package
28(http://oss.sgi.com/projects/libnuma/). Note that it only works
29well right now on machines with a small number of CPUs.
30
diff --git a/Documentation/admin-guide/perf/arm-ccn.rst b/Documentation/admin-guide/perf/arm-ccn.rst
new file mode 100644
index 000000000000..832b0c64023a
--- /dev/null
+++ b/Documentation/admin-guide/perf/arm-ccn.rst
@@ -0,0 +1,61 @@
1==========================
2ARM Cache Coherent Network
3==========================
4
5CCN-504 is a ring-bus interconnect consisting of 11 crosspoints
6(XPs), with each crosspoint supporting up to two device ports,
7so nodes (devices) 0 and 1 are connected to crosspoint 0,
8nodes 2 and 3 to crosspoint 1 etc.
9
10PMU (perf) driver
11-----------------
12
13The CCN driver registers a perf PMU driver, which provides
14description of available events and configuration options
15in sysfs, see /sys/bus/event_source/devices/ccn*.
16
17The "format" directory describes format of the config, config1
18and config2 fields of the perf_event_attr structure. The "events"
19directory provides configuration templates for all documented
20events, that can be used with perf tool. For example "xp_valid_flit"
21is an equivalent of "type=0x8,event=0x4". Other parameters must be
22explicitly specified.
23
24For events originating from device, "node" defines its index.
25
26Crosspoint PMU events require "xp" (index), "bus" (bus number)
27and "vc" (virtual channel ID).
28
29Crosspoint watchpoint-based events (special "event" value 0xfe)
30require "xp" and "vc" as as above plus "port" (device port index),
31"dir" (transmit/receive direction), comparator values ("cmp_l"
32and "cmp_h") and "mask", being index of the comparator mask.
33
34Masks are defined separately from the event description
35(due to limited number of the config values) in the "cmp_mask"
36directory, with first 8 configurable by user and additional
374 hardcoded for the most frequent use cases.
38
39Cycle counter is described by a "type" value 0xff and does
40not require any other settings.
41
42The driver also provides a "cpumask" sysfs attribute, which contains
43a single CPU ID, of the processor which will be used to handle all
44the CCN PMU events. It is recommended that the user space tools
45request the events on this processor (if not, the perf_event->cpu value
46will be overwritten anyway). In case of this processor being offlined,
47the events are migrated to another one and the attribute is updated.
48
49Example of perf tool use::
50
51 / # perf list | grep ccn
52 ccn/cycles/ [Kernel PMU event]
53 <...>
54 ccn/xp_valid_flit,xp=?,port=?,vc=?,dir=?/ [Kernel PMU event]
55 <...>
56
57 / # perf stat -a -e ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/ \
58 sleep 1
59
60The driver does not support sampling, therefore "perf record" will
61not work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/arm_dsu_pmu.rst b/Documentation/admin-guide/perf/arm_dsu_pmu.rst
new file mode 100644
index 000000000000..7fd34db75d13
--- /dev/null
+++ b/Documentation/admin-guide/perf/arm_dsu_pmu.rst
@@ -0,0 +1,29 @@
1==================================
2ARM DynamIQ Shared Unit (DSU) PMU
3==================================
4
5ARM DynamIQ Shared Unit integrates one or more cores with an L3 memory system,
6control logic and external interfaces to form a multicore cluster. The PMU
7allows counting the various events related to the L3 cache, Snoop Control Unit
8etc, using 32bit independent counters. It also provides a 64bit cycle counter.
9
10The PMU can only be accessed via CPU system registers and are common to the
11cores connected to the same DSU. Like most of the other uncore PMUs, DSU
12PMU doesn't support process specific events and cannot be used in sampling mode.
13
14The DSU provides a bitmap for a subset of implemented events via hardware
15registers. There is no way for the driver to determine if the other events
16are available or not. Hence the driver exposes only those events advertised
17by the DSU, in "events" directory under::
18
19 /sys/bus/event_sources/devices/arm_dsu_<N>/
20
21The user should refer to the TRM of the product to figure out the supported events
22and use the raw event code for the unlisted events.
23
24The driver also exposes the CPUs connected to the DSU instance in "associated_cpus".
25
26
27e.g usage::
28
29 perf stat -a -e arm_dsu_0/cycles/
diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst
new file mode 100644
index 000000000000..404a5c3d9d00
--- /dev/null
+++ b/Documentation/admin-guide/perf/hisi-pmu.rst
@@ -0,0 +1,60 @@
1======================================================
2HiSilicon SoC uncore Performance Monitoring Unit (PMU)
3======================================================
4
5The HiSilicon SoC chip includes various independent system device PMUs
6such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are
7independent and have hardware logic to gather statistics and performance
8information.
9
10The HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster
11(CCL) is made up of 4 cpu cores sharing one L3 cache; each CPU die is
12called Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has
13two HHAs (0 - 1) and four DDRCs (0 - 3), respectively.
14
15HiSilicon SoC uncore PMU driver
16-------------------------------
17
18Each device PMU has separate registers for event counting, control and
19interrupt, and the PMU driver shall register perf PMU drivers like L3C,
20HHA and DDRC etc. The available events and configuration options shall
21be described in the sysfs, see:
22
23/sys/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>/, or
24/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>.
25The "perf list" command shall list the available events from sysfs.
26
27Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU
28name will appear in event listing as hisi_sccl<sccl-id>_module<index-id>.
29where "sccl-id" is the identifier of the SCCL and "index-id" is the index of
30module.
31
32e.g. hisi_sccl3_l3c0/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 in
33SCCL ID #3.
34
35e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in
36SCCL ID #1.
37
38The driver also provides a "cpumask" sysfs attribute, which shows the CPU core
39ID used to count the uncore PMU event.
40
41Example usage of perf::
42
43 $# perf list
44 hisi_sccl3_l3c0/rd_hit_cpipe/ [kernel PMU event]
45 ------------------------------------------
46 hisi_sccl3_l3c0/wr_hit_cpipe/ [kernel PMU event]
47 ------------------------------------------
48 hisi_sccl1_l3c0/rd_hit_cpipe/ [kernel PMU event]
49 ------------------------------------------
50 hisi_sccl1_l3c0/wr_hit_cpipe/ [kernel PMU event]
51 ------------------------------------------
52
53 $# perf stat -a -e hisi_sccl3_l3c0/rd_hit_cpipe/ sleep 5
54 $# perf stat -a -e hisi_sccl3_l3c0/config=0x02/ sleep 5
55
56The current driver does not support sampling. So "perf record" is unsupported.
57Also attach to a task is unsupported as the events are all uncore.
58
59Note: Please contact the maintainer for a complete list of events supported for
60the PMU devices in the SoC and its information if needed.
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
new file mode 100644
index 000000000000..ee4bfd2a740f
--- /dev/null
+++ b/Documentation/admin-guide/perf/index.rst
@@ -0,0 +1,16 @@
1.. SPDX-License-Identifier: GPL-2.0
2
3===========================
4Performance monitor support
5===========================
6
7.. toctree::
8 :maxdepth: 1
9
10 hisi-pmu
11 qcom_l2_pmu
12 qcom_l3_pmu
13 arm-ccn
14 xgene-pmu
15 arm_dsu_pmu
16 thunderx2-pmu
diff --git a/Documentation/admin-guide/perf/qcom_l2_pmu.rst b/Documentation/admin-guide/perf/qcom_l2_pmu.rst
new file mode 100644
index 000000000000..c130178a4a55
--- /dev/null
+++ b/Documentation/admin-guide/perf/qcom_l2_pmu.rst
@@ -0,0 +1,39 @@
1=====================================================================
2Qualcomm Technologies Level-2 Cache Performance Monitoring Unit (PMU)
3=====================================================================
4
5This driver supports the L2 cache clusters found in Qualcomm Technologies
6Centriq SoCs. There are multiple physical L2 cache clusters, each with their
7own PMU. Each cluster has one or more CPUs associated with it.
8
9There is one logical L2 PMU exposed, which aggregates the results from
10the physical PMUs.
11
12The driver provides a description of its available events and configuration
13options in sysfs, see /sys/devices/l2cache_0.
14
15The "format" directory describes the format of the events.
16
17Events can be envisioned as a 2-dimensional array. Each column represents
18a group of events. There are 8 groups. Only one entry from each
19group can be in use at a time. If multiple events from the same group
20are specified, the conflicting events cannot be counted at the same time.
21
22Events are specified as 0xCCG, where CC is 2 hex digits specifying
23the code (array row) and G specifies the group (column) 0-7.
24
25In addition there is a cycle counter event specified by the value 0xFE
26which is outside the above scheme.
27
28The driver provides a "cpumask" sysfs attribute which contains a mask
29consisting of one CPU per cluster which will be used to handle all the PMU
30events on that cluster.
31
32Examples for use with perf::
33
34 perf stat -e l2cache_0/config=0x001/,l2cache_0/config=0x042/ -a sleep 1
35
36 perf stat -e l2cache_0/config=0xfe/ -C 2 sleep 1
37
38The driver does not support sampling, therefore "perf record" will
39not work. Per-task perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/qcom_l3_pmu.rst b/Documentation/admin-guide/perf/qcom_l3_pmu.rst
new file mode 100644
index 000000000000..a3d014a46bfd
--- /dev/null
+++ b/Documentation/admin-guide/perf/qcom_l3_pmu.rst
@@ -0,0 +1,26 @@
1===========================================================================
2Qualcomm Datacenter Technologies L3 Cache Performance Monitoring Unit (PMU)
3===========================================================================
4
5This driver supports the L3 cache PMUs found in Qualcomm Datacenter Technologies
6Centriq SoCs. The L3 cache on these SOCs is composed of multiple slices, shared
7by all cores within a socket. Each slice is exposed as a separate uncore perf
8PMU with device name l3cache_<socket>_<instance>. User space is responsible
9for aggregating across slices.
10
11The driver provides a description of its available events and configuration
12options in sysfs, see /sys/devices/l3cache*. Given that these are uncore PMUs
13the driver also exposes a "cpumask" sysfs attribute which contains a mask
14consisting of one CPU per socket which will be used to handle all the PMU
15events on that socket.
16
17The hardware implements 32bit event counters and has a flat 8bit event space
18exposed via the "event" format attribute. In addition to the 32bit physical
19counters the driver supports virtual 64bit hardware counters by using hardware
20counter chaining. This feature is exposed via the "lc" (long counter) format
21flag. E.g.::
22
23 perf stat -e l3cache_0_0/read-miss,lc/
24
25Given that these are uncore PMUs the driver does not support sampling, therefore
26"perf record" will not work. Per-task perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst
new file mode 100644
index 000000000000..08e33675853a
--- /dev/null
+++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst
@@ -0,0 +1,42 @@
1=============================================================
2Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
3=============================================================
4
5The ThunderX2 SoC PMU consists of independent, system-wide, per-socket
6PMUs such as the Level 3 Cache (L3C) and DDR4 Memory Controller (DMC).
7
8The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles.
9Events are counted for the default channel (i.e. channel 0) and prorated
10to the total number of channels/tiles.
11
12The DMC and L3C support up to 4 counters. Counters are independently
13programmable and can be started and stopped individually. Each counter
14can be set to a different event. Counters are 32-bit and do not support
15an overflow interrupt; they are read every 2 seconds.
16
17PMU UNCORE (perf) driver:
18
19The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and
20L3C devices. Each PMU can be used to count up to 4 events
21simultaneously. The PMUs provide a description of their available events
22and configuration options under sysfs, see
23/sys/devices/uncore_<l3c_S/dmc_S/>; S is the socket id.
24
25The driver does not support sampling, therefore "perf record" will not
26work. Per-task perf sessions are also not supported.
27
28Examples::
29
30 # perf stat -a -e uncore_dmc_0/cnt_cycles/ sleep 1
31
32 # perf stat -a -e \
33 uncore_dmc_0/cnt_cycles/,\
34 uncore_dmc_0/data_transfers/,\
35 uncore_dmc_0/read_txns/,\
36 uncore_dmc_0/write_txns/ sleep 1
37
38 # perf stat -a -e \
39 uncore_l3c_0/read_request/,\
40 uncore_l3c_0/read_hit/,\
41 uncore_l3c_0/inv_request/,\
42 uncore_l3c_0/inv_hit/ sleep 1
diff --git a/Documentation/admin-guide/perf/xgene-pmu.rst b/Documentation/admin-guide/perf/xgene-pmu.rst
new file mode 100644
index 000000000000..644f8ed89152
--- /dev/null
+++ b/Documentation/admin-guide/perf/xgene-pmu.rst
@@ -0,0 +1,49 @@
1================================================
2APM X-Gene SoC Performance Monitoring Unit (PMU)
3================================================
4
5X-Gene SoC PMU consists of various independent system device PMUs such as
6L3 cache(s), I/O bridge(s), memory controller bridge(s) and memory
7controller(s). These PMU devices are loosely architected to follow the
8same model as the PMU for ARM cores. The PMUs share the same top level
9interrupt and status CSR region.
10
11PMU (perf) driver
12-----------------
13
14The xgene-pmu driver registers several perf PMU drivers. Each of the perf
15driver provides description of its available events and configuration options
16in sysfs, see /sys/devices/<l3cX/iobX/mcbX/mcX>/.
17
18The "format" directory describes format of the config (event ID),
19config1 (agent ID) fields of the perf_event_attr structure. The "events"
20directory provides configuration templates for all supported event types that
21can be used with perf tool. For example, "l3c0/bank-fifo-full/" is an
22equivalent of "l3c0/config=0x0b/".
23
24Most of the SoC PMU has a specific list of agent ID used for monitoring
25performance of a specific datapath. For example, agents of a L3 cache can be
26a specific CPU or an I/O bridge. Each PMU has a set of 2 registers capable of
27masking the agents from which the request come from. If the bit with
28the bit number corresponding to the agent is set, the event is counted only if
29it is caused by a request from that agent. Each agent ID bit is inversely mapped
30to a corresponding bit in "config1" field. By default, the event will be
31counted for all agent requests (config1 = 0x0). For all the supported agents of
32each PMU, please refer to APM X-Gene User Manual.
33
34Each perf driver also provides a "cpumask" sysfs attribute, which contains a
35single CPU ID of the processor which will be used to handle all the PMU events.
36
37Example for perf tool use::
38
39 / # perf list | grep -e l3c -e iob -e mcb -e mc
40 l3c0/ackq-full/ [Kernel PMU event]
41 <...>
42 mcb1/mcb-csw-stall/ [Kernel PMU event]
43
44 / # perf stat -a -e l3c0/read-miss/,mcb1/csw-write-request/ sleep 1
45
46 / # perf stat -a -e l3c0/read-miss,config1=0xfffffffffffffffe/ sleep 1
47
48The driver does not support sampling, therefore "perf record" will
49not work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst
new file mode 100644
index 000000000000..bab2d10631f0
--- /dev/null
+++ b/Documentation/admin-guide/pnp.rst
@@ -0,0 +1,292 @@
1=================================
2Linux Plug and Play Documentation
3=================================
4
5:Author: Adam Belay <ambx1@neo.rr.com>
6:Last updated: Oct. 16, 2002
7
8
9Overview
10--------
11
12Plug and Play provides a means of detecting and setting resources for legacy or
13otherwise unconfigurable devices. The Linux Plug and Play Layer provides these
14services to compatible drivers.
15
16
17The User Interface
18------------------
19
20The Linux Plug and Play user interface provides a means to activate PnP devices
21for legacy and user level drivers that do not support Linux Plug and Play. The
22user interface is integrated into sysfs.
23
24In addition to the standard sysfs file the following are created in each
25device's directory:
26- id - displays a list of support EISA IDs
27- options - displays possible resource configurations
28- resources - displays currently allocated resources and allows resource changes
29
30activating a device
31^^^^^^^^^^^^^^^^^^^
32
33::
34
35 # echo "auto" > resources
36
37this will invoke the automatic resource config system to activate the device
38
39manually activating a device
40^^^^^^^^^^^^^^^^^^^^^^^^^^^^
41
42::
43
44 # echo "manual <depnum> <mode>" > resources
45
46 <depnum> - the configuration number
47 <mode> - static or dynamic
48 static = for next boot
49 dynamic = now
50
51disabling a device
52^^^^^^^^^^^^^^^^^^
53
54::
55
56 # echo "disable" > resources
57
58
59EXAMPLE:
60
61Suppose you need to activate the floppy disk controller.
62
631. change to the proper directory, in my case it is
64 /driver/bus/pnp/devices/00:0f::
65
66 # cd /driver/bus/pnp/devices/00:0f
67 # cat name
68 PC standard floppy disk controller
69
702. check if the device is already active::
71
72 # cat resources
73 DISABLED
74
75 - Notice the string "DISABLED". This means the device is not active.
76
773. check the device's possible configurations (optional)::
78
79 # cat options
80 Dependent: 01 - Priority acceptable
81 port 0x3f0-0x3f0, align 0x7, size 0x6, 16-bit address decoding
82 port 0x3f7-0x3f7, align 0x0, size 0x1, 16-bit address decoding
83 irq 6
84 dma 2 8-bit compatible
85 Dependent: 02 - Priority acceptable
86 port 0x370-0x370, align 0x7, size 0x6, 16-bit address decoding
87 port 0x377-0x377, align 0x0, size 0x1, 16-bit address decoding
88 irq 6
89 dma 2 8-bit compatible
90
914. now activate the device::
92
93 # echo "auto" > resources
94
955. finally check if the device is active::
96
97 # cat resources
98 io 0x3f0-0x3f5
99 io 0x3f7-0x3f7
100 irq 6
101 dma 2
102
103also there are a series of kernel parameters::
104
105 pnp_reserve_irq=irq1[,irq2] ....
106 pnp_reserve_dma=dma1[,dma2] ....
107 pnp_reserve_io=io1,size1[,io2,size2] ....
108 pnp_reserve_mem=mem1,size1[,mem2,size2] ....
109
110
111
112The Unified Plug and Play Layer
113-------------------------------
114
115All Plug and Play drivers, protocols, and services meet at a central location
116called the Plug and Play Layer. This layer is responsible for the exchange of
117information between PnP drivers and PnP protocols. Thus it automatically
118forwards commands to the proper protocol. This makes writing PnP drivers
119significantly easier.
120
121The following functions are available from the Plug and Play Layer:
122
123pnp_get_protocol
124 increments the number of uses by one
125
126pnp_put_protocol
127 deincrements the number of uses by one
128
129pnp_register_protocol
130 use this to register a new PnP protocol
131
132pnp_unregister_protocol
133 use this function to remove a PnP protocol from the Plug and Play Layer
134
135pnp_register_driver
136 adds a PnP driver to the Plug and Play Layer
137
138 this includes driver model integration
139 returns zero for success or a negative error number for failure; count
140 calls to the .add() method if you need to know how many devices bind to
141 the driver
142
143pnp_unregister_driver
144 removes a PnP driver from the Plug and Play Layer
145
146
147
148Plug and Play Protocols
149-----------------------
150
151This section contains information for PnP protocol developers.
152
153The following Protocols are currently available in the computing world:
154
155- PNPBIOS:
156 used for system devices such as serial and parallel ports.
157- ISAPNP:
158 provides PnP support for the ISA bus
159- ACPI:
160 among its many uses, ACPI provides information about system level
161 devices.
162
163It is meant to replace the PNPBIOS. It is not currently supported by Linux
164Plug and Play but it is planned to be in the near future.
165
166
167Requirements for a Linux PnP protocol:
1681. the protocol must use EISA IDs
1692. the protocol must inform the PnP Layer of a device's current configuration
170
171- the ability to set resources is optional but preferred.
172
173The following are PnP protocol related functions:
174
175pnp_add_device
176 use this function to add a PnP device to the PnP layer
177
178 only call this function when all wanted values are set in the pnp_dev
179 structure
180
181pnp_init_device
182 call this to initialize the PnP structure
183
184pnp_remove_device
185 call this to remove a device from the Plug and Play Layer.
186 it will fail if the device is still in use.
187 automatically will free mem used by the device and related structures
188
189pnp_add_id
190 adds an EISA ID to the list of supported IDs for the specified device
191
192For more information consult the source of a protocol such as
193/drivers/pnp/pnpbios/core.c.
194
195
196
197Linux Plug and Play Drivers
198---------------------------
199
200This section contains information for Linux PnP driver developers.
201
202The New Way
203^^^^^^^^^^^
204
2051. first make a list of supported EISA IDS
206
207 ex::
208
209 static const struct pnp_id pnp_dev_table[] = {
210 /* Standard LPT Printer Port */
211 {.id = "PNP0400", .driver_data = 0},
212 /* ECP Printer Port */
213 {.id = "PNP0401", .driver_data = 0},
214 {.id = ""}
215 };
216
217 Please note that the character 'X' can be used as a wild card in the function
218 portion (last four characters).
219
220 ex::
221
222 /* Unknown PnP modems */
223 { "PNPCXXX", UNKNOWN_DEV },
224
225 Supported PnP card IDs can optionally be defined.
226 ex::
227
228 static const struct pnp_id pnp_card_table[] = {
229 { "ANYDEVS", 0 },
230 { "", 0 }
231 };
232
2332. Optionally define probe and remove functions. It may make sense not to
234 define these functions if the driver already has a reliable method of detecting
235 the resources, such as the parport_pc driver.
236
237 ex::
238
239 static int
240 serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const
241 struct pnp_id *dev_id)
242 {
243 . . .
244
245 ex::
246
247 static void serial_pnp_remove(struct pnp_dev * dev)
248 {
249 . . .
250
251 consult /drivers/serial/8250_pnp.c for more information.
252
2533. create a driver structure
254
255 ex::
256
257 static struct pnp_driver serial_pnp_driver = {
258 .name = "serial",
259 .card_id_table = pnp_card_table,
260 .id_table = pnp_dev_table,
261 .probe = serial_pnp_probe,
262 .remove = serial_pnp_remove,
263 };
264
265 * name and id_table cannot be NULL.
266
2674. register the driver
268
269 ex::
270
271 static int __init serial8250_pnp_init(void)
272 {
273 return pnp_register_driver(&serial_pnp_driver);
274 }
275
276The Old Way
277^^^^^^^^^^^
278
279A series of compatibility functions have been created to make it easy to convert
280ISAPNP drivers. They should serve as a temporary solution only.
281
282They are as follows::
283
284 struct pnp_card *pnp_find_card(unsigned short vendor,
285 unsigned short device,
286 struct pnp_card *from)
287
288 struct pnp_dev *pnp_find_dev(struct pnp_card *card,
289 unsigned short vendor,
290 unsigned short function,
291 struct pnp_dev *from)
292
diff --git a/Documentation/admin-guide/rapidio.rst b/Documentation/admin-guide/rapidio.rst
new file mode 100644
index 000000000000..71ff658ab78e
--- /dev/null
+++ b/Documentation/admin-guide/rapidio.rst
@@ -0,0 +1,107 @@
1=======================
2RapidIO Subsystem Guide
3=======================
4
5:Author: Matt Porter
6
7Introduction
8============
9
10RapidIO is a high speed switched fabric interconnect with features aimed
11at the embedded market. RapidIO provides support for memory-mapped I/O
12as well as message-based transactions over the switched fabric network.
13RapidIO has a standardized discovery mechanism not unlike the PCI bus
14standard that allows simple detection of devices in a network.
15
16This documentation is provided for developers intending to support
17RapidIO on new architectures, write new drivers, or to understand the
18subsystem internals.
19
20Known Bugs and Limitations
21==========================
22
23Bugs
24----
25
26None. ;)
27
28Limitations
29-----------
30
311. Access/management of RapidIO memory regions is not supported
32
332. Multiple host enumeration is not supported
34
35RapidIO driver interface
36========================
37
38Drivers are provided a set of calls in order to interface with the
39subsystem to gather info on devices, request/map memory region
40resources, and manage mailboxes/doorbells.
41
42Functions
43---------
44
45.. kernel-doc:: include/linux/rio_drv.h
46 :internal:
47
48.. kernel-doc:: drivers/rapidio/rio-driver.c
49 :export:
50
51.. kernel-doc:: drivers/rapidio/rio.c
52 :export:
53
54Internals
55=========
56
57This chapter contains the autogenerated documentation of the RapidIO
58subsystem.
59
60Structures
61----------
62
63.. kernel-doc:: include/linux/rio.h
64 :internal:
65
66Enumeration and Discovery
67-------------------------
68
69.. kernel-doc:: drivers/rapidio/rio-scan.c
70 :internal:
71
72Driver functionality
73--------------------
74
75.. kernel-doc:: drivers/rapidio/rio.c
76 :internal:
77
78.. kernel-doc:: drivers/rapidio/rio-access.c
79 :internal:
80
81Device model support
82--------------------
83
84.. kernel-doc:: drivers/rapidio/rio-driver.c
85 :internal:
86
87PPC32 support
88-------------
89
90.. kernel-doc:: arch/powerpc/sysdev/fsl_rio.c
91 :internal:
92
93Credits
94=======
95
96The following people have contributed to the RapidIO subsystem directly
97or indirectly:
98
991. Matt Porter\ mporter@kernel.crashing.org
100
1012. Randy Vinson\ rvinson@mvista.com
102
1033. Dan Malek\ dan@embeddedalley.com
104
105The following people have contributed to this document:
106
1071. Matt Porter\ mporter@kernel.crashing.org
diff --git a/Documentation/admin-guide/rtc.rst b/Documentation/admin-guide/rtc.rst
new file mode 100644
index 000000000000..688c95b11919
--- /dev/null
+++ b/Documentation/admin-guide/rtc.rst
@@ -0,0 +1,140 @@
1=======================================
2Real Time Clock (RTC) Drivers for Linux
3=======================================
4
5When Linux developers talk about a "Real Time Clock", they usually mean
6something that tracks wall clock time and is battery backed so that it
7works even with system power off. Such clocks will normally not track
8the local time zone or daylight savings time -- unless they dual boot
9with MS-Windows -- but will instead be set to Coordinated Universal Time
10(UTC, formerly "Greenwich Mean Time").
11
12The newest non-PC hardware tends to just count seconds, like the time(2)
13system call reports, but RTCs also very commonly represent time using
14the Gregorian calendar and 24 hour time, as reported by gmtime(3).
15
16Linux has two largely-compatible userspace RTC API families you may
17need to know about:
18
19 * /dev/rtc ... is the RTC provided by PC compatible systems,
20 so it's not very portable to non-x86 systems.
21
22 * /dev/rtc0, /dev/rtc1 ... are part of a framework that's
23 supported by a wide variety of RTC chips on all systems.
24
25Programmers need to understand that the PC/AT functionality is not
26always available, and some systems can do much more. That is, the
27RTCs use the same API to make requests in both RTC frameworks (using
28different filenames of course), but the hardware may not offer the
29same functionality. For example, not every RTC is hooked up to an
30IRQ, so they can't all issue alarms; and where standard PC RTCs can
31only issue an alarm up to 24 hours in the future, other hardware may
32be able to schedule one any time in the upcoming century.
33
34
35Old PC/AT-Compatible driver: /dev/rtc
36--------------------------------------
37
38All PCs (even Alpha machines) have a Real Time Clock built into them.
39Usually they are built into the chipset of the computer, but some may
40actually have a Motorola MC146818 (or clone) on the board. This is the
41clock that keeps the date and time while your computer is turned off.
42
43ACPI has standardized that MC146818 functionality, and extended it in
44a few ways (enabling longer alarm periods, and wake-from-hibernate).
45That functionality is NOT exposed in the old driver.
46
47However it can also be used to generate signals from a slow 2Hz to a
48relatively fast 8192Hz, in increments of powers of two. These signals
49are reported by interrupt number 8. (Oh! So *that* is what IRQ 8 is
50for...) It can also function as a 24hr alarm, raising IRQ 8 when the
51alarm goes off. The alarm can also be programmed to only check any
52subset of the three programmable values, meaning that it could be set to
53ring on the 30th second of the 30th minute of every hour, for example.
54The clock can also be set to generate an interrupt upon every clock
55update, thus generating a 1Hz signal.
56
57The interrupts are reported via /dev/rtc (major 10, minor 135, read only
58character device) in the form of an unsigned long. The low byte contains
59the type of interrupt (update-done, alarm-rang, or periodic) that was
60raised, and the remaining bytes contain the number of interrupts since
61the last read. Status information is reported through the pseudo-file
62/proc/driver/rtc if the /proc filesystem was enabled. The driver has
63built in locking so that only one process is allowed to have the /dev/rtc
64interface open at a time.
65
66A user process can monitor these interrupts by doing a read(2) or a
67select(2) on /dev/rtc -- either will block/stop the user process until
68the next interrupt is received. This is useful for things like
69reasonably high frequency data acquisition where one doesn't want to
70burn up 100% CPU by polling gettimeofday etc. etc.
71
72At high frequencies, or under high loads, the user process should check
73the number of interrupts received since the last read to determine if
74there has been any interrupt "pileup" so to speak. Just for reference, a
75typical 486-33 running a tight read loop on /dev/rtc will start to suffer
76occasional interrupt pileup (i.e. > 1 IRQ event since last read) for
77frequencies above 1024Hz. So you really should check the high bytes
78of the value you read, especially at frequencies above that of the
79normal timer interrupt, which is 100Hz.
80
81Programming and/or enabling interrupt frequencies greater than 64Hz is
82only allowed by root. This is perhaps a bit conservative, but we don't want
83an evil user generating lots of IRQs on a slow 386sx-16, where it might have
84a negative impact on performance. This 64Hz limit can be changed by writing
85a different value to /proc/sys/dev/rtc/max-user-freq. Note that the
86interrupt handler is only a few lines of code to minimize any possibility
87of this effect.
88
89Also, if the kernel time is synchronized with an external source, the
90kernel will write the time back to the CMOS clock every 11 minutes. In
91the process of doing this, the kernel briefly turns off RTC periodic
92interrupts, so be aware of this if you are doing serious work. If you
93don't synchronize the kernel time with an external source (via ntp or
94whatever) then the kernel will keep its hands off the RTC, allowing you
95exclusive access to the device for your applications.
96
97The alarm and/or interrupt frequency are programmed into the RTC via
98various ioctl(2) calls as listed in ./include/linux/rtc.h
99Rather than write 50 pages describing the ioctl() and so on, it is
100perhaps more useful to include a small test program that demonstrates
101how to use them, and demonstrates the features of the driver. This is
102probably a lot more useful to people interested in writing applications
103that will be using this driver. See the code at the end of this document.
104
105(The original /dev/rtc driver was written by Paul Gortmaker.)
106
107
108New portable "RTC Class" drivers: /dev/rtcN
109--------------------------------------------
110
111Because Linux supports many non-ACPI and non-PC platforms, some of which
112have more than one RTC style clock, it needed a more portable solution
113than expecting a single battery-backed MC146818 clone on every system.
114Accordingly, a new "RTC Class" framework has been defined. It offers
115three different userspace interfaces:
116
117 * /dev/rtcN ... much the same as the older /dev/rtc interface
118
119 * /sys/class/rtc/rtcN ... sysfs attributes support readonly
120 access to some RTC attributes.
121
122 * /proc/driver/rtc ... the system clock RTC may expose itself
123 using a procfs interface. If there is no RTC for the system clock,
124 rtc0 is used by default. More information is (currently) shown
125 here than through sysfs.
126
127The RTC Class framework supports a wide variety of RTCs, ranging from those
128integrated into embeddable system-on-chip (SOC) processors to discrete chips
129using I2C, SPI, or some other bus to communicate with the host CPU. There's
130even support for PC-style RTCs ... including the features exposed on newer PCs
131through ACPI.
132
133The new framework also removes the "one RTC per system" restriction. For
134example, maybe the low-power battery-backed RTC is a discrete I2C chip, but
135a high functionality RTC is integrated into the SOC. That system might read
136the system clock from the discrete RTC, but use the integrated one for all
137other tasks, because of its greater functionality.
138
139Check out tools/testing/selftests/rtc/rtctest.c for an example usage of the
140ioctl interface.
diff --git a/Documentation/admin-guide/svga.rst b/Documentation/admin-guide/svga.rst
new file mode 100644
index 000000000000..b6c2f9acca92
--- /dev/null
+++ b/Documentation/admin-guide/svga.rst
@@ -0,0 +1,249 @@
1.. include:: <isonum.txt>
2
3=================================
4Video Mode Selection Support 2.13
5=================================
6
7:Copyright: |copy| 1995--1999 Martin Mares, <mj@ucw.cz>
8
9Intro
10~~~~~
11
12This small document describes the "Video Mode Selection" feature which
13allows the use of various special video modes supported by the video BIOS. Due
14to usage of the BIOS, the selection is limited to boot time (before the
15kernel decompression starts) and works only on 80X86 machines.
16
17.. note::
18
19 Short intro for the impatient: Just use vga=ask for the first time,
20 enter ``scan`` on the video mode prompt, pick the mode you want to use,
21 remember its mode ID (the four-digit hexadecimal number) and then
22 set the vga parameter to this number (converted to decimal first).
23
24The video mode to be used is selected by a kernel parameter which can be
25specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..."
26option of LILO (or some other boot loader you use) or by the "vidmode" utility
27(present in standard Linux utility packages). You can use the following values
28of this parameter::
29
30 NORMAL_VGA - Standard 80x25 mode available on all display adapters.
31
32 EXTENDED_VGA - Standard 8-pixel font mode: 80x43 on EGA, 80x50 on VGA.
33
34 ASK_VGA - Display a video mode menu upon startup (see below).
35
36 0..35 - Menu item number (when you have used the menu to view the list of
37 modes available on your adapter, you can specify the menu item you want
38 to use). 0..9 correspond to "0".."9", 10..35 to "a".."z". Warning: the
39 mode list displayed may vary as the kernel version changes, because the
40 modes are listed in a "first detected -- first displayed" manner. It's
41 better to use absolute mode numbers instead.
42
43 0x.... - Hexadecimal video mode ID (also displayed on the menu, see below
44 for exact meaning of the ID). Warning: rdev and LILO don't support
45 hexadecimal numbers -- you have to convert it to decimal manually.
46
47Menu
48~~~~
49
50The ASK_VGA mode causes the kernel to offer a video mode menu upon
51bootup. It displays a "Press <RETURN> to see video modes available, <SPACE>
52to continue or wait 30 secs" message. If you press <RETURN>, you enter the
53menu, if you press <SPACE> or wait 30 seconds, the kernel will boot up in
54the standard 80x25 mode.
55
56The menu looks like::
57
58 Video adapter: <name-of-detected-video-adapter>
59 Mode: COLSxROWS:
60 0 0F00 80x25
61 1 0F01 80x50
62 2 0F02 80x43
63 3 0F03 80x26
64 ....
65 Enter mode number or ``scan``: <flashing-cursor-here>
66
67<name-of-detected-video-adapter> tells what video adapter did Linux detect
68-- it's either a generic adapter name (MDA, CGA, HGC, EGA, VGA, VESA VGA [a VGA
69with VESA-compliant BIOS]) or a chipset name (e.g., Trident). Direct detection
70of chipsets is turned off by default as it's inherently unreliable due to
71absolutely insane PC design.
72
73"0 0F00 80x25" means that the first menu item (the menu items are numbered
74from "0" to "9" and from "a" to "z") is a 80x25 mode with ID=0x0f00 (see the
75next section for a description of mode IDs).
76
77<flashing-cursor-here> encourages you to enter the item number or mode ID
78you wish to set and press <RETURN>. If the computer complains something about
79"Unknown mode ID", it is trying to tell you that it isn't possible to set such
80a mode. It's also possible to press only <RETURN> which leaves the current mode.
81
82The mode list usually contains a few basic modes and some VESA modes. In
83case your chipset has been detected, some chipset-specific modes are shown as
84well (some of these might be missing or unusable on your machine as different
85BIOSes are often shipped with the same card and the mode numbers depend purely
86on the VGA BIOS).
87
88The modes displayed on the menu are partially sorted: The list starts with
89the standard modes (80x25 and 80x50) followed by "special" modes (80x28 and
9080x43), local modes (if the local modes feature is enabled), VESA modes and
91finally SVGA modes for the auto-detected adapter.
92
93If you are not happy with the mode list offered (e.g., if you think your card
94is able to do more), you can enter "scan" instead of item number / mode ID. The
95program will try to ask the BIOS for all possible video mode numbers and test
96what happens then. The screen will be probably flashing wildly for some time and
97strange noises will be heard from inside the monitor and so on and then, really
98all consistent video modes supported by your BIOS will appear (plus maybe some
99``ghost modes``). If you are afraid this could damage your monitor, don't use
100this function.
101
102After scanning, the mode ordering is a bit different: the auto-detected SVGA
103modes are not listed at all and the modes revealed by ``scan`` are shown before
104all VESA modes.
105
106Mode IDs
107~~~~~~~~
108
109Because of the complexity of all the video stuff, the video mode IDs
110used here are also a bit complex. A video mode ID is a 16-bit number usually
111expressed in a hexadecimal notation (starting with "0x"). You can set a mode
112by entering its mode directly if you know it even if it isn't shown on the menu.
113
114The ID numbers can be divided to those regions::
115
116 0x0000 to 0x00ff - menu item references. 0x0000 is the first item. Don't use
117 outside the menu as this can change from boot to boot (especially if you
118 have used the ``scan`` feature).
119
120 0x0100 to 0x017f - standard BIOS modes. The ID is a BIOS video mode number
121 (as presented to INT 10, function 00) increased by 0x0100.
122
123 0x0200 to 0x08ff - VESA BIOS modes. The ID is a VESA mode ID increased by
124 0x0100. All VESA modes should be autodetected and shown on the menu.
125
126 0x0900 to 0x09ff - Video7 special modes. Set by calling INT 0x10, AX=0x6f05.
127 (Usually 940=80x43, 941=132x25, 942=132x44, 943=80x60, 944=100x60,
128 945=132x28 for the standard Video7 BIOS)
129
130 0x0f00 to 0x0fff - special modes (they are set by various tricks -- usually
131 by modifying one of the standard modes). Currently available:
132 0x0f00 standard 80x25, don't reset mode if already set (=FFFF)
133 0x0f01 standard with 8-point font: 80x43 on EGA, 80x50 on VGA
134 0x0f02 VGA 80x43 (VGA switched to 350 scanlines with a 8-point font)
135 0x0f03 VGA 80x28 (standard VGA scans, but 14-point font)
136 0x0f04 leave current video mode
137 0x0f05 VGA 80x30 (480 scans, 16-point font)
138 0x0f06 VGA 80x34 (480 scans, 14-point font)
139 0x0f07 VGA 80x60 (480 scans, 8-point font)
140 0x0f08 Graphics hack (see the VIDEO_GFX_HACK paragraph below)
141
142 0x1000 to 0x7fff - modes specified by resolution. The code has a "0xRRCC"
143 form where RR is a number of rows and CC is a number of columns.
144 E.g., 0x1950 corresponds to a 80x25 mode, 0x2b84 to 132x43 etc.
145 This is the only fully portable way to refer to a non-standard mode,
146 but it relies on the mode being found and displayed on the menu
147 (remember that mode scanning is not done automatically).
148
149 0xff00 to 0xffff - aliases for backward compatibility:
150 0xffff equivalent to 0x0f00 (standard 80x25)
151 0xfffe equivalent to 0x0f01 (EGA 80x43 or VGA 80x50)
152
153If you add 0x8000 to the mode ID, the program will try to recalculate
154vertical display timing according to mode parameters, which can be used to
155eliminate some annoying bugs of certain VGA BIOSes (usually those used for
156cards with S3 chipsets and old Cirrus Logic BIOSes) -- mainly extra lines at the
157end of the display.
158
159Options
160~~~~~~~
161
162Build options for arch/x86/boot/* are selected by the kernel kconfig
163utility and the kernel .config file.
164
165VIDEO_GFX_HACK - includes special hack for setting of graphics modes
166to be used later by special drivers.
167Allows to set _any_ BIOS mode including graphic ones and forcing specific
168text screen resolution instead of peeking it from BIOS variables. Don't use
169unless you think you know what you're doing. To activate this setup, use
170mode number 0x0f08 (see the Mode IDs section above).
171
172Still doesn't work?
173~~~~~~~~~~~~~~~~~~~
174
175When the mode detection doesn't work (e.g., the mode list is incorrect or
176the machine hangs instead of displaying the menu), try to switch off some of
177the configuration options listed under "Options". If it fails, you can still use
178your kernel with the video mode set directly via the kernel parameter.
179
180In either case, please send me a bug report containing what _exactly_
181happens and how do the configuration switches affect the behaviour of the bug.
182
183If you start Linux from M$-DOS, you might also use some DOS tools for
184video mode setting. In this case, you must specify the 0x0f04 mode ("leave
185current settings") to Linux, because if you don't and you use any non-standard
186mode, Linux will switch to 80x25 automatically.
187
188If you set some extended mode and there's one or more extra lines on the
189bottom of the display containing already scrolled-out text, your VGA BIOS
190contains the most common video BIOS bug called "incorrect vertical display
191end setting". Adding 0x8000 to the mode ID might fix the problem. Unfortunately,
192this must be done manually -- no autodetection mechanisms are available.
193
194History
195~~~~~~~
196
197=============== ================================================================
1981.0 (??-Nov-95) First version supporting all adapters supported by the old
199 setup.S + Cirrus Logic 54XX. Present in some 1.3.4? kernels
200 and then removed due to instability on some machines.
2012.0 (28-Jan-96) Rewritten from scratch. Cirrus Logic 64XX support added, almost
202 everything is configurable, the VESA support should be much more
203 stable, explicit mode numbering allowed, "scan" implemented etc.
2042.1 (30-Jan-96) VESA modes moved to 0x200-0x3ff. Mode selection by resolution
205 supported. Few bugs fixed. VESA modes are listed prior to
206 modes supplied by SVGA autodetection as they are more reliable.
207 CLGD autodetect works better. Doesn't depend on 80x25 being
208 active when started. Scanning fixed. 80x43 (any VGA) added.
209 Code cleaned up.
2102.2 (01-Feb-96) EGA 80x43 fixed. VESA extended to 0x200-0x4ff (non-standard 02XX
211 VESA modes work now). Display end bug workaround supported.
212 Special modes renumbered to allow adding of the "recalculate"
213 flag, 0xffff and 0xfffe became aliases instead of real IDs.
214 Screen contents retained during mode changes.
2152.3 (15-Mar-96) Changed to work with 1.3.74 kernel.
2162.4 (18-Mar-96) Added patches by Hans Lermen fixing a memory overwrite problem
217 with some boot loaders. Memory management rewritten to reflect
218 these changes. Unfortunately, screen contents retaining works
219 only with some loaders now.
220 Added a Tseng 132x60 mode.
2212.5 (19-Mar-96) Fixed a VESA mode scanning bug introduced in 2.4.
2222.6 (25-Mar-96) Some VESA BIOS errors not reported -- it fixes error reports on
223 several cards with broken VESA code (e.g., ATI VGA).
2242.7 (09-Apr-96) - Accepted all VESA modes in range 0x100 to 0x7ff, because some
225 cards use very strange mode numbers.
226 - Added Realtek VGA modes (thanks to Gonzalo Tornaria).
227 - Hardware testing order slightly changed, tests based on ROM
228 contents done as first.
229 - Added support for special Video7 mode switching functions
230 (thanks to Tom Vander Aa).
231 - Added 480-scanline modes (especially useful for notebooks,
232 original version written by hhanemaa@cs.ruu.nl, patched by
233 Jeff Chua, rewritten by me).
234 - Screen store/restore fixed.
2352.8 (14-Apr-96) - Previous release was not compilable without CONFIG_VIDEO_SVGA.
236 - Better recognition of text modes during mode scan.
2372.9 (12-May-96) - Ignored VESA modes 0x80 - 0xff (more VESA BIOS bugs!)
2382.10(11-Nov-96) - The whole thing made optional.
239 - Added the CONFIG_VIDEO_400_HACK switch.
240 - Added the CONFIG_VIDEO_GFX_HACK switch.
241 - Code cleanup.
2422.11(03-May-97) - Yet another cleanup, now including also the documentation.
243 - Direct testing of SVGA adapters turned off by default, ``scan``
244 offered explicitly on the prompt line.
245 - Removed the doc section describing adding of new probing
246 functions as I try to get rid of _all_ hardware probing here.
2472.12(25-May-98) Added support for VESA frame buffer graphics.
2482.13(14-May-99) Minor documentation fixes.
249=============== ================================================================
diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst
new file mode 100644
index 000000000000..599bcde7f0b7
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/abi.rst
@@ -0,0 +1,67 @@
1================================
2Documentation for /proc/sys/abi/
3================================
4
5kernel version 2.6.0.test2
6
7Copyright (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net>
8
9For general info: index.rst.
10
11------------------------------------------------------------------------------
12
13This path is binary emulation relevant aka personality types aka abi.
14When a process is executed, it's linked to an exec_domain whose
15personality is defined using values available from /proc/sys/abi.
16You can find further details about abi in include/linux/personality.h.
17
18Here are the files featuring in 2.6 kernel:
19
20- defhandler_coff
21- defhandler_elf
22- defhandler_lcall7
23- defhandler_libcso
24- fake_utsname
25- trace
26
27defhandler_coff
28---------------
29
30defined value:
31 PER_SCOSVR3::
32
33 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE
34
35defhandler_elf
36--------------
37
38defined value:
39 PER_LINUX::
40
41 0
42
43defhandler_lcall7
44-----------------
45
46defined value :
47 PER_SVR4::
48
49 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
50
51defhandler_libsco
52-----------------
53
54defined value:
55 PER_SVR4::
56
57 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
58
59fake_utsname
60------------
61
62Unused
63
64trace
65-----
66
67Unused
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
new file mode 100644
index 000000000000..2a45119e3331
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -0,0 +1,384 @@
1===============================
2Documentation for /proc/sys/fs/
3===============================
4
5kernel version 2.2.10
6
7Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
8
9Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
10
11For general info and legal blurb, please look in intro.rst.
12
13------------------------------------------------------------------------------
14
15This file contains documentation for the sysctl files in
16/proc/sys/fs/ and is valid for Linux kernel version 2.2.
17
18The files in this directory can be used to tune and monitor
19miscellaneous and general things in the operation of the Linux
20kernel. Since some of the files _can_ be used to screw up your
21system, it is advisable to read both documentation and source
22before actually making adjustments.
23
241. /proc/sys/fs
25===============
26
27Currently, these files are in /proc/sys/fs:
28
29- aio-max-nr
30- aio-nr
31- dentry-state
32- dquot-max
33- dquot-nr
34- file-max
35- file-nr
36- inode-max
37- inode-nr
38- inode-state
39- nr_open
40- overflowuid
41- overflowgid
42- pipe-user-pages-hard
43- pipe-user-pages-soft
44- protected_fifos
45- protected_hardlinks
46- protected_regular
47- protected_symlinks
48- suid_dumpable
49- super-max
50- super-nr
51
52
53aio-nr & aio-max-nr
54-------------------
55
56aio-nr is the running total of the number of events specified on the
57io_setup system call for all currently active aio contexts. If aio-nr
58reaches aio-max-nr then io_setup will fail with EAGAIN. Note that
59raising aio-max-nr does not result in the pre-allocation or re-sizing
60of any kernel data structures.
61
62
63dentry-state
64------------
65
66From linux/include/linux/dcache.h::
67
68 struct dentry_stat_t dentry_stat {
69 int nr_dentry;
70 int nr_unused;
71 int age_limit; /* age in seconds */
72 int want_pages; /* pages requested by system */
73 int nr_negative; /* # of unused negative dentries */
74 int dummy; /* Reserved for future use */
75 };
76
77Dentries are dynamically allocated and deallocated.
78
79nr_dentry shows the total number of dentries allocated (active
80+ unused). nr_unused shows the number of dentries that are not
81actively used, but are saved in the LRU list for future reuse.
82
83Age_limit is the age in seconds after which dcache entries
84can be reclaimed when memory is short and want_pages is
85nonzero when shrink_dcache_pages() has been called and the
86dcache isn't pruned yet.
87
88nr_negative shows the number of unused dentries that are also
89negative dentries which do not map to any files. Instead,
90they help speeding up rejection of non-existing files provided
91by the users.
92
93
94dquot-max & dquot-nr
95--------------------
96
97The file dquot-max shows the maximum number of cached disk
98quota entries.
99
100The file dquot-nr shows the number of allocated disk quota
101entries and the number of free disk quota entries.
102
103If the number of free cached disk quotas is very low and
104you have some awesome number of simultaneous system users,
105you might want to raise the limit.
106
107
108file-max & file-nr
109------------------
110
111The value in file-max denotes the maximum number of file-
112handles that the Linux kernel will allocate. When you get lots
113of error messages about running out of file handles, you might
114want to increase this limit.
115
116Historically,the kernel was able to allocate file handles
117dynamically, but not to free them again. The three values in
118file-nr denote the number of allocated file handles, the number
119of allocated but unused file handles, and the maximum number of
120file handles. Linux 2.6 always reports 0 as the number of free
121file handles -- this is not an error, it just means that the
122number of allocated file handles exactly matches the number of
123used file handles.
124
125Attempts to allocate more file descriptors than file-max are
126reported with printk, look for "VFS: file-max limit <number>
127reached".
128
129
130nr_open
131-------
132
133This denotes the maximum number of file-handles a process can
134allocate. Default value is 1024*1024 (1048576) which should be
135enough for most machines. Actual limit depends on RLIMIT_NOFILE
136resource limit.
137
138
139inode-max, inode-nr & inode-state
140---------------------------------
141
142As with file handles, the kernel allocates the inode structures
143dynamically, but can't free them yet.
144
145The value in inode-max denotes the maximum number of inode
146handlers. This value should be 3-4 times larger than the value
147in file-max, since stdin, stdout and network sockets also
148need an inode struct to handle them. When you regularly run
149out of inodes, you need to increase this value.
150
151The file inode-nr contains the first two items from
152inode-state, so we'll skip to that file...
153
154Inode-state contains three actual numbers and four dummies.
155The actual numbers are, in order of appearance, nr_inodes,
156nr_free_inodes and preshrink.
157
158Nr_inodes stands for the number of inodes the system has
159allocated, this can be slightly more than inode-max because
160Linux allocates them one pageful at a time.
161
162Nr_free_inodes represents the number of free inodes (?) and
163preshrink is nonzero when the nr_inodes > inode-max and the
164system needs to prune the inode list instead of allocating
165more.
166
167
168overflowgid & overflowuid
169-------------------------
170
171Some filesystems only support 16-bit UIDs and GIDs, although in Linux
172UIDs and GIDs are 32 bits. When one of these filesystems is mounted
173with writes enabled, any UID or GID that would exceed 65535 is translated
174to a fixed value before being written to disk.
175
176These sysctls allow you to change the value of the fixed UID and GID.
177The default is 65534.
178
179
180pipe-user-pages-hard
181--------------------
182
183Maximum total number of pages a non-privileged user may allocate for pipes.
184Once this limit is reached, no new pipes may be allocated until usage goes
185below the limit again. When set to 0, no limit is applied, which is the default
186setting.
187
188
189pipe-user-pages-soft
190--------------------
191
192Maximum total number of pages a non-privileged user may allocate for pipes
193before the pipe size gets limited to a single page. Once this limit is reached,
194new pipes will be limited to a single page in size for this user in order to
195limit total memory usage, and trying to increase them using fcntl() will be
196denied until usage goes below the limit again. The default value allows to
197allocate up to 1024 pipes at their default size. When set to 0, no limit is
198applied.
199
200
201protected_fifos
202---------------
203
204The intent of this protection is to avoid unintentional writes to
205an attacker-controlled FIFO, where a program expected to create a regular
206file.
207
208When set to "0", writing to FIFOs is unrestricted.
209
210When set to "1" don't allow O_CREAT open on FIFOs that we don't own
211in world writable sticky directories, unless they are owned by the
212owner of the directory.
213
214When set to "2" it also applies to group writable sticky directories.
215
216This protection is based on the restrictions in Openwall.
217
218
219protected_hardlinks
220--------------------
221
222A long-standing class of security issues is the hardlink-based
223time-of-check-time-of-use race, most commonly seen in world-writable
224directories like /tmp. The common method of exploitation of this flaw
225is to cross privilege boundaries when following a given hardlink (i.e. a
226root process follows a hardlink created by another user). Additionally,
227on systems without separated partitions, this stops unauthorized users
228from "pinning" vulnerable setuid/setgid files against being upgraded by
229the administrator, or linking to special files.
230
231When set to "0", hardlink creation behavior is unrestricted.
232
233When set to "1" hardlinks cannot be created by users if they do not
234already own the source file, or do not have read/write access to it.
235
236This protection is based on the restrictions in Openwall and grsecurity.
237
238
239protected_regular
240-----------------
241
242This protection is similar to protected_fifos, but it
243avoids writes to an attacker-controlled regular file, where a program
244expected to create one.
245
246When set to "0", writing to regular files is unrestricted.
247
248When set to "1" don't allow O_CREAT open on regular files that we
249don't own in world writable sticky directories, unless they are
250owned by the owner of the directory.
251
252When set to "2" it also applies to group writable sticky directories.
253
254
255protected_symlinks
256------------------
257
258A long-standing class of security issues is the symlink-based
259time-of-check-time-of-use race, most commonly seen in world-writable
260directories like /tmp. The common method of exploitation of this flaw
261is to cross privilege boundaries when following a given symlink (i.e. a
262root process follows a symlink belonging to another user). For a likely
263incomplete list of hundreds of examples across the years, please see:
264http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
265
266When set to "0", symlink following behavior is unrestricted.
267
268When set to "1" symlinks are permitted to be followed only when outside
269a sticky world-writable directory, or when the uid of the symlink and
270follower match, or when the directory owner matches the symlink's owner.
271
272This protection is based on the restrictions in Openwall and grsecurity.
273
274
275suid_dumpable:
276--------------
277
278This value can be used to query and set the core dump mode for setuid
279or otherwise protected/tainted binaries. The modes are
280
281= ========== ===============================================================
2820 (default) traditional behaviour. Any process which has changed
283 privilege levels or is execute only will not be dumped.
2841 (debug) all processes dump core when possible. The core dump is
285 owned by the current user and no security is applied. This is
286 intended for system debugging situations only.
287 Ptrace is unchecked.
288 This is insecure as it allows regular users to examine the
289 memory contents of privileged processes.
2902 (suidsafe) any binary which normally would not be dumped is dumped
291 anyway, but only if the "core_pattern" kernel sysctl is set to
292 either a pipe handler or a fully qualified path. (For more
293 details on this limitation, see CVE-2006-2451.) This mode is
294 appropriate when administrators are attempting to debug
295 problems in a normal environment, and either have a core dump
296 pipe handler that knows to treat privileged core dumps with
297 care, or specific directory defined for catching core dumps.
298 If a core dump happens without a pipe handler or fully
299 qualified path, a message will be emitted to syslog warning
300 about the lack of a correct setting.
301= ========== ===============================================================
302
303
304super-max & super-nr
305--------------------
306
307These numbers control the maximum number of superblocks, and
308thus the maximum number of mounted filesystems the kernel
309can have. You only need to increase super-max if you need to
310mount more filesystems than the current value in super-max
311allows you to.
312
313
314aio-nr & aio-max-nr
315-------------------
316
317aio-nr shows the current system-wide number of asynchronous io
318requests. aio-max-nr allows you to change the maximum value
319aio-nr can grow to.
320
321
322mount-max
323---------
324
325This denotes the maximum number of mounts that may exist
326in a mount namespace.
327
328
329
3302. /proc/sys/fs/binfmt_misc
331===========================
332
333Documentation for the files in /proc/sys/fs/binfmt_misc is
334in Documentation/admin-guide/binfmt-misc.rst.
335
336
3373. /proc/sys/fs/mqueue - POSIX message queues filesystem
338========================================================
339
340
341The "mqueue" filesystem provides the necessary kernel features to enable the
342creation of a user space library that implements the POSIX message queues
343API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
344Interfaces specification.)
345
346The "mqueue" filesystem contains values for determining/setting the amount of
347resources used by the file system.
348
349/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the
350maximum number of message queues allowed on the system.
351
352/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the
353maximum number of messages in a queue value. In fact it is the limiting value
354for another (user) limit which is set in mq_open invocation. This attribute of
355a queue must be less or equal then msg_max.
356
357/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the
358maximum message size value (it is every message queue's attribute set during
359its creation).
360
361/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the
362default number of messages in a queue value if attr parameter of mq_open(2) is
363NULL. If it exceed msg_max, the default value is initialized msg_max.
364
365/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting
366the default message size value if attr parameter of mq_open(2) is NULL. If it
367exceed msgsize_max, the default value is initialized msgsize_max.
368
3694. /proc/sys/fs/epoll - Configuration options for the epoll interface
370=====================================================================
371
372This directory contains configuration options for the epoll(7) interface.
373
374max_user_watches
375----------------
376
377Every epoll file descriptor can store a number of files to be monitored
378for event readiness. Each one of these monitored files constitutes a "watch".
379This configuration option sets the maximum number of "watches" that are
380allowed for each user.
381Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes
382on a 64bit one.
383The current default value for max_user_watches is the 1/32 of the available
384low memory, divided for the "watch" cost in bytes.
diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst
new file mode 100644
index 000000000000..03346f98c7b9
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/index.rst
@@ -0,0 +1,98 @@
1===========================
2Documentation for /proc/sys
3===========================
4
5Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
6
7------------------------------------------------------------------------------
8
9'Why', I hear you ask, 'would anyone even _want_ documentation
10for them sysctl files? If anybody really needs it, it's all in
11the source...'
12
13Well, this documentation is written because some people either
14don't know they need to tweak something, or because they don't
15have the time or knowledge to read the source code.
16
17Furthermore, the programmers who built sysctl have built it to
18be actually used, not just for the fun of programming it :-)
19
20------------------------------------------------------------------------------
21
22Legal blurb:
23
24As usual, there are two main things to consider:
25
261. you get what you pay for
272. it's free
28
29The consequences are that I won't guarantee the correctness of
30this document, and if you come to me complaining about how you
31screwed up your system because of wrong documentation, I won't
32feel sorry for you. I might even laugh at you...
33
34But of course, if you _do_ manage to screw up your system using
35only the sysctl options used in this file, I'd like to hear of
36it. Not only to have a great laugh, but also to make sure that
37you're the last RTFMing person to screw up.
38
39In short, e-mail your suggestions, corrections and / or horror
40stories to: <riel@nl.linux.org>
41
42Rik van Riel.
43
44--------------------------------------------------------------
45
46Introduction
47============
48
49Sysctl is a means of configuring certain aspects of the kernel
50at run-time, and the /proc/sys/ directory is there so that you
51don't even need special tools to do it!
52In fact, there are only four things needed to use these config
53facilities:
54
55- a running Linux system
56- root access
57- common sense (this is especially hard to come by these days)
58- knowledge of what all those values mean
59
60As a quick 'ls /proc/sys' will show, the directory consists of
61several (arch-dependent?) subdirs. Each subdir is mainly about
62one part of the kernel, so you can do configuration on a piece
63by piece basis, or just some 'thematic frobbing'.
64
65This documentation is about:
66
67=============== ===============================================================
68abi/ execution domains & personalities
69debug/ <empty>
70dev/ device specific information (eg dev/cdrom/info)
71fs/ specific filesystems
72 filehandle, inode, dentry and quota tuning
73 binfmt_misc <Documentation/admin-guide/binfmt-misc.rst>
74kernel/ global kernel info / tuning
75 miscellaneous stuff
76net/ networking stuff, for documentation look in:
77 <Documentation/networking/>
78proc/ <empty>
79sunrpc/ SUN Remote Procedure Call (NFS)
80vm/ memory management tuning
81 buffer and cache management
82user/ Per user per user namespace limits
83=============== ===============================================================
84
85These are the subdirs I have on my system. There might be more
86or other subdirs in another setup. If you see another dir, I'd
87really like to hear about it :-)
88
89.. toctree::
90 :maxdepth: 1
91
92 abi
93 fs
94 kernel
95 net
96 sunrpc
97 user
98 vm
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
new file mode 100644
index 000000000000..032c7cd3cede
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -0,0 +1,1177 @@
1===================================
2Documentation for /proc/sys/kernel/
3===================================
4
5kernel version 2.2.10
6
7Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
8
9Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
10
11For general info and legal blurb, please look in index.rst.
12
13------------------------------------------------------------------------------
14
15This file contains documentation for the sysctl files in
16/proc/sys/kernel/ and is valid for Linux kernel version 2.2.
17
18The files in this directory can be used to tune and monitor
19miscellaneous and general things in the operation of the Linux
20kernel. Since some of the files _can_ be used to screw up your
21system, it is advisable to read both documentation and source
22before actually making adjustments.
23
24Currently, these files might (depending on your configuration)
25show up in /proc/sys/kernel:
26
27- acct
28- acpi_video_flags
29- auto_msgmni
30- bootloader_type [ X86 only ]
31- bootloader_version [ X86 only ]
32- cap_last_cap
33- core_pattern
34- core_pipe_limit
35- core_uses_pid
36- ctrl-alt-del
37- dmesg_restrict
38- domainname
39- hostname
40- hotplug
41- hardlockup_all_cpu_backtrace
42- hardlockup_panic
43- hung_task_panic
44- hung_task_check_count
45- hung_task_timeout_secs
46- hung_task_check_interval_secs
47- hung_task_warnings
48- hyperv_record_panic_msg
49- kexec_load_disabled
50- kptr_restrict
51- l2cr [ PPC only ]
52- modprobe ==> Documentation/debugging-modules.txt
53- modules_disabled
54- msg_next_id [ sysv ipc ]
55- msgmax
56- msgmnb
57- msgmni
58- nmi_watchdog
59- osrelease
60- ostype
61- overflowgid
62- overflowuid
63- panic
64- panic_on_oops
65- panic_on_stackoverflow
66- panic_on_unrecovered_nmi
67- panic_on_warn
68- panic_print
69- panic_on_rcu_stall
70- perf_cpu_time_max_percent
71- perf_event_paranoid
72- perf_event_max_stack
73- perf_event_mlock_kb
74- perf_event_max_contexts_per_stack
75- pid_max
76- powersave-nap [ PPC only ]
77- printk
78- printk_delay
79- printk_ratelimit
80- printk_ratelimit_burst
81- pty ==> Documentation/filesystems/devpts.txt
82- randomize_va_space
83- real-root-dev ==> Documentation/admin-guide/initrd.rst
84- reboot-cmd [ SPARC only ]
85- rtsig-max
86- rtsig-nr
87- sched_energy_aware
88- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst
89- sem
90- sem_next_id [ sysv ipc ]
91- sg-big-buff [ generic SCSI device (sg) ]
92- shm_next_id [ sysv ipc ]
93- shm_rmid_forced
94- shmall
95- shmmax [ sysv ipc ]
96- shmmni
97- softlockup_all_cpu_backtrace
98- soft_watchdog
99- stack_erasing
100- stop-a [ SPARC only ]
101- sysrq ==> Documentation/admin-guide/sysrq.rst
102- sysctl_writes_strict
103- tainted ==> Documentation/admin-guide/tainted-kernels.rst
104- threads-max
105- unknown_nmi_panic
106- watchdog
107- watchdog_thresh
108- version
109
110
111acct:
112=====
113
114highwater lowwater frequency
115
116If BSD-style process accounting is enabled these values control
117its behaviour. If free space on filesystem where the log lives
118goes below <lowwater>% accounting suspends. If free space gets
119above <highwater>% accounting resumes. <Frequency> determines
120how often do we check the amount of free space (value is in
121seconds). Default:
1224 2 30
123That is, suspend accounting if there left <= 2% free; resume it
124if we got >=4%; consider information about amount of free space
125valid for 30 seconds.
126
127
128acpi_video_flags:
129=================
130
131flags
132
133See Doc*/kernel/power/video.txt, it allows mode of video boot to be
134set during run time.
135
136
137auto_msgmni:
138============
139
140This variable has no effect and may be removed in future kernel
141releases. Reading it always returns 0.
142Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni
143upon memory add/remove or upon ipc namespace creation/removal.
144Echoing "1" into this file enabled msgmni automatic recomputing.
145Echoing "0" turned it off. auto_msgmni default value was 1.
146
147
148bootloader_type:
149================
150
151x86 bootloader identification
152
153This gives the bootloader type number as indicated by the bootloader,
154shifted left by 4, and OR'd with the low four bits of the bootloader
155version. The reason for this encoding is that this used to match the
156type_of_loader field in the kernel header; the encoding is kept for
157backwards compatibility. That is, if the full bootloader type number
158is 0x15 and the full version number is 0x234, this file will contain
159the value 340 = 0x154.
160
161See the type_of_loader and ext_loader_type fields in
162Documentation/x86/boot.rst for additional information.
163
164
165bootloader_version:
166===================
167
168x86 bootloader version
169
170The complete bootloader version number. In the example above, this
171file will contain the value 564 = 0x234.
172
173See the type_of_loader and ext_loader_ver fields in
174Documentation/x86/boot.rst for additional information.
175
176
177cap_last_cap:
178=============
179
180Highest valid capability of the running kernel. Exports
181CAP_LAST_CAP from the kernel.
182
183
184core_pattern:
185=============
186
187core_pattern is used to specify a core dumpfile pattern name.
188
189* max length 127 characters; default value is "core"
190* core_pattern is used as a pattern template for the output filename;
191 certain string patterns (beginning with '%') are substituted with
192 their actual values.
193* backward compatibility with core_uses_pid:
194
195 If core_pattern does not include "%p" (default does not)
196 and core_uses_pid is set, then .PID will be appended to
197 the filename.
198
199* corename format specifiers::
200
201 %<NUL> '%' is dropped
202 %% output one '%'
203 %p pid
204 %P global pid (init PID namespace)
205 %i tid
206 %I global tid (init PID namespace)
207 %u uid (in initial user namespace)
208 %g gid (in initial user namespace)
209 %d dump mode, matches PR_SET_DUMPABLE and
210 /proc/sys/fs/suid_dumpable
211 %s signal number
212 %t UNIX time of dump
213 %h hostname
214 %e executable filename (may be shortened)
215 %E executable path
216 %<OTHER> both are dropped
217
218* If the first character of the pattern is a '|', the kernel will treat
219 the rest of the pattern as a command to run. The core dump will be
220 written to the standard input of that program instead of to a file.
221
222
223core_pipe_limit:
224================
225
226This sysctl is only applicable when core_pattern is configured to pipe
227core files to a user space helper (when the first character of
228core_pattern is a '|', see above). When collecting cores via a pipe
229to an application, it is occasionally useful for the collecting
230application to gather data about the crashing process from its
231/proc/pid directory. In order to do this safely, the kernel must wait
232for the collecting process to exit, so as not to remove the crashing
233processes proc files prematurely. This in turn creates the
234possibility that a misbehaving userspace collecting process can block
235the reaping of a crashed process simply by never exiting. This sysctl
236defends against that. It defines how many concurrent crashing
237processes may be piped to user space applications in parallel. If
238this value is exceeded, then those crashing processes above that value
239are noted via the kernel log and their cores are skipped. 0 is a
240special value, indicating that unlimited processes may be captured in
241parallel, but that no waiting will take place (i.e. the collecting
242process is not guaranteed access to /proc/<crashing pid>/). This
243value defaults to 0.
244
245
246core_uses_pid:
247==============
248
249The default coredump filename is "core". By setting
250core_uses_pid to 1, the coredump filename becomes core.PID.
251If core_pattern does not include "%p" (default does not)
252and core_uses_pid is set, then .PID will be appended to
253the filename.
254
255
256ctrl-alt-del:
257=============
258
259When the value in this file is 0, ctrl-alt-del is trapped and
260sent to the init(1) program to handle a graceful restart.
261When, however, the value is > 0, Linux's reaction to a Vulcan
262Nerve Pinch (tm) will be an immediate reboot, without even
263syncing its dirty buffers.
264
265Note:
266 when a program (like dosemu) has the keyboard in 'raw'
267 mode, the ctrl-alt-del is intercepted by the program before it
268 ever reaches the kernel tty layer, and it's up to the program
269 to decide what to do with it.
270
271
272dmesg_restrict:
273===============
274
275This toggle indicates whether unprivileged users are prevented
276from using dmesg(8) to view messages from the kernel's log buffer.
277When dmesg_restrict is set to (0) there are no restrictions. When
278dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use
279dmesg(8).
280
281The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the
282default value of dmesg_restrict.
283
284
285domainname & hostname:
286======================
287
288These files can be used to set the NIS/YP domainname and the
289hostname of your box in exactly the same way as the commands
290domainname and hostname, i.e.::
291
292 # echo "darkstar" > /proc/sys/kernel/hostname
293 # echo "mydomain" > /proc/sys/kernel/domainname
294
295has the same effect as::
296
297 # hostname "darkstar"
298 # domainname "mydomain"
299
300Note, however, that the classic darkstar.frop.org has the
301hostname "darkstar" and DNS (Internet Domain Name Server)
302domainname "frop.org", not to be confused with the NIS (Network
303Information Service) or YP (Yellow Pages) domainname. These two
304domain names are in general different. For a detailed discussion
305see the hostname(1) man page.
306
307
308hardlockup_all_cpu_backtrace:
309=============================
310
311This value controls the hard lockup detector behavior when a hard
312lockup condition is detected as to whether or not to gather further
313debug information. If enabled, arch-specific all-CPU stack dumping
314will be initiated.
315
3160: do nothing. This is the default behavior.
317
3181: on detection capture more debug information.
319
320
321hardlockup_panic:
322=================
323
324This parameter can be used to control whether the kernel panics
325when a hard lockup is detected.
326
327 0 - don't panic on hard lockup
328 1 - panic on hard lockup
329
330See Documentation/admin-guide/lockup-watchdogs.rst for more information. This can
331also be set using the nmi_watchdog kernel parameter.
332
333
334hotplug:
335========
336
337Path for the hotplug policy agent.
338Default value is "/sbin/hotplug".
339
340
341hung_task_panic:
342================
343
344Controls the kernel's behavior when a hung task is detected.
345This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
346
3470: continue operation. This is the default behavior.
348
3491: panic immediately.
350
351
352hung_task_check_count:
353======================
354
355The upper bound on the number of tasks that are checked.
356This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
357
358
359hung_task_timeout_secs:
360=======================
361
362When a task in D state did not get scheduled
363for more than this value report a warning.
364This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
365
3660: means infinite timeout - no checking done.
367
368Possible values to set are in range {0..LONG_MAX/HZ}.
369
370
371hung_task_check_interval_secs:
372==============================
373
374Hung task check interval. If hung task checking is enabled
375(see hung_task_timeout_secs), the check is done every
376hung_task_check_interval_secs seconds.
377This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
378
3790 (default): means use hung_task_timeout_secs as checking interval.
380Possible values to set are in range {0..LONG_MAX/HZ}.
381
382
383hung_task_warnings:
384===================
385
386The maximum number of warnings to report. During a check interval
387if a hung task is detected, this value is decreased by 1.
388When this value reaches 0, no more warnings will be reported.
389This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
390
391-1: report an infinite number of warnings.
392
393
394hyperv_record_panic_msg:
395========================
396
397Controls whether the panic kmsg data should be reported to Hyper-V.
398
3990: do not report panic kmsg data.
400
4011: report the panic kmsg data. This is the default behavior.
402
403
404kexec_load_disabled:
405====================
406
407A toggle indicating if the kexec_load syscall has been disabled. This
408value defaults to 0 (false: kexec_load enabled), but can be set to 1
409(true: kexec_load disabled). Once true, kexec can no longer be used, and
410the toggle cannot be set back to false. This allows a kexec image to be
411loaded before disabling the syscall, allowing a system to set up (and
412later use) an image without it being altered. Generally used together
413with the "modules_disabled" sysctl.
414
415
416kptr_restrict:
417==============
418
419This toggle indicates whether restrictions are placed on
420exposing kernel addresses via /proc and other interfaces.
421
422When kptr_restrict is set to 0 (the default) the address is hashed before
423printing. (This is the equivalent to %p.)
424
425When kptr_restrict is set to (1), kernel pointers printed using the %pK
426format specifier will be replaced with 0's unless the user has CAP_SYSLOG
427and effective user and group ids are equal to the real ids. This is
428because %pK checks are done at read() time rather than open() time, so
429if permissions are elevated between the open() and the read() (e.g via
430a setuid binary) then %pK will not leak kernel pointers to unprivileged
431users. Note, this is a temporary solution only. The correct long-term
432solution is to do the permission checks at open() time. Consider removing
433world read permissions from files that use %pK, and using dmesg_restrict
434to protect against uses of %pK in dmesg(8) if leaking kernel pointer
435values to unprivileged users is a concern.
436
437When kptr_restrict is set to (2), kernel pointers printed using
438%pK will be replaced with 0's regardless of privileges.
439
440
441l2cr: (PPC only)
442================
443
444This flag controls the L2 cache of G3 processor boards. If
4450, the cache is disabled. Enabled if nonzero.
446
447
448modules_disabled:
449=================
450
451A toggle value indicating if modules are allowed to be loaded
452in an otherwise modular kernel. This toggle defaults to off
453(0), but can be set true (1). Once true, modules can be
454neither loaded nor unloaded, and the toggle cannot be set back
455to false. Generally used with the "kexec_load_disabled" toggle.
456
457
458msg_next_id, sem_next_id, and shm_next_id:
459==========================================
460
461These three toggles allows to specify desired id for next allocated IPC
462object: message, semaphore or shared memory respectively.
463
464By default they are equal to -1, which means generic allocation logic.
465Possible values to set are in range {0..INT_MAX}.
466
467Notes:
468 1) kernel doesn't guarantee, that new object will have desired id. So,
469 it's up to userspace, how to handle an object with "wrong" id.
470 2) Toggle with non-default value will be set back to -1 by kernel after
471 successful IPC object allocation. If an IPC object allocation syscall
472 fails, it is undefined if the value remains unmodified or is reset to -1.
473
474
475nmi_watchdog:
476=============
477
478This parameter can be used to control the NMI watchdog
479(i.e. the hard lockup detector) on x86 systems.
480
4810 - disable the hard lockup detector
482
4831 - enable the hard lockup detector
484
485The hard lockup detector monitors each CPU for its ability to respond to
486timer interrupts. The mechanism utilizes CPU performance counter registers
487that are programmed to generate Non-Maskable Interrupts (NMIs) periodically
488while a CPU is busy. Hence, the alternative name 'NMI watchdog'.
489
490The NMI watchdog is disabled by default if the kernel is running as a guest
491in a KVM virtual machine. This default can be overridden by adding::
492
493 nmi_watchdog=1
494
495to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst).
496
497
498numa_balancing:
499===============
500
501Enables/disables automatic page fault based NUMA memory
502balancing. Memory is moved automatically to nodes
503that access it often.
504
505Enables/disables automatic NUMA memory balancing. On NUMA machines, there
506is a performance penalty if remote memory is accessed by a CPU. When this
507feature is enabled the kernel samples what task thread is accessing memory
508by periodically unmapping pages and later trapping a page fault. At the
509time of the page fault, it is determined if the data being accessed should
510be migrated to a local memory node.
511
512The unmapping of pages and trapping faults incur additional overhead that
513ideally is offset by improved memory locality but there is no universal
514guarantee. If the target workload is already bound to NUMA nodes then this
515feature should be disabled. Otherwise, if the system overhead from the
516feature is too high then the rate the kernel samples for NUMA hinting
517faults may be controlled by the numa_balancing_scan_period_min_ms,
518numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
519numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.
520
521numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
522===============================================================================================================================
523
524
525Automatic NUMA balancing scans tasks address space and unmaps pages to
526detect if pages are properly placed or if the data should be migrated to a
527memory node local to where the task is running. Every "scan delay" the task
528scans the next "scan size" number of pages in its address space. When the
529end of the address space is reached the scanner restarts from the beginning.
530
531In combination, the "scan delay" and "scan size" determine the scan rate.
532When "scan delay" decreases, the scan rate increases. The scan delay and
533hence the scan rate of every task is adaptive and depends on historical
534behaviour. If pages are properly placed then the scan delay increases,
535otherwise the scan delay decreases. The "scan size" is not adaptive but
536the higher the "scan size", the higher the scan rate.
537
538Higher scan rates incur higher system overhead as page faults must be
539trapped and potentially data must be migrated. However, the higher the scan
540rate, the more quickly a tasks memory is migrated to a local node if the
541workload pattern changes and minimises performance impact due to remote
542memory accesses. These sysctls control the thresholds for scan delays and
543the number of pages scanned.
544
545numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
546scan a tasks virtual memory. It effectively controls the maximum scanning
547rate for each task.
548
549numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
550when it initially forks.
551
552numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
553scan a tasks virtual memory. It effectively controls the minimum scanning
554rate for each task.
555
556numa_balancing_scan_size_mb is how many megabytes worth of pages are
557scanned for a given scan.
558
559
560osrelease, ostype & version:
561============================
562
563::
564
565 # cat osrelease
566 2.1.88
567 # cat ostype
568 Linux
569 # cat version
570 #5 Wed Feb 25 21:49:24 MET 1998
571
572The files osrelease and ostype should be clear enough. Version
573needs a little more clarification however. The '#5' means that
574this is the fifth kernel built from this source base and the
575date behind it indicates the time the kernel was built.
576The only way to tune these values is to rebuild the kernel :-)
577
578
579overflowgid & overflowuid:
580==========================
581
582if your architecture did not always support 32-bit UIDs (i.e. arm,
583i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to
584applications that use the old 16-bit UID/GID system calls, if the
585actual UID or GID would exceed 65535.
586
587These sysctls allow you to change the value of the fixed UID and GID.
588The default is 65534.
589
590
591panic:
592======
593
594The value in this file represents the number of seconds the kernel
595waits before rebooting on a panic. When you use the software watchdog,
596the recommended setting is 60.
597
598
599panic_on_io_nmi:
600================
601
602Controls the kernel's behavior when a CPU receives an NMI caused by
603an IO error.
604
6050: try to continue operation (default)
606
6071: panic immediately. The IO error triggered an NMI. This indicates a
608 serious system condition which could result in IO data corruption.
609 Rather than continuing, panicking might be a better choice. Some
610 servers issue this sort of NMI when the dump button is pushed,
611 and you can use this option to take a crash dump.
612
613
614panic_on_oops:
615==============
616
617Controls the kernel's behaviour when an oops or BUG is encountered.
618
6190: try to continue operation
620
6211: panic immediately. If the `panic` sysctl is also non-zero then the
622 machine will be rebooted.
623
624
625panic_on_stackoverflow:
626=======================
627
628Controls the kernel's behavior when detecting the overflows of
629kernel, IRQ and exception stacks except a user stack.
630This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
631
6320: try to continue operation.
633
6341: panic immediately.
635
636
637panic_on_unrecovered_nmi:
638=========================
639
640The default Linux behaviour on an NMI of either memory or unknown is
641to continue operation. For many environments such as scientific
642computing it is preferable that the box is taken out and the error
643dealt with than an uncorrected parity/ECC error get propagated.
644
645A small number of systems do generate NMI's for bizarre random reasons
646such as power management so the default is off. That sysctl works like
647the existing panic controls already in that directory.
648
649
650panic_on_warn:
651==============
652
653Calls panic() in the WARN() path when set to 1. This is useful to avoid
654a kernel rebuild when attempting to kdump at the location of a WARN().
655
6560: only WARN(), default behaviour.
657
6581: call panic() after printing out WARN() location.
659
660
661panic_print:
662============
663
664Bitmask for printing system info when panic happens. User can chose
665combination of the following bits:
666
667===== ========================================
668bit 0 print all tasks info
669bit 1 print system memory info
670bit 2 print timer info
671bit 3 print locks info if CONFIG_LOCKDEP is on
672bit 4 print ftrace buffer
673===== ========================================
674
675So for example to print tasks and memory info on panic, user can::
676
677 echo 3 > /proc/sys/kernel/panic_print
678
679
680panic_on_rcu_stall:
681===================
682
683When set to 1, calls panic() after RCU stall detection messages. This
684is useful to define the root cause of RCU stalls using a vmcore.
685
6860: do not panic() when RCU stall takes place, default behavior.
687
6881: panic() after printing RCU stall messages.
689
690
691perf_cpu_time_max_percent:
692==========================
693
694Hints to the kernel how much CPU time it should be allowed to
695use to handle perf sampling events. If the perf subsystem
696is informed that its samples are exceeding this limit, it
697will drop its sampling frequency to attempt to reduce its CPU
698usage.
699
700Some perf sampling happens in NMIs. If these samples
701unexpectedly take too long to execute, the NMIs can become
702stacked up next to each other so much that nothing else is
703allowed to execute.
704
7050:
706 disable the mechanism. Do not monitor or correct perf's
707 sampling rate no matter how CPU time it takes.
708
7091-100:
710 attempt to throttle perf's sample rate to this
711 percentage of CPU. Note: the kernel calculates an
712 "expected" length of each sample event. 100 here means
713 100% of that expected length. Even if this is set to
714 100, you may still see sample throttling if this
715 length is exceeded. Set to 0 if you truly do not care
716 how much CPU is consumed.
717
718
719perf_event_paranoid:
720====================
721
722Controls use of the performance events system by unprivileged
723users (without CAP_SYS_ADMIN). The default value is 2.
724
725=== ==================================================================
726 -1 Allow use of (almost) all events by all users
727
728 Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
729
730>=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
731
732 Disallow raw tracepoint access by users without CAP_SYS_ADMIN
733
734>=1 Disallow CPU event access by users without CAP_SYS_ADMIN
735
736>=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
737=== ==================================================================
738
739
740perf_event_max_stack:
741=====================
742
743Controls maximum number of stack frames to copy for (attr.sample_type &
744PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using
745'perf record -g' or 'perf trace --call-graph fp'.
746
747This can only be done when no events are in use that have callchains
748enabled, otherwise writing to this file will return -EBUSY.
749
750The default value is 127.
751
752
753perf_event_mlock_kb:
754====================
755
756Control size of per-cpu ring buffer not counted agains mlock limit.
757
758The default value is 512 + 1 page
759
760
761perf_event_max_contexts_per_stack:
762==================================
763
764Controls maximum number of stack frame context entries for
765(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for
766instance, when using 'perf record -g' or 'perf trace --call-graph fp'.
767
768This can only be done when no events are in use that have callchains
769enabled, otherwise writing to this file will return -EBUSY.
770
771The default value is 8.
772
773
774pid_max:
775========
776
777PID allocation wrap value. When the kernel's next PID value
778reaches this value, it wraps back to a minimum PID value.
779PIDs of value pid_max or larger are not allocated.
780
781
782ns_last_pid:
783============
784
785The last pid allocated in the current (the one task using this sysctl
786lives in) pid namespace. When selecting a pid for a next task on fork
787kernel tries to allocate a number starting from this one.
788
789
790powersave-nap: (PPC only)
791=========================
792
793If set, Linux-PPC will use the 'nap' mode of powersaving,
794otherwise the 'doze' mode will be used.
795
796==============================================================
797
798printk:
799=======
800
801The four values in printk denote: console_loglevel,
802default_message_loglevel, minimum_console_loglevel and
803default_console_loglevel respectively.
804
805These values influence printk() behavior when printing or
806logging error messages. See 'man 2 syslog' for more info on
807the different loglevels.
808
809- console_loglevel:
810 messages with a higher priority than
811 this will be printed to the console
812- default_message_loglevel:
813 messages without an explicit priority
814 will be printed with this priority
815- minimum_console_loglevel:
816 minimum (highest) value to which
817 console_loglevel can be set
818- default_console_loglevel:
819 default value for console_loglevel
820
821
822printk_delay:
823=============
824
825Delay each printk message in printk_delay milliseconds
826
827Value from 0 - 10000 is allowed.
828
829
830printk_ratelimit:
831=================
832
833Some warning messages are rate limited. printk_ratelimit specifies
834the minimum length of time between these messages (in jiffies), by
835default we allow one every 5 seconds.
836
837A value of 0 will disable rate limiting.
838
839
840printk_ratelimit_burst:
841=======================
842
843While long term we enforce one message per printk_ratelimit
844seconds, we do allow a burst of messages to pass through.
845printk_ratelimit_burst specifies the number of messages we can
846send before ratelimiting kicks in.
847
848
849printk_devkmsg:
850===============
851
852Control the logging to /dev/kmsg from userspace:
853
854ratelimit:
855 default, ratelimited
856
857on: unlimited logging to /dev/kmsg from userspace
858
859off: logging to /dev/kmsg disabled
860
861The kernel command line parameter printk.devkmsg= overrides this and is
862a one-time setting until next reboot: once set, it cannot be changed by
863this sysctl interface anymore.
864
865
866randomize_va_space:
867===================
868
869This option can be used to select the type of process address
870space randomization that is used in the system, for architectures
871that support this feature.
872
873== ===========================================================================
8740 Turn the process address space randomization off. This is the
875 default for architectures that do not support this feature anyways,
876 and kernels that are booted with the "norandmaps" parameter.
877
8781 Make the addresses of mmap base, stack and VDSO page randomized.
879 This, among other things, implies that shared libraries will be
880 loaded to random addresses. Also for PIE-linked binaries, the
881 location of code start is randomized. This is the default if the
882 CONFIG_COMPAT_BRK option is enabled.
883
8842 Additionally enable heap randomization. This is the default if
885 CONFIG_COMPAT_BRK is disabled.
886
887 There are a few legacy applications out there (such as some ancient
888 versions of libc.so.5 from 1996) that assume that brk area starts
889 just after the end of the code+bss. These applications break when
890 start of the brk area is randomized. There are however no known
891 non-legacy applications that would be broken this way, so for most
892 systems it is safe to choose full randomization.
893
894 Systems with ancient and/or broken binaries should be configured
895 with CONFIG_COMPAT_BRK enabled, which excludes the heap from process
896 address space randomization.
897== ===========================================================================
898
899
900reboot-cmd: (Sparc only)
901========================
902
903??? This seems to be a way to give an argument to the Sparc
904ROM/Flash boot loader. Maybe to tell it what to do after
905rebooting. ???
906
907
908rtsig-max & rtsig-nr:
909=====================
910
911The file rtsig-max can be used to tune the maximum number
912of POSIX realtime (queued) signals that can be outstanding
913in the system.
914
915rtsig-nr shows the number of RT signals currently queued.
916
917
918sched_energy_aware:
919===================
920
921Enables/disables Energy Aware Scheduling (EAS). EAS starts
922automatically on platforms where it can run (that is,
923platforms with asymmetric CPU topologies and having an Energy
924Model available). If your platform happens to meet the
925requirements for EAS but you do not want to use it, change
926this value to 0.
927
928
929sched_schedstats:
930=================
931
932Enables/disables scheduler statistics. Enabling this feature
933incurs a small amount of overhead in the scheduler but is
934useful for debugging and performance tuning.
935
936
937sg-big-buff:
938============
939
940This file shows the size of the generic SCSI (sg) buffer.
941You can't tune it just yet, but you could change it on
942compile time by editing include/scsi/sg.h and changing
943the value of SG_BIG_BUFF.
944
945There shouldn't be any reason to change this value. If
946you can come up with one, you probably know what you
947are doing anyway :)
948
949
950shmall:
951=======
952
953This parameter sets the total amount of shared memory pages that
954can be used system wide. Hence, SHMALL should always be at least
955ceil(shmmax/PAGE_SIZE).
956
957If you are not sure what the default PAGE_SIZE is on your Linux
958system, you can run the following command:
959
960 # getconf PAGE_SIZE
961
962
963shmmax:
964=======
965
966This value can be used to query and set the run time limit
967on the maximum shared memory segment size that can be created.
968Shared memory segments up to 1Gb are now supported in the
969kernel. This value defaults to SHMMAX.
970
971
972shm_rmid_forced:
973================
974
975Linux lets you set resource limits, including how much memory one
976process can consume, via setrlimit(2). Unfortunately, shared memory
977segments are allowed to exist without association with any process, and
978thus might not be counted against any resource limits. If enabled,
979shared memory segments are automatically destroyed when their attach
980count becomes zero after a detach or a process termination. It will
981also destroy segments that were created, but never attached to, on exit
982from the process. The only use left for IPC_RMID is to immediately
983destroy an unattached segment. Of course, this breaks the way things are
984defined, so some applications might stop working. Note that this
985feature will do you no good unless you also configure your resource
986limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't
987need this.
988
989Note that if you change this from 0 to 1, already created segments
990without users and with a dead originative process will be destroyed.
991
992
993sysctl_writes_strict:
994=====================
995
996Control how file position affects the behavior of updating sysctl values
997via the /proc/sys interface:
998
999 == ======================================================================
1000 -1 Legacy per-write sysctl value handling, with no printk warnings.
1001 Each write syscall must fully contain the sysctl value to be
1002 written, and multiple writes on the same sysctl file descriptor
1003 will rewrite the sysctl value, regardless of file position.
1004 0 Same behavior as above, but warn about processes that perform writes
1005 to a sysctl file descriptor when the file position is not 0.
1006 1 (default) Respect file position when writing sysctl strings. Multiple
1007 writes will append to the sysctl value buffer. Anything past the max
1008 length of the sysctl value buffer will be ignored. Writes to numeric
1009 sysctl entries must always be at file position 0 and the value must
1010 be fully contained in the buffer sent in the write syscall.
1011 == ======================================================================
1012
1013
1014softlockup_all_cpu_backtrace:
1015=============================
1016
1017This value controls the soft lockup detector thread's behavior
1018when a soft lockup condition is detected as to whether or not
1019to gather further debug information. If enabled, each cpu will
1020be issued an NMI and instructed to capture stack trace.
1021
1022This feature is only applicable for architectures which support
1023NMI.
1024
10250: do nothing. This is the default behavior.
1026
10271: on detection capture more debug information.
1028
1029
1030soft_watchdog:
1031==============
1032
1033This parameter can be used to control the soft lockup detector.
1034
1035 0 - disable the soft lockup detector
1036
1037 1 - enable the soft lockup detector
1038
1039The soft lockup detector monitors CPUs for threads that are hogging the CPUs
1040without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads
1041from running. The mechanism depends on the CPUs ability to respond to timer
1042interrupts which are needed for the 'watchdog/N' threads to be woken up by
1043the watchdog timer function, otherwise the NMI watchdog - if enabled - can
1044detect a hard lockup condition.
1045
1046
1047stack_erasing:
1048==============
1049
1050This parameter can be used to control kernel stack erasing at the end
1051of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK.
1052
1053That erasing reduces the information which kernel stack leak bugs
1054can reveal and blocks some uninitialized stack variable attacks.
1055The tradeoff is the performance impact: on a single CPU system kernel
1056compilation sees a 1% slowdown, other systems and workloads may vary.
1057
1058 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated.
1059
1060 1: kernel stack erasing is enabled (default), it is performed before
1061 returning to the userspace at the end of syscalls.
1062
1063
1064tainted
1065=======
1066
1067Non-zero if the kernel has been tainted. Numeric values, which can be
1068ORed together. The letters are seen in "Tainted" line of Oops reports.
1069
1070====== ===== ==============================================================
1071 1 `(P)` proprietary module was loaded
1072 2 `(F)` module was force loaded
1073 4 `(S)` SMP kernel oops on an officially SMP incapable processor
1074 8 `(R)` module was force unloaded
1075 16 `(M)` processor reported a Machine Check Exception (MCE)
1076 32 `(B)` bad page referenced or some unexpected page flags
1077 64 `(U)` taint requested by userspace application
1078 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG
1079 256 `(A)` an ACPI table was overridden by user
1080 512 `(W)` kernel issued warning
1081 1024 `(C)` staging driver was loaded
1082 2048 `(I)` workaround for bug in platform firmware applied
1083 4096 `(O)` externally-built ("out-of-tree") module was loaded
1084 8192 `(E)` unsigned module was loaded
1085 16384 `(L)` soft lockup occurred
1086 32768 `(K)` kernel has been live patched
1087 65536 `(X)` Auxiliary taint, defined and used by for distros
1088131072 `(T)` The kernel was built with the struct randomization plugin
1089====== ===== ==============================================================
1090
1091See Documentation/admin-guide/tainted-kernels.rst for more information.
1092
1093
1094threads-max:
1095============
1096
1097This value controls the maximum number of threads that can be created
1098using fork().
1099
1100During initialization the kernel sets this value such that even if the
1101maximum number of threads is created, the thread structures occupy only
1102a part (1/8th) of the available RAM pages.
1103
1104The minimum value that can be written to threads-max is 20.
1105
1106The maximum value that can be written to threads-max is given by the
1107constant FUTEX_TID_MASK (0x3fffffff).
1108
1109If a value outside of this range is written to threads-max an error
1110EINVAL occurs.
1111
1112The value written is checked against the available RAM pages. If the
1113thread structures would occupy too much (more than 1/8th) of the
1114available RAM pages threads-max is reduced accordingly.
1115
1116
1117unknown_nmi_panic:
1118==================
1119
1120The value in this file affects behavior of handling NMI. When the
1121value is non-zero, unknown NMI is trapped and then panic occurs. At
1122that time, kernel debugging information is displayed on console.
1123
1124NMI switch that most IA32 servers have fires unknown NMI up, for
1125example. If a system hangs up, try pressing the NMI switch.
1126
1127
1128watchdog:
1129=========
1130
1131This parameter can be used to disable or enable the soft lockup detector
1132_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time.
1133
1134 0 - disable both lockup detectors
1135
1136 1 - enable both lockup detectors
1137
1138The soft lockup detector and the NMI watchdog can also be disabled or
1139enabled individually, using the soft_watchdog and nmi_watchdog parameters.
1140If the watchdog parameter is read, for example by executing::
1141
1142 cat /proc/sys/kernel/watchdog
1143
1144the output of this command (0 or 1) shows the logical OR of soft_watchdog
1145and nmi_watchdog.
1146
1147
1148watchdog_cpumask:
1149=================
1150
1151This value can be used to control on which cpus the watchdog may run.
1152The default cpumask is all possible cores, but if NO_HZ_FULL is
1153enabled in the kernel config, and cores are specified with the
1154nohz_full= boot argument, those cores are excluded by default.
1155Offline cores can be included in this mask, and if the core is later
1156brought online, the watchdog will be started based on the mask value.
1157
1158Typically this value would only be touched in the nohz_full case
1159to re-enable cores that by default were not running the watchdog,
1160if a kernel lockup was suspected on those cores.
1161
1162The argument value is the standard cpulist format for cpumasks,
1163so for example to enable the watchdog on cores 0, 2, 3, and 4 you
1164might say::
1165
1166 echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask
1167
1168
1169watchdog_thresh:
1170================
1171
1172This value can be used to control the frequency of hrtimer and NMI
1173events and the soft and hard lockup thresholds. The default threshold
1174is 10 seconds.
1175
1176The softlockup threshold is (2 * watchdog_thresh). Setting this
1177tunable to zero will disable lockup detection altogether.
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
new file mode 100644
index 000000000000..a7d44e71019d
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -0,0 +1,461 @@
1================================
2Documentation for /proc/sys/net/
3================================
4
5Copyright
6
7Copyright (c) 1999
8
9 - Terrehon Bowden <terrehon@pacbell.net>
10 - Bodo Bauer <bb@ricochet.net>
11
12Copyright (c) 2000
13
14 - Jorge Nerin <comandante@zaralinux.com>
15
16Copyright (c) 2009
17
18 - Shen Feng <shen@cn.fujitsu.com>
19
20For general info and legal blurb, please look in index.rst.
21
22------------------------------------------------------------------------------
23
24This file contains the documentation for the sysctl files in
25/proc/sys/net
26
27The interface to the networking parts of the kernel is located in
28/proc/sys/net. The following table shows all possible subdirectories. You may
29see only some of them, depending on your kernel's configuration.
30
31
32Table : Subdirectories in /proc/sys/net
33
34 ========= =================== = ========== ==================
35 Directory Content Directory Content
36 ========= =================== = ========== ==================
37 core General parameter appletalk Appletalk protocol
38 unix Unix domain sockets netrom NET/ROM
39 802 E802 protocol ax25 AX25
40 ethernet Ethernet protocol rose X.25 PLP layer
41 ipv4 IP version 4 x25 X.25 protocol
42 ipx IPX token-ring IBM token ring
43 bridge Bridging decnet DEC net
44 ipv6 IP version 6 tipc TIPC
45 ========= =================== = ========== ==================
46
471. /proc/sys/net/core - Network core options
48============================================
49
50bpf_jit_enable
51--------------
52
53This enables the BPF Just in Time (JIT) compiler. BPF is a flexible
54and efficient infrastructure allowing to execute bytecode at various
55hook points. It is used in a number of Linux kernel subsystems such
56as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints)
57and security (e.g. seccomp). LLVM has a BPF back end that can compile
58restricted C into a sequence of BPF instructions. After program load
59through bpf(2) and passing a verifier in the kernel, a JIT will then
60translate these BPF proglets into native CPU instructions. There are
61two flavors of JITs, the newer eBPF JIT currently supported on:
62
63 - x86_64
64 - x86_32
65 - arm64
66 - arm32
67 - ppc64
68 - sparc64
69 - mips64
70 - s390x
71 - riscv
72
73And the older cBPF JIT supported on the following archs:
74
75 - mips
76 - ppc
77 - sparc
78
79eBPF JITs are a superset of cBPF JITs, meaning the kernel will
80migrate cBPF instructions into eBPF instructions and then JIT
81compile them transparently. Older cBPF JITs can only translate
82tcpdump filters, seccomp rules, etc, but not mentioned eBPF
83programs loaded through bpf(2).
84
85Values:
86
87 - 0 - disable the JIT (default value)
88 - 1 - enable the JIT
89 - 2 - enable the JIT and ask the compiler to emit traces on kernel log.
90
91bpf_jit_harden
92--------------
93
94This enables hardening for the BPF JIT compiler. Supported are eBPF
95JIT backends. Enabling hardening trades off performance, but can
96mitigate JIT spraying.
97
98Values:
99
100 - 0 - disable JIT hardening (default value)
101 - 1 - enable JIT hardening for unprivileged users only
102 - 2 - enable JIT hardening for all users
103
104bpf_jit_kallsyms
105----------------
106
107When BPF JIT compiler is enabled, then compiled images are unknown
108addresses to the kernel, meaning they neither show up in traces nor
109in /proc/kallsyms. This enables export of these addresses, which can
110be used for debugging/tracing. If bpf_jit_harden is enabled, this
111feature is disabled.
112
113Values :
114
115 - 0 - disable JIT kallsyms export (default value)
116 - 1 - enable JIT kallsyms export for privileged users only
117
118bpf_jit_limit
119-------------
120
121This enforces a global limit for memory allocations to the BPF JIT
122compiler in order to reject unprivileged JIT requests once it has
123been surpassed. bpf_jit_limit contains the value of the global limit
124in bytes.
125
126dev_weight
127----------
128
129The maximum number of packets that kernel can handle on a NAPI interrupt,
130it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware
131aggregated packet is counted as one packet in this context.
132
133Default: 64
134
135dev_weight_rx_bias
136------------------
137
138RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function
139of the driver for the per softirq cycle netdev_budget. This parameter influences
140the proportion of the configured netdev_budget that is spent on RPS based packet
141processing during RX softirq cycles. It is further meant for making current
142dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack.
143(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based
144on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias).
145
146Default: 1
147
148dev_weight_tx_bias
149------------------
150
151Scales the maximum number of packets that can be processed during a TX softirq cycle.
152Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric
153net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog.
154
155Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias).
156
157Default: 1
158
159default_qdisc
160-------------
161
162The default queuing discipline to use for network devices. This allows
163overriding the default of pfifo_fast with an alternative. Since the default
164queuing discipline is created without additional parameters so is best suited
165to queuing disciplines that work well without configuration like stochastic
166fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use
167queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin
168which require setting up classes and bandwidths. Note that physical multiqueue
169interfaces still use mq as root qdisc, which in turn uses this default for its
170leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead
171default to noqueue.
172
173Default: pfifo_fast
174
175busy_read
176---------
177
178Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL)
179Approximate time in us to busy loop waiting for packets on the device queue.
180This sets the default value of the SO_BUSY_POLL socket option.
181Can be set or overridden per socket by setting socket option SO_BUSY_POLL,
182which is the preferred method of enabling. If you need to enable the feature
183globally via sysctl, a value of 50 is recommended.
184
185Will increase power usage.
186
187Default: 0 (off)
188
189busy_poll
190----------------
191Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL)
192Approximate time in us to busy loop waiting for events.
193Recommended value depends on the number of sockets you poll on.
194For several sockets 50, for several hundreds 100.
195For more than that you probably want to use epoll.
196Note that only sockets with SO_BUSY_POLL set will be busy polled,
197so you want to either selectively set SO_BUSY_POLL on those sockets or set
198sysctl.net.busy_read globally.
199
200Will increase power usage.
201
202Default: 0 (off)
203
204rmem_default
205------------
206
207The default setting of the socket receive buffer in bytes.
208
209rmem_max
210--------
211
212The maximum receive socket buffer size in bytes.
213
214tstamp_allow_data
215-----------------
216Allow processes to receive tx timestamps looped together with the original
217packet contents. If disabled, transmit timestamp requests from unprivileged
218processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set.
219
220Default: 1 (on)
221
222
223wmem_default
224------------
225
226The default setting (in bytes) of the socket send buffer.
227
228wmem_max
229--------
230
231The maximum send socket buffer size in bytes.
232
233message_burst and message_cost
234------------------------------
235
236These parameters are used to limit the warning messages written to the kernel
237log from the networking code. They enforce a rate limit to make a
238denial-of-service attack impossible. A higher message_cost factor, results in
239fewer messages that will be written. Message_burst controls when messages will
240be dropped. The default settings limit warning messages to one every five
241seconds.
242
243warnings
244--------
245
246This sysctl is now unused.
247
248This was used to control console messages from the networking stack that
249occur because of problems on the network like duplicate address or bad
250checksums.
251
252These messages are now emitted at KERN_DEBUG and can generally be enabled
253and controlled by the dynamic_debug facility.
254
255netdev_budget
256-------------
257
258Maximum number of packets taken from all interfaces in one polling cycle (NAPI
259poll). In one polling cycle interfaces which are registered to polling are
260probed in a round-robin manner. Also, a polling cycle may not exceed
261netdev_budget_usecs microseconds, even if netdev_budget has not been
262exhausted.
263
264netdev_budget_usecs
265---------------------
266
267Maximum number of microseconds in one NAPI polling cycle. Polling
268will exit when either netdev_budget_usecs have elapsed during the
269poll cycle or the number of packets processed reaches netdev_budget.
270
271netdev_max_backlog
272------------------
273
274Maximum number of packets, queued on the INPUT side, when the interface
275receives packets faster than kernel can process them.
276
277netdev_rss_key
278--------------
279
280RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is
281randomly generated.
282Some user space might need to gather its content even if drivers do not
283provide ethtool -x support yet.
284
285::
286
287 myhost:~# cat /proc/sys/net/core/netdev_rss_key
288 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total)
289
290File contains nul bytes if no driver ever called netdev_rss_key_fill() function.
291
292Note:
293 /proc/sys/net/core/netdev_rss_key contains 52 bytes of key,
294 but most drivers only use 40 bytes of it.
295
296::
297
298 myhost:~# ethtool -x eth0
299 RX flow hash indirection table for eth0 with 8 RX ring(s):
300 0: 0 1 2 3 4 5 6 7
301 RSS hash key:
302 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89
303
304netdev_tstamp_prequeue
305----------------------
306
307If set to 0, RX packet timestamps can be sampled after RPS processing, when
308the target CPU processes packets. It might give some delay on timestamps, but
309permit to distribute the load on several cpus.
310
311If set to 1 (default), timestamps are sampled as soon as possible, before
312queueing.
313
314optmem_max
315----------
316
317Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
318of struct cmsghdr structures with appended data.
319
320fb_tunnels_only_for_init_net
321----------------------------
322
323Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0,
324sit0, ip6tnl0, ip6gre0) are automatically created when a new
325network namespace is created, if corresponding tunnel is present
326in initial network namespace.
327If set to 1, these devices are not automatically created, and
328user space is responsible for creating them if needed.
329
330Default : 0 (for compatibility reasons)
331
332devconf_inherit_init_net
333------------------------
334
335Controls if a new network namespace should inherit all current
336settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By
337default, we keep the current behavior: for IPv4 we inherit all current
338settings from init_net and for IPv6 we reset all settings to default.
339
340If set to 1, both IPv4 and IPv6 settings are forced to inherit from
341current ones in init_net. If set to 2, both IPv4 and IPv6 settings are
342forced to reset to their default values.
343
344Default : 0 (for compatibility reasons)
345
3462. /proc/sys/net/unix - Parameters for Unix domain sockets
347----------------------------------------------------------
348
349There is only one file in this directory.
350unix_dgram_qlen limits the max number of datagrams queued in Unix domain
351socket's buffer. It will not take effect unless PF_UNIX flag is specified.
352
353
3543. /proc/sys/net/ipv4 - IPV4 settings
355-------------------------------------
356Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for
357descriptions of these entries.
358
359
3604. Appletalk
361------------
362
363The /proc/sys/net/appletalk directory holds the Appletalk configuration data
364when Appletalk is loaded. The configurable parameters are:
365
366aarp-expiry-time
367----------------
368
369The amount of time we keep an ARP entry before expiring it. Used to age out
370old hosts.
371
372aarp-resolve-time
373-----------------
374
375The amount of time we will spend trying to resolve an Appletalk address.
376
377aarp-retransmit-limit
378---------------------
379
380The number of times we will retransmit a query before giving up.
381
382aarp-tick-time
383--------------
384
385Controls the rate at which expires are checked.
386
387The directory /proc/net/appletalk holds the list of active Appletalk sockets
388on a machine.
389
390The fields indicate the DDP type, the local address (in network:node format)
391the remote address, the size of the transmit pending queue, the size of the
392received queue (bytes waiting for applications to read) the state and the uid
393owning the socket.
394
395/proc/net/atalk_iface lists all the interfaces configured for appletalk.It
396shows the name of the interface, its Appletalk address, the network range on
397that address (or network number for phase 1 networks), and the status of the
398interface.
399
400/proc/net/atalk_route lists each known network route. It lists the target
401(network) that the route leads to, the router (may be directly connected), the
402route flags, and the device the route is using.
403
404
4055. IPX
406------
407
408The IPX protocol has no tunable values in proc/sys/net.
409
410The IPX protocol does, however, provide proc/net/ipx. This lists each IPX
411socket giving the local and remote addresses in Novell format (that is
412network:node:port). In accordance with the strange Novell tradition,
413everything but the port is in hex. Not_Connected is displayed for sockets that
414are not tied to a specific remote address. The Tx and Rx queue sizes indicate
415the number of bytes pending for transmission and reception. The state
416indicates the state the socket is in and the uid is the owning uid of the
417socket.
418
419The /proc/net/ipx_interface file lists all IPX interfaces. For each interface
420it gives the network number, the node number, and indicates if the network is
421the primary network. It also indicates which device it is bound to (or
422Internal for internal networks) and the Frame Type if appropriate. Linux
423supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for
424IPX.
425
426The /proc/net/ipx_route table holds a list of IPX routes. For each route it
427gives the destination network, the router node (or Directly) and the network
428address of the router (or Connected) for internal networks.
429
4306. TIPC
431-------
432
433tipc_rmem
434---------
435
436The TIPC protocol now has a tunable for the receive memory, similar to the
437tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max)
438
439::
440
441 # cat /proc/sys/net/tipc/tipc_rmem
442 4252725 34021800 68043600
443 #
444
445The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values
446are scaled (shifted) versions of that same value. Note that the min value
447is not at this point in time used in any meaningful way, but the triplet is
448preserved in order to be consistent with things like tcp_rmem.
449
450named_timeout
451-------------
452
453TIPC name table updates are distributed asynchronously in a cluster, without
454any form of transaction handling. This means that different race scenarios are
455possible. One such is that a name withdrawal sent out by one node and received
456by another node may arrive after a second, overlapping name publication already
457has been accepted from a third node, although the conflicting updates
458originally may have been issued in the correct sequential order.
459If named_timeout is nonzero, failed topology updates will be placed on a defer
460queue until another event arrives that clears the error, or until the timeout
461expires. Value is in milliseconds.
diff --git a/Documentation/admin-guide/sysctl/sunrpc.rst b/Documentation/admin-guide/sysctl/sunrpc.rst
new file mode 100644
index 000000000000..09780a682afd
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/sunrpc.rst
@@ -0,0 +1,25 @@
1===================================
2Documentation for /proc/sys/sunrpc/
3===================================
4
5kernel version 2.2.10
6
7Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
8
9For general info and legal blurb, please look in index.rst.
10
11------------------------------------------------------------------------------
12
13This file contains the documentation for the sysctl files in
14/proc/sys/sunrpc and is valid for Linux kernel version 2.2.
15
16The files in this directory can be used to (re)set the debug
17flags of the SUN Remote Procedure Call (RPC) subsystem in
18the Linux kernel. This stuff is used for NFS, KNFSD and
19maybe a few other things as well.
20
21The files in there are used to control the debugging flags:
22rpc_debug, nfs_debug, nfsd_debug and nlm_debug.
23
24These flags are for kernel hackers only. You should read the
25source code in net/sunrpc/ for more information.
diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst
new file mode 100644
index 000000000000..650eaa03f15e
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/user.rst
@@ -0,0 +1,78 @@
1=================================
2Documentation for /proc/sys/user/
3=================================
4
5kernel version 4.9.0
6
7Copyright (c) 2016 Eric Biederman <ebiederm@xmission.com>
8
9------------------------------------------------------------------------------
10
11This file contains the documentation for the sysctl files in
12/proc/sys/user.
13
14The files in this directory can be used to override the default
15limits on the number of namespaces and other objects that have
16per user per user namespace limits.
17
18The primary purpose of these limits is to stop programs that
19malfunction and attempt to create a ridiculous number of objects,
20before the malfunction becomes a system wide problem. It is the
21intention that the defaults of these limits are set high enough that
22no program in normal operation should run into these limits.
23
24The creation of per user per user namespace objects are charged to
25the user in the user namespace who created the object and
26verified to be below the per user limit in that user namespace.
27
28The creation of objects is also charged to all of the users
29who created user namespaces the creation of the object happens
30in (user namespaces can be nested) and verified to be below the per user
31limits in the user namespaces of those users.
32
33This recursive counting of created objects ensures that creating a
34user namespace does not allow a user to escape their current limits.
35
36Currently, these files are in /proc/sys/user:
37
38max_cgroup_namespaces
39=====================
40
41 The maximum number of cgroup namespaces that any user in the current
42 user namespace may create.
43
44max_ipc_namespaces
45==================
46
47 The maximum number of ipc namespaces that any user in the current
48 user namespace may create.
49
50max_mnt_namespaces
51==================
52
53 The maximum number of mount namespaces that any user in the current
54 user namespace may create.
55
56max_net_namespaces
57==================
58
59 The maximum number of network namespaces that any user in the
60 current user namespace may create.
61
62max_pid_namespaces
63==================
64
65 The maximum number of pid namespaces that any user in the current
66 user namespace may create.
67
68max_user_namespaces
69===================
70
71 The maximum number of user namespaces that any user in the current
72 user namespace may create.
73
74max_uts_namespaces
75==================
76
77 The maximum number of user namespaces that any user in the current
78 user namespace may create.
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
new file mode 100644
index 000000000000..64aeee1009ca
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -0,0 +1,964 @@
1===============================
2Documentation for /proc/sys/vm/
3===============================
4
5kernel version 2.6.29
6
7Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
8
9Copyright (c) 2008 Peter W. Morreale <pmorreale@novell.com>
10
11For general info and legal blurb, please look in index.rst.
12
13------------------------------------------------------------------------------
14
15This file contains the documentation for the sysctl files in
16/proc/sys/vm and is valid for Linux kernel version 2.6.29.
17
18The files in this directory can be used to tune the operation
19of the virtual memory (VM) subsystem of the Linux kernel and
20the writeout of dirty data to disk.
21
22Default values and initialization routines for most of these
23files can be found in mm/swap.c.
24
25Currently, these files are in /proc/sys/vm:
26
27- admin_reserve_kbytes
28- block_dump
29- compact_memory
30- compact_unevictable_allowed
31- dirty_background_bytes
32- dirty_background_ratio
33- dirty_bytes
34- dirty_expire_centisecs
35- dirty_ratio
36- dirtytime_expire_seconds
37- dirty_writeback_centisecs
38- drop_caches
39- extfrag_threshold
40- hugetlb_shm_group
41- laptop_mode
42- legacy_va_layout
43- lowmem_reserve_ratio
44- max_map_count
45- memory_failure_early_kill
46- memory_failure_recovery
47- min_free_kbytes
48- min_slab_ratio
49- min_unmapped_ratio
50- mmap_min_addr
51- mmap_rnd_bits
52- mmap_rnd_compat_bits
53- nr_hugepages
54- nr_hugepages_mempolicy
55- nr_overcommit_hugepages
56- nr_trim_pages (only if CONFIG_MMU=n)
57- numa_zonelist_order
58- oom_dump_tasks
59- oom_kill_allocating_task
60- overcommit_kbytes
61- overcommit_memory
62- overcommit_ratio
63- page-cluster
64- panic_on_oom
65- percpu_pagelist_fraction
66- stat_interval
67- stat_refresh
68- numa_stat
69- swappiness
70- unprivileged_userfaultfd
71- user_reserve_kbytes
72- vfs_cache_pressure
73- watermark_boost_factor
74- watermark_scale_factor
75- zone_reclaim_mode
76
77
78admin_reserve_kbytes
79====================
80
81The amount of free memory in the system that should be reserved for users
82with the capability cap_sys_admin.
83
84admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
85
86That should provide enough for the admin to log in and kill a process,
87if necessary, under the default overcommit 'guess' mode.
88
89Systems running under overcommit 'never' should increase this to account
90for the full Virtual Memory Size of programs used to recover. Otherwise,
91root may not be able to log in to recover the system.
92
93How do you calculate a minimum useful reserve?
94
95sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
96
97For overcommit 'guess', we can sum resident set sizes (RSS).
98On x86_64 this is about 8MB.
99
100For overcommit 'never', we can take the max of their virtual sizes (VSZ)
101and add the sum of their RSS.
102On x86_64 this is about 128MB.
103
104Changing this takes effect whenever an application requests memory.
105
106
107block_dump
108==========
109
110block_dump enables block I/O debugging when set to a nonzero value. More
111information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
112
113
114compact_memory
115==============
116
117Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
118all zones are compacted such that free memory is available in contiguous
119blocks where possible. This can be important for example in the allocation of
120huge pages although processes will also directly compact memory as required.
121
122
123compact_unevictable_allowed
124===========================
125
126Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
127allowed to examine the unevictable lru (mlocked pages) for pages to compact.
128This should be used on systems where stalls for minor page faults are an
129acceptable trade for large contiguous free memory. Set to 0 to prevent
130compaction from moving pages that are unevictable. Default value is 1.
131
132
133dirty_background_bytes
134======================
135
136Contains the amount of dirty memory at which the background kernel
137flusher threads will start writeback.
138
139Note:
140 dirty_background_bytes is the counterpart of dirty_background_ratio. Only
141 one of them may be specified at a time. When one sysctl is written it is
142 immediately taken into account to evaluate the dirty memory limits and the
143 other appears as 0 when read.
144
145
146dirty_background_ratio
147======================
148
149Contains, as a percentage of total available memory that contains free pages
150and reclaimable pages, the number of pages at which the background kernel
151flusher threads will start writing out dirty data.
152
153The total available memory is not equal to total system memory.
154
155
156dirty_bytes
157===========
158
159Contains the amount of dirty memory at which a process generating disk writes
160will itself start writeback.
161
162Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
163specified at a time. When one sysctl is written it is immediately taken into
164account to evaluate the dirty memory limits and the other appears as 0 when
165read.
166
167Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
168value lower than this limit will be ignored and the old configuration will be
169retained.
170
171
172dirty_expire_centisecs
173======================
174
175This tunable is used to define when dirty data is old enough to be eligible
176for writeout by the kernel flusher threads. It is expressed in 100'ths
177of a second. Data which has been dirty in-memory for longer than this
178interval will be written out next time a flusher thread wakes up.
179
180
181dirty_ratio
182===========
183
184Contains, as a percentage of total available memory that contains free pages
185and reclaimable pages, the number of pages at which a process which is
186generating disk writes will itself start writing out dirty data.
187
188The total available memory is not equal to total system memory.
189
190
191dirtytime_expire_seconds
192========================
193
194When a lazytime inode is constantly having its pages dirtied, the inode with
195an updated timestamp will never get chance to be written out. And, if the
196only thing that has happened on the file system is a dirtytime inode caused
197by an atime update, a worker will be scheduled to make sure that inode
198eventually gets pushed out to disk. This tunable is used to define when dirty
199inode is old enough to be eligible for writeback by the kernel flusher threads.
200And, it is also used as the interval to wakeup dirtytime_writeback thread.
201
202
203dirty_writeback_centisecs
204=========================
205
206The kernel flusher threads will periodically wake up and write `old` data
207out to disk. This tunable expresses the interval between those wakeups, in
208100'ths of a second.
209
210Setting this to zero disables periodic writeback altogether.
211
212
213drop_caches
214===========
215
216Writing to this will cause the kernel to drop clean caches, as well as
217reclaimable slab objects like dentries and inodes. Once dropped, their
218memory becomes free.
219
220To free pagecache::
221
222 echo 1 > /proc/sys/vm/drop_caches
223
224To free reclaimable slab objects (includes dentries and inodes)::
225
226 echo 2 > /proc/sys/vm/drop_caches
227
228To free slab objects and pagecache::
229
230 echo 3 > /proc/sys/vm/drop_caches
231
232This is a non-destructive operation and will not free any dirty objects.
233To increase the number of objects freed by this operation, the user may run
234`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the
235number of dirty objects on the system and create more candidates to be
236dropped.
237
238This file is not a means to control the growth of the various kernel caches
239(inodes, dentries, pagecache, etc...) These objects are automatically
240reclaimed by the kernel when memory is needed elsewhere on the system.
241
242Use of this file can cause performance problems. Since it discards cached
243objects, it may cost a significant amount of I/O and CPU to recreate the
244dropped objects, especially if they were under heavy use. Because of this,
245use outside of a testing or debugging environment is not recommended.
246
247You may see informational messages in your kernel log when this file is
248used::
249
250 cat (1234): drop_caches: 3
251
252These are informational only. They do not mean that anything is wrong
253with your system. To disable them, echo 4 (bit 2) into drop_caches.
254
255
256extfrag_threshold
257=================
258
259This parameter affects whether the kernel will compact memory or direct
260reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
261debugfs shows what the fragmentation index for each order is in each zone in
262the system. Values tending towards 0 imply allocations would fail due to lack
263of memory, values towards 1000 imply failures are due to fragmentation and -1
264implies that the allocation will succeed as long as watermarks are met.
265
266The kernel will not compact memory in a zone if the
267fragmentation index is <= extfrag_threshold. The default value is 500.
268
269
270highmem_is_dirtyable
271====================
272
273Available only for systems with CONFIG_HIGHMEM enabled (32b systems).
274
275This parameter controls whether the high memory is considered for dirty
276writers throttling. This is not the case by default which means that
277only the amount of memory directly visible/usable by the kernel can
278be dirtied. As a result, on systems with a large amount of memory and
279lowmem basically depleted writers might be throttled too early and
280streaming writes can get very slow.
281
282Changing the value to non zero would allow more memory to be dirtied
283and thus allow writers to write more data which can be flushed to the
284storage more effectively. Note this also comes with a risk of pre-mature
285OOM killer because some writers (e.g. direct block device writes) can
286only use the low memory and they can fill it up with dirty data without
287any throttling.
288
289
290hugetlb_shm_group
291=================
292
293hugetlb_shm_group contains group id that is allowed to create SysV
294shared memory segment using hugetlb page.
295
296
297laptop_mode
298===========
299
300laptop_mode is a knob that controls "laptop mode". All the things that are
301controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst.
302
303
304legacy_va_layout
305================
306
307If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
308will use the legacy (2.4) layout for all processes.
309
310
311lowmem_reserve_ratio
312====================
313
314For some specialised workloads on highmem machines it is dangerous for
315the kernel to allow process memory to be allocated from the "lowmem"
316zone. This is because that memory could then be pinned via the mlock()
317system call, or by unavailability of swapspace.
318
319And on large highmem machines this lack of reclaimable lowmem memory
320can be fatal.
321
322So the Linux page allocator has a mechanism which prevents allocations
323which *could* use highmem from using too much lowmem. This means that
324a certain amount of lowmem is defended from the possibility of being
325captured into pinned user memory.
326
327(The same argument applies to the old 16 megabyte ISA DMA region. This
328mechanism will also defend that region from allocations which could use
329highmem or lowmem).
330
331The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
332in defending these lower zones.
333
334If you have a machine which uses highmem or ISA DMA and your
335applications are using mlock(), or if you are running with no swap then
336you probably should change the lowmem_reserve_ratio setting.
337
338The lowmem_reserve_ratio is an array. You can see them by reading this file::
339
340 % cat /proc/sys/vm/lowmem_reserve_ratio
341 256 256 32
342
343But, these values are not used directly. The kernel calculates # of protection
344pages for each zones from them. These are shown as array of protection pages
345in /proc/zoneinfo like followings. (This is an example of x86-64 box).
346Each zone has an array of protection pages like this::
347
348 Node 0, zone DMA
349 pages free 1355
350 min 3
351 low 3
352 high 4
353 :
354 :
355 numa_other 0
356 protection: (0, 2004, 2004, 2004)
357 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
358 pagesets
359 cpu: 0 pcp: 0
360 :
361
362These protections are added to score to judge whether this zone should be used
363for page allocation or should be reclaimed.
364
365In this example, if normal pages (index=2) are required to this DMA zone and
366watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
367not be used because pages_free(1355) is smaller than watermark + protection[2]
368(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
369normal page requirement. If requirement is DMA zone(index=0), protection[0]
370(=0) is used.
371
372zone[i]'s protection[j] is calculated by following expression::
373
374 (i < j):
375 zone[i]->protection[j]
376 = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
377 / lowmem_reserve_ratio[i];
378 (i = j):
379 (should not be protected. = 0;
380 (i > j):
381 (not necessary, but looks 0)
382
383The default values of lowmem_reserve_ratio[i] are
384
385 === ====================================
386 256 (if zone[i] means DMA or DMA32 zone)
387 32 (others)
388 === ====================================
389
390As above expression, they are reciprocal number of ratio.
391256 means 1/256. # of protection pages becomes about "0.39%" of total managed
392pages of higher zones on the node.
393
394If you would like to protect more pages, smaller values are effective.
395The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
396disables protection of the pages.
397
398
399max_map_count:
400==============
401
402This file contains the maximum number of memory map areas a process
403may have. Memory map areas are used as a side-effect of calling
404malloc, directly by mmap, mprotect, and madvise, and also when loading
405shared libraries.
406
407While most applications need less than a thousand maps, certain
408programs, particularly malloc debuggers, may consume lots of them,
409e.g., up to one or two maps per allocation.
410
411The default value is 65536.
412
413
414memory_failure_early_kill:
415==========================
416
417Control how to kill processes when uncorrected memory error (typically
418a 2bit error in a memory module) is detected in the background by hardware
419that cannot be handled by the kernel. In some cases (like the page
420still having a valid copy on disk) the kernel will handle the failure
421transparently without affecting any applications. But if there is
422no other uptodate copy of the data it will kill to prevent any data
423corruptions from propagating.
424
4251: Kill all processes that have the corrupted and not reloadable page mapped
426as soon as the corruption is detected. Note this is not supported
427for a few types of pages, like kernel internally allocated data or
428the swap cache, but works for the majority of user pages.
429
4300: Only unmap the corrupted page from all processes and only kill a process
431who tries to access it.
432
433The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
434handle this if they want to.
435
436This is only active on architectures/platforms with advanced machine
437check handling and depends on the hardware capabilities.
438
439Applications can override this setting individually with the PR_MCE_KILL prctl
440
441
442memory_failure_recovery
443=======================
444
445Enable memory failure recovery (when supported by the platform)
446
4471: Attempt recovery.
448
4490: Always panic on a memory failure.
450
451
452min_free_kbytes
453===============
454
455This is used to force the Linux VM to keep a minimum number
456of kilobytes free. The VM uses this number to compute a
457watermark[WMARK_MIN] value for each lowmem zone in the system.
458Each lowmem zone gets a number of reserved free pages based
459proportionally on its size.
460
461Some minimal amount of memory is needed to satisfy PF_MEMALLOC
462allocations; if you set this to lower than 1024KB, your system will
463become subtly broken, and prone to deadlock under high loads.
464
465Setting this too high will OOM your machine instantly.
466
467
468min_slab_ratio
469==============
470
471This is available only on NUMA kernels.
472
473A percentage of the total pages in each zone. On Zone reclaim
474(fallback from the local zone occurs) slabs will be reclaimed if more
475than this percentage of pages in a zone are reclaimable slab pages.
476This insures that the slab growth stays under control even in NUMA
477systems that rarely perform global reclaim.
478
479The default is 5 percent.
480
481Note that slab reclaim is triggered in a per zone / node fashion.
482The process of reclaiming slab memory is currently not node specific
483and may not be fast.
484
485
486min_unmapped_ratio
487==================
488
489This is available only on NUMA kernels.
490
491This is a percentage of the total pages in each zone. Zone reclaim will
492only occur if more than this percentage of pages are in a state that
493zone_reclaim_mode allows to be reclaimed.
494
495If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
496against all file-backed unmapped pages including swapcache pages and tmpfs
497files. Otherwise, only unmapped pages backed by normal files but not tmpfs
498files and similar are considered.
499
500The default is 1 percent.
501
502
503mmap_min_addr
504=============
505
506This file indicates the amount of address space which a user process will
507be restricted from mmapping. Since kernel null dereference bugs could
508accidentally operate based on the information in the first couple of pages
509of memory userspace processes should not be allowed to write to them. By
510default this value is set to 0 and no protections will be enforced by the
511security module. Setting this value to something like 64k will allow the
512vast majority of applications to work correctly and provide defense in depth
513against future potential kernel bugs.
514
515
516mmap_rnd_bits
517=============
518
519This value can be used to select the number of bits to use to
520determine the random offset to the base address of vma regions
521resulting from mmap allocations on architectures which support
522tuning address space randomization. This value will be bounded
523by the architecture's minimum and maximum supported values.
524
525This value can be changed after boot using the
526/proc/sys/vm/mmap_rnd_bits tunable
527
528
529mmap_rnd_compat_bits
530====================
531
532This value can be used to select the number of bits to use to
533determine the random offset to the base address of vma regions
534resulting from mmap allocations for applications run in
535compatibility mode on architectures which support tuning address
536space randomization. This value will be bounded by the
537architecture's minimum and maximum supported values.
538
539This value can be changed after boot using the
540/proc/sys/vm/mmap_rnd_compat_bits tunable
541
542
543nr_hugepages
544============
545
546Change the minimum size of the hugepage pool.
547
548See Documentation/admin-guide/mm/hugetlbpage.rst
549
550
551nr_hugepages_mempolicy
552======================
553
554Change the size of the hugepage pool at run-time on a specific
555set of NUMA nodes.
556
557See Documentation/admin-guide/mm/hugetlbpage.rst
558
559
560nr_overcommit_hugepages
561=======================
562
563Change the maximum size of the hugepage pool. The maximum is
564nr_hugepages + nr_overcommit_hugepages.
565
566See Documentation/admin-guide/mm/hugetlbpage.rst
567
568
569nr_trim_pages
570=============
571
572This is available only on NOMMU kernels.
573
574This value adjusts the excess page trimming behaviour of power-of-2 aligned
575NOMMU mmap allocations.
576
577A value of 0 disables trimming of allocations entirely, while a value of 1
578trims excess pages aggressively. Any value >= 1 acts as the watermark where
579trimming of allocations is initiated.
580
581The default value is 1.
582
583See Documentation/nommu-mmap.txt for more information.
584
585
586numa_zonelist_order
587===================
588
589This sysctl is only for NUMA and it is deprecated. Anything but
590Node order will fail!
591
592'where the memory is allocated from' is controlled by zonelists.
593
594(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
595you may be able to read ZONE_DMA as ZONE_DMA32...)
596
597In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
598ZONE_NORMAL -> ZONE_DMA
599This means that a memory allocation request for GFP_KERNEL will
600get memory from ZONE_DMA only when ZONE_NORMAL is not available.
601
602In NUMA case, you can think of following 2 types of order.
603Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL::
604
605 (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
606 (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
607
608Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
609will be used before ZONE_NORMAL exhaustion. This increases possibility of
610out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
611
612Type(B) cannot offer the best locality but is more robust against OOM of
613the DMA zone.
614
615Type(A) is called as "Node" order. Type (B) is "Zone" order.
616
617"Node order" orders the zonelists by node, then by zone within each node.
618Specify "[Nn]ode" for node order
619
620"Zone Order" orders the zonelists by zone type, then by node within each
621zone. Specify "[Zz]one" for zone order.
622
623Specify "[Dd]efault" to request automatic configuration.
624
625On 32-bit, the Normal zone needs to be preserved for allocations accessible
626by the kernel, so "zone" order will be selected.
627
628On 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
629order will be selected.
630
631Default order is recommended unless this is causing problems for your
632system/application.
633
634
635oom_dump_tasks
636==============
637
638Enables a system-wide task dump (excluding kernel threads) to be produced
639when the kernel performs an OOM-killing and includes such information as
640pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
641score, and name. This is helpful to determine why the OOM killer was
642invoked, to identify the rogue task that caused it, and to determine why
643the OOM killer chose the task it did to kill.
644
645If this is set to zero, this information is suppressed. On very
646large systems with thousands of tasks it may not be feasible to dump
647the memory state information for each one. Such systems should not
648be forced to incur a performance penalty in OOM conditions when the
649information may not be desired.
650
651If this is set to non-zero, this information is shown whenever the
652OOM killer actually kills a memory-hogging task.
653
654The default value is 1 (enabled).
655
656
657oom_kill_allocating_task
658========================
659
660This enables or disables killing the OOM-triggering task in
661out-of-memory situations.
662
663If this is set to zero, the OOM killer will scan through the entire
664tasklist and select a task based on heuristics to kill. This normally
665selects a rogue memory-hogging task that frees up a large amount of
666memory when killed.
667
668If this is set to non-zero, the OOM killer simply kills the task that
669triggered the out-of-memory condition. This avoids the expensive
670tasklist scan.
671
672If panic_on_oom is selected, it takes precedence over whatever value
673is used in oom_kill_allocating_task.
674
675The default value is 0.
676
677
678overcommit_kbytes
679=================
680
681When overcommit_memory is set to 2, the committed address space is not
682permitted to exceed swap plus this amount of physical RAM. See below.
683
684Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
685of them may be specified at a time. Setting one disables the other (which
686then appears as 0 when read).
687
688
689overcommit_memory
690=================
691
692This value contains a flag that enables memory overcommitment.
693
694When this flag is 0, the kernel attempts to estimate the amount
695of free memory left when userspace requests more memory.
696
697When this flag is 1, the kernel pretends there is always enough
698memory until it actually runs out.
699
700When this flag is 2, the kernel uses a "never overcommit"
701policy that attempts to prevent any overcommit of memory.
702Note that user_reserve_kbytes affects this policy.
703
704This feature can be very useful because there are a lot of
705programs that malloc() huge amounts of memory "just-in-case"
706and don't use much of it.
707
708The default value is 0.
709
710See Documentation/vm/overcommit-accounting.rst and
711mm/util.c::__vm_enough_memory() for more information.
712
713
714overcommit_ratio
715================
716
717When overcommit_memory is set to 2, the committed address
718space is not permitted to exceed swap plus this percentage
719of physical RAM. See above.
720
721
722page-cluster
723============
724
725page-cluster controls the number of pages up to which consecutive pages
726are read in from swap in a single attempt. This is the swap counterpart
727to page cache readahead.
728The mentioned consecutivity is not in terms of virtual/physical addresses,
729but consecutive on swap space - that means they were swapped out together.
730
731It is a logarithmic value - setting it to zero means "1 page", setting
732it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
733Zero disables swap readahead completely.
734
735The default value is three (eight pages at a time). There may be some
736small benefits in tuning this to a different value if your workload is
737swap-intensive.
738
739Lower values mean lower latencies for initial faults, but at the same time
740extra faults and I/O delays for following faults if they would have been part of
741that consecutive pages readahead would have brought in.
742
743
744panic_on_oom
745============
746
747This enables or disables panic on out-of-memory feature.
748
749If this is set to 0, the kernel will kill some rogue process,
750called oom_killer. Usually, oom_killer can kill rogue processes and
751system will survive.
752
753If this is set to 1, the kernel panics when out-of-memory happens.
754However, if a process limits using nodes by mempolicy/cpusets,
755and those nodes become memory exhaustion status, one process
756may be killed by oom-killer. No panic occurs in this case.
757Because other nodes' memory may be free. This means system total status
758may be not fatal yet.
759
760If this is set to 2, the kernel panics compulsorily even on the
761above-mentioned. Even oom happens under memory cgroup, the whole
762system panics.
763
764The default value is 0.
765
7661 and 2 are for failover of clustering. Please select either
767according to your policy of failover.
768
769panic_on_oom=2+kdump gives you very strong tool to investigate
770why oom happens. You can get snapshot.
771
772
773percpu_pagelist_fraction
774========================
775
776This is the fraction of pages at most (high mark pcp->high) in each zone that
777are allocated for each per cpu page list. The min value for this is 8. It
778means that we don't allow more than 1/8th of pages in each zone to be
779allocated in any single per_cpu_pagelist. This entry only changes the value
780of hot per cpu pagelists. User can specify a number like 100 to allocate
7811/100th of each zone to each per cpu page list.
782
783The batch value of each per cpu pagelist is also updated as a result. It is
784set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
785
786The initial value is zero. Kernel does not use this value at boot time to set
787the high water marks for each per cpu page list. If the user writes '0' to this
788sysctl, it will revert to this default behavior.
789
790
791stat_interval
792=============
793
794The time interval between which vm statistics are updated. The default
795is 1 second.
796
797
798stat_refresh
799============
800
801Any read or write (by root only) flushes all the per-cpu vm statistics
802into their global totals, for more accurate reports when testing
803e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
804
805As a side-effect, it also checks for negative totals (elsewhere reported
806as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
807(At time of writing, a few stats are known sometimes to be found negative,
808with no ill effects: errors and warnings on these stats are suppressed.)
809
810
811numa_stat
812=========
813
814This interface allows runtime configuration of numa statistics.
815
816When page allocation performance becomes a bottleneck and you can tolerate
817some possible tool breakage and decreased numa counter precision, you can
818do::
819
820 echo 0 > /proc/sys/vm/numa_stat
821
822When page allocation performance is not a bottleneck and you want all
823tooling to work, you can do::
824
825 echo 1 > /proc/sys/vm/numa_stat
826
827
828swappiness
829==========
830
831This control is used to define how aggressive the kernel will swap
832memory pages. Higher values will increase aggressiveness, lower values
833decrease the amount of swap. A value of 0 instructs the kernel not to
834initiate swap until the amount of free and file-backed pages is less
835than the high water mark in a zone.
836
837The default value is 60.
838
839
840unprivileged_userfaultfd
841========================
842
843This flag controls whether unprivileged users can use the userfaultfd
844system calls. Set this to 1 to allow unprivileged users to use the
845userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
846privileged users (with SYS_CAP_PTRACE capability).
847
848The default value is 1.
849
850
851user_reserve_kbytes
852===================
853
854When overcommit_memory is set to 2, "never overcommit" mode, reserve
855min(3% of current process size, user_reserve_kbytes) of free memory.
856This is intended to prevent a user from starting a single memory hogging
857process, such that they cannot recover (kill the hog).
858
859user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
860
861If this is reduced to zero, then the user will be allowed to allocate
862all free memory with a single process, minus admin_reserve_kbytes.
863Any subsequent attempts to execute a command will result in
864"fork: Cannot allocate memory".
865
866Changing this takes effect whenever an application requests memory.
867
868
869vfs_cache_pressure
870==================
871
872This percentage value controls the tendency of the kernel to reclaim
873the memory which is used for caching of directory and inode objects.
874
875At the default value of vfs_cache_pressure=100 the kernel will attempt to
876reclaim dentries and inodes at a "fair" rate with respect to pagecache and
877swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
878to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
879never reclaim dentries and inodes due to memory pressure and this can easily
880lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
881causes the kernel to prefer to reclaim dentries and inodes.
882
883Increasing vfs_cache_pressure significantly beyond 100 may have negative
884performance impact. Reclaim code needs to take various locks to find freeable
885directory and inode objects. With vfs_cache_pressure=1000, it will look for
886ten times more freeable objects than there are.
887
888
889watermark_boost_factor
890======================
891
892This factor controls the level of reclaim when memory is being fragmented.
893It defines the percentage of the high watermark of a zone that will be
894reclaimed if pages of different mobility are being mixed within pageblocks.
895The intent is that compaction has less work to do in the future and to
896increase the success rate of future high-order allocations such as SLUB
897allocations, THP and hugetlbfs pages.
898
899To make it sensible with respect to the watermark_scale_factor
900parameter, the unit is in fractions of 10,000. The default value of
90115,000 on !DISCONTIGMEM configurations means that up to 150% of the high
902watermark will be reclaimed in the event of a pageblock being mixed due
903to fragmentation. The level of reclaim is determined by the number of
904fragmentation events that occurred in the recent past. If this value is
905smaller than a pageblock then a pageblocks worth of pages will be reclaimed
906(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
907
908
909watermark_scale_factor
910======================
911
912This factor controls the aggressiveness of kswapd. It defines the
913amount of memory left in a node/system before kswapd is woken up and
914how much memory needs to be free before kswapd goes back to sleep.
915
916The unit is in fractions of 10,000. The default value of 10 means the
917distances between watermarks are 0.1% of the available memory in the
918node/system. The maximum value is 1000, or 10% of memory.
919
920A high rate of threads entering direct reclaim (allocstall) or kswapd
921going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
922that the number of free pages kswapd maintains for latency reasons is
923too small for the allocation bursts occurring in the system. This knob
924can then be used to tune kswapd aggressiveness accordingly.
925
926
927zone_reclaim_mode
928=================
929
930Zone_reclaim_mode allows someone to set more or less aggressive approaches to
931reclaim memory when a zone runs out of memory. If it is set to zero then no
932zone reclaim occurs. Allocations will be satisfied from other zones / nodes
933in the system.
934
935This is value OR'ed together of
936
937= ===================================
9381 Zone reclaim on
9392 Zone reclaim writes dirty pages out
9404 Zone reclaim swaps pages
941= ===================================
942
943zone_reclaim_mode is disabled by default. For file servers or workloads
944that benefit from having their data cached, zone_reclaim_mode should be
945left disabled as the caching effect is likely to be more important than
946data locality.
947
948zone_reclaim may be enabled if it's known that the workload is partitioned
949such that each partition fits within a NUMA node and that accessing remote
950memory would cause a measurable performance reduction. The page allocator
951will then reclaim easily reusable pages (those page cache pages that are
952currently not used) before allocating off node pages.
953
954Allowing zone reclaim to write out pages stops processes that are
955writing large amounts of data from dirtying pages on other nodes. Zone
956reclaim will write out dirty pages if a zone fills up and so effectively
957throttle the process. This may decrease the performance of a single process
958since it cannot use all of system memory to buffer the outgoing writes
959anymore but it preserve the memory on other nodes so that the performance
960of other processes running on other nodes will not be affected.
961
962Allowing regular swap effectively restricts allocations to the local
963node unless explicitly overridden by memory policies or cpuset
964configurations.
diff --git a/Documentation/admin-guide/video-output.rst b/Documentation/admin-guide/video-output.rst
new file mode 100644
index 000000000000..56d6fa2e2368
--- /dev/null
+++ b/Documentation/admin-guide/video-output.rst
@@ -0,0 +1,34 @@
1Video Output Switcher Control
2~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3
42006 luming.yu@intel.com
5
6The output sysfs class driver provides an abstract video output layer that
7can be used to hook platform specific methods to enable/disable video output
8device through common sysfs interface. For example, on my IBM ThinkPad T42
9laptop, The ACPI video driver registered its output devices and read/write
10method for 'state' with output sysfs class. The user interface under sysfs is::
11
12 linux:/sys/class/video_output # tree .
13 .
14 |-- CRT0
15 | |-- device -> ../../../devices/pci0000:00/0000:00:01.0
16 | |-- state
17 | |-- subsystem -> ../../../class/video_output
18 | `-- uevent
19 |-- DVI0
20 | |-- device -> ../../../devices/pci0000:00/0000:00:01.0
21 | |-- state
22 | |-- subsystem -> ../../../class/video_output
23 | `-- uevent
24 |-- LCD0
25 | |-- device -> ../../../devices/pci0000:00/0000:00:01.0
26 | |-- state
27 | |-- subsystem -> ../../../class/video_output
28 | `-- uevent
29 `-- TV0
30 |-- device -> ../../../devices/pci0000:00/0000:00:01.0
31 |-- state
32 |-- subsystem -> ../../../class/video_output
33 `-- uevent
34