diff options
author | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2012-06-25 12:31:00 -0400 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2012-06-25 12:31:00 -0400 |
commit | bcc66c0b8881f88459f9ac21038455bcafacdc6e (patch) | |
tree | b402e677253c3fc1038ca4a52fc54fc223261133 /Documentation | |
parent | 1c1b86215730ef07d8851c2247b9ecf73038d05d (diff) | |
parent | 6b16351acbd415e66ba16bf7d473ece1574cf0bc (diff) |
Merge 3.5-rc4 into staging-next
This picks up the staging changes made in 3.5-rc4 so that everyone can sync up
properly.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/arm/SPEAr/overview.txt | 2 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt | 93 | ||||
-rw-r--r-- | Documentation/hwmon/coretemp | 22 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 9 | ||||
-rw-r--r-- | Documentation/networking/stmmac.txt | 44 | ||||
-rw-r--r-- | Documentation/vm/frontswap.txt | 278 |
6 files changed, 427 insertions, 21 deletions
diff --git a/Documentation/arm/SPEAr/overview.txt b/Documentation/arm/SPEAr/overview.txt index 57aae7765c74..65610bf52ebf 100644 --- a/Documentation/arm/SPEAr/overview.txt +++ b/Documentation/arm/SPEAr/overview.txt | |||
@@ -60,4 +60,4 @@ Introduction | |||
60 | Document Author | 60 | Document Author |
61 | --------------- | 61 | --------------- |
62 | 62 | ||
63 | Viresh Kumar <viresh.kumar@st.com>, (c) 2010-2012 ST Microelectronics | 63 | Viresh Kumar <viresh.linux@gmail.com>, (c) 2010-2012 ST Microelectronics |
diff --git a/Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt b/Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt new file mode 100644 index 000000000000..ae8af1694e95 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt | |||
@@ -0,0 +1,93 @@ | |||
1 | Pinctrl-based I2C Bus Mux | ||
2 | |||
3 | This binding describes an I2C bus multiplexer that uses pin multiplexing to | ||
4 | route the I2C signals, and represents the pin multiplexing configuration | ||
5 | using the pinctrl device tree bindings. | ||
6 | |||
7 | +-----+ +-----+ | ||
8 | | dev | | dev | | ||
9 | +------------------------+ +-----+ +-----+ | ||
10 | | SoC | | | | ||
11 | | /----|------+--------+ | ||
12 | | +---+ +------+ | child bus A, on first set of pins | ||
13 | | |I2C|---|Pinmux| | | ||
14 | | +---+ +------+ | child bus B, on second set of pins | ||
15 | | \----|------+--------+--------+ | ||
16 | | | | | | | ||
17 | +------------------------+ +-----+ +-----+ +-----+ | ||
18 | | dev | | dev | | dev | | ||
19 | +-----+ +-----+ +-----+ | ||
20 | |||
21 | Required properties: | ||
22 | - compatible: i2c-mux-pinctrl | ||
23 | - i2c-parent: The phandle of the I2C bus that this multiplexer's master-side | ||
24 | port is connected to. | ||
25 | |||
26 | Also required are: | ||
27 | |||
28 | * Standard pinctrl properties that specify the pin mux state for each child | ||
29 | bus. See ../pinctrl/pinctrl-bindings.txt. | ||
30 | |||
31 | * Standard I2C mux properties. See mux.txt in this directory. | ||
32 | |||
33 | * I2C child bus nodes. See mux.txt in this directory. | ||
34 | |||
35 | For each named state defined in the pinctrl-names property, an I2C child bus | ||
36 | will be created. I2C child bus numbers are assigned based on the index into | ||
37 | the pinctrl-names property. | ||
38 | |||
39 | The only exception is that no bus will be created for a state named "idle". If | ||
40 | such a state is defined, it must be the last entry in pinctrl-names. For | ||
41 | example: | ||
42 | |||
43 | pinctrl-names = "ddc", "pta", "idle" -> ddc = bus 0, pta = bus 1 | ||
44 | pinctrl-names = "ddc", "idle", "pta" -> Invalid ("idle" not last) | ||
45 | pinctrl-names = "idle", "ddc", "pta" -> Invalid ("idle" not last) | ||
46 | |||
47 | Whenever an access is made to a device on a child bus, the relevant pinctrl | ||
48 | state will be programmed into hardware. | ||
49 | |||
50 | If an idle state is defined, whenever an access is not being made to a device | ||
51 | on a child bus, the idle pinctrl state will be programmed into hardware. | ||
52 | |||
53 | If an idle state is not defined, the most recently used pinctrl state will be | ||
54 | left programmed into hardware whenever no access is being made of a device on | ||
55 | a child bus. | ||
56 | |||
57 | Example: | ||
58 | |||
59 | i2cmux { | ||
60 | compatible = "i2c-mux-pinctrl"; | ||
61 | #address-cells = <1>; | ||
62 | #size-cells = <0>; | ||
63 | |||
64 | i2c-parent = <&i2c1>; | ||
65 | |||
66 | pinctrl-names = "ddc", "pta", "idle"; | ||
67 | pinctrl-0 = <&state_i2cmux_ddc>; | ||
68 | pinctrl-1 = <&state_i2cmux_pta>; | ||
69 | pinctrl-2 = <&state_i2cmux_idle>; | ||
70 | |||
71 | i2c@0 { | ||
72 | reg = <0>; | ||
73 | #address-cells = <1>; | ||
74 | #size-cells = <0>; | ||
75 | |||
76 | eeprom { | ||
77 | compatible = "eeprom"; | ||
78 | reg = <0x50>; | ||
79 | }; | ||
80 | }; | ||
81 | |||
82 | i2c@1 { | ||
83 | reg = <1>; | ||
84 | #address-cells = <1>; | ||
85 | #size-cells = <0>; | ||
86 | |||
87 | eeprom { | ||
88 | compatible = "eeprom"; | ||
89 | reg = <0x50>; | ||
90 | }; | ||
91 | }; | ||
92 | }; | ||
93 | |||
diff --git a/Documentation/hwmon/coretemp b/Documentation/hwmon/coretemp index 84d46c0c71a3..c86b50c03ea8 100644 --- a/Documentation/hwmon/coretemp +++ b/Documentation/hwmon/coretemp | |||
@@ -6,7 +6,9 @@ Supported chips: | |||
6 | Prefix: 'coretemp' | 6 | Prefix: 'coretemp' |
7 | CPUID: family 0x6, models 0xe (Pentium M DC), 0xf (Core 2 DC 65nm), | 7 | CPUID: family 0x6, models 0xe (Pentium M DC), 0xf (Core 2 DC 65nm), |
8 | 0x16 (Core 2 SC 65nm), 0x17 (Penryn 45nm), | 8 | 0x16 (Core 2 SC 65nm), 0x17 (Penryn 45nm), |
9 | 0x1a (Nehalem), 0x1c (Atom), 0x1e (Lynnfield) | 9 | 0x1a (Nehalem), 0x1c (Atom), 0x1e (Lynnfield), |
10 | 0x26 (Tunnel Creek Atom), 0x27 (Medfield Atom), | ||
11 | 0x36 (Cedar Trail Atom) | ||
10 | Datasheet: Intel 64 and IA-32 Architectures Software Developer's Manual | 12 | Datasheet: Intel 64 and IA-32 Architectures Software Developer's Manual |
11 | Volume 3A: System Programming Guide | 13 | Volume 3A: System Programming Guide |
12 | http://softwarecommunity.intel.com/Wiki/Mobility/720.htm | 14 | http://softwarecommunity.intel.com/Wiki/Mobility/720.htm |
@@ -52,6 +54,17 @@ Some information comes from ark.intel.com | |||
52 | 54 | ||
53 | Process Processor TjMax(C) | 55 | Process Processor TjMax(C) |
54 | 56 | ||
57 | 22nm Core i5/i7 Processors | ||
58 | i7 3920XM, 3820QM, 3720QM, 3667U, 3520M 105 | ||
59 | i5 3427U, 3360M/3320M 105 | ||
60 | i7 3770/3770K 105 | ||
61 | i5 3570/3570K, 3550, 3470/3450 105 | ||
62 | i7 3770S 103 | ||
63 | i5 3570S/3550S, 3475S/3470S/3450S 103 | ||
64 | i7 3770T 94 | ||
65 | i5 3570T 94 | ||
66 | i5 3470T 91 | ||
67 | |||
55 | 32nm Core i3/i5/i7 Processors | 68 | 32nm Core i3/i5/i7 Processors |
56 | i7 660UM/640/620, 640LM/620, 620M, 610E 105 | 69 | i7 660UM/640/620, 640LM/620, 620M, 610E 105 |
57 | i5 540UM/520/430, 540M/520/450/430 105 | 70 | i5 540UM/520/430, 540M/520/450/430 105 |
@@ -65,6 +78,11 @@ Process Processor TjMax(C) | |||
65 | U3400 105 | 78 | U3400 105 |
66 | P4505/P4500 90 | 79 | P4505/P4500 90 |
67 | 80 | ||
81 | 32nm Atom Processors | ||
82 | Z2460 90 | ||
83 | D2700/2550/2500 100 | ||
84 | N2850/2800/2650/2600 100 | ||
85 | |||
68 | 45nm Xeon Processors 5400 Quad-Core | 86 | 45nm Xeon Processors 5400 Quad-Core |
69 | X5492, X5482, X5472, X5470, X5460, X5450 85 | 87 | X5492, X5482, X5472, X5470, X5460, X5450 85 |
70 | E5472, E5462, E5450/40/30/20/10/05 85 | 88 | E5472, E5462, E5450/40/30/20/10/05 85 |
@@ -85,6 +103,8 @@ Process Processor TjMax(C) | |||
85 | N475/470/455/450 100 | 103 | N475/470/455/450 100 |
86 | N280/270 90 | 104 | N280/270 90 |
87 | 330/230 125 | 105 | 330/230 125 |
106 | E680/660/640/620 90 | ||
107 | E680T/660T/640T/620T 110 | ||
88 | 108 | ||
89 | 45nm Core2 Processors | 109 | 45nm Core2 Processors |
90 | Solo ULV SU3500/3300 100 | 110 | Solo ULV SU3500/3300 100 |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index c45513d806ab..a92c5ebf373e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -2543,6 +2543,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. | |||
2543 | 2543 | ||
2544 | sched_debug [KNL] Enables verbose scheduler debug messages. | 2544 | sched_debug [KNL] Enables verbose scheduler debug messages. |
2545 | 2545 | ||
2546 | skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate | ||
2547 | xtime_lock contention on larger systems, and/or RCU lock | ||
2548 | contention on all systems with CONFIG_MAXSMP set. | ||
2549 | Format: { "0" | "1" } | ||
2550 | 0 -- disable. (may be 1 via CONFIG_CMDLINE="skew_tick=1" | ||
2551 | 1 -- enable. | ||
2552 | Note: increases power consumption, thus should only be | ||
2553 | enabled if running jitter sensitive (HPC/RT) workloads. | ||
2554 | |||
2546 | security= [SECURITY] Choose a security module to enable at boot. | 2555 | security= [SECURITY] Choose a security module to enable at boot. |
2547 | If this boot parameter is not specified, only the first | 2556 | If this boot parameter is not specified, only the first |
2548 | security module asking for security registration will be | 2557 | security module asking for security registration will be |
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index ab1e8d7004c5..5cb9a1972460 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt | |||
@@ -10,8 +10,8 @@ Currently this network device driver is for all STM embedded MAC/GMAC | |||
10 | (i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 | 10 | (i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 |
11 | FF1152AMT0221 D1215994A VIRTEX FPGA board. | 11 | FF1152AMT0221 D1215994A VIRTEX FPGA board. |
12 | 12 | ||
13 | DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether MAC 10/100 | 13 | DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether |
14 | Universal version 4.0 have been used for developing this driver. | 14 | MAC 10/100 Universal version 4.0 have been used for developing this driver. |
15 | 15 | ||
16 | This driver supports both the platform bus and PCI. | 16 | This driver supports both the platform bus and PCI. |
17 | 17 | ||
@@ -54,27 +54,27 @@ net_device structure enabling the scatter/gather feature. | |||
54 | When one or more packets are received, an interrupt happens. The interrupts | 54 | When one or more packets are received, an interrupt happens. The interrupts |
55 | are not queued so the driver has to scan all the descriptors in the ring during | 55 | are not queued so the driver has to scan all the descriptors in the ring during |
56 | the receive process. | 56 | the receive process. |
57 | This is based on NAPI so the interrupt handler signals only if there is work to be | 57 | This is based on NAPI so the interrupt handler signals only if there is work |
58 | done, and it exits. | 58 | to be done, and it exits. |
59 | Then the poll method will be scheduled at some future point. | 59 | Then the poll method will be scheduled at some future point. |
60 | The incoming packets are stored, by the DMA, in a list of pre-allocated socket | 60 | The incoming packets are stored, by the DMA, in a list of pre-allocated socket |
61 | buffers in order to avoid the memcpy (Zero-copy). | 61 | buffers in order to avoid the memcpy (Zero-copy). |
62 | 62 | ||
63 | 4.3) Timer-Driver Interrupt | 63 | 4.3) Timer-Driver Interrupt |
64 | Instead of having the device that asynchronously notifies the frame receptions, the | 64 | Instead of having the device that asynchronously notifies the frame receptions, |
65 | driver configures a timer to generate an interrupt at regular intervals. | 65 | the driver configures a timer to generate an interrupt at regular intervals. |
66 | Based on the granularity of the timer, the frames that are received by the device | 66 | Based on the granularity of the timer, the frames that are received by the |
67 | will experience different levels of latency. Some NICs have dedicated timer | 67 | device will experience different levels of latency. Some NICs have dedicated |
68 | device to perform this task. STMMAC can use either the RTC device or the TMU | 68 | timer device to perform this task. STMMAC can use either the RTC device or the |
69 | channel 2 on STLinux platforms. | 69 | TMU channel 2 on STLinux platforms. |
70 | The timers frequency can be passed to the driver as parameter; when change it, | 70 | The timers frequency can be passed to the driver as parameter; when change it, |
71 | take care of both hardware capability and network stability/performance impact. | 71 | take care of both hardware capability and network stability/performance impact. |
72 | Several performance tests on STM platforms showed this optimisation allows to spare | 72 | Several performance tests on STM platforms showed this optimisation allows to |
73 | the CPU while having the maximum throughput. | 73 | spare the CPU while having the maximum throughput. |
74 | 74 | ||
75 | 4.4) WOL | 75 | 4.4) WOL |
76 | Wake up on Lan feature through Magic and Unicast frames are supported for the GMAC | 76 | Wake up on Lan feature through Magic and Unicast frames are supported for the |
77 | core. | 77 | GMAC core. |
78 | 78 | ||
79 | 4.5) DMA descriptors | 79 | 4.5) DMA descriptors |
80 | Driver handles both normal and enhanced descriptors. The latter has been only | 80 | Driver handles both normal and enhanced descriptors. The latter has been only |
@@ -106,7 +106,8 @@ Several driver's information can be passed through the platform | |||
106 | These are included in the include/linux/stmmac.h header file | 106 | These are included in the include/linux/stmmac.h header file |
107 | and detailed below as well: | 107 | and detailed below as well: |
108 | 108 | ||
109 | struct plat_stmmacenet_data { | 109 | struct plat_stmmacenet_data { |
110 | char *phy_bus_name; | ||
110 | int bus_id; | 111 | int bus_id; |
111 | int phy_addr; | 112 | int phy_addr; |
112 | int interface; | 113 | int interface; |
@@ -124,19 +125,24 @@ and detailed below as well: | |||
124 | void (*bus_setup)(void __iomem *ioaddr); | 125 | void (*bus_setup)(void __iomem *ioaddr); |
125 | int (*init)(struct platform_device *pdev); | 126 | int (*init)(struct platform_device *pdev); |
126 | void (*exit)(struct platform_device *pdev); | 127 | void (*exit)(struct platform_device *pdev); |
128 | void *custom_cfg; | ||
129 | void *custom_data; | ||
127 | void *bsp_priv; | 130 | void *bsp_priv; |
128 | }; | 131 | }; |
129 | 132 | ||
130 | Where: | 133 | Where: |
134 | o phy_bus_name: phy bus name to attach to the stmmac. | ||
131 | o bus_id: bus identifier. | 135 | o bus_id: bus identifier. |
132 | o phy_addr: the physical address can be passed from the platform. | 136 | o phy_addr: the physical address can be passed from the platform. |
133 | If it is set to -1 the driver will automatically | 137 | If it is set to -1 the driver will automatically |
134 | detect it at run-time by probing all the 32 addresses. | 138 | detect it at run-time by probing all the 32 addresses. |
135 | o interface: PHY device's interface. | 139 | o interface: PHY device's interface. |
136 | o mdio_bus_data: specific platform fields for the MDIO bus. | 140 | o mdio_bus_data: specific platform fields for the MDIO bus. |
137 | o pbl: the Programmable Burst Length is maximum number of beats to | 141 | o dma_cfg: internal DMA parameters |
142 | o pbl: the Programmable Burst Length is maximum number of beats to | ||
138 | be transferred in one DMA transaction. | 143 | be transferred in one DMA transaction. |
139 | GMAC also enables the 4xPBL by default. | 144 | GMAC also enables the 4xPBL by default. |
145 | o fixed_burst/mixed_burst/burst_len | ||
140 | o clk_csr: fixed CSR Clock range selection. | 146 | o clk_csr: fixed CSR Clock range selection. |
141 | o has_gmac: uses the GMAC core. | 147 | o has_gmac: uses the GMAC core. |
142 | o enh_desc: if sets the MAC will use the enhanced descriptor structure. | 148 | o enh_desc: if sets the MAC will use the enhanced descriptor structure. |
@@ -160,8 +166,9 @@ Where: | |||
160 | this is sometime necessary on some platforms (e.g. ST boxes) | 166 | this is sometime necessary on some platforms (e.g. ST boxes) |
161 | where the HW needs to have set some PIO lines or system cfg | 167 | where the HW needs to have set some PIO lines or system cfg |
162 | registers. | 168 | registers. |
163 | o custom_cfg: this is a custom configuration that can be passed while | 169 | o custom_cfg/custom_data: this is a custom configuration that can be passed |
164 | initialising the resources. | 170 | while initialising the resources. |
171 | o bsp_priv: another private poiter. | ||
165 | 172 | ||
166 | For MDIO bus The we have: | 173 | For MDIO bus The we have: |
167 | 174 | ||
@@ -180,7 +187,6 @@ Where: | |||
180 | o irqs: list of IRQs, one per PHY. | 187 | o irqs: list of IRQs, one per PHY. |
181 | o probed_phy_irq: if irqs is NULL, use this for probed PHY. | 188 | o probed_phy_irq: if irqs is NULL, use this for probed PHY. |
182 | 189 | ||
183 | |||
184 | For DMA engine we have the following internal fields that should be | 190 | For DMA engine we have the following internal fields that should be |
185 | tuned according to the HW capabilities. | 191 | tuned according to the HW capabilities. |
186 | 192 | ||
diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt new file mode 100644 index 000000000000..37067cf455f4 --- /dev/null +++ b/Documentation/vm/frontswap.txt | |||
@@ -0,0 +1,278 @@ | |||
1 | Frontswap provides a "transcendent memory" interface for swap pages. | ||
2 | In some environments, dramatic performance savings may be obtained because | ||
3 | swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. | ||
4 | |||
5 | (Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" | ||
6 | and the only necessary changes to the core kernel for transcendent memory; | ||
7 | all other supporting code -- the "backends" -- is implemented as drivers. | ||
8 | See the LWN.net article "Transcendent memory in a nutshell" for a detailed | ||
9 | overview of frontswap and related kernel parts: | ||
10 | https://lwn.net/Articles/454795/ ) | ||
11 | |||
12 | Frontswap is so named because it can be thought of as the opposite of | ||
13 | a "backing" store for a swap device. The storage is assumed to be | ||
14 | a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming | ||
15 | to the requirements of transcendent memory (such as Xen's "tmem", or | ||
16 | in-kernel compressed memory, aka "zcache", or future RAM-like devices); | ||
17 | this pseudo-RAM device is not directly accessible or addressable by the | ||
18 | kernel and is of unknown and possibly time-varying size. The driver | ||
19 | links itself to frontswap by calling frontswap_register_ops to set the | ||
20 | frontswap_ops funcs appropriately and the functions it provides must | ||
21 | conform to certain policies as follows: | ||
22 | |||
23 | An "init" prepares the device to receive frontswap pages associated | ||
24 | with the specified swap device number (aka "type"). A "store" will | ||
25 | copy the page to transcendent memory and associate it with the type and | ||
26 | offset associated with the page. A "load" will copy the page, if found, | ||
27 | from transcendent memory into kernel memory, but will NOT remove the page | ||
28 | from from transcendent memory. An "invalidate_page" will remove the page | ||
29 | from transcendent memory and an "invalidate_area" will remove ALL pages | ||
30 | associated with the swap type (e.g., like swapoff) and notify the "device" | ||
31 | to refuse further stores with that swap type. | ||
32 | |||
33 | Once a page is successfully stored, a matching load on the page will normally | ||
34 | succeed. So when the kernel finds itself in a situation where it needs | ||
35 | to swap out a page, it first attempts to use frontswap. If the store returns | ||
36 | success, the data has been successfully saved to transcendent memory and | ||
37 | a disk write and, if the data is later read back, a disk read are avoided. | ||
38 | If a store returns failure, transcendent memory has rejected the data, and the | ||
39 | page can be written to swap as usual. | ||
40 | |||
41 | If a backend chooses, frontswap can be configured as a "writethrough | ||
42 | cache" by calling frontswap_writethrough(). In this mode, the reduction | ||
43 | in swap device writes is lost (and also a non-trivial performance advantage) | ||
44 | in order to allow the backend to arbitrarily "reclaim" space used to | ||
45 | store frontswap pages to more completely manage its memory usage. | ||
46 | |||
47 | Note that if a page is stored and the page already exists in transcendent memory | ||
48 | (a "duplicate" store), either the store succeeds and the data is overwritten, | ||
49 | or the store fails AND the page is invalidated. This ensures stale data may | ||
50 | never be obtained from frontswap. | ||
51 | |||
52 | If properly configured, monitoring of frontswap is done via debugfs in | ||
53 | the /sys/kernel/debug/frontswap directory. The effectiveness of | ||
54 | frontswap can be measured (across all swap devices) with: | ||
55 | |||
56 | failed_stores - how many store attempts have failed | ||
57 | loads - how many loads were attempted (all should succeed) | ||
58 | succ_stores - how many store attempts have succeeded | ||
59 | invalidates - how many invalidates were attempted | ||
60 | |||
61 | A backend implementation may provide additional metrics. | ||
62 | |||
63 | FAQ | ||
64 | |||
65 | 1) Where's the value? | ||
66 | |||
67 | When a workload starts swapping, performance falls through the floor. | ||
68 | Frontswap significantly increases performance in many such workloads by | ||
69 | providing a clean, dynamic interface to read and write swap pages to | ||
70 | "transcendent memory" that is otherwise not directly addressable to the kernel. | ||
71 | This interface is ideal when data is transformed to a different form | ||
72 | and size (such as with compression) or secretly moved (as might be | ||
73 | useful for write-balancing for some RAM-like devices). Swap pages (and | ||
74 | evicted page-cache pages) are a great use for this kind of slower-than-RAM- | ||
75 | but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and | ||
76 | cleancache) interface to transcendent memory provides a nice way to read | ||
77 | and write -- and indirectly "name" -- the pages. | ||
78 | |||
79 | Frontswap -- and cleancache -- with a fairly small impact on the kernel, | ||
80 | provides a huge amount of flexibility for more dynamic, flexible RAM | ||
81 | utilization in various system configurations: | ||
82 | |||
83 | In the single kernel case, aka "zcache", pages are compressed and | ||
84 | stored in local memory, thus increasing the total anonymous pages | ||
85 | that can be safely kept in RAM. Zcache essentially trades off CPU | ||
86 | cycles used in compression/decompression for better memory utilization. | ||
87 | Benchmarks have shown little or no impact when memory pressure is | ||
88 | low while providing a significant performance improvement (25%+) | ||
89 | on some workloads under high memory pressure. | ||
90 | |||
91 | "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory | ||
92 | support for clustered systems. Frontswap pages are locally compressed | ||
93 | as in zcache, but then "remotified" to another system's RAM. This | ||
94 | allows RAM to be dynamically load-balanced back-and-forth as needed, | ||
95 | i.e. when system A is overcommitted, it can swap to system B, and | ||
96 | vice versa. RAMster can also be configured as a memory server so | ||
97 | many servers in a cluster can swap, dynamically as needed, to a single | ||
98 | server configured with a large amount of RAM... without pre-configuring | ||
99 | how much of the RAM is available for each of the clients! | ||
100 | |||
101 | In the virtual case, the whole point of virtualization is to statistically | ||
102 | multiplex physical resources acrosst the varying demands of multiple | ||
103 | virtual machines. This is really hard to do with RAM and efforts to do | ||
104 | it well with no kernel changes have essentially failed (except in some | ||
105 | well-publicized special-case workloads). | ||
106 | Specifically, the Xen Transcendent Memory backend allows otherwise | ||
107 | "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | ||
108 | virtual machines, but the pages can be compressed and deduplicated to | ||
109 | optimize RAM utilization. And when guest OS's are induced to surrender | ||
110 | underutilized RAM (e.g. with "selfballooning"), sudden unexpected | ||
111 | memory pressure may result in swapping; frontswap allows those pages | ||
112 | to be swapped to and from hypervisor RAM (if overall host system memory | ||
113 | conditions allow), thus mitigating the potentially awful performance impact | ||
114 | of unplanned swapping. | ||
115 | |||
116 | A KVM implementation is underway and has been RFC'ed to lkml. And, | ||
117 | using frontswap, investigation is also underway on the use of NVM as | ||
118 | a memory extension technology. | ||
119 | |||
120 | 2) Sure there may be performance advantages in some situations, but | ||
121 | what's the space/time overhead of frontswap? | ||
122 | |||
123 | If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into | ||
124 | nothingness and the only overhead is a few extra bytes per swapon'ed | ||
125 | swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" | ||
126 | registers, there is one extra global variable compared to zero for | ||
127 | every swap page read or written. If CONFIG_FRONTSWAP is enabled | ||
128 | AND a frontswap backend registers AND the backend fails every "store" | ||
129 | request (i.e. provides no memory despite claiming it might), | ||
130 | CPU overhead is still negligible -- and since every frontswap fail | ||
131 | precedes a swap page write-to-disk, the system is highly likely | ||
132 | to be I/O bound and using a small fraction of a percent of a CPU | ||
133 | will be irrelevant anyway. | ||
134 | |||
135 | As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend | ||
136 | registers, one bit is allocated for every swap page for every swap | ||
137 | device that is swapon'd. This is added to the EIGHT bits (which | ||
138 | was sixteen until about 2.6.34) that the kernel already allocates | ||
139 | for every swap page for every swap device that is swapon'd. (Hugh | ||
140 | Dickins has observed that frontswap could probably steal one of | ||
141 | the existing eight bits, but let's worry about that minor optimization | ||
142 | later.) For very large swap disks (which are rare) on a standard | ||
143 | 4K pagesize, this is 1MB per 32GB swap. | ||
144 | |||
145 | When swap pages are stored in transcendent memory instead of written | ||
146 | out to disk, there is a side effect that this may create more memory | ||
147 | pressure that can potentially outweigh the other advantages. A | ||
148 | backend, such as zcache, must implement policies to carefully (but | ||
149 | dynamically) manage memory limits to ensure this doesn't happen. | ||
150 | |||
151 | 3) OK, how about a quick overview of what this frontswap patch does | ||
152 | in terms that a kernel hacker can grok? | ||
153 | |||
154 | Let's assume that a frontswap "backend" has registered during | ||
155 | kernel initialization; this registration indicates that this | ||
156 | frontswap backend has access to some "memory" that is not directly | ||
157 | accessible by the kernel. Exactly how much memory it provides is | ||
158 | entirely dynamic and random. | ||
159 | |||
160 | Whenever a swap-device is swapon'd frontswap_init() is called, | ||
161 | passing the swap device number (aka "type") as a parameter. | ||
162 | This notifies frontswap to expect attempts to "store" swap pages | ||
163 | associated with that number. | ||
164 | |||
165 | Whenever the swap subsystem is readying a page to write to a swap | ||
166 | device (c.f swap_writepage()), frontswap_store is called. Frontswap | ||
167 | consults with the frontswap backend and if the backend says it does NOT | ||
168 | have room, frontswap_store returns -1 and the kernel swaps the page | ||
169 | to the swap device as normal. Note that the response from the frontswap | ||
170 | backend is unpredictable to the kernel; it may choose to never accept a | ||
171 | page, it could accept every ninth page, or it might accept every | ||
172 | page. But if the backend does accept a page, the data from the page | ||
173 | has already been copied and associated with the type and offset, | ||
174 | and the backend guarantees the persistence of the data. In this case, | ||
175 | frontswap sets a bit in the "frontswap_map" for the swap device | ||
176 | corresponding to the page offset on the swap device to which it would | ||
177 | otherwise have written the data. | ||
178 | |||
179 | When the swap subsystem needs to swap-in a page (swap_readpage()), | ||
180 | it first calls frontswap_load() which checks the frontswap_map to | ||
181 | see if the page was earlier accepted by the frontswap backend. If | ||
182 | it was, the page of data is filled from the frontswap backend and | ||
183 | the swap-in is complete. If not, the normal swap-in code is | ||
184 | executed to obtain the page of data from the real swap device. | ||
185 | |||
186 | So every time the frontswap backend accepts a page, a swap device read | ||
187 | and (potentially) a swap device write are replaced by a "frontswap backend | ||
188 | store" and (possibly) a "frontswap backend loads", which are presumably much | ||
189 | faster. | ||
190 | |||
191 | 4) Can't frontswap be configured as a "special" swap device that is | ||
192 | just higher priority than any real swap device (e.g. like zswap, | ||
193 | or maybe swap-over-nbd/NFS)? | ||
194 | |||
195 | No. First, the existing swap subsystem doesn't allow for any kind of | ||
196 | swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy, | ||
197 | but this would require fairly drastic changes. Even if it were | ||
198 | rewritten, the existing swap subsystem uses the block I/O layer which | ||
199 | assumes a swap device is fixed size and any page in it is linearly | ||
200 | addressable. Frontswap barely touches the existing swap subsystem, | ||
201 | and works around the constraints of the block I/O subsystem to provide | ||
202 | a great deal of flexibility and dynamicity. | ||
203 | |||
204 | For example, the acceptance of any swap page by the frontswap backend is | ||
205 | entirely unpredictable. This is critical to the definition of frontswap | ||
206 | backends because it grants completely dynamic discretion to the | ||
207 | backend. In zcache, one cannot know a priori how compressible a page is. | ||
208 | "Poorly" compressible pages can be rejected, and "poorly" can itself be | ||
209 | defined dynamically depending on current memory constraints. | ||
210 | |||
211 | Further, frontswap is entirely synchronous whereas a real swap | ||
212 | device is, by definition, asynchronous and uses block I/O. The | ||
213 | block I/O layer is not only unnecessary, but may perform "optimizations" | ||
214 | that are inappropriate for a RAM-oriented device including delaying | ||
215 | the write of some pages for a significant amount of time. Synchrony is | ||
216 | required to ensure the dynamicity of the backend and to avoid thorny race | ||
217 | conditions that would unnecessarily and greatly complicate frontswap | ||
218 | and/or the block I/O subsystem. That said, only the initial "store" | ||
219 | and "load" operations need be synchronous. A separate asynchronous thread | ||
220 | is free to manipulate the pages stored by frontswap. For example, | ||
221 | the "remotification" thread in RAMster uses standard asynchronous | ||
222 | kernel sockets to move compressed frontswap pages to a remote machine. | ||
223 | Similarly, a KVM guest-side implementation could do in-guest compression | ||
224 | and use "batched" hypercalls. | ||
225 | |||
226 | In a virtualized environment, the dynamicity allows the hypervisor | ||
227 | (or host OS) to do "intelligent overcommit". For example, it can | ||
228 | choose to accept pages only until host-swapping might be imminent, | ||
229 | then force guests to do their own swapping. | ||
230 | |||
231 | There is a downside to the transcendent memory specifications for | ||
232 | frontswap: Since any "store" might fail, there must always be a real | ||
233 | slot on a real swap device to swap the page. Thus frontswap must be | ||
234 | implemented as a "shadow" to every swapon'd device with the potential | ||
235 | capability of holding every page that the swap device might have held | ||
236 | and the possibility that it might hold no pages at all. This means | ||
237 | that frontswap cannot contain more pages than the total of swapon'd | ||
238 | swap devices. For example, if NO swap device is configured on some | ||
239 | installation, frontswap is useless. Swapless portable devices | ||
240 | can still use frontswap but a backend for such devices must configure | ||
241 | some kind of "ghost" swap device and ensure that it is never used. | ||
242 | |||
243 | 5) Why this weird definition about "duplicate stores"? If a page | ||
244 | has been previously successfully stored, can't it always be | ||
245 | successfully overwritten? | ||
246 | |||
247 | Nearly always it can, but no, sometimes it cannot. Consider an example | ||
248 | where data is compressed and the original 4K page has been compressed | ||
249 | to 1K. Now an attempt is made to overwrite the page with data that | ||
250 | is non-compressible and so would take the entire 4K. But the backend | ||
251 | has no more space. In this case, the store must be rejected. Whenever | ||
252 | frontswap rejects a store that would overwrite, it also must invalidate | ||
253 | the old data and ensure that it is no longer accessible. Since the | ||
254 | swap subsystem then writes the new data to the read swap device, | ||
255 | this is the correct course of action to ensure coherency. | ||
256 | |||
257 | 6) What is frontswap_shrink for? | ||
258 | |||
259 | When the (non-frontswap) swap subsystem swaps out a page to a real | ||
260 | swap device, that page is only taking up low-value pre-allocated disk | ||
261 | space. But if frontswap has placed a page in transcendent memory, that | ||
262 | page may be taking up valuable real estate. The frontswap_shrink | ||
263 | routine allows code outside of the swap subsystem to force pages out | ||
264 | of the memory managed by frontswap and back into kernel-addressable memory. | ||
265 | For example, in RAMster, a "suction driver" thread will attempt | ||
266 | to "repatriate" pages sent to a remote machine back to the local machine; | ||
267 | this is driven using the frontswap_shrink mechanism when memory pressure | ||
268 | subsides. | ||
269 | |||
270 | 7) Why does the frontswap patch create the new include file swapfile.h? | ||
271 | |||
272 | The frontswap code depends on some swap-subsystem-internal data | ||
273 | structures that have, over the years, moved back and forth between | ||
274 | static and global. This seemed a reasonable compromise: Define | ||
275 | them as global but declare them in a new include file that isn't | ||
276 | included by the large number of source files that include swap.h. | ||
277 | |||
278 | Dan Magenheimer, last updated April 9, 2012 | ||