diff options
| -rw-r--r-- | Documentation/vm/frontswap.txt | 278 | ||||
| -rw-r--r-- | mm/Kconfig | 17 | ||||
| -rw-r--r-- | mm/Makefile | 1 |
3 files changed, 296 insertions, 0 deletions
diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt new file mode 100644 index 000000000000..a9f731af0fac --- /dev/null +++ b/Documentation/vm/frontswap.txt | |||
| @@ -0,0 +1,278 @@ | |||
| 1 | Frontswap provides a "transcendent memory" interface for swap pages. | ||
| 2 | In some environments, dramatic performance savings may be obtained because | ||
| 3 | swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. | ||
| 4 | |||
| 5 | (Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" | ||
| 6 | and the only necessary changes to the core kernel for transcendent memory; | ||
| 7 | all other supporting code -- the "backends" -- is implemented as drivers. | ||
| 8 | See the LWN.net article "Transcendent memory in a nutshell" for a detailed | ||
| 9 | overview of frontswap and related kernel parts: | ||
| 10 | https://lwn.net/Articles/454795/ ) | ||
| 11 | |||
| 12 | Frontswap is so named because it can be thought of as the opposite of | ||
| 13 | a "backing" store for a swap device. The storage is assumed to be | ||
| 14 | a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming | ||
| 15 | to the requirements of transcendent memory (such as Xen's "tmem", or | ||
| 16 | in-kernel compressed memory, aka "zcache", or future RAM-like devices); | ||
| 17 | this pseudo-RAM device is not directly accessible or addressable by the | ||
| 18 | kernel and is of unknown and possibly time-varying size. The driver | ||
| 19 | links itself to frontswap by calling frontswap_register_ops to set the | ||
| 20 | frontswap_ops funcs appropriately and the functions it provides must | ||
| 21 | conform to certain policies as follows: | ||
| 22 | |||
| 23 | An "init" prepares the device to receive frontswap pages associated | ||
| 24 | with the specified swap device number (aka "type"). A "put_page" will | ||
| 25 | copy the page to transcendent memory and associate it with the type and | ||
| 26 | offset associated with the page. A "get_page" will copy the page, if found, | ||
| 27 | from transcendent memory into kernel memory, but will NOT remove the page | ||
| 28 | from from transcendent memory. An "invalidate_page" will remove the page | ||
| 29 | from transcendent memory and an "invalidate_area" will remove ALL pages | ||
| 30 | associated with the swap type (e.g., like swapoff) and notify the "device" | ||
| 31 | to refuse further puts with that swap type. | ||
| 32 | |||
| 33 | Once a page is successfully put, a matching get on the page will normally | ||
| 34 | succeed. So when the kernel finds itself in a situation where it needs | ||
| 35 | to swap out a page, it first attempts to use frontswap. If the put returns | ||
| 36 | success, the data has been successfully saved to transcendent memory and | ||
| 37 | a disk write and, if the data is later read back, a disk read are avoided. | ||
| 38 | If a put returns failure, transcendent memory has rejected the data, and the | ||
| 39 | page can be written to swap as usual. | ||
| 40 | |||
| 41 | If a backend chooses, frontswap can be configured as a "writethrough | ||
| 42 | cache" by calling frontswap_writethrough(). In this mode, the reduction | ||
| 43 | in swap device writes is lost (and also a non-trivial performance advantage) | ||
| 44 | in order to allow the backend to arbitrarily "reclaim" space used to | ||
| 45 | store frontswap pages to more completely manage its memory usage. | ||
| 46 | |||
| 47 | Note that if a page is put and the page already exists in transcendent memory | ||
| 48 | (a "duplicate" put), either the put succeeds and the data is overwritten, | ||
| 49 | or the put fails AND the page is invalidated. This ensures stale data may | ||
| 50 | never be obtained from frontswap. | ||
| 51 | |||
| 52 | If properly configured, monitoring of frontswap is done via debugfs in | ||
| 53 | the /sys/kernel/debug/frontswap directory. The effectiveness of | ||
| 54 | frontswap can be measured (across all swap devices) with: | ||
| 55 | |||
| 56 | failed_puts - how many put attempts have failed | ||
| 57 | gets - how many gets were attempted (all should succeed) | ||
| 58 | succ_puts - how many put attempts have succeeded | ||
| 59 | invalidates - how many invalidates were attempted | ||
| 60 | |||
| 61 | A backend implementation may provide additional metrics. | ||
| 62 | |||
| 63 | FAQ | ||
| 64 | |||
| 65 | 1) Where's the value? | ||
| 66 | |||
| 67 | When a workload starts swapping, performance falls through the floor. | ||
| 68 | Frontswap significantly increases performance in many such workloads by | ||
| 69 | providing a clean, dynamic interface to read and write swap pages to | ||
| 70 | "transcendent memory" that is otherwise not directly addressable to the kernel. | ||
| 71 | This interface is ideal when data is transformed to a different form | ||
| 72 | and size (such as with compression) or secretly moved (as might be | ||
| 73 | useful for write-balancing for some RAM-like devices). Swap pages (and | ||
| 74 | evicted page-cache pages) are a great use for this kind of slower-than-RAM- | ||
| 75 | but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and | ||
| 76 | cleancache) interface to transcendent memory provides a nice way to read | ||
| 77 | and write -- and indirectly "name" -- the pages. | ||
| 78 | |||
| 79 | Frontswap -- and cleancache -- with a fairly small impact on the kernel, | ||
| 80 | provides a huge amount of flexibility for more dynamic, flexible RAM | ||
| 81 | utilization in various system configurations: | ||
| 82 | |||
| 83 | In the single kernel case, aka "zcache", pages are compressed and | ||
| 84 | stored in local memory, thus increasing the total anonymous pages | ||
| 85 | that can be safely kept in RAM. Zcache essentially trades off CPU | ||
| 86 | cycles used in compression/decompression for better memory utilization. | ||
| 87 | Benchmarks have shown little or no impact when memory pressure is | ||
| 88 | low while providing a significant performance improvement (25%+) | ||
| 89 | on some workloads under high memory pressure. | ||
| 90 | |||
| 91 | "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory | ||
| 92 | support for clustered systems. Frontswap pages are locally compressed | ||
| 93 | as in zcache, but then "remotified" to another system's RAM. This | ||
| 94 | allows RAM to be dynamically load-balanced back-and-forth as needed, | ||
| 95 | i.e. when system A is overcommitted, it can swap to system B, and | ||
| 96 | vice versa. RAMster can also be configured as a memory server so | ||
| 97 | many servers in a cluster can swap, dynamically as needed, to a single | ||
| 98 | server configured with a large amount of RAM... without pre-configuring | ||
| 99 | how much of the RAM is available for each of the clients! | ||
| 100 | |||
| 101 | In the virtual case, the whole point of virtualization is to statistically | ||
| 102 | multiplex physical resources acrosst the varying demands of multiple | ||
| 103 | virtual machines. This is really hard to do with RAM and efforts to do | ||
| 104 | it well with no kernel changes have essentially failed (except in some | ||
| 105 | well-publicized special-case workloads). | ||
| 106 | Specifically, the Xen Transcendent Memory backend allows otherwise | ||
| 107 | "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | ||
| 108 | virtual machines, but the pages can be compressed and deduplicated to | ||
| 109 | optimize RAM utilization. And when guest OS's are induced to surrender | ||
| 110 | underutilized RAM (e.g. with "selfballooning"), sudden unexpected | ||
| 111 | memory pressure may result in swapping; frontswap allows those pages | ||
| 112 | to be swapped to and from hypervisor RAM (if overall host system memory | ||
| 113 | conditions allow), thus mitigating the potentially awful performance impact | ||
| 114 | of unplanned swapping. | ||
| 115 | |||
| 116 | A KVM implementation is underway and has been RFC'ed to lkml. And, | ||
| 117 | using frontswap, investigation is also underway on the use of NVM as | ||
| 118 | a memory extension technology. | ||
| 119 | |||
| 120 | 2) Sure there may be performance advantages in some situations, but | ||
| 121 | what's the space/time overhead of frontswap? | ||
| 122 | |||
| 123 | If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into | ||
| 124 | nothingness and the only overhead is a few extra bytes per swapon'ed | ||
| 125 | swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" | ||
| 126 | registers, there is one extra global variable compared to zero for | ||
| 127 | every swap page read or written. If CONFIG_FRONTSWAP is enabled | ||
| 128 | AND a frontswap backend registers AND the backend fails every "put" | ||
| 129 | request (i.e. provides no memory despite claiming it might), | ||
| 130 | CPU overhead is still negligible -- and since every frontswap fail | ||
| 131 | precedes a swap page write-to-disk, the system is highly likely | ||
| 132 | to be I/O bound and using a small fraction of a percent of a CPU | ||
| 133 | will be irrelevant anyway. | ||
| 134 | |||
| 135 | As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend | ||
| 136 | registers, one bit is allocated for every swap page for every swap | ||
| 137 | device that is swapon'd. This is added to the EIGHT bits (which | ||
| 138 | was sixteen until about 2.6.34) that the kernel already allocates | ||
| 139 | for every swap page for every swap device that is swapon'd. (Hugh | ||
| 140 | Dickins has observed that frontswap could probably steal one of | ||
| 141 | the existing eight bits, but let's worry about that minor optimization | ||
| 142 | later.) For very large swap disks (which are rare) on a standard | ||
| 143 | 4K pagesize, this is 1MB per 32GB swap. | ||
| 144 | |||
| 145 | When swap pages are stored in transcendent memory instead of written | ||
| 146 | out to disk, there is a side effect that this may create more memory | ||
| 147 | pressure that can potentially outweigh the other advantages. A | ||
| 148 | backend, such as zcache, must implement policies to carefully (but | ||
| 149 | dynamically) manage memory limits to ensure this doesn't happen. | ||
| 150 | |||
| 151 | 3) OK, how about a quick overview of what this frontswap patch does | ||
| 152 | in terms that a kernel hacker can grok? | ||
| 153 | |||
| 154 | Let's assume that a frontswap "backend" has registered during | ||
| 155 | kernel initialization; this registration indicates that this | ||
| 156 | frontswap backend has access to some "memory" that is not directly | ||
| 157 | accessible by the kernel. Exactly how much memory it provides is | ||
| 158 | entirely dynamic and random. | ||
| 159 | |||
| 160 | Whenever a swap-device is swapon'd frontswap_init() is called, | ||
| 161 | passing the swap device number (aka "type") as a parameter. | ||
| 162 | This notifies frontswap to expect attempts to "put" swap pages | ||
| 163 | associated with that number. | ||
| 164 | |||
| 165 | Whenever the swap subsystem is readying a page to write to a swap | ||
| 166 | device (c.f swap_writepage()), frontswap_put_page is called. Frontswap | ||
| 167 | consults with the frontswap backend and if the backend says it does NOT | ||
| 168 | have room, frontswap_put_page returns -1 and the kernel swaps the page | ||
| 169 | to the swap device as normal. Note that the response from the frontswap | ||
| 170 | backend is unpredictable to the kernel; it may choose to never accept a | ||
| 171 | page, it could accept every ninth page, or it might accept every | ||
| 172 | page. But if the backend does accept a page, the data from the page | ||
| 173 | has already been copied and associated with the type and offset, | ||
| 174 | and the backend guarantees the persistence of the data. In this case, | ||
| 175 | frontswap sets a bit in the "frontswap_map" for the swap device | ||
| 176 | corresponding to the page offset on the swap device to which it would | ||
| 177 | otherwise have written the data. | ||
| 178 | |||
| 179 | When the swap subsystem needs to swap-in a page (swap_readpage()), | ||
| 180 | it first calls frontswap_get_page() which checks the frontswap_map to | ||
| 181 | see if the page was earlier accepted by the frontswap backend. If | ||
| 182 | it was, the page of data is filled from the frontswap backend and | ||
| 183 | the swap-in is complete. If not, the normal swap-in code is | ||
| 184 | executed to obtain the page of data from the real swap device. | ||
| 185 | |||
| 186 | So every time the frontswap backend accepts a page, a swap device read | ||
| 187 | and (potentially) a swap device write are replaced by a "frontswap backend | ||
| 188 | put" and (possibly) a "frontswap backend get", which are presumably much | ||
| 189 | faster. | ||
| 190 | |||
| 191 | 4) Can't frontswap be configured as a "special" swap device that is | ||
| 192 | just higher priority than any real swap device (e.g. like zswap, | ||
| 193 | or maybe swap-over-nbd/NFS)? | ||
| 194 | |||
| 195 | No. First, the existing swap subsystem doesn't allow for any kind of | ||
| 196 | swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy, | ||
| 197 | but this would require fairly drastic changes. Even if it were | ||
| 198 | rewritten, the existing swap subsystem uses the block I/O layer which | ||
| 199 | assumes a swap device is fixed size and any page in it is linearly | ||
| 200 | addressable. Frontswap barely touches the existing swap subsystem, | ||
| 201 | and works around the constraints of the block I/O subsystem to provide | ||
| 202 | a great deal of flexibility and dynamicity. | ||
| 203 | |||
| 204 | For example, the acceptance of any swap page by the frontswap backend is | ||
| 205 | entirely unpredictable. This is critical to the definition of frontswap | ||
| 206 | backends because it grants completely dynamic discretion to the | ||
| 207 | backend. In zcache, one cannot know a priori how compressible a page is. | ||
| 208 | "Poorly" compressible pages can be rejected, and "poorly" can itself be | ||
| 209 | defined dynamically depending on current memory constraints. | ||
| 210 | |||
| 211 | Further, frontswap is entirely synchronous whereas a real swap | ||
| 212 | device is, by definition, asynchronous and uses block I/O. The | ||
| 213 | block I/O layer is not only unnecessary, but may perform "optimizations" | ||
| 214 | that are inappropriate for a RAM-oriented device including delaying | ||
| 215 | the write of some pages for a significant amount of time. Synchrony is | ||
| 216 | required to ensure the dynamicity of the backend and to avoid thorny race | ||
| 217 | conditions that would unnecessarily and greatly complicate frontswap | ||
| 218 | and/or the block I/O subsystem. That said, only the initial "put" | ||
| 219 | and "get" operations need be synchronous. A separate asynchronous thread | ||
| 220 | is free to manipulate the pages stored by frontswap. For example, | ||
| 221 | the "remotification" thread in RAMster uses standard asynchronous | ||
| 222 | kernel sockets to move compressed frontswap pages to a remote machine. | ||
| 223 | Similarly, a KVM guest-side implementation could do in-guest compression | ||
| 224 | and use "batched" hypercalls. | ||
| 225 | |||
| 226 | In a virtualized environment, the dynamicity allows the hypervisor | ||
| 227 | (or host OS) to do "intelligent overcommit". For example, it can | ||
| 228 | choose to accept pages only until host-swapping might be imminent, | ||
| 229 | then force guests to do their own swapping. | ||
| 230 | |||
| 231 | There is a downside to the transcendent memory specifications for | ||
| 232 | frontswap: Since any "put" might fail, there must always be a real | ||
| 233 | slot on a real swap device to swap the page. Thus frontswap must be | ||
| 234 | implemented as a "shadow" to every swapon'd device with the potential | ||
| 235 | capability of holding every page that the swap device might have held | ||
| 236 | and the possibility that it might hold no pages at all. This means | ||
| 237 | that frontswap cannot contain more pages than the total of swapon'd | ||
| 238 | swap devices. For example, if NO swap device is configured on some | ||
| 239 | installation, frontswap is useless. Swapless portable devices | ||
| 240 | can still use frontswap but a backend for such devices must configure | ||
| 241 | some kind of "ghost" swap device and ensure that it is never used. | ||
| 242 | |||
| 243 | 5) Why this weird definition about "duplicate puts"? If a page | ||
| 244 | has been previously successfully put, can't it always be | ||
| 245 | successfully overwritten? | ||
| 246 | |||
| 247 | Nearly always it can, but no, sometimes it cannot. Consider an example | ||
| 248 | where data is compressed and the original 4K page has been compressed | ||
| 249 | to 1K. Now an attempt is made to overwrite the page with data that | ||
| 250 | is non-compressible and so would take the entire 4K. But the backend | ||
| 251 | has no more space. In this case, the put must be rejected. Whenever | ||
| 252 | frontswap rejects a put that would overwrite, it also must invalidate | ||
| 253 | the old data and ensure that it is no longer accessible. Since the | ||
| 254 | swap subsystem then writes the new data to the read swap device, | ||
| 255 | this is the correct course of action to ensure coherency. | ||
| 256 | |||
| 257 | 6) What is frontswap_shrink for? | ||
| 258 | |||
| 259 | When the (non-frontswap) swap subsystem swaps out a page to a real | ||
| 260 | swap device, that page is only taking up low-value pre-allocated disk | ||
| 261 | space. But if frontswap has placed a page in transcendent memory, that | ||
| 262 | page may be taking up valuable real estate. The frontswap_shrink | ||
| 263 | routine allows code outside of the swap subsystem to force pages out | ||
| 264 | of the memory managed by frontswap and back into kernel-addressable memory. | ||
| 265 | For example, in RAMster, a "suction driver" thread will attempt | ||
| 266 | to "repatriate" pages sent to a remote machine back to the local machine; | ||
| 267 | this is driven using the frontswap_shrink mechanism when memory pressure | ||
| 268 | subsides. | ||
| 269 | |||
| 270 | 7) Why does the frontswap patch create the new include file swapfile.h? | ||
| 271 | |||
| 272 | The frontswap code depends on some swap-subsystem-internal data | ||
| 273 | structures that have, over the years, moved back and forth between | ||
| 274 | static and global. This seemed a reasonable compromise: Define | ||
| 275 | them as global but declare them in a new include file that isn't | ||
| 276 | included by the large number of source files that include swap.h. | ||
| 277 | |||
| 278 | Dan Magenheimer, last updated April 9, 2012 | ||
diff --git a/mm/Kconfig b/mm/Kconfig index e338407f1225..2613c910935a 100644 --- a/mm/Kconfig +++ b/mm/Kconfig | |||
| @@ -379,3 +379,20 @@ config CLEANCACHE | |||
| 379 | in a negligible performance hit. | 379 | in a negligible performance hit. |
| 380 | 380 | ||
| 381 | If unsure, say Y to enable cleancache | 381 | If unsure, say Y to enable cleancache |
| 382 | |||
| 383 | config FRONTSWAP | ||
| 384 | bool "Enable frontswap to cache swap pages if tmem is present" | ||
| 385 | depends on SWAP | ||
| 386 | default n | ||
| 387 | help | ||
| 388 | Frontswap is so named because it can be thought of as the opposite | ||
| 389 | of a "backing" store for a swap device. The data is stored into | ||
| 390 | "transcendent memory", memory that is not directly accessible or | ||
| 391 | addressable by the kernel and is of unknown and possibly | ||
| 392 | time-varying size. When space in transcendent memory is available, | ||
| 393 | a significant swap I/O reduction may be achieved. When none is | ||
| 394 | available, all frontswap calls are reduced to a single pointer- | ||
| 395 | compare-against-NULL resulting in a negligible performance hit | ||
| 396 | and swap data is stored as normal on the matching swap device. | ||
| 397 | |||
| 398 | If unsure, say Y to enable frontswap. | ||
diff --git a/mm/Makefile b/mm/Makefile index 50ec00ef2a0e..306742a28266 100644 --- a/mm/Makefile +++ b/mm/Makefile | |||
| @@ -26,6 +26,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o | |||
| 26 | 26 | ||
| 27 | obj-$(CONFIG_BOUNCE) += bounce.o | 27 | obj-$(CONFIG_BOUNCE) += bounce.o |
| 28 | obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o | 28 | obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o |
| 29 | obj-$(CONFIG_FRONTSWAP) += frontswap.o | ||
| 29 | obj-$(CONFIG_HAS_DMA) += dmapool.o | 30 | obj-$(CONFIG_HAS_DMA) += dmapool.o |
| 30 | obj-$(CONFIG_HUGETLBFS) += hugetlb.o | 31 | obj-$(CONFIG_HUGETLBFS) += hugetlb.o |
| 31 | obj-$(CONFIG_NUMA) += mempolicy.o | 32 | obj-$(CONFIG_NUMA) += mempolicy.o |
