diff options
author | Dan Magenheimer <dan.magenheimer@oracle.com> | 2013-05-20 10:52:17 -0400 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2013-05-20 11:21:04 -0400 |
commit | 8bb3e55103b37869175333e00fc01b34b0459529 (patch) | |
tree | 96eb4df3801d92460a82b708014406f06df2bdd5 | |
parent | 642f2ecc092f4d2d5a9b7219090531508017c324 (diff) |
staging: ramster: add how-to document
Add how-to documentation that provides a step-by-step guide
for configuring and trying out a ramster cluster.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-rw-r--r-- | drivers/staging/zcache/ramster/ramster-howto.txt | 366 |
1 files changed, 366 insertions, 0 deletions
diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt b/drivers/staging/zcache/ramster/ramster-howto.txt new file mode 100644 index 000000000000..7b1ee3bbfdd5 --- /dev/null +++ b/drivers/staging/zcache/ramster/ramster-howto.txt | |||
@@ -0,0 +1,366 @@ | |||
1 | RAMSTER HOW-TO | ||
2 | |||
3 | Author: Dan Magenheimer | ||
4 | Ramster maintainer: Konrad Wilk <konrad.wilk@oracle.com> | ||
5 | |||
6 | This is a HOWTO document for ramster which, as of this writing, is in | ||
7 | the kernel as a subdirectory of zcache in drivers/staging, called ramster. | ||
8 | (Zcache can be built with or without ramster functionality.) If enabled | ||
9 | and properly configured, ramster allows memory capacity load balancing | ||
10 | across multiple machines in a cluster. Further, the ramster code serves | ||
11 | as an example of asynchronous access for zcache (as well as cleancache and | ||
12 | frontswap) that may prove useful for future transcendent memory | ||
13 | implementations, such as KVM and NVRAM. While ramster works today on | ||
14 | any network connection that supports kernel sockets, its features may | ||
15 | become more interesting on future high-speed fabrics/interconnects. | ||
16 | |||
17 | Ramster requires both kernel and userland support. The userland support, | ||
18 | called ramster-tools, is known to work with EL6-based distros, but is a | ||
19 | set of poorly-hacked slightly-modified cluster tools based on ocfs2, which | ||
20 | includes an init file, a config file, and a userland binary that interfaces | ||
21 | to the kernel. This state of userland support reflects the abysmal userland | ||
22 | skills of this suitably-embarrassed author; any help/patches to turn | ||
23 | ramster-tools into more distributable rpms/debs useful for a wider range | ||
24 | of distros would be appreciated. The source RPM that can be used as a | ||
25 | starting point is available at: | ||
26 | http://oss.oracle.com/projects/tmem/files/RAMster/ | ||
27 | |||
28 | As a result of this author's ignorance, userland setup described in this | ||
29 | HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies | ||
30 | if this offends anyone! | ||
31 | |||
32 | Kernel support has only been tested on x86_64. Systems with an active | ||
33 | ocfs2 filesystem should work, but since ramster leverages a lot of | ||
34 | code from ocfs2, there may be latent issues. A kernel configuration that | ||
35 | includes CONFIG_OCFS2_FS should build OK, and should certainly run OK | ||
36 | if no ocfs2 filesystem is mounted. | ||
37 | |||
38 | This HOWTO demonstrates memory capacity load balancing for a two-node | ||
39 | cluster, where one node called the "local" node becomes overcommitted | ||
40 | and the other node called the "remote" node provides additional RAM | ||
41 | capacity for use by the local node. Ramster is capable of more complex | ||
42 | topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES". | ||
43 | |||
44 | If you find any terms in this HOWTO unfamiliar or don't understand the | ||
45 | motivation for ramster, the following LWN reading is recommended: | ||
46 | -- Transcendent Memory in a Nutshell (lwn.net/Articles/454795) | ||
47 | -- The future calculus of memory management (lwn.net/Articles/475681) | ||
48 | And since ramster is built on top of zcache, this article may be helpful: | ||
49 | -- In-kernel memory compression (lwn.net/Articles/545244) | ||
50 | |||
51 | Now that you've memorized the contents of those articles, let's get started! | ||
52 | |||
53 | A. PRELIMINARY | ||
54 | |||
55 | 1) Install two x86_64 Linux systems that are known to work when | ||
56 | upgraded to a recent upstream Linux kernel version. | ||
57 | |||
58 | On each system: | ||
59 | |||
60 | 2) Configure, build and install, then boot Linux, just to ensure it | ||
61 | can be done with an unmodified upstream kernel. Confirm you booted | ||
62 | the upstream kernel with "uname -a". | ||
63 | |||
64 | 3) If you plan to do any performance testing or unless you plan to | ||
65 | test only swapping, the "WasActive" patch is also highly recommended. | ||
66 | (Search lkml.org for WasActive, apply the patch, rebuild your kernel.) | ||
67 | For a demo or simple testing, the patch can be ignored. | ||
68 | |||
69 | 4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems | ||
70 | can be found at: | ||
71 | http://oss.oracle.com/projects/tmem/files/RAMster/ | ||
72 | (Sorry but for now, non-EL6 users must recreate ramster-tools on | ||
73 | their own from source. See above.) | ||
74 | |||
75 | 5) Ensure that debugfs is mounted at each boot. Examples below assume it | ||
76 | is mounted at /sys/kernel/debug. | ||
77 | |||
78 | B. BUILDING RAMSTER INTO THE KERNEL | ||
79 | |||
80 | Do the following on each system: | ||
81 | |||
82 | 1) Using the kernel configuration mechanism of your choice, change | ||
83 | your config to include: | ||
84 | |||
85 | CONFIG_CLEANCACHE=y | ||
86 | CONFIG_FRONTSWAP=y | ||
87 | CONFIG_STAGING=y | ||
88 | CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m | ||
89 | CONFIG_ZCACHE=y | ||
90 | CONFIG_RAMSTER=y | ||
91 | |||
92 | For a linux-3.10 or later kernel, you should also set: | ||
93 | |||
94 | CONFIG_ZCACHE_DEBUG=y | ||
95 | CONFIG_RAMSTER_DEBUG=y | ||
96 | |||
97 | Before building the kernel please doublecheck your kernel config | ||
98 | file to ensure all of the settings are correct. | ||
99 | |||
100 | 2) Build this kernel and change your boot file (e.g. /etc/grub.conf) | ||
101 | so that the new kernel will boot. | ||
102 | |||
103 | 3) Add "zcache" and "ramster" as kernel boot parameters for the new kernel. | ||
104 | |||
105 | 4) Reboot each system approximately simultaneously. | ||
106 | |||
107 | 5) Check dmesg to ensure there are some messages from ramster, prefixed | ||
108 | by "ramster:" | ||
109 | |||
110 | # dmesg | grep ramster | ||
111 | |||
112 | You should also see a lot of files in: | ||
113 | |||
114 | # ls /sys/kernel/debug/zcache | ||
115 | # ls /sys/kernel/debug/ramster | ||
116 | |||
117 | These are mostly counters for various zcache and ramster activities. | ||
118 | You should also see files in: | ||
119 | |||
120 | # ls /sys/kernel/mm/ramster | ||
121 | |||
122 | These are sysfs files that control ramster as we shall see. | ||
123 | |||
124 | Ramster now will act as a single-system zcache on each system | ||
125 | but doesn't yet know anything about the cluster so can't yet do | ||
126 | anything remotely. | ||
127 | |||
128 | C. CONFIGURING THE RAMSTER CLUSTER | ||
129 | |||
130 | This part can be error prone unless you are familiar with clustering | ||
131 | filesystems. We need to describe the cluster in a /etc/ramster.conf | ||
132 | file and the init scripts that parse it are extremely picky about | ||
133 | the syntax. | ||
134 | |||
135 | 1) Create a /etc/ramster.conf file and ensure it is identical on both | ||
136 | systems. This file mimics the ocfs2 format and there is a good amount | ||
137 | of documentation that can be searched for ocfs2.conf, but you can use: | ||
138 | |||
139 | cluster: | ||
140 | name = ramster | ||
141 | node_count = 2 | ||
142 | node: | ||
143 | name = system1 | ||
144 | cluster = ramster | ||
145 | number = 0 | ||
146 | ip_address = my.ip.ad.r1 | ||
147 | ip_port = 7777 | ||
148 | node: | ||
149 | name = system2 | ||
150 | cluster = ramster | ||
151 | number = 1 | ||
152 | ip_address = my.ip.ad.r2 | ||
153 | ip_port = 7777 | ||
154 | |||
155 | You must ensure that the "name" field in the file exactly matches | ||
156 | the output of "hostname" on each system; if "hostname" shows a | ||
157 | fully-qualified hostname, ensure the name is fully qualified in | ||
158 | /etc/ramster.conf. Obviously, substitute my.ip.ad.rx with proper | ||
159 | ip addresses. | ||
160 | |||
161 | 2) Enable the ramster service and configure it. If you used the | ||
162 | EL6 ramster-tools, this would be: | ||
163 | |||
164 | # chkconfig --add ramster | ||
165 | # service ramster configure | ||
166 | |||
167 | Set "load on boot" to "y", cluster to start is "ramster" (or whatever | ||
168 | name you chose in ramster.conf), heartbeat dead threshold as "500", | ||
169 | network idle timeout as "1000000". Leave the others as default. | ||
170 | |||
171 | 3) Reboot both systems. After reboot, try (assuming EL6 ramster-tools): | ||
172 | |||
173 | # service ramster status | ||
174 | |||
175 | You should see "Checking RAMSTER cluster "ramster": Online". If you do | ||
176 | not, something is wrong and ramster will not work. Note that you | ||
177 | should also see that the driver for "configfs" is loaded and mounted, | ||
178 | the driver for ocfs2_dlmfs is not loaded, and some numbers for network | ||
179 | parameters. You will also see "Checking RAMSTER heartbeat: Not active". | ||
180 | That's all OK. | ||
181 | |||
182 | 4) Now you need to start the cluster heartbeat; the cluster is not "up" | ||
183 | until all nodes detect a heartbeat. In a real cluster, heartbeat detection | ||
184 | is done via a cluster filesystem, but ramster doesn't require one. Some | ||
185 | hack-y kernel code in ramster can start the heartbeat for you though if | ||
186 | you tell it what nodes are "up". To enable the heartbeat, do: | ||
187 | |||
188 | # echo 0 > /sys/kernel/mm/ramster/manual_node_up | ||
189 | # echo 1 > /sys/kernel/mm/ramster/manual_node_up | ||
190 | |||
191 | This must be done on BOTH nodes and, to avoid timeouts, must be done | ||
192 | approximately concurrently on both nodes. On an EL6 system, it is | ||
193 | convenient to put these lines in /etc/rc.local. To confirm that the | ||
194 | cluster is now up, on both systems do: | ||
195 | |||
196 | # dmesg | grep ramster | ||
197 | |||
198 | You should see ramster "Accepted connection" messages in dmesg on both | ||
199 | nodes after this. Note that if you check userland status again with | ||
200 | |||
201 | # service ramster status | ||
202 | |||
203 | you will still see "Checking RAMSTER heartbeat: Not active". That's | ||
204 | still OK... the ramster kernel heartbeat hack doesn't communicate to | ||
205 | userland. | ||
206 | |||
207 | 5) You now must tell each node the node to which it should "remotify" pages. | ||
208 | On this two node cluster, we will assume the "local" node, node 0, has | ||
209 | memory overcommitted and will use ramster to utilize RAM capacity on | ||
210 | the "remote node", node 1. To configure this, on node 0, you do: | ||
211 | |||
212 | # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum | ||
213 | |||
214 | You should see "ramster: node 1 set as remotification target" in dmesg | ||
215 | on node 0. Again, on EL6, /etc/rc.local is a good place to put this | ||
216 | on node 0 so you don't forget to do it at each boot. | ||
217 | |||
218 | 6) One more step: By default, the ramster code does not "remotify" any | ||
219 | pages; this is primarily for testing purposes, but sometimes it is | ||
220 | useful. This may change in the future, but for now, on node 0, you do: | ||
221 | |||
222 | # echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable | ||
223 | # echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable | ||
224 | |||
225 | The first enables remotifying swap (persistent, aka frontswap) pages, | ||
226 | the second enables remotifying of page cache (ephemeral, cleancache) | ||
227 | pages. | ||
228 | |||
229 | On EL6, these lines can also be put in /etc/rc.local (AFTER the | ||
230 | node_up lines), or at the beginning of a script that runs a workload. | ||
231 | |||
232 | 7) Note that most testing has been done with both/all machines booted | ||
233 | roughly simultaneously to avoid cluster timeouts. Ideally, you should | ||
234 | do this too unless you are trying to break ramster rather than just | ||
235 | use it. ;-) | ||
236 | |||
237 | D. TESTING RAMSTER | ||
238 | |||
239 | 1) Note that ramster has no value unless pages get "remotified". For | ||
240 | swap/frontswap/persistent pages, this doesn't happen unless/until | ||
241 | the workload would cause swapping to occur, at which point pages | ||
242 | are put into frontswap/zcache, and the remotification thread starts | ||
243 | working. To get to the point where the system swaps, you either | ||
244 | need a workload for which the working set exceeds the RAM in the | ||
245 | system; or you need to somehow reduce the amount of RAM one of | ||
246 | the system sees. This latter is easy when testing in a VM, but | ||
247 | harder on physical systems. In some cases, "mem=xxxM" on the | ||
248 | kernel command line restricts memory, but for some values of xxx | ||
249 | the kernel may fail to boot. One may also try creating a fixed | ||
250 | RAMdisk, doing nothing with it, but ensuring that it eats up a fixed | ||
251 | amount of RAM. | ||
252 | |||
253 | 2) To see if ramster is working, on the "remote node", node 1, try: | ||
254 | |||
255 | # grep . /sys/kernel/debug/ramster/foreign_* | ||
256 | # # note, that is space-dot-space between grep and the pathname | ||
257 | |||
258 | to monitor the number (and max) ephemeral and persistent pages | ||
259 | that ramster has sent. If these stay at zero, ramster is not working | ||
260 | either because the workload on the local node (node 0) isn't creating | ||
261 | enough memory pressure or because "remotifying" isn't working. On the | ||
262 | local system, node 0, you can watch lots of useful information also. | ||
263 | Try: | ||
264 | |||
265 | grep . /sys/kernel/debug/zcache/*pageframes* \ | ||
266 | /sys/kernel/debug/zcache/*zbytes* \ | ||
267 | /sys/kernel/debug/zcache/*zpages* \ | ||
268 | /sys/kernel/debug/ramster/*remote* | ||
269 | |||
270 | Of particular note are the remote_*_pages_succ_get counters. These | ||
271 | show how many disk reads and/or disk writes have been avoided on the | ||
272 | overcommitted local system by storing pages remotely using ramster. | ||
273 | |||
274 | At the risk of information overload, you can also grep: | ||
275 | |||
276 | /sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/* | ||
277 | |||
278 | These show, for example, how many disk reads and/or disk writes have | ||
279 | been avoided by using zcache to optimize RAM on the local system. | ||
280 | |||
281 | |||
282 | AUTOMATIC SWAP REPATRIATION | ||
283 | |||
284 | You may notice that while the systems are idle, the foreign persistent | ||
285 | page count on the remote machine slowly decreases. This is because | ||
286 | ramster implements "frontswap selfshrinking": When possible, swap | ||
287 | pages that have been remotified are slowly repatriated to the local | ||
288 | machine. This is so that local RAM can be used when possible and | ||
289 | so that, in case of remote machine crash, the probability of loss | ||
290 | of data is reduced. | ||
291 | |||
292 | REBOOTING / POWEROFF | ||
293 | |||
294 | If a system is shut down while some of its swap pages still reside | ||
295 | on a remote system, the system may lock up during the shutdown | ||
296 | sequence. This will occur if the network is shut down before the | ||
297 | swap mechansim is shut down, which is the default ordering on many | ||
298 | distros. To avoid this annoying problem, simply shut off the swap | ||
299 | subsystem before starting the shutdown sequence, e.g.: | ||
300 | |||
301 | # swapoff -a | ||
302 | # reboot | ||
303 | |||
304 | Ideally, this swapoff-before-ifdown ordering should be enforced permanently | ||
305 | using shutdown scripts. | ||
306 | |||
307 | KNOWN PROBLEMS | ||
308 | |||
309 | 1) You may periodically see messages such as: | ||
310 | |||
311 | ramster_r2net, message length problem | ||
312 | |||
313 | This is harmless but indicates that a node is sending messages | ||
314 | containing compressed pages that exceed the maximum for zcache | ||
315 | (PAGE_SIZE*15/16). The sender side needs to be fixed. | ||
316 | |||
317 | 2) If you see a "No longer connected to node..." message or a "No connection | ||
318 | established with node X after N seconds", it is possible you may | ||
319 | be in an unrecoverable state. If you are certain all of the | ||
320 | appropriate cluster configuration steps described above have been | ||
321 | performed, try rebooting the two servers concurrently to see if | ||
322 | the cluster starts. | ||
323 | |||
324 | Note that "Connection to node... shutdown, state 7" is an intermediate | ||
325 | connection state. As long as you later see "Accepted connection", the | ||
326 | intermediate states are harmless. | ||
327 | |||
328 | 3) There are known issues in counting certain values. As a result | ||
329 | you may see periodic warnings from the kernel. Almost always you | ||
330 | will see "ramster: bad accounting for XXX". There are also "WARN_ONCE" | ||
331 | messages. If you see kernel warnings with a tombstone, please report | ||
332 | them. They are harmless but reflect bugs that need to be eventually fixed. | ||
333 | |||
334 | ADVANCED RAMSTER TOPOLOGIES | ||
335 | |||
336 | The kernel code for ramster can support up to eight nodes in a cluster, | ||
337 | but no testing has been done with more than three nodes. | ||
338 | |||
339 | In the example described above, the "remote" node serves as a RAM | ||
340 | overflow for the "local" node. This can be made symmetric by appropriate | ||
341 | settings of the sysfs remote_target_nodenum file. For example, by setting: | ||
342 | |||
343 | # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum | ||
344 | |||
345 | on node 0, and | ||
346 | |||
347 | # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum | ||
348 | |||
349 | on node 1, each node can serve as a RAM overflow for the other. | ||
350 | |||
351 | For more than two nodes, a "RAM server" can be configured. For a | ||
352 | three node system, set: | ||
353 | |||
354 | # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum | ||
355 | |||
356 | on node 1, and | ||
357 | |||
358 | # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum | ||
359 | |||
360 | on node 2. Then node 0 is a RAM server for node 1 and node 2. | ||
361 | |||
362 | In this implementation of ramster, any remote node is potentially a single | ||
363 | point of failure (SPOF). Though the probability of failure is reduced | ||
364 | by automatic swap repatriation (see above), a proposed future enhancement | ||
365 | to ramster improves high-availability for the cluster by sending a copy | ||
366 | of each page of date to two other nodes. Patches welcome! | ||