aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDan Magenheimer <dan.magenheimer@oracle.com>2013-05-20 10:52:17 -0400
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>2013-05-20 11:21:04 -0400
commit8bb3e55103b37869175333e00fc01b34b0459529 (patch)
tree96eb4df3801d92460a82b708014406f06df2bdd5
parent642f2ecc092f4d2d5a9b7219090531508017c324 (diff)
staging: ramster: add how-to document
Add how-to documentation that provides a step-by-step guide for configuring and trying out a ramster cluster. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-rw-r--r--drivers/staging/zcache/ramster/ramster-howto.txt366
1 files changed, 366 insertions, 0 deletions
diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt b/drivers/staging/zcache/ramster/ramster-howto.txt
new file mode 100644
index 000000000000..7b1ee3bbfdd5
--- /dev/null
+++ b/drivers/staging/zcache/ramster/ramster-howto.txt
@@ -0,0 +1,366 @@
1 RAMSTER HOW-TO
2
3Author: Dan Magenheimer
4Ramster maintainer: Konrad Wilk <konrad.wilk@oracle.com>
5
6This is a HOWTO document for ramster which, as of this writing, is in
7the kernel as a subdirectory of zcache in drivers/staging, called ramster.
8(Zcache can be built with or without ramster functionality.) If enabled
9and properly configured, ramster allows memory capacity load balancing
10across multiple machines in a cluster. Further, the ramster code serves
11as an example of asynchronous access for zcache (as well as cleancache and
12frontswap) that may prove useful for future transcendent memory
13implementations, such as KVM and NVRAM. While ramster works today on
14any network connection that supports kernel sockets, its features may
15become more interesting on future high-speed fabrics/interconnects.
16
17Ramster requires both kernel and userland support. The userland support,
18called ramster-tools, is known to work with EL6-based distros, but is a
19set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
20includes an init file, a config file, and a userland binary that interfaces
21to the kernel. This state of userland support reflects the abysmal userland
22skills of this suitably-embarrassed author; any help/patches to turn
23ramster-tools into more distributable rpms/debs useful for a wider range
24of distros would be appreciated. The source RPM that can be used as a
25starting point is available at:
26 http://oss.oracle.com/projects/tmem/files/RAMster/
27
28As a result of this author's ignorance, userland setup described in this
29HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies
30if this offends anyone!
31
32Kernel support has only been tested on x86_64. Systems with an active
33ocfs2 filesystem should work, but since ramster leverages a lot of
34code from ocfs2, there may be latent issues. A kernel configuration that
35includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
36if no ocfs2 filesystem is mounted.
37
38This HOWTO demonstrates memory capacity load balancing for a two-node
39cluster, where one node called the "local" node becomes overcommitted
40and the other node called the "remote" node provides additional RAM
41capacity for use by the local node. Ramster is capable of more complex
42topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES".
43
44If you find any terms in this HOWTO unfamiliar or don't understand the
45motivation for ramster, the following LWN reading is recommended:
46-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
47-- The future calculus of memory management (lwn.net/Articles/475681)
48And since ramster is built on top of zcache, this article may be helpful:
49-- In-kernel memory compression (lwn.net/Articles/545244)
50
51Now that you've memorized the contents of those articles, let's get started!
52
53A. PRELIMINARY
54
551) Install two x86_64 Linux systems that are known to work when
56 upgraded to a recent upstream Linux kernel version.
57
58On each system:
59
602) Configure, build and install, then boot Linux, just to ensure it
61 can be done with an unmodified upstream kernel. Confirm you booted
62 the upstream kernel with "uname -a".
63
643) If you plan to do any performance testing or unless you plan to
65 test only swapping, the "WasActive" patch is also highly recommended.
66 (Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
67 For a demo or simple testing, the patch can be ignored.
68
694) Install ramster-tools as root. An x86_64 rpm for EL6-based systems
70 can be found at:
71 http://oss.oracle.com/projects/tmem/files/RAMster/
72 (Sorry but for now, non-EL6 users must recreate ramster-tools on
73 their own from source. See above.)
74
755) Ensure that debugfs is mounted at each boot. Examples below assume it
76 is mounted at /sys/kernel/debug.
77
78B. BUILDING RAMSTER INTO THE KERNEL
79
80Do the following on each system:
81
821) Using the kernel configuration mechanism of your choice, change
83 your config to include:
84
85 CONFIG_CLEANCACHE=y
86 CONFIG_FRONTSWAP=y
87 CONFIG_STAGING=y
88 CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m
89 CONFIG_ZCACHE=y
90 CONFIG_RAMSTER=y
91
92 For a linux-3.10 or later kernel, you should also set:
93
94 CONFIG_ZCACHE_DEBUG=y
95 CONFIG_RAMSTER_DEBUG=y
96
97 Before building the kernel please doublecheck your kernel config
98 file to ensure all of the settings are correct.
99
1002) Build this kernel and change your boot file (e.g. /etc/grub.conf)
101 so that the new kernel will boot.
102
1033) Add "zcache" and "ramster" as kernel boot parameters for the new kernel.
104
1054) Reboot each system approximately simultaneously.
106
1075) Check dmesg to ensure there are some messages from ramster, prefixed
108 by "ramster:"
109
110 # dmesg | grep ramster
111
112 You should also see a lot of files in:
113
114 # ls /sys/kernel/debug/zcache
115 # ls /sys/kernel/debug/ramster
116
117 These are mostly counters for various zcache and ramster activities.
118 You should also see files in:
119
120 # ls /sys/kernel/mm/ramster
121
122 These are sysfs files that control ramster as we shall see.
123
124 Ramster now will act as a single-system zcache on each system
125 but doesn't yet know anything about the cluster so can't yet do
126 anything remotely.
127
128C. CONFIGURING THE RAMSTER CLUSTER
129
130This part can be error prone unless you are familiar with clustering
131filesystems. We need to describe the cluster in a /etc/ramster.conf
132file and the init scripts that parse it are extremely picky about
133the syntax.
134
1351) Create a /etc/ramster.conf file and ensure it is identical on both
136 systems. This file mimics the ocfs2 format and there is a good amount
137 of documentation that can be searched for ocfs2.conf, but you can use:
138
139 cluster:
140 name = ramster
141 node_count = 2
142 node:
143 name = system1
144 cluster = ramster
145 number = 0
146 ip_address = my.ip.ad.r1
147 ip_port = 7777
148 node:
149 name = system2
150 cluster = ramster
151 number = 1
152 ip_address = my.ip.ad.r2
153 ip_port = 7777
154
155 You must ensure that the "name" field in the file exactly matches
156 the output of "hostname" on each system; if "hostname" shows a
157 fully-qualified hostname, ensure the name is fully qualified in
158 /etc/ramster.conf. Obviously, substitute my.ip.ad.rx with proper
159 ip addresses.
160
1612) Enable the ramster service and configure it. If you used the
162 EL6 ramster-tools, this would be:
163
164 # chkconfig --add ramster
165 # service ramster configure
166
167 Set "load on boot" to "y", cluster to start is "ramster" (or whatever
168 name you chose in ramster.conf), heartbeat dead threshold as "500",
169 network idle timeout as "1000000". Leave the others as default.
170
1713) Reboot both systems. After reboot, try (assuming EL6 ramster-tools):
172
173 # service ramster status
174
175 You should see "Checking RAMSTER cluster "ramster": Online". If you do
176 not, something is wrong and ramster will not work. Note that you
177 should also see that the driver for "configfs" is loaded and mounted,
178 the driver for ocfs2_dlmfs is not loaded, and some numbers for network
179 parameters. You will also see "Checking RAMSTER heartbeat: Not active".
180 That's all OK.
181
1824) Now you need to start the cluster heartbeat; the cluster is not "up"
183 until all nodes detect a heartbeat. In a real cluster, heartbeat detection
184 is done via a cluster filesystem, but ramster doesn't require one. Some
185 hack-y kernel code in ramster can start the heartbeat for you though if
186 you tell it what nodes are "up". To enable the heartbeat, do:
187
188 # echo 0 > /sys/kernel/mm/ramster/manual_node_up
189 # echo 1 > /sys/kernel/mm/ramster/manual_node_up
190
191 This must be done on BOTH nodes and, to avoid timeouts, must be done
192 approximately concurrently on both nodes. On an EL6 system, it is
193 convenient to put these lines in /etc/rc.local. To confirm that the
194 cluster is now up, on both systems do:
195
196 # dmesg | grep ramster
197
198 You should see ramster "Accepted connection" messages in dmesg on both
199 nodes after this. Note that if you check userland status again with
200
201 # service ramster status
202
203 you will still see "Checking RAMSTER heartbeat: Not active". That's
204 still OK... the ramster kernel heartbeat hack doesn't communicate to
205 userland.
206
2075) You now must tell each node the node to which it should "remotify" pages.
208 On this two node cluster, we will assume the "local" node, node 0, has
209 memory overcommitted and will use ramster to utilize RAM capacity on
210 the "remote node", node 1. To configure this, on node 0, you do:
211
212 # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
213
214 You should see "ramster: node 1 set as remotification target" in dmesg
215 on node 0. Again, on EL6, /etc/rc.local is a good place to put this
216 on node 0 so you don't forget to do it at each boot.
217
2186) One more step: By default, the ramster code does not "remotify" any
219 pages; this is primarily for testing purposes, but sometimes it is
220 useful. This may change in the future, but for now, on node 0, you do:
221
222 # echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable
223 # echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable
224
225 The first enables remotifying swap (persistent, aka frontswap) pages,
226 the second enables remotifying of page cache (ephemeral, cleancache)
227 pages.
228
229 On EL6, these lines can also be put in /etc/rc.local (AFTER the
230 node_up lines), or at the beginning of a script that runs a workload.
231
2327) Note that most testing has been done with both/all machines booted
233 roughly simultaneously to avoid cluster timeouts. Ideally, you should
234 do this too unless you are trying to break ramster rather than just
235 use it. ;-)
236
237D. TESTING RAMSTER
238
2391) Note that ramster has no value unless pages get "remotified". For
240 swap/frontswap/persistent pages, this doesn't happen unless/until
241 the workload would cause swapping to occur, at which point pages
242 are put into frontswap/zcache, and the remotification thread starts
243 working. To get to the point where the system swaps, you either
244 need a workload for which the working set exceeds the RAM in the
245 system; or you need to somehow reduce the amount of RAM one of
246 the system sees. This latter is easy when testing in a VM, but
247 harder on physical systems. In some cases, "mem=xxxM" on the
248 kernel command line restricts memory, but for some values of xxx
249 the kernel may fail to boot. One may also try creating a fixed
250 RAMdisk, doing nothing with it, but ensuring that it eats up a fixed
251 amount of RAM.
252
2532) To see if ramster is working, on the "remote node", node 1, try:
254
255 # grep . /sys/kernel/debug/ramster/foreign_*
256 # # note, that is space-dot-space between grep and the pathname
257
258 to monitor the number (and max) ephemeral and persistent pages
259 that ramster has sent. If these stay at zero, ramster is not working
260 either because the workload on the local node (node 0) isn't creating
261 enough memory pressure or because "remotifying" isn't working. On the
262 local system, node 0, you can watch lots of useful information also.
263 Try:
264
265 grep . /sys/kernel/debug/zcache/*pageframes* \
266 /sys/kernel/debug/zcache/*zbytes* \
267 /sys/kernel/debug/zcache/*zpages* \
268 /sys/kernel/debug/ramster/*remote*
269
270 Of particular note are the remote_*_pages_succ_get counters. These
271 show how many disk reads and/or disk writes have been avoided on the
272 overcommitted local system by storing pages remotely using ramster.
273
274 At the risk of information overload, you can also grep:
275
276 /sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/*
277
278 These show, for example, how many disk reads and/or disk writes have
279 been avoided by using zcache to optimize RAM on the local system.
280
281
282AUTOMATIC SWAP REPATRIATION
283
284You may notice that while the systems are idle, the foreign persistent
285page count on the remote machine slowly decreases. This is because
286ramster implements "frontswap selfshrinking": When possible, swap
287pages that have been remotified are slowly repatriated to the local
288machine. This is so that local RAM can be used when possible and
289so that, in case of remote machine crash, the probability of loss
290of data is reduced.
291
292REBOOTING / POWEROFF
293
294If a system is shut down while some of its swap pages still reside
295on a remote system, the system may lock up during the shutdown
296sequence. This will occur if the network is shut down before the
297swap mechansim is shut down, which is the default ordering on many
298distros. To avoid this annoying problem, simply shut off the swap
299subsystem before starting the shutdown sequence, e.g.:
300
301 # swapoff -a
302 # reboot
303
304Ideally, this swapoff-before-ifdown ordering should be enforced permanently
305using shutdown scripts.
306
307KNOWN PROBLEMS
308
3091) You may periodically see messages such as:
310
311 ramster_r2net, message length problem
312
313 This is harmless but indicates that a node is sending messages
314 containing compressed pages that exceed the maximum for zcache
315 (PAGE_SIZE*15/16). The sender side needs to be fixed.
316
3172) If you see a "No longer connected to node..." message or a "No connection
318 established with node X after N seconds", it is possible you may
319 be in an unrecoverable state. If you are certain all of the
320 appropriate cluster configuration steps described above have been
321 performed, try rebooting the two servers concurrently to see if
322 the cluster starts.
323
324 Note that "Connection to node... shutdown, state 7" is an intermediate
325 connection state. As long as you later see "Accepted connection", the
326 intermediate states are harmless.
327
3283) There are known issues in counting certain values. As a result
329 you may see periodic warnings from the kernel. Almost always you
330 will see "ramster: bad accounting for XXX". There are also "WARN_ONCE"
331 messages. If you see kernel warnings with a tombstone, please report
332 them. They are harmless but reflect bugs that need to be eventually fixed.
333
334ADVANCED RAMSTER TOPOLOGIES
335
336The kernel code for ramster can support up to eight nodes in a cluster,
337but no testing has been done with more than three nodes.
338
339In the example described above, the "remote" node serves as a RAM
340overflow for the "local" node. This can be made symmetric by appropriate
341settings of the sysfs remote_target_nodenum file. For example, by setting:
342
343 # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
344
345on node 0, and
346
347 # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
348
349on node 1, each node can serve as a RAM overflow for the other.
350
351For more than two nodes, a "RAM server" can be configured. For a
352three node system, set:
353
354 # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
355
356on node 1, and
357
358 # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
359
360on node 2. Then node 0 is a RAM server for node 1 and node 2.
361
362In this implementation of ramster, any remote node is potentially a single
363point of failure (SPOF). Though the probability of failure is reduced
364by automatic swap repatriation (see above), a proposed future enhancement
365to ramster improves high-availability for the cluster by sending a copy
366of each page of date to two other nodes. Patches welcome!