diff options
Diffstat (limited to 'Documentation/filesystems/ceph.txt')
| -rw-r--r-- | Documentation/filesystems/ceph.txt | 139 |
1 files changed, 139 insertions, 0 deletions
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 000000000000..6e03917316bd --- /dev/null +++ b/Documentation/filesystems/ceph.txt | |||
| @@ -0,0 +1,139 @@ | |||
| 1 | Ceph Distributed File System | ||
| 2 | ============================ | ||
| 3 | |||
| 4 | Ceph is a distributed network file system designed to provide good | ||
| 5 | performance, reliability, and scalability. | ||
| 6 | |||
| 7 | Basic features include: | ||
| 8 | |||
| 9 | * POSIX semantics | ||
| 10 | * Seamless scaling from 1 to many thousands of nodes | ||
| 11 | * High availability and reliability. No single points of failure. | ||
| 12 | * N-way replication of data across storage nodes | ||
| 13 | * Fast recovery from node failures | ||
| 14 | * Automatic rebalancing of data on node addition/removal | ||
| 15 | * Easy deployment: most FS components are userspace daemons | ||
| 16 | |||
| 17 | Also, | ||
| 18 | * Flexible snapshots (on any directory) | ||
| 19 | * Recursive accounting (nested files, directories, bytes) | ||
| 20 | |||
| 21 | In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely | ||
| 22 | on symmetric access by all clients to shared block devices, Ceph | ||
| 23 | separates data and metadata management into independent server | ||
| 24 | clusters, similar to Lustre. Unlike Lustre, however, metadata and | ||
| 25 | storage nodes run entirely as user space daemons. Storage nodes | ||
| 26 | utilize btrfs to store data objects, leveraging its advanced features | ||
| 27 | (checksumming, metadata replication, etc.). File data is striped | ||
| 28 | across storage nodes in large chunks to distribute workload and | ||
| 29 | facilitate high throughputs. When storage nodes fail, data is | ||
| 30 | re-replicated in a distributed fashion by the storage nodes themselves | ||
| 31 | (with some minimal coordination from a cluster monitor), making the | ||
| 32 | system extremely efficient and scalable. | ||
| 33 | |||
| 34 | Metadata servers effectively form a large, consistent, distributed | ||
| 35 | in-memory cache above the file namespace that is extremely scalable, | ||
| 36 | dynamically redistributes metadata in response to workload changes, | ||
| 37 | and can tolerate arbitrary (well, non-Byzantine) node failures. The | ||
| 38 | metadata server takes a somewhat unconventional approach to metadata | ||
| 39 | storage to significantly improve performance for common workloads. In | ||
| 40 | particular, inodes with only a single link are embedded in | ||
| 41 | directories, allowing entire directories of dentries and inodes to be | ||
| 42 | loaded into its cache with a single I/O operation. The contents of | ||
| 43 | extremely large directories can be fragmented and managed by | ||
| 44 | independent metadata servers, allowing scalable concurrent access. | ||
| 45 | |||
| 46 | The system offers automatic data rebalancing/migration when scaling | ||
| 47 | from a small cluster of just a few nodes to many hundreds, without | ||
| 48 | requiring an administrator carve the data set into static volumes or | ||
| 49 | go through the tedious process of migrating data between servers. | ||
| 50 | When the file system approaches full, new nodes can be easily added | ||
| 51 | and things will "just work." | ||
| 52 | |||
| 53 | Ceph includes flexible snapshot mechanism that allows a user to create | ||
| 54 | a snapshot on any subdirectory (and its nested contents) in the | ||
| 55 | system. Snapshot creation and deletion are as simple as 'mkdir | ||
| 56 | .snap/foo' and 'rmdir .snap/foo'. | ||
| 57 | |||
| 58 | Ceph also provides some recursive accounting on directories for nested | ||
| 59 | files and bytes. That is, a 'getfattr -d foo' on any directory in the | ||
| 60 | system will reveal the total number of nested regular files and | ||
| 61 | subdirectories, and a summation of all nested file sizes. This makes | ||
| 62 | the identification of large disk space consumers relatively quick, as | ||
| 63 | no 'du' or similar recursive scan of the file system is required. | ||
| 64 | |||
| 65 | |||
| 66 | Mount Syntax | ||
| 67 | ============ | ||
| 68 | |||
| 69 | The basic mount syntax is: | ||
| 70 | |||
| 71 | # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt | ||
| 72 | |||
| 73 | You only need to specify a single monitor, as the client will get the | ||
| 74 | full list when it connects. (However, if the monitor you specify | ||
| 75 | happens to be down, the mount won't succeed.) The port can be left | ||
| 76 | off if the monitor is using the default. So if the monitor is at | ||
| 77 | 1.2.3.4, | ||
| 78 | |||
| 79 | # mount -t ceph 1.2.3.4:/ /mnt/ceph | ||
| 80 | |||
| 81 | is sufficient. If /sbin/mount.ceph is installed, a hostname can be | ||
| 82 | used instead of an IP address. | ||
| 83 | |||
| 84 | |||
| 85 | |||
| 86 | Mount Options | ||
| 87 | ============= | ||
| 88 | |||
| 89 | ip=A.B.C.D[:N] | ||
| 90 | Specify the IP and/or port the client should bind to locally. | ||
| 91 | There is normally not much reason to do this. If the IP is not | ||
| 92 | specified, the client's IP address is determined by looking at the | ||
| 93 | address it's connection to the monitor originates from. | ||
| 94 | |||
| 95 | wsize=X | ||
| 96 | Specify the maximum write size in bytes. By default there is no | ||
| 97 | maximu. Ceph will normally size writes based on the file stripe | ||
| 98 | size. | ||
| 99 | |||
| 100 | rsize=X | ||
| 101 | Specify the maximum readahead. | ||
| 102 | |||
| 103 | mount_timeout=X | ||
| 104 | Specify the timeout value for mount (in seconds), in the case | ||
| 105 | of a non-responsive Ceph file system. The default is 30 | ||
| 106 | seconds. | ||
| 107 | |||
| 108 | rbytes | ||
| 109 | When stat() is called on a directory, set st_size to 'rbytes', | ||
| 110 | the summation of file sizes over all files nested beneath that | ||
| 111 | directory. This is the default. | ||
| 112 | |||
| 113 | norbytes | ||
| 114 | When stat() is called on a directory, set st_size to the | ||
| 115 | number of entries in that directory. | ||
| 116 | |||
| 117 | nocrc | ||
| 118 | Disable CRC32C calculation for data writes. If set, the OSD | ||
| 119 | must rely on TCP's error correction to detect data corruption | ||
| 120 | in the data payload. | ||
| 121 | |||
| 122 | noasyncreaddir | ||
| 123 | Disable client's use its local cache to satisfy readdir | ||
| 124 | requests. (This does not change correctness; the client uses | ||
| 125 | cached metadata only when a lease or capability ensures it is | ||
| 126 | valid.) | ||
| 127 | |||
| 128 | |||
| 129 | More Information | ||
| 130 | ================ | ||
| 131 | |||
| 132 | For more information on Ceph, see the home page at | ||
| 133 | http://ceph.newdream.net/ | ||
| 134 | |||
| 135 | The Linux kernel client source tree is available at | ||
| 136 | git://ceph.newdream.net/linux-ceph-client.git | ||
| 137 | |||
| 138 | and the source for the full system is at | ||
| 139 | git://ceph.newdream.net/ceph.git | ||
