Staging: pohmelfs: documentation.

This patch includes POHMELFS design and implementation description. Separate file includes mount options, default parameters and usage examples. Signed-off-by: Eveniy Polyakov <zbr@ioremap.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
author: Evgeniy Polyakov <zbr@ioremap.net> 2009-02-09 09:02:34 -0500
committer: Greg Kroah-Hartman <gregkh@suse.de> 2009-04-03 17:53:33 -0400
commit: b8523c40d57f5996a467f83825cb05583a5a7da4 (patch)
tree: d345233b8e97d64995d60370eca78c5f3fdefa61
parent: e333720166a432ea890dbd438b465fd0cee3be32 (diff)
3 files changed, 383 insertions, 0 deletions
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
new file mode 100644
index 000000000000..6d6db60d567d
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -0,0 +1,70 @@
+POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
+                Evgeniy Polyakov <zbr@ioremap.net>
+Homepage: http://www.ioremap.net/projects/pohmelfs
+POHMELFS first began as a network filesystem with coherent local data and
+metadata caches but is now evolving into a parallel distributed filesystem.
+Main features of this FS include:
+ * Locally coherent cache for data and metadata with (potentially) byte-range locks.
+        Since all Linux filesystems lock the whole inode during writing, algorithm
+        is very simple and does not use byte-ranges, although they are sent in
+        locking messages.
+ * Completely async processing of all events except creation of hard and symbolic
+        links, and rename events.
+        Object creation and data reading and writing are processed asynchronously.
+ * Flexible object architecture optimized for network processing.
+        Ability to create long paths to objects and remove arbitrarily huge
+        directories with a single network command.
+        (like removing the whole kernel tree via a single network command).
+ * Very high performance.
+ * Fast and scalable multithreaded userspace server. Being in userspace it works
+        with any underlying filesystem and still is much faster than async in-kernel NFS one.
+ * Client is able to switch between different servers (if one goes down, client
+        automatically reconnects to second and so on).
+ * Transactions support. Full failover for all operations.
+        Resending transactions to different servers on timeout or error.
+ * Read request (data read, directory listing, lookup requests) balancing between multiple servers.
+ * Write requests are replicated to multiple servers and completed only when all of them are acked.
+ * Ability to add and/or remove servers from the working set at run-time.
+ * Strong authentification and possible data encryption in network channel.
+ * Extended attributes support.
+POHMELFS is based on transactions, which are potentially long-standing objects that live
+in the client's memory. Each transaction contains all the information needed to process a given
+command (or set of commands, which is frequently used during data writing: single transactions
+can contain creation and data writing commands). Transactions are committed by all the servers
+to which they are sent and, in case of failures, are eventually resent or dropped with an error.
+For example, reading will return an error if no servers are available.
+POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
+possible to detach replies from requests and, if the command requires data to be received, the
+caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
+servers and async threads will pick up replies in parallel, find appropriate transactions in the
+system and put the data where it belongs (like the page or inode cache).
+The main feature of POHMELFS is writeback data and the metadata cache.
+Only a few non-performance critical operations use the write-through cache and
+are synchronous: hard and symbolic link creation, and object rename. Creation,
+removal of objects and data writing are asynchronous and are sent to
+the server during system writeback. Only one writer at a time is allowed for any
+given inode, which is guarded by an appropriate locking protocol.
+Because of this feature, POHMELFS is extremely fast at metadata intensive
+workloads and can fully utilize the bandwidth to the servers when doing bulk
+data transfers.
+POHMELFS clients operate with a working set of servers and are capable of balancing read-only
+operations (like lookups or directory listings) between them.
+Administrators can add or remove servers from the set at run-time via special commands (described
+in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers.
+POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
+One can select any kernel supported cipher, encryption mode, hash type and operation mode
+(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
+is checked during mount time and, if the server does not support it, appropriate capabilities
+will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
+Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
+crypto operations and send the resulting data to server or submit it up the stack. This number
+can be controlled via a mount option.
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt
new file mode 100644
index 000000000000..4e3d50157083
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/info.txt
@@ -0,0 +1,86 @@
+POHMELFS usage information.
+Mount options:
+idx=%u
+ Each mountpoint is associated with a special index via this option.
+ Administrator can add or remove servers from the given index, so all mounts,
+ which were attached to it, are updated.
+ Default it is 0.
+trans_scan_timeout=%u
+ This timeout, expressed in milliseconds, specifies time to scan transaction
+ trees looking for stale requests, which have to be resent, or if number of
+ retries exceed specified limit, dropped with error.
+ Default is 5 seconds.
+drop_scan_timeout=%u
+ Internal timeout, expressed in milliseconds, which specifies how frequently
+ inodes marked to be dropped are freed. It also specifies how frequently
+ the system checks that servers have to be added or removed from current working set.
+ Default is 1 second.
+wait_on_page_timeout=%u
+ Number of milliseconds to wait for reply from remote server for data reading command.
+ If this timeout is exceeded, reading returns an error.
+ Default is 5 seconds.
+trans_retries=%u
+ This is the number of times that a transaction will be resent to a server that did
+ not answer for the last @trans_scan_timeout milliseconds.
+ When the number of resends exceeds this limit, the transaction is completed with error.
+ Default is 5 resends.
+crypto_thread_num=%u
+ Number of crypto processing threads. Threads are used both for RX and TX traffic.
+ Default is 2, or no threads if crypto operations are not supported.
+trans_max_pages=%u
+ Maximum number of pages in a single transaction. This parameter also controls
+ the number of pages,  allocated for crypto processing (each crypto thread has
+ pool of pages, the number of which is equal to 'trans_max_pages'.
+ Default is 100 pages.
+crypto_fail_unsupported
+ If specified, mount will fail if the server does not support requested crypto operations.
+ By default mount will disable non-matching crypto operations.
+mcache_timeout=%u
+ Maximum number of milliseconds to wait for the mcache objects to be processed.
+ Mcache includes locks (given lock should be granted by server), attributes (they should be
+ fully received in the given timeframe).
+ Default is 5 seconds.
+Usage examples.
+Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx
+with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
+$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
+Mount filesystem with given index $idx to /mnt mountpoint.
+Client will connect to all servers specified in the working set via previous command:
+mount -t pohmel -o idx=$idx q /mnt
+One can add or remove servers from working set after mounting too.
+Server installation.
+Creating a server, which listens at port 1025 and 0.0.0.0 address.
+Working root directory (note, that server chroots there, so you have to have appropriate permissions)
+is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
+are appropriate key files.
+Number of working threads is set to 10.
+# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key
+ -A 6                    - listen on ipv6 address. Default: Disabled.
+ -r root                 - path to root directory. Default: /tmp.
+ -a addr                 - listen address. Default: 0.0.0.0.
+ -p port                 - listen port. Default: 1025.
+ -w workers              - number of workers per connected client. Default: 1.
+ -K file                 - hash key size. Default: none.
+ -k file                 - cipher key size. Default: none.
+ -h                      - this help.
+Number of worker threads specifies how many workers will be created for each client.
+Bulk single-client transafers usually are better handled with smaller number (like 1-3).
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt
new file mode 100644
index 000000000000..40ea6c295afb
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/network_protocol.txt
@@ -0,0 +1,227 @@
+POHMELFS network protocol.
+Basic structure used in network communication is following command:
+struct netfs_cmd
+{
+        __u16                   cmd;    /* Command number */
+        __u16                   csize;  /* Attached crypto information size */
+        __u16                   cpad;   /* Attached padding size */
+        __u16                   ext;    /* External flags */
+        __u32                   size;   /* Size of the attached data */
+        __u32                   trans;  /* Transaction id */
+        __u64                   id;     /* Object ID to operate on. Used for feedback.*/
+        __u64                   start;  /* Start of the object. */
+        __u64                   iv;     /* IV sequence */
+        __u8                    data[0];
+};
+Commands can be embedded into transaction command (which in turn has own command),
+so one can extend protocol as needed without breaking backward compatibility as long
+as old commands are supported. All string lengths include tail 0 byte.
+All commans are transfered over the network in big-endian. CPU endianess is used at the end peers.
+@cmd - command number, which specifies command to be processed. Following
+        commands are used currently:
+        NETFS_READDIR   = 1,    /* Read directory for given inode number */
+        NETFS_READ_PAGE,        /* Read data page from the server */
+        NETFS_WRITE_PAGE,       /* Write data page to the server */
+        NETFS_CREATE,           /* Create directory entry */
+        NETFS_REMOVE,           /* Remove directory entry */
+        NETFS_LOOKUP,           /* Lookup single object */
+        NETFS_LINK,             /* Create a link */
+        NETFS_TRANS,            /* Transaction */
+        NETFS_OPEN,             /* Open intent */
+        NETFS_INODE_INFO,       /* Metadata cache coherency synchronization message */
+        NETFS_PAGE_CACHE,       /* Page cache invalidation message */
+        NETFS_READ_PAGES,       /* Read multiple contiguous pages in one go */
+        NETFS_RENAME,           /* Rename object */
+        NETFS_CAPABILITIES,     /* Capabilities of the client, for example supported crypto */
+        NETFS_LOCK,             /* Distributed lock message */
+        NETFS_XATTR_SET,        /* Set extended attribute */
+        NETFS_XATTR_GET,        /* Get extended attribute */
+@ext - external flags. Used by different commands to specify some extra arguments
+        like partial size of the embedded objects or creation flags.
+@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
+        but size of the requested data is incorporated here. It does not include size of the command
+        header (struct netfs_cmd) itself.
+@id - id of the object this command operates on. Each command can use it for own purpose.
+@start - start of the object this command operates on. Each command can use it for own purpose.
+@csize, @cpad - size and padding size of the (attached if needed) crypto information.
+Command specifications.
+@NETFS_READDIR
+This command is used to sync content of the remote dir to the client.
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number of the directory to read.
+@start - zero.
+@NETFS_READ_PAGE
+This command is used to read data from remote server.
+Data size does not exceed local page cache size.
+@id - inode number.
+@start - first byte offset.
+@size - number of bytes to read plus length of the path to object.
+@ext - object path length.
+@NETFS_CREATE
+Used to create object.
+It does not require that all directories on top of the object were
+already created, it will create them automatically. Each object has
+associated @netfs_path_entry data structure, which contains creation
+mode (permissions and type) and length of the name as long as name itself.
+@start - 0
+@size - size of the all data structures needed to create a path
+@id - local inode number
+@ext - 0
+@NETFS_REMOVE
+Used to remove object.
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number.
+@start - zero.
+@NETFS_LOOKUP
+Lookup information about object on server.
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number of the directory to look object in.
+@start - local inode number of the object to look at.
+@NETFS_LINK
+Create hard of symlink.
+Command is sent as "object_path|target_path".
+@size - size of the above string.
+@id - parent local inode number.
+@start - 1 for symlink, 0 for hardlink.
+@ext - size of the "object_path" above.
+@NETFS_TRANS
+Transaction header.
+@size - incorporates all embedded command sizes including theirs header sizes.
+@start - transaction generation number - unique id used to find transaction.
+@ext - transaction flags. Unused at the moment.
+@id - 0.
+@NETFS_OPEN
+Open intent for given transaction.
+@id - local inode number.
+@start - 0.
+@size - path length to the object.
+@ext - open flags (O_RDWR and so on).
+@NETFS_INODE_INFO
+Metadata update command.
+It is sent to servers when attributes of the object are changed and received
+when data or metadata were updated. It operates with the following structure:
+struct netfs_inode_info
+{
+        unsigned int            mode;
+        unsigned int            nlink;
+        unsigned int            uid;
+        unsigned int            gid;
+        unsigned int            blocksize;
+        unsigned int            padding;
+        __u64                   ino;
+        __u64                   blocks;
+        __u64                   rdev;
+        __u64                   size;
+        __u64                   version;
+};
+It effectively mirrors stat(2) returned data.
+@ext - path length to the object.
+@size - the same plus size of the netfs_inode_info structure.
+@id - local inode number.
+@start - 0.
+@NETFS_PAGE_CACHE
+Command is only received by clients. It contains information about
+page to be marked as not up-to-date.
+@id - client's inode number.
+@start - last byte of the page to be invalidated. If it is not equal to
+        current inode size, it will be vmtruncated().
+@size - 0
+@ext - 0
+@NETFS_READ_PAGES
+Used to read multiple contiguous pages in one go.
+@start - first byte of the contiguous region to read.
+@size - contains of two fields: lower 8 bits are used to represent page cache shift
+        used by client, another 3 bytes are used to get number of pages.
+@id - local inode number.
+@ext - path length to the object.
+@NETFS_RENAME
+Used to rename object.
+Attached data is formed into following string: "old_path|new_path".
+@id - local inode number.
+@start - parent inode number.
+@size - length of the above string.
+@ext - length of the old path part.
+@NETFS_CAPABILITIES
+Used to exchange crypto capabilities with server.
+If crypto capabilities are not supported by server, then client will disable it
+or fail (if 'crypto_fail_unsupported' mount options was specified).
+@id - superblock index. Used to specify crypto information for group of servers.
+@size - size of the attached capabilities structure.
+@start - 0.
+@size - 0.
+@scsize - 0.
+@NETFS_LOCK
+Used to send lock request/release messages. Although it sends byte range request
+and is capable of flushing pages based on that, it is not used, since all Linux
+filesystems lock the whole inode.
+@id - lock generation number.
+@start - start of the locked range.
+@size - size of the locked range.
+@ext - lock type: read/write. Not used actually. 15'th bit is used to determine,
+        if it is lock request (1) or release (0).
+@NETFS_XATTR_SET
+@NETFS_XATTR_GET
+Used to set/get extended attributes for given inode.
+@id - attribute generation number or xattr setting type
+@start - size of the attribute (request or attached)
+@size - name length, path len and data size for given attribute
+@ext - path length for given object
author	Evgeniy Polyakov <zbr@ioremap.net>	2009-02-09 09:02:34 -0500
committer	Greg Kroah-Hartman <gregkh@suse.de>	2009-04-03 17:53:33 -0400
commit	b8523c40d57f5996a467f83825cb05583a5a7da4 (patch)
tree	d345233b8e97d64995d60370eca78c5f3fdefa61
parent	e333720166a432ea890dbd438b465fd0cee3be32 (diff)

diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt new file mode 100644 index 000000000000..6d6db60d567d --- /dev/null +++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -0,0 +1,70 @@
	1	POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
	2
	3	Evgeniy Polyakov <zbr@ioremap.net>
	4
	5	Homepage: http://www.ioremap.net/projects/pohmelfs
	6
	7	POHMELFS first began as a network filesystem with coherent local data and
	8	metadata caches but is now evolving into a parallel distributed filesystem.
	9
	10	Main features of this FS include:
	11	* Locally coherent cache for data and metadata with (potentially) byte-range locks.
	12	Since all Linux filesystems lock the whole inode during writing, algorithm
	13	is very simple and does not use byte-ranges, although they are sent in
	14	locking messages.
	15	* Completely async processing of all events except creation of hard and symbolic
	16	links, and rename events.
	17	Object creation and data reading and writing are processed asynchronously.
	18	* Flexible object architecture optimized for network processing.
	19	Ability to create long paths to objects and remove arbitrarily huge
	20	directories with a single network command.
	21	(like removing the whole kernel tree via a single network command).
	22	* Very high performance.
	23	* Fast and scalable multithreaded userspace server. Being in userspace it works
	24	with any underlying filesystem and still is much faster than async in-kernel NFS one.
	25	* Client is able to switch between different servers (if one goes down, client
	26	automatically reconnects to second and so on).
	27	* Transactions support. Full failover for all operations.
	28	Resending transactions to different servers on timeout or error.
	29	* Read request (data read, directory listing, lookup requests) balancing between multiple servers.
	30	* Write requests are replicated to multiple servers and completed only when all of them are acked.
	31	* Ability to add and/or remove servers from the working set at run-time.
	32	* Strong authentification and possible data encryption in network channel.
	33	* Extended attributes support.
	34
	35	POHMELFS is based on transactions, which are potentially long-standing objects that live
	36	in the client's memory. Each transaction contains all the information needed to process a given
	37	command (or set of commands, which is frequently used during data writing: single transactions
	38	can contain creation and data writing commands). Transactions are committed by all the servers
	39	to which they are sent and, in case of failures, are eventually resent or dropped with an error.
	40	For example, reading will return an error if no servers are available.
	41
	42	POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
	43	possible to detach replies from requests and, if the command requires data to be received, the
	44	caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
	45	servers and async threads will pick up replies in parallel, find appropriate transactions in the
	46	system and put the data where it belongs (like the page or inode cache).
	47
	48	The main feature of POHMELFS is writeback data and the metadata cache.
	49	Only a few non-performance critical operations use the write-through cache and
	50	are synchronous: hard and symbolic link creation, and object rename. Creation,
	51	removal of objects and data writing are asynchronous and are sent to
	52	the server during system writeback. Only one writer at a time is allowed for any
	53	given inode, which is guarded by an appropriate locking protocol.
	54	Because of this feature, POHMELFS is extremely fast at metadata intensive
	55	workloads and can fully utilize the bandwidth to the servers when doing bulk
	56	data transfers.
	57
	58	POHMELFS clients operate with a working set of servers and are capable of balancing read-only
	59	operations (like lookups or directory listings) between them.
	60	Administrators can add or remove servers from the set at run-time via special commands (described
	61	in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers.
	62
	63	POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
	64	One can select any kernel supported cipher, encryption mode, hash type and operation mode
	65	(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
	66	is checked during mount time and, if the server does not support it, appropriate capabilities
	67	will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
	68	Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
	69	crypto operations and send the resulting data to server or submit it up the stack. This number
	70	can be controlled via a mount option.


diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt new file mode 100644 index 000000000000..4e3d50157083 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/info.txt
@@ -0,0 +1,86 @@
	1	POHMELFS usage information.
	2
	3	Mount options:
	4	idx=%u
	5	Each mountpoint is associated with a special index via this option.
	6	Administrator can add or remove servers from the given index, so all mounts,
	7	which were attached to it, are updated.
	8	Default it is 0.
	9
	10	trans_scan_timeout=%u
	11	This timeout, expressed in milliseconds, specifies time to scan transaction
	12	trees looking for stale requests, which have to be resent, or if number of
	13	retries exceed specified limit, dropped with error.
	14	Default is 5 seconds.
	15
	16	drop_scan_timeout=%u
	17	Internal timeout, expressed in milliseconds, which specifies how frequently
	18	inodes marked to be dropped are freed. It also specifies how frequently
	19	the system checks that servers have to be added or removed from current working set.
	20	Default is 1 second.
	21
	22	wait_on_page_timeout=%u
	23	Number of milliseconds to wait for reply from remote server for data reading command.
	24	If this timeout is exceeded, reading returns an error.
	25	Default is 5 seconds.
	26
	27	trans_retries=%u
	28	This is the number of times that a transaction will be resent to a server that did
	29	not answer for the last @trans_scan_timeout milliseconds.
	30	When the number of resends exceeds this limit, the transaction is completed with error.
	31	Default is 5 resends.
	32
	33	crypto_thread_num=%u
	34	Number of crypto processing threads. Threads are used both for RX and TX traffic.
	35	Default is 2, or no threads if crypto operations are not supported.
	36
	37	trans_max_pages=%u
	38	Maximum number of pages in a single transaction. This parameter also controls
	39	the number of pages, allocated for crypto processing (each crypto thread has
	40	pool of pages, the number of which is equal to 'trans_max_pages'.
	41	Default is 100 pages.
	42
	43	crypto_fail_unsupported
	44	If specified, mount will fail if the server does not support requested crypto operations.
	45	By default mount will disable non-matching crypto operations.
	46
	47	mcache_timeout=%u
	48	Maximum number of milliseconds to wait for the mcache objects to be processed.
	49	Mcache includes locks (given lock should be granted by server), attributes (they should be
	50	fully received in the given timeframe).
	51	Default is 5 seconds.
	52
	53	Usage examples.
	54
	55	Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx
	56	with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
	57	$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
	58
	59	Mount filesystem with given index $idx to /mnt mountpoint.
	60	Client will connect to all servers specified in the working set via previous command:
	61	mount -t pohmel -o idx=$idx q /mnt
	62
	63	One can add or remove servers from working set after mounting too.
	64
	65
	66	Server installation.
	67
	68	Creating a server, which listens at port 1025 and 0.0.0.0 address.
	69	Working root directory (note, that server chroots there, so you have to have appropriate permissions)
	70	is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
	71	are appropriate key files.
	72	Number of working threads is set to 10.
	73
	74	# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key
	75
	76	-A 6 - listen on ipv6 address. Default: Disabled.
	77	-r root - path to root directory. Default: /tmp.
	78	-a addr - listen address. Default: 0.0.0.0.
	79	-p port - listen port. Default: 1025.
	80	-w workers - number of workers per connected client. Default: 1.
	81	-K file - hash key size. Default: none.
	82	-k file - cipher key size. Default: none.
	83	-h - this help.
	84
	85	Number of worker threads specifies how many workers will be created for each client.
	86	Bulk single-client transafers usually are better handled with smaller number (like 1-3).


diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt new file mode 100644 index 000000000000..40ea6c295afb --- /dev/null +++ b/Documentation/filesystems/pohmelfs/network_protocol.txt
@@ -0,0 +1,227 @@
	1	POHMELFS network protocol.
	2
	3	Basic structure used in network communication is following command:
	4
	5	struct netfs_cmd
	6	{
	7	__u16 cmd; /* Command number */
	8	__u16 csize; /* Attached crypto information size */
	9	__u16 cpad; /* Attached padding size */
	10	__u16 ext; /* External flags */
	11	__u32 size; /* Size of the attached data */
	12	__u32 trans; /* Transaction id */
	13	__u64 id; /* Object ID to operate on. Used for feedback.*/
	14	__u64 start; /* Start of the object. */
	15	__u64 iv; /* IV sequence */
	16	__u8 data[0];
	17	};
	18
	19	Commands can be embedded into transaction command (which in turn has own command),
	20	so one can extend protocol as needed without breaking backward compatibility as long
	21	as old commands are supported. All string lengths include tail 0 byte.
	22
	23	All commans are transfered over the network in big-endian. CPU endianess is used at the end peers.
	24
	25	@cmd - command number, which specifies command to be processed. Following
	26	commands are used currently:
	27
	28	NETFS_READDIR = 1, /* Read directory for given inode number */
	29	NETFS_READ_PAGE, /* Read data page from the server */
	30	NETFS_WRITE_PAGE, /* Write data page to the server */
	31	NETFS_CREATE, /* Create directory entry */
	32	NETFS_REMOVE, /* Remove directory entry */
	33	NETFS_LOOKUP, /* Lookup single object */
	34	NETFS_LINK, /* Create a link */
	35	NETFS_TRANS, /* Transaction */
	36	NETFS_OPEN, /* Open intent */
	37	NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */
	38	NETFS_PAGE_CACHE, /* Page cache invalidation message */
	39	NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */
	40	NETFS_RENAME, /* Rename object */
	41	NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */
	42	NETFS_LOCK, /* Distributed lock message */
	43	NETFS_XATTR_SET, /* Set extended attribute */
	44	NETFS_XATTR_GET, /* Get extended attribute */
	45
	46	@ext - external flags. Used by different commands to specify some extra arguments
	47	like partial size of the embedded objects or creation flags.
	48
	49	@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
	50	but size of the requested data is incorporated here. It does not include size of the command
	51	header (struct netfs_cmd) itself.
	52
	53	@id - id of the object this command operates on. Each command can use it for own purpose.
	54
	55	@start - start of the object this command operates on. Each command can use it for own purpose.
	56
	57	@csize, @cpad - size and padding size of the (attached if needed) crypto information.
	58
	59	Command specifications.
	60
	61	@NETFS_READDIR
	62	This command is used to sync content of the remote dir to the client.
	63
	64	@ext - length of the path to object.
	65	@size - the same.
	66	@id - local inode number of the directory to read.
	67	@start - zero.
	68
	69
	70	@NETFS_READ_PAGE
	71	This command is used to read data from remote server.
	72	Data size does not exceed local page cache size.
	73
	74	@id - inode number.
	75	@start - first byte offset.
	76	@size - number of bytes to read plus length of the path to object.
	77	@ext - object path length.
	78
	79
	80	@NETFS_CREATE
	81	Used to create object.
	82	It does not require that all directories on top of the object were
	83	already created, it will create them automatically. Each object has
	84	associated @netfs_path_entry data structure, which contains creation
	85	mode (permissions and type) and length of the name as long as name itself.
	86
	87	@start - 0
	88	@size - size of the all data structures needed to create a path
	89	@id - local inode number
	90	@ext - 0
	91
	92
	93	@NETFS_REMOVE
	94	Used to remove object.
	95
	96	@ext - length of the path to object.
	97	@size - the same.
	98	@id - local inode number.
	99	@start - zero.
	100
	101
	102	@NETFS_LOOKUP
	103	Lookup information about object on server.
	104
	105	@ext - length of the path to object.
	106	@size - the same.
	107	@id - local inode number of the directory to look object in.
	108	@start - local inode number of the object to look at.
	109
	110
	111	@NETFS_LINK
	112	Create hard of symlink.
	113	Command is sent as "object_path\|target_path".
	114
	115	@size - size of the above string.
	116	@id - parent local inode number.
	117	@start - 1 for symlink, 0 for hardlink.
	118	@ext - size of the "object_path" above.
	119
	120
	121	@NETFS_TRANS
	122	Transaction header.
	123
	124	@size - incorporates all embedded command sizes including theirs header sizes.
	125	@start - transaction generation number - unique id used to find transaction.
	126	@ext - transaction flags. Unused at the moment.
	127	@id - 0.
	128
	129
	130	@NETFS_OPEN
	131	Open intent for given transaction.
	132
	133	@id - local inode number.
	134	@start - 0.
	135	@size - path length to the object.
	136	@ext - open flags (O_RDWR and so on).
	137
	138
	139	@NETFS_INODE_INFO
	140	Metadata update command.
	141	It is sent to servers when attributes of the object are changed and received
	142	when data or metadata were updated. It operates with the following structure:
	143
	144	struct netfs_inode_info
	145	{
	146	unsigned int mode;
	147	unsigned int nlink;
	148	unsigned int uid;
	149	unsigned int gid;
	150	unsigned int blocksize;
	151	unsigned int padding;
	152	__u64 ino;
	153	__u64 blocks;
	154	__u64 rdev;
	155	__u64 size;
	156	__u64 version;
	157	};
	158
	159	It effectively mirrors stat(2) returned data.
	160
	161
	162	@ext - path length to the object.
	163	@size - the same plus size of the netfs_inode_info structure.
	164	@id - local inode number.
	165	@start - 0.
	166
	167
	168	@NETFS_PAGE_CACHE
	169	Command is only received by clients. It contains information about
	170	page to be marked as not up-to-date.
	171
	172	@id - client's inode number.
	173	@start - last byte of the page to be invalidated. If it is not equal to
	174	current inode size, it will be vmtruncated().
	175	@size - 0
	176	@ext - 0
	177
	178
	179	@NETFS_READ_PAGES
	180	Used to read multiple contiguous pages in one go.
	181
	182	@start - first byte of the contiguous region to read.
	183	@size - contains of two fields: lower 8 bits are used to represent page cache shift
	184	used by client, another 3 bytes are used to get number of pages.
	185	@id - local inode number.
	186	@ext - path length to the object.
	187
	188
	189	@NETFS_RENAME
	190	Used to rename object.
	191	Attached data is formed into following string: "old_path\|new_path".
	192
	193	@id - local inode number.
	194	@start - parent inode number.
	195	@size - length of the above string.
	196	@ext - length of the old path part.
	197
	198
	199	@NETFS_CAPABILITIES
	200	Used to exchange crypto capabilities with server.
	201	If crypto capabilities are not supported by server, then client will disable it
	202	or fail (if 'crypto_fail_unsupported' mount options was specified).
	203
	204	@id - superblock index. Used to specify crypto information for group of servers.
	205	@size - size of the attached capabilities structure.
	206	@start - 0.
	207	@size - 0.
	208	@scsize - 0.
	209
	210	@NETFS_LOCK
	211	Used to send lock request/release messages. Although it sends byte range request
	212	and is capable of flushing pages based on that, it is not used, since all Linux
	213	filesystems lock the whole inode.
	214
	215	@id - lock generation number.
	216	@start - start of the locked range.
	217	@size - size of the locked range.
	218	@ext - lock type: read/write. Not used actually. 15'th bit is used to determine,
	219	if it is lock request (1) or release (0).
	220
	221	@NETFS_XATTR_SET
	222	@NETFS_XATTR_GET
	223	Used to set/get extended attributes for given inode.
	224	@id - attribute generation number or xattr setting type
	225	@start - size of the attribute (request or attached)
	226	@size - name length, path len and data size for given attribute
	227	@ext - path length for given object