Documentation/filesystems/rpc-cache.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202

	This document gives a brief introduction to the caching
mechanisms in the sunrpc layer that is used, in particular,
for NFS authentication.

CACHES
======
The caching replaces the old exports table and allows for
a wide variety of values to be caches.

There are a number of caches that are similar in structure though
quite possibly very different in content and use.  There is a corpus
of common code for managing these caches.

Examples of caches that are likely to be needed are:
  - mapping from IP address to client name
  - mapping from client name and filesystem to export options
  - mapping from UID to list of GIDs, to work around NFS's limitation
    of 16 gids.
  - mappings between local UID/GID and remote UID/GID for sites that
    do not have uniform uid assignment
  - mapping from network identify to public key for crypto authentication.

The common code handles such things as:
   - general cache lookup with correct locking
   - supporting 'NEGATIVE' as well as positive entries
   - allowing an EXPIRED time on cache items, and removing
     items after they expire, and are no longer in-use.
   - making requests to user-space to fill in cache entries
   - allowing user-space to directly set entries in the cache
   - delaying RPC requests that depend on as-yet incomplete
     cache entries, and replaying those requests when the cache entry
     is complete.
   - clean out old entries as they expire.

Creating a Cache
----------------

1/ A cache needs a datum to store.  This is in the form of a
   structure definition that must contain a
     struct cache_head
   as an element, usually the first.
   It will also contain a key and some content.
   Each cache element is reference counted and contains
   expiry and update times for use in cache management.
2/ A cache needs a "cache_detail" structure that
   describes the cache.  This stores the hash table, some
   parameters for cache management, and some operations detailing how
   to work with particular cache items.
   The operations requires are:
   	struct cache_head *alloc(void)
		This simply allocates appropriate memory and returns
   		a pointer to the cache_detail embedded within the
		structure
	void cache_put(struct kref *)
		This is called when the last reference to an item is
		dropped.  The pointer passed is to the 'ref' field
		in the cache_head.  cache_put should release any
		references create by 'cache_init' and, if CACHE_VALID
		is set, any references created by cache_update.
		It should then release the memory allocated by
   		'alloc'.
        int match(struct cache_head *orig, struct cache_head *new)
		test if the keys in the two structures match.  Return
		1 if they do, 0 if they don't.
	void init(struct cache_head *orig, struct cache_head *new)
		Set the 'key' fields in 'new' from 'orig'.  This may
		include taking references to shared objects.
	void update(struct cache_head *orig, struct cache_head *new)
		Set the 'content' fileds in 'new' from 'orig'.
	int cache_show(struct seq_file *m, struct cache_detail *cd,
			struct cache_head *h)
		Optional.  Used to provide a /proc file that lists the
		contents of a cache.  This should show one item,
   		usually on just one line.
	int cache_request(struct cache_detail *cd, struct cache_head *h,
   		char **bpp, int *blen)
		Format a request to be send to user-space for an item
   		to be instantiated.  *bpp is a buffer of size *blen.
		bpp should be moved forward over the encoded message,
		and  *blen should be reduced to show how much free
		space remains.  Return 0 on success or <0 if not
		enough room or other problem.
	int cache_parse(struct cache_detail *cd, char *buf, int len)
		A message from user space has arrived to fill out a
		cache entry.  It is in 'buf' of length 'len'.
		cache_parse should parse this, find the item in the
		cache with sunrpc_cache_lookup, and update the item
		with sunrpc_cache_update.


3/ A cache needs to be registered using cache_register().  This
   includes it on a list of caches that will be regularly
   cleaned to discard old data.

Using a cache
-------------

To find a value in a cache, call sunrpc_cache_lookup passing a pointer
to the cache_head in a sample item with the 'key' fields filled in.
This will be passed to ->match to identify the target entry.  If no
entry is found, a new entry will be create, added to the cache, and
marked as not containing valid data.

The item returned is typically passed to cache_check which will check
if the data is valid, and may initiate an up-call to get fresh data.
cache_check will return -ENOENT in the entry is negative or if an up
call is needed but not possible, -EAGAIN if an upcall is pending,
or 0 if the data is valid;

cache_check can be passed a "struct cache_req *".  This structure is
typically embedded in the actual request and can be used to create a
deferred copy of the request (struct cache_deferred_req).  This is
done when the found cache item is not uptodate, but the is reason to
believe that userspace might provide information soon.  When the cache
item does become valid, the deferred copy of the request will be
revisited (->revisit).  It is expected that this method will
reschedule the request for processing.

The value returned by sunrpc_cache_lookup can also be passed to
sunrpc_cache_update to set the content for the item.  A second item is
passed which should hold the content.  If the item found by _lookup
has valid data, then it is discarded and a new item is created.  This
saves any user of an item from worrying about content changing while
it is being inspected.  If the item found by _lookup does not contain
valid data, then the content is copied across and CACHE_VALID is set.

Populating a cache
------------------

Each cache has a name, and when the cache is registered, a directory
with that name is created in /proc/net/rpc

This directory contains a file called 'channel' which is a channel
for communicating between kernel and user for populating the cache.
This directory may later contain other files of interacting
with the cache.

The 'channel' works a bit like a datagram socket. Each 'write' is
passed as a whole to the cache for parsing and interpretation.
Each cache can treat the write requests differently, but it is
expected that a message written will contain:
  - a key
  - an expiry time
  - a content.
with the intention that an item in the cache with the give key
should be create or updated to have the given content, and the
expiry time should be set on that item.

Reading from a channel is a bit more interesting.  When a cache
lookup fails, or when it succeeds but finds an entry that may soon
expire, a request is lodged for that cache item to be updated by
user-space.  These requests appear in the channel file.

Successive reads will return successive requests.
If there are no more requests to return, read will return EOF, but a
select or poll for read will block waiting for another request to be
added.

Thus a user-space helper is likely to:
  open the channel.
    select for readable
    read a request
    write a response
  loop.

If it dies and needs to be restarted, any requests that have not been
answered will still appear in the file and will be read by the new
instance of the helper.

Each cache should define a "cache_parse" method which takes a message
written from user-space and processes it.  It should return an error
(which propagates back to the write syscall) or 0.

Each cache should also define a "cache_request" method which
takes a cache item and encodes a request into the buffer
provided.

Note: If a cache has no active readers on the channel, and has had not
active readers for more than 60 seconds, further requests will not be
added to the channel but instead all lookups that do not find a valid
entry will fail.  This is partly for backward compatibility: The
previous nfs exports table was deemed to be authoritative and a
failed lookup meant a definite 'no'.

request/response format
-----------------------

While each cache is free to use it's own format for requests
and responses over channel, the following is recommended as
appropriate and support routines are available to help:
Each request or response record should be printable ASCII
with precisely one newline character which should be at the end.
Fields within the record should be separated by spaces, normally one.
If spaces, newlines, or nul characters are needed in a field they
much be quoted.  two mechanisms are available:
1/ If a field begins '\x' then it must contain an even number of
   hex digits, and pairs of these digits provide the bytes in the
   field.
2/ otherwise a \ in the field must be followed by 3 octal digits
   which give the code for a byte.  Other characters are treated
   as them selves.  At the very least, space, newline, nul, and
   '\' must be quoted in this way.
/* -*- mode: c; c-basic-offset: 8; -*-
 * vim: noexpandtab sw=8 ts=8 sts=0:
 *
 * file.c
 *
 * File open, close, extend, truncate
 *
 * Copyright (C) 2002, 2004 Oracle.  All rights reserved.
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public
 * License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public
 * License along with this program; if not, write to the
 * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 * Boston, MA 021110-1307, USA.
 */

#include <linux/capability.h>
#include <linux/fs.h>
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/uio.h>
#include <linux/sched.h>
#include <linux/pipe_fs_i.h>
#include <linux/mount.h>
#include <linux/writeback.h>

#define MLOG_MASK_PREFIX ML_INODE
#include <cluster/masklog.h>

#include "ocfs2.h"

#include "alloc.h"
#include "aops.h"
#include "dir.h"
#include "dlmglue.h"
#include "extent_map.h"
#include "file.h"
#include "sysfile.h"
#include "inode.h"
#include "ioctl.h"
#include "journal.h"
#include "mmap.h"
#include "suballoc.h"
#include "super.h"

#include "buffer_head_io.h"

static int ocfs2_sync_inode(struct inode *inode)
{
	filemap_fdatawrite(inode->i_mapping);
	return sync_mapping_buffers(inode->i_mapping);
}

static int ocfs2_file_open(struct inode *inode, struct file *file)
{
	int status;
	int mode = file->f_flags;
	struct ocfs2_inode_info *oi = OCFS2_I(inode);

	mlog_entry("(0x%p, 0x%p, '%.*s')\n", inode, file,
		   file->f_path.dentry->d_name.len, file->f_path.dentry->d_name.name);

	spin_lock(&oi->ip_lock);

	/* Check that the inode hasn't been wiped from disk by another
	 * node. If it hasn't then we're safe as long as we hold the
	 * spin lock until our increment of open count. */
	if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_DELETED) {
		spin_unlock(&oi->ip_lock);

		status = -ENOENT;
		goto leave;
	}

	if (mode & O_DIRECT)
		oi->ip_flags |= OCFS2_INODE_OPEN_DIRECT;

	oi->ip_open_count++;
	spin_unlock(&oi->ip_lock);
	status = 0;
leave:
	mlog_exit(status);
	return status;
}

static int ocfs2_file_release(struct inode *inode, struct file *file)
{
	struct ocfs2_inode_info *oi = OCFS2_I(inode);

	mlog_entry("(0x%p, 0x%p, '%.*s')\n", inode, file,
		       file->f_path.dentry->d_name.len,
		       file->f_path.dentry->d_name.name);

	spin_lock(&oi->ip_lock);
	if (!--oi->ip_open_count)
		oi->ip_flags &= ~OCFS2_INODE_OPEN_DIRECT;
	spin_unlock(&oi->ip_lock);

	mlog_exit(0);

	return 0;
}

static int ocfs2_sync_file(struct file *file,
			   struct dentry *dentry,
			   int datasync)
{
	int err = 0;
	journal_t *journal;
	struct inode *inode = dentry->d_inode;
	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

	mlog_entry("(0x%p, 0x%p, %d, '%.*s')\n", file, dentry, datasync,
		   dentry->d_name.len, dentry->d_name.name);

	err = ocfs2_sync_inode(dentry->d_inode);
	if (err)
		goto bail;

	journal = osb->journal->j_journal;
	err = journal_force_commit(journal);

bail:
	mlog_exit(err);

	return (err < 0) ? -EIO : 0;
}

int ocfs2_should_update_atime(struct inode *inode,
			      struct vfsmount *vfsmnt)
{
	struct timespec now;
	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
		return 0;

	if ((inode->i_flags & S_NOATIME) ||
	    ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode)))
		return 0;

	/*
	 * We can be called with no vfsmnt structure - NFSD will
	 * sometimes do this.
	 *
	 * Note that our action here is different than touch_atime() -
	 * if we can't tell whether this is a noatime mount, then we
	 * don't know whether to trust the value of s_atime_quantum.
	 */
	if (vfsmnt == NULL)
		return 0;

	if ((vfsmnt->mnt_flags & MNT_NOATIME) ||
	    ((vfsmnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
		return 0;

	if (vfsmnt->mnt_flags & MNT_RELATIME) {
		if ((timespec_compare(&inode->i_atime, &inode->i_mtime) <= 0) ||
		    (timespec_compare(&inode->i_atime, &inode->i_ctime) <= 0))
			return 1;

		return 0;
	}

	now = CURRENT_TIME;
	if ((now.tv_sec - inode->i_atime.tv_sec <= osb->s_atime_quantum))
		return 0;
	else
		return 1;
}

int ocfs2_update_inode_atime(struct inode *inode,
			     struct buffer_head *bh)
{
	int ret;
	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
	handle_t *handle;

	mlog_entry_void();

	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
	if (handle == NULL) {
		ret = -ENOMEM;
		mlog_errno(ret);
		goto out;
	}

	inode->i_atime = CURRENT_TIME;
	ret = ocfs2_mark_inode_dirty(handle, inode, bh);
	if (ret < 0)
		mlog_errno(ret);

	ocfs2_commit_trans(OCFS2_SB(inode->i_sb), handle);
out:
	mlog_exit(ret);
	return ret;
}

int ocfs2_set_inode_size(handle_t *handle,
			 struct inode *inode,
			 struct buffer_head *fe_bh,
			 u64 new_i_size)
{
	int status;

	mlog_entry_void();
	i_size_write(inode, new_i_size);
	inode->i_blocks = ocfs2_align_bytes_to_sectors(new_i_size);
	inode->i_ctime = inode->i_mtime = CURRENT_TIME;

	status = ocfs2_mark_inode_dirty(handle, inode, fe_bh);
	if (status < 0) {
		mlog_errno(status);
		goto bail;
	}

bail:
	mlog_exit(status);
	return status;
}

static int ocfs2_simple_size_update(struct inode *inode,
				    struct buffer_head *di_bh,
				    u64 new_i_size)
{
	int ret;
	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
	handle_t *handle = NULL;

	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
	if (handle == NULL) {
		ret = -ENOMEM;
		mlog_errno(ret);
		goto out;
	}

	ret = ocfs2_set_inode_size(handle, inode, di_bh,
				   new_i_size);
	if (ret < 0)
		mlog_errno(ret);

	ocfs2_commit_trans(osb, handle);
out:
	return ret;
}

static int ocfs2_orphan_for_truncate(struct ocfs2_super *osb,
				     struct inode *inode,
				     struct buffer_head *fe_bh,
				     u64 new_i_size)
{
	int status;
	handle_t *handle;
	struct ocfs2_dinode *di;

	mlog_entry_void();

	/* TODO: This needs to actually orphan the inode in this
	 * transaction. */

	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
	if (IS_ERR(handle)) {
		status = PTR_ERR(handle);
		mlog_errno(status);
		goto out;
	}

	status = ocfs2_journal_access(handle, inode, fe_bh,
				      OCFS2_JOURNAL_ACCESS_WRITE);
	if (status < 0) {
		mlog_errno(status);
		goto out_commit;
	}

	/*
	 * Do this before setting i_size.
	 */
	status = ocfs2_zero_tail_for_truncate(inode, handle, new_i_size);
	if (status) {
		mlog_errno(status);
		goto out_commit;
	}

	i_size_write(inode, new_i_size);
	inode->i_blocks = ocfs2_align_bytes_to_sectors(new_i_size);
	inode->i_ctime = inode->i_mtime = CURRENT_TIME;

	di = (struct ocfs2_dinode *) fe_bh->b_data;
	di->i_size = cpu_to_le64(new_i_size);
	di->i_ctime = di->i_mtime = cpu_to_le64(inode->i_ctime.tv_sec);
	di->i_ctime_nsec = di->i_mtime_nsec = cpu_to_le32(inode->i_ctime.tv_nsec);

	status = ocfs2_journal_dirty(handle, fe_bh);
	if (status < 0)
		mlog_errno(status);

out_commit:
	ocfs2_commit_trans(osb, handle);
out:

	mlog_exit(status);
	return status;
}

static int ocfs2_truncate_file(struct inode *inode,
			       struct buffer_head *di_bh,
			       u64 new_i_size)
{
	int status = 0;
	struct ocfs2_dinode *fe = NULL;
	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
	struct ocfs2_truncate_context *tc = NULL;

	mlog_entry("(inode = %llu, new_i_size = %llu\n",
		   (unsigned long long)OCFS2_I(inode)->ip_blkno,
		   (unsigned long long)new_i_size);

	truncate_inode_pages(inode->i_mapping, new_i_size);

	fe = (struct ocfs2_dinode *) di_bh->b_data;
	if (!OCFS2_IS_VALID_DINODE(fe)) {
		OCFS2_RO_ON_INVALID_DINODE(inode->i_sb, fe);
		status = -EIO;
		goto bail;
	}

	mlog_bug_on_msg(le64_to_cpu(fe->i_size) != i_size_read(inode),
			"Inode %llu, inode i_size = %lld != di "
			"i_size = %llu, i_flags = 0x%x\n",
			(unsigned long long)OCFS2_I(inode)->ip_blkno,
			i_size_read(inode),
			(unsigned long long)le64_to_cpu(fe->i_size),
			le32_to_cpu(fe->i_flags));

	if (new_i_size > le64_to_cpu(fe->i_size)) {
		mlog(0, "asked to truncate file with size (%llu) to size (%llu)!\n",
		     (unsigned long long)le64_to_cpu(fe->i_size),
		     (unsigned long long)new_i_size);
		status = -EINVAL;
		mlog_errno(status);
		goto bail;
	}

	mlog(0, "inode %llu, i_size = %llu, new_i_size = %llu\n",
	     (unsigned long long)le64_to_cpu(fe->i_blkno),
	     (unsigned long long)le64_to_cpu(fe->i_size),
	     (unsigned long long)new_i_size);

	/* lets handle the simple truncate cases before doing any more
	 * cluster locking. */
	if (new_i_size == le64_to_cpu(fe->i_size))
		goto bail;

	/* This forces other nodes to sync and drop their pages. Do
	 * this even if we have a truncate without allocation change -
	 * ocfs2 cluster sizes can be much greater than page size, so
	 * we have to truncate them anyway.  */
	status = ocfs2_data_lock(inode, 1);
	if (status < 0) {
		mlog_errno(status);
		goto bail;
	}

	/* alright, we're going to need to do a full blown alloc size
	 * change. Orphan the inode so that recovery can complete the
	 * truncate if necessary. This does the task of marking
	 * i_size. */
	status = ocfs2_orphan_for_truncate(osb, inode, di_bh, new_i_size);
	if (status < 0) {
		mlog_errno(status);
		goto bail_unlock_data;
	}

	status = ocfs2_prepare_truncate(osb, inode, di_bh, &tc);
	if (status < 0) {
		mlog_errno(status);
		goto bail_unlock_data;
	}

	status = ocfs2_commit_truncate(osb, inode, di_bh, tc);
	if (status < 0) {
		mlog_errno(status);
		goto bail_unlock_data;
	}

	/* TODO: orphan dir cleanup here. */
bail_unlock_data:
	ocfs2_data_unlock(inode, 1);

bail:

	mlog_exit(status);
	return status;
}

/*
 * extend allocation only here.
 * we'll update all the disk stuff, and oip->alloc_size
 *
 * expect stuff to be locked, a transaction started and enough data /
 * metadata reservations in the contexts.
 *
 * Will return -EAGAIN, and a reason if a restart is needed.
 * If passed in, *reason will always be set, even in error.
 */
int ocfs2_do_extend_allocation(struct ocfs2_super *osb,
			       struct inode *inode,
			       u32 *logical_offset,
			       u32 clusters_to_add,
			       struct buffer_head *fe_bh,
			       handle_t *handle,
			       struct ocfs2_alloc_context *data_ac,
			       struct ocfs2_alloc_context *meta_ac,
			       enum ocfs2_alloc_restarted *reason_ret)
{
	int status = 0;
	int free_extents;
	struct ocfs2_dinode *fe = (struct ocfs2_dinode *) fe_bh->b_data;
	enum ocfs2_alloc_restarted reason = RESTART_NONE;
	u32 bit_off, num_bits;
	u64 block;

	BUG_ON(!clusters_to_add);

	free_extents = ocfs2_num_free_extents(osb, inode, fe);
	if (free_extents < 0) {
		status = free_extents;
		mlog_errno(status);
		goto leave;
	}

	/* there are two cases which could cause us to EAGAIN in the
	 * we-need-more-metadata case:
	 * 1) we haven't reserved *any*
	 * 2) we are so fragmented, we've needed to add metadata too
	 *    many times. */
	if (!free_extents && !meta_ac) {
		mlog(0, "we haven't reserved any metadata!\n");
		status = -EAGAIN;
		reason = RESTART_META;
		goto leave;
	} else if ((!free_extents)
		   && (ocfs2_alloc_context_bits_left(meta_ac)
		       < ocfs2_extend_meta_needed(fe))) {
		mlog(0, "filesystem is really fragmented...\n");
		status = -EAGAIN;
		reason = RESTART_META;
		goto leave;
	}

	status = ocfs2_claim_clusters(osb, handle, data_ac, 1,