| Commit message (Collapse) | Author | Age |
... | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Implement ceph_osdc_watch_check() to be able to check on status of
watch. Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Implement ceph_osdc_notify() for sending notifies.
Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The unwatch timeout is currently implemented in rbd. With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
These are going to be used by request_reinit() code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This is a major sync up, up to ~Jewel. The highlights are:
- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support
The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished. This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
OSD client is getting moved from the big per-client lock to a set of
per-session locks. The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.
If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
These are going to be used by homeless OSD sessions code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message. This sets up the scene for the updated OSD
client.
Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.
We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails. Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.
ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid. This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Add and decode pi->min_size and pi->last_force_request_resend. These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
calc_target() code is going to need to know how to compare PGs. Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes. This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around. Introduce and switch to ceph_osds to help with that.
Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
They are called only once, from ceph_osdc_stop() and
handle_osds_timeout() respectively.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Either unused or useless:
osdmap->mkfs_epoch
osd->o_marked_for_keepalive
monc->num_generic_requests
osdc->map_waiters
osdc->last_requested_map
osdc->timeout_tid
osd_req_op_cls_response_data()
osdmap_apply_incremental() @msgr arg
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
For a message pool message, preallocate a page, just like we do for
osd_op. For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
ceph_osdc_build_request() is going away. Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | |_|/
| |/| |
| | | |
| | | | |
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and
write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client
has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call"
* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
nfs: avoid race that crashes nfs_init_commit
NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
pnfs: make pnfs_layout_process more robust
pnfs: rework LAYOUTGET retry handling
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: fix bad error handling in send_layoutget
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
NFS: Reclaim writes via writepage are opportunistic
NFSv4: Use the right stateid for delegations in setattr, read and write
...
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Clean up.
After "xprtrdma: Remove ro_unmap() from all registration modes",
there are no longer any sites that take rpcrdma_ia::qplock for read.
The one site that takes it for write is always single-threaded. It
is safe to remove it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In a cluster failover scenario, it is desirable for the client to
attempt to reconnect quickly, as an alternate NFS server is already
waiting to take over for the down server. The client can't see that
a server IP address has moved to a new server until the existing
connection is gone.
For fabrics and devices where it is meaningful, set a definite upper
bound on the amount of time before it is determined that a
connection is no longer valid. This allows the RPC client to detect
connection loss in a timely matter, then perform a fresh resolution
of the server GUID in case it has changed (cluster failover).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Clean up: The ro_unmap method is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:
+ Doesn't have to sleep
+ Doesn't rely on having a QP in RTS
ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.
The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.
Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Separate the DMA unmap operation from freeing the MW. In a
subsequent patch they will not always be done at the same time,
and they are not related operations (except by order; freeing
the MW must be the last step during invalidation).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In a subsequent patch, the fr_xprt and fr_worker fields will be
needed by another memory registration mode. Move them into the
generic rpcrdma_mw structure that wraps struct rpcrdma_frmr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Maintain the order of invalidation and DMA unmapping when doing
a background MR reset.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
frwr_op_unmap_sync() is now invoked in a workqueue context, the same
as __frwr_queue_recovery(). There's no need to defer MR reset if
posting LOCAL_INV MRs fails.
This means that even when ib_post_send() fails (which should occur
very rarely) the invalidation and DMA unmapping steps are still done
in the correct order.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Move the the I/O direction field from rpcrdma_mr_seg into the
rpcrdma_frmr.
This makes it possible to DMA-unmap the frwr long after an RPC has
exited and its rpcrdma_mr_seg array has been released and re-used.
This might occur if an RPC times out while waiting for a new
connection to be established.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Clean up: Follow same naming convention as other fields in struct
rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
the new ib_drain_qp() API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
rpcrdma_create_chunks() has been replaced, and can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Update documenting comments to reflect code changes over the past
year.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Avoid the latency and interrupt overhead of registering a Write
chunk when handling NFS READ requests of a few hundred bytes or
less.
This change does not interoperate with Linux NFS/RDMA servers
that do not have commit 9d11b51ce7c1 ('svcrdma: Fix send_reply()
scatter/gather set-up'). Commit 9d11b51ce7c1 was introduced in v4.3,
and is included in 4.2.y, 4.1.y, and 3.18.y.
Oracle bug 22925946 has been filed to request that the above fix
be included in the Oracle Linux UEK4 NFS/RDMA server.
Red Hat bugzillas 1327280 and 1327554 have been filed to request
that RHEL NFS/RDMA server backports include the above fix.
Workaround: Replace the "proto=rdma,port=20049" mount options
with "proto=tcp" until commit 9d11b51ce7c1 is applied to your
NFS server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.
Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.
The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.
The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Send buffer space is shared between the RPC-over-RDMA header and
an RPC message. A large RPC-over-RDMA header means less space is
available for the associated RPC message, which then has to be
moved via an RDMA Read or Write.
As more segments are added to the chunk lists, the header increases
in size. Typical modern hardware needs only a few segments to
convey the maximum payload size, but some devices and registration
modes may need a lot of segments to convey data payload. Sometimes
so many are needed that the remaining space in the Send buffer is
not enough for the RPC message. Sending such a message usually
fails.
To ensure a transport can always make forward progress, cap the
number of RDMA segments that are allowed in chunk lists. This
prevents less-capable devices and memory registrations from
consuming a large portion of the Send buffer by reducing the
maximum data payload that can be conveyed with such devices.
For now I choose an arbitrary maximum of 8 RDMA segments. This
allows a maximum size RPC-over-RDMA header to fit nicely in the
current 1024 byte inline threshold with over 700 bytes remaining
for an inline RPC message.
The current maximum data payload of NFS READ or WRITE requests is
one megabyte. To convey that payload on a client with 4KB pages,
each chunk segment would need to handle 32 or more data pages. This
is well within the capabilities of FMR. For physical registration,
the maximum payload size on platforms with 4KB pages is reduced to
32KB.
For FRWR, a device's maximum page list depth would need to be at
least 34 to support the maximum 1MB payload. A device with a smaller
maximum page list depth means the maximum data payload is reduced
when using that device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|