aboutsummaryrefslogtreecommitdiffstats
path: root/net/ceph
Commit message (Collapse)AuthorAge
...
* | crush: add set_choose_local_[fallback_]tries stepsIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | This allows all of the tunables to be overridden by a specific rule. Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292, 0497db49e5973b50df26251ed0e3f4ac7578e66e. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: generalize descend_onceIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The legacy behavior is to make the normal number of tries for the recursive chooseleaf call. The descend_once tunable changed this to making a single try and bail if we get a reject (note that it is impossible to collide in the recursive case). The new set_chooseleaf_tries lets you select the number of recursive chooseleaf attempts for indep mode, or default to 1. Use the same behavior for firstn, except default to total_tries when the legacy tunables are set (for compatibility). This makes the rule step override the (new) default of 1 recursive attempt, keeping behavior consistent with indep mode. Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: CHOOSE_LEAF -> CHOOSELEAF throughoutIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | This aligns the internal identifier names with the user-visible names in the decompiled crush map language. Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: add SET_CHOOSE_TRIES rule stepIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: apply chooseleaf_tries to firstn mode tooIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Parameterize the attempts for the _firstn choose method, and apply the rule-specified tries count to firstn mode as well. Note that we have slightly different behavior here than with indep: If the firstn value is not specified for firstn, we pass through the normal attempt count. This maintains compatibility with legacy behavior. Note that this is usually *not* actually N^2 work, though, because of the descend_once tunable. However, descend_once is unfortunately *not* the same thing as 1 chooseleaf try because it is only checked on a reject but not on a collision. Sigh. In contrast, for indep, if tries is not specified we default to 1 recursive attempt, because that is simply more sane, and we have the option to do so. The descend_once tunable has no effect for indep. Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: new SET_CHOOSE_LEAF_TRIES commandIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly control the number of sample attempts, and allow the number of tries in the recursive call to be explicitly controlled via the rule. This is important because the amount of time we want to spend looking for a solution may be rule dependent (e.g., higher for the wide indep pool than the rep pools). (We should do the same for the other tunables, by the way!) Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: pass parent r value for indep callIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | Pass down the parent's 'r' value so that we will sample different values in the recursive call when the parent tries multiple times. This avoids doing useless work (calling multiple times and trying the same values). Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: clarify numrep vs endposIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: strip firstn conditionals out of crush_choose, renameIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that indep is handled by crush_choose_indep, rename crush_choose to crush_choose_firstn and remove all the conditionals. This ends up stripping out *lots* of code. Note that it *also* makes it obvious that the shenanigans we were playing with r' for uniform buckets were broken for firstn mode. This appears to have happened waaaay back in commit dae8bec9 (or earlier)... 2007. Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: add note about r in recursive chooseIlya Dryomov2013-12-31
| | | | | | | | | | | | | | Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: use breadth-first search for indep modeIlya Dryomov2013-12-31
| | | | | | | | | | | | | | Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: return CRUSH_ITEM_UNDEF for failed placements with indepIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | | | For firstn mode, if we fail to make a valid placement choice, we just continue and return a short result to the caller. For indep mode, however, we need to make the position stable, and return an undefined value on failed placements to avoid shifting later results to the left. Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: eliminate CRUSH_MAX_SET result size limitationIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | This is only present to size the temporary scratch arrays that we put on the stack. Let the caller allocate them as they wish and remove the limitation. Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: fix some commentsIlya Dryomov2013-12-31
| | | | | | | | | | | | | | Reflects ceph.git commit 3cef755428761f2481b1dd0e0fbd0464ac483fc5. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: reduce scope of some local variablesIlya Dryomov2013-12-31
| | | | | | | | | | | | | | Reflects ceph.git commit e7d47827f0333c96ad43d257607fb92ed4176550. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: factor out (trivial) crush_destroy_rule()Ilya Dryomov2013-12-31
| | | | | | | | | | | | | | Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | crush: pass weight vector size to map functionIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass the size of the weight vector into crush_do_rule() to ensure that we don't access values past the end. This can happen if the caller misbehaves and passes a weight vector that is smaller than max_devices. Currently the monitor tries to prevent that from happening, but this will gracefully tolerate previous bad osdmaps that got into this state. It's also a bit more defensive. Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | libceph: update ceph_features.hIlya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | | | | | This updates ceph_features.h so that it has all feature bits defined in ceph.git. In the interim since the last update, ceph.git crossed the "32 feature bits" point, and, the addition of the 33rd bit wasn't handled correctly. The work-around is squashed into this commit and reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | libceph: all features fields must be u64Ilya Dryomov2013-12-31
| | | | | | | | | | | | | | | | | | In preparation for ceph_features.h update, change all features fields from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this point.) Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | libceph: resend all writes after the osdmap loses the full flagJosh Durgin2013-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the current full handling, there is a race between osds and clients getting the first map marked full. If the osd wins, it will return -ENOSPC to any writes, but the client may already have writes in flight. This results in the client getting the error and propagating it up the stack. For rbd, the block layer turns this into EIO, which can cause corruption in filesystems above it. To avoid this race, osds are being changed to drop writes that came from clients with an osdmap older than the last osdmap marked full. In order for this to work, clients must resend all writes after they encounter a full -> not full transition in the osdmap. osds will wait for an updated map instead of processing a request from a client with a newer map, so resent writes will not be dropped by the osd unless there is another not full -> full transition. This approach requires both osds and clients to be fixed to avoid the race. Old clients talking to osds with this fix may hang instead of returning EIO and potentially corrupting an fs. New clients talking to old osds have the same behavior as before if they encounter this race. Fixes: http://tracker.ceph.com/issues/6938 Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* | libceph: block I/O when PAUSE or FULL osd map flags are setJosh Durgin2013-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PAUSEWR and PAUSERD flags are meant to stop the cluster from processing writes and reads, respectively. The FULL flag is set when the cluster determines that it is out of space, and will no longer process writes. PAUSEWR and PAUSERD are purely client-side settings already implemented in userspace clients. The osd does nothing special with these flags. When the FULL flag is set, however, the osd responds to all writes with -ENOSPC. For cephfs, this makes sense, but for rbd the block layer translates this into EIO. If a cluster goes from full to non-full quickly, a filesystem on top of rbd will not behave well, since some writes succeed while others get EIO. Fix this by blocking any writes when the FULL flag is set in the osd client. This is the same strategy used by userspace, so apply it by default. A follow-on patch makes this configurable. __map_request() is called to re-target osd requests in case the available osds changed. Add a paused field to a ceph_osd_request, and set it whenever an appropriate osd map flag is set. Avoid queueing paused requests in __map_request(), but force them to be resent if they become unpaused. Also subscribe to the next osd map from the monitor if any of these flags are set, so paused requests can be unblocked as soon as possible. Fixes: http://tracker.ceph.com/issues/6079 Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* | ceph: Add necessary clean up if invalid reply received in handle_reply()Li Wang2013-12-13
|/ | | | | | | | Wake up possible waiters, invoke the call back if any, unregister the request Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>
* net: 8021q/bluetooth/bridge/can/ceph: Remove extern from function prototypesJoe Perches2013-10-19
| | | | | | | | | | | | | There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-19
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "These fix several bugs with RBD from 3.11 that didn't get tested in time for the merge window: some error handling, a use-after-free, and a sequencing issue when unmapping and image races with a notify operation. There is also a patch fixing a problem with the new ceph + fscache code that just went in" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: fscache: check consistency does not decrement refcount rbd: fix error handling from rbd_snap_name() rbd: ignore unmapped snapshots that no longer exist rbd: fix use-after free of rbd_dev->disk rbd: make rbd_obj_notify_ack() synchronous rbd: complete notifies before cleaning up osd_client and rbd_dev libceph: add function to ensure notifies are complete
| * libceph: add function to ensure notifies are completeJosh Durgin2013-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without a way to flush the osd client's notify workqueue, a watch event that is unregistered could continue receiving callbacks indefinitely. Unregistering the event simply means no new notifies are added to the queue, but there may still be events in the queue that will call the watch callback for the event. If the queue is flushed after the event is unregistered, the caller can be sure no more watch callbacks will occur for the canceled watch. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-09
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "This includes both the first pile of Ceph patches (which I sent to torvalds@vger, sigh) and a few new patches that add support for fscache for Ceph. That includes a few fscache core fixes that David Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski for putting this feature together) This first batch of patches (included here) had (has) several important RBD bug fixes, hole punch support, several different cleanups in the page cache interactions, improvements in the truncate code (new truncate mutex to avoid shenanigans with i_mutex), and a series of fixes in the synchronous striping read/write code. On top of that is a random collection of small fixes all across the tree (error code checks and error path cleanup, obsolete wq flags, etc)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits) ceph: use d_invalidate() to invalidate aliases ceph: remove ceph_lookup_inode() ceph: trivial buildbot warnings fix ceph: Do not do invalidate if the filesystem is mounted nofsc ceph: page still marked private_2 ceph: ceph_readpage_to_fscache didn't check if marked ceph: clean PgPrivate2 on returning from readpages ceph: use fscache as a local presisent cache fscache: Netfs function for cleanup post readpages FS-Cache: Fix heading in documentation CacheFiles: Implement interface to check cache consistency FS-Cache: Add interface to check consistency of a cached object rbd: fix null dereference in dout rbd: fix buffer size for writes to images with snapshots libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc rbd: fix I/O error propagation for reads ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem ceph: allow sync_read/write return partial successed size of read/write. ceph: fix bugs about handling short-read for sync read mode. ceph: remove useless variable revoked_rdcache ...
| * libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calcSage Weil2013-09-04
| | | | | | | | | | | | | | | | | | | | Fix a typo that used the wrong bitmask for the pg.seed calculation. This is normally unnoticed because in most cases pg_num == pgp_num. It is, however, a bug that is easily corrected. CC: stable@vger.kernel.org Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linary.org>
| * libceph: create_singlethread_workqueue() doesn't return ERR_PTRsDan Carpenter2013-08-27
| | | | | | | | | | | | | | | | | | | | | | create_singlethread_workqueue() returns NULL on error, and it doesn't return ERR_PTRs. I tweaked the error handling a little to be consistent with earlier in the function. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * libceph: potential NULL dereference in ceph_osdc_handle_map()Dan Carpenter2013-08-27
| | | | | | | | | | | | | | | | There are two places where we read "nr_maps" if both of them are set to zero then we would hit a NULL dereference here. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * libceph: fix error handling in handle_reply()Dan Carpenter2013-08-27
| | | | | | | | | | | | | | | | | | | | We've tried to fix the error paths in this function before, but there is still a hidden goto in the ceph_decode_need() macro which goes to the wrong place. We need to release the "req" and unlock a mutex before returning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * ceph: punch hole supportLi Wang2013-08-15
| | | | | | | | | | | | | | This patch implements fallocate and punch hole support for Ceph kernel client. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
| * ceph: WQ_NON_REENTRANT is meaningless and going awayTejun Heo2013-08-09
| | | | | | | | | | | | | | | | | | | | | | dbf2576e37 ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Sage Weil <sage@inktank.com> Cc: ceph-devel@vger.kernel.org
| * libceph: unregister request in __map_request failed and nofail == falsemajianpeng2013-08-09
| | | | | | | | | | | | | | | | | | | | For nofail == false request, if __map_request failed, the caller does cleanup work, like releasing the relative pages. It doesn't make any sense to retry this request. CC: stable@vger.kernel.org Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | net: add sk_stream_is_writeable() helperEric Dumazet2013-07-24
|/ | | | | | | | | | | | | | | Several call sites use the hardcoded following condition : sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) Lets use a helper because TCP_NOTSENT_LOWAT support will change this condition for TCP sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* libceph: call r_unsafe_callback when unsafe reply is receivedYan, Zheng2013-07-03
| | | | | | | | | | | | | | | | | | We can't use !req->r_sent to check if OSD request is sent for the first time, this is because __cancel_request() zeros req->r_sent when OSD map changes. Rather than adding a new variable to struct ceph_osd_request to indicate if it's sent for the first time, We can call the unsafe callback only when unsafe OSD reply is received. If OSD's first reply is safe, just skip calling the unsafe callback. The purpose of unsafe callback is adding unsafe request to a list, so that fsync(2) can wait for the safe reply. fsync(2) doesn't need to wait for a write(2) that hasn't returned yet. So it's OK to add request to the unsafe list when the first OSD reply is received. (ceph_sync_write() returns after receiving the first OSD reply) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: Fix NULL pointer dereference in auth client codeTyler Hicks2013-07-03
| | | | | | | | | | | | | | | | A malicious monitor can craft an auth reply message that could cause a NULL function pointer dereference in the client's kernel. To prevent this, the auth_none protocol handler needs an empty ceph_auth_client_ops->build_request() function. CVE-2013-1059 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Chanam Park <chanam.park@hkpco.kr> Reviewed-by: Seth Arnold <seth.arnold@canonical.com> Reviewed-by: Sage Weil <sage@inktank.com> Cc: stable@vger.kernel.org
* libceph: fix truncate size calculationYan, Zheng2013-07-03
| | | | | | | check the "not truncated yet" case Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: fix safe completionYan, Zheng2013-07-03
| | | | | | | | handle_reply() calls complete_request() only if the first OSD reply has ONDISK flag. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: print more info for short message headerAlex Elder2013-07-03
| | | | | | | | | | | | | If an osd client response message arrives that has a front section that's too big for the buffer set aside to receive it, a warning gets reported and a new buffer is allocated. The warning says nothing about which connection had the problem. Add the peer type and number to what gets reported, to be a bit more informative. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: add lingering request reference when registeredAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an osd request is set to linger, the osd client holds onto the request so it can be re-submitted following certain osd map changes. The osd client holds a reference to the request until it is unregistered. This is used by rbd for watch requests. Currently, the reference is taken when the request is marked with the linger flag. This means that if an error occurs after that time but before the the request completes successfully, that reference is leaked. There's really no reason to take the reference until the request is registered in the the osd client's list of lingering requests, and that only happens when the lingering (watch) request completes successfully. So take that reference only when it gets registered following succesful completion, and drop it (as before) when the request gets unregistered. This avoids the reference problem on error in rbd. Rearrange ceph_osdc_unregister_linger_request() to avoid using the request pointer after it may have been freed. And hold an extra reference in kick_requests() while handling a linger request that has not yet been registered, to ensure it doesn't go away. This resolves: http://tracker.ceph.com/issues/3859 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: must hold mutex for reset_changed_osds()Alex Elder2013-05-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | An osd client has a red-black tree describing its osds, and occasionally we would get crashes due to one of these trees tree becoming corrupt somehow. The problem turned out to be that reset_changed_osds() was being called without protection of the osd client request mutex. That function would call __reset_osd() for any osd that had changed, and __reset_osd() would call __remove_osd() for any osd with no outstanding requests, and finally __remove_osd() would remove the corresponding entry from the red-black tree. Thus, the tree was getting modified without having any lock protection, and was vulnerable to problems due to concurrent updates. This appears to be the only osd tree updating path that has this problem. It can be fairly easily fixed by moving the call up a few lines, to just before the request mutex gets dropped in kick_requests(). This resolves: http://tracker.ceph.com/issues/5043 Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: init sent and completed when startingAlex Elder2013-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rbd code has a need to be able to restart an osd request that has already been started and completed once before. This currently wouldn't work right because the osd client code assumes an osd request will be started exactly once Certain fields in a request are never cleared and this leads to trouble if you try to reuse it. Specifically, the r_sent, r_got_reply, and r_completed fields are never cleared. The r_sent field records the osd incarnation at the time the request was sent to that osd. If that's non-zero, the message won't get re-mapped to a target osd properly, and won't be put on the unsafe requests list the first time it's sent as it should. The r_got_reply field is used in handle_reply() to ensure the reply to a request is processed only once. And the r_completed field is used for lingering requests to avoid calling the callback function every time the osd client re-sends the request on behalf of its initiator. Each osd request passes through ceph_osdc_start_request() when responsibility for the request is handed over to the osd client for completion. We can safely zero these three fields there each time a request gets started. One last related change--clear the r_linger flag when a request is no longer registered as a linger request. This resolves: http://tracker.ceph.com/issues/5026 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: use slab cache for osd client requestsAlex Elder2013-05-02
| | | | | | | | | | | Create a slab cache to manage allocation of ceph_osdc_request structures. This resolves: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: allocate ceph message data with a slab allocatorAlex Elder2013-05-02
| | | | | | | | | | Create a slab cache to manage ceph_msg_data structure allocation. This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: allocate ceph messages with a slab allocatorAlex Elder2013-05-02
| | | | | | | | | | Create a slab cache to manage ceph_msg structure allocation. This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: create source file "net/ceph/snapshot.c"Alex Elder2013-05-02
| | | | | | | | | | | | | This creates a new source file "net/ceph/snapshot.c" to contain utility routines related to ceph snapshot contexts. The main motivation was to define ceph_create_snap_context() as a common way to create these structures, but I've moved the definitions of ceph_get_snap_context() and ceph_put_snap_context() there too. (The benefit of inlining those is very small, and I'd rather keep this collection of functions together.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: fix byte order mismatchAlex Elder2013-05-02
| | | | | | | | | | | | | | | | | | | | | | A WATCH op includes an object version. The version that's supplied is incorrectly byte-swapped osd_req_op_watch_init() where it's first assigned (it's been this way since that code was first added). The result is that the version sent to the osd is wrong, because that value gets byte-swapped again in osd_req_encode_op(). This is the source of a sparse warning related to improper byte order in the assignment. The approach of using the version to avoid a race is deprecated (see http://tracker.ceph.com/issues/3871), and the watch parameter is no longer even examined by the osd. So fix the assignment in osd_req_op_watch_init() so it no longer does the byte swap. This resolves: http://tracker.ceph.com/issues/3847 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: support pages for class request dataAlex Elder2013-05-02
| | | | | | | | Add the ability to provide an array of pages as outbound request data for object class method calls. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: fix two messenger bugsAlex Elder2013-05-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes four small changes in the ceph messenger. While getting copyup functionality working I found two bugs in the messenger. Existing paths through the code did not trigger these problems, but they're fixed here: - In ceph_msg_data_pagelist_cursor_init(), the cursor's last_piece field was being checked against the length supplied. This was OK until this commit: ccba6d98 libceph: implement multiple data items in a message That commit changed the cursor init routines to allow lengths to be supplied that exceeded the size of the current data item. Because of this, we have to use the assigned cursor resid field rather than the provided length in determining whether the cursor points to the last piece of a data item. - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously catching attempts to add page data to a message if the message already had data assigned to it. That was OK until that same commit, at which point it was fine for messages to have multiple data items. It slipped through because that BUG_ON() call was present twice in that function. (You can never be too careful.) In addition two other minor things are changed: - In ceph_msg_data_cursor_init(), the local variable "data" was getting assigned twice. - In ceph_msg_data_advance(), it was assumed that the type-specific advance routine would set new_piece to true after it advanced past the last piece. That may have been fine, but since we check for that case we might as well set it explicitly in ceph_msg_data_advance(). This resolves: http://tracker.ceph.com/issues/4762 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* libceph: support raw data requestsAlex Elder2013-05-02
| | | | | | | | | | | | | | | | | Allow osd request ops that aren't otherwise structured (not class, extent, or watch ops) to specify "raw" data to be used to hold incoming data for the op. Make use of this capability for the osd STAT op. Prefix the name of the private function osd_req_op_init() with "_", and expose a new function by that (earlier) name whose purpose is to initialize osd ops with (only) implied data. For now we'll just support the use of a page array for an osd op with incoming raw data. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>