libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

Add libceph support for a new CRUSH tunable recently added to Ceph servers. Consider the CRUSH rule step chooseleaf firstn 0 type <node_type> This rule means that <n> replicas will be chosen in a manner such that each chosen leaf's branch will contain a unique instance of <node_type>. When an object is re-replicated after a leaf failure, if the CRUSH map uses a chooseleaf rule the remapped replica ends up under the <node_type> bucket that held the failed leaf. This causes uneven data distribution across the storage cluster, to the point that when all the leaves but one fail under a particular <node_type> bucket, that remaining leaf holds all the data from its failed peers. This behavior also limits the number of peers that can participate in the re-replication of the data held by the failed leaf, which increases the time required to re-replicate after a failure. For a chooseleaf CRUSH rule, the tree descent has two steps: call them the inner and outer descents. If the tree descent down to <node_type> is the outer descent, and the descent from <node_type> down to a leaf is the inner descent, the issue is that a down leaf is detected on the inner descent, so only the inner descent is retried. In order to disperse re-replicated data as widely as possible across a storage cluster after a failure, we want to retry the outer descent. So, fix up crush_choose() to allow the inner descent to return immediately on choosing a failed leaf. Wire this up as a new CRUSH tunable. Note that after this change, for a chooseleaf rule, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc. Signed-off-by: Jim Schutt <jaschut@sandia.gov> Reviewed-by: Sage Weil <sage@inktank.com>
author: Jim Schutt <jaschut@sandia.gov> 2012-11-30 11:15:25 -0500
committer: Alex Elder <elder@inktank.com> 2013-01-17 13:42:39 -0500
commit: 1604f488ac2dcce33c8218e75a000e8c5fb57e61 (patch)
tree: 084b399c1c9be245e62543a024d727241f7a9ad4 /net/ceph/osdmap.c
parent: 390306c38dd43908f7f7730229999790a773d1d5 (diff)
1 files changed, 6 insertions, 0 deletions
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index de73214b5d26..ca05871635bc 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -170,6 +170,7 @@ static struct crush_map *crush_decode(void *pbyval, void *end)
        c->choose_local_tries = 2;
        c->choose_local_fallback_tries = 5;
        c->choose_total_tries = 19;
+        c->chooseleaf_descend_once = 0;
        ceph_decode_need(p, end, 4*sizeof(u32), bad);
        magic = ceph_decode_32(p);
@@ -336,6 +337,11 @@ static struct crush_map *crush_decode(void *pbyval, void *end)
        dout("crush decode tunable choose_total_tries = %d",
             c->choose_total_tries);
+        ceph_decode_need(p, end, sizeof(u32), done);
+        c->chooseleaf_descend_once = ceph_decode_32(p);
+        dout("crush decode tunable chooseleaf_descend_once = %d",
+             c->chooseleaf_descend_once);
 done:
        dout("crush_decode success\n");
        return c;
author	Jim Schutt <jaschut@sandia.gov>	2012-11-30 11:15:25 -0500
committer	Alex Elder <elder@inktank.com>	2013-01-17 13:42:39 -0500
commit	1604f488ac2dcce33c8218e75a000e8c5fb57e61 (patch)
tree	084b399c1c9be245e62543a024d727241f7a9ad4 /net/ceph/osdmap.c
parent	390306c38dd43908f7f7730229999790a773d1d5 (diff)