drbd: panic on delayed completion of aborted requests

"aborting" requests, or force-detaching the disk, is intended for completely blocked/hung local backing devices which do no longer complete requests at all, not even do error completions. In this situation, usually a hard-reset and failover is the only way out. By "aborting", basically faking a local error-completion, we allow for a more graceful swichover by cleanly migrating services. Still the affected node has to be rebooted "soon". By completing these requests, we allow the upper layers to re-use the associated data pages. If later the local backing device "recovers", and now DMAs some data from disk into the original request pages, in the best case it will just put random data into unused pages; but typically it will corrupt meanwhile completely unrelated data, causing all sorts of damage. Which means delayed successful completion, especially for READ requests, is a reason to panic(). We assume that a delayed *error* completion is OK, though we still will complain noisily about it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
author: Lars Ellenberg <lars.ellenberg@linbit.com> 2012-09-03 09:48:21 -0400
committer: Jens Axboe <axboe@kernel.dk> 2012-10-30 03:39:17 -0400
commit: 7fb907c15fb8d0e10e72c8566a13f6defab3f484 (patch)
tree: 60cf652915f97cb84e842c3e173f2729e0baac72
parent: dbd0820c6f7b7db9a97d63ea379fc174a63ddbca (diff)
1 files changed, 36 insertions, 0 deletions
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 1352455dd7dd..66dcb2d7eada 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -227,6 +227,42 @@ void drbd_endio_pri(struct bio *bio, int error)
                error = -EIO;
        }
+        /* If this request was aborted locally before,
+         * but now was completed "successfully",
+         * chances are that this caused arbitrary data corruption.
+         *
+         * "aborting" requests, or force-detaching the disk, is intended for
+         * completely blocked/hung local backing devices which do no longer
+         * complete requests at all, not even do error completions.  In this
+         * situation, usually a hard-reset and failover is the only way out.
+         *
+         * By "aborting", basically faking a local error-completion,
+         * we allow for a more graceful swichover by cleanly migrating services.
+         * Still the affected node has to be rebooted "soon".
+         *
+         * By completing these requests, we allow the upper layers to re-use
+         * the associated data pages.
+         *
+         * If later the local backing device "recovers", and now DMAs some data
+         * from disk into the original request pages, in the best case it will
+         * just put random data into unused pages; but typically it will corrupt
+         * meanwhile completely unrelated data, causing all sorts of damage.
+         *
+         * Which means delayed successful completion,
+         * especially for READ requests,
+         * is a reason to panic().
+         *
+         * We assume that a delayed *error* completion is OK,
+         * though we still will complain noisily about it.
+         */
+        if (unlikely(req->rq_state & RQ_LOCAL_ABORTED)) {
+                if (__ratelimit(&drbd_ratelimit_state))
+                        dev_emerg(DEV, "delayed completion of aborted local request; disk-timeout may be too aggressive\n");
+                if (!error)
+                        panic("possible random memory corruption caused by delayed completion of aborted local request\n");
+        }
        /* to avoid recursion in __req_mod */
        if (unlikely(error)) {
                what = (bio_data_dir(bio) == WRITE)
author	Lars Ellenberg <lars.ellenberg@linbit.com>	2012-09-03 09:48:21 -0400
committer	Jens Axboe <axboe@kernel.dk>	2012-10-30 03:39:17 -0400
commit	7fb907c15fb8d0e10e72c8566a13f6defab3f484 (patch)
tree	60cf652915f97cb84e842c3e173f2729e0baac72
parent	dbd0820c6f7b7db9a97d63ea379fc174a63ddbca (diff)

diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c index 1352455dd7dd..66dcb2d7eada 100644 --- a/drivers/block/drbd/drbd_worker.c +++ b/drivers/block/drbd/drbd_worker.c
@@ -227,6 +227,42 @@ void drbd_endio_pri(struct bio *bio, int error)
227	error = -EIO;	227	error = -EIO;
228	}	228	}
229		229
		230	/* If this request was aborted locally before,
		231	* but now was completed "successfully",
		232	* chances are that this caused arbitrary data corruption.
		233	*
		234	* "aborting" requests, or force-detaching the disk, is intended for
		235	* completely blocked/hung local backing devices which do no longer
		236	* complete requests at all, not even do error completions. In this
		237	* situation, usually a hard-reset and failover is the only way out.
		238	*
		239	* By "aborting", basically faking a local error-completion,
		240	* we allow for a more graceful swichover by cleanly migrating services.
		241	* Still the affected node has to be rebooted "soon".
		242	*
		243	* By completing these requests, we allow the upper layers to re-use
		244	* the associated data pages.
		245	*
		246	* If later the local backing device "recovers", and now DMAs some data
		247	* from disk into the original request pages, in the best case it will
		248	* just put random data into unused pages; but typically it will corrupt
		249	* meanwhile completely unrelated data, causing all sorts of damage.
		250	*
		251	* Which means delayed successful completion,
		252	* especially for READ requests,
		253	* is a reason to panic().
		254	*
		255	* We assume that a delayed error completion is OK,
		256	* though we still will complain noisily about it.
		257	*/
		258	if (unlikely(req->rq_state & RQ_LOCAL_ABORTED)) {
		259	if (__ratelimit(&drbd_ratelimit_state))
		260	dev_emerg(DEV, "delayed completion of aborted local request; disk-timeout may be too aggressive\n");
		261
		262	if (!error)
		263	panic("possible random memory corruption caused by delayed completion of aborted local request\n");
		264	}
		265
230	/* to avoid recursion in __req_mod */	266	/* to avoid recursion in __req_mod */
231	if (unlikely(error)) {	267	if (unlikely(error)) {
232	what = (bio_data_dir(bio) == WRITE)	268	what = (bio_data_dir(bio) == WRITE)