xfs: force background CIL push under sustained load

I have been seeing occasional pauses in transaction throughput up to 30s long under heavy parallel workloads. The only notable thing was that the xfsaild was trying to be active during the pauses, but making no progress. It was running exactly 20 times a second (on the 50ms no-progress backoff), and the number of pushbuf events was constant across this time as well. IOWs, the xfsaild appeared to be stuck on buffers that it could not push out. Further investigation indicated that it was trying to push out inode buffers that were pinned and/or locked. The xfsbufd was also getting woken at the same frequency (by the xfsaild, no doubt) to push out delayed write buffers. The xfsbufd was not making any progress because all the buffers in the delwri queue were pinned. This scan- and-make-no-progress dance went one in the trace for some seconds, before the xfssyncd came along an issued a log force, and then things started going again. However, I noticed something strange about the log force - there were way too many IO's issued. 516 log buffers were written, to be exact. That added up to 129MB of log IO, which got me very interested because it's almost exactly 25% of the size of the log. He delayed logging code is suppose to aggregate the minimum of 25% of the log or 8MB worth of changes before flushing. That's what really puzzled me - why did a log force write 129MB instead of only 8MB? Essentially what has happened is that no CIL pushes had occurred since the previous tail push which cleared out 25% of the log space. That caused all the new transactions to block because there wasn't log space for them, but they kick the xfsaild to push the tail. However, the xfsaild was not making progress because there were buffers it could not lock and flush, and the xfsbufd could not flush them because they were pinned. As a result, both the xfsaild and the xfsbufd could not move the tail of the log forward without the CIL first committing. The cause of the problem was that the background CIL push, which should happen when 8MB of aggregated changes have been committed, is being held off by the concurrent transaction commit load. The background push does a down_write_trylock() which will fail if there is a concurrent transaction commit holding the push lock in read mode. With 8 CPUs all doing transactions as fast as they can, there was enough concurrent transaction commits to hold off the background push until tail-pushing could no longer free log space, and the halt would occur. It should be noted that there is no reason why it would halt at 25% of log space used by a single CIL checkpoint. This bug could definitely violate the "no transaction should be larger than half the log" requirement and hence result in corruption if the system crashed under heavy load. This sort of bug is exactly the reason why delayed logging was tagged as experimental.... The fix is to start blocking background pushes once the threshold has been exceeded. Rework the threshold calculations to keep the amount of log space a CIL checkpoint can use to below that of the AIL push threshold to avoid the problem completely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
author: Dave Chinner <dchinner@redhat.com> 2010-09-24 04:13:44 -0400
committer: Alex Elder <aelder@sgi.com> 2010-09-29 08:51:03 -0400
commit: 80168676ebfe4af51407d30f336d67f082d45201 (patch)
tree: 30f12022ffadd153cec05f460bdaa620690d7b12 /fs/xfs/xfs_log_priv.h
parent: 899611ee7d373e5eeda08e9a8632684e1ebbbf00 (diff)
1 files changed, 21 insertions, 16 deletions
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ced52b98b322..edcdfe01617f 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -426,13 +426,13 @@ struct xfs_cil {
 };
 /*
- * The amount of log space we should the CIL to aggregate is difficult to size.
+ * The amount of log space we allow the CIL to aggregate is difficult to size.
- * Whatever we chose we have to make we can get a reservation for the log space
+ * Whatever we choose, we have to make sure we can get a reservation for the
- * effectively, that it is large enough to capture sufficient relogging to
+ * log space effectively, that it is large enough to capture sufficient
- * reduce log buffer IO significantly, but it is not too large for the log or
+ * relogging to reduce log buffer IO significantly, but it is not too large for
- * induces too much latency when writing out through the iclogs. We track both
+ * the log or induces too much latency when writing out through the iclogs. We
- * space consumed and the number of vectors in the checkpoint context, so we
+ * track both space consumed and the number of vectors in the checkpoint
- * need to decide which to use for limiting.
+ * context, so we need to decide which to use for limiting.
 *
 * Every log buffer we write out during a push needs a header reserved, which
 * is at least one sector and more for v2 logs. Hence we need a reservation of
@@ -459,16 +459,21 @@ struct xfs_cil {
 * checkpoint transaction ticket is specific to the checkpoint context, rather
 * than the CIL itself.
 *
- * With dynamic reservations, we can basically make up arbitrary limits for the
+ * With dynamic reservations, we can effectively make up arbitrary limits for
- * checkpoint size so long as they don't violate any other size rules.  Hence
+ * the checkpoint size so long as they don't violate any other size rules.
- * the initial maximum size for the checkpoint transaction will be set to a
+ * Recovery imposes a rule that no transaction exceed half the log, so we are
- * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit
+ * limited by that.  Furthermore, the log transaction reservation subsystem
- * right now based on the latency of writing out a large amount of data through
+ * tries to keep 25% of the log free, so we need to keep below that limit or we
- * the circular iclog buffers.
+ * risk running out of free log space to start any new transactions.
+ *
+ * In order to keep background CIL push efficient, we will set a lower
+ * threshold at which background pushing is attempted without blocking current
+ * transaction commits.  A separate, higher bound defines when CIL pushes are
+ * enforced to ensure we stay within our maximum checkpoint size bounds.
+ * threshold, yet give us plenty of space for aggregation on large logs.
 */
+#define XLOG_CIL_SPACE_LIMIT(log)       (log->l_logsize >> 3)
-#define XLOG_CIL_SPACE_LIMIT(log)       \
+#define XLOG_CIL_HARD_SPACE_LIMIT(log)  (3 * (log->l_logsize >> 4))
-        (min((log->l_logsize >> 2), (8 * 1024 * 1024)))
 /*
 * The reservation head lsn is not made up of a cycle number and block number.
author	Dave Chinner <dchinner@redhat.com>	2010-09-24 04:13:44 -0400
committer	Alex Elder <aelder@sgi.com>	2010-09-29 08:51:03 -0400
commit	80168676ebfe4af51407d30f336d67f082d45201 (patch)
tree	30f12022ffadd153cec05f460bdaa620690d7b12 /fs/xfs/xfs_log_priv.h
parent	899611ee7d373e5eeda08e9a8632684e1ebbbf00 (diff)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ced52b98b322..edcdfe01617f 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h
@@ -426,13 +426,13 @@ struct xfs_cil {
426	};	426	};
427		427
428	/*	428	/*
429	* The amount of log space we should the CIL to aggregate is difficult to size.	429	* The amount of log space we allow the CIL to aggregate is difficult to size.
430	* Whatever we chose we have to make we can get a reservation for the log space	430	* Whatever we choose, we have to make sure we can get a reservation for the
431	* effectively, that it is large enough to capture sufficient relogging to	431	* log space effectively, that it is large enough to capture sufficient
432	* reduce log buffer IO significantly, but it is not too large for the log or	432	* relogging to reduce log buffer IO significantly, but it is not too large for
433	* induces too much latency when writing out through the iclogs. We track both	433	* the log or induces too much latency when writing out through the iclogs. We
434	* space consumed and the number of vectors in the checkpoint context, so we	434	* track both space consumed and the number of vectors in the checkpoint
435	* need to decide which to use for limiting.	435	* context, so we need to decide which to use for limiting.
436	*	436	*
437	* Every log buffer we write out during a push needs a header reserved, which	437	* Every log buffer we write out during a push needs a header reserved, which
438	* is at least one sector and more for v2 logs. Hence we need a reservation of	438	* is at least one sector and more for v2 logs. Hence we need a reservation of
@@ -459,16 +459,21 @@ struct xfs_cil {
459	* checkpoint transaction ticket is specific to the checkpoint context, rather	459	* checkpoint transaction ticket is specific to the checkpoint context, rather
460	* than the CIL itself.	460	* than the CIL itself.
461	*	461	*
462	* With dynamic reservations, we can basically make up arbitrary limits for the	462	* With dynamic reservations, we can effectively make up arbitrary limits for
463	* checkpoint size so long as they don't violate any other size rules. Hence	463	* the checkpoint size so long as they don't violate any other size rules.
464	* the initial maximum size for the checkpoint transaction will be set to a	464	* Recovery imposes a rule that no transaction exceed half the log, so we are
465	* quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit	465	* limited by that. Furthermore, the log transaction reservation subsystem
466	* right now based on the latency of writing out a large amount of data through	466	* tries to keep 25% of the log free, so we need to keep below that limit or we
467	* the circular iclog buffers.	467	* risk running out of free log space to start any new transactions.
		468	*
		469	* In order to keep background CIL push efficient, we will set a lower
		470	* threshold at which background pushing is attempted without blocking current
		471	* transaction commits. A separate, higher bound defines when CIL pushes are
		472	* enforced to ensure we stay within our maximum checkpoint size bounds.
		473	* threshold, yet give us plenty of space for aggregation on large logs.
468	*/	474	*/
469		475	#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3)
470	#define XLOG_CIL_SPACE_LIMIT(log) \	476	#define XLOG_CIL_HARD_SPACE_LIMIT(log) (3 * (log->l_logsize >> 4))
471	(min((log->l_logsize >> 2), (8 * 1024 * 1024)))
472		477
473	/*	478	/*
474	* The reservation head lsn is not made up of a cycle number and block number.	479	* The reservation head lsn is not made up of a cycle number and block number.