aboutsummaryrefslogtreecommitdiffstats
path: root/fs
diff options
context:
space:
mode:
authorDave Chinner <dchinner@redhat.com>2014-07-14 17:08:24 -0400
committerDave Chinner <david@fromorbit.com>2014-07-14 17:08:24 -0400
commitcf11da9c5d374962913ca5ba0ce0886b58286224 (patch)
tree88480a47229aa9a3244beca6cae49e0ae00df37b /fs
parentaa182e64f16fc29a4984c2d79191b161888bbd9b (diff)
xfs: refine the allocation stack switch
The allocation stack switch at xfs_bmapi_allocate() has served it's purpose, but is no longer a sufficient solution to the stack usage problem we have in the XFS allocation path. Whilst the kernel stack size is now 16k, that is not a valid reason for undoing all our "keep stack usage down" modifications. What it does allow us to do is have the freedom to refine and perfect the modifications knowing that if we get it wrong it won't blow up in our faces - we have a safety net now. This is important because we still have the issue of older kernels having smaller stacks and that they are still supported and are demonstrating a wide range of different stack overflows. Red Hat has several open bugs for allocation based stack overflows from directory modifications and direct IO block allocation and these problems still need to be solved. If we can solve them upstream, then distro's won't need to bake their own unique solutions. To that end, I've observed that every allocation based stack overflow report has had a specific characteristic - it has happened during or directly after a bmap btree block split. That event requires a new block to be allocated to the tree, and so we effectively stack one allocation stack on top of another, and that's when we get into trouble. A further observation is that bmap btree block splits are much rarer than writeback allocation - over a range of different workloads I've observed the ratio of bmap btree inserts to splits ranges from 100:1 (xfstests run) to 10000:1 (local VM image server with sparse files that range in the hundreds of thousands to millions of extents). Either way, bmap btree split events are much, much rarer than allocation events. Finally, we have to move the kswapd state to the allocation workqueue work when allocation is done on behalf of kswapd. This is proving to cause significant perturbation in performance under memory pressure and appears to be generating allocation deadlock warnings under some workloads, so avoiding the use of a workqueue for the majority of kswapd writeback allocation will minimise the impact of such behaviour. Hence it makes sense to move the stack switch to xfs_btree_split() and only do it for bmap btree splits. Stack switches during allocation will be much rarer, so there won't be significant performacne overhead caused by switching stacks. The worse case stack from all allocation paths will be split, not just writeback. And the majority of memory allocations will be done in the correct context (e.g. kswapd) without causing additional latency, and so we simplify the memory reclaim interactions between processes, workqueues and kswapd. The worst stack I've been able to generate with this patch in place is 5600 bytes deep. It's very revealing because we exit XFS at: 37) 1768 64 kmem_cache_alloc+0x13b/0x170 about 1800 bytes of stack consumed, and the remaining 3800 bytes (and 36 functions) is memory reclaim, swap and the IO stack. And this occurs in the inode allocation from an open(O_CREAT) syscall, not writeback. The amount of stack being used is much less than I've previously be able to generate - fs_mark testing has been able to generate stack usage of around 7k without too much trouble; with this patch it's only just getting to 5.5k. This is primarily because the metadata allocation paths (e.g. directory blocks) are no longer causing double splits on the same stack, and hence now stack tracing is showing swapping being the worst stack consumer rather than XFS. Performance of fs_mark inode create workloads is unchanged. Performance of fs_mark async fsync workloads is consistently good with context switches reduced by around 150,000/s (30%). Performance of dbench, streaming IO and postmark is unchanged. Allocation deadlock warnings have not been seen on the workloads that generated them since adding this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
Diffstat (limited to 'fs')
-rw-r--r--fs/xfs/xfs_bmap.c7
-rw-r--r--fs/xfs/xfs_bmap.h4
-rw-r--r--fs/xfs/xfs_bmap_util.c43
-rw-r--r--fs/xfs/xfs_bmap_util.h13
-rw-r--r--fs/xfs/xfs_btree.c82
-rw-r--r--fs/xfs/xfs_iomap.c3
6 files changed, 90 insertions, 62 deletions
diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 96175df211b1..75c3fe5f3d9d 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -4298,8 +4298,8 @@ xfs_bmapi_delay(
4298} 4298}
4299 4299
4300 4300
4301int 4301static int
4302__xfs_bmapi_allocate( 4302xfs_bmapi_allocate(
4303 struct xfs_bmalloca *bma) 4303 struct xfs_bmalloca *bma)
4304{ 4304{
4305 struct xfs_mount *mp = bma->ip->i_mount; 4305 struct xfs_mount *mp = bma->ip->i_mount;
@@ -4578,9 +4578,6 @@ xfs_bmapi_write(
4578 bma.flist = flist; 4578 bma.flist = flist;
4579 bma.firstblock = firstblock; 4579 bma.firstblock = firstblock;
4580 4580
4581 if (flags & XFS_BMAPI_STACK_SWITCH)
4582 bma.stack_switch = 1;
4583
4584 while (bno < end && n < *nmap) { 4581 while (bno < end && n < *nmap) {
4585 inhole = eof || bma.got.br_startoff > bno; 4582 inhole = eof || bma.got.br_startoff > bno;
4586 wasdelay = !inhole && isnullstartblock(bma.got.br_startblock); 4583 wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
diff --git a/fs/xfs/xfs_bmap.h b/fs/xfs/xfs_bmap.h
index 38ba36e9b2f0..b879ca56a64c 100644
--- a/fs/xfs/xfs_bmap.h
+++ b/fs/xfs/xfs_bmap.h
@@ -77,7 +77,6 @@ typedef struct xfs_bmap_free
77 * from written to unwritten, otherwise convert from unwritten to written. 77 * from written to unwritten, otherwise convert from unwritten to written.
78 */ 78 */
79#define XFS_BMAPI_CONVERT 0x040 79#define XFS_BMAPI_CONVERT 0x040
80#define XFS_BMAPI_STACK_SWITCH 0x080
81 80
82#define XFS_BMAPI_FLAGS \ 81#define XFS_BMAPI_FLAGS \
83 { XFS_BMAPI_ENTIRE, "ENTIRE" }, \ 82 { XFS_BMAPI_ENTIRE, "ENTIRE" }, \
@@ -86,8 +85,7 @@ typedef struct xfs_bmap_free
86 { XFS_BMAPI_PREALLOC, "PREALLOC" }, \ 85 { XFS_BMAPI_PREALLOC, "PREALLOC" }, \
87 { XFS_BMAPI_IGSTATE, "IGSTATE" }, \ 86 { XFS_BMAPI_IGSTATE, "IGSTATE" }, \
88 { XFS_BMAPI_CONTIG, "CONTIG" }, \ 87 { XFS_BMAPI_CONTIG, "CONTIG" }, \
89 { XFS_BMAPI_CONVERT, "CONVERT" }, \ 88 { XFS_BMAPI_CONVERT, "CONVERT" }
90 { XFS_BMAPI_STACK_SWITCH, "STACK_SWITCH" }
91 89
92 90
93static inline int xfs_bmapi_aflag(int w) 91static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671811d6..64731ef3324d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -249,49 +249,6 @@ xfs_bmap_rtalloc(
249} 249}
250 250
251/* 251/*
252 * Stack switching interfaces for allocation
253 */
254static void
255xfs_bmapi_allocate_worker(
256 struct work_struct *work)
257{
258 struct xfs_bmalloca *args = container_of(work,
259 struct xfs_bmalloca, work);
260 unsigned long pflags;
261
262 /* we are in a transaction context here */
263 current_set_flags_nested(&pflags, PF_FSTRANS);
264
265 args->result = __xfs_bmapi_allocate(args);
266 complete(args->done);
267
268 current_restore_flags_nested(&pflags, PF_FSTRANS);
269}
270
271/*
272 * Some allocation requests often come in with little stack to work on. Push
273 * them off to a worker thread so there is lots of stack to use. Otherwise just
274 * call directly to avoid the context switch overhead here.
275 */
276int
277xfs_bmapi_allocate(
278 struct xfs_bmalloca *args)
279{
280 DECLARE_COMPLETION_ONSTACK(done);
281
282 if (!args->stack_switch)
283 return __xfs_bmapi_allocate(args);
284
285
286 args->done = &done;
287 INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
288 queue_work(xfs_alloc_wq, &args->work);
289 wait_for_completion(&done);
290 destroy_work_on_stack(&args->work);
291 return args->result;
292}
293
294/*
295 * Check if the endoff is outside the last extent. If so the caller will grow 252 * Check if the endoff is outside the last extent. If so the caller will grow
296 * the allocation to a stripe unit boundary. All offsets are considered outside 253 * the allocation to a stripe unit boundary. All offsets are considered outside
297 * the end of file for an empty fork, so 1 is returned in *eof in that case. 254 * the end of file for an empty fork, so 1 is returned in *eof in that case.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 935ed2b24edf..2fdb72d2c908 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -50,12 +50,11 @@ struct xfs_bmalloca {
50 xfs_extlen_t total; /* total blocks needed for xaction */ 50 xfs_extlen_t total; /* total blocks needed for xaction */
51 xfs_extlen_t minlen; /* minimum allocation size (blocks) */ 51 xfs_extlen_t minlen; /* minimum allocation size (blocks) */
52 xfs_extlen_t minleft; /* amount must be left after alloc */ 52 xfs_extlen_t minleft; /* amount must be left after alloc */
53 char eof; /* set if allocating past last extent */ 53 bool eof; /* set if allocating past last extent */
54 char wasdel; /* replacing a delayed allocation */ 54 bool wasdel; /* replacing a delayed allocation */
55 char userdata;/* set if is user data */ 55 bool userdata;/* set if is user data */
56 char aeof; /* allocated space at eof */ 56 bool aeof; /* allocated space at eof */
57 char conv; /* overwriting unwritten extents */ 57 bool conv; /* overwriting unwritten extents */
58 char stack_switch;
59 int flags; 58 int flags;
60 struct completion *done; 59 struct completion *done;
61 struct work_struct work; 60 struct work_struct work;
@@ -65,8 +64,6 @@ struct xfs_bmalloca {
65int xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist, 64int xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
66 int *committed); 65 int *committed);
67int xfs_bmap_rtalloc(struct xfs_bmalloca *ap); 66int xfs_bmap_rtalloc(struct xfs_bmalloca *ap);
68int xfs_bmapi_allocate(struct xfs_bmalloca *args);
69int __xfs_bmapi_allocate(struct xfs_bmalloca *args);
70int xfs_bmap_eof(struct xfs_inode *ip, xfs_fileoff_t endoff, 67int xfs_bmap_eof(struct xfs_inode *ip, xfs_fileoff_t endoff,
71 int whichfork, int *eof); 68 int whichfork, int *eof);
72int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip, 69int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index bf810c6baf2b..cf893bc1e373 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -33,6 +33,7 @@
33#include "xfs_error.h" 33#include "xfs_error.h"
34#include "xfs_trace.h" 34#include "xfs_trace.h"
35#include "xfs_cksum.h" 35#include "xfs_cksum.h"
36#include "xfs_alloc.h"
36 37
37/* 38/*
38 * Cursor allocation zone. 39 * Cursor allocation zone.
@@ -2323,7 +2324,7 @@ error1:
2323 * record (to be inserted into parent). 2324 * record (to be inserted into parent).
2324 */ 2325 */
2325STATIC int /* error */ 2326STATIC int /* error */
2326xfs_btree_split( 2327__xfs_btree_split(
2327 struct xfs_btree_cur *cur, 2328 struct xfs_btree_cur *cur,
2328 int level, 2329 int level,
2329 union xfs_btree_ptr *ptrp, 2330 union xfs_btree_ptr *ptrp,
@@ -2503,6 +2504,85 @@ error0:
2503 return error; 2504 return error;
2504} 2505}
2505 2506
2507struct xfs_btree_split_args {
2508 struct xfs_btree_cur *cur;
2509 int level;
2510 union xfs_btree_ptr *ptrp;
2511 union xfs_btree_key *key;
2512 struct xfs_btree_cur **curp;
2513 int *stat; /* success/failure */
2514 int result;
2515 bool kswapd; /* allocation in kswapd context */
2516 struct completion *done;
2517 struct work_struct work;
2518};
2519
2520/*
2521 * Stack switching interfaces for allocation
2522 */
2523static void
2524xfs_btree_split_worker(
2525 struct work_struct *work)
2526{
2527 struct xfs_btree_split_args *args = container_of(work,
2528 struct xfs_btree_split_args, work);
2529 unsigned long pflags;
2530 unsigned long new_pflags = PF_FSTRANS;
2531
2532 /*
2533 * we are in a transaction context here, but may also be doing work
2534 * in kswapd context, and hence we may need to inherit that state
2535 * temporarily to ensure that we don't block waiting for memory reclaim
2536 * in any way.
2537 */
2538 if (args->kswapd)
2539 new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
2540
2541 current_set_flags_nested(&pflags, new_pflags);
2542
2543 args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
2544 args->key, args->curp, args->stat);
2545 complete(args->done);
2546
2547 current_restore_flags_nested(&pflags, new_pflags);
2548}
2549
2550/*
2551 * BMBT split requests often come in with little stack to work on. Push
2552 * them off to a worker thread so there is lots of stack to use. For the other
2553 * btree types, just call directly to avoid the context switch overhead here.
2554 */
2555STATIC int /* error */
2556xfs_btree_split(
2557 struct xfs_btree_cur *cur,
2558 int level,
2559 union xfs_btree_ptr *ptrp,
2560 union xfs_btree_key *key,
2561 struct xfs_btree_cur **curp,
2562 int *stat) /* success/failure */
2563{
2564 struct xfs_btree_split_args args;
2565 DECLARE_COMPLETION_ONSTACK(done);
2566
2567 if (cur->bc_btnum != XFS_BTNUM_BMAP)
2568 return __xfs_btree_split(cur, level, ptrp, key, curp, stat);
2569
2570 args.cur = cur;
2571 args.level = level;
2572 args.ptrp = ptrp;
2573 args.key = key;
2574 args.curp = curp;
2575 args.stat = stat;
2576 args.done = &done;
2577 args.kswapd = current_is_kswapd();
2578 INIT_WORK_ONSTACK(&args.work, xfs_btree_split_worker);
2579 queue_work(xfs_alloc_wq, &args.work);
2580 wait_for_completion(&done);
2581 destroy_work_on_stack(&args.work);
2582 return args.result;
2583}
2584
2585
2506/* 2586/*
2507 * Copy the old inode root contents into a real block and make the 2587 * Copy the old inode root contents into a real block and make the
2508 * broot point to it. 2588 * broot point to it.
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 6c5eb4c551e3..6d3ec2b6ee29 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -749,8 +749,7 @@ xfs_iomap_write_allocate(
749 * pointer that the caller gave to us. 749 * pointer that the caller gave to us.
750 */ 750 */
751 error = xfs_bmapi_write(tp, ip, map_start_fsb, 751 error = xfs_bmapi_write(tp, ip, map_start_fsb,
752 count_fsb, 752 count_fsb, 0,
753 XFS_BMAPI_STACK_SWITCH,
754 &first_block, 1, 753 &first_block, 1,
755 imap, &nimaps, &free_list); 754 imap, &nimaps, &free_list);
756 if (error) 755 if (error)