aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2016-10-13 23:28:22 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2016-10-13 23:28:22 -0400
commit35a891be96f1f8e1227e6ad3ca827b8a08ce47ea (patch)
treeab67c3b97a49f8e8ba2d011d4a706d52bcde318b
parent40bd3a5f341b4ef4c6a49fb68938247d3065d8ad (diff)
parentfeac470e3642e8956ac9b7f14224e6b301b9219d (diff)
Merge tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
< XFS has gained super CoW powers! > ---------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || Pull XFS support for shared data extents from Dave Chinner: "This is the second part of the XFS updates for this merge cycle. This pullreq contains the new shared data extents feature for XFS. Given the complexity and size of this change I am expecting - like the addition of reverse mapping last cycle - that there will be some follow-up bug fixes and cleanups around the -rc3 stage for issues that I'm sure will show up once the code hits a wider userbase. What it is: At the most basic level we are simply adding shared data extents to XFS - i.e. a single extent on disk can now have multiple owners. To do this we have to add new on-disk features to both track the shared extents and the number of times they've been shared. This is done by the new "refcount" btree that sits in every allocation group. When we share or unshare an extent, this tree gets updated. Along with this new tree, the reverse mapping tree needs to be updated to track each owner or a shared extent. This also needs to be updated ever share/unshare operation. These interactions at extent allocation and freeing time have complex ordering and recovery constraints, so there's a significant amount of new intent-based transaction code to ensure that operations are performed atomically from both the runtime and integrity/crash recovery perspectives. We also need to break sharing when writes hit a shared extent - this is where the new copy-on-write implementation comes in. We allocate new storage and copy the original data along with the overwrite data into the new location. We only do this for data as we don't share metadata at all - each inode has it's own metadata that tracks the shared data extents, the extents undergoing CoW and it's own private extents. Of course, being XFS, nothing is simple - we use delayed allocation for CoW similar to how we use it for normal writes. ENOSPC is a significant issue here - we build on the reservation code added in 4.8-rc1 with the reverse mapping feature to ensure we don't get spurious ENOSPC issues part way through a CoW operation. These mechanisms also help minimise fragmentation due to repeated CoW operations. To further reduce fragmentation overhead, we've also introduced a CoW extent size hint, which indicates how large a region we should allocate when we execute a CoW operation. With all this functionality in place, we can hook up .copy_file_range, .clone_file_range and .dedupe_file_range and we gain all the capabilities of reflink and other vfs provided functionality that enable manipulation to shared extents. We also added a fallocate mode that explicitly unshares a range of a file, which we implemented as an explicit CoW of all the shared extents in a file. As such, it's a huge chunk of new functionality with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time the kernel with this code in it is released. The new code causes 5-6 new failures with xfstests - these aren't serious functional failures but things the output of tests changing slightly due to perturbations in layouts, space usage, etc. OTOH, we've added 150+ new tests to xfstests that specifically exercise this new functionality so it's got far better test coverage than any functionality we've previously added to XFS. Darrick has done a pretty amazing job getting us to this stage, and special mention also needs to go to Christoph (review, testing, improvements and bug fixes) and Brian (caught several intricate bugs during review) for the effort they've also put in. Summary: - unshare range (FALLOC_FL_UNSHARE) support for fallocate - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface - shared extent support for XFS - copy-on-write support for shared extents - copy_file_range support - clone_file_range support (implements reflink) - dedupe_file_range support - defrag support for reverse mapping enabled filesystems" * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits) xfs: convert COW blocks to real blocks before unwritten extent conversion xfs: rework refcount cow recovery error handling xfs: clear reflink flag if setting realtime flag xfs: fix error initialization xfs: fix label inaccuracies xfs: remove isize check from unshare operation xfs: reduce stack usage of _reflink_clear_inode_flag xfs: check inode reflink flag before calling reflink functions xfs: implement swapext for rmap filesystems xfs: refactor swapext code xfs: various swapext cleanups xfs: recognize the reflink feature bit xfs: simulate per-AG reservations being critically low xfs: don't mix reflink and DAX mode for now xfs: check for invalid inode reflink flags xfs: set a default CoW extent size of 32 blocks xfs: convert unwritten status of reverse mappings for shared files xfs: use interval query for rmap alloc operations on shared files xfs: add shared rmap map/unmap/convert log item types xfs: increase log reservations for reflink ...
-rw-r--r--fs/open.c5
-rw-r--r--fs/xfs/Makefile7
-rw-r--r--fs/xfs/libxfs/xfs_ag_resv.c15
-rw-r--r--fs/xfs/libxfs/xfs_alloc.c23
-rw-r--r--fs/xfs/libxfs/xfs_bmap.c575
-rw-r--r--fs/xfs/libxfs/xfs_bmap.h67
-rw-r--r--fs/xfs/libxfs/xfs_bmap_btree.c18
-rw-r--r--fs/xfs/libxfs/xfs_btree.c8
-rw-r--r--fs/xfs/libxfs/xfs_btree.h16
-rw-r--r--fs/xfs/libxfs/xfs_defer.h2
-rw-r--r--fs/xfs/libxfs/xfs_format.h97
-rw-r--r--fs/xfs/libxfs/xfs_fs.h10
-rw-r--r--fs/xfs/libxfs/xfs_inode_buf.c24
-rw-r--r--fs/xfs/libxfs/xfs_inode_buf.h1
-rw-r--r--fs/xfs/libxfs/xfs_inode_fork.c70
-rw-r--r--fs/xfs/libxfs/xfs_inode_fork.h28
-rw-r--r--fs/xfs/libxfs/xfs_log_format.h118
-rw-r--r--fs/xfs/libxfs/xfs_refcount.c1698
-rw-r--r--fs/xfs/libxfs/xfs_refcount.h70
-rw-r--r--fs/xfs/libxfs/xfs_refcount_btree.c451
-rw-r--r--fs/xfs/libxfs/xfs_refcount_btree.h74
-rw-r--r--fs/xfs/libxfs/xfs_rmap.c914
-rw-r--r--fs/xfs/libxfs/xfs_rmap.h7
-rw-r--r--fs/xfs/libxfs/xfs_rmap_btree.c82
-rw-r--r--fs/xfs/libxfs/xfs_rmap_btree.h7
-rw-r--r--fs/xfs/libxfs/xfs_sb.c9
-rw-r--r--fs/xfs/libxfs/xfs_shared.h2
-rw-r--r--fs/xfs/libxfs/xfs_trans_resv.c23
-rw-r--r--fs/xfs/libxfs/xfs_trans_resv.h3
-rw-r--r--fs/xfs/libxfs/xfs_trans_space.h9
-rw-r--r--fs/xfs/libxfs/xfs_types.h3
-rw-r--r--fs/xfs/xfs_aops.c222
-rw-r--r--fs/xfs/xfs_aops.h4
-rw-r--r--fs/xfs/xfs_bmap_item.c508
-rw-r--r--fs/xfs/xfs_bmap_item.h98
-rw-r--r--fs/xfs/xfs_bmap_util.c589
-rw-r--r--fs/xfs/xfs_dir2_readdir.c3
-rw-r--r--fs/xfs/xfs_error.h10
-rw-r--r--fs/xfs/xfs_file.c221
-rw-r--r--fs/xfs/xfs_fsops.c107
-rw-r--r--fs/xfs/xfs_fsops.h3
-rw-r--r--fs/xfs/xfs_globals.c5
-rw-r--r--fs/xfs/xfs_icache.c243
-rw-r--r--fs/xfs/xfs_icache.h7
-rw-r--r--fs/xfs/xfs_inode.c51
-rw-r--r--fs/xfs/xfs_inode.h19
-rw-r--r--fs/xfs/xfs_inode_item.c2
-rw-r--r--fs/xfs/xfs_ioctl.c75
-rw-r--r--fs/xfs/xfs_iomap.c35
-rw-r--r--fs/xfs/xfs_iomap.h3
-rw-r--r--fs/xfs/xfs_iops.c1
-rw-r--r--fs/xfs/xfs_itable.c8
-rw-r--r--fs/xfs/xfs_linux.h1
-rw-r--r--fs/xfs/xfs_log_recover.c357
-rw-r--r--fs/xfs/xfs_mount.c32
-rw-r--r--fs/xfs/xfs_mount.h8
-rw-r--r--fs/xfs/xfs_ondisk.h3
-rw-r--r--fs/xfs/xfs_pnfs.c7
-rw-r--r--fs/xfs/xfs_refcount_item.c539
-rw-r--r--fs/xfs/xfs_refcount_item.h101
-rw-r--r--fs/xfs/xfs_reflink.c1688
-rw-r--r--fs/xfs/xfs_reflink.h58
-rw-r--r--fs/xfs/xfs_rmap_item.c12
-rw-r--r--fs/xfs/xfs_stats.c1
-rw-r--r--fs/xfs/xfs_stats.h18
-rw-r--r--fs/xfs/xfs_super.c87
-rw-r--r--fs/xfs/xfs_sysctl.c9
-rw-r--r--fs/xfs/xfs_sysctl.h1
-rw-r--r--fs/xfs/xfs_trace.h742
-rw-r--r--fs/xfs/xfs_trans.h29
-rw-r--r--fs/xfs/xfs_trans_bmap.c249
-rw-r--r--fs/xfs/xfs_trans_refcount.c264
-rw-r--r--fs/xfs/xfs_trans_rmap.c9
-rw-r--r--include/linux/falloc.h3
-rw-r--r--include/uapi/linux/falloc.h18
-rw-r--r--include/uapi/linux/fs.h4
76 files changed, 10580 insertions, 310 deletions
diff --git a/fs/open.c b/fs/open.c
index a7719cfb7257..d3ed8171e8e0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -267,6 +267,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
267 (mode & ~FALLOC_FL_INSERT_RANGE)) 267 (mode & ~FALLOC_FL_INSERT_RANGE))
268 return -EINVAL; 268 return -EINVAL;
269 269
270 /* Unshare range should only be used with allocate mode. */
271 if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
272 (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
273 return -EINVAL;
274
270 if (!(file->f_mode & FMODE_WRITE)) 275 if (!(file->f_mode & FMODE_WRITE))
271 return -EBADF; 276 return -EBADF;
272 277
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 584e87e11cb6..26ef1958b65b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -55,6 +55,8 @@ xfs-y += $(addprefix libxfs/, \
55 xfs_ag_resv.o \ 55 xfs_ag_resv.o \
56 xfs_rmap.o \ 56 xfs_rmap.o \
57 xfs_rmap_btree.o \ 57 xfs_rmap_btree.o \
58 xfs_refcount.o \
59 xfs_refcount_btree.o \
58 xfs_sb.o \ 60 xfs_sb.o \
59 xfs_symlink_remote.o \ 61 xfs_symlink_remote.o \
60 xfs_trans_resv.o \ 62 xfs_trans_resv.o \
@@ -88,6 +90,7 @@ xfs-y += xfs_aops.o \
88 xfs_message.o \ 90 xfs_message.o \
89 xfs_mount.o \ 91 xfs_mount.o \
90 xfs_mru_cache.o \ 92 xfs_mru_cache.o \
93 xfs_reflink.o \
91 xfs_stats.o \ 94 xfs_stats.o \
92 xfs_super.o \ 95 xfs_super.o \
93 xfs_symlink.o \ 96 xfs_symlink.o \
@@ -100,16 +103,20 @@ xfs-y += xfs_aops.o \
100# low-level transaction/log code 103# low-level transaction/log code
101xfs-y += xfs_log.o \ 104xfs-y += xfs_log.o \
102 xfs_log_cil.o \ 105 xfs_log_cil.o \
106 xfs_bmap_item.o \
103 xfs_buf_item.o \ 107 xfs_buf_item.o \
104 xfs_extfree_item.o \ 108 xfs_extfree_item.o \
105 xfs_icreate_item.o \ 109 xfs_icreate_item.o \
106 xfs_inode_item.o \ 110 xfs_inode_item.o \
111 xfs_refcount_item.o \
107 xfs_rmap_item.o \ 112 xfs_rmap_item.o \
108 xfs_log_recover.o \ 113 xfs_log_recover.o \
109 xfs_trans_ail.o \ 114 xfs_trans_ail.o \
115 xfs_trans_bmap.o \
110 xfs_trans_buf.o \ 116 xfs_trans_buf.o \
111 xfs_trans_extfree.o \ 117 xfs_trans_extfree.o \
112 xfs_trans_inode.o \ 118 xfs_trans_inode.o \
119 xfs_trans_refcount.o \
113 xfs_trans_rmap.o \ 120 xfs_trans_rmap.o \
114 121
115# optional features 122# optional features
diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index e3ae0f2b4294..e5ebc3770460 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -38,6 +38,7 @@
38#include "xfs_trans_space.h" 38#include "xfs_trans_space.h"
39#include "xfs_rmap_btree.h" 39#include "xfs_rmap_btree.h"
40#include "xfs_btree.h" 40#include "xfs_btree.h"
41#include "xfs_refcount_btree.h"
41 42
42/* 43/*
43 * Per-AG Block Reservations 44 * Per-AG Block Reservations
@@ -108,7 +109,9 @@ xfs_ag_resv_critical(
108 trace_xfs_ag_resv_critical(pag, type, avail); 109 trace_xfs_ag_resv_critical(pag, type, avail);
109 110
110 /* Critically low if less than 10% or max btree height remains. */ 111 /* Critically low if less than 10% or max btree height remains. */
111 return avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS; 112 return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
113 pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL,
114 XFS_RANDOM_AG_RESV_CRITICAL);
112} 115}
113 116
114/* 117/*
@@ -228,6 +231,11 @@ xfs_ag_resv_init(
228 if (pag->pag_meta_resv.ar_asked == 0) { 231 if (pag->pag_meta_resv.ar_asked == 0) {
229 ask = used = 0; 232 ask = used = 0;
230 233
234 error = xfs_refcountbt_calc_reserves(pag->pag_mount,
235 pag->pag_agno, &ask, &used);
236 if (error)
237 goto out;
238
231 error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA, 239 error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
232 ask, used); 240 ask, used);
233 if (error) 241 if (error)
@@ -238,6 +246,11 @@ xfs_ag_resv_init(
238 if (pag->pag_agfl_resv.ar_asked == 0) { 246 if (pag->pag_agfl_resv.ar_asked == 0) {
239 ask = used = 0; 247 ask = used = 0;
240 248
249 error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
250 &ask, &used);
251 if (error)
252 goto out;
253
241 error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used); 254 error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
242 if (error) 255 if (error)
243 goto out; 256 goto out;
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index ca75dc90ebe0..effb64cf714f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -52,10 +52,23 @@ STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
52STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *, 52STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
53 xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *); 53 xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
54 54
55unsigned int
56xfs_refc_block(
57 struct xfs_mount *mp)
58{
59 if (xfs_sb_version_hasrmapbt(&mp->m_sb))
60 return XFS_RMAP_BLOCK(mp) + 1;
61 if (xfs_sb_version_hasfinobt(&mp->m_sb))
62 return XFS_FIBT_BLOCK(mp) + 1;
63 return XFS_IBT_BLOCK(mp) + 1;
64}
65
55xfs_extlen_t 66xfs_extlen_t
56xfs_prealloc_blocks( 67xfs_prealloc_blocks(
57 struct xfs_mount *mp) 68 struct xfs_mount *mp)
58{ 69{
70 if (xfs_sb_version_hasreflink(&mp->m_sb))
71 return xfs_refc_block(mp) + 1;
59 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) 72 if (xfs_sb_version_hasrmapbt(&mp->m_sb))
60 return XFS_RMAP_BLOCK(mp) + 1; 73 return XFS_RMAP_BLOCK(mp) + 1;
61 if (xfs_sb_version_hasfinobt(&mp->m_sb)) 74 if (xfs_sb_version_hasfinobt(&mp->m_sb))
@@ -115,6 +128,8 @@ xfs_alloc_ag_max_usable(
115 blocks++; /* finobt root block */ 128 blocks++; /* finobt root block */
116 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) 129 if (xfs_sb_version_hasrmapbt(&mp->m_sb))
117 blocks++; /* rmap root block */ 130 blocks++; /* rmap root block */
131 if (xfs_sb_version_hasreflink(&mp->m_sb))
132 blocks++; /* refcount root block */
118 133
119 return mp->m_sb.sb_agblocks - blocks; 134 return mp->m_sb.sb_agblocks - blocks;
120} 135}
@@ -2321,6 +2336,9 @@ xfs_alloc_log_agf(
2321 offsetof(xfs_agf_t, agf_btreeblks), 2336 offsetof(xfs_agf_t, agf_btreeblks),
2322 offsetof(xfs_agf_t, agf_uuid), 2337 offsetof(xfs_agf_t, agf_uuid),
2323 offsetof(xfs_agf_t, agf_rmap_blocks), 2338 offsetof(xfs_agf_t, agf_rmap_blocks),
2339 offsetof(xfs_agf_t, agf_refcount_blocks),
2340 offsetof(xfs_agf_t, agf_refcount_root),
2341 offsetof(xfs_agf_t, agf_refcount_level),
2324 /* needed so that we don't log the whole rest of the structure: */ 2342 /* needed so that we don't log the whole rest of the structure: */
2325 offsetof(xfs_agf_t, agf_spare64), 2343 offsetof(xfs_agf_t, agf_spare64),
2326 sizeof(xfs_agf_t) 2344 sizeof(xfs_agf_t)
@@ -2458,6 +2476,10 @@ xfs_agf_verify(
2458 be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length)) 2476 be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
2459 return false; 2477 return false;
2460 2478
2479 if (xfs_sb_version_hasreflink(&mp->m_sb) &&
2480 be32_to_cpu(agf->agf_refcount_level) > XFS_BTREE_MAXLEVELS)
2481 return false;
2482
2461 return true;; 2483 return true;;
2462 2484
2463} 2485}
@@ -2578,6 +2600,7 @@ xfs_alloc_read_agf(
2578 be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]); 2600 be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
2579 pag->pagf_levels[XFS_BTNUM_RMAPi] = 2601 pag->pagf_levels[XFS_BTNUM_RMAPi] =
2580 be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]); 2602 be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
2603 pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
2581 spin_lock_init(&pag->pagb_lock); 2604 spin_lock_init(&pag->pagb_lock);
2582 pag->pagb_count = 0; 2605 pag->pagb_count = 0;
2583 pag->pagb_tree = RB_ROOT; 2606 pag->pagb_tree = RB_ROOT;
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9d7f61d36645..c27344cf38e1 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -48,6 +48,7 @@
48#include "xfs_filestream.h" 48#include "xfs_filestream.h"
49#include "xfs_rmap.h" 49#include "xfs_rmap.h"
50#include "xfs_ag_resv.h" 50#include "xfs_ag_resv.h"
51#include "xfs_refcount.h"
51 52
52 53
53kmem_zone_t *xfs_bmap_free_item_zone; 54kmem_zone_t *xfs_bmap_free_item_zone;
@@ -140,7 +141,8 @@ xfs_bmbt_lookup_ge(
140 */ 141 */
141static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork) 142static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork)
142{ 143{
143 return XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS && 144 return whichfork != XFS_COW_FORK &&
145 XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS &&
144 XFS_IFORK_NEXTENTS(ip, whichfork) > 146 XFS_IFORK_NEXTENTS(ip, whichfork) >
145 XFS_IFORK_MAXEXT(ip, whichfork); 147 XFS_IFORK_MAXEXT(ip, whichfork);
146} 148}
@@ -150,7 +152,8 @@ static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork)
150 */ 152 */
151static inline bool xfs_bmap_wants_extents(struct xfs_inode *ip, int whichfork) 153static inline bool xfs_bmap_wants_extents(struct xfs_inode *ip, int whichfork)
152{ 154{
153 return XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE && 155 return whichfork != XFS_COW_FORK &&
156 XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE &&
154 XFS_IFORK_NEXTENTS(ip, whichfork) <= 157 XFS_IFORK_NEXTENTS(ip, whichfork) <=
155 XFS_IFORK_MAXEXT(ip, whichfork); 158 XFS_IFORK_MAXEXT(ip, whichfork);
156} 159}
@@ -640,6 +643,7 @@ xfs_bmap_btree_to_extents(
640 643
641 mp = ip->i_mount; 644 mp = ip->i_mount;
642 ifp = XFS_IFORK_PTR(ip, whichfork); 645 ifp = XFS_IFORK_PTR(ip, whichfork);
646 ASSERT(whichfork != XFS_COW_FORK);
643 ASSERT(ifp->if_flags & XFS_IFEXTENTS); 647 ASSERT(ifp->if_flags & XFS_IFEXTENTS);
644 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE); 648 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE);
645 rblock = ifp->if_broot; 649 rblock = ifp->if_broot;
@@ -706,6 +710,7 @@ xfs_bmap_extents_to_btree(
706 xfs_bmbt_ptr_t *pp; /* root block address pointer */ 710 xfs_bmbt_ptr_t *pp; /* root block address pointer */
707 711
708 mp = ip->i_mount; 712 mp = ip->i_mount;
713 ASSERT(whichfork != XFS_COW_FORK);
709 ifp = XFS_IFORK_PTR(ip, whichfork); 714 ifp = XFS_IFORK_PTR(ip, whichfork);
710 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS); 715 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS);
711 716
@@ -748,6 +753,7 @@ xfs_bmap_extents_to_btree(
748 args.type = XFS_ALLOCTYPE_START_BNO; 753 args.type = XFS_ALLOCTYPE_START_BNO;
749 args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); 754 args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
750 } else if (dfops->dop_low) { 755 } else if (dfops->dop_low) {
756try_another_ag:
751 args.type = XFS_ALLOCTYPE_START_BNO; 757 args.type = XFS_ALLOCTYPE_START_BNO;
752 args.fsbno = *firstblock; 758 args.fsbno = *firstblock;
753 } else { 759 } else {
@@ -762,6 +768,21 @@ xfs_bmap_extents_to_btree(
762 xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); 768 xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
763 return error; 769 return error;
764 } 770 }
771
772 /*
773 * During a CoW operation, the allocation and bmbt updates occur in
774 * different transactions. The mapping code tries to put new bmbt
775 * blocks near extents being mapped, but the only way to guarantee this
776 * is if the alloc and the mapping happen in a single transaction that
777 * has a block reservation. That isn't the case here, so if we run out
778 * of space we'll try again with another AG.
779 */
780 if (xfs_sb_version_hasreflink(&cur->bc_mp->m_sb) &&
781 args.fsbno == NULLFSBLOCK &&
782 args.type == XFS_ALLOCTYPE_NEAR_BNO) {
783 dfops->dop_low = true;
784 goto try_another_ag;
785 }
765 /* 786 /*
766 * Allocation can't fail, the space was reserved. 787 * Allocation can't fail, the space was reserved.
767 */ 788 */
@@ -837,6 +858,7 @@ xfs_bmap_local_to_extents_empty(
837{ 858{
838 struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); 859 struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
839 860
861 ASSERT(whichfork != XFS_COW_FORK);
840 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL); 862 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL);
841 ASSERT(ifp->if_bytes == 0); 863 ASSERT(ifp->if_bytes == 0);
842 ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0); 864 ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0);
@@ -896,6 +918,7 @@ xfs_bmap_local_to_extents(
896 * file currently fits in an inode. 918 * file currently fits in an inode.
897 */ 919 */
898 if (*firstblock == NULLFSBLOCK) { 920 if (*firstblock == NULLFSBLOCK) {
921try_another_ag:
899 args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino); 922 args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
900 args.type = XFS_ALLOCTYPE_START_BNO; 923 args.type = XFS_ALLOCTYPE_START_BNO;
901 } else { 924 } else {
@@ -908,6 +931,19 @@ xfs_bmap_local_to_extents(
908 if (error) 931 if (error)
909 goto done; 932 goto done;
910 933
934 /*
935 * During a CoW operation, the allocation and bmbt updates occur in
936 * different transactions. The mapping code tries to put new bmbt
937 * blocks near extents being mapped, but the only way to guarantee this
938 * is if the alloc and the mapping happen in a single transaction that
939 * has a block reservation. That isn't the case here, so if we run out
940 * of space we'll try again with another AG.
941 */
942 if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb) &&
943 args.fsbno == NULLFSBLOCK &&
944 args.type == XFS_ALLOCTYPE_NEAR_BNO) {
945 goto try_another_ag;
946 }
911 /* Can't fail, the space was reserved. */ 947 /* Can't fail, the space was reserved. */
912 ASSERT(args.fsbno != NULLFSBLOCK); 948 ASSERT(args.fsbno != NULLFSBLOCK);
913 ASSERT(args.len == 1); 949 ASSERT(args.len == 1);
@@ -1670,7 +1706,8 @@ xfs_bmap_one_block(
1670 */ 1706 */
1671STATIC int /* error */ 1707STATIC int /* error */
1672xfs_bmap_add_extent_delay_real( 1708xfs_bmap_add_extent_delay_real(
1673 struct xfs_bmalloca *bma) 1709 struct xfs_bmalloca *bma,
1710 int whichfork)
1674{ 1711{
1675 struct xfs_bmbt_irec *new = &bma->got; 1712 struct xfs_bmbt_irec *new = &bma->got;
1676 int diff; /* temp value */ 1713 int diff; /* temp value */
@@ -1688,11 +1725,14 @@ xfs_bmap_add_extent_delay_real(
1688 xfs_filblks_t temp=0; /* value for da_new calculations */ 1725 xfs_filblks_t temp=0; /* value for da_new calculations */
1689 xfs_filblks_t temp2=0;/* value for da_new calculations */ 1726 xfs_filblks_t temp2=0;/* value for da_new calculations */
1690 int tmp_rval; /* partial logging flags */ 1727 int tmp_rval; /* partial logging flags */
1691 int whichfork = XFS_DATA_FORK;
1692 struct xfs_mount *mp; 1728 struct xfs_mount *mp;
1729 xfs_extnum_t *nextents;
1693 1730
1694 mp = bma->ip->i_mount; 1731 mp = bma->ip->i_mount;
1695 ifp = XFS_IFORK_PTR(bma->ip, whichfork); 1732 ifp = XFS_IFORK_PTR(bma->ip, whichfork);
1733 ASSERT(whichfork != XFS_ATTR_FORK);
1734 nextents = (whichfork == XFS_COW_FORK ? &bma->ip->i_cnextents :
1735 &bma->ip->i_d.di_nextents);
1696 1736
1697 ASSERT(bma->idx >= 0); 1737 ASSERT(bma->idx >= 0);
1698 ASSERT(bma->idx <= ifp->if_bytes / sizeof(struct xfs_bmbt_rec)); 1738 ASSERT(bma->idx <= ifp->if_bytes / sizeof(struct xfs_bmbt_rec));
@@ -1706,6 +1746,9 @@ xfs_bmap_add_extent_delay_real(
1706#define RIGHT r[1] 1746#define RIGHT r[1]
1707#define PREV r[2] 1747#define PREV r[2]
1708 1748
1749 if (whichfork == XFS_COW_FORK)
1750 state |= BMAP_COWFORK;
1751
1709 /* 1752 /*
1710 * Set up a bunch of variables to make the tests simpler. 1753 * Set up a bunch of variables to make the tests simpler.
1711 */ 1754 */
@@ -1792,7 +1835,7 @@ xfs_bmap_add_extent_delay_real(
1792 trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_); 1835 trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
1793 1836
1794 xfs_iext_remove(bma->ip, bma->idx + 1, 2, state); 1837 xfs_iext_remove(bma->ip, bma->idx + 1, 2, state);
1795 bma->ip->i_d.di_nextents--; 1838 (*nextents)--;
1796 if (bma->cur == NULL) 1839 if (bma->cur == NULL)
1797 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT; 1840 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
1798 else { 1841 else {
@@ -1894,7 +1937,7 @@ xfs_bmap_add_extent_delay_real(
1894 xfs_bmbt_set_startblock(ep, new->br_startblock); 1937 xfs_bmbt_set_startblock(ep, new->br_startblock);
1895 trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_); 1938 trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
1896 1939
1897 bma->ip->i_d.di_nextents++; 1940 (*nextents)++;
1898 if (bma->cur == NULL) 1941 if (bma->cur == NULL)
1899 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT; 1942 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
1900 else { 1943 else {
@@ -1964,7 +2007,7 @@ xfs_bmap_add_extent_delay_real(
1964 temp = PREV.br_blockcount - new->br_blockcount; 2007 temp = PREV.br_blockcount - new->br_blockcount;
1965 xfs_bmbt_set_blockcount(ep, temp); 2008 xfs_bmbt_set_blockcount(ep, temp);
1966 xfs_iext_insert(bma->ip, bma->idx, 1, new, state); 2009 xfs_iext_insert(bma->ip, bma->idx, 1, new, state);
1967 bma->ip->i_d.di_nextents++; 2010 (*nextents)++;
1968 if (bma->cur == NULL) 2011 if (bma->cur == NULL)
1969 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT; 2012 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
1970 else { 2013 else {
@@ -2048,7 +2091,7 @@ xfs_bmap_add_extent_delay_real(
2048 trace_xfs_bmap_pre_update(bma->ip, bma->idx, state, _THIS_IP_); 2091 trace_xfs_bmap_pre_update(bma->ip, bma->idx, state, _THIS_IP_);
2049 xfs_bmbt_set_blockcount(ep, temp); 2092 xfs_bmbt_set_blockcount(ep, temp);
2050 xfs_iext_insert(bma->ip, bma->idx + 1, 1, new, state); 2093 xfs_iext_insert(bma->ip, bma->idx + 1, 1, new, state);
2051 bma->ip->i_d.di_nextents++; 2094 (*nextents)++;
2052 if (bma->cur == NULL) 2095 if (bma->cur == NULL)
2053 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT; 2096 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
2054 else { 2097 else {
@@ -2117,7 +2160,7 @@ xfs_bmap_add_extent_delay_real(
2117 RIGHT.br_blockcount = temp2; 2160 RIGHT.br_blockcount = temp2;
2118 /* insert LEFT (r[0]) and RIGHT (r[1]) at the same time */ 2161 /* insert LEFT (r[0]) and RIGHT (r[1]) at the same time */
2119 xfs_iext_insert(bma->ip, bma->idx + 1, 2, &LEFT, state); 2162 xfs_iext_insert(bma->ip, bma->idx + 1, 2, &LEFT, state);
2120 bma->ip->i_d.di_nextents++; 2163 (*nextents)++;
2121 if (bma->cur == NULL) 2164 if (bma->cur == NULL)
2122 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT; 2165 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
2123 else { 2166 else {
@@ -2215,7 +2258,8 @@ xfs_bmap_add_extent_delay_real(
2215 2258
2216 xfs_bmap_check_leaf_extents(bma->cur, bma->ip, whichfork); 2259 xfs_bmap_check_leaf_extents(bma->cur, bma->ip, whichfork);
2217done: 2260done:
2218 bma->logflags |= rval; 2261 if (whichfork != XFS_COW_FORK)
2262 bma->logflags |= rval;
2219 return error; 2263 return error;
2220#undef LEFT 2264#undef LEFT
2221#undef RIGHT 2265#undef RIGHT
@@ -2759,6 +2803,7 @@ done:
2759STATIC void 2803STATIC void
2760xfs_bmap_add_extent_hole_delay( 2804xfs_bmap_add_extent_hole_delay(
2761 xfs_inode_t *ip, /* incore inode pointer */ 2805 xfs_inode_t *ip, /* incore inode pointer */
2806 int whichfork,
2762 xfs_extnum_t *idx, /* extent number to update/insert */ 2807 xfs_extnum_t *idx, /* extent number to update/insert */
2763 xfs_bmbt_irec_t *new) /* new data to add to file extents */ 2808 xfs_bmbt_irec_t *new) /* new data to add to file extents */
2764{ 2809{
@@ -2770,8 +2815,10 @@ xfs_bmap_add_extent_hole_delay(
2770 int state; /* state bits, accessed thru macros */ 2815 int state; /* state bits, accessed thru macros */
2771 xfs_filblks_t temp=0; /* temp for indirect calculations */ 2816 xfs_filblks_t temp=0; /* temp for indirect calculations */
2772 2817
2773 ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK); 2818 ifp = XFS_IFORK_PTR(ip, whichfork);
2774 state = 0; 2819 state = 0;
2820 if (whichfork == XFS_COW_FORK)
2821 state |= BMAP_COWFORK;
2775 ASSERT(isnullstartblock(new->br_startblock)); 2822 ASSERT(isnullstartblock(new->br_startblock));
2776 2823
2777 /* 2824 /*
@@ -2789,7 +2836,7 @@ xfs_bmap_add_extent_hole_delay(
2789 * Check and set flags if the current (right) segment exists. 2836 * Check and set flags if the current (right) segment exists.
2790 * If it doesn't exist, we're converting the hole at end-of-file. 2837 * If it doesn't exist, we're converting the hole at end-of-file.
2791 */ 2838 */
2792 if (*idx < ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t)) { 2839 if (*idx < ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)) {
2793 state |= BMAP_RIGHT_VALID; 2840 state |= BMAP_RIGHT_VALID;
2794 xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *idx), &right); 2841 xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *idx), &right);
2795 2842
@@ -2923,6 +2970,7 @@ xfs_bmap_add_extent_hole_real(
2923 ASSERT(!isnullstartblock(new->br_startblock)); 2970 ASSERT(!isnullstartblock(new->br_startblock));
2924 ASSERT(!bma->cur || 2971 ASSERT(!bma->cur ||
2925 !(bma->cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL)); 2972 !(bma->cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL));
2973 ASSERT(whichfork != XFS_COW_FORK);
2926 2974
2927 XFS_STATS_INC(mp, xs_add_exlist); 2975 XFS_STATS_INC(mp, xs_add_exlist);
2928 2976
@@ -3648,7 +3696,9 @@ xfs_bmap_btalloc(
3648 else if (mp->m_dalign) 3696 else if (mp->m_dalign)
3649 stripe_align = mp->m_dalign; 3697 stripe_align = mp->m_dalign;
3650 3698
3651 if (xfs_alloc_is_userdata(ap->datatype)) 3699 if (ap->flags & XFS_BMAPI_COWFORK)
3700 align = xfs_get_cowextsz_hint(ap->ip);
3701 else if (xfs_alloc_is_userdata(ap->datatype))
3652 align = xfs_get_extsz_hint(ap->ip); 3702 align = xfs_get_extsz_hint(ap->ip);
3653 if (unlikely(align)) { 3703 if (unlikely(align)) {
3654 error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, 3704 error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
@@ -3856,7 +3906,8 @@ xfs_bmap_btalloc(
3856 ASSERT(nullfb || fb_agno == args.agno || 3906 ASSERT(nullfb || fb_agno == args.agno ||
3857 (ap->dfops->dop_low && fb_agno < args.agno)); 3907 (ap->dfops->dop_low && fb_agno < args.agno));
3858 ap->length = args.len; 3908 ap->length = args.len;
3859 ap->ip->i_d.di_nblocks += args.len; 3909 if (!(ap->flags & XFS_BMAPI_COWFORK))
3910 ap->ip->i_d.di_nblocks += args.len;
3860 xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE); 3911 xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
3861 if (ap->wasdel) 3912 if (ap->wasdel)
3862 ap->ip->i_delayed_blks -= args.len; 3913 ap->ip->i_delayed_blks -= args.len;
@@ -3876,6 +3927,63 @@ xfs_bmap_btalloc(
3876} 3927}
3877 3928
3878/* 3929/*
3930 * For a remap operation, just "allocate" an extent at the address that the
3931 * caller passed in, and ensure that the AGFL is the right size. The caller
3932 * will then map the "allocated" extent into the file somewhere.
3933 */
3934STATIC int
3935xfs_bmap_remap_alloc(
3936 struct xfs_bmalloca *ap)
3937{
3938 struct xfs_trans *tp = ap->tp;
3939 struct xfs_mount *mp = tp->t_mountp;
3940 xfs_agblock_t bno;
3941 struct xfs_alloc_arg args;
3942 int error;
3943
3944 /*
3945 * validate that the block number is legal - the enables us to detect
3946 * and handle a silent filesystem corruption rather than crashing.
3947 */
3948 memset(&args, 0, sizeof(struct xfs_alloc_arg));
3949 args.tp = ap->tp;
3950 args.mp = ap->tp->t_mountp;
3951 bno = *ap->firstblock;
3952 args.agno = XFS_FSB_TO_AGNO(mp, bno);
3953 args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
3954 if (args.agno >= mp->m_sb.sb_agcount ||
3955 args.agbno >= mp->m_sb.sb_agblocks)
3956 return -EFSCORRUPTED;
3957
3958 /* "Allocate" the extent from the range we passed in. */
3959 trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
3960 ap->blkno = bno;
3961 ap->ip->i_d.di_nblocks += ap->length;
3962 xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
3963
3964 /* Fix the freelist, like a real allocator does. */
3965 args.datatype = ap->datatype;
3966 args.pag = xfs_perag_get(args.mp, args.agno);
3967 ASSERT(args.pag);
3968
3969 /*
3970 * The freelist fixing code will decline the allocation if
3971 * the size and shape of the free space doesn't allow for
3972 * allocating the extent and updating all the metadata that
3973 * happens during an allocation. We're remapping, not
3974 * allocating, so skip that check by pretending to be freeing.
3975 */
3976 error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
3977 if (error)
3978 goto error0;
3979error0:
3980 xfs_perag_put(args.pag);
3981 if (error)
3982 trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
3983 return error;
3984}
3985
3986/*
3879 * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file. 3987 * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
3880 * It figures out where to ask the underlying allocator to put the new extent. 3988 * It figures out where to ask the underlying allocator to put the new extent.
3881 */ 3989 */
@@ -3883,6 +3991,8 @@ STATIC int
3883xfs_bmap_alloc( 3991xfs_bmap_alloc(
3884 struct xfs_bmalloca *ap) /* bmap alloc argument struct */ 3992 struct xfs_bmalloca *ap) /* bmap alloc argument struct */
3885{ 3993{
3994 if (ap->flags & XFS_BMAPI_REMAP)
3995 return xfs_bmap_remap_alloc(ap);
3886 if (XFS_IS_REALTIME_INODE(ap->ip) && 3996 if (XFS_IS_REALTIME_INODE(ap->ip) &&
3887 xfs_alloc_is_userdata(ap->datatype)) 3997 xfs_alloc_is_userdata(ap->datatype))
3888 return xfs_bmap_rtalloc(ap); 3998 return xfs_bmap_rtalloc(ap);
@@ -4012,12 +4122,11 @@ xfs_bmapi_read(
4012 int error; 4122 int error;
4013 int eof; 4123 int eof;
4014 int n = 0; 4124 int n = 0;
4015 int whichfork = (flags & XFS_BMAPI_ATTRFORK) ? 4125 int whichfork = xfs_bmapi_whichfork(flags);
4016 XFS_ATTR_FORK : XFS_DATA_FORK;
4017 4126
4018 ASSERT(*nmap >= 1); 4127 ASSERT(*nmap >= 1);
4019 ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK|XFS_BMAPI_ENTIRE| 4128 ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK|XFS_BMAPI_ENTIRE|
4020 XFS_BMAPI_IGSTATE))); 4129 XFS_BMAPI_IGSTATE|XFS_BMAPI_COWFORK)));
4021 ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)); 4130 ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
4022 4131
4023 if (unlikely(XFS_TEST_ERROR( 4132 if (unlikely(XFS_TEST_ERROR(
@@ -4035,6 +4144,16 @@ xfs_bmapi_read(
4035 4144
4036 ifp = XFS_IFORK_PTR(ip, whichfork); 4145 ifp = XFS_IFORK_PTR(ip, whichfork);
4037 4146
4147 /* No CoW fork? Return a hole. */
4148 if (whichfork == XFS_COW_FORK && !ifp) {
4149 mval->br_startoff = bno;
4150 mval->br_startblock = HOLESTARTBLOCK;
4151 mval->br_blockcount = len;
4152 mval->br_state = XFS_EXT_NORM;
4153 *nmap = 1;
4154 return 0;
4155 }
4156
4038 if (!(ifp->if_flags & XFS_IFEXTENTS)) { 4157 if (!(ifp->if_flags & XFS_IFEXTENTS)) {
4039 error = xfs_iread_extents(NULL, ip, whichfork); 4158 error = xfs_iread_extents(NULL, ip, whichfork);
4040 if (error) 4159 if (error)
@@ -4084,6 +4203,7 @@ xfs_bmapi_read(
4084int 4203int
4085xfs_bmapi_reserve_delalloc( 4204xfs_bmapi_reserve_delalloc(
4086 struct xfs_inode *ip, 4205 struct xfs_inode *ip,
4206 int whichfork,
4087 xfs_fileoff_t aoff, 4207 xfs_fileoff_t aoff,
4088 xfs_filblks_t len, 4208 xfs_filblks_t len,
4089 struct xfs_bmbt_irec *got, 4209 struct xfs_bmbt_irec *got,
@@ -4092,7 +4212,7 @@ xfs_bmapi_reserve_delalloc(
4092 int eof) 4212 int eof)
4093{ 4213{
4094 struct xfs_mount *mp = ip->i_mount; 4214 struct xfs_mount *mp = ip->i_mount;
4095 struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK); 4215 struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
4096 xfs_extlen_t alen; 4216 xfs_extlen_t alen;
4097 xfs_extlen_t indlen; 4217 xfs_extlen_t indlen;
4098 char rt = XFS_IS_REALTIME_INODE(ip); 4218 char rt = XFS_IS_REALTIME_INODE(ip);
@@ -4104,7 +4224,10 @@ xfs_bmapi_reserve_delalloc(
4104 alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff); 4224 alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
4105 4225
4106 /* Figure out the extent size, adjust alen */ 4226 /* Figure out the extent size, adjust alen */
4107 extsz = xfs_get_extsz_hint(ip); 4227 if (whichfork == XFS_COW_FORK)
4228 extsz = xfs_get_cowextsz_hint(ip);
4229 else
4230 extsz = xfs_get_extsz_hint(ip);
4108 if (extsz) { 4231 if (extsz) {
4109 error = xfs_bmap_extsize_align(mp, got, prev, extsz, rt, eof, 4232 error = xfs_bmap_extsize_align(mp, got, prev, extsz, rt, eof,
4110 1, 0, &aoff, &alen); 4233 1, 0, &aoff, &alen);
@@ -4151,7 +4274,7 @@ xfs_bmapi_reserve_delalloc(
4151 got->br_startblock = nullstartblock(indlen); 4274 got->br_startblock = nullstartblock(indlen);
4152 got->br_blockcount = alen; 4275 got->br_blockcount = alen;
4153 got->br_state = XFS_EXT_NORM; 4276 got->br_state = XFS_EXT_NORM;
4154 xfs_bmap_add_extent_hole_delay(ip, lastx, got); 4277 xfs_bmap_add_extent_hole_delay(ip, whichfork, lastx, got);
4155 4278
4156 /* 4279 /*
4157 * Update our extent pointer, given that xfs_bmap_add_extent_hole_delay 4280 * Update our extent pointer, given that xfs_bmap_add_extent_hole_delay
@@ -4182,8 +4305,7 @@ xfs_bmapi_allocate(
4182 struct xfs_bmalloca *bma) 4305 struct xfs_bmalloca *bma)
4183{ 4306{
4184 struct xfs_mount *mp = bma->ip->i_mount; 4307 struct xfs_mount *mp = bma->ip->i_mount;
4185 int whichfork = (bma->flags & XFS_BMAPI_ATTRFORK) ? 4308 int whichfork = xfs_bmapi_whichfork(bma->flags);
4186 XFS_ATTR_FORK : XFS_DATA_FORK;
4187 struct xfs_ifork *ifp = XFS_IFORK_PTR(bma->ip, whichfork); 4309 struct xfs_ifork *ifp = XFS_IFORK_PTR(bma->ip, whichfork);
4188 int tmp_logflags = 0; 4310 int tmp_logflags = 0;
4189 int error; 4311 int error;
@@ -4278,7 +4400,7 @@ xfs_bmapi_allocate(
4278 bma->got.br_state = XFS_EXT_UNWRITTEN; 4400 bma->got.br_state = XFS_EXT_UNWRITTEN;
4279 4401
4280 if (bma->wasdel) 4402 if (bma->wasdel)
4281 error = xfs_bmap_add_extent_delay_real(bma); 4403 error = xfs_bmap_add_extent_delay_real(bma, whichfork);
4282 else 4404 else
4283 error = xfs_bmap_add_extent_hole_real(bma, whichfork); 4405 error = xfs_bmap_add_extent_hole_real(bma, whichfork);
4284 4406
@@ -4308,8 +4430,7 @@ xfs_bmapi_convert_unwritten(
4308 xfs_filblks_t len, 4430 xfs_filblks_t len,
4309 int flags) 4431 int flags)
4310{ 4432{
4311 int whichfork = (flags & XFS_BMAPI_ATTRFORK) ? 4433 int whichfork = xfs_bmapi_whichfork(flags);
4312 XFS_ATTR_FORK : XFS_DATA_FORK;
4313 struct xfs_ifork *ifp = XFS_IFORK_PTR(bma->ip, whichfork); 4434 struct xfs_ifork *ifp = XFS_IFORK_PTR(bma->ip, whichfork);
4314 int tmp_logflags = 0; 4435 int tmp_logflags = 0;
4315 int error; 4436 int error;
@@ -4325,6 +4446,8 @@ xfs_bmapi_convert_unwritten(
4325 (XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT)) 4446 (XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT))
4326 return 0; 4447 return 0;
4327 4448
4449 ASSERT(whichfork != XFS_COW_FORK);
4450
4328 /* 4451 /*
4329 * Modify (by adding) the state flag, if writing. 4452 * Modify (by adding) the state flag, if writing.
4330 */ 4453 */
@@ -4431,8 +4554,7 @@ xfs_bmapi_write(
4431 orig_mval = mval; 4554 orig_mval = mval;
4432 orig_nmap = *nmap; 4555 orig_nmap = *nmap;
4433#endif 4556#endif
4434 whichfork = (flags & XFS_BMAPI_ATTRFORK) ? 4557 whichfork = xfs_bmapi_whichfork(flags);
4435 XFS_ATTR_FORK : XFS_DATA_FORK;
4436 4558
4437 ASSERT(*nmap >= 1); 4559 ASSERT(*nmap >= 1);
4438 ASSERT(*nmap <= XFS_BMAP_MAX_NMAP); 4560 ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
@@ -4441,6 +4563,11 @@ xfs_bmapi_write(
4441 ASSERT(len > 0); 4563 ASSERT(len > 0);
4442 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL); 4564 ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
4443 ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); 4565 ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
4566 ASSERT(!(flags & XFS_BMAPI_REMAP) || whichfork == XFS_DATA_FORK);
4567 ASSERT(!(flags & XFS_BMAPI_PREALLOC) || !(flags & XFS_BMAPI_REMAP));
4568 ASSERT(!(flags & XFS_BMAPI_CONVERT) || !(flags & XFS_BMAPI_REMAP));
4569 ASSERT(!(flags & XFS_BMAPI_PREALLOC) || whichfork != XFS_COW_FORK);
4570 ASSERT(!(flags & XFS_BMAPI_CONVERT) || whichfork != XFS_COW_FORK);
4444 4571
4445 /* zeroing is for currently only for data extents, not metadata */ 4572 /* zeroing is for currently only for data extents, not metadata */
4446 ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) != 4573 ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
@@ -4502,6 +4629,14 @@ xfs_bmapi_write(
4502 wasdelay = !inhole && isnullstartblock(bma.got.br_startblock); 4629 wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
4503 4630
4504 /* 4631 /*
4632 * Make sure we only reflink into a hole.
4633 */
4634 if (flags & XFS_BMAPI_REMAP)
4635 ASSERT(inhole);
4636 if (flags & XFS_BMAPI_COWFORK)
4637 ASSERT(!inhole);
4638
4639 /*
4505 * First, deal with the hole before the allocated space 4640 * First, deal with the hole before the allocated space
4506 * that we found, if any. 4641 * that we found, if any.
4507 */ 4642 */
@@ -4531,6 +4666,17 @@ xfs_bmapi_write(
4531 goto error0; 4666 goto error0;
4532 if (bma.blkno == NULLFSBLOCK) 4667 if (bma.blkno == NULLFSBLOCK)
4533 break; 4668 break;
4669
4670 /*
4671 * If this is a CoW allocation, record the data in
4672 * the refcount btree for orphan recovery.
4673 */
4674 if (whichfork == XFS_COW_FORK) {
4675 error = xfs_refcount_alloc_cow_extent(mp, dfops,
4676 bma.blkno, bma.length);
4677 if (error)
4678 goto error0;
4679 }
4534 } 4680 }
4535 4681
4536 /* Deal with the allocated space we found. */ 4682 /* Deal with the allocated space we found. */
@@ -4696,7 +4842,8 @@ xfs_bmap_del_extent(
4696 xfs_btree_cur_t *cur, /* if null, not a btree */ 4842 xfs_btree_cur_t *cur, /* if null, not a btree */
4697 xfs_bmbt_irec_t *del, /* data to remove from extents */ 4843 xfs_bmbt_irec_t *del, /* data to remove from extents */
4698 int *logflagsp, /* inode logging flags */ 4844 int *logflagsp, /* inode logging flags */
4699 int whichfork) /* data or attr fork */ 4845 int whichfork, /* data or attr fork */
4846 int bflags) /* bmapi flags */
4700{ 4847{
4701 xfs_filblks_t da_new; /* new delay-alloc indirect blocks */ 4848 xfs_filblks_t da_new; /* new delay-alloc indirect blocks */
4702 xfs_filblks_t da_old; /* old delay-alloc indirect blocks */ 4849 xfs_filblks_t da_old; /* old delay-alloc indirect blocks */
@@ -4725,6 +4872,8 @@ xfs_bmap_del_extent(
4725 4872
4726 if (whichfork == XFS_ATTR_FORK) 4873 if (whichfork == XFS_ATTR_FORK)
4727 state |= BMAP_ATTRFORK; 4874 state |= BMAP_ATTRFORK;
4875 else if (whichfork == XFS_COW_FORK)
4876 state |= BMAP_COWFORK;
4728 4877
4729 ifp = XFS_IFORK_PTR(ip, whichfork); 4878 ifp = XFS_IFORK_PTR(ip, whichfork);
4730 ASSERT((*idx >= 0) && (*idx < ifp->if_bytes / 4879 ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
@@ -4805,6 +4954,7 @@ xfs_bmap_del_extent(
4805 /* 4954 /*
4806 * Matches the whole extent. Delete the entry. 4955 * Matches the whole extent. Delete the entry.
4807 */ 4956 */
4957 trace_xfs_bmap_pre_update(ip, *idx, state, _THIS_IP_);
4808 xfs_iext_remove(ip, *idx, 1, 4958 xfs_iext_remove(ip, *idx, 1,
4809 whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0); 4959 whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0);
4810 --*idx; 4960 --*idx;
@@ -4988,9 +5138,16 @@ xfs_bmap_del_extent(
4988 /* 5138 /*
4989 * If we need to, add to list of extents to delete. 5139 * If we need to, add to list of extents to delete.
4990 */ 5140 */
4991 if (do_fx) 5141 if (do_fx && !(bflags & XFS_BMAPI_REMAP)) {
4992 xfs_bmap_add_free(mp, dfops, del->br_startblock, 5142 if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) {
4993 del->br_blockcount, NULL); 5143 error = xfs_refcount_decrease_extent(mp, dfops, del);
5144 if (error)
5145 goto done;
5146 } else
5147 xfs_bmap_add_free(mp, dfops, del->br_startblock,
5148 del->br_blockcount, NULL);
5149 }
5150
4994 /* 5151 /*
4995 * Adjust inode # blocks in the file. 5152 * Adjust inode # blocks in the file.
4996 */ 5153 */
@@ -4999,7 +5156,7 @@ xfs_bmap_del_extent(
4999 /* 5156 /*
5000 * Adjust quota data. 5157 * Adjust quota data.
5001 */ 5158 */
5002 if (qfield) 5159 if (qfield && !(bflags & XFS_BMAPI_REMAP))
5003 xfs_trans_mod_dquot_byino(tp, ip, qfield, (long)-nblks); 5160 xfs_trans_mod_dquot_byino(tp, ip, qfield, (long)-nblks);
5004 5161
5005 /* 5162 /*
@@ -5014,6 +5171,175 @@ done:
5014 return error; 5171 return error;
5015} 5172}
5016 5173
5174/* Remove an extent from the CoW fork. Similar to xfs_bmap_del_extent. */
5175int
5176xfs_bunmapi_cow(
5177 struct xfs_inode *ip,
5178 struct xfs_bmbt_irec *del)
5179{
5180 xfs_filblks_t da_new;
5181 xfs_filblks_t da_old;
5182 xfs_fsblock_t del_endblock = 0;
5183 xfs_fileoff_t del_endoff;
5184 int delay;
5185 struct xfs_bmbt_rec_host *ep;
5186 int error;
5187 struct xfs_bmbt_irec got;
5188 xfs_fileoff_t got_endoff;
5189 struct xfs_ifork *ifp;
5190 struct xfs_mount *mp;
5191 xfs_filblks_t nblks;
5192 struct xfs_bmbt_irec new;
5193 /* REFERENCED */
5194 uint qfield;
5195 xfs_filblks_t temp;
5196 xfs_filblks_t temp2;
5197 int state = BMAP_COWFORK;
5198 int eof;
5199 xfs_extnum_t eidx;
5200
5201 mp = ip->i_mount;
5202 XFS_STATS_INC(mp, xs_del_exlist);
5203
5204 ep = xfs_bmap_search_extents(ip, del->br_startoff, XFS_COW_FORK, &eof,
5205 &eidx, &got, &new);
5206
5207 ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); ifp = ifp;
5208 ASSERT((eidx >= 0) && (eidx < ifp->if_bytes /
5209 (uint)sizeof(xfs_bmbt_rec_t)));
5210 ASSERT(del->br_blockcount > 0);
5211 ASSERT(got.br_startoff <= del->br_startoff);
5212 del_endoff = del->br_startoff + del->br_blockcount;
5213 got_endoff = got.br_startoff + got.br_blockcount;
5214 ASSERT(got_endoff >= del_endoff);
5215 delay = isnullstartblock(got.br_startblock);
5216 ASSERT(isnullstartblock(del->br_startblock) == delay);
5217 qfield = 0;
5218 error = 0;
5219 /*
5220 * If deleting a real allocation, must free up the disk space.
5221 */
5222 if (!delay) {
5223 nblks = del->br_blockcount;
5224 qfield = XFS_TRANS_DQ_BCOUNT;
5225 /*
5226 * Set up del_endblock and cur for later.
5227 */
5228 del_endblock = del->br_startblock + del->br_blockcount;
5229 da_old = da_new = 0;
5230 } else {
5231 da_old = startblockval(got.br_startblock);
5232 da_new = 0;
5233 nblks = 0;
5234 }
5235 qfield = qfield;
5236 nblks = nblks;
5237
5238 /*
5239 * Set flag value to use in switch statement.
5240 * Left-contig is 2, right-contig is 1.
5241 */
5242 switch (((got.br_startoff == del->br_startoff) << 1) |
5243 (got_endoff == del_endoff)) {
5244 case 3:
5245 /*
5246 * Matches the whole extent. Delete the entry.
5247 */
5248 xfs_iext_remove(ip, eidx, 1, BMAP_COWFORK);
5249 --eidx;
5250 break;
5251
5252 case 2:
5253 /*
5254 * Deleting the first part of the extent.
5255 */
5256 trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
5257 xfs_bmbt_set_startoff(ep, del_endoff);
5258 temp = got.br_blockcount - del->br_blockcount;
5259 xfs_bmbt_set_blockcount(ep, temp);
5260 if (delay) {
5261 temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
5262 da_old);
5263 xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
5264 trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
5265 da_new = temp;
5266 break;
5267 }
5268 xfs_bmbt_set_startblock(ep, del_endblock);
5269 trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
5270 break;
5271
5272 case 1:
5273 /*
5274 * Deleting the last part of the extent.
5275 */
5276 temp = got.br_blockcount - del->br_blockcount;
5277 trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
5278 xfs_bmbt_set_blockcount(ep, temp);
5279 if (delay) {
5280 temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
5281 da_old);
5282 xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
5283 trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
5284 da_new = temp;
5285 break;
5286 }
5287 trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
5288 break;
5289
5290 case 0:
5291 /*
5292 * Deleting the middle of the extent.
5293 */
5294 temp = del->br_startoff - got.br_startoff;
5295 trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
5296 xfs_bmbt_set_blockcount(ep, temp);
5297 new.br_startoff = del_endoff;
5298 temp2 = got_endoff - del_endoff;
5299 new.br_blockcount = temp2;
5300 new.br_state = got.br_state;
5301 if (!delay) {
5302 new.br_startblock = del_endblock;
5303 } else {
5304 temp = xfs_bmap_worst_indlen(ip, temp);
5305 xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
5306 temp2 = xfs_bmap_worst_indlen(ip, temp2);
5307 new.br_startblock = nullstartblock((int)temp2);
5308 da_new = temp + temp2;
5309 while (da_new > da_old) {
5310 if (temp) {
5311 temp--;
5312 da_new--;
5313 xfs_bmbt_set_startblock(ep,
5314 nullstartblock((int)temp));
5315 }
5316 if (da_new == da_old)
5317 break;
5318 if (temp2) {
5319 temp2--;
5320 da_new--;
5321 new.br_startblock =
5322 nullstartblock((int)temp2);
5323 }
5324 }
5325 }
5326 trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
5327 xfs_iext_insert(ip, eidx + 1, 1, &new, state);
5328 ++eidx;
5329 break;
5330 }
5331
5332 /*
5333 * Account for change in delayed indirect blocks.
5334 * Nothing to do for disk quota accounting here.
5335 */
5336 ASSERT(da_old >= da_new);
5337 if (da_old > da_new)
5338 xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
5339
5340 return error;
5341}
5342
5017/* 5343/*
5018 * Unmap (remove) blocks from a file. 5344 * Unmap (remove) blocks from a file.
5019 * If nexts is nonzero then the number of extents to remove is limited to 5345 * If nexts is nonzero then the number of extents to remove is limited to
@@ -5021,17 +5347,16 @@ done:
5021 * *done is set. 5347 * *done is set.
5022 */ 5348 */
5023int /* error */ 5349int /* error */
5024xfs_bunmapi( 5350__xfs_bunmapi(
5025 xfs_trans_t *tp, /* transaction pointer */ 5351 xfs_trans_t *tp, /* transaction pointer */
5026 struct xfs_inode *ip, /* incore inode */ 5352 struct xfs_inode *ip, /* incore inode */
5027 xfs_fileoff_t bno, /* starting offset to unmap */ 5353 xfs_fileoff_t bno, /* starting offset to unmap */
5028 xfs_filblks_t len, /* length to unmap in file */ 5354 xfs_filblks_t *rlen, /* i/o: amount remaining */
5029 int flags, /* misc flags */ 5355 int flags, /* misc flags */
5030 xfs_extnum_t nexts, /* number of extents max */ 5356 xfs_extnum_t nexts, /* number of extents max */
5031 xfs_fsblock_t *firstblock, /* first allocated block 5357 xfs_fsblock_t *firstblock, /* first allocated block
5032 controls a.g. for allocs */ 5358 controls a.g. for allocs */
5033 struct xfs_defer_ops *dfops, /* i/o: list extents to free */ 5359 struct xfs_defer_ops *dfops) /* i/o: deferred updates */
5034 int *done) /* set if not done yet */
5035{ 5360{
5036 xfs_btree_cur_t *cur; /* bmap btree cursor */ 5361 xfs_btree_cur_t *cur; /* bmap btree cursor */
5037 xfs_bmbt_irec_t del; /* extent being deleted */ 5362 xfs_bmbt_irec_t del; /* extent being deleted */
@@ -5053,11 +5378,12 @@ xfs_bunmapi(
5053 int wasdel; /* was a delayed alloc extent */ 5378 int wasdel; /* was a delayed alloc extent */
5054 int whichfork; /* data or attribute fork */ 5379 int whichfork; /* data or attribute fork */
5055 xfs_fsblock_t sum; 5380 xfs_fsblock_t sum;
5381 xfs_filblks_t len = *rlen; /* length to unmap in file */
5056 5382
5057 trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_); 5383 trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
5058 5384
5059 whichfork = (flags & XFS_BMAPI_ATTRFORK) ? 5385 whichfork = xfs_bmapi_whichfork(flags);
5060 XFS_ATTR_FORK : XFS_DATA_FORK; 5386 ASSERT(whichfork != XFS_COW_FORK);
5061 ifp = XFS_IFORK_PTR(ip, whichfork); 5387 ifp = XFS_IFORK_PTR(ip, whichfork);
5062 if (unlikely( 5388 if (unlikely(
5063 XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS && 5389 XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
@@ -5079,7 +5405,7 @@ xfs_bunmapi(
5079 return error; 5405 return error;
5080 nextents = ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t); 5406 nextents = ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
5081 if (nextents == 0) { 5407 if (nextents == 0) {
5082 *done = 1; 5408 *rlen = 0;
5083 return 0; 5409 return 0;
5084 } 5410 }
5085 XFS_STATS_INC(mp, xs_blk_unmap); 5411 XFS_STATS_INC(mp, xs_blk_unmap);
@@ -5324,7 +5650,7 @@ xfs_bunmapi(
5324 cur->bc_private.b.flags &= ~XFS_BTCUR_BPRV_WASDEL; 5650 cur->bc_private.b.flags &= ~XFS_BTCUR_BPRV_WASDEL;
5325 5651
5326 error = xfs_bmap_del_extent(ip, tp, &lastx, dfops, cur, &del, 5652 error = xfs_bmap_del_extent(ip, tp, &lastx, dfops, cur, &del,
5327 &tmp_logflags, whichfork); 5653 &tmp_logflags, whichfork, flags);
5328 logflags |= tmp_logflags; 5654 logflags |= tmp_logflags;
5329 if (error) 5655 if (error)
5330 goto error0; 5656 goto error0;
@@ -5350,7 +5676,10 @@ nodelete:
5350 extno++; 5676 extno++;
5351 } 5677 }
5352 } 5678 }
5353 *done = bno == (xfs_fileoff_t)-1 || bno < start || lastx < 0; 5679 if (bno == (xfs_fileoff_t)-1 || bno < start || lastx < 0)
5680 *rlen = 0;
5681 else
5682 *rlen = bno - start + 1;
5354 5683
5355 /* 5684 /*
5356 * Convert to a btree if necessary. 5685 * Convert to a btree if necessary.
@@ -5406,6 +5735,27 @@ error0:
5406 return error; 5735 return error;
5407} 5736}
5408 5737
5738/* Unmap a range of a file. */
5739int
5740xfs_bunmapi(
5741 xfs_trans_t *tp,
5742 struct xfs_inode *ip,
5743 xfs_fileoff_t bno,
5744 xfs_filblks_t len,
5745 int flags,
5746 xfs_extnum_t nexts,
5747 xfs_fsblock_t *firstblock,
5748 struct xfs_defer_ops *dfops,
5749 int *done)
5750{
5751 int error;
5752
5753 error = __xfs_bunmapi(tp, ip, bno, &len, flags, nexts, firstblock,
5754 dfops);
5755 *done = (len == 0);
5756 return error;
5757}
5758
5409/* 5759/*
5410 * Determine whether an extent shift can be accomplished by a merge with the 5760 * Determine whether an extent shift can be accomplished by a merge with the
5411 * extent that precedes the target hole of the shift. 5761 * extent that precedes the target hole of the shift.
@@ -5985,3 +6335,146 @@ out:
5985 xfs_trans_cancel(tp); 6335 xfs_trans_cancel(tp);
5986 return error; 6336 return error;
5987} 6337}
6338
6339/* Deferred mapping is only for real extents in the data fork. */
6340static bool
6341xfs_bmap_is_update_needed(
6342 struct xfs_bmbt_irec *bmap)
6343{
6344 return bmap->br_startblock != HOLESTARTBLOCK &&
6345 bmap->br_startblock != DELAYSTARTBLOCK;
6346}
6347
6348/* Record a bmap intent. */
6349static int
6350__xfs_bmap_add(
6351 struct xfs_mount *mp,
6352 struct xfs_defer_ops *dfops,
6353 enum xfs_bmap_intent_type type,
6354 struct xfs_inode *ip,
6355 int whichfork,
6356 struct xfs_bmbt_irec *bmap)
6357{
6358 int error;
6359 struct xfs_bmap_intent *bi;
6360
6361 trace_xfs_bmap_defer(mp,
6362 XFS_FSB_TO_AGNO(mp, bmap->br_startblock),
6363 type,
6364 XFS_FSB_TO_AGBNO(mp, bmap->br_startblock),
6365 ip->i_ino, whichfork,
6366 bmap->br_startoff,
6367 bmap->br_blockcount,
6368 bmap->br_state);
6369
6370 bi = kmem_alloc(sizeof(struct xfs_bmap_intent), KM_SLEEP | KM_NOFS);
6371 INIT_LIST_HEAD(&bi->bi_list);
6372 bi->bi_type = type;
6373 bi->bi_owner = ip;
6374 bi->bi_whichfork = whichfork;
6375 bi->bi_bmap = *bmap;
6376
6377 error = xfs_defer_join(dfops, bi->bi_owner);
6378 if (error) {
6379 kmem_free(bi);
6380 return error;
6381 }
6382
6383 xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
6384 return 0;
6385}
6386
6387/* Map an extent into a file. */
6388int
6389xfs_bmap_map_extent(
6390 struct xfs_mount *mp,
6391 struct xfs_defer_ops *dfops,
6392 struct xfs_inode *ip,
6393 struct xfs_bmbt_irec *PREV)
6394{
6395 if (!xfs_bmap_is_update_needed(PREV))
6396 return 0;
6397
6398 return __xfs_bmap_add(mp, dfops, XFS_BMAP_MAP, ip,
6399 XFS_DATA_FORK, PREV);
6400}
6401
6402/* Unmap an extent out of a file. */
6403int
6404xfs_bmap_unmap_extent(
6405 struct xfs_mount *mp,
6406 struct xfs_defer_ops *dfops,
6407 struct xfs_inode *ip,
6408 struct xfs_bmbt_irec *PREV)
6409{
6410 if (!xfs_bmap_is_update_needed(PREV))
6411 return 0;
6412
6413 return __xfs_bmap_add(mp, dfops, XFS_BMAP_UNMAP, ip,
6414 XFS_DATA_FORK, PREV);
6415}
6416
6417/*
6418 * Process one of the deferred bmap operations. We pass back the
6419 * btree cursor to maintain our lock on the bmapbt between calls.
6420 */
6421int
6422xfs_bmap_finish_one(
6423 struct xfs_trans *tp,
6424 struct xfs_defer_ops *dfops,
6425 struct xfs_inode *ip,
6426 enum xfs_bmap_intent_type type,
6427 int whichfork,
6428 xfs_fileoff_t startoff,
6429 xfs_fsblock_t startblock,
6430 xfs_filblks_t blockcount,
6431 xfs_exntst_t state)
6432{
6433 struct xfs_bmbt_irec bmap;
6434 int nimaps = 1;
6435 xfs_fsblock_t firstfsb;
6436 int flags = XFS_BMAPI_REMAP;
6437 int done;
6438 int error = 0;
6439
6440 bmap.br_startblock = startblock;
6441 bmap.br_startoff = startoff;
6442 bmap.br_blockcount = blockcount;
6443 bmap.br_state = state;
6444
6445 trace_xfs_bmap_deferred(tp->t_mountp,
6446 XFS_FSB_TO_AGNO(tp->t_mountp, startblock), type,
6447 XFS_FSB_TO_AGBNO(tp->t_mountp, startblock),
6448 ip->i_ino, whichfork, startoff, blockcount, state);
6449
6450 if (whichfork != XFS_DATA_FORK && whichfork != XFS_ATTR_FORK)
6451 return -EFSCORRUPTED;
6452 if (whichfork == XFS_ATTR_FORK)
6453 flags |= XFS_BMAPI_ATTRFORK;
6454
6455 if (XFS_TEST_ERROR(false, tp->t_mountp,
6456 XFS_ERRTAG_BMAP_FINISH_ONE,
6457 XFS_RANDOM_BMAP_FINISH_ONE))
6458 return -EIO;
6459
6460 switch (type) {
6461 case XFS_BMAP_MAP:
6462 firstfsb = bmap.br_startblock;
6463 error = xfs_bmapi_write(tp, ip, bmap.br_startoff,
6464 bmap.br_blockcount, flags, &firstfsb,
6465 bmap.br_blockcount, &bmap, &nimaps,
6466 dfops);
6467 break;
6468 case XFS_BMAP_UNMAP:
6469 error = xfs_bunmapi(tp, ip, bmap.br_startoff,
6470 bmap.br_blockcount, flags, 1, &firstfsb,
6471 dfops, &done);
6472 ASSERT(done);
6473 break;
6474 default:
6475 ASSERT(0);
6476 error = -EFSCORRUPTED;
6477 }
6478
6479 return error;
6480}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 8395f6e8cf7d..f97db7132564 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -97,6 +97,19 @@ struct xfs_extent_free_item
97 */ 97 */
98#define XFS_BMAPI_ZERO 0x080 98#define XFS_BMAPI_ZERO 0x080
99 99
100/*
101 * Map the inode offset to the block given in ap->firstblock. Primarily
102 * used for reflink. The range must be in a hole, and this flag cannot be
103 * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
104 *
105 * For bunmapi, this flag unmaps the range without adjusting quota, reducing
106 * refcount, or freeing the blocks.
107 */
108#define XFS_BMAPI_REMAP 0x100
109
110/* Map something in the CoW fork. */
111#define XFS_BMAPI_COWFORK 0x200
112
100#define XFS_BMAPI_FLAGS \ 113#define XFS_BMAPI_FLAGS \
101 { XFS_BMAPI_ENTIRE, "ENTIRE" }, \ 114 { XFS_BMAPI_ENTIRE, "ENTIRE" }, \
102 { XFS_BMAPI_METADATA, "METADATA" }, \ 115 { XFS_BMAPI_METADATA, "METADATA" }, \
@@ -105,12 +118,24 @@ struct xfs_extent_free_item
105 { XFS_BMAPI_IGSTATE, "IGSTATE" }, \ 118 { XFS_BMAPI_IGSTATE, "IGSTATE" }, \
106 { XFS_BMAPI_CONTIG, "CONTIG" }, \ 119 { XFS_BMAPI_CONTIG, "CONTIG" }, \
107 { XFS_BMAPI_CONVERT, "CONVERT" }, \ 120 { XFS_BMAPI_CONVERT, "CONVERT" }, \
108 { XFS_BMAPI_ZERO, "ZERO" } 121 { XFS_BMAPI_ZERO, "ZERO" }, \
122 { XFS_BMAPI_REMAP, "REMAP" }, \
123 { XFS_BMAPI_COWFORK, "COWFORK" }
109 124
110 125
111static inline int xfs_bmapi_aflag(int w) 126static inline int xfs_bmapi_aflag(int w)
112{ 127{
113 return (w == XFS_ATTR_FORK ? XFS_BMAPI_ATTRFORK : 0); 128 return (w == XFS_ATTR_FORK ? XFS_BMAPI_ATTRFORK :
129 (w == XFS_COW_FORK ? XFS_BMAPI_COWFORK : 0));
130}
131
132static inline int xfs_bmapi_whichfork(int bmapi_flags)
133{
134 if (bmapi_flags & XFS_BMAPI_COWFORK)
135 return XFS_COW_FORK;
136 else if (bmapi_flags & XFS_BMAPI_ATTRFORK)
137 return XFS_ATTR_FORK;
138 return XFS_DATA_FORK;
114} 139}
115 140
116/* 141/*
@@ -131,13 +156,15 @@ static inline int xfs_bmapi_aflag(int w)
131#define BMAP_LEFT_VALID (1 << 6) 156#define BMAP_LEFT_VALID (1 << 6)
132#define BMAP_RIGHT_VALID (1 << 7) 157#define BMAP_RIGHT_VALID (1 << 7)
133#define BMAP_ATTRFORK (1 << 8) 158#define BMAP_ATTRFORK (1 << 8)
159#define BMAP_COWFORK (1 << 9)
134 160
135#define XFS_BMAP_EXT_FLAGS \ 161#define XFS_BMAP_EXT_FLAGS \
136 { BMAP_LEFT_CONTIG, "LC" }, \ 162 { BMAP_LEFT_CONTIG, "LC" }, \
137 { BMAP_RIGHT_CONTIG, "RC" }, \ 163 { BMAP_RIGHT_CONTIG, "RC" }, \
138 { BMAP_LEFT_FILLING, "LF" }, \ 164 { BMAP_LEFT_FILLING, "LF" }, \
139 { BMAP_RIGHT_FILLING, "RF" }, \ 165 { BMAP_RIGHT_FILLING, "RF" }, \
140 { BMAP_ATTRFORK, "ATTR" } 166 { BMAP_ATTRFORK, "ATTR" }, \
167 { BMAP_COWFORK, "COW" }
141 168
142 169
143/* 170/*
@@ -186,10 +213,15 @@ int xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
186 xfs_fsblock_t *firstblock, xfs_extlen_t total, 213 xfs_fsblock_t *firstblock, xfs_extlen_t total,
187 struct xfs_bmbt_irec *mval, int *nmap, 214 struct xfs_bmbt_irec *mval, int *nmap,
188 struct xfs_defer_ops *dfops); 215 struct xfs_defer_ops *dfops);
216int __xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
217 xfs_fileoff_t bno, xfs_filblks_t *rlen, int flags,
218 xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
219 struct xfs_defer_ops *dfops);
189int xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip, 220int xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
190 xfs_fileoff_t bno, xfs_filblks_t len, int flags, 221 xfs_fileoff_t bno, xfs_filblks_t len, int flags,
191 xfs_extnum_t nexts, xfs_fsblock_t *firstblock, 222 xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
192 struct xfs_defer_ops *dfops, int *done); 223 struct xfs_defer_ops *dfops, int *done);
224int xfs_bunmapi_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *del);
193int xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx, 225int xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
194 xfs_extnum_t num); 226 xfs_extnum_t num);
195uint xfs_default_attroffset(struct xfs_inode *ip); 227uint xfs_default_attroffset(struct xfs_inode *ip);
@@ -203,8 +235,31 @@ struct xfs_bmbt_rec_host *
203 xfs_bmap_search_extents(struct xfs_inode *ip, xfs_fileoff_t bno, 235 xfs_bmap_search_extents(struct xfs_inode *ip, xfs_fileoff_t bno,
204 int fork, int *eofp, xfs_extnum_t *lastxp, 236 int fork, int *eofp, xfs_extnum_t *lastxp,
205 struct xfs_bmbt_irec *gotp, struct xfs_bmbt_irec *prevp); 237 struct xfs_bmbt_irec *gotp, struct xfs_bmbt_irec *prevp);
206int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, xfs_fileoff_t aoff, 238int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
207 xfs_filblks_t len, struct xfs_bmbt_irec *got, 239 xfs_fileoff_t aoff, xfs_filblks_t len,
208 struct xfs_bmbt_irec *prev, xfs_extnum_t *lastx, int eof); 240 struct xfs_bmbt_irec *got, struct xfs_bmbt_irec *prev,
241 xfs_extnum_t *lastx, int eof);
242
243enum xfs_bmap_intent_type {
244 XFS_BMAP_MAP = 1,
245 XFS_BMAP_UNMAP,
246};
247
248struct xfs_bmap_intent {
249 struct list_head bi_list;
250 enum xfs_bmap_intent_type bi_type;
251 struct xfs_inode *bi_owner;
252 int bi_whichfork;
253 struct xfs_bmbt_irec bi_bmap;
254};
255
256int xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_defer_ops *dfops,
257 struct xfs_inode *ip, enum xfs_bmap_intent_type type,
258 int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
259 xfs_filblks_t blockcount, xfs_exntst_t state);
260int xfs_bmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
261 struct xfs_inode *ip, struct xfs_bmbt_irec *imap);
262int xfs_bmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
263 struct xfs_inode *ip, struct xfs_bmbt_irec *imap);
209 264
210#endif /* __XFS_BMAP_H__ */ 265#endif /* __XFS_BMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index cd85274e810c..8007d2ba9aef 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -453,6 +453,7 @@ xfs_bmbt_alloc_block(
453 453
454 if (args.fsbno == NULLFSBLOCK) { 454 if (args.fsbno == NULLFSBLOCK) {
455 args.fsbno = be64_to_cpu(start->l); 455 args.fsbno = be64_to_cpu(start->l);
456try_another_ag:
456 args.type = XFS_ALLOCTYPE_START_BNO; 457 args.type = XFS_ALLOCTYPE_START_BNO;
457 /* 458 /*
458 * Make sure there is sufficient room left in the AG to 459 * Make sure there is sufficient room left in the AG to
@@ -482,6 +483,22 @@ xfs_bmbt_alloc_block(
482 if (error) 483 if (error)
483 goto error0; 484 goto error0;
484 485
486 /*
487 * During a CoW operation, the allocation and bmbt updates occur in
488 * different transactions. The mapping code tries to put new bmbt
489 * blocks near extents being mapped, but the only way to guarantee this
490 * is if the alloc and the mapping happen in a single transaction that
491 * has a block reservation. That isn't the case here, so if we run out
492 * of space we'll try again with another AG.
493 */
494 if (xfs_sb_version_hasreflink(&cur->bc_mp->m_sb) &&
495 args.fsbno == NULLFSBLOCK &&
496 args.type == XFS_ALLOCTYPE_NEAR_BNO) {
497 cur->bc_private.b.dfops->dop_low = true;
498 args.fsbno = cur->bc_private.b.firstblock;
499 goto try_another_ag;
500 }
501
485 if (args.fsbno == NULLFSBLOCK && args.minleft) { 502 if (args.fsbno == NULLFSBLOCK && args.minleft) {
486 /* 503 /*
487 * Could not find an AG with enough free space to satisfy 504 * Could not find an AG with enough free space to satisfy
@@ -777,6 +794,7 @@ xfs_bmbt_init_cursor(
777{ 794{
778 struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); 795 struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
779 struct xfs_btree_cur *cur; 796 struct xfs_btree_cur *cur;
797 ASSERT(whichfork != XFS_COW_FORK);
780 798
781 cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP); 799 cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
782 800
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index aa1752f918b8..5c8e6f2ce44f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -45,9 +45,10 @@ kmem_zone_t *xfs_btree_cur_zone;
45 */ 45 */
46static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = { 46static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
47 { XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC, 47 { XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
48 XFS_FIBT_MAGIC }, 48 XFS_FIBT_MAGIC, 0 },
49 { XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC, 49 { XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
50 XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC } 50 XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC,
51 XFS_REFC_CRC_MAGIC }
51}; 52};
52#define xfs_btree_magic(cur) \ 53#define xfs_btree_magic(cur) \
53 xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum] 54 xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
@@ -1216,6 +1217,9 @@ xfs_btree_set_refs(
1216 case XFS_BTNUM_RMAP: 1217 case XFS_BTNUM_RMAP:
1217 xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF); 1218 xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
1218 break; 1219 break;
1220 case XFS_BTNUM_REFC:
1221 xfs_buf_set_ref(bp, XFS_REFC_BTREE_REF);
1222 break;
1219 default: 1223 default:
1220 ASSERT(0); 1224 ASSERT(0);
1221 } 1225 }
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 3f8556a5c2ad..c2b01d1c79ee 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -49,6 +49,7 @@ union xfs_btree_key {
49 struct xfs_inobt_key inobt; 49 struct xfs_inobt_key inobt;
50 struct xfs_rmap_key rmap; 50 struct xfs_rmap_key rmap;
51 struct xfs_rmap_key __rmap_bigkey[2]; 51 struct xfs_rmap_key __rmap_bigkey[2];
52 struct xfs_refcount_key refc;
52}; 53};
53 54
54union xfs_btree_rec { 55union xfs_btree_rec {
@@ -57,6 +58,7 @@ union xfs_btree_rec {
57 struct xfs_alloc_rec alloc; 58 struct xfs_alloc_rec alloc;
58 struct xfs_inobt_rec inobt; 59 struct xfs_inobt_rec inobt;
59 struct xfs_rmap_rec rmap; 60 struct xfs_rmap_rec rmap;
61 struct xfs_refcount_rec refc;
60}; 62};
61 63
62/* 64/*
@@ -72,6 +74,7 @@ union xfs_btree_rec {
72#define XFS_BTNUM_INO ((xfs_btnum_t)XFS_BTNUM_INOi) 74#define XFS_BTNUM_INO ((xfs_btnum_t)XFS_BTNUM_INOi)
73#define XFS_BTNUM_FINO ((xfs_btnum_t)XFS_BTNUM_FINOi) 75#define XFS_BTNUM_FINO ((xfs_btnum_t)XFS_BTNUM_FINOi)
74#define XFS_BTNUM_RMAP ((xfs_btnum_t)XFS_BTNUM_RMAPi) 76#define XFS_BTNUM_RMAP ((xfs_btnum_t)XFS_BTNUM_RMAPi)
77#define XFS_BTNUM_REFC ((xfs_btnum_t)XFS_BTNUM_REFCi)
75 78
76/* 79/*
77 * For logging record fields. 80 * For logging record fields.
@@ -105,6 +108,7 @@ do { \
105 case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \ 108 case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
106 case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \ 109 case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
107 case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \ 110 case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \
111 case XFS_BTNUM_REFC: __XFS_BTREE_STATS_INC(__mp, refcbt, stat); break; \
108 case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \ 112 case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
109 } \ 113 } \
110} while (0) 114} while (0)
@@ -127,6 +131,8 @@ do { \
127 __XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \ 131 __XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
128 case XFS_BTNUM_RMAP: \ 132 case XFS_BTNUM_RMAP: \
129 __XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \ 133 __XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \
134 case XFS_BTNUM_REFC: \
135 __XFS_BTREE_STATS_ADD(__mp, refcbt, stat, val); break; \
130 case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \ 136 case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
131 } \ 137 } \
132} while (0) 138} while (0)
@@ -217,6 +223,15 @@ union xfs_btree_irec {
217 struct xfs_bmbt_irec b; 223 struct xfs_bmbt_irec b;
218 struct xfs_inobt_rec_incore i; 224 struct xfs_inobt_rec_incore i;
219 struct xfs_rmap_irec r; 225 struct xfs_rmap_irec r;
226 struct xfs_refcount_irec rc;
227};
228
229/* Per-AG btree private information. */
230union xfs_btree_cur_private {
231 struct {
232 unsigned long nr_ops; /* # record updates */
233 int shape_changes; /* # of extent splits */
234 } refc;
220}; 235};
221 236
222/* 237/*
@@ -243,6 +258,7 @@ typedef struct xfs_btree_cur
243 struct xfs_buf *agbp; /* agf/agi buffer pointer */ 258 struct xfs_buf *agbp; /* agf/agi buffer pointer */
244 struct xfs_defer_ops *dfops; /* deferred updates */ 259 struct xfs_defer_ops *dfops; /* deferred updates */
245 xfs_agnumber_t agno; /* ag number */ 260 xfs_agnumber_t agno; /* ag number */
261 union xfs_btree_cur_private priv;
246 } a; 262 } a;
247 struct { /* needed for BMAP */ 263 struct { /* needed for BMAP */
248 struct xfs_inode *ip; /* pointer to our inode */ 264 struct xfs_inode *ip; /* pointer to our inode */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index e96533d178cf..f6e93ef0bffe 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,8 @@ struct xfs_defer_pending {
51 * find all the space it needs. 51 * find all the space it needs.
52 */ 52 */
53enum xfs_defer_ops_type { 53enum xfs_defer_ops_type {
54 XFS_DEFER_OPS_TYPE_BMAP,
55 XFS_DEFER_OPS_TYPE_REFCOUNT,
54 XFS_DEFER_OPS_TYPE_RMAP, 56 XFS_DEFER_OPS_TYPE_RMAP,
55 XFS_DEFER_OPS_TYPE_FREE, 57 XFS_DEFER_OPS_TYPE_FREE,
56 XFS_DEFER_OPS_TYPE_MAX, 58 XFS_DEFER_OPS_TYPE_MAX,
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 270fb5cf4fa1..f6547fc5e016 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -456,9 +456,11 @@ xfs_sb_has_compat_feature(
456 456
457#define XFS_SB_FEAT_RO_COMPAT_FINOBT (1 << 0) /* free inode btree */ 457#define XFS_SB_FEAT_RO_COMPAT_FINOBT (1 << 0) /* free inode btree */
458#define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */ 458#define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
459#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
459#define XFS_SB_FEAT_RO_COMPAT_ALL \ 460#define XFS_SB_FEAT_RO_COMPAT_ALL \
460 (XFS_SB_FEAT_RO_COMPAT_FINOBT | \ 461 (XFS_SB_FEAT_RO_COMPAT_FINOBT | \
461 XFS_SB_FEAT_RO_COMPAT_RMAPBT) 462 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
463 XFS_SB_FEAT_RO_COMPAT_REFLINK)
462#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL 464#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
463static inline bool 465static inline bool
464xfs_sb_has_ro_compat_feature( 466xfs_sb_has_ro_compat_feature(
@@ -546,6 +548,12 @@ static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
546 (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT); 548 (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
547} 549}
548 550
551static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
552{
553 return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
554 (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
555}
556
549/* 557/*
550 * end of superblock version macros 558 * end of superblock version macros
551 */ 559 */
@@ -641,14 +649,17 @@ typedef struct xfs_agf {
641 uuid_t agf_uuid; /* uuid of filesystem */ 649 uuid_t agf_uuid; /* uuid of filesystem */
642 650
643 __be32 agf_rmap_blocks; /* rmapbt blocks used */ 651 __be32 agf_rmap_blocks; /* rmapbt blocks used */
644 __be32 agf_padding; /* padding */ 652 __be32 agf_refcount_blocks; /* refcountbt blocks used */
653
654 __be32 agf_refcount_root; /* refcount tree root block */
655 __be32 agf_refcount_level; /* refcount btree levels */
645 656
646 /* 657 /*
647 * reserve some contiguous space for future logged fields before we add 658 * reserve some contiguous space for future logged fields before we add
648 * the unlogged fields. This makes the range logging via flags and 659 * the unlogged fields. This makes the range logging via flags and
649 * structure offsets much simpler. 660 * structure offsets much simpler.
650 */ 661 */
651 __be64 agf_spare64[15]; 662 __be64 agf_spare64[14];
652 663
653 /* unlogged fields, written during buffer writeback. */ 664 /* unlogged fields, written during buffer writeback. */
654 __be64 agf_lsn; /* last write sequence */ 665 __be64 agf_lsn; /* last write sequence */
@@ -674,8 +685,11 @@ typedef struct xfs_agf {
674#define XFS_AGF_BTREEBLKS 0x00000800 685#define XFS_AGF_BTREEBLKS 0x00000800
675#define XFS_AGF_UUID 0x00001000 686#define XFS_AGF_UUID 0x00001000
676#define XFS_AGF_RMAP_BLOCKS 0x00002000 687#define XFS_AGF_RMAP_BLOCKS 0x00002000
677#define XFS_AGF_SPARE64 0x00004000 688#define XFS_AGF_REFCOUNT_BLOCKS 0x00004000
678#define XFS_AGF_NUM_BITS 15 689#define XFS_AGF_REFCOUNT_ROOT 0x00008000
690#define XFS_AGF_REFCOUNT_LEVEL 0x00010000
691#define XFS_AGF_SPARE64 0x00020000
692#define XFS_AGF_NUM_BITS 18
679#define XFS_AGF_ALL_BITS ((1 << XFS_AGF_NUM_BITS) - 1) 693#define XFS_AGF_ALL_BITS ((1 << XFS_AGF_NUM_BITS) - 1)
680 694
681#define XFS_AGF_FLAGS \ 695#define XFS_AGF_FLAGS \
@@ -693,6 +707,9 @@ typedef struct xfs_agf {
693 { XFS_AGF_BTREEBLKS, "BTREEBLKS" }, \ 707 { XFS_AGF_BTREEBLKS, "BTREEBLKS" }, \
694 { XFS_AGF_UUID, "UUID" }, \ 708 { XFS_AGF_UUID, "UUID" }, \
695 { XFS_AGF_RMAP_BLOCKS, "RMAP_BLOCKS" }, \ 709 { XFS_AGF_RMAP_BLOCKS, "RMAP_BLOCKS" }, \
710 { XFS_AGF_REFCOUNT_BLOCKS, "REFCOUNT_BLOCKS" }, \
711 { XFS_AGF_REFCOUNT_ROOT, "REFCOUNT_ROOT" }, \
712 { XFS_AGF_REFCOUNT_LEVEL, "REFCOUNT_LEVEL" }, \
696 { XFS_AGF_SPARE64, "SPARE64" } 713 { XFS_AGF_SPARE64, "SPARE64" }
697 714
698/* disk block (xfs_daddr_t) in the AG */ 715/* disk block (xfs_daddr_t) in the AG */
@@ -885,7 +902,8 @@ typedef struct xfs_dinode {
885 __be64 di_changecount; /* number of attribute changes */ 902 __be64 di_changecount; /* number of attribute changes */
886 __be64 di_lsn; /* flush sequence */ 903 __be64 di_lsn; /* flush sequence */
887 __be64 di_flags2; /* more random flags */ 904 __be64 di_flags2; /* more random flags */
888 __u8 di_pad2[16]; /* more padding for future expansion */ 905 __be32 di_cowextsize; /* basic cow extent size for file */
906 __u8 di_pad2[12]; /* more padding for future expansion */
889 907
890 /* fields only written to during inode creation */ 908 /* fields only written to during inode creation */
891 xfs_timestamp_t di_crtime; /* time created */ 909 xfs_timestamp_t di_crtime; /* time created */
@@ -1041,9 +1059,14 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
1041 * 16 bits of the XFS_XFLAG_s range. 1059 * 16 bits of the XFS_XFLAG_s range.
1042 */ 1060 */
1043#define XFS_DIFLAG2_DAX_BIT 0 /* use DAX for this inode */ 1061#define XFS_DIFLAG2_DAX_BIT 0 /* use DAX for this inode */
1062#define XFS_DIFLAG2_REFLINK_BIT 1 /* file's blocks may be shared */
1063#define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
1044#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT) 1064#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
1065#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
1066#define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
1045 1067
1046#define XFS_DIFLAG2_ANY (XFS_DIFLAG2_DAX) 1068#define XFS_DIFLAG2_ANY \
1069 (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
1047 1070
1048/* 1071/*
1049 * Inode number format: 1072 * Inode number format:
@@ -1353,7 +1376,9 @@ struct xfs_owner_info {
1353#define XFS_RMAP_OWN_AG (-5ULL) /* AG freespace btree blocks */ 1376#define XFS_RMAP_OWN_AG (-5ULL) /* AG freespace btree blocks */
1354#define XFS_RMAP_OWN_INOBT (-6ULL) /* Inode btree blocks */ 1377#define XFS_RMAP_OWN_INOBT (-6ULL) /* Inode btree blocks */
1355#define XFS_RMAP_OWN_INODES (-7ULL) /* Inode chunk */ 1378#define XFS_RMAP_OWN_INODES (-7ULL) /* Inode chunk */
1356#define XFS_RMAP_OWN_MIN (-8ULL) /* guard */ 1379#define XFS_RMAP_OWN_REFC (-8ULL) /* refcount tree */
1380#define XFS_RMAP_OWN_COW (-9ULL) /* cow allocations */
1381#define XFS_RMAP_OWN_MIN (-10ULL) /* guard */
1357 1382
1358#define XFS_RMAP_NON_INODE_OWNER(owner) (!!((owner) & (1ULL << 63))) 1383#define XFS_RMAP_NON_INODE_OWNER(owner) (!!((owner) & (1ULL << 63)))
1359 1384
@@ -1434,6 +1459,62 @@ typedef __be32 xfs_rmap_ptr_t;
1434 XFS_IBT_BLOCK(mp) + 1) 1459 XFS_IBT_BLOCK(mp) + 1)
1435 1460
1436/* 1461/*
1462 * Reference Count Btree format definitions
1463 *
1464 */
1465#define XFS_REFC_CRC_MAGIC 0x52334643 /* 'R3FC' */
1466
1467unsigned int xfs_refc_block(struct xfs_mount *mp);
1468
1469/*
1470 * Data record/key structure
1471 *
1472 * Each record associates a range of physical blocks (starting at
1473 * rc_startblock and ending rc_blockcount blocks later) with a reference
1474 * count (rc_refcount). Extents that are being used to stage a copy on
1475 * write (CoW) operation are recorded in the refcount btree with a
1476 * refcount of 1. All other records must have a refcount > 1 and must
1477 * track an extent mapped only by file data forks.
1478 *
1479 * Extents with a single owner (attributes, metadata, non-shared file
1480 * data) are not tracked here. Free space is also not tracked here.
1481 * This is consistent with pre-reflink XFS.
1482 */
1483
1484/*
1485 * Extents that are being used to stage a copy on write are stored
1486 * in the refcount btree with a refcount of 1 and the upper bit set
1487 * on the startblock. This speeds up mount time deletion of stale
1488 * staging extents because they're all at the right side of the tree.
1489 */
1490#define XFS_REFC_COW_START ((xfs_agblock_t)(1U << 31))
1491#define REFCNTBT_COWFLAG_BITLEN 1
1492#define REFCNTBT_AGBLOCK_BITLEN 31
1493
1494struct xfs_refcount_rec {
1495 __be32 rc_startblock; /* starting block number */
1496 __be32 rc_blockcount; /* count of blocks */
1497 __be32 rc_refcount; /* number of inodes linked here */
1498};
1499
1500struct xfs_refcount_key {
1501 __be32 rc_startblock; /* starting block number */
1502};
1503
1504struct xfs_refcount_irec {
1505 xfs_agblock_t rc_startblock; /* starting block number */
1506 xfs_extlen_t rc_blockcount; /* count of free blocks */
1507 xfs_nlink_t rc_refcount; /* number of inodes linked here */
1508};
1509
1510#define MAXREFCOUNT ((xfs_nlink_t)~0U)
1511#define MAXREFCEXTLEN ((xfs_extlen_t)~0U)
1512
1513/* btree pointer type */
1514typedef __be32 xfs_refcount_ptr_t;
1515
1516
1517/*
1437 * BMAP Btree format definitions 1518 * BMAP Btree format definitions
1438 * 1519 *
1439 * This includes both the root block definition that sits inside an inode fork 1520 * This includes both the root block definition that sits inside an inode fork
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 79455058b752..b72dc821d78b 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -81,14 +81,16 @@ struct getbmapx {
81#define BMV_IF_PREALLOC 0x4 /* rtn status BMV_OF_PREALLOC if req */ 81#define BMV_IF_PREALLOC 0x4 /* rtn status BMV_OF_PREALLOC if req */
82#define BMV_IF_DELALLOC 0x8 /* rtn status BMV_OF_DELALLOC if req */ 82#define BMV_IF_DELALLOC 0x8 /* rtn status BMV_OF_DELALLOC if req */
83#define BMV_IF_NO_HOLES 0x10 /* Do not return holes */ 83#define BMV_IF_NO_HOLES 0x10 /* Do not return holes */
84#define BMV_IF_COWFORK 0x20 /* return CoW fork rather than data */
84#define BMV_IF_VALID \ 85#define BMV_IF_VALID \
85 (BMV_IF_ATTRFORK|BMV_IF_NO_DMAPI_READ|BMV_IF_PREALLOC| \ 86 (BMV_IF_ATTRFORK|BMV_IF_NO_DMAPI_READ|BMV_IF_PREALLOC| \
86 BMV_IF_DELALLOC|BMV_IF_NO_HOLES) 87 BMV_IF_DELALLOC|BMV_IF_NO_HOLES|BMV_IF_COWFORK)
87 88
88/* bmv_oflags values - returned for each non-header segment */ 89/* bmv_oflags values - returned for each non-header segment */
89#define BMV_OF_PREALLOC 0x1 /* segment = unwritten pre-allocation */ 90#define BMV_OF_PREALLOC 0x1 /* segment = unwritten pre-allocation */
90#define BMV_OF_DELALLOC 0x2 /* segment = delayed allocation */ 91#define BMV_OF_DELALLOC 0x2 /* segment = delayed allocation */
91#define BMV_OF_LAST 0x4 /* segment is the last in the file */ 92#define BMV_OF_LAST 0x4 /* segment is the last in the file */
93#define BMV_OF_SHARED 0x8 /* segment shared with another file */
92 94
93/* 95/*
94 * Structure for XFS_IOC_FSSETDM. 96 * Structure for XFS_IOC_FSSETDM.
@@ -206,7 +208,8 @@ typedef struct xfs_fsop_resblks {
206#define XFS_FSOP_GEOM_FLAGS_FTYPE 0x10000 /* inode directory types */ 208#define XFS_FSOP_GEOM_FLAGS_FTYPE 0x10000 /* inode directory types */
207#define XFS_FSOP_GEOM_FLAGS_FINOBT 0x20000 /* free inode btree */ 209#define XFS_FSOP_GEOM_FLAGS_FINOBT 0x20000 /* free inode btree */
208#define XFS_FSOP_GEOM_FLAGS_SPINODES 0x40000 /* sparse inode chunks */ 210#define XFS_FSOP_GEOM_FLAGS_SPINODES 0x40000 /* sparse inode chunks */
209#define XFS_FSOP_GEOM_FLAGS_RMAPBT 0x80000 /* Reverse mapping btree */ 211#define XFS_FSOP_GEOM_FLAGS_RMAPBT 0x80000 /* reverse mapping btree */
212#define XFS_FSOP_GEOM_FLAGS_REFLINK 0x100000 /* files can share blocks */
210 213
211/* 214/*
212 * Minimum and maximum sizes need for growth checks. 215 * Minimum and maximum sizes need for growth checks.
@@ -275,7 +278,8 @@ typedef struct xfs_bstat {
275#define bs_projid bs_projid_lo /* (previously just bs_projid) */ 278#define bs_projid bs_projid_lo /* (previously just bs_projid) */
276 __u16 bs_forkoff; /* inode fork offset in bytes */ 279 __u16 bs_forkoff; /* inode fork offset in bytes */
277 __u16 bs_projid_hi; /* higher part of project id */ 280 __u16 bs_projid_hi; /* higher part of project id */
278 unsigned char bs_pad[10]; /* pad space, unused */ 281 unsigned char bs_pad[6]; /* pad space, unused */
282 __u32 bs_cowextsize; /* cow extent size */
279 __u32 bs_dmevmask; /* DMIG event mask */ 283 __u32 bs_dmevmask; /* DMIG event mask */
280 __u16 bs_dmstate; /* DMIG state info */ 284 __u16 bs_dmstate; /* DMIG state info */
281 __u16 bs_aextents; /* attribute number of extents */ 285 __u16 bs_aextents; /* attribute number of extents */
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 4b9769e23c83..8de9a3a29589 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -256,6 +256,7 @@ xfs_inode_from_disk(
256 to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec); 256 to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
257 to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec); 257 to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
258 to->di_flags2 = be64_to_cpu(from->di_flags2); 258 to->di_flags2 = be64_to_cpu(from->di_flags2);
259 to->di_cowextsize = be32_to_cpu(from->di_cowextsize);
259 } 260 }
260} 261}
261 262
@@ -305,7 +306,7 @@ xfs_inode_to_disk(
305 to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec); 306 to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
306 to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec); 307 to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
307 to->di_flags2 = cpu_to_be64(from->di_flags2); 308 to->di_flags2 = cpu_to_be64(from->di_flags2);
308 309 to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
309 to->di_ino = cpu_to_be64(ip->i_ino); 310 to->di_ino = cpu_to_be64(ip->i_ino);
310 to->di_lsn = cpu_to_be64(lsn); 311 to->di_lsn = cpu_to_be64(lsn);
311 memset(to->di_pad2, 0, sizeof(to->di_pad2)); 312 memset(to->di_pad2, 0, sizeof(to->di_pad2));
@@ -357,6 +358,7 @@ xfs_log_dinode_to_disk(
357 to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec); 358 to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
358 to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec); 359 to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
359 to->di_flags2 = cpu_to_be64(from->di_flags2); 360 to->di_flags2 = cpu_to_be64(from->di_flags2);
361 to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
360 to->di_ino = cpu_to_be64(from->di_ino); 362 to->di_ino = cpu_to_be64(from->di_ino);
361 to->di_lsn = cpu_to_be64(from->di_lsn); 363 to->di_lsn = cpu_to_be64(from->di_lsn);
362 memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2)); 364 memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
@@ -373,6 +375,9 @@ xfs_dinode_verify(
373 struct xfs_inode *ip, 375 struct xfs_inode *ip,
374 struct xfs_dinode *dip) 376 struct xfs_dinode *dip)
375{ 377{
378 uint16_t flags;
379 uint64_t flags2;
380
376 if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC)) 381 if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
377 return false; 382 return false;
378 383
@@ -389,6 +394,23 @@ xfs_dinode_verify(
389 return false; 394 return false;
390 if (!uuid_equal(&dip->di_uuid, &mp->m_sb.sb_meta_uuid)) 395 if (!uuid_equal(&dip->di_uuid, &mp->m_sb.sb_meta_uuid))
391 return false; 396 return false;
397
398 flags = be16_to_cpu(dip->di_flags);
399 flags2 = be64_to_cpu(dip->di_flags2);
400
401 /* don't allow reflink/cowextsize if we don't have reflink */
402 if ((flags2 & (XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)) &&
403 !xfs_sb_version_hasreflink(&mp->m_sb))
404 return false;
405
406 /* don't let reflink and realtime mix */
407 if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME))
408 return false;
409
410 /* don't let reflink and dax mix */
411 if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags2 & XFS_DIFLAG2_DAX))
412 return false;
413
392 return true; 414 return true;
393} 415}
394 416
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 7c4dd321b215..62d9d4681c8c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -47,6 +47,7 @@ struct xfs_icdinode {
47 __uint16_t di_flags; /* random flags, XFS_DIFLAG_... */ 47 __uint16_t di_flags; /* random flags, XFS_DIFLAG_... */
48 48
49 __uint64_t di_flags2; /* more random flags */ 49 __uint64_t di_flags2; /* more random flags */
50 __uint32_t di_cowextsize; /* basic cow extent size for file */
50 51
51 xfs_ictimestamp_t di_crtime; /* time created */ 52 xfs_ictimestamp_t di_crtime; /* time created */
52}; 53};
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index bbcc8c7a44b3..5dd56d3dbb3a 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -121,6 +121,26 @@ xfs_iformat_fork(
121 return -EFSCORRUPTED; 121 return -EFSCORRUPTED;
122 } 122 }
123 123
124 if (unlikely(xfs_is_reflink_inode(ip) &&
125 (VFS_I(ip)->i_mode & S_IFMT) != S_IFREG)) {
126 xfs_warn(ip->i_mount,
127 "corrupt dinode %llu, wrong file type for reflink.",
128 ip->i_ino);
129 XFS_CORRUPTION_ERROR("xfs_iformat(reflink)",
130 XFS_ERRLEVEL_LOW, ip->i_mount, dip);
131 return -EFSCORRUPTED;
132 }
133
134 if (unlikely(xfs_is_reflink_inode(ip) &&
135 (ip->i_d.di_flags & XFS_DIFLAG_REALTIME))) {
136 xfs_warn(ip->i_mount,
137 "corrupt dinode %llu, has reflink+realtime flag set.",
138 ip->i_ino);
139 XFS_CORRUPTION_ERROR("xfs_iformat(reflink)",
140 XFS_ERRLEVEL_LOW, ip->i_mount, dip);
141 return -EFSCORRUPTED;
142 }
143
124 switch (VFS_I(ip)->i_mode & S_IFMT) { 144 switch (VFS_I(ip)->i_mode & S_IFMT) {
125 case S_IFIFO: 145 case S_IFIFO:
126 case S_IFCHR: 146 case S_IFCHR:
@@ -186,9 +206,14 @@ xfs_iformat_fork(
186 XFS_ERROR_REPORT("xfs_iformat(7)", XFS_ERRLEVEL_LOW, ip->i_mount); 206 XFS_ERROR_REPORT("xfs_iformat(7)", XFS_ERRLEVEL_LOW, ip->i_mount);
187 return -EFSCORRUPTED; 207 return -EFSCORRUPTED;
188 } 208 }
189 if (error) { 209 if (error)
190 return error; 210 return error;
211
212 if (xfs_is_reflink_inode(ip)) {
213 ASSERT(ip->i_cowfp == NULL);
214 xfs_ifork_init_cow(ip);
191 } 215 }
216
192 if (!XFS_DFORK_Q(dip)) 217 if (!XFS_DFORK_Q(dip))
193 return 0; 218 return 0;
194 219
@@ -208,7 +233,8 @@ xfs_iformat_fork(
208 XFS_CORRUPTION_ERROR("xfs_iformat(8)", 233 XFS_CORRUPTION_ERROR("xfs_iformat(8)",
209 XFS_ERRLEVEL_LOW, 234 XFS_ERRLEVEL_LOW,
210 ip->i_mount, dip); 235 ip->i_mount, dip);
211 return -EFSCORRUPTED; 236 error = -EFSCORRUPTED;
237 break;
212 } 238 }
213 239
214 error = xfs_iformat_local(ip, dip, XFS_ATTR_FORK, size); 240 error = xfs_iformat_local(ip, dip, XFS_ATTR_FORK, size);
@@ -226,6 +252,9 @@ xfs_iformat_fork(
226 if (error) { 252 if (error) {
227 kmem_zone_free(xfs_ifork_zone, ip->i_afp); 253 kmem_zone_free(xfs_ifork_zone, ip->i_afp);
228 ip->i_afp = NULL; 254 ip->i_afp = NULL;
255 if (ip->i_cowfp)
256 kmem_zone_free(xfs_ifork_zone, ip->i_cowfp);
257 ip->i_cowfp = NULL;
229 xfs_idestroy_fork(ip, XFS_DATA_FORK); 258 xfs_idestroy_fork(ip, XFS_DATA_FORK);
230 } 259 }
231 return error; 260 return error;
@@ -740,6 +769,9 @@ xfs_idestroy_fork(
740 if (whichfork == XFS_ATTR_FORK) { 769 if (whichfork == XFS_ATTR_FORK) {
741 kmem_zone_free(xfs_ifork_zone, ip->i_afp); 770 kmem_zone_free(xfs_ifork_zone, ip->i_afp);
742 ip->i_afp = NULL; 771 ip->i_afp = NULL;
772 } else if (whichfork == XFS_COW_FORK) {
773 kmem_zone_free(xfs_ifork_zone, ip->i_cowfp);
774 ip->i_cowfp = NULL;
743 } 775 }
744} 776}
745 777
@@ -927,6 +959,19 @@ xfs_iext_get_ext(
927 } 959 }
928} 960}
929 961
962/* Convert bmap state flags to an inode fork. */
963struct xfs_ifork *
964xfs_iext_state_to_fork(
965 struct xfs_inode *ip,
966 int state)
967{
968 if (state & BMAP_COWFORK)
969 return ip->i_cowfp;
970 else if (state & BMAP_ATTRFORK)
971 return ip->i_afp;
972 return &ip->i_df;
973}
974
930/* 975/*
931 * Insert new item(s) into the extent records for incore inode 976 * Insert new item(s) into the extent records for incore inode
932 * fork 'ifp'. 'count' new items are inserted at index 'idx'. 977 * fork 'ifp'. 'count' new items are inserted at index 'idx'.
@@ -939,7 +984,7 @@ xfs_iext_insert(
939 xfs_bmbt_irec_t *new, /* items to insert */ 984 xfs_bmbt_irec_t *new, /* items to insert */
940 int state) /* type of extent conversion */ 985 int state) /* type of extent conversion */
941{ 986{
942 xfs_ifork_t *ifp = (state & BMAP_ATTRFORK) ? ip->i_afp : &ip->i_df; 987 xfs_ifork_t *ifp = xfs_iext_state_to_fork(ip, state);
943 xfs_extnum_t i; /* extent record index */ 988 xfs_extnum_t i; /* extent record index */
944 989
945 trace_xfs_iext_insert(ip, idx, new, state, _RET_IP_); 990 trace_xfs_iext_insert(ip, idx, new, state, _RET_IP_);
@@ -1189,7 +1234,7 @@ xfs_iext_remove(
1189 int ext_diff, /* number of extents to remove */ 1234 int ext_diff, /* number of extents to remove */
1190 int state) /* type of extent conversion */ 1235 int state) /* type of extent conversion */
1191{ 1236{
1192 xfs_ifork_t *ifp = (state & BMAP_ATTRFORK) ? ip->i_afp : &ip->i_df; 1237 xfs_ifork_t *ifp = xfs_iext_state_to_fork(ip, state);
1193 xfs_extnum_t nextents; /* number of extents in file */ 1238 xfs_extnum_t nextents; /* number of extents in file */
1194 int new_size; /* size of extents after removal */ 1239 int new_size; /* size of extents after removal */
1195 1240
@@ -1934,3 +1979,20 @@ xfs_iext_irec_update_extoffs(
1934 ifp->if_u1.if_ext_irec[i].er_extoff += ext_diff; 1979 ifp->if_u1.if_ext_irec[i].er_extoff += ext_diff;
1935 } 1980 }
1936} 1981}
1982
1983/*
1984 * Initialize an inode's copy-on-write fork.
1985 */
1986void
1987xfs_ifork_init_cow(
1988 struct xfs_inode *ip)
1989{
1990 if (ip->i_cowfp)
1991 return;
1992
1993 ip->i_cowfp = kmem_zone_zalloc(xfs_ifork_zone,
1994 KM_SLEEP | KM_NOFS);
1995 ip->i_cowfp->if_flags = XFS_IFEXTENTS;
1996 ip->i_cformat = XFS_DINODE_FMT_EXTENTS;
1997 ip->i_cnextents = 0;
1998}
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index f95e072ae646..c9476f50e32d 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -92,7 +92,9 @@ typedef struct xfs_ifork {
92#define XFS_IFORK_PTR(ip,w) \ 92#define XFS_IFORK_PTR(ip,w) \
93 ((w) == XFS_DATA_FORK ? \ 93 ((w) == XFS_DATA_FORK ? \
94 &(ip)->i_df : \ 94 &(ip)->i_df : \
95 (ip)->i_afp) 95 ((w) == XFS_ATTR_FORK ? \
96 (ip)->i_afp : \
97 (ip)->i_cowfp))
96#define XFS_IFORK_DSIZE(ip) \ 98#define XFS_IFORK_DSIZE(ip) \
97 (XFS_IFORK_Q(ip) ? \ 99 (XFS_IFORK_Q(ip) ? \
98 XFS_IFORK_BOFF(ip) : \ 100 XFS_IFORK_BOFF(ip) : \
@@ -105,26 +107,38 @@ typedef struct xfs_ifork {
105#define XFS_IFORK_SIZE(ip,w) \ 107#define XFS_IFORK_SIZE(ip,w) \
106 ((w) == XFS_DATA_FORK ? \ 108 ((w) == XFS_DATA_FORK ? \
107 XFS_IFORK_DSIZE(ip) : \ 109 XFS_IFORK_DSIZE(ip) : \
108 XFS_IFORK_ASIZE(ip)) 110 ((w) == XFS_ATTR_FORK ? \
111 XFS_IFORK_ASIZE(ip) : \
112 0))
109#define XFS_IFORK_FORMAT(ip,w) \ 113#define XFS_IFORK_FORMAT(ip,w) \
110 ((w) == XFS_DATA_FORK ? \ 114 ((w) == XFS_DATA_FORK ? \
111 (ip)->i_d.di_format : \ 115 (ip)->i_d.di_format : \
112 (ip)->i_d.di_aformat) 116 ((w) == XFS_ATTR_FORK ? \
117 (ip)->i_d.di_aformat : \
118 (ip)->i_cformat))
113#define XFS_IFORK_FMT_SET(ip,w,n) \ 119#define XFS_IFORK_FMT_SET(ip,w,n) \
114 ((w) == XFS_DATA_FORK ? \ 120 ((w) == XFS_DATA_FORK ? \
115 ((ip)->i_d.di_format = (n)) : \ 121 ((ip)->i_d.di_format = (n)) : \
116 ((ip)->i_d.di_aformat = (n))) 122 ((w) == XFS_ATTR_FORK ? \
123 ((ip)->i_d.di_aformat = (n)) : \
124 ((ip)->i_cformat = (n))))
117#define XFS_IFORK_NEXTENTS(ip,w) \ 125#define XFS_IFORK_NEXTENTS(ip,w) \
118 ((w) == XFS_DATA_FORK ? \ 126 ((w) == XFS_DATA_FORK ? \
119 (ip)->i_d.di_nextents : \ 127 (ip)->i_d.di_nextents : \
120 (ip)->i_d.di_anextents) 128 ((w) == XFS_ATTR_FORK ? \
129 (ip)->i_d.di_anextents : \
130 (ip)->i_cnextents))
121#define XFS_IFORK_NEXT_SET(ip,w,n) \ 131#define XFS_IFORK_NEXT_SET(ip,w,n) \
122 ((w) == XFS_DATA_FORK ? \ 132 ((w) == XFS_DATA_FORK ? \
123 ((ip)->i_d.di_nextents = (n)) : \ 133 ((ip)->i_d.di_nextents = (n)) : \
124 ((ip)->i_d.di_anextents = (n))) 134 ((w) == XFS_ATTR_FORK ? \
135 ((ip)->i_d.di_anextents = (n)) : \
136 ((ip)->i_cnextents = (n))))
125#define XFS_IFORK_MAXEXT(ip, w) \ 137#define XFS_IFORK_MAXEXT(ip, w) \
126 (XFS_IFORK_SIZE(ip, w) / sizeof(xfs_bmbt_rec_t)) 138 (XFS_IFORK_SIZE(ip, w) / sizeof(xfs_bmbt_rec_t))
127 139
140struct xfs_ifork *xfs_iext_state_to_fork(struct xfs_inode *ip, int state);
141
128int xfs_iformat_fork(struct xfs_inode *, struct xfs_dinode *); 142int xfs_iformat_fork(struct xfs_inode *, struct xfs_dinode *);
129void xfs_iflush_fork(struct xfs_inode *, struct xfs_dinode *, 143void xfs_iflush_fork(struct xfs_inode *, struct xfs_dinode *,
130 struct xfs_inode_log_item *, int); 144 struct xfs_inode_log_item *, int);
@@ -169,4 +183,6 @@ void xfs_iext_irec_update_extoffs(struct xfs_ifork *, int, int);
169 183
170extern struct kmem_zone *xfs_ifork_zone; 184extern struct kmem_zone *xfs_ifork_zone;
171 185
186extern void xfs_ifork_init_cow(struct xfs_inode *ip);
187
172#endif /* __XFS_INODE_FORK_H__ */ 188#endif /* __XFS_INODE_FORK_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index fc5eef85d61e..083cdd6d6c28 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -112,7 +112,11 @@ static inline uint xlog_get_cycle(char *ptr)
112#define XLOG_REG_TYPE_ICREATE 20 112#define XLOG_REG_TYPE_ICREATE 20
113#define XLOG_REG_TYPE_RUI_FORMAT 21 113#define XLOG_REG_TYPE_RUI_FORMAT 21
114#define XLOG_REG_TYPE_RUD_FORMAT 22 114#define XLOG_REG_TYPE_RUD_FORMAT 22
115#define XLOG_REG_TYPE_MAX 22 115#define XLOG_REG_TYPE_CUI_FORMAT 23
116#define XLOG_REG_TYPE_CUD_FORMAT 24
117#define XLOG_REG_TYPE_BUI_FORMAT 25
118#define XLOG_REG_TYPE_BUD_FORMAT 26
119#define XLOG_REG_TYPE_MAX 26
116 120
117/* 121/*
118 * Flags to log operation header 122 * Flags to log operation header
@@ -231,6 +235,10 @@ typedef struct xfs_trans_header {
231#define XFS_LI_ICREATE 0x123f 235#define XFS_LI_ICREATE 0x123f
232#define XFS_LI_RUI 0x1240 /* rmap update intent */ 236#define XFS_LI_RUI 0x1240 /* rmap update intent */
233#define XFS_LI_RUD 0x1241 237#define XFS_LI_RUD 0x1241
238#define XFS_LI_CUI 0x1242 /* refcount update intent */
239#define XFS_LI_CUD 0x1243
240#define XFS_LI_BUI 0x1244 /* bmbt update intent */
241#define XFS_LI_BUD 0x1245
234 242
235#define XFS_LI_TYPE_DESC \ 243#define XFS_LI_TYPE_DESC \
236 { XFS_LI_EFI, "XFS_LI_EFI" }, \ 244 { XFS_LI_EFI, "XFS_LI_EFI" }, \
@@ -242,7 +250,11 @@ typedef struct xfs_trans_header {
242 { XFS_LI_QUOTAOFF, "XFS_LI_QUOTAOFF" }, \ 250 { XFS_LI_QUOTAOFF, "XFS_LI_QUOTAOFF" }, \
243 { XFS_LI_ICREATE, "XFS_LI_ICREATE" }, \ 251 { XFS_LI_ICREATE, "XFS_LI_ICREATE" }, \
244 { XFS_LI_RUI, "XFS_LI_RUI" }, \ 252 { XFS_LI_RUI, "XFS_LI_RUI" }, \
245 { XFS_LI_RUD, "XFS_LI_RUD" } 253 { XFS_LI_RUD, "XFS_LI_RUD" }, \
254 { XFS_LI_CUI, "XFS_LI_CUI" }, \
255 { XFS_LI_CUD, "XFS_LI_CUD" }, \
256 { XFS_LI_BUI, "XFS_LI_BUI" }, \
257 { XFS_LI_BUD, "XFS_LI_BUD" }
246 258
247/* 259/*
248 * Inode Log Item Format definitions. 260 * Inode Log Item Format definitions.
@@ -411,7 +423,8 @@ struct xfs_log_dinode {
411 __uint64_t di_changecount; /* number of attribute changes */ 423 __uint64_t di_changecount; /* number of attribute changes */
412 xfs_lsn_t di_lsn; /* flush sequence */ 424 xfs_lsn_t di_lsn; /* flush sequence */
413 __uint64_t di_flags2; /* more random flags */ 425 __uint64_t di_flags2; /* more random flags */
414 __uint8_t di_pad2[16]; /* more padding for future expansion */ 426 __uint32_t di_cowextsize; /* basic cow extent size for file */
427 __uint8_t di_pad2[12]; /* more padding for future expansion */
415 428
416 /* fields only written to during inode creation */ 429 /* fields only written to during inode creation */
417 xfs_ictimestamp_t di_crtime; /* time created */ 430 xfs_ictimestamp_t di_crtime; /* time created */
@@ -622,8 +635,11 @@ struct xfs_map_extent {
622 635
623/* rmap me_flags: upper bits are flags, lower byte is type code */ 636/* rmap me_flags: upper bits are flags, lower byte is type code */
624#define XFS_RMAP_EXTENT_MAP 1 637#define XFS_RMAP_EXTENT_MAP 1
638#define XFS_RMAP_EXTENT_MAP_SHARED 2
625#define XFS_RMAP_EXTENT_UNMAP 3 639#define XFS_RMAP_EXTENT_UNMAP 3
640#define XFS_RMAP_EXTENT_UNMAP_SHARED 4
626#define XFS_RMAP_EXTENT_CONVERT 5 641#define XFS_RMAP_EXTENT_CONVERT 5
642#define XFS_RMAP_EXTENT_CONVERT_SHARED 6
627#define XFS_RMAP_EXTENT_ALLOC 7 643#define XFS_RMAP_EXTENT_ALLOC 7
628#define XFS_RMAP_EXTENT_FREE 8 644#define XFS_RMAP_EXTENT_FREE 8
629#define XFS_RMAP_EXTENT_TYPE_MASK 0xFF 645#define XFS_RMAP_EXTENT_TYPE_MASK 0xFF
@@ -671,6 +687,102 @@ struct xfs_rud_log_format {
671}; 687};
672 688
673/* 689/*
690 * CUI/CUD (refcount update) log format definitions
691 */
692struct xfs_phys_extent {
693 __uint64_t pe_startblock;
694 __uint32_t pe_len;
695 __uint32_t pe_flags;
696};
697
698/* refcount pe_flags: upper bits are flags, lower byte is type code */
699/* Type codes are taken directly from enum xfs_refcount_intent_type. */
700#define XFS_REFCOUNT_EXTENT_TYPE_MASK 0xFF
701
702#define XFS_REFCOUNT_EXTENT_FLAGS (XFS_REFCOUNT_EXTENT_TYPE_MASK)
703
704/*
705 * This is the structure used to lay out a cui log item in the
706 * log. The cui_extents field is a variable size array whose
707 * size is given by cui_nextents.
708 */
709struct xfs_cui_log_format {
710 __uint16_t cui_type; /* cui log item type */
711 __uint16_t cui_size; /* size of this item */
712 __uint32_t cui_nextents; /* # extents to free */
713 __uint64_t cui_id; /* cui identifier */
714 struct xfs_phys_extent cui_extents[]; /* array of extents */
715};
716
717static inline size_t
718xfs_cui_log_format_sizeof(
719 unsigned int nr)
720{
721 return sizeof(struct xfs_cui_log_format) +
722 nr * sizeof(struct xfs_phys_extent);
723}
724
725/*
726 * This is the structure used to lay out a cud log item in the
727 * log. The cud_extents array is a variable size array whose
728 * size is given by cud_nextents;
729 */
730struct xfs_cud_log_format {
731 __uint16_t cud_type; /* cud log item type */
732 __uint16_t cud_size; /* size of this item */
733 __uint32_t __pad;
734 __uint64_t cud_cui_id; /* id of corresponding cui */
735};
736
737/*
738 * BUI/BUD (inode block mapping) log format definitions
739 */
740
741/* bmbt me_flags: upper bits are flags, lower byte is type code */
742/* Type codes are taken directly from enum xfs_bmap_intent_type. */
743#define XFS_BMAP_EXTENT_TYPE_MASK 0xFF
744
745#define XFS_BMAP_EXTENT_ATTR_FORK (1U << 31)
746#define XFS_BMAP_EXTENT_UNWRITTEN (1U << 30)
747
748#define XFS_BMAP_EXTENT_FLAGS (XFS_BMAP_EXTENT_TYPE_MASK | \
749 XFS_BMAP_EXTENT_ATTR_FORK | \
750 XFS_BMAP_EXTENT_UNWRITTEN)
751
752/*
753 * This is the structure used to lay out an bui log item in the
754 * log. The bui_extents field is a variable size array whose
755 * size is given by bui_nextents.
756 */
757struct xfs_bui_log_format {
758 __uint16_t bui_type; /* bui log item type */
759 __uint16_t bui_size; /* size of this item */
760 __uint32_t bui_nextents; /* # extents to free */
761 __uint64_t bui_id; /* bui identifier */
762 struct xfs_map_extent bui_extents[]; /* array of extents to bmap */
763};
764
765static inline size_t
766xfs_bui_log_format_sizeof(
767 unsigned int nr)
768{
769 return sizeof(struct xfs_bui_log_format) +
770 nr * sizeof(struct xfs_map_extent);
771}
772
773/*
774 * This is the structure used to lay out an bud log item in the
775 * log. The bud_extents array is a variable size array whose
776 * size is given by bud_nextents;
777 */
778struct xfs_bud_log_format {
779 __uint16_t bud_type; /* bud log item type */
780 __uint16_t bud_size; /* size of this item */
781 __uint32_t __pad;
782 __uint64_t bud_bui_id; /* id of corresponding bui */
783};
784
785/*
674 * Dquot Log format definitions. 786 * Dquot Log format definitions.
675 * 787 *
676 * The first two fields must be the type and size fitting into 788 * The first two fields must be the type and size fitting into
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
new file mode 100644
index 000000000000..b177ef33cd4c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -0,0 +1,1698 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_shared.h"
23#include "xfs_format.h"
24#include "xfs_log_format.h"
25#include "xfs_trans_resv.h"
26#include "xfs_sb.h"
27#include "xfs_mount.h"
28#include "xfs_defer.h"
29#include "xfs_btree.h"
30#include "xfs_bmap.h"
31#include "xfs_refcount_btree.h"
32#include "xfs_alloc.h"
33#include "xfs_error.h"
34#include "xfs_trace.h"
35#include "xfs_cksum.h"
36#include "xfs_trans.h"
37#include "xfs_bit.h"
38#include "xfs_refcount.h"
39#include "xfs_rmap.h"
40
41/* Allowable refcount adjustment amounts. */
42enum xfs_refc_adjust_op {
43 XFS_REFCOUNT_ADJUST_INCREASE = 1,
44 XFS_REFCOUNT_ADJUST_DECREASE = -1,
45 XFS_REFCOUNT_ADJUST_COW_ALLOC = 0,
46 XFS_REFCOUNT_ADJUST_COW_FREE = -1,
47};
48
49STATIC int __xfs_refcount_cow_alloc(struct xfs_btree_cur *rcur,
50 xfs_agblock_t agbno, xfs_extlen_t aglen,
51 struct xfs_defer_ops *dfops);
52STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur,
53 xfs_agblock_t agbno, xfs_extlen_t aglen,
54 struct xfs_defer_ops *dfops);
55
56/*
57 * Look up the first record less than or equal to [bno, len] in the btree
58 * given by cur.
59 */
60int
61xfs_refcount_lookup_le(
62 struct xfs_btree_cur *cur,
63 xfs_agblock_t bno,
64 int *stat)
65{
66 trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
67 XFS_LOOKUP_LE);
68 cur->bc_rec.rc.rc_startblock = bno;
69 cur->bc_rec.rc.rc_blockcount = 0;
70 return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
71}
72
73/*
74 * Look up the first record greater than or equal to [bno, len] in the btree
75 * given by cur.
76 */
77int
78xfs_refcount_lookup_ge(
79 struct xfs_btree_cur *cur,
80 xfs_agblock_t bno,
81 int *stat)
82{
83 trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
84 XFS_LOOKUP_GE);
85 cur->bc_rec.rc.rc_startblock = bno;
86 cur->bc_rec.rc.rc_blockcount = 0;
87 return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
88}
89
90/* Convert on-disk record to in-core format. */
91static inline void
92xfs_refcount_btrec_to_irec(
93 union xfs_btree_rec *rec,
94 struct xfs_refcount_irec *irec)
95{
96 irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
97 irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
98 irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
99}
100
101/*
102 * Get the data from the pointed-to record.
103 */
104int
105xfs_refcount_get_rec(
106 struct xfs_btree_cur *cur,
107 struct xfs_refcount_irec *irec,
108 int *stat)
109{
110 union xfs_btree_rec *rec;
111 int error;
112
113 error = xfs_btree_get_rec(cur, &rec, stat);
114 if (!error && *stat == 1) {
115 xfs_refcount_btrec_to_irec(rec, irec);
116 trace_xfs_refcount_get(cur->bc_mp, cur->bc_private.a.agno,
117 irec);
118 }
119 return error;
120}
121
122/*
123 * Update the record referred to by cur to the value given
124 * by [bno, len, refcount].
125 * This either works (return 0) or gets an EFSCORRUPTED error.
126 */
127STATIC int
128xfs_refcount_update(
129 struct xfs_btree_cur *cur,
130 struct xfs_refcount_irec *irec)
131{
132 union xfs_btree_rec rec;
133 int error;
134
135 trace_xfs_refcount_update(cur->bc_mp, cur->bc_private.a.agno, irec);
136 rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock);
137 rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount);
138 rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount);
139 error = xfs_btree_update(cur, &rec);
140 if (error)
141 trace_xfs_refcount_update_error(cur->bc_mp,
142 cur->bc_private.a.agno, error, _RET_IP_);
143 return error;
144}
145
146/*
147 * Insert the record referred to by cur to the value given
148 * by [bno, len, refcount].
149 * This either works (return 0) or gets an EFSCORRUPTED error.
150 */
151STATIC int
152xfs_refcount_insert(
153 struct xfs_btree_cur *cur,
154 struct xfs_refcount_irec *irec,
155 int *i)
156{
157 int error;
158
159 trace_xfs_refcount_insert(cur->bc_mp, cur->bc_private.a.agno, irec);
160 cur->bc_rec.rc.rc_startblock = irec->rc_startblock;
161 cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount;
162 cur->bc_rec.rc.rc_refcount = irec->rc_refcount;
163 error = xfs_btree_insert(cur, i);
164 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, *i == 1, out_error);
165out_error:
166 if (error)
167 trace_xfs_refcount_insert_error(cur->bc_mp,
168 cur->bc_private.a.agno, error, _RET_IP_);
169 return error;
170}
171
172/*
173 * Remove the record referred to by cur, then set the pointer to the spot
174 * where the record could be re-inserted, in case we want to increment or
175 * decrement the cursor.
176 * This either works (return 0) or gets an EFSCORRUPTED error.
177 */
178STATIC int
179xfs_refcount_delete(
180 struct xfs_btree_cur *cur,
181 int *i)
182{
183 struct xfs_refcount_irec irec;
184 int found_rec;
185 int error;
186
187 error = xfs_refcount_get_rec(cur, &irec, &found_rec);
188 if (error)
189 goto out_error;
190 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
191 trace_xfs_refcount_delete(cur->bc_mp, cur->bc_private.a.agno, &irec);
192 error = xfs_btree_delete(cur, i);
193 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, *i == 1, out_error);
194 if (error)
195 goto out_error;
196 error = xfs_refcount_lookup_ge(cur, irec.rc_startblock, &found_rec);
197out_error:
198 if (error)
199 trace_xfs_refcount_delete_error(cur->bc_mp,
200 cur->bc_private.a.agno, error, _RET_IP_);
201 return error;
202}
203
204/*
205 * Adjusting the Reference Count
206 *
207 * As stated elsewhere, the reference count btree (refcbt) stores
208 * >1 reference counts for extents of physical blocks. In this
209 * operation, we're either raising or lowering the reference count of
210 * some subrange stored in the tree:
211 *
212 * <------ adjustment range ------>
213 * ----+ +---+-----+ +--+--------+---------
214 * 2 | | 3 | 4 | |17| 55 | 10
215 * ----+ +---+-----+ +--+--------+---------
216 * X axis is physical blocks number;
217 * reference counts are the numbers inside the rectangles
218 *
219 * The first thing we need to do is to ensure that there are no
220 * refcount extents crossing either boundary of the range to be
221 * adjusted. For any extent that does cross a boundary, split it into
222 * two extents so that we can increment the refcount of one of the
223 * pieces later:
224 *
225 * <------ adjustment range ------>
226 * ----+ +---+-----+ +--+--------+----+----
227 * 2 | | 3 | 2 | |17| 55 | 10 | 10
228 * ----+ +---+-----+ +--+--------+----+----
229 *
230 * For this next step, let's assume that all the physical blocks in
231 * the adjustment range are mapped to a file and are therefore in use
232 * at least once. Therefore, we can infer that any gap in the
233 * refcount tree within the adjustment range represents a physical
234 * extent with refcount == 1:
235 *
236 * <------ adjustment range ------>
237 * ----+---+---+-----+-+--+--------+----+----
238 * 2 |"1"| 3 | 2 |1|17| 55 | 10 | 10
239 * ----+---+---+-----+-+--+--------+----+----
240 * ^
241 *
242 * For each extent that falls within the interval range, figure out
243 * which extent is to the left or the right of that extent. Now we
244 * have a left, current, and right extent. If the new reference count
245 * of the center extent enables us to merge left, center, and right
246 * into one record covering all three, do so. If the center extent is
247 * at the left end of the range, abuts the left extent, and its new
248 * reference count matches the left extent's record, then merge them.
249 * If the center extent is at the right end of the range, abuts the
250 * right extent, and the reference counts match, merge those. In the
251 * example, we can left merge (assuming an increment operation):
252 *
253 * <------ adjustment range ------>
254 * --------+---+-----+-+--+--------+----+----
255 * 2 | 3 | 2 |1|17| 55 | 10 | 10
256 * --------+---+-----+-+--+--------+----+----
257 * ^
258 *
259 * For all other extents within the range, adjust the reference count
260 * or delete it if the refcount falls below 2. If we were
261 * incrementing, the end result looks like this:
262 *
263 * <------ adjustment range ------>
264 * --------+---+-----+-+--+--------+----+----
265 * 2 | 4 | 3 |2|18| 56 | 11 | 10
266 * --------+---+-----+-+--+--------+----+----
267 *
268 * The result of a decrement operation looks as such:
269 *
270 * <------ adjustment range ------>
271 * ----+ +---+ +--+--------+----+----
272 * 2 | | 2 | |16| 54 | 9 | 10
273 * ----+ +---+ +--+--------+----+----
274 * DDDD 111111DD
275 *
276 * The blocks marked "D" are freed; the blocks marked "1" are only
277 * referenced once and therefore the record is removed from the
278 * refcount btree.
279 */
280
281/* Next block after this extent. */
282static inline xfs_agblock_t
283xfs_refc_next(
284 struct xfs_refcount_irec *rc)
285{
286 return rc->rc_startblock + rc->rc_blockcount;
287}
288
289/*
290 * Split a refcount extent that crosses agbno.
291 */
292STATIC int
293xfs_refcount_split_extent(
294 struct xfs_btree_cur *cur,
295 xfs_agblock_t agbno,
296 bool *shape_changed)
297{
298 struct xfs_refcount_irec rcext, tmp;
299 int found_rec;
300 int error;
301
302 *shape_changed = false;
303 error = xfs_refcount_lookup_le(cur, agbno, &found_rec);
304 if (error)
305 goto out_error;
306 if (!found_rec)
307 return 0;
308
309 error = xfs_refcount_get_rec(cur, &rcext, &found_rec);
310 if (error)
311 goto out_error;
312 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
313 if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
314 return 0;
315
316 *shape_changed = true;
317 trace_xfs_refcount_split_extent(cur->bc_mp, cur->bc_private.a.agno,
318 &rcext, agbno);
319
320 /* Establish the right extent. */
321 tmp = rcext;
322 tmp.rc_startblock = agbno;
323 tmp.rc_blockcount -= (agbno - rcext.rc_startblock);
324 error = xfs_refcount_update(cur, &tmp);
325 if (error)
326 goto out_error;
327
328 /* Insert the left extent. */
329 tmp = rcext;
330 tmp.rc_blockcount = agbno - rcext.rc_startblock;
331 error = xfs_refcount_insert(cur, &tmp, &found_rec);
332 if (error)
333 goto out_error;
334 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
335 return error;
336
337out_error:
338 trace_xfs_refcount_split_extent_error(cur->bc_mp,
339 cur->bc_private.a.agno, error, _RET_IP_);
340 return error;
341}
342
343/*
344 * Merge the left, center, and right extents.
345 */
346STATIC int
347xfs_refcount_merge_center_extents(
348 struct xfs_btree_cur *cur,
349 struct xfs_refcount_irec *left,
350 struct xfs_refcount_irec *center,
351 struct xfs_refcount_irec *right,
352 unsigned long long extlen,
353 xfs_agblock_t *agbno,
354 xfs_extlen_t *aglen)
355{
356 int error;
357 int found_rec;
358
359 trace_xfs_refcount_merge_center_extents(cur->bc_mp,
360 cur->bc_private.a.agno, left, center, right);
361
362 /*
363 * Make sure the center and right extents are not in the btree.
364 * If the center extent was synthesized, the first delete call
365 * removes the right extent and we skip the second deletion.
366 * If center and right were in the btree, then the first delete
367 * call removes the center and the second one removes the right
368 * extent.
369 */
370 error = xfs_refcount_lookup_ge(cur, center->rc_startblock,
371 &found_rec);
372 if (error)
373 goto out_error;
374 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
375
376 error = xfs_refcount_delete(cur, &found_rec);
377 if (error)
378 goto out_error;
379 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
380
381 if (center->rc_refcount > 1) {
382 error = xfs_refcount_delete(cur, &found_rec);
383 if (error)
384 goto out_error;
385 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
386 out_error);
387 }
388
389 /* Enlarge the left extent. */
390 error = xfs_refcount_lookup_le(cur, left->rc_startblock,
391 &found_rec);
392 if (error)
393 goto out_error;
394 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
395
396 left->rc_blockcount = extlen;
397 error = xfs_refcount_update(cur, left);
398 if (error)
399 goto out_error;
400
401 *aglen = 0;
402 return error;
403
404out_error:
405 trace_xfs_refcount_merge_center_extents_error(cur->bc_mp,
406 cur->bc_private.a.agno, error, _RET_IP_);
407 return error;
408}
409
410/*
411 * Merge with the left extent.
412 */
413STATIC int
414xfs_refcount_merge_left_extent(
415 struct xfs_btree_cur *cur,
416 struct xfs_refcount_irec *left,
417 struct xfs_refcount_irec *cleft,
418 xfs_agblock_t *agbno,
419 xfs_extlen_t *aglen)
420{
421 int error;
422 int found_rec;
423
424 trace_xfs_refcount_merge_left_extent(cur->bc_mp,
425 cur->bc_private.a.agno, left, cleft);
426
427 /* If the extent at agbno (cleft) wasn't synthesized, remove it. */
428 if (cleft->rc_refcount > 1) {
429 error = xfs_refcount_lookup_le(cur, cleft->rc_startblock,
430 &found_rec);
431 if (error)
432 goto out_error;
433 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
434 out_error);
435
436 error = xfs_refcount_delete(cur, &found_rec);
437 if (error)
438 goto out_error;
439 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
440 out_error);
441 }
442
443 /* Enlarge the left extent. */
444 error = xfs_refcount_lookup_le(cur, left->rc_startblock,
445 &found_rec);
446 if (error)
447 goto out_error;
448 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
449
450 left->rc_blockcount += cleft->rc_blockcount;
451 error = xfs_refcount_update(cur, left);
452 if (error)
453 goto out_error;
454
455 *agbno += cleft->rc_blockcount;
456 *aglen -= cleft->rc_blockcount;
457 return error;
458
459out_error:
460 trace_xfs_refcount_merge_left_extent_error(cur->bc_mp,
461 cur->bc_private.a.agno, error, _RET_IP_);
462 return error;
463}
464
465/*
466 * Merge with the right extent.
467 */
468STATIC int
469xfs_refcount_merge_right_extent(
470 struct xfs_btree_cur *cur,
471 struct xfs_refcount_irec *right,
472 struct xfs_refcount_irec *cright,
473 xfs_agblock_t *agbno,
474 xfs_extlen_t *aglen)
475{
476 int error;
477 int found_rec;
478
479 trace_xfs_refcount_merge_right_extent(cur->bc_mp,
480 cur->bc_private.a.agno, cright, right);
481
482 /*
483 * If the extent ending at agbno+aglen (cright) wasn't synthesized,
484 * remove it.
485 */
486 if (cright->rc_refcount > 1) {
487 error = xfs_refcount_lookup_le(cur, cright->rc_startblock,
488 &found_rec);
489 if (error)
490 goto out_error;
491 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
492 out_error);
493
494 error = xfs_refcount_delete(cur, &found_rec);
495 if (error)
496 goto out_error;
497 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
498 out_error);
499 }
500
501 /* Enlarge the right extent. */
502 error = xfs_refcount_lookup_le(cur, right->rc_startblock,
503 &found_rec);
504 if (error)
505 goto out_error;
506 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
507
508 right->rc_startblock -= cright->rc_blockcount;
509 right->rc_blockcount += cright->rc_blockcount;
510 error = xfs_refcount_update(cur, right);
511 if (error)
512 goto out_error;
513
514 *aglen -= cright->rc_blockcount;
515 return error;
516
517out_error:
518 trace_xfs_refcount_merge_right_extent_error(cur->bc_mp,
519 cur->bc_private.a.agno, error, _RET_IP_);
520 return error;
521}
522
523#define XFS_FIND_RCEXT_SHARED 1
524#define XFS_FIND_RCEXT_COW 2
525/*
526 * Find the left extent and the one after it (cleft). This function assumes
527 * that we've already split any extent crossing agbno.
528 */
529STATIC int
530xfs_refcount_find_left_extents(
531 struct xfs_btree_cur *cur,
532 struct xfs_refcount_irec *left,
533 struct xfs_refcount_irec *cleft,
534 xfs_agblock_t agbno,
535 xfs_extlen_t aglen,
536 int flags)
537{
538 struct xfs_refcount_irec tmp;
539 int error;
540 int found_rec;
541
542 left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK;
543 error = xfs_refcount_lookup_le(cur, agbno - 1, &found_rec);
544 if (error)
545 goto out_error;
546 if (!found_rec)
547 return 0;
548
549 error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
550 if (error)
551 goto out_error;
552 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
553
554 if (xfs_refc_next(&tmp) != agbno)
555 return 0;
556 if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
557 return 0;
558 if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
559 return 0;
560 /* We have a left extent; retrieve (or invent) the next right one */
561 *left = tmp;
562
563 error = xfs_btree_increment(cur, 0, &found_rec);
564 if (error)
565 goto out_error;
566 if (found_rec) {
567 error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
568 if (error)
569 goto out_error;
570 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
571 out_error);
572
573 /* if tmp starts at the end of our range, just use that */
574 if (tmp.rc_startblock == agbno)
575 *cleft = tmp;
576 else {
577 /*
578 * There's a gap in the refcntbt at the start of the
579 * range we're interested in (refcount == 1) so
580 * synthesize the implied extent and pass it back.
581 * We assume here that the agbno/aglen range was
582 * passed in from a data fork extent mapping and
583 * therefore is allocated to exactly one owner.
584 */
585 cleft->rc_startblock = agbno;
586 cleft->rc_blockcount = min(aglen,
587 tmp.rc_startblock - agbno);
588 cleft->rc_refcount = 1;
589 }
590 } else {
591 /*
592 * No extents, so pretend that there's one covering the whole
593 * range.
594 */
595 cleft->rc_startblock = agbno;
596 cleft->rc_blockcount = aglen;
597 cleft->rc_refcount = 1;
598 }
599 trace_xfs_refcount_find_left_extent(cur->bc_mp, cur->bc_private.a.agno,
600 left, cleft, agbno);
601 return error;
602
603out_error:
604 trace_xfs_refcount_find_left_extent_error(cur->bc_mp,
605 cur->bc_private.a.agno, error, _RET_IP_);
606 return error;
607}
608
609/*
610 * Find the right extent and the one before it (cright). This function
611 * assumes that we've already split any extents crossing agbno + aglen.
612 */
613STATIC int
614xfs_refcount_find_right_extents(
615 struct xfs_btree_cur *cur,
616 struct xfs_refcount_irec *right,
617 struct xfs_refcount_irec *cright,
618 xfs_agblock_t agbno,
619 xfs_extlen_t aglen,
620 int flags)
621{
622 struct xfs_refcount_irec tmp;
623 int error;
624 int found_rec;
625
626 right->rc_startblock = cright->rc_startblock = NULLAGBLOCK;
627 error = xfs_refcount_lookup_ge(cur, agbno + aglen, &found_rec);
628 if (error)
629 goto out_error;
630 if (!found_rec)
631 return 0;
632
633 error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
634 if (error)
635 goto out_error;
636 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
637
638 if (tmp.rc_startblock != agbno + aglen)
639 return 0;
640 if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
641 return 0;
642 if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
643 return 0;
644 /* We have a right extent; retrieve (or invent) the next left one */
645 *right = tmp;
646
647 error = xfs_btree_decrement(cur, 0, &found_rec);
648 if (error)
649 goto out_error;
650 if (found_rec) {
651 error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
652 if (error)
653 goto out_error;
654 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
655 out_error);
656
657 /* if tmp ends at the end of our range, just use that */
658 if (xfs_refc_next(&tmp) == agbno + aglen)
659 *cright = tmp;
660 else {
661 /*
662 * There's a gap in the refcntbt at the end of the
663 * range we're interested in (refcount == 1) so
664 * create the implied extent and pass it back.
665 * We assume here that the agbno/aglen range was
666 * passed in from a data fork extent mapping and
667 * therefore is allocated to exactly one owner.
668 */
669 cright->rc_startblock = max(agbno, xfs_refc_next(&tmp));
670 cright->rc_blockcount = right->rc_startblock -
671 cright->rc_startblock;
672 cright->rc_refcount = 1;
673 }
674 } else {
675 /*
676 * No extents, so pretend that there's one covering the whole
677 * range.
678 */
679 cright->rc_startblock = agbno;
680 cright->rc_blockcount = aglen;
681 cright->rc_refcount = 1;
682 }
683 trace_xfs_refcount_find_right_extent(cur->bc_mp, cur->bc_private.a.agno,
684 cright, right, agbno + aglen);
685 return error;
686
687out_error:
688 trace_xfs_refcount_find_right_extent_error(cur->bc_mp,
689 cur->bc_private.a.agno, error, _RET_IP_);
690 return error;
691}
692
693/* Is this extent valid? */
694static inline bool
695xfs_refc_valid(
696 struct xfs_refcount_irec *rc)
697{
698 return rc->rc_startblock != NULLAGBLOCK;
699}
700
701/*
702 * Try to merge with any extents on the boundaries of the adjustment range.
703 */
704STATIC int
705xfs_refcount_merge_extents(
706 struct xfs_btree_cur *cur,
707 xfs_agblock_t *agbno,
708 xfs_extlen_t *aglen,
709 enum xfs_refc_adjust_op adjust,
710 int flags,
711 bool *shape_changed)
712{
713 struct xfs_refcount_irec left = {0}, cleft = {0};
714 struct xfs_refcount_irec cright = {0}, right = {0};
715 int error;
716 unsigned long long ulen;
717 bool cequal;
718
719 *shape_changed = false;
720 /*
721 * Find the extent just below agbno [left], just above agbno [cleft],
722 * just below (agbno + aglen) [cright], and just above (agbno + aglen)
723 * [right].
724 */
725 error = xfs_refcount_find_left_extents(cur, &left, &cleft, *agbno,
726 *aglen, flags);
727 if (error)
728 return error;
729 error = xfs_refcount_find_right_extents(cur, &right, &cright, *agbno,
730 *aglen, flags);
731 if (error)
732 return error;
733
734 /* No left or right extent to merge; exit. */
735 if (!xfs_refc_valid(&left) && !xfs_refc_valid(&right))
736 return 0;
737
738 cequal = (cleft.rc_startblock == cright.rc_startblock) &&
739 (cleft.rc_blockcount == cright.rc_blockcount);
740
741 /* Try to merge left, cleft, and right. cleft must == cright. */
742 ulen = (unsigned long long)left.rc_blockcount + cleft.rc_blockcount +
743 right.rc_blockcount;
744 if (xfs_refc_valid(&left) && xfs_refc_valid(&right) &&
745 xfs_refc_valid(&cleft) && xfs_refc_valid(&cright) && cequal &&
746 left.rc_refcount == cleft.rc_refcount + adjust &&
747 right.rc_refcount == cleft.rc_refcount + adjust &&
748 ulen < MAXREFCEXTLEN) {
749 *shape_changed = true;
750 return xfs_refcount_merge_center_extents(cur, &left, &cleft,
751 &right, ulen, agbno, aglen);
752 }
753
754 /* Try to merge left and cleft. */
755 ulen = (unsigned long long)left.rc_blockcount + cleft.rc_blockcount;
756 if (xfs_refc_valid(&left) && xfs_refc_valid(&cleft) &&
757 left.rc_refcount == cleft.rc_refcount + adjust &&
758 ulen < MAXREFCEXTLEN) {
759 *shape_changed = true;
760 error = xfs_refcount_merge_left_extent(cur, &left, &cleft,
761 agbno, aglen);
762 if (error)
763 return error;
764
765 /*
766 * If we just merged left + cleft and cleft == cright,
767 * we no longer have a cright to merge with right. We're done.
768 */
769 if (cequal)
770 return 0;
771 }
772
773 /* Try to merge cright and right. */
774 ulen = (unsigned long long)right.rc_blockcount + cright.rc_blockcount;
775 if (xfs_refc_valid(&right) && xfs_refc_valid(&cright) &&
776 right.rc_refcount == cright.rc_refcount + adjust &&
777 ulen < MAXREFCEXTLEN) {
778 *shape_changed = true;
779 return xfs_refcount_merge_right_extent(cur, &right, &cright,
780 agbno, aglen);
781 }
782
783 return error;
784}
785
786/*
787 * While we're adjusting the refcounts records of an extent, we have
788 * to keep an eye on the number of extents we're dirtying -- run too
789 * many in a single transaction and we'll exceed the transaction's
790 * reservation and crash the fs. Each record adds 12 bytes to the
791 * log (plus any key updates) so we'll conservatively assume 24 bytes
792 * per record. We must also leave space for btree splits on both ends
793 * of the range and space for the CUD and a new CUI.
794 *
795 * XXX: This is a pretty hand-wavy estimate. The penalty for guessing
796 * true incorrectly is a shutdown FS; the penalty for guessing false
797 * incorrectly is more transaction rolls than might be necessary.
798 * Be conservative here.
799 */
800static bool
801xfs_refcount_still_have_space(
802 struct xfs_btree_cur *cur)
803{
804 unsigned long overhead;
805
806 overhead = cur->bc_private.a.priv.refc.shape_changes *
807 xfs_allocfree_log_count(cur->bc_mp, 1);
808 overhead *= cur->bc_mp->m_sb.sb_blocksize;
809
810 /*
811 * Only allow 2 refcount extent updates per transaction if the
812 * refcount continue update "error" has been injected.
813 */
814 if (cur->bc_private.a.priv.refc.nr_ops > 2 &&
815 XFS_TEST_ERROR(false, cur->bc_mp,
816 XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE,
817 XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE))
818 return false;
819
820 if (cur->bc_private.a.priv.refc.nr_ops == 0)
821 return true;
822 else if (overhead > cur->bc_tp->t_log_res)
823 return false;
824 return cur->bc_tp->t_log_res - overhead >
825 cur->bc_private.a.priv.refc.nr_ops * 32;
826}
827
828/*
829 * Adjust the refcounts of middle extents. At this point we should have
830 * split extents that crossed the adjustment range; merged with adjacent
831 * extents; and updated agbno/aglen to reflect the merges. Therefore,
832 * all we have to do is update the extents inside [agbno, agbno + aglen].
833 */
834STATIC int
835xfs_refcount_adjust_extents(
836 struct xfs_btree_cur *cur,
837 xfs_agblock_t *agbno,
838 xfs_extlen_t *aglen,
839 enum xfs_refc_adjust_op adj,
840 struct xfs_defer_ops *dfops,
841 struct xfs_owner_info *oinfo)
842{
843 struct xfs_refcount_irec ext, tmp;
844 int error;
845 int found_rec, found_tmp;
846 xfs_fsblock_t fsbno;
847
848 /* Merging did all the work already. */
849 if (*aglen == 0)
850 return 0;
851
852 error = xfs_refcount_lookup_ge(cur, *agbno, &found_rec);
853 if (error)
854 goto out_error;
855
856 while (*aglen > 0 && xfs_refcount_still_have_space(cur)) {
857 error = xfs_refcount_get_rec(cur, &ext, &found_rec);
858 if (error)
859 goto out_error;
860 if (!found_rec) {
861 ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
862 ext.rc_blockcount = 0;
863 ext.rc_refcount = 0;
864 }
865
866 /*
867 * Deal with a hole in the refcount tree; if a file maps to
868 * these blocks and there's no refcountbt record, pretend that
869 * there is one with refcount == 1.
870 */
871 if (ext.rc_startblock != *agbno) {
872 tmp.rc_startblock = *agbno;
873 tmp.rc_blockcount = min(*aglen,
874 ext.rc_startblock - *agbno);
875 tmp.rc_refcount = 1 + adj;
876 trace_xfs_refcount_modify_extent(cur->bc_mp,
877 cur->bc_private.a.agno, &tmp);
878
879 /*
880 * Either cover the hole (increment) or
881 * delete the range (decrement).
882 */
883 if (tmp.rc_refcount) {
884 error = xfs_refcount_insert(cur, &tmp,
885 &found_tmp);
886 if (error)
887 goto out_error;
888 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
889 found_tmp == 1, out_error);
890 cur->bc_private.a.priv.refc.nr_ops++;
891 } else {
892 fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
893 cur->bc_private.a.agno,
894 tmp.rc_startblock);
895 xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
896 tmp.rc_blockcount, oinfo);
897 }
898
899 (*agbno) += tmp.rc_blockcount;
900 (*aglen) -= tmp.rc_blockcount;
901
902 error = xfs_refcount_lookup_ge(cur, *agbno,
903 &found_rec);
904 if (error)
905 goto out_error;
906 }
907
908 /* Stop if there's nothing left to modify */
909 if (*aglen == 0 || !xfs_refcount_still_have_space(cur))
910 break;
911
912 /*
913 * Adjust the reference count and either update the tree
914 * (incr) or free the blocks (decr).
915 */
916 if (ext.rc_refcount == MAXREFCOUNT)
917 goto skip;
918 ext.rc_refcount += adj;
919 trace_xfs_refcount_modify_extent(cur->bc_mp,
920 cur->bc_private.a.agno, &ext);
921 if (ext.rc_refcount > 1) {
922 error = xfs_refcount_update(cur, &ext);
923 if (error)
924 goto out_error;
925 cur->bc_private.a.priv.refc.nr_ops++;
926 } else if (ext.rc_refcount == 1) {
927 error = xfs_refcount_delete(cur, &found_rec);
928 if (error)
929 goto out_error;
930 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
931 found_rec == 1, out_error);
932 cur->bc_private.a.priv.refc.nr_ops++;
933 goto advloop;
934 } else {
935 fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
936 cur->bc_private.a.agno,
937 ext.rc_startblock);
938 xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
939 ext.rc_blockcount, oinfo);
940 }
941
942skip:
943 error = xfs_btree_increment(cur, 0, &found_rec);
944 if (error)
945 goto out_error;
946
947advloop:
948 (*agbno) += ext.rc_blockcount;
949 (*aglen) -= ext.rc_blockcount;
950 }
951
952 return error;
953out_error:
954 trace_xfs_refcount_modify_extent_error(cur->bc_mp,
955 cur->bc_private.a.agno, error, _RET_IP_);
956 return error;
957}
958
959/* Adjust the reference count of a range of AG blocks. */
960STATIC int
961xfs_refcount_adjust(
962 struct xfs_btree_cur *cur,
963 xfs_agblock_t agbno,
964 xfs_extlen_t aglen,
965 xfs_agblock_t *new_agbno,
966 xfs_extlen_t *new_aglen,
967 enum xfs_refc_adjust_op adj,
968 struct xfs_defer_ops *dfops,
969 struct xfs_owner_info *oinfo)
970{
971 bool shape_changed;
972 int shape_changes = 0;
973 int error;
974
975 *new_agbno = agbno;
976 *new_aglen = aglen;
977 if (adj == XFS_REFCOUNT_ADJUST_INCREASE)
978 trace_xfs_refcount_increase(cur->bc_mp, cur->bc_private.a.agno,
979 agbno, aglen);
980 else
981 trace_xfs_refcount_decrease(cur->bc_mp, cur->bc_private.a.agno,
982 agbno, aglen);
983
984 /*
985 * Ensure that no rcextents cross the boundary of the adjustment range.
986 */
987 error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
988 if (error)
989 goto out_error;
990 if (shape_changed)
991 shape_changes++;
992
993 error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
994 if (error)
995 goto out_error;
996 if (shape_changed)
997 shape_changes++;
998
999 /*
1000 * Try to merge with the left or right extents of the range.
1001 */
1002 error = xfs_refcount_merge_extents(cur, new_agbno, new_aglen, adj,
1003 XFS_FIND_RCEXT_SHARED, &shape_changed);
1004 if (error)
1005 goto out_error;
1006 if (shape_changed)
1007 shape_changes++;
1008 if (shape_changes)
1009 cur->bc_private.a.priv.refc.shape_changes++;
1010
1011 /* Now that we've taken care of the ends, adjust the middle extents */
1012 error = xfs_refcount_adjust_extents(cur, new_agbno, new_aglen,
1013 adj, dfops, oinfo);
1014 if (error)
1015 goto out_error;
1016
1017 return 0;
1018
1019out_error:
1020 trace_xfs_refcount_adjust_error(cur->bc_mp, cur->bc_private.a.agno,
1021 error, _RET_IP_);
1022 return error;
1023}
1024
1025/* Clean up after calling xfs_refcount_finish_one. */
1026void
1027xfs_refcount_finish_one_cleanup(
1028 struct xfs_trans *tp,
1029 struct xfs_btree_cur *rcur,
1030 int error)
1031{
1032 struct xfs_buf *agbp;
1033
1034 if (rcur == NULL)
1035 return;
1036 agbp = rcur->bc_private.a.agbp;
1037 xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
1038 if (error)
1039 xfs_trans_brelse(tp, agbp);
1040}
1041
1042/*
1043 * Process one of the deferred refcount operations. We pass back the
1044 * btree cursor to maintain our lock on the btree between calls.
1045 * This saves time and eliminates a buffer deadlock between the
1046 * superblock and the AGF because we'll always grab them in the same
1047 * order.
1048 */
1049int
1050xfs_refcount_finish_one(
1051 struct xfs_trans *tp,
1052 struct xfs_defer_ops *dfops,
1053 enum xfs_refcount_intent_type type,
1054 xfs_fsblock_t startblock,
1055 xfs_extlen_t blockcount,
1056 xfs_fsblock_t *new_fsb,
1057 xfs_extlen_t *new_len,
1058 struct xfs_btree_cur **pcur)
1059{
1060 struct xfs_mount *mp = tp->t_mountp;
1061 struct xfs_btree_cur *rcur;
1062 struct xfs_buf *agbp = NULL;
1063 int error = 0;
1064 xfs_agnumber_t agno;
1065 xfs_agblock_t bno;
1066 xfs_agblock_t new_agbno;
1067 unsigned long nr_ops = 0;
1068 int shape_changes = 0;
1069
1070 agno = XFS_FSB_TO_AGNO(mp, startblock);
1071 ASSERT(agno != NULLAGNUMBER);
1072 bno = XFS_FSB_TO_AGBNO(mp, startblock);
1073
1074 trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, startblock),
1075 type, XFS_FSB_TO_AGBNO(mp, startblock),
1076 blockcount);
1077
1078 if (XFS_TEST_ERROR(false, mp,
1079 XFS_ERRTAG_REFCOUNT_FINISH_ONE,
1080 XFS_RANDOM_REFCOUNT_FINISH_ONE))
1081 return -EIO;
1082
1083 /*
1084 * If we haven't gotten a cursor or the cursor AG doesn't match
1085 * the startblock, get one now.
1086 */
1087 rcur = *pcur;
1088 if (rcur != NULL && rcur->bc_private.a.agno != agno) {
1089 nr_ops = rcur->bc_private.a.priv.refc.nr_ops;
1090 shape_changes = rcur->bc_private.a.priv.refc.shape_changes;
1091 xfs_refcount_finish_one_cleanup(tp, rcur, 0);
1092 rcur = NULL;
1093 *pcur = NULL;
1094 }
1095 if (rcur == NULL) {
1096 error = xfs_alloc_read_agf(tp->t_mountp, tp, agno,
1097 XFS_ALLOC_FLAG_FREEING, &agbp);
1098 if (error)
1099 return error;
1100 if (!agbp)
1101 return -EFSCORRUPTED;
1102
1103 rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, agno, dfops);
1104 if (!rcur) {
1105 error = -ENOMEM;
1106 goto out_cur;
1107 }
1108 rcur->bc_private.a.priv.refc.nr_ops = nr_ops;
1109 rcur->bc_private.a.priv.refc.shape_changes = shape_changes;
1110 }
1111 *pcur = rcur;
1112
1113 switch (type) {
1114 case XFS_REFCOUNT_INCREASE:
1115 error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
1116 new_len, XFS_REFCOUNT_ADJUST_INCREASE, dfops, NULL);
1117 *new_fsb = XFS_AGB_TO_FSB(mp, agno, new_agbno);
1118 break;
1119 case XFS_REFCOUNT_DECREASE:
1120 error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
1121 new_len, XFS_REFCOUNT_ADJUST_DECREASE, dfops, NULL);
1122 *new_fsb = XFS_AGB_TO_FSB(mp, agno, new_agbno);
1123 break;
1124 case XFS_REFCOUNT_ALLOC_COW:
1125 *new_fsb = startblock + blockcount;
1126 *new_len = 0;
1127 error = __xfs_refcount_cow_alloc(rcur, bno, blockcount, dfops);
1128 break;
1129 case XFS_REFCOUNT_FREE_COW:
1130 *new_fsb = startblock + blockcount;
1131 *new_len = 0;
1132 error = __xfs_refcount_cow_free(rcur, bno, blockcount, dfops);
1133 break;
1134 default:
1135 ASSERT(0);
1136 error = -EFSCORRUPTED;
1137 }
1138 if (!error && *new_len > 0)
1139 trace_xfs_refcount_finish_one_leftover(mp, agno, type,
1140 bno, blockcount, new_agbno, *new_len);
1141 return error;
1142
1143out_cur:
1144 xfs_trans_brelse(tp, agbp);
1145
1146 return error;
1147}
1148
1149/*
1150 * Record a refcount intent for later processing.
1151 */
1152static int
1153__xfs_refcount_add(
1154 struct xfs_mount *mp,
1155 struct xfs_defer_ops *dfops,
1156 enum xfs_refcount_intent_type type,
1157 xfs_fsblock_t startblock,
1158 xfs_extlen_t blockcount)
1159{
1160 struct xfs_refcount_intent *ri;
1161
1162 trace_xfs_refcount_defer(mp, XFS_FSB_TO_AGNO(mp, startblock),
1163 type, XFS_FSB_TO_AGBNO(mp, startblock),
1164 blockcount);
1165
1166 ri = kmem_alloc(sizeof(struct xfs_refcount_intent),
1167 KM_SLEEP | KM_NOFS);
1168 INIT_LIST_HEAD(&ri->ri_list);
1169 ri->ri_type = type;
1170 ri->ri_startblock = startblock;
1171 ri->ri_blockcount = blockcount;
1172
1173 xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_REFCOUNT, &ri->ri_list);
1174 return 0;
1175}
1176
1177/*
1178 * Increase the reference count of the blocks backing a file's extent.
1179 */
1180int
1181xfs_refcount_increase_extent(
1182 struct xfs_mount *mp,
1183 struct xfs_defer_ops *dfops,
1184 struct xfs_bmbt_irec *PREV)
1185{
1186 if (!xfs_sb_version_hasreflink(&mp->m_sb))
1187 return 0;
1188
1189 return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_INCREASE,
1190 PREV->br_startblock, PREV->br_blockcount);
1191}
1192
1193/*
1194 * Decrease the reference count of the blocks backing a file's extent.
1195 */
1196int
1197xfs_refcount_decrease_extent(
1198 struct xfs_mount *mp,
1199 struct xfs_defer_ops *dfops,
1200 struct xfs_bmbt_irec *PREV)
1201{
1202 if (!xfs_sb_version_hasreflink(&mp->m_sb))
1203 return 0;
1204
1205 return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_DECREASE,
1206 PREV->br_startblock, PREV->br_blockcount);
1207}
1208
1209/*
1210 * Given an AG extent, find the lowest-numbered run of shared blocks
1211 * within that range and return the range in fbno/flen. If
1212 * find_end_of_shared is set, return the longest contiguous extent of
1213 * shared blocks; if not, just return the first extent we find. If no
1214 * shared blocks are found, fbno and flen will be set to NULLAGBLOCK
1215 * and 0, respectively.
1216 */
1217int
1218xfs_refcount_find_shared(
1219 struct xfs_btree_cur *cur,
1220 xfs_agblock_t agbno,
1221 xfs_extlen_t aglen,
1222 xfs_agblock_t *fbno,
1223 xfs_extlen_t *flen,
1224 bool find_end_of_shared)
1225{
1226 struct xfs_refcount_irec tmp;
1227 int i;
1228 int have;
1229 int error;
1230
1231 trace_xfs_refcount_find_shared(cur->bc_mp, cur->bc_private.a.agno,
1232 agbno, aglen);
1233
1234 /* By default, skip the whole range */
1235 *fbno = NULLAGBLOCK;
1236 *flen = 0;
1237
1238 /* Try to find a refcount extent that crosses the start */
1239 error = xfs_refcount_lookup_le(cur, agbno, &have);
1240 if (error)
1241 goto out_error;
1242 if (!have) {
1243 /* No left extent, look at the next one */
1244 error = xfs_btree_increment(cur, 0, &have);
1245 if (error)
1246 goto out_error;
1247 if (!have)
1248 goto done;
1249 }
1250 error = xfs_refcount_get_rec(cur, &tmp, &i);
1251 if (error)
1252 goto out_error;
1253 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
1254
1255 /* If the extent ends before the start, look at the next one */
1256 if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
1257 error = xfs_btree_increment(cur, 0, &have);
1258 if (error)
1259 goto out_error;
1260 if (!have)
1261 goto done;
1262 error = xfs_refcount_get_rec(cur, &tmp, &i);
1263 if (error)
1264 goto out_error;
1265 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
1266 }
1267
1268 /* If the extent starts after the range we want, bail out */
1269 if (tmp.rc_startblock >= agbno + aglen)
1270 goto done;
1271
1272 /* We found the start of a shared extent! */
1273 if (tmp.rc_startblock < agbno) {
1274 tmp.rc_blockcount -= (agbno - tmp.rc_startblock);
1275 tmp.rc_startblock = agbno;
1276 }
1277
1278 *fbno = tmp.rc_startblock;
1279 *flen = min(tmp.rc_blockcount, agbno + aglen - *fbno);
1280 if (!find_end_of_shared)
1281 goto done;
1282
1283 /* Otherwise, find the end of this shared extent */
1284 while (*fbno + *flen < agbno + aglen) {
1285 error = xfs_btree_increment(cur, 0, &have);
1286 if (error)
1287 goto out_error;
1288 if (!have)
1289 break;
1290 error = xfs_refcount_get_rec(cur, &tmp, &i);
1291 if (error)
1292 goto out_error;
1293 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
1294 if (tmp.rc_startblock >= agbno + aglen ||
1295 tmp.rc_startblock != *fbno + *flen)
1296 break;
1297 *flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
1298 }
1299
1300done:
1301 trace_xfs_refcount_find_shared_result(cur->bc_mp,
1302 cur->bc_private.a.agno, *fbno, *flen);
1303
1304out_error:
1305 if (error)
1306 trace_xfs_refcount_find_shared_error(cur->bc_mp,
1307 cur->bc_private.a.agno, error, _RET_IP_);
1308 return error;
1309}
1310
1311/*
1312 * Recovering CoW Blocks After a Crash
1313 *
1314 * Due to the way that the copy on write mechanism works, there's a window of
1315 * opportunity in which we can lose track of allocated blocks during a crash.
1316 * Because CoW uses delayed allocation in the in-core CoW fork, writeback
1317 * causes blocks to be allocated and stored in the CoW fork. The blocks are
1318 * no longer in the free space btree but are not otherwise recorded anywhere
1319 * until the write completes and the blocks are mapped into the file. A crash
1320 * in between allocation and remapping results in the replacement blocks being
1321 * lost. This situation is exacerbated by the CoW extent size hint because
1322 * allocations can hang around for long time.
1323 *
1324 * However, there is a place where we can record these allocations before they
1325 * become mappings -- the reference count btree. The btree does not record
1326 * extents with refcount == 1, so we can record allocations with a refcount of
1327 * 1. Blocks being used for CoW writeout cannot be shared, so there should be
1328 * no conflict with shared block records. These mappings should be created
1329 * when we allocate blocks to the CoW fork and deleted when they're removed
1330 * from the CoW fork.
1331 *
1332 * Minor nit: records for in-progress CoW allocations and records for shared
1333 * extents must never be merged, to preserve the property that (except for CoW
1334 * allocations) there are no refcount btree entries with refcount == 1. The
1335 * only time this could potentially happen is when unsharing a block that's
1336 * adjacent to CoW allocations, so we must be careful to avoid this.
1337 *
1338 * At mount time we recover lost CoW allocations by searching the refcount
1339 * btree for these refcount == 1 mappings. These represent CoW allocations
1340 * that were in progress at the time the filesystem went down, so we can free
1341 * them to get the space back.
1342 *
1343 * This mechanism is superior to creating EFIs for unmapped CoW extents for
1344 * several reasons -- first, EFIs pin the tail of the log and would have to be
1345 * periodically relogged to avoid filling up the log. Second, CoW completions
1346 * will have to file an EFD and create new EFIs for whatever remains in the
1347 * CoW fork; this partially takes care of (1) but extent-size reservations
1348 * will have to periodically relog even if there's no writeout in progress.
1349 * This can happen if the CoW extent size hint is set, which you really want.
1350 * Third, EFIs cannot currently be automatically relogged into newer
1351 * transactions to advance the log tail. Fourth, stuffing the log full of
1352 * EFIs places an upper bound on the number of CoW allocations that can be
1353 * held filesystem-wide at any given time. Recording them in the refcount
1354 * btree doesn't require us to maintain any state in memory and doesn't pin
1355 * the log.
1356 */
1357/*
1358 * Adjust the refcounts of CoW allocations. These allocations are "magic"
1359 * in that they're not referenced anywhere else in the filesystem, so we
1360 * stash them in the refcount btree with a refcount of 1 until either file
1361 * remapping (or CoW cancellation) happens.
1362 */
1363STATIC int
1364xfs_refcount_adjust_cow_extents(
1365 struct xfs_btree_cur *cur,
1366 xfs_agblock_t agbno,
1367 xfs_extlen_t aglen,
1368 enum xfs_refc_adjust_op adj,
1369 struct xfs_defer_ops *dfops,
1370 struct xfs_owner_info *oinfo)
1371{
1372 struct xfs_refcount_irec ext, tmp;
1373 int error;
1374 int found_rec, found_tmp;
1375
1376 if (aglen == 0)
1377 return 0;
1378
1379 /* Find any overlapping refcount records */
1380 error = xfs_refcount_lookup_ge(cur, agbno, &found_rec);
1381 if (error)
1382 goto out_error;
1383 error = xfs_refcount_get_rec(cur, &ext, &found_rec);
1384 if (error)
1385 goto out_error;
1386 if (!found_rec) {
1387 ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks +
1388 XFS_REFC_COW_START;
1389 ext.rc_blockcount = 0;
1390 ext.rc_refcount = 0;
1391 }
1392
1393 switch (adj) {
1394 case XFS_REFCOUNT_ADJUST_COW_ALLOC:
1395 /* Adding a CoW reservation, there should be nothing here. */
1396 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
1397 ext.rc_startblock >= agbno + aglen, out_error);
1398
1399 tmp.rc_startblock = agbno;
1400 tmp.rc_blockcount = aglen;
1401 tmp.rc_refcount = 1;
1402 trace_xfs_refcount_modify_extent(cur->bc_mp,
1403 cur->bc_private.a.agno, &tmp);
1404
1405 error = xfs_refcount_insert(cur, &tmp,
1406 &found_tmp);
1407 if (error)
1408 goto out_error;
1409 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
1410 found_tmp == 1, out_error);
1411 break;
1412 case XFS_REFCOUNT_ADJUST_COW_FREE:
1413 /* Removing a CoW reservation, there should be one extent. */
1414 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
1415 ext.rc_startblock == agbno, out_error);
1416 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
1417 ext.rc_blockcount == aglen, out_error);
1418 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
1419 ext.rc_refcount == 1, out_error);
1420
1421 ext.rc_refcount = 0;
1422 trace_xfs_refcount_modify_extent(cur->bc_mp,
1423 cur->bc_private.a.agno, &ext);
1424 error = xfs_refcount_delete(cur, &found_rec);
1425 if (error)
1426 goto out_error;
1427 XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
1428 found_rec == 1, out_error);
1429 break;
1430 default:
1431 ASSERT(0);
1432 }
1433
1434 return error;
1435out_error:
1436 trace_xfs_refcount_modify_extent_error(cur->bc_mp,
1437 cur->bc_private.a.agno, error, _RET_IP_);
1438 return error;
1439}
1440
1441/*
1442 * Add or remove refcount btree entries for CoW reservations.
1443 */
1444STATIC int
1445xfs_refcount_adjust_cow(
1446 struct xfs_btree_cur *cur,
1447 xfs_agblock_t agbno,
1448 xfs_extlen_t aglen,
1449 enum xfs_refc_adjust_op adj,
1450 struct xfs_defer_ops *dfops)
1451{
1452 bool shape_changed;
1453 int error;
1454
1455 agbno += XFS_REFC_COW_START;
1456
1457 /*
1458 * Ensure that no rcextents cross the boundary of the adjustment range.
1459 */
1460 error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
1461 if (error)
1462 goto out_error;
1463
1464 error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
1465 if (error)
1466 goto out_error;
1467
1468 /*
1469 * Try to merge with the left or right extents of the range.
1470 */
1471 error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
1472 XFS_FIND_RCEXT_COW, &shape_changed);
1473 if (error)
1474 goto out_error;
1475
1476 /* Now that we've taken care of the ends, adjust the middle extents */
1477 error = xfs_refcount_adjust_cow_extents(cur, agbno, aglen, adj,
1478 dfops, NULL);
1479 if (error)
1480 goto out_error;
1481
1482 return 0;
1483
1484out_error:
1485 trace_xfs_refcount_adjust_cow_error(cur->bc_mp, cur->bc_private.a.agno,
1486 error, _RET_IP_);
1487 return error;
1488}
1489
1490/*
1491 * Record a CoW allocation in the refcount btree.
1492 */
1493STATIC int
1494__xfs_refcount_cow_alloc(
1495 struct xfs_btree_cur *rcur,
1496 xfs_agblock_t agbno,
1497 xfs_extlen_t aglen,
1498 struct xfs_defer_ops *dfops)
1499{
1500 int error;
1501
1502 trace_xfs_refcount_cow_increase(rcur->bc_mp, rcur->bc_private.a.agno,
1503 agbno, aglen);
1504
1505 /* Add refcount btree reservation */
1506 error = xfs_refcount_adjust_cow(rcur, agbno, aglen,
1507 XFS_REFCOUNT_ADJUST_COW_ALLOC, dfops);
1508 if (error)
1509 return error;
1510
1511 /* Add rmap entry */
1512 if (xfs_sb_version_hasrmapbt(&rcur->bc_mp->m_sb)) {
1513 error = xfs_rmap_alloc_extent(rcur->bc_mp, dfops,
1514 rcur->bc_private.a.agno,
1515 agbno, aglen, XFS_RMAP_OWN_COW);
1516 if (error)
1517 return error;
1518 }
1519
1520 return error;
1521}
1522
1523/*
1524 * Remove a CoW allocation from the refcount btree.
1525 */
1526STATIC int
1527__xfs_refcount_cow_free(
1528 struct xfs_btree_cur *rcur,
1529 xfs_agblock_t agbno,
1530 xfs_extlen_t aglen,
1531 struct xfs_defer_ops *dfops)
1532{
1533 int error;
1534
1535 trace_xfs_refcount_cow_decrease(rcur->bc_mp, rcur->bc_private.a.agno,
1536 agbno, aglen);
1537
1538 /* Remove refcount btree reservation */
1539 error = xfs_refcount_adjust_cow(rcur, agbno, aglen,
1540 XFS_REFCOUNT_ADJUST_COW_FREE, dfops);
1541 if (error)
1542 return error;
1543
1544 /* Remove rmap entry */
1545 if (xfs_sb_version_hasrmapbt(&rcur->bc_mp->m_sb)) {
1546 error = xfs_rmap_free_extent(rcur->bc_mp, dfops,
1547 rcur->bc_private.a.agno,
1548 agbno, aglen, XFS_RMAP_OWN_COW);
1549 if (error)
1550 return error;
1551 }
1552
1553 return error;
1554}
1555
1556/* Record a CoW staging extent in the refcount btree. */
1557int
1558xfs_refcount_alloc_cow_extent(
1559 struct xfs_mount *mp,
1560 struct xfs_defer_ops *dfops,
1561 xfs_fsblock_t fsb,
1562 xfs_extlen_t len)
1563{
1564 if (!xfs_sb_version_hasreflink(&mp->m_sb))
1565 return 0;
1566
1567 return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_ALLOC_COW,
1568 fsb, len);
1569}
1570
1571/* Forget a CoW staging event in the refcount btree. */
1572int
1573xfs_refcount_free_cow_extent(
1574 struct xfs_mount *mp,
1575 struct xfs_defer_ops *dfops,
1576 xfs_fsblock_t fsb,
1577 xfs_extlen_t len)
1578{
1579 if (!xfs_sb_version_hasreflink(&mp->m_sb))
1580 return 0;
1581
1582 return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_FREE_COW,
1583 fsb, len);
1584}
1585
1586struct xfs_refcount_recovery {
1587 struct list_head rr_list;
1588 struct xfs_refcount_irec rr_rrec;
1589};
1590
1591/* Stuff an extent on the recovery list. */
1592STATIC int
1593xfs_refcount_recover_extent(
1594 struct xfs_btree_cur *cur,
1595 union xfs_btree_rec *rec,
1596 void *priv)
1597{
1598 struct list_head *debris = priv;
1599 struct xfs_refcount_recovery *rr;
1600
1601 if (be32_to_cpu(rec->refc.rc_refcount) != 1)
1602 return -EFSCORRUPTED;
1603
1604 rr = kmem_alloc(sizeof(struct xfs_refcount_recovery), KM_SLEEP);
1605 xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec);
1606 list_add_tail(&rr->rr_list, debris);
1607
1608 return 0;
1609}
1610
1611/* Find and remove leftover CoW reservations. */
1612int
1613xfs_refcount_recover_cow_leftovers(
1614 struct xfs_mount *mp,
1615 xfs_agnumber_t agno)
1616{
1617 struct xfs_trans *tp;
1618 struct xfs_btree_cur *cur;
1619 struct xfs_buf *agbp;
1620 struct xfs_refcount_recovery *rr, *n;
1621 struct list_head debris;
1622 union xfs_btree_irec low;
1623 union xfs_btree_irec high;
1624 struct xfs_defer_ops dfops;
1625 xfs_fsblock_t fsb;
1626 xfs_agblock_t agbno;
1627 int error;
1628
1629 if (mp->m_sb.sb_agblocks >= XFS_REFC_COW_START)
1630 return -EOPNOTSUPP;
1631
1632 error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
1633 if (error)
1634 return error;
1635 cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
1636
1637 /* Find all the leftover CoW staging extents. */
1638 INIT_LIST_HEAD(&debris);
1639 memset(&low, 0, sizeof(low));
1640 memset(&high, 0, sizeof(high));
1641 low.rc.rc_startblock = XFS_REFC_COW_START;
1642 high.rc.rc_startblock = -1U;
1643 error = xfs_btree_query_range(cur, &low, &high,
1644 xfs_refcount_recover_extent, &debris);
1645 if (error)
1646 goto out_cursor;
1647 xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
1648 xfs_buf_relse(agbp);
1649
1650 /* Now iterate the list to free the leftovers */
1651 list_for_each_entry(rr, &debris, rr_list) {
1652 /* Set up transaction. */
1653 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
1654 if (error)
1655 goto out_free;
1656
1657 trace_xfs_refcount_recover_extent(mp, agno, &rr->rr_rrec);
1658
1659 /* Free the orphan record */
1660 xfs_defer_init(&dfops, &fsb);
1661 agbno = rr->rr_rrec.rc_startblock - XFS_REFC_COW_START;
1662 fsb = XFS_AGB_TO_FSB(mp, agno, agbno);
1663 error = xfs_refcount_free_cow_extent(mp, &dfops, fsb,
1664 rr->rr_rrec.rc_blockcount);
1665 if (error)
1666 goto out_defer;
1667
1668 /* Free the block. */
1669 xfs_bmap_add_free(mp, &dfops, fsb,
1670 rr->rr_rrec.rc_blockcount, NULL);
1671
1672 error = xfs_defer_finish(&tp, &dfops, NULL);
1673 if (error)
1674 goto out_defer;
1675
1676 error = xfs_trans_commit(tp);
1677 if (error)
1678 goto out_free;
1679 }
1680
1681out_free:
1682 /* Free the leftover list */
1683 list_for_each_entry_safe(rr, n, &debris, rr_list) {
1684 list_del(&rr->rr_list);
1685 kmem_free(rr);
1686 }
1687 return error;
1688
1689out_cursor:
1690 xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
1691 xfs_buf_relse(agbp);
1692 goto out_free;
1693
1694out_defer:
1695 xfs_defer_cancel(&dfops);
1696 xfs_trans_cancel(tp);
1697 goto out_free;
1698}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
new file mode 100644
index 000000000000..098dc668ab2c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -0,0 +1,70 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#ifndef __XFS_REFCOUNT_H__
21#define __XFS_REFCOUNT_H__
22
23extern int xfs_refcount_lookup_le(struct xfs_btree_cur *cur,
24 xfs_agblock_t bno, int *stat);
25extern int xfs_refcount_lookup_ge(struct xfs_btree_cur *cur,
26 xfs_agblock_t bno, int *stat);
27extern int xfs_refcount_get_rec(struct xfs_btree_cur *cur,
28 struct xfs_refcount_irec *irec, int *stat);
29
30enum xfs_refcount_intent_type {
31 XFS_REFCOUNT_INCREASE = 1,
32 XFS_REFCOUNT_DECREASE,
33 XFS_REFCOUNT_ALLOC_COW,
34 XFS_REFCOUNT_FREE_COW,
35};
36
37struct xfs_refcount_intent {
38 struct list_head ri_list;
39 enum xfs_refcount_intent_type ri_type;
40 xfs_fsblock_t ri_startblock;
41 xfs_extlen_t ri_blockcount;
42};
43
44extern int xfs_refcount_increase_extent(struct xfs_mount *mp,
45 struct xfs_defer_ops *dfops, struct xfs_bmbt_irec *irec);
46extern int xfs_refcount_decrease_extent(struct xfs_mount *mp,
47 struct xfs_defer_ops *dfops, struct xfs_bmbt_irec *irec);
48
49extern void xfs_refcount_finish_one_cleanup(struct xfs_trans *tp,
50 struct xfs_btree_cur *rcur, int error);
51extern int xfs_refcount_finish_one(struct xfs_trans *tp,
52 struct xfs_defer_ops *dfops, enum xfs_refcount_intent_type type,
53 xfs_fsblock_t startblock, xfs_extlen_t blockcount,
54 xfs_fsblock_t *new_fsb, xfs_extlen_t *new_len,
55 struct xfs_btree_cur **pcur);
56
57extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
58 xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
59 xfs_extlen_t *flen, bool find_end_of_shared);
60
61extern int xfs_refcount_alloc_cow_extent(struct xfs_mount *mp,
62 struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
63 xfs_extlen_t len);
64extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
65 struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
66 xfs_extlen_t len);
67extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
68 xfs_agnumber_t agno);
69
70#endif /* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
new file mode 100644
index 000000000000..453bb2757ec2
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -0,0 +1,451 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_shared.h"
23#include "xfs_format.h"
24#include "xfs_log_format.h"
25#include "xfs_trans_resv.h"
26#include "xfs_sb.h"
27#include "xfs_mount.h"
28#include "xfs_btree.h"
29#include "xfs_bmap.h"
30#include "xfs_refcount_btree.h"
31#include "xfs_alloc.h"
32#include "xfs_error.h"
33#include "xfs_trace.h"
34#include "xfs_cksum.h"
35#include "xfs_trans.h"
36#include "xfs_bit.h"
37#include "xfs_rmap.h"
38
39static struct xfs_btree_cur *
40xfs_refcountbt_dup_cursor(
41 struct xfs_btree_cur *cur)
42{
43 return xfs_refcountbt_init_cursor(cur->bc_mp, cur->bc_tp,
44 cur->bc_private.a.agbp, cur->bc_private.a.agno,
45 cur->bc_private.a.dfops);
46}
47
48STATIC void
49xfs_refcountbt_set_root(
50 struct xfs_btree_cur *cur,
51 union xfs_btree_ptr *ptr,
52 int inc)
53{
54 struct xfs_buf *agbp = cur->bc_private.a.agbp;
55 struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp);
56 xfs_agnumber_t seqno = be32_to_cpu(agf->agf_seqno);
57 struct xfs_perag *pag = xfs_perag_get(cur->bc_mp, seqno);
58
59 ASSERT(ptr->s != 0);
60
61 agf->agf_refcount_root = ptr->s;
62 be32_add_cpu(&agf->agf_refcount_level, inc);
63 pag->pagf_refcount_level += inc;
64 xfs_perag_put(pag);
65
66 xfs_alloc_log_agf(cur->bc_tp, agbp,
67 XFS_AGF_REFCOUNT_ROOT | XFS_AGF_REFCOUNT_LEVEL);
68}
69
70STATIC int
71xfs_refcountbt_alloc_block(
72 struct xfs_btree_cur *cur,
73 union xfs_btree_ptr *start,
74 union xfs_btree_ptr *new,
75 int *stat)
76{
77 struct xfs_buf *agbp = cur->bc_private.a.agbp;
78 struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp);
79 struct xfs_alloc_arg args; /* block allocation args */
80 int error; /* error return value */
81
82 XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
83
84 memset(&args, 0, sizeof(args));
85 args.tp = cur->bc_tp;
86 args.mp = cur->bc_mp;
87 args.type = XFS_ALLOCTYPE_NEAR_BNO;
88 args.fsbno = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
89 xfs_refc_block(args.mp));
90 args.firstblock = args.fsbno;
91 xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
92 args.minlen = args.maxlen = args.prod = 1;
93 args.resv = XFS_AG_RESV_METADATA;
94
95 error = xfs_alloc_vextent(&args);
96 if (error)
97 goto out_error;
98 trace_xfs_refcountbt_alloc_block(cur->bc_mp, cur->bc_private.a.agno,
99 args.agbno, 1);
100 if (args.fsbno == NULLFSBLOCK) {
101 XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
102 *stat = 0;
103 return 0;
104 }
105 ASSERT(args.agno == cur->bc_private.a.agno);
106 ASSERT(args.len == 1);
107
108 new->s = cpu_to_be32(args.agbno);
109 be32_add_cpu(&agf->agf_refcount_blocks, 1);
110 xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
111
112 XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
113 *stat = 1;
114 return 0;
115
116out_error:
117 XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
118 return error;
119}
120
121STATIC int
122xfs_refcountbt_free_block(
123 struct xfs_btree_cur *cur,
124 struct xfs_buf *bp)
125{
126 struct xfs_mount *mp = cur->bc_mp;
127 struct xfs_buf *agbp = cur->bc_private.a.agbp;
128 struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp);
129 xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
130 struct xfs_owner_info oinfo;
131 int error;
132
133 trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
134 XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
135 xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
136 be32_add_cpu(&agf->agf_refcount_blocks, -1);
137 xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
138 error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
139 XFS_AG_RESV_METADATA);
140 if (error)
141 return error;
142
143 return error;
144}
145
146STATIC int
147xfs_refcountbt_get_minrecs(
148 struct xfs_btree_cur *cur,
149 int level)
150{
151 return cur->bc_mp->m_refc_mnr[level != 0];
152}
153
154STATIC int
155xfs_refcountbt_get_maxrecs(
156 struct xfs_btree_cur *cur,
157 int level)
158{
159 return cur->bc_mp->m_refc_mxr[level != 0];
160}
161
162STATIC void
163xfs_refcountbt_init_key_from_rec(
164 union xfs_btree_key *key,
165 union xfs_btree_rec *rec)
166{
167 key->refc.rc_startblock = rec->refc.rc_startblock;
168}
169
170STATIC void
171xfs_refcountbt_init_high_key_from_rec(
172 union xfs_btree_key *key,
173 union xfs_btree_rec *rec)
174{
175 __u32 x;
176
177 x = be32_to_cpu(rec->refc.rc_startblock);
178 x += be32_to_cpu(rec->refc.rc_blockcount) - 1;
179 key->refc.rc_startblock = cpu_to_be32(x);
180}
181
182STATIC void
183xfs_refcountbt_init_rec_from_cur(
184 struct xfs_btree_cur *cur,
185 union xfs_btree_rec *rec)
186{
187 rec->refc.rc_startblock = cpu_to_be32(cur->bc_rec.rc.rc_startblock);
188 rec->refc.rc_blockcount = cpu_to_be32(cur->bc_rec.rc.rc_blockcount);
189 rec->refc.rc_refcount = cpu_to_be32(cur->bc_rec.rc.rc_refcount);
190}
191
192STATIC void
193xfs_refcountbt_init_ptr_from_cur(
194 struct xfs_btree_cur *cur,
195 union xfs_btree_ptr *ptr)
196{
197 struct xfs_agf *agf = XFS_BUF_TO_AGF(cur->bc_private.a.agbp);
198
199 ASSERT(cur->bc_private.a.agno == be32_to_cpu(agf->agf_seqno));
200 ASSERT(agf->agf_refcount_root != 0);
201
202 ptr->s = agf->agf_refcount_root;
203}
204
205STATIC __int64_t
206xfs_refcountbt_key_diff(
207 struct xfs_btree_cur *cur,
208 union xfs_btree_key *key)
209{
210 struct xfs_refcount_irec *rec = &cur->bc_rec.rc;
211 struct xfs_refcount_key *kp = &key->refc;
212
213 return (__int64_t)be32_to_cpu(kp->rc_startblock) - rec->rc_startblock;
214}
215
216STATIC __int64_t
217xfs_refcountbt_diff_two_keys(
218 struct xfs_btree_cur *cur,
219 union xfs_btree_key *k1,
220 union xfs_btree_key *k2)
221{
222 return (__int64_t)be32_to_cpu(k1->refc.rc_startblock) -
223 be32_to_cpu(k2->refc.rc_startblock);
224}
225
226STATIC bool
227xfs_refcountbt_verify(
228 struct xfs_buf *bp)
229{
230 struct xfs_mount *mp = bp->b_target->bt_mount;
231 struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp);
232 struct xfs_perag *pag = bp->b_pag;
233 unsigned int level;
234
235 if (block->bb_magic != cpu_to_be32(XFS_REFC_CRC_MAGIC))
236 return false;
237
238 if (!xfs_sb_version_hasreflink(&mp->m_sb))
239 return false;
240 if (!xfs_btree_sblock_v5hdr_verify(bp))
241 return false;
242
243 level = be16_to_cpu(block->bb_level);
244 if (pag && pag->pagf_init) {
245 if (level >= pag->pagf_refcount_level)
246 return false;
247 } else if (level >= mp->m_refc_maxlevels)
248 return false;
249
250 return xfs_btree_sblock_verify(bp, mp->m_refc_mxr[level != 0]);
251}
252
253STATIC void
254xfs_refcountbt_read_verify(
255 struct xfs_buf *bp)
256{
257 if (!xfs_btree_sblock_verify_crc(bp))
258 xfs_buf_ioerror(bp, -EFSBADCRC);
259 else if (!xfs_refcountbt_verify(bp))
260 xfs_buf_ioerror(bp, -EFSCORRUPTED);
261
262 if (bp->b_error) {
263 trace_xfs_btree_corrupt(bp, _RET_IP_);
264 xfs_verifier_error(bp);
265 }
266}
267
268STATIC void
269xfs_refcountbt_write_verify(
270 struct xfs_buf *bp)
271{
272 if (!xfs_refcountbt_verify(bp)) {
273 trace_xfs_btree_corrupt(bp, _RET_IP_);
274 xfs_buf_ioerror(bp, -EFSCORRUPTED);
275 xfs_verifier_error(bp);
276 return;
277 }
278 xfs_btree_sblock_calc_crc(bp);
279
280}
281
282const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
283 .name = "xfs_refcountbt",
284 .verify_read = xfs_refcountbt_read_verify,
285 .verify_write = xfs_refcountbt_write_verify,
286};
287
288#if defined(DEBUG) || defined(XFS_WARN)
289STATIC int
290xfs_refcountbt_keys_inorder(
291 struct xfs_btree_cur *cur,
292 union xfs_btree_key *k1,
293 union xfs_btree_key *k2)
294{
295 return be32_to_cpu(k1->refc.rc_startblock) <
296 be32_to_cpu(k2->refc.rc_startblock);
297}
298
299STATIC int
300xfs_refcountbt_recs_inorder(
301 struct xfs_btree_cur *cur,
302 union xfs_btree_rec *r1,
303 union xfs_btree_rec *r2)
304{
305 return be32_to_cpu(r1->refc.rc_startblock) +
306 be32_to_cpu(r1->refc.rc_blockcount) <=
307 be32_to_cpu(r2->refc.rc_startblock);
308}
309#endif
310
311static const struct xfs_btree_ops xfs_refcountbt_ops = {
312 .rec_len = sizeof(struct xfs_refcount_rec),
313 .key_len = sizeof(struct xfs_refcount_key),
314
315 .dup_cursor = xfs_refcountbt_dup_cursor,
316 .set_root = xfs_refcountbt_set_root,
317 .alloc_block = xfs_refcountbt_alloc_block,
318 .free_block = xfs_refcountbt_free_block,
319 .get_minrecs = xfs_refcountbt_get_minrecs,
320 .get_maxrecs = xfs_refcountbt_get_maxrecs,
321 .init_key_from_rec = xfs_refcountbt_init_key_from_rec,
322 .init_high_key_from_rec = xfs_refcountbt_init_high_key_from_rec,
323 .init_rec_from_cur = xfs_refcountbt_init_rec_from_cur,
324 .init_ptr_from_cur = xfs_refcountbt_init_ptr_from_cur,
325 .key_diff = xfs_refcountbt_key_diff,
326 .buf_ops = &xfs_refcountbt_buf_ops,
327 .diff_two_keys = xfs_refcountbt_diff_two_keys,
328#if defined(DEBUG) || defined(XFS_WARN)
329 .keys_inorder = xfs_refcountbt_keys_inorder,
330 .recs_inorder = xfs_refcountbt_recs_inorder,
331#endif
332};
333
334/*
335 * Allocate a new refcount btree cursor.
336 */
337struct xfs_btree_cur *
338xfs_refcountbt_init_cursor(
339 struct xfs_mount *mp,
340 struct xfs_trans *tp,
341 struct xfs_buf *agbp,
342 xfs_agnumber_t agno,
343 struct xfs_defer_ops *dfops)
344{
345 struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp);
346 struct xfs_btree_cur *cur;
347
348 ASSERT(agno != NULLAGNUMBER);
349 ASSERT(agno < mp->m_sb.sb_agcount);
350 cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
351
352 cur->bc_tp = tp;
353 cur->bc_mp = mp;
354 cur->bc_btnum = XFS_BTNUM_REFC;
355 cur->bc_blocklog = mp->m_sb.sb_blocklog;
356 cur->bc_ops = &xfs_refcountbt_ops;
357
358 cur->bc_nlevels = be32_to_cpu(agf->agf_refcount_level);
359
360 cur->bc_private.a.agbp = agbp;
361 cur->bc_private.a.agno = agno;
362 cur->bc_private.a.dfops = dfops;
363 cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
364
365 cur->bc_private.a.priv.refc.nr_ops = 0;
366 cur->bc_private.a.priv.refc.shape_changes = 0;
367
368 return cur;
369}
370
371/*
372 * Calculate the number of records in a refcount btree block.
373 */
374int
375xfs_refcountbt_maxrecs(
376 struct xfs_mount *mp,
377 int blocklen,
378 bool leaf)
379{
380 blocklen -= XFS_REFCOUNT_BLOCK_LEN;
381
382 if (leaf)
383 return blocklen / sizeof(struct xfs_refcount_rec);
384 return blocklen / (sizeof(struct xfs_refcount_key) +
385 sizeof(xfs_refcount_ptr_t));
386}
387
388/* Compute the maximum height of a refcount btree. */
389void
390xfs_refcountbt_compute_maxlevels(
391 struct xfs_mount *mp)
392{
393 mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
394 mp->m_refc_mnr, mp->m_sb.sb_agblocks);
395}
396
397/* Calculate the refcount btree size for some records. */
398xfs_extlen_t
399xfs_refcountbt_calc_size(
400 struct xfs_mount *mp,
401 unsigned long long len)
402{
403 return xfs_btree_calc_size(mp, mp->m_refc_mnr, len);
404}
405
406/*
407 * Calculate the maximum refcount btree size.
408 */
409xfs_extlen_t
410xfs_refcountbt_max_size(
411 struct xfs_mount *mp)
412{
413 /* Bail out if we're uninitialized, which can happen in mkfs. */
414 if (mp->m_refc_mxr[0] == 0)
415 return 0;
416
417 return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
418}
419
420/*
421 * Figure out how many blocks to reserve and how many are used by this btree.
422 */
423int
424xfs_refcountbt_calc_reserves(
425 struct xfs_mount *mp,
426 xfs_agnumber_t agno,
427 xfs_extlen_t *ask,
428 xfs_extlen_t *used)
429{
430 struct xfs_buf *agbp;
431 struct xfs_agf *agf;
432 xfs_extlen_t tree_len;
433 int error;
434
435 if (!xfs_sb_version_hasreflink(&mp->m_sb))
436 return 0;
437
438 *ask += xfs_refcountbt_max_size(mp);
439
440 error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
441 if (error)
442 return error;
443
444 agf = XFS_BUF_TO_AGF(agbp);
445 tree_len = be32_to_cpu(agf->agf_refcount_blocks);
446 xfs_buf_relse(agbp);
447
448 *used += tree_len;
449
450 return error;
451}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
new file mode 100644
index 000000000000..3be7768bd51a
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -0,0 +1,74 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#ifndef __XFS_REFCOUNT_BTREE_H__
21#define __XFS_REFCOUNT_BTREE_H__
22
23/*
24 * Reference Count Btree on-disk structures
25 */
26
27struct xfs_buf;
28struct xfs_btree_cur;
29struct xfs_mount;
30
31/*
32 * Btree block header size
33 */
34#define XFS_REFCOUNT_BLOCK_LEN XFS_BTREE_SBLOCK_CRC_LEN
35
36/*
37 * Record, key, and pointer address macros for btree blocks.
38 *
39 * (note that some of these may appear unused, but they are used in userspace)
40 */
41#define XFS_REFCOUNT_REC_ADDR(block, index) \
42 ((struct xfs_refcount_rec *) \
43 ((char *)(block) + \
44 XFS_REFCOUNT_BLOCK_LEN + \
45 (((index) - 1) * sizeof(struct xfs_refcount_rec))))
46
47#define XFS_REFCOUNT_KEY_ADDR(block, index) \
48 ((struct xfs_refcount_key *) \
49 ((char *)(block) + \
50 XFS_REFCOUNT_BLOCK_LEN + \
51 ((index) - 1) * sizeof(struct xfs_refcount_key)))
52
53#define XFS_REFCOUNT_PTR_ADDR(block, index, maxrecs) \
54 ((xfs_refcount_ptr_t *) \
55 ((char *)(block) + \
56 XFS_REFCOUNT_BLOCK_LEN + \
57 (maxrecs) * sizeof(struct xfs_refcount_key) + \
58 ((index) - 1) * sizeof(xfs_refcount_ptr_t)))
59
60extern struct xfs_btree_cur *xfs_refcountbt_init_cursor(struct xfs_mount *mp,
61 struct xfs_trans *tp, struct xfs_buf *agbp, xfs_agnumber_t agno,
62 struct xfs_defer_ops *dfops);
63extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
64 bool leaf);
65extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
66
67extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
68 unsigned long long len);
69extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
70
71extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
72 xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
73
74#endif /* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 73d05407d663..3a8cc7139912 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -148,6 +148,37 @@ done:
148 return error; 148 return error;
149} 149}
150 150
151STATIC int
152xfs_rmap_delete(
153 struct xfs_btree_cur *rcur,
154 xfs_agblock_t agbno,
155 xfs_extlen_t len,
156 uint64_t owner,
157 uint64_t offset,
158 unsigned int flags)
159{
160 int i;
161 int error;
162
163 trace_xfs_rmap_delete(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
164 len, owner, offset, flags);
165
166 error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
167 if (error)
168 goto done;
169 XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
170
171 error = xfs_btree_delete(rcur, &i);
172 if (error)
173 goto done;
174 XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
175done:
176 if (error)
177 trace_xfs_rmap_delete_error(rcur->bc_mp,
178 rcur->bc_private.a.agno, error, _RET_IP_);
179 return error;
180}
181
151static int 182static int
152xfs_rmap_btrec_to_irec( 183xfs_rmap_btrec_to_irec(
153 union xfs_btree_rec *rec, 184 union xfs_btree_rec *rec,
@@ -180,6 +211,160 @@ xfs_rmap_get_rec(
180 return xfs_rmap_btrec_to_irec(rec, irec); 211 return xfs_rmap_btrec_to_irec(rec, irec);
181} 212}
182 213
214struct xfs_find_left_neighbor_info {
215 struct xfs_rmap_irec high;
216 struct xfs_rmap_irec *irec;
217 int *stat;
218};
219
220/* For each rmap given, figure out if it matches the key we want. */
221STATIC int
222xfs_rmap_find_left_neighbor_helper(
223 struct xfs_btree_cur *cur,
224 struct xfs_rmap_irec *rec,
225 void *priv)
226{
227 struct xfs_find_left_neighbor_info *info = priv;
228
229 trace_xfs_rmap_find_left_neighbor_candidate(cur->bc_mp,
230 cur->bc_private.a.agno, rec->rm_startblock,
231 rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
232 rec->rm_flags);
233
234 if (rec->rm_owner != info->high.rm_owner)
235 return XFS_BTREE_QUERY_RANGE_CONTINUE;
236 if (!XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) &&
237 !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
238 rec->rm_offset + rec->rm_blockcount - 1 != info->high.rm_offset)
239 return XFS_BTREE_QUERY_RANGE_CONTINUE;
240
241 *info->irec = *rec;
242 *info->stat = 1;
243 return XFS_BTREE_QUERY_RANGE_ABORT;
244}
245
246/*
247 * Find the record to the left of the given extent, being careful only to
248 * return a match with the same owner and adjacent physical and logical
249 * block ranges.
250 */
251int
252xfs_rmap_find_left_neighbor(
253 struct xfs_btree_cur *cur,
254 xfs_agblock_t bno,
255 uint64_t owner,
256 uint64_t offset,
257 unsigned int flags,
258 struct xfs_rmap_irec *irec,
259 int *stat)
260{
261 struct xfs_find_left_neighbor_info info;
262 int error;
263
264 *stat = 0;
265 if (bno == 0)
266 return 0;
267 info.high.rm_startblock = bno - 1;
268 info.high.rm_owner = owner;
269 if (!XFS_RMAP_NON_INODE_OWNER(owner) &&
270 !(flags & XFS_RMAP_BMBT_BLOCK)) {
271 if (offset == 0)
272 return 0;
273 info.high.rm_offset = offset - 1;
274 } else
275 info.high.rm_offset = 0;
276 info.high.rm_flags = flags;
277 info.high.rm_blockcount = 0;
278 info.irec = irec;
279 info.stat = stat;
280
281 trace_xfs_rmap_find_left_neighbor_query(cur->bc_mp,
282 cur->bc_private.a.agno, bno, 0, owner, offset, flags);
283
284 error = xfs_rmap_query_range(cur, &info.high, &info.high,
285 xfs_rmap_find_left_neighbor_helper, &info);
286 if (error == XFS_BTREE_QUERY_RANGE_ABORT)
287 error = 0;
288 if (*stat)
289 trace_xfs_rmap_find_left_neighbor_result(cur->bc_mp,
290 cur->bc_private.a.agno, irec->rm_startblock,
291 irec->rm_blockcount, irec->rm_owner,
292 irec->rm_offset, irec->rm_flags);
293 return error;
294}
295
296/* For each rmap given, figure out if it matches the key we want. */
297STATIC int
298xfs_rmap_lookup_le_range_helper(
299 struct xfs_btree_cur *cur,
300 struct xfs_rmap_irec *rec,
301 void *priv)
302{
303 struct xfs_find_left_neighbor_info *info = priv;
304
305 trace_xfs_rmap_lookup_le_range_candidate(cur->bc_mp,
306 cur->bc_private.a.agno, rec->rm_startblock,
307 rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
308 rec->rm_flags);
309
310 if (rec->rm_owner != info->high.rm_owner)
311 return XFS_BTREE_QUERY_RANGE_CONTINUE;
312 if (!XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) &&
313 !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
314 (rec->rm_offset > info->high.rm_offset ||
315 rec->rm_offset + rec->rm_blockcount <= info->high.rm_offset))
316 return XFS_BTREE_QUERY_RANGE_CONTINUE;
317
318 *info->irec = *rec;
319 *info->stat = 1;
320 return XFS_BTREE_QUERY_RANGE_ABORT;
321}
322
323/*
324 * Find the record to the left of the given extent, being careful only to
325 * return a match with the same owner and overlapping physical and logical
326 * block ranges. This is the overlapping-interval version of
327 * xfs_rmap_lookup_le.
328 */
329int
330xfs_rmap_lookup_le_range(
331 struct xfs_btree_cur *cur,
332 xfs_agblock_t bno,
333 uint64_t owner,
334 uint64_t offset,
335 unsigned int flags,
336 struct xfs_rmap_irec *irec,
337 int *stat)
338{
339 struct xfs_find_left_neighbor_info info;
340 int error;
341
342 info.high.rm_startblock = bno;
343 info.high.rm_owner = owner;
344 if (!XFS_RMAP_NON_INODE_OWNER(owner) && !(flags & XFS_RMAP_BMBT_BLOCK))
345 info.high.rm_offset = offset;
346 else
347 info.high.rm_offset = 0;
348 info.high.rm_flags = flags;
349 info.high.rm_blockcount = 0;
350 *stat = 0;
351 info.irec = irec;
352 info.stat = stat;
353
354 trace_xfs_rmap_lookup_le_range(cur->bc_mp,
355 cur->bc_private.a.agno, bno, 0, owner, offset, flags);
356 error = xfs_rmap_query_range(cur, &info.high, &info.high,
357 xfs_rmap_lookup_le_range_helper, &info);
358 if (error == XFS_BTREE_QUERY_RANGE_ABORT)
359 error = 0;
360 if (*stat)
361 trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
362 cur->bc_private.a.agno, irec->rm_startblock,
363 irec->rm_blockcount, irec->rm_owner,
364 irec->rm_offset, irec->rm_flags);
365 return error;
366}
367
183/* 368/*
184 * Find the extent in the rmap btree and remove it. 369 * Find the extent in the rmap btree and remove it.
185 * 370 *
@@ -1093,11 +1278,704 @@ done:
1093 return error; 1278 return error;
1094} 1279}
1095 1280
1281/*
1282 * Convert an unwritten extent to a real extent or vice versa. If there is no
1283 * possibility of overlapping extents, delegate to the simpler convert
1284 * function.
1285 */
1286STATIC int
1287xfs_rmap_convert_shared(
1288 struct xfs_btree_cur *cur,
1289 xfs_agblock_t bno,
1290 xfs_extlen_t len,
1291 bool unwritten,
1292 struct xfs_owner_info *oinfo)
1293{
1294 struct xfs_mount *mp = cur->bc_mp;
1295 struct xfs_rmap_irec r[4]; /* neighbor extent entries */
1296 /* left is 0, right is 1, prev is 2 */
1297 /* new is 3 */
1298 uint64_t owner;
1299 uint64_t offset;
1300 uint64_t new_endoff;
1301 unsigned int oldext;
1302 unsigned int newext;
1303 unsigned int flags = 0;
1304 int i;
1305 int state = 0;
1306 int error;
1307
1308 xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
1309 ASSERT(!(XFS_RMAP_NON_INODE_OWNER(owner) ||
1310 (flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))));
1311 oldext = unwritten ? XFS_RMAP_UNWRITTEN : 0;
1312 new_endoff = offset + len;
1313 trace_xfs_rmap_convert(mp, cur->bc_private.a.agno, bno, len,
1314 unwritten, oinfo);
1315
1316 /*
1317 * For the initial lookup, look for and exact match or the left-adjacent
1318 * record for our insertion point. This will also give us the record for
1319 * start block contiguity tests.
1320 */
1321 error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags,
1322 &PREV, &i);
1323 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1324
1325 ASSERT(PREV.rm_offset <= offset);
1326 ASSERT(PREV.rm_offset + PREV.rm_blockcount >= new_endoff);
1327 ASSERT((PREV.rm_flags & XFS_RMAP_UNWRITTEN) == oldext);
1328 newext = ~oldext & XFS_RMAP_UNWRITTEN;
1329
1330 /*
1331 * Set flags determining what part of the previous oldext allocation
1332 * extent is being replaced by a newext allocation.
1333 */
1334 if (PREV.rm_offset == offset)
1335 state |= RMAP_LEFT_FILLING;
1336 if (PREV.rm_offset + PREV.rm_blockcount == new_endoff)
1337 state |= RMAP_RIGHT_FILLING;
1338
1339 /* Is there a left record that abuts our range? */
1340 error = xfs_rmap_find_left_neighbor(cur, bno, owner, offset, newext,
1341 &LEFT, &i);
1342 if (error)
1343 goto done;
1344 if (i) {
1345 state |= RMAP_LEFT_VALID;
1346 XFS_WANT_CORRUPTED_GOTO(mp,
1347 LEFT.rm_startblock + LEFT.rm_blockcount <= bno,
1348 done);
1349 if (xfs_rmap_is_mergeable(&LEFT, owner, newext))
1350 state |= RMAP_LEFT_CONTIG;
1351 }
1352
1353 /* Is there a right record that abuts our range? */
1354 error = xfs_rmap_lookup_eq(cur, bno + len, len, owner, offset + len,
1355 newext, &i);
1356 if (error)
1357 goto done;
1358 if (i) {
1359 state |= RMAP_RIGHT_VALID;
1360 error = xfs_rmap_get_rec(cur, &RIGHT, &i);
1361 if (error)
1362 goto done;
1363 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1364 XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= RIGHT.rm_startblock,
1365 done);
1366 trace_xfs_rmap_find_right_neighbor_result(cur->bc_mp,
1367 cur->bc_private.a.agno, RIGHT.rm_startblock,
1368 RIGHT.rm_blockcount, RIGHT.rm_owner,
1369 RIGHT.rm_offset, RIGHT.rm_flags);
1370 if (xfs_rmap_is_mergeable(&RIGHT, owner, newext))
1371 state |= RMAP_RIGHT_CONTIG;
1372 }
1373
1374 /* check that left + prev + right is not too long */
1375 if ((state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
1376 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) ==
1377 (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
1378 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG) &&
1379 (unsigned long)LEFT.rm_blockcount + len +
1380 RIGHT.rm_blockcount > XFS_RMAP_LEN_MAX)
1381 state &= ~RMAP_RIGHT_CONTIG;
1382
1383 trace_xfs_rmap_convert_state(mp, cur->bc_private.a.agno, state,
1384 _RET_IP_);
1385 /*
1386 * Switch out based on the FILLING and CONTIG state bits.
1387 */
1388 switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
1389 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
1390 case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
1391 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
1392 /*
1393 * Setting all of a previous oldext extent to newext.
1394 * The left and right neighbors are both contiguous with new.
1395 */
1396 error = xfs_rmap_delete(cur, RIGHT.rm_startblock,
1397 RIGHT.rm_blockcount, RIGHT.rm_owner,
1398 RIGHT.rm_offset, RIGHT.rm_flags);
1399 if (error)
1400 goto done;
1401 error = xfs_rmap_delete(cur, PREV.rm_startblock,
1402 PREV.rm_blockcount, PREV.rm_owner,
1403 PREV.rm_offset, PREV.rm_flags);
1404 if (error)
1405 goto done;
1406 NEW = LEFT;
1407 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1408 NEW.rm_blockcount, NEW.rm_owner,
1409 NEW.rm_offset, NEW.rm_flags, &i);
1410 if (error)
1411 goto done;
1412 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1413 NEW.rm_blockcount += PREV.rm_blockcount + RIGHT.rm_blockcount;
1414 error = xfs_rmap_update(cur, &NEW);
1415 if (error)
1416 goto done;
1417 break;
1418
1419 case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
1420 /*
1421 * Setting all of a previous oldext extent to newext.
1422 * The left neighbor is contiguous, the right is not.
1423 */
1424 error = xfs_rmap_delete(cur, PREV.rm_startblock,
1425 PREV.rm_blockcount, PREV.rm_owner,
1426 PREV.rm_offset, PREV.rm_flags);
1427 if (error)
1428 goto done;
1429 NEW = LEFT;
1430 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1431 NEW.rm_blockcount, NEW.rm_owner,
1432 NEW.rm_offset, NEW.rm_flags, &i);
1433 if (error)
1434 goto done;
1435 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1436 NEW.rm_blockcount += PREV.rm_blockcount;
1437 error = xfs_rmap_update(cur, &NEW);
1438 if (error)
1439 goto done;
1440 break;
1441
1442 case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
1443 /*
1444 * Setting all of a previous oldext extent to newext.
1445 * The right neighbor is contiguous, the left is not.
1446 */
1447 error = xfs_rmap_delete(cur, RIGHT.rm_startblock,
1448 RIGHT.rm_blockcount, RIGHT.rm_owner,
1449 RIGHT.rm_offset, RIGHT.rm_flags);
1450 if (error)
1451 goto done;
1452 NEW = PREV;
1453 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1454 NEW.rm_blockcount, NEW.rm_owner,
1455 NEW.rm_offset, NEW.rm_flags, &i);
1456 if (error)
1457 goto done;
1458 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1459 NEW.rm_blockcount += RIGHT.rm_blockcount;
1460 NEW.rm_flags = RIGHT.rm_flags;
1461 error = xfs_rmap_update(cur, &NEW);
1462 if (error)
1463 goto done;
1464 break;
1465
1466 case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING:
1467 /*
1468 * Setting all of a previous oldext extent to newext.
1469 * Neither the left nor right neighbors are contiguous with
1470 * the new one.
1471 */
1472 NEW = PREV;
1473 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1474 NEW.rm_blockcount, NEW.rm_owner,
1475 NEW.rm_offset, NEW.rm_flags, &i);
1476 if (error)
1477 goto done;
1478 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1479 NEW.rm_flags = newext;
1480 error = xfs_rmap_update(cur, &NEW);
1481 if (error)
1482 goto done;
1483 break;
1484
1485 case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG:
1486 /*
1487 * Setting the first part of a previous oldext extent to newext.
1488 * The left neighbor is contiguous.
1489 */
1490 NEW = PREV;
1491 error = xfs_rmap_delete(cur, NEW.rm_startblock,
1492 NEW.rm_blockcount, NEW.rm_owner,
1493 NEW.rm_offset, NEW.rm_flags);
1494 if (error)
1495 goto done;
1496 NEW.rm_offset += len;
1497 NEW.rm_startblock += len;
1498 NEW.rm_blockcount -= len;
1499 error = xfs_rmap_insert(cur, NEW.rm_startblock,
1500 NEW.rm_blockcount, NEW.rm_owner,
1501 NEW.rm_offset, NEW.rm_flags);
1502 if (error)
1503 goto done;
1504 NEW = LEFT;
1505 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1506 NEW.rm_blockcount, NEW.rm_owner,
1507 NEW.rm_offset, NEW.rm_flags, &i);
1508 if (error)
1509 goto done;
1510 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1511 NEW.rm_blockcount += len;
1512 error = xfs_rmap_update(cur, &NEW);
1513 if (error)
1514 goto done;
1515 break;
1516
1517 case RMAP_LEFT_FILLING:
1518 /*
1519 * Setting the first part of a previous oldext extent to newext.
1520 * The left neighbor is not contiguous.
1521 */
1522 NEW = PREV;
1523 error = xfs_rmap_delete(cur, NEW.rm_startblock,
1524 NEW.rm_blockcount, NEW.rm_owner,
1525 NEW.rm_offset, NEW.rm_flags);
1526 if (error)
1527 goto done;
1528 NEW.rm_offset += len;
1529 NEW.rm_startblock += len;
1530 NEW.rm_blockcount -= len;
1531 error = xfs_rmap_insert(cur, NEW.rm_startblock,
1532 NEW.rm_blockcount, NEW.rm_owner,
1533 NEW.rm_offset, NEW.rm_flags);
1534 if (error)
1535 goto done;
1536 error = xfs_rmap_insert(cur, bno, len, owner, offset, newext);
1537 if (error)
1538 goto done;
1539 break;
1540
1541 case RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
1542 /*
1543 * Setting the last part of a previous oldext extent to newext.
1544 * The right neighbor is contiguous with the new allocation.
1545 */
1546 NEW = PREV;
1547 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1548 NEW.rm_blockcount, NEW.rm_owner,
1549 NEW.rm_offset, NEW.rm_flags, &i);
1550 if (error)
1551 goto done;
1552 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1553 NEW.rm_blockcount = offset - NEW.rm_offset;
1554 error = xfs_rmap_update(cur, &NEW);
1555 if (error)
1556 goto done;
1557 NEW = RIGHT;
1558 error = xfs_rmap_delete(cur, NEW.rm_startblock,
1559 NEW.rm_blockcount, NEW.rm_owner,
1560 NEW.rm_offset, NEW.rm_flags);
1561 if (error)
1562 goto done;
1563 NEW.rm_offset = offset;
1564 NEW.rm_startblock = bno;
1565 NEW.rm_blockcount += len;
1566 error = xfs_rmap_insert(cur, NEW.rm_startblock,
1567 NEW.rm_blockcount, NEW.rm_owner,
1568 NEW.rm_offset, NEW.rm_flags);
1569 if (error)
1570 goto done;
1571 break;
1572
1573 case RMAP_RIGHT_FILLING:
1574 /*
1575 * Setting the last part of a previous oldext extent to newext.
1576 * The right neighbor is not contiguous.
1577 */
1578 NEW = PREV;
1579 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1580 NEW.rm_blockcount, NEW.rm_owner,
1581 NEW.rm_offset, NEW.rm_flags, &i);
1582 if (error)
1583 goto done;
1584 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1585 NEW.rm_blockcount -= len;
1586 error = xfs_rmap_update(cur, &NEW);
1587 if (error)
1588 goto done;
1589 error = xfs_rmap_insert(cur, bno, len, owner, offset, newext);
1590 if (error)
1591 goto done;
1592 break;
1593
1594 case 0:
1595 /*
1596 * Setting the middle part of a previous oldext extent to
1597 * newext. Contiguity is impossible here.
1598 * One extent becomes three extents.
1599 */
1600 /* new right extent - oldext */
1601 NEW.rm_startblock = bno + len;
1602 NEW.rm_owner = owner;
1603 NEW.rm_offset = new_endoff;
1604 NEW.rm_blockcount = PREV.rm_offset + PREV.rm_blockcount -
1605 new_endoff;
1606 NEW.rm_flags = PREV.rm_flags;
1607 error = xfs_rmap_insert(cur, NEW.rm_startblock,
1608 NEW.rm_blockcount, NEW.rm_owner, NEW.rm_offset,
1609 NEW.rm_flags);
1610 if (error)
1611 goto done;
1612 /* new left extent - oldext */
1613 NEW = PREV;
1614 error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
1615 NEW.rm_blockcount, NEW.rm_owner,
1616 NEW.rm_offset, NEW.rm_flags, &i);
1617 if (error)
1618 goto done;
1619 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
1620 NEW.rm_blockcount = offset - NEW.rm_offset;
1621 error = xfs_rmap_update(cur, &NEW);
1622 if (error)
1623 goto done;
1624 /* new middle extent - newext */
1625 NEW.rm_startblock = bno;
1626 NEW.rm_blockcount = len;
1627 NEW.rm_owner = owner;
1628 NEW.rm_offset = offset;
1629 NEW.rm_flags = newext;
1630 error = xfs_rmap_insert(cur, NEW.rm_startblock,
1631 NEW.rm_blockcount, NEW.rm_owner, NEW.rm_offset,
1632 NEW.rm_flags);
1633 if (error)
1634 goto done;
1635 break;
1636
1637 case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
1638 case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
1639 case RMAP_LEFT_FILLING | RMAP_RIGHT_CONTIG:
1640 case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
1641 case RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
1642 case RMAP_LEFT_CONTIG:
1643 case RMAP_RIGHT_CONTIG:
1644 /*
1645 * These cases are all impossible.
1646 */
1647 ASSERT(0);
1648 }
1649
1650 trace_xfs_rmap_convert_done(mp, cur->bc_private.a.agno, bno, len,
1651 unwritten, oinfo);
1652done:
1653 if (error)
1654 trace_xfs_rmap_convert_error(cur->bc_mp,
1655 cur->bc_private.a.agno, error, _RET_IP_);
1656 return error;
1657}
1658
1096#undef NEW 1659#undef NEW
1097#undef LEFT 1660#undef LEFT
1098#undef RIGHT 1661#undef RIGHT
1099#undef PREV 1662#undef PREV
1100 1663
1664/*
1665 * Find an extent in the rmap btree and unmap it. For rmap extent types that
1666 * can overlap (data fork rmaps on reflink filesystems) we must be careful
1667 * that the prev/next records in the btree might belong to another owner.
1668 * Therefore we must use delete+insert to alter any of the key fields.
1669 *
1670 * For every other situation there can only be one owner for a given extent,
1671 * so we can call the regular _free function.
1672 */
1673STATIC int
1674xfs_rmap_unmap_shared(
1675 struct xfs_btree_cur *cur,
1676 xfs_agblock_t bno,
1677 xfs_extlen_t len,
1678 bool unwritten,
1679 struct xfs_owner_info *oinfo)
1680{
1681 struct xfs_mount *mp = cur->bc_mp;
1682 struct xfs_rmap_irec ltrec;
1683 uint64_t ltoff;
1684 int error = 0;
1685 int i;
1686 uint64_t owner;
1687 uint64_t offset;
1688 unsigned int flags;
1689
1690 xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
1691 if (unwritten)
1692 flags |= XFS_RMAP_UNWRITTEN;
1693 trace_xfs_rmap_unmap(mp, cur->bc_private.a.agno, bno, len,
1694 unwritten, oinfo);
1695
1696 /*
1697 * We should always have a left record because there's a static record
1698 * for the AG headers at rm_startblock == 0 created by mkfs/growfs that
1699 * will not ever be removed from the tree.
1700 */
1701 error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags,
1702 &ltrec, &i);
1703 if (error)
1704 goto out_error;
1705 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
1706 ltoff = ltrec.rm_offset;
1707
1708 /* Make sure the extent we found covers the entire freeing range. */
1709 XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_startblock <= bno &&
1710 ltrec.rm_startblock + ltrec.rm_blockcount >=
1711 bno + len, out_error);
1712
1713 /* Make sure the owner matches what we expect to find in the tree. */
1714 XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner, out_error);
1715
1716 /* Make sure the unwritten flag matches. */
1717 XFS_WANT_CORRUPTED_GOTO(mp, (flags & XFS_RMAP_UNWRITTEN) ==
1718 (ltrec.rm_flags & XFS_RMAP_UNWRITTEN), out_error);
1719
1720 /* Check the offset. */
1721 XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_offset <= offset, out_error);
1722 XFS_WANT_CORRUPTED_GOTO(mp, offset <= ltoff + ltrec.rm_blockcount,
1723 out_error);
1724
1725 if (ltrec.rm_startblock == bno && ltrec.rm_blockcount == len) {
1726 /* Exact match, simply remove the record from rmap tree. */
1727 error = xfs_rmap_delete(cur, ltrec.rm_startblock,
1728 ltrec.rm_blockcount, ltrec.rm_owner,
1729 ltrec.rm_offset, ltrec.rm_flags);
1730 if (error)
1731 goto out_error;
1732 } else if (ltrec.rm_startblock == bno) {
1733 /*
1734 * Overlap left hand side of extent: move the start, trim the
1735 * length and update the current record.
1736 *
1737 * ltbno ltlen
1738 * Orig: |oooooooooooooooooooo|
1739 * Freeing: |fffffffff|
1740 * Result: |rrrrrrrrrr|
1741 * bno len
1742 */
1743
1744 /* Delete prev rmap. */
1745 error = xfs_rmap_delete(cur, ltrec.rm_startblock,
1746 ltrec.rm_blockcount, ltrec.rm_owner,
1747 ltrec.rm_offset, ltrec.rm_flags);
1748 if (error)
1749 goto out_error;
1750
1751 /* Add an rmap at the new offset. */
1752 ltrec.rm_startblock += len;
1753 ltrec.rm_blockcount -= len;
1754 ltrec.rm_offset += len;
1755 error = xfs_rmap_insert(cur, ltrec.rm_startblock,
1756 ltrec.rm_blockcount, ltrec.rm_owner,
1757 ltrec.rm_offset, ltrec.rm_flags);
1758 if (error)
1759 goto out_error;
1760 } else if (ltrec.rm_startblock + ltrec.rm_blockcount == bno + len) {
1761 /*
1762 * Overlap right hand side of extent: trim the length and
1763 * update the current record.
1764 *
1765 * ltbno ltlen
1766 * Orig: |oooooooooooooooooooo|
1767 * Freeing: |fffffffff|
1768 * Result: |rrrrrrrrrr|
1769 * bno len
1770 */
1771 error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
1772 ltrec.rm_blockcount, ltrec.rm_owner,
1773 ltrec.rm_offset, ltrec.rm_flags, &i);
1774 if (error)
1775 goto out_error;
1776 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
1777 ltrec.rm_blockcount -= len;
1778 error = xfs_rmap_update(cur, &ltrec);
1779 if (error)
1780 goto out_error;
1781 } else {
1782 /*
1783 * Overlap middle of extent: trim the length of the existing
1784 * record to the length of the new left-extent size, increment
1785 * the insertion position so we can insert a new record
1786 * containing the remaining right-extent space.
1787 *
1788 * ltbno ltlen
1789 * Orig: |oooooooooooooooooooo|
1790 * Freeing: |fffffffff|
1791 * Result: |rrrrr| |rrrr|
1792 * bno len
1793 */
1794 xfs_extlen_t orig_len = ltrec.rm_blockcount;
1795
1796 /* Shrink the left side of the rmap */
1797 error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
1798 ltrec.rm_blockcount, ltrec.rm_owner,
1799 ltrec.rm_offset, ltrec.rm_flags, &i);
1800 if (error)
1801 goto out_error;
1802 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
1803 ltrec.rm_blockcount = bno - ltrec.rm_startblock;
1804 error = xfs_rmap_update(cur, &ltrec);
1805 if (error)
1806 goto out_error;
1807
1808 /* Add an rmap at the new offset */
1809 error = xfs_rmap_insert(cur, bno + len,
1810 orig_len - len - ltrec.rm_blockcount,
1811 ltrec.rm_owner, offset + len,
1812 ltrec.rm_flags);
1813 if (error)
1814 goto out_error;
1815 }
1816
1817 trace_xfs_rmap_unmap_done(mp, cur->bc_private.a.agno, bno, len,
1818 unwritten, oinfo);
1819out_error:
1820 if (error)
1821 trace_xfs_rmap_unmap_error(cur->bc_mp,
1822 cur->bc_private.a.agno, error, _RET_IP_);
1823 return error;
1824}
1825
1826/*
1827 * Find an extent in the rmap btree and map it. For rmap extent types that
1828 * can overlap (data fork rmaps on reflink filesystems) we must be careful
1829 * that the prev/next records in the btree might belong to another owner.
1830 * Therefore we must use delete+insert to alter any of the key fields.
1831 *
1832 * For every other situation there can only be one owner for a given extent,
1833 * so we can call the regular _alloc function.
1834 */
1835STATIC int
1836xfs_rmap_map_shared(
1837 struct xfs_btree_cur *cur,
1838 xfs_agblock_t bno,
1839 xfs_extlen_t len,
1840 bool unwritten,
1841 struct xfs_owner_info *oinfo)
1842{
1843 struct xfs_mount *mp = cur->bc_mp;
1844 struct xfs_rmap_irec ltrec;
1845 struct xfs_rmap_irec gtrec;
1846 int have_gt;
1847 int have_lt;
1848 int error = 0;
1849 int i;
1850 uint64_t owner;
1851 uint64_t offset;
1852 unsigned int flags = 0;
1853
1854 xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
1855 if (unwritten)
1856 flags |= XFS_RMAP_UNWRITTEN;
1857 trace_xfs_rmap_map(mp, cur->bc_private.a.agno, bno, len,
1858 unwritten, oinfo);
1859
1860 /* Is there a left record that abuts our range? */
1861 error = xfs_rmap_find_left_neighbor(cur, bno, owner, offset, flags,
1862 &ltrec, &have_lt);
1863 if (error)
1864 goto out_error;
1865 if (have_lt &&
1866 !xfs_rmap_is_mergeable(&ltrec, owner, flags))
1867 have_lt = 0;
1868
1869 /* Is there a right record that abuts our range? */
1870 error = xfs_rmap_lookup_eq(cur, bno + len, len, owner, offset + len,
1871 flags, &have_gt);
1872 if (error)
1873 goto out_error;
1874 if (have_gt) {
1875 error = xfs_rmap_get_rec(cur, &gtrec, &have_gt);
1876 if (error)
1877 goto out_error;
1878 XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
1879 trace_xfs_rmap_find_right_neighbor_result(cur->bc_mp,
1880 cur->bc_private.a.agno, gtrec.rm_startblock,
1881 gtrec.rm_blockcount, gtrec.rm_owner,
1882 gtrec.rm_offset, gtrec.rm_flags);
1883
1884 if (!xfs_rmap_is_mergeable(&gtrec, owner, flags))
1885 have_gt = 0;
1886 }
1887
1888 if (have_lt &&
1889 ltrec.rm_startblock + ltrec.rm_blockcount == bno &&
1890 ltrec.rm_offset + ltrec.rm_blockcount == offset) {
1891 /*
1892 * Left edge contiguous, merge into left record.
1893 *
1894 * ltbno ltlen
1895 * orig: |ooooooooo|
1896 * adding: |aaaaaaaaa|
1897 * result: |rrrrrrrrrrrrrrrrrrr|
1898 * bno len
1899 */
1900 ltrec.rm_blockcount += len;
1901 if (have_gt &&
1902 bno + len == gtrec.rm_startblock &&
1903 offset + len == gtrec.rm_offset) {
1904 /*
1905 * Right edge also contiguous, delete right record
1906 * and merge into left record.
1907 *
1908 * ltbno ltlen gtbno gtlen
1909 * orig: |ooooooooo| |ooooooooo|
1910 * adding: |aaaaaaaaa|
1911 * result: |rrrrrrrrrrrrrrrrrrrrrrrrrrrrr|
1912 */
1913 ltrec.rm_blockcount += gtrec.rm_blockcount;
1914 error = xfs_rmap_delete(cur, gtrec.rm_startblock,
1915 gtrec.rm_blockcount, gtrec.rm_owner,
1916 gtrec.rm_offset, gtrec.rm_flags);
1917 if (error)
1918 goto out_error;
1919 }
1920
1921 /* Point the cursor back to the left record and update. */
1922 error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
1923 ltrec.rm_blockcount, ltrec.rm_owner,
1924 ltrec.rm_offset, ltrec.rm_flags, &i);
1925 if (error)
1926 goto out_error;
1927 XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
1928
1929 error = xfs_rmap_update(cur, &ltrec);
1930 if (error)
1931 goto out_error;
1932 } else if (have_gt &&
1933 bno + len == gtrec.rm_startblock &&
1934 offset + len == gtrec.rm_offset) {
1935 /*
1936 * Right edge contiguous, merge into right record.
1937 *
1938 * gtbno gtlen
1939 * Orig: |ooooooooo|
1940 * adding: |aaaaaaaaa|
1941 * Result: |rrrrrrrrrrrrrrrrrrr|
1942 * bno len
1943 */
1944 /* Delete the old record. */
1945 error = xfs_rmap_delete(cur, gtrec.rm_startblock,
1946 gtrec.rm_blockcount, gtrec.rm_owner,
1947 gtrec.rm_offset, gtrec.rm_flags);
1948 if (error)
1949 goto out_error;
1950
1951 /* Move the start and re-add it. */
1952 gtrec.rm_startblock = bno;
1953 gtrec.rm_blockcount += len;
1954 gtrec.rm_offset = offset;
1955 error = xfs_rmap_insert(cur, gtrec.rm_startblock,
1956 gtrec.rm_blockcount, gtrec.rm_owner,
1957 gtrec.rm_offset, gtrec.rm_flags);
1958 if (error)
1959 goto out_error;
1960 } else {
1961 /*
1962 * No contiguous edge with identical owner, insert
1963 * new record at current cursor position.
1964 */
1965 error = xfs_rmap_insert(cur, bno, len, owner, offset, flags);
1966 if (error)
1967 goto out_error;
1968 }
1969
1970 trace_xfs_rmap_map_done(mp, cur->bc_private.a.agno, bno, len,
1971 unwritten, oinfo);
1972out_error:
1973 if (error)
1974 trace_xfs_rmap_map_error(cur->bc_mp,
1975 cur->bc_private.a.agno, error, _RET_IP_);
1976 return error;
1977}
1978
1101struct xfs_rmap_query_range_info { 1979struct xfs_rmap_query_range_info {
1102 xfs_rmap_query_range_fn fn; 1980 xfs_rmap_query_range_fn fn;
1103 void *priv; 1981 void *priv;
@@ -1237,15 +2115,27 @@ xfs_rmap_finish_one(
1237 case XFS_RMAP_MAP: 2115 case XFS_RMAP_MAP:
1238 error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo); 2116 error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
1239 break; 2117 break;
2118 case XFS_RMAP_MAP_SHARED:
2119 error = xfs_rmap_map_shared(rcur, bno, blockcount, unwritten,
2120 &oinfo);
2121 break;
1240 case XFS_RMAP_FREE: 2122 case XFS_RMAP_FREE:
1241 case XFS_RMAP_UNMAP: 2123 case XFS_RMAP_UNMAP:
1242 error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten, 2124 error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
1243 &oinfo); 2125 &oinfo);
1244 break; 2126 break;
2127 case XFS_RMAP_UNMAP_SHARED:
2128 error = xfs_rmap_unmap_shared(rcur, bno, blockcount, unwritten,
2129 &oinfo);
2130 break;
1245 case XFS_RMAP_CONVERT: 2131 case XFS_RMAP_CONVERT:
1246 error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten, 2132 error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
1247 &oinfo); 2133 &oinfo);
1248 break; 2134 break;
2135 case XFS_RMAP_CONVERT_SHARED:
2136 error = xfs_rmap_convert_shared(rcur, bno, blockcount,
2137 !unwritten, &oinfo);
2138 break;
1249 default: 2139 default:
1250 ASSERT(0); 2140 ASSERT(0);
1251 error = -EFSCORRUPTED; 2141 error = -EFSCORRUPTED;
@@ -1263,9 +2153,10 @@ out_cur:
1263 */ 2153 */
1264static bool 2154static bool
1265xfs_rmap_update_is_needed( 2155xfs_rmap_update_is_needed(
1266 struct xfs_mount *mp) 2156 struct xfs_mount *mp,
2157 int whichfork)
1267{ 2158{
1268 return xfs_sb_version_hasrmapbt(&mp->m_sb); 2159 return xfs_sb_version_hasrmapbt(&mp->m_sb) && whichfork != XFS_COW_FORK;
1269} 2160}
1270 2161
1271/* 2162/*
@@ -1311,10 +2202,11 @@ xfs_rmap_map_extent(
1311 int whichfork, 2202 int whichfork,
1312 struct xfs_bmbt_irec *PREV) 2203 struct xfs_bmbt_irec *PREV)
1313{ 2204{
1314 if (!xfs_rmap_update_is_needed(mp)) 2205 if (!xfs_rmap_update_is_needed(mp, whichfork))
1315 return 0; 2206 return 0;
1316 2207
1317 return __xfs_rmap_add(mp, dfops, XFS_RMAP_MAP, ip->i_ino, 2208 return __xfs_rmap_add(mp, dfops, xfs_is_reflink_inode(ip) ?
2209 XFS_RMAP_MAP_SHARED : XFS_RMAP_MAP, ip->i_ino,
1318 whichfork, PREV); 2210 whichfork, PREV);
1319} 2211}
1320 2212
@@ -1327,10 +2219,11 @@ xfs_rmap_unmap_extent(
1327 int whichfork, 2219 int whichfork,
1328 struct xfs_bmbt_irec *PREV) 2220 struct xfs_bmbt_irec *PREV)
1329{ 2221{
1330 if (!xfs_rmap_update_is_needed(mp)) 2222 if (!xfs_rmap_update_is_needed(mp, whichfork))
1331 return 0; 2223 return 0;
1332 2224
1333 return __xfs_rmap_add(mp, dfops, XFS_RMAP_UNMAP, ip->i_ino, 2225 return __xfs_rmap_add(mp, dfops, xfs_is_reflink_inode(ip) ?
2226 XFS_RMAP_UNMAP_SHARED : XFS_RMAP_UNMAP, ip->i_ino,
1334 whichfork, PREV); 2227 whichfork, PREV);
1335} 2228}
1336 2229
@@ -1343,10 +2236,11 @@ xfs_rmap_convert_extent(
1343 int whichfork, 2236 int whichfork,
1344 struct xfs_bmbt_irec *PREV) 2237 struct xfs_bmbt_irec *PREV)
1345{ 2238{
1346 if (!xfs_rmap_update_is_needed(mp)) 2239 if (!xfs_rmap_update_is_needed(mp, whichfork))
1347 return 0; 2240 return 0;
1348 2241
1349 return __xfs_rmap_add(mp, dfops, XFS_RMAP_CONVERT, ip->i_ino, 2242 return __xfs_rmap_add(mp, dfops, xfs_is_reflink_inode(ip) ?
2243 XFS_RMAP_CONVERT_SHARED : XFS_RMAP_CONVERT, ip->i_ino,
1350 whichfork, PREV); 2244 whichfork, PREV);
1351} 2245}
1352 2246
@@ -1362,7 +2256,7 @@ xfs_rmap_alloc_extent(
1362{ 2256{
1363 struct xfs_bmbt_irec bmap; 2257 struct xfs_bmbt_irec bmap;
1364 2258
1365 if (!xfs_rmap_update_is_needed(mp)) 2259 if (!xfs_rmap_update_is_needed(mp, XFS_DATA_FORK))
1366 return 0; 2260 return 0;
1367 2261
1368 bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno); 2262 bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
@@ -1386,7 +2280,7 @@ xfs_rmap_free_extent(
1386{ 2280{
1387 struct xfs_bmbt_irec bmap; 2281 struct xfs_bmbt_irec bmap;
1388 2282
1389 if (!xfs_rmap_update_is_needed(mp)) 2283 if (!xfs_rmap_update_is_needed(mp, XFS_DATA_FORK))
1390 return 0; 2284 return 0;
1391 2285
1392 bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno); 2286 bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 71cf99a4acba..789930599339 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -206,4 +206,11 @@ int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
206 xfs_fsblock_t startblock, xfs_filblks_t blockcount, 206 xfs_fsblock_t startblock, xfs_filblks_t blockcount,
207 xfs_exntst_t state, struct xfs_btree_cur **pcur); 207 xfs_exntst_t state, struct xfs_btree_cur **pcur);
208 208
209int xfs_rmap_find_left_neighbor(struct xfs_btree_cur *cur, xfs_agblock_t bno,
210 uint64_t owner, uint64_t offset, unsigned int flags,
211 struct xfs_rmap_irec *irec, int *stat);
212int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
213 uint64_t owner, uint64_t offset, unsigned int flags,
214 struct xfs_rmap_irec *irec, int *stat);
215
209#endif /* __XFS_RMAP_H__ */ 216#endif /* __XFS_RMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 17b8eeb34ac8..83e672ff7577 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -35,6 +35,7 @@
35#include "xfs_cksum.h" 35#include "xfs_cksum.h"
36#include "xfs_error.h" 36#include "xfs_error.h"
37#include "xfs_extent_busy.h" 37#include "xfs_extent_busy.h"
38#include "xfs_ag_resv.h"
38 39
39/* 40/*
40 * Reverse map btree. 41 * Reverse map btree.
@@ -512,6 +513,83 @@ void
512xfs_rmapbt_compute_maxlevels( 513xfs_rmapbt_compute_maxlevels(
513 struct xfs_mount *mp) 514 struct xfs_mount *mp)
514{ 515{
515 mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp, 516 /*
516 mp->m_rmap_mnr, mp->m_sb.sb_agblocks); 517 * On a non-reflink filesystem, the maximum number of rmap
518 * records is the number of blocks in the AG, hence the max
519 * rmapbt height is log_$maxrecs($agblocks). However, with
520 * reflink each AG block can have up to 2^32 (per the refcount
521 * record format) owners, which means that theoretically we
522 * could face up to 2^64 rmap records.
523 *
524 * That effectively means that the max rmapbt height must be
525 * XFS_BTREE_MAXLEVELS. "Fortunately" we'll run out of AG
526 * blocks to feed the rmapbt long before the rmapbt reaches
527 * maximum height. The reflink code uses ag_resv_critical to
528 * disallow reflinking when less than 10% of the per-AG metadata
529 * block reservation since the fallback is a regular file copy.
530 */
531 if (xfs_sb_version_hasreflink(&mp->m_sb))
532 mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
533 else
534 mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
535 mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
536}
537
538/* Calculate the refcount btree size for some records. */
539xfs_extlen_t
540xfs_rmapbt_calc_size(
541 struct xfs_mount *mp,
542 unsigned long long len)
543{
544 return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
545}
546
547/*
548 * Calculate the maximum refcount btree size.
549 */
550xfs_extlen_t
551xfs_rmapbt_max_size(
552 struct xfs_mount *mp)
553{
554 /* Bail out if we're uninitialized, which can happen in mkfs. */
555 if (mp->m_rmap_mxr[0] == 0)
556 return 0;
557
558 return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
559}
560
561/*
562 * Figure out how many blocks to reserve and how many are used by this btree.
563 */
564int
565xfs_rmapbt_calc_reserves(
566 struct xfs_mount *mp,
567 xfs_agnumber_t agno,
568 xfs_extlen_t *ask,
569 xfs_extlen_t *used)
570{
571 struct xfs_buf *agbp;
572 struct xfs_agf *agf;
573 xfs_extlen_t pool_len;
574 xfs_extlen_t tree_len;
575 int error;
576
577 if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
578 return 0;
579
580 /* Reserve 1% of the AG or enough for 1 block per record. */
581 pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
582 *ask += pool_len;
583
584 error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
585 if (error)
586 return error;
587
588 agf = XFS_BUF_TO_AGF(agbp);
589 tree_len = be32_to_cpu(agf->agf_rmap_blocks);
590 xfs_buf_relse(agbp);
591
592 *used += tree_len;
593
594 return error;
517} 595}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index e73a55357dab..2a9ac472fb15 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
58int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf); 58int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
59extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp); 59extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
60 60
61extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
62 unsigned long long len);
63extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
64
65extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
66 xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
67
61#endif /* __XFS_RMAP_BTREE_H__ */ 68#endif /* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 4aecc5fefe96..a70aec910626 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -38,6 +38,8 @@
38#include "xfs_ialloc_btree.h" 38#include "xfs_ialloc_btree.h"
39#include "xfs_log.h" 39#include "xfs_log.h"
40#include "xfs_rmap_btree.h" 40#include "xfs_rmap_btree.h"
41#include "xfs_bmap.h"
42#include "xfs_refcount_btree.h"
41 43
42/* 44/*
43 * Physical superblock buffer manipulations. Shared with libxfs in userspace. 45 * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -737,6 +739,13 @@ xfs_sb_mount_common(
737 mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2; 739 mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
738 mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2; 740 mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
739 741
742 mp->m_refc_mxr[0] = xfs_refcountbt_maxrecs(mp, sbp->sb_blocksize,
743 true);
744 mp->m_refc_mxr[1] = xfs_refcountbt_maxrecs(mp, sbp->sb_blocksize,
745 false);
746 mp->m_refc_mnr[0] = mp->m_refc_mxr[0] / 2;
747 mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
748
740 mp->m_bsize = XFS_FSB_TO_BB(mp, 1); 749 mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
741 mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK, 750 mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
742 sbp->sb_inopblock); 751 sbp->sb_inopblock);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 0c5b30bd884c..c6f4eb46fe26 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -39,6 +39,7 @@ extern const struct xfs_buf_ops xfs_agf_buf_ops;
39extern const struct xfs_buf_ops xfs_agfl_buf_ops; 39extern const struct xfs_buf_ops xfs_agfl_buf_ops;
40extern const struct xfs_buf_ops xfs_allocbt_buf_ops; 40extern const struct xfs_buf_ops xfs_allocbt_buf_ops;
41extern const struct xfs_buf_ops xfs_rmapbt_buf_ops; 41extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
42extern const struct xfs_buf_ops xfs_refcountbt_buf_ops;
42extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops; 43extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
43extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops; 44extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
44extern const struct xfs_buf_ops xfs_bmbt_buf_ops; 45extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
@@ -122,6 +123,7 @@ int xfs_log_calc_minimum_size(struct xfs_mount *);
122#define XFS_INO_REF 2 123#define XFS_INO_REF 2
123#define XFS_ATTR_BTREE_REF 1 124#define XFS_ATTR_BTREE_REF 1
124#define XFS_DQUOT_REF 1 125#define XFS_DQUOT_REF 1
126#define XFS_REFC_BTREE_REF 1
125 127
126/* 128/*
127 * Flags for xfs_trans_ichgtime(). 129 * Flags for xfs_trans_ichgtime().
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 301ef2f4dbd6..b456cca1bfb2 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -67,13 +67,14 @@ xfs_calc_buf_res(
67 * Per-extent log reservation for the btree changes involved in freeing or 67 * Per-extent log reservation for the btree changes involved in freeing or
68 * allocating an extent. In classic XFS there were two trees that will be 68 * allocating an extent. In classic XFS there were two trees that will be
69 * modified (bnobt + cntbt). With rmap enabled, there are three trees 69 * modified (bnobt + cntbt). With rmap enabled, there are three trees
70 * (rmapbt). The number of blocks reserved is based on the formula: 70 * (rmapbt). With reflink, there are four trees (refcountbt). The number of
71 * blocks reserved is based on the formula:
71 * 72 *
72 * num trees * ((2 blocks/level * max depth) - 1) 73 * num trees * ((2 blocks/level * max depth) - 1)
73 * 74 *
74 * Keep in mind that max depth is calculated separately for each type of tree. 75 * Keep in mind that max depth is calculated separately for each type of tree.
75 */ 76 */
76static uint 77uint
77xfs_allocfree_log_count( 78xfs_allocfree_log_count(
78 struct xfs_mount *mp, 79 struct xfs_mount *mp,
79 uint num_ops) 80 uint num_ops)
@@ -83,6 +84,8 @@ xfs_allocfree_log_count(
83 blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1); 84 blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
84 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) 85 if (xfs_sb_version_hasrmapbt(&mp->m_sb))
85 blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1); 86 blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
87 if (xfs_sb_version_hasreflink(&mp->m_sb))
88 blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
86 89
87 return blocks; 90 return blocks;
88} 91}
@@ -809,11 +812,18 @@ xfs_trans_resv_calc(
809 * require a permanent reservation on space. 812 * require a permanent reservation on space.
810 */ 813 */
811 resp->tr_write.tr_logres = xfs_calc_write_reservation(mp); 814 resp->tr_write.tr_logres = xfs_calc_write_reservation(mp);
812 resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT; 815 if (xfs_sb_version_hasreflink(&mp->m_sb))
816 resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
817 else
818 resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
813 resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES; 819 resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
814 820
815 resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp); 821 resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
816 resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT; 822 if (xfs_sb_version_hasreflink(&mp->m_sb))
823 resp->tr_itruncate.tr_logcount =
824 XFS_ITRUNCATE_LOG_COUNT_REFLINK;
825 else
826 resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
817 resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES; 827 resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
818 828
819 resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp); 829 resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
@@ -870,7 +880,10 @@ xfs_trans_resv_calc(
870 resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; 880 resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
871 881
872 resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp); 882 resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp);
873 resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT; 883 if (xfs_sb_version_hasreflink(&mp->m_sb))
884 resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
885 else
886 resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
874 resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; 887 resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
875 888
876 /* 889 /*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 0eb46ed6d404..b7e5357d060a 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -87,6 +87,7 @@ struct xfs_trans_resv {
87#define XFS_DEFAULT_LOG_COUNT 1 87#define XFS_DEFAULT_LOG_COUNT 1
88#define XFS_DEFAULT_PERM_LOG_COUNT 2 88#define XFS_DEFAULT_PERM_LOG_COUNT 2
89#define XFS_ITRUNCATE_LOG_COUNT 2 89#define XFS_ITRUNCATE_LOG_COUNT 2
90#define XFS_ITRUNCATE_LOG_COUNT_REFLINK 8
90#define XFS_INACTIVE_LOG_COUNT 2 91#define XFS_INACTIVE_LOG_COUNT 2
91#define XFS_CREATE_LOG_COUNT 2 92#define XFS_CREATE_LOG_COUNT 2
92#define XFS_CREATE_TMPFILE_LOG_COUNT 2 93#define XFS_CREATE_TMPFILE_LOG_COUNT 2
@@ -96,11 +97,13 @@ struct xfs_trans_resv {
96#define XFS_LINK_LOG_COUNT 2 97#define XFS_LINK_LOG_COUNT 2
97#define XFS_RENAME_LOG_COUNT 2 98#define XFS_RENAME_LOG_COUNT 2
98#define XFS_WRITE_LOG_COUNT 2 99#define XFS_WRITE_LOG_COUNT 2
100#define XFS_WRITE_LOG_COUNT_REFLINK 8
99#define XFS_ADDAFORK_LOG_COUNT 2 101#define XFS_ADDAFORK_LOG_COUNT 2
100#define XFS_ATTRINVAL_LOG_COUNT 1 102#define XFS_ATTRINVAL_LOG_COUNT 1
101#define XFS_ATTRSET_LOG_COUNT 3 103#define XFS_ATTRSET_LOG_COUNT 3
102#define XFS_ATTRRM_LOG_COUNT 3 104#define XFS_ATTRRM_LOG_COUNT 3
103 105
104void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp); 106void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);
107uint xfs_allocfree_log_count(struct xfs_mount *mp, uint num_ops);
105 108
106#endif /* __XFS_TRANS_RESV_H__ */ 109#endif /* __XFS_TRANS_RESV_H__ */
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 41e0428d8175..7917f6e44286 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -21,6 +21,8 @@
21/* 21/*
22 * Components of space reservations. 22 * Components of space reservations.
23 */ 23 */
24#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \
25 (((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
24#define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) \ 26#define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) \
25 (((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0])) 27 (((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0]))
26#define XFS_EXTENTADD_SPACE_RES(mp,w) (XFS_BM_MAXLEVELS(mp,w) - 1) 28#define XFS_EXTENTADD_SPACE_RES(mp,w) (XFS_BM_MAXLEVELS(mp,w) - 1)
@@ -28,6 +30,13 @@
28 (((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \ 30 (((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
29 XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \ 31 XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
30 XFS_EXTENTADD_SPACE_RES(mp,w)) 32 XFS_EXTENTADD_SPACE_RES(mp,w))
33#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)\
34 (((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
35 XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
36 XFS_EXTENTADD_SPACE_RES(mp,w) + \
37 ((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \
38 XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) * \
39 (mp)->m_rmap_maxlevels)
31#define XFS_DAENTER_1B(mp,w) \ 40#define XFS_DAENTER_1B(mp,w) \
32 ((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1) 41 ((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
33#define XFS_DAENTER_DBS(mp,w) \ 42#define XFS_DAENTER_DBS(mp,w) \
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 3d503647f26b..8d74870468c2 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -90,6 +90,7 @@ typedef __int64_t xfs_sfiloff_t; /* signed block number in a file */
90 */ 90 */
91#define XFS_DATA_FORK 0 91#define XFS_DATA_FORK 0
92#define XFS_ATTR_FORK 1 92#define XFS_ATTR_FORK 1
93#define XFS_COW_FORK 2
93 94
94/* 95/*
95 * Min numbers of data/attr fork btree root pointers. 96 * Min numbers of data/attr fork btree root pointers.
@@ -109,7 +110,7 @@ typedef enum {
109 110
110typedef enum { 111typedef enum {
111 XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi, 112 XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
112 XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_MAX 113 XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_MAX
113} xfs_btnum_t; 114} xfs_btnum_t;
114 115
115struct xfs_name { 116struct xfs_name {
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa91e3b1..3e57a56cf829 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -31,6 +31,7 @@
31#include "xfs_bmap.h" 31#include "xfs_bmap.h"
32#include "xfs_bmap_util.h" 32#include "xfs_bmap_util.h"
33#include "xfs_bmap_btree.h" 33#include "xfs_bmap_btree.h"
34#include "xfs_reflink.h"
34#include <linux/gfp.h> 35#include <linux/gfp.h>
35#include <linux/mpage.h> 36#include <linux/mpage.h>
36#include <linux/pagevec.h> 37#include <linux/pagevec.h>
@@ -39,6 +40,7 @@
39/* flags for direct write completions */ 40/* flags for direct write completions */
40#define XFS_DIO_FLAG_UNWRITTEN (1 << 0) 41#define XFS_DIO_FLAG_UNWRITTEN (1 << 0)
41#define XFS_DIO_FLAG_APPEND (1 << 1) 42#define XFS_DIO_FLAG_APPEND (1 << 1)
43#define XFS_DIO_FLAG_COW (1 << 2)
42 44
43/* 45/*
44 * structure owned by writepages passed to individual writepage calls 46 * structure owned by writepages passed to individual writepage calls
@@ -287,6 +289,25 @@ xfs_end_io(
287 error = -EIO; 289 error = -EIO;
288 290
289 /* 291 /*
292 * For a CoW extent, we need to move the mapping from the CoW fork
293 * to the data fork. If instead an error happened, just dump the
294 * new blocks.
295 */
296 if (ioend->io_type == XFS_IO_COW) {
297 if (error)
298 goto done;
299 if (ioend->io_bio->bi_error) {
300 error = xfs_reflink_cancel_cow_range(ip,
301 ioend->io_offset, ioend->io_size);
302 goto done;
303 }
304 error = xfs_reflink_end_cow(ip, ioend->io_offset,
305 ioend->io_size);
306 if (error)
307 goto done;
308 }
309
310 /*
290 * For unwritten extents we need to issue transactions to convert a 311 * For unwritten extents we need to issue transactions to convert a
291 * range to normal written extens after the data I/O has finished. 312 * range to normal written extens after the data I/O has finished.
292 * Detecting and handling completion IO errors is done individually 313 * Detecting and handling completion IO errors is done individually
@@ -301,7 +322,8 @@ xfs_end_io(
301 } else if (ioend->io_append_trans) { 322 } else if (ioend->io_append_trans) {
302 error = xfs_setfilesize_ioend(ioend, error); 323 error = xfs_setfilesize_ioend(ioend, error);
303 } else { 324 } else {
304 ASSERT(!xfs_ioend_is_append(ioend)); 325 ASSERT(!xfs_ioend_is_append(ioend) ||
326 ioend->io_type == XFS_IO_COW);
305 } 327 }
306 328
307done: 329done:
@@ -315,7 +337,7 @@ xfs_end_bio(
315 struct xfs_ioend *ioend = bio->bi_private; 337 struct xfs_ioend *ioend = bio->bi_private;
316 struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount; 338 struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount;
317 339
318 if (ioend->io_type == XFS_IO_UNWRITTEN) 340 if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
319 queue_work(mp->m_unwritten_workqueue, &ioend->io_work); 341 queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
320 else if (ioend->io_append_trans) 342 else if (ioend->io_append_trans)
321 queue_work(mp->m_data_workqueue, &ioend->io_work); 343 queue_work(mp->m_data_workqueue, &ioend->io_work);
@@ -341,6 +363,7 @@ xfs_map_blocks(
341 if (XFS_FORCED_SHUTDOWN(mp)) 363 if (XFS_FORCED_SHUTDOWN(mp))
342 return -EIO; 364 return -EIO;
343 365
366 ASSERT(type != XFS_IO_COW);
344 if (type == XFS_IO_UNWRITTEN) 367 if (type == XFS_IO_UNWRITTEN)
345 bmapi_flags |= XFS_BMAPI_IGSTATE; 368 bmapi_flags |= XFS_BMAPI_IGSTATE;
346 369
@@ -355,6 +378,13 @@ xfs_map_blocks(
355 offset_fsb = XFS_B_TO_FSBT(mp, offset); 378 offset_fsb = XFS_B_TO_FSBT(mp, offset);
356 error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, 379 error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
357 imap, &nimaps, bmapi_flags); 380 imap, &nimaps, bmapi_flags);
381 /*
382 * Truncate an overwrite extent if there's a pending CoW
383 * reservation before the end of this extent. This forces us
384 * to come back to writepage to take care of the CoW.
385 */
386 if (nimaps && type == XFS_IO_OVERWRITE)
387 xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
358 xfs_iunlock(ip, XFS_ILOCK_SHARED); 388 xfs_iunlock(ip, XFS_ILOCK_SHARED);
359 389
360 if (error) 390 if (error)
@@ -362,7 +392,8 @@ xfs_map_blocks(
362 392
363 if (type == XFS_IO_DELALLOC && 393 if (type == XFS_IO_DELALLOC &&
364 (!nimaps || isnullstartblock(imap->br_startblock))) { 394 (!nimaps || isnullstartblock(imap->br_startblock))) {
365 error = xfs_iomap_write_allocate(ip, offset, imap); 395 error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
396 imap);
366 if (!error) 397 if (!error)
367 trace_xfs_map_blocks_alloc(ip, offset, count, type, imap); 398 trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
368 return error; 399 return error;
@@ -737,6 +768,56 @@ out_invalidate:
737 return; 768 return;
738} 769}
739 770
771static int
772xfs_map_cow(
773 struct xfs_writepage_ctx *wpc,
774 struct inode *inode,
775 loff_t offset,
776 unsigned int *new_type)
777{
778 struct xfs_inode *ip = XFS_I(inode);
779 struct xfs_bmbt_irec imap;
780 bool is_cow = false, need_alloc = false;
781 int error;
782
783 /*
784 * If we already have a valid COW mapping keep using it.
785 */
786 if (wpc->io_type == XFS_IO_COW) {
787 wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
788 if (wpc->imap_valid) {
789 *new_type = XFS_IO_COW;
790 return 0;
791 }
792 }
793
794 /*
795 * Else we need to check if there is a COW mapping at this offset.
796 */
797 xfs_ilock(ip, XFS_ILOCK_SHARED);
798 is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap, &need_alloc);
799 xfs_iunlock(ip, XFS_ILOCK_SHARED);
800
801 if (!is_cow)
802 return 0;
803
804 /*
805 * And if the COW mapping has a delayed extent here we need to
806 * allocate real space for it now.
807 */
808 if (need_alloc) {
809 error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
810 &imap);
811 if (error)
812 return error;
813 }
814
815 wpc->io_type = *new_type = XFS_IO_COW;
816 wpc->imap_valid = true;
817 wpc->imap = imap;
818 return 0;
819}
820
740/* 821/*
741 * We implement an immediate ioend submission policy here to avoid needing to 822 * We implement an immediate ioend submission policy here to avoid needing to
742 * chain multiple ioends and hence nest mempool allocations which can violate 823 * chain multiple ioends and hence nest mempool allocations which can violate
@@ -769,6 +850,7 @@ xfs_writepage_map(
769 int error = 0; 850 int error = 0;
770 int count = 0; 851 int count = 0;
771 int uptodate = 1; 852 int uptodate = 1;
853 unsigned int new_type;
772 854
773 bh = head = page_buffers(page); 855 bh = head = page_buffers(page);
774 offset = page_offset(page); 856 offset = page_offset(page);
@@ -789,22 +871,13 @@ xfs_writepage_map(
789 continue; 871 continue;
790 } 872 }
791 873
792 if (buffer_unwritten(bh)) { 874 if (buffer_unwritten(bh))
793 if (wpc->io_type != XFS_IO_UNWRITTEN) { 875 new_type = XFS_IO_UNWRITTEN;
794 wpc->io_type = XFS_IO_UNWRITTEN; 876 else if (buffer_delay(bh))
795 wpc->imap_valid = false; 877 new_type = XFS_IO_DELALLOC;
796 } 878 else if (buffer_uptodate(bh))
797 } else if (buffer_delay(bh)) { 879 new_type = XFS_IO_OVERWRITE;
798 if (wpc->io_type != XFS_IO_DELALLOC) { 880 else {
799 wpc->io_type = XFS_IO_DELALLOC;
800 wpc->imap_valid = false;
801 }
802 } else if (buffer_uptodate(bh)) {
803 if (wpc->io_type != XFS_IO_OVERWRITE) {
804 wpc->io_type = XFS_IO_OVERWRITE;
805 wpc->imap_valid = false;
806 }
807 } else {
808 if (PageUptodate(page)) 881 if (PageUptodate(page))
809 ASSERT(buffer_mapped(bh)); 882 ASSERT(buffer_mapped(bh));
810 /* 883 /*
@@ -817,6 +890,17 @@ xfs_writepage_map(
817 continue; 890 continue;
818 } 891 }
819 892
893 if (xfs_is_reflink_inode(XFS_I(inode))) {
894 error = xfs_map_cow(wpc, inode, offset, &new_type);
895 if (error)
896 goto out;
897 }
898
899 if (wpc->io_type != new_type) {
900 wpc->io_type = new_type;
901 wpc->imap_valid = false;
902 }
903
820 if (wpc->imap_valid) 904 if (wpc->imap_valid)
821 wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, 905 wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
822 offset); 906 offset);
@@ -1107,18 +1191,24 @@ xfs_map_direct(
1107 struct inode *inode, 1191 struct inode *inode,
1108 struct buffer_head *bh_result, 1192 struct buffer_head *bh_result,
1109 struct xfs_bmbt_irec *imap, 1193 struct xfs_bmbt_irec *imap,
1110 xfs_off_t offset) 1194 xfs_off_t offset,
1195 bool is_cow)
1111{ 1196{
1112 uintptr_t *flags = (uintptr_t *)&bh_result->b_private; 1197 uintptr_t *flags = (uintptr_t *)&bh_result->b_private;
1113 xfs_off_t size = bh_result->b_size; 1198 xfs_off_t size = bh_result->b_size;
1114 1199
1115 trace_xfs_get_blocks_map_direct(XFS_I(inode), offset, size, 1200 trace_xfs_get_blocks_map_direct(XFS_I(inode), offset, size,
1116 ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, imap); 1201 ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : is_cow ? XFS_IO_COW :
1202 XFS_IO_OVERWRITE, imap);
1117 1203
1118 if (ISUNWRITTEN(imap)) { 1204 if (ISUNWRITTEN(imap)) {
1119 *flags |= XFS_DIO_FLAG_UNWRITTEN; 1205 *flags |= XFS_DIO_FLAG_UNWRITTEN;
1120 set_buffer_defer_completion(bh_result); 1206 set_buffer_defer_completion(bh_result);
1121 } else if (offset + size > i_size_read(inode) || offset + size < 0) { 1207 } else if (is_cow) {
1208 *flags |= XFS_DIO_FLAG_COW;
1209 set_buffer_defer_completion(bh_result);
1210 }
1211 if (offset + size > i_size_read(inode) || offset + size < 0) {
1122 *flags |= XFS_DIO_FLAG_APPEND; 1212 *flags |= XFS_DIO_FLAG_APPEND;
1123 set_buffer_defer_completion(bh_result); 1213 set_buffer_defer_completion(bh_result);
1124 } 1214 }
@@ -1164,6 +1254,44 @@ xfs_map_trim_size(
1164 bh_result->b_size = mapping_size; 1254 bh_result->b_size = mapping_size;
1165} 1255}
1166 1256
1257/* Bounce unaligned directio writes to the page cache. */
1258static int
1259xfs_bounce_unaligned_dio_write(
1260 struct xfs_inode *ip,
1261 xfs_fileoff_t offset_fsb,
1262 struct xfs_bmbt_irec *imap)
1263{
1264 struct xfs_bmbt_irec irec;
1265 xfs_fileoff_t delta;
1266 bool shared;
1267 bool x;
1268 int error;
1269
1270 irec = *imap;
1271 if (offset_fsb > irec.br_startoff) {
1272 delta = offset_fsb - irec.br_startoff;
1273 irec.br_blockcount -= delta;
1274 irec.br_startblock += delta;
1275 irec.br_startoff = offset_fsb;
1276 }
1277 error = xfs_reflink_trim_around_shared(ip, &irec, &shared, &x);
1278 if (error)
1279 return error;
1280
1281 /*
1282 * We're here because we're trying to do a directio write to a
1283 * region that isn't aligned to a filesystem block. If any part
1284 * of the extent is shared, fall back to buffered mode to handle
1285 * the RMW. This is done by returning -EREMCHG ("remote addr
1286 * changed"), which is caught further up the call stack.
1287 */
1288 if (shared) {
1289 trace_xfs_reflink_bounce_dio_write(ip, imap);
1290 return -EREMCHG;
1291 }
1292 return 0;
1293}
1294
1167STATIC int 1295STATIC int
1168__xfs_get_blocks( 1296__xfs_get_blocks(
1169 struct inode *inode, 1297 struct inode *inode,
@@ -1183,6 +1311,8 @@ __xfs_get_blocks(
1183 xfs_off_t offset; 1311 xfs_off_t offset;
1184 ssize_t size; 1312 ssize_t size;
1185 int new = 0; 1313 int new = 0;
1314 bool is_cow = false;
1315 bool need_alloc = false;
1186 1316
1187 BUG_ON(create && !direct); 1317 BUG_ON(create && !direct);
1188 1318
@@ -1208,8 +1338,26 @@ __xfs_get_blocks(
1208 end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size); 1338 end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
1209 offset_fsb = XFS_B_TO_FSBT(mp, offset); 1339 offset_fsb = XFS_B_TO_FSBT(mp, offset);
1210 1340
1211 error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, 1341 if (create && direct && xfs_is_reflink_inode(ip))
1212 &imap, &nimaps, XFS_BMAPI_ENTIRE); 1342 is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap,
1343 &need_alloc);
1344 if (!is_cow) {
1345 error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
1346 &imap, &nimaps, XFS_BMAPI_ENTIRE);
1347 /*
1348 * Truncate an overwrite extent if there's a pending CoW
1349 * reservation before the end of this extent. This
1350 * forces us to come back to get_blocks to take care of
1351 * the CoW.
1352 */
1353 if (create && direct && nimaps &&
1354 imap.br_startblock != HOLESTARTBLOCK &&
1355 imap.br_startblock != DELAYSTARTBLOCK &&
1356 !ISUNWRITTEN(&imap))
1357 xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb,
1358 &imap);
1359 }
1360 ASSERT(!need_alloc);
1213 if (error) 1361 if (error)
1214 goto out_unlock; 1362 goto out_unlock;
1215 1363
@@ -1261,6 +1409,13 @@ __xfs_get_blocks(
1261 if (imap.br_startblock != HOLESTARTBLOCK && 1409 if (imap.br_startblock != HOLESTARTBLOCK &&
1262 imap.br_startblock != DELAYSTARTBLOCK && 1410 imap.br_startblock != DELAYSTARTBLOCK &&
1263 (create || !ISUNWRITTEN(&imap))) { 1411 (create || !ISUNWRITTEN(&imap))) {
1412 if (create && direct && !is_cow) {
1413 error = xfs_bounce_unaligned_dio_write(ip, offset_fsb,
1414 &imap);
1415 if (error)
1416 return error;
1417 }
1418
1264 xfs_map_buffer(inode, bh_result, &imap, offset); 1419 xfs_map_buffer(inode, bh_result, &imap, offset);
1265 if (ISUNWRITTEN(&imap)) 1420 if (ISUNWRITTEN(&imap))
1266 set_buffer_unwritten(bh_result); 1421 set_buffer_unwritten(bh_result);
@@ -1269,7 +1424,8 @@ __xfs_get_blocks(
1269 if (dax_fault) 1424 if (dax_fault)
1270 ASSERT(!ISUNWRITTEN(&imap)); 1425 ASSERT(!ISUNWRITTEN(&imap));
1271 else 1426 else
1272 xfs_map_direct(inode, bh_result, &imap, offset); 1427 xfs_map_direct(inode, bh_result, &imap, offset,
1428 is_cow);
1273 } 1429 }
1274 } 1430 }
1275 1431
@@ -1391,11 +1547,14 @@ xfs_end_io_direct_write(
1391 i_size_write(inode, offset + size); 1547 i_size_write(inode, offset + size);
1392 spin_unlock(&ip->i_flags_lock); 1548 spin_unlock(&ip->i_flags_lock);
1393 1549
1550 if (flags & XFS_DIO_FLAG_COW)
1551 error = xfs_reflink_end_cow(ip, offset, size);
1394 if (flags & XFS_DIO_FLAG_UNWRITTEN) { 1552 if (flags & XFS_DIO_FLAG_UNWRITTEN) {
1395 trace_xfs_end_io_direct_write_unwritten(ip, offset, size); 1553 trace_xfs_end_io_direct_write_unwritten(ip, offset, size);
1396 1554
1397 error = xfs_iomap_write_unwritten(ip, offset, size); 1555 error = xfs_iomap_write_unwritten(ip, offset, size);
1398 } else if (flags & XFS_DIO_FLAG_APPEND) { 1556 }
1557 if (flags & XFS_DIO_FLAG_APPEND) {
1399 trace_xfs_end_io_direct_write_append(ip, offset, size); 1558 trace_xfs_end_io_direct_write_append(ip, offset, size);
1400 1559
1401 error = xfs_setfilesize(ip, offset, size); 1560 error = xfs_setfilesize(ip, offset, size);
@@ -1425,6 +1584,17 @@ xfs_vm_bmap(
1425 1584
1426 trace_xfs_vm_bmap(XFS_I(inode)); 1585 trace_xfs_vm_bmap(XFS_I(inode));
1427 xfs_ilock(ip, XFS_IOLOCK_SHARED); 1586 xfs_ilock(ip, XFS_IOLOCK_SHARED);
1587
1588 /*
1589 * The swap code (ab-)uses ->bmap to get a block mapping and then
1590 * bypasseѕ the file system for actual I/O. We really can't allow
1591 * that on reflinks inodes, so we have to skip out here. And yes,
1592 * 0 is the magic code for a bmap error..
1593 */
1594 if (xfs_is_reflink_inode(ip)) {
1595 xfs_iunlock(ip, XFS_IOLOCK_SHARED);
1596 return 0;
1597 }
1428 filemap_write_and_wait(mapping); 1598 filemap_write_and_wait(mapping);
1429 xfs_iunlock(ip, XFS_IOLOCK_SHARED); 1599 xfs_iunlock(ip, XFS_IOLOCK_SHARED);
1430 return generic_block_bmap(mapping, block, xfs_get_blocks); 1600 return generic_block_bmap(mapping, block, xfs_get_blocks);
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 1950e3bca2ac..b3c6634f9518 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -28,13 +28,15 @@ enum {
28 XFS_IO_DELALLOC, /* covers delalloc region */ 28 XFS_IO_DELALLOC, /* covers delalloc region */
29 XFS_IO_UNWRITTEN, /* covers allocated but uninitialized data */ 29 XFS_IO_UNWRITTEN, /* covers allocated but uninitialized data */
30 XFS_IO_OVERWRITE, /* covers already allocated extent */ 30 XFS_IO_OVERWRITE, /* covers already allocated extent */
31 XFS_IO_COW, /* covers copy-on-write extent */
31}; 32};
32 33
33#define XFS_IO_TYPES \ 34#define XFS_IO_TYPES \
34 { XFS_IO_INVALID, "invalid" }, \ 35 { XFS_IO_INVALID, "invalid" }, \
35 { XFS_IO_DELALLOC, "delalloc" }, \ 36 { XFS_IO_DELALLOC, "delalloc" }, \
36 { XFS_IO_UNWRITTEN, "unwritten" }, \ 37 { XFS_IO_UNWRITTEN, "unwritten" }, \
37 { XFS_IO_OVERWRITE, "overwrite" } 38 { XFS_IO_OVERWRITE, "overwrite" }, \
39 { XFS_IO_COW, "CoW" }
38 40
39/* 41/*
40 * Structure for buffered I/O completions. 42 * Structure for buffered I/O completions.
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
new file mode 100644
index 000000000000..9bf57c76623b
--- /dev/null
+++ b/fs/xfs/xfs_bmap_item.c
@@ -0,0 +1,508 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_format.h"
23#include "xfs_log_format.h"
24#include "xfs_trans_resv.h"
25#include "xfs_bit.h"
26#include "xfs_mount.h"
27#include "xfs_defer.h"
28#include "xfs_inode.h"
29#include "xfs_trans.h"
30#include "xfs_trans_priv.h"
31#include "xfs_buf_item.h"
32#include "xfs_bmap_item.h"
33#include "xfs_log.h"
34#include "xfs_bmap.h"
35#include "xfs_icache.h"
36#include "xfs_trace.h"
37
38
39kmem_zone_t *xfs_bui_zone;
40kmem_zone_t *xfs_bud_zone;
41
42static inline struct xfs_bui_log_item *BUI_ITEM(struct xfs_log_item *lip)
43{
44 return container_of(lip, struct xfs_bui_log_item, bui_item);
45}
46
47void
48xfs_bui_item_free(
49 struct xfs_bui_log_item *buip)
50{
51 kmem_zone_free(xfs_bui_zone, buip);
52}
53
54STATIC void
55xfs_bui_item_size(
56 struct xfs_log_item *lip,
57 int *nvecs,
58 int *nbytes)
59{
60 struct xfs_bui_log_item *buip = BUI_ITEM(lip);
61
62 *nvecs += 1;
63 *nbytes += xfs_bui_log_format_sizeof(buip->bui_format.bui_nextents);
64}
65
66/*
67 * This is called to fill in the vector of log iovecs for the
68 * given bui log item. We use only 1 iovec, and we point that
69 * at the bui_log_format structure embedded in the bui item.
70 * It is at this point that we assert that all of the extent
71 * slots in the bui item have been filled.
72 */
73STATIC void
74xfs_bui_item_format(
75 struct xfs_log_item *lip,
76 struct xfs_log_vec *lv)
77{
78 struct xfs_bui_log_item *buip = BUI_ITEM(lip);
79 struct xfs_log_iovec *vecp = NULL;
80
81 ASSERT(atomic_read(&buip->bui_next_extent) ==
82 buip->bui_format.bui_nextents);
83
84 buip->bui_format.bui_type = XFS_LI_BUI;
85 buip->bui_format.bui_size = 1;
86
87 xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_BUI_FORMAT, &buip->bui_format,
88 xfs_bui_log_format_sizeof(buip->bui_format.bui_nextents));
89}
90
91/*
92 * Pinning has no meaning for an bui item, so just return.
93 */
94STATIC void
95xfs_bui_item_pin(
96 struct xfs_log_item *lip)
97{
98}
99
100/*
101 * The unpin operation is the last place an BUI is manipulated in the log. It is
102 * either inserted in the AIL or aborted in the event of a log I/O error. In
103 * either case, the BUI transaction has been successfully committed to make it
104 * this far. Therefore, we expect whoever committed the BUI to either construct
105 * and commit the BUD or drop the BUD's reference in the event of error. Simply
106 * drop the log's BUI reference now that the log is done with it.
107 */
108STATIC void
109xfs_bui_item_unpin(
110 struct xfs_log_item *lip,
111 int remove)
112{
113 struct xfs_bui_log_item *buip = BUI_ITEM(lip);
114
115 xfs_bui_release(buip);
116}
117
118/*
119 * BUI items have no locking or pushing. However, since BUIs are pulled from
120 * the AIL when their corresponding BUDs are committed to disk, their situation
121 * is very similar to being pinned. Return XFS_ITEM_PINNED so that the caller
122 * will eventually flush the log. This should help in getting the BUI out of
123 * the AIL.
124 */
125STATIC uint
126xfs_bui_item_push(
127 struct xfs_log_item *lip,
128 struct list_head *buffer_list)
129{
130 return XFS_ITEM_PINNED;
131}
132
133/*
134 * The BUI has been either committed or aborted if the transaction has been
135 * cancelled. If the transaction was cancelled, an BUD isn't going to be
136 * constructed and thus we free the BUI here directly.
137 */
138STATIC void
139xfs_bui_item_unlock(
140 struct xfs_log_item *lip)
141{
142 if (lip->li_flags & XFS_LI_ABORTED)
143 xfs_bui_item_free(BUI_ITEM(lip));
144}
145
146/*
147 * The BUI is logged only once and cannot be moved in the log, so simply return
148 * the lsn at which it's been logged.
149 */
150STATIC xfs_lsn_t
151xfs_bui_item_committed(
152 struct xfs_log_item *lip,
153 xfs_lsn_t lsn)
154{
155 return lsn;
156}
157
158/*
159 * The BUI dependency tracking op doesn't do squat. It can't because
160 * it doesn't know where the free extent is coming from. The dependency
161 * tracking has to be handled by the "enclosing" metadata object. For
162 * example, for inodes, the inode is locked throughout the extent freeing
163 * so the dependency should be recorded there.
164 */
165STATIC void
166xfs_bui_item_committing(
167 struct xfs_log_item *lip,
168 xfs_lsn_t lsn)
169{
170}
171
172/*
173 * This is the ops vector shared by all bui log items.
174 */
175static const struct xfs_item_ops xfs_bui_item_ops = {
176 .iop_size = xfs_bui_item_size,
177 .iop_format = xfs_bui_item_format,
178 .iop_pin = xfs_bui_item_pin,
179 .iop_unpin = xfs_bui_item_unpin,
180 .iop_unlock = xfs_bui_item_unlock,
181 .iop_committed = xfs_bui_item_committed,
182 .iop_push = xfs_bui_item_push,
183 .iop_committing = xfs_bui_item_committing,
184};
185
186/*
187 * Allocate and initialize an bui item with the given number of extents.
188 */
189struct xfs_bui_log_item *
190xfs_bui_init(
191 struct xfs_mount *mp)
192
193{
194 struct xfs_bui_log_item *buip;
195
196 buip = kmem_zone_zalloc(xfs_bui_zone, KM_SLEEP);
197
198 xfs_log_item_init(mp, &buip->bui_item, XFS_LI_BUI, &xfs_bui_item_ops);
199 buip->bui_format.bui_nextents = XFS_BUI_MAX_FAST_EXTENTS;
200 buip->bui_format.bui_id = (uintptr_t)(void *)buip;
201 atomic_set(&buip->bui_next_extent, 0);
202 atomic_set(&buip->bui_refcount, 2);
203
204 return buip;
205}
206
207/*
208 * Freeing the BUI requires that we remove it from the AIL if it has already
209 * been placed there. However, the BUI may not yet have been placed in the AIL
210 * when called by xfs_bui_release() from BUD processing due to the ordering of
211 * committed vs unpin operations in bulk insert operations. Hence the reference
212 * count to ensure only the last caller frees the BUI.
213 */
214void
215xfs_bui_release(
216 struct xfs_bui_log_item *buip)
217{
218 if (atomic_dec_and_test(&buip->bui_refcount)) {
219 xfs_trans_ail_remove(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
220 xfs_bui_item_free(buip);
221 }
222}
223
224static inline struct xfs_bud_log_item *BUD_ITEM(struct xfs_log_item *lip)
225{
226 return container_of(lip, struct xfs_bud_log_item, bud_item);
227}
228
229STATIC void
230xfs_bud_item_size(
231 struct xfs_log_item *lip,
232 int *nvecs,
233 int *nbytes)
234{
235 *nvecs += 1;
236 *nbytes += sizeof(struct xfs_bud_log_format);
237}
238
239/*
240 * This is called to fill in the vector of log iovecs for the
241 * given bud log item. We use only 1 iovec, and we point that
242 * at the bud_log_format structure embedded in the bud item.
243 * It is at this point that we assert that all of the extent
244 * slots in the bud item have been filled.
245 */
246STATIC void
247xfs_bud_item_format(
248 struct xfs_log_item *lip,
249 struct xfs_log_vec *lv)
250{
251 struct xfs_bud_log_item *budp = BUD_ITEM(lip);
252 struct xfs_log_iovec *vecp = NULL;
253
254 budp->bud_format.bud_type = XFS_LI_BUD;
255 budp->bud_format.bud_size = 1;
256
257 xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_BUD_FORMAT, &budp->bud_format,
258 sizeof(struct xfs_bud_log_format));
259}
260
261/*
262 * Pinning has no meaning for an bud item, so just return.
263 */
264STATIC void
265xfs_bud_item_pin(
266 struct xfs_log_item *lip)
267{
268}
269
270/*
271 * Since pinning has no meaning for an bud item, unpinning does
272 * not either.
273 */
274STATIC void
275xfs_bud_item_unpin(
276 struct xfs_log_item *lip,
277 int remove)
278{
279}
280
281/*
282 * There isn't much you can do to push on an bud item. It is simply stuck
283 * waiting for the log to be flushed to disk.
284 */
285STATIC uint
286xfs_bud_item_push(
287 struct xfs_log_item *lip,
288 struct list_head *buffer_list)
289{
290 return XFS_ITEM_PINNED;
291}
292
293/*
294 * The BUD is either committed or aborted if the transaction is cancelled. If
295 * the transaction is cancelled, drop our reference to the BUI and free the
296 * BUD.
297 */
298STATIC void
299xfs_bud_item_unlock(
300 struct xfs_log_item *lip)
301{
302 struct xfs_bud_log_item *budp = BUD_ITEM(lip);
303
304 if (lip->li_flags & XFS_LI_ABORTED) {
305 xfs_bui_release(budp->bud_buip);
306 kmem_zone_free(xfs_bud_zone, budp);
307 }
308}
309
310/*
311 * When the bud item is committed to disk, all we need to do is delete our
312 * reference to our partner bui item and then free ourselves. Since we're
313 * freeing ourselves we must return -1 to keep the transaction code from
314 * further referencing this item.
315 */
316STATIC xfs_lsn_t
317xfs_bud_item_committed(
318 struct xfs_log_item *lip,
319 xfs_lsn_t lsn)
320{
321 struct xfs_bud_log_item *budp = BUD_ITEM(lip);
322
323 /*
324 * Drop the BUI reference regardless of whether the BUD has been
325 * aborted. Once the BUD transaction is constructed, it is the sole
326 * responsibility of the BUD to release the BUI (even if the BUI is
327 * aborted due to log I/O error).
328 */
329 xfs_bui_release(budp->bud_buip);
330 kmem_zone_free(xfs_bud_zone, budp);
331
332 return (xfs_lsn_t)-1;
333}
334
335/*
336 * The BUD dependency tracking op doesn't do squat. It can't because
337 * it doesn't know where the free extent is coming from. The dependency
338 * tracking has to be handled by the "enclosing" metadata object. For
339 * example, for inodes, the inode is locked throughout the extent freeing
340 * so the dependency should be recorded there.
341 */
342STATIC void
343xfs_bud_item_committing(
344 struct xfs_log_item *lip,
345 xfs_lsn_t lsn)
346{
347}
348
349/*
350 * This is the ops vector shared by all bud log items.
351 */
352static const struct xfs_item_ops xfs_bud_item_ops = {
353 .iop_size = xfs_bud_item_size,
354 .iop_format = xfs_bud_item_format,
355 .iop_pin = xfs_bud_item_pin,
356 .iop_unpin = xfs_bud_item_unpin,
357 .iop_unlock = xfs_bud_item_unlock,
358 .iop_committed = xfs_bud_item_committed,
359 .iop_push = xfs_bud_item_push,
360 .iop_committing = xfs_bud_item_committing,
361};
362
363/*
364 * Allocate and initialize an bud item with the given number of extents.
365 */
366struct xfs_bud_log_item *
367xfs_bud_init(
368 struct xfs_mount *mp,
369 struct xfs_bui_log_item *buip)
370
371{
372 struct xfs_bud_log_item *budp;
373
374 budp = kmem_zone_zalloc(xfs_bud_zone, KM_SLEEP);
375 xfs_log_item_init(mp, &budp->bud_item, XFS_LI_BUD, &xfs_bud_item_ops);
376 budp->bud_buip = buip;
377 budp->bud_format.bud_bui_id = buip->bui_format.bui_id;
378
379 return budp;
380}
381
382/*
383 * Process a bmap update intent item that was recovered from the log.
384 * We need to update some inode's bmbt.
385 */
386int
387xfs_bui_recover(
388 struct xfs_mount *mp,
389 struct xfs_bui_log_item *buip)
390{
391 int error = 0;
392 unsigned int bui_type;
393 struct xfs_map_extent *bmap;
394 xfs_fsblock_t startblock_fsb;
395 xfs_fsblock_t inode_fsb;
396 bool op_ok;
397 struct xfs_bud_log_item *budp;
398 enum xfs_bmap_intent_type type;
399 int whichfork;
400 xfs_exntst_t state;
401 struct xfs_trans *tp;
402 struct xfs_inode *ip = NULL;
403 struct xfs_defer_ops dfops;
404 xfs_fsblock_t firstfsb;
405
406 ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
407
408 /* Only one mapping operation per BUI... */
409 if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) {
410 set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
411 xfs_bui_release(buip);
412 return -EIO;
413 }
414
415 /*
416 * First check the validity of the extent described by the
417 * BUI. If anything is bad, then toss the BUI.
418 */
419 bmap = &buip->bui_format.bui_extents[0];
420 startblock_fsb = XFS_BB_TO_FSB(mp,
421 XFS_FSB_TO_DADDR(mp, bmap->me_startblock));
422 inode_fsb = XFS_BB_TO_FSB(mp, XFS_FSB_TO_DADDR(mp,
423 XFS_INO_TO_FSB(mp, bmap->me_owner)));
424 switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
425 case XFS_BMAP_MAP:
426 case XFS_BMAP_UNMAP:
427 op_ok = true;
428 break;
429 default:
430 op_ok = false;
431 break;
432 }
433 if (!op_ok || startblock_fsb == 0 ||
434 bmap->me_len == 0 ||
435 inode_fsb == 0 ||
436 startblock_fsb >= mp->m_sb.sb_dblocks ||
437 bmap->me_len >= mp->m_sb.sb_agblocks ||
438 inode_fsb >= mp->m_sb.sb_dblocks ||
439 (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) {
440 /*
441 * This will pull the BUI from the AIL and
442 * free the memory associated with it.
443 */
444 set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
445 xfs_bui_release(buip);
446 return -EIO;
447 }
448
449 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
450 if (error)
451 return error;
452 budp = xfs_trans_get_bud(tp, buip);
453
454 /* Grab the inode. */
455 error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip);
456 if (error)
457 goto err_inode;
458
459 if (VFS_I(ip)->i_nlink == 0)
460 xfs_iflags_set(ip, XFS_IRECOVERY);
461 xfs_defer_init(&dfops, &firstfsb);
462
463 /* Process deferred bmap item. */
464 state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
465 XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
466 whichfork = (bmap->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ?
467 XFS_ATTR_FORK : XFS_DATA_FORK;
468 bui_type = bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK;
469 switch (bui_type) {
470 case XFS_BMAP_MAP:
471 case XFS_BMAP_UNMAP:
472 type = bui_type;
473 break;
474 default:
475 error = -EFSCORRUPTED;
476 goto err_dfops;
477 }
478 xfs_trans_ijoin(tp, ip, 0);
479
480 error = xfs_trans_log_finish_bmap_update(tp, budp, &dfops, type,
481 ip, whichfork, bmap->me_startoff,
482 bmap->me_startblock, bmap->me_len,
483 state);
484 if (error)
485 goto err_dfops;
486
487 /* Finish transaction, free inodes. */
488 error = xfs_defer_finish(&tp, &dfops, NULL);
489 if (error)
490 goto err_dfops;
491
492 set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
493 error = xfs_trans_commit(tp);
494 xfs_iunlock(ip, XFS_ILOCK_EXCL);
495 IRELE(ip);
496
497 return error;
498
499err_dfops:
500 xfs_defer_cancel(&dfops);
501err_inode:
502 xfs_trans_cancel(tp);
503 if (ip) {
504 xfs_iunlock(ip, XFS_ILOCK_EXCL);
505 IRELE(ip);
506 }
507 return error;
508}
diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
new file mode 100644
index 000000000000..c867daae4a3c
--- /dev/null
+++ b/fs/xfs/xfs_bmap_item.h
@@ -0,0 +1,98 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#ifndef __XFS_BMAP_ITEM_H__
21#define __XFS_BMAP_ITEM_H__
22
23/*
24 * There are (currently) two pairs of bmap btree redo item types: map & unmap.
25 * The common abbreviations for these are BUI (bmap update intent) and BUD
26 * (bmap update done). The redo item type is encoded in the flags field of
27 * each xfs_map_extent.
28 *
29 * *I items should be recorded in the *first* of a series of rolled
30 * transactions, and the *D items should be recorded in the same transaction
31 * that records the associated bmbt updates.
32 *
33 * Should the system crash after the commit of the first transaction but
34 * before the commit of the final transaction in a series, log recovery will
35 * use the redo information recorded by the intent items to replay the
36 * bmbt metadata updates in the non-first transaction.
37 */
38
39/* kernel only BUI/BUD definitions */
40
41struct xfs_mount;
42struct kmem_zone;
43
44/*
45 * Max number of extents in fast allocation path.
46 */
47#define XFS_BUI_MAX_FAST_EXTENTS 1
48
49/*
50 * Define BUI flag bits. Manipulated by set/clear/test_bit operators.
51 */
52#define XFS_BUI_RECOVERED 1
53
54/*
55 * This is the "bmap update intent" log item. It is used to log the fact that
56 * some reverse mappings need to change. It is used in conjunction with the
57 * "bmap update done" log item described below.
58 *
59 * These log items follow the same rules as struct xfs_efi_log_item; see the
60 * comments about that structure (in xfs_extfree_item.h) for more details.
61 */
62struct xfs_bui_log_item {
63 struct xfs_log_item bui_item;
64 atomic_t bui_refcount;
65 atomic_t bui_next_extent;
66 unsigned long bui_flags; /* misc flags */
67 struct xfs_bui_log_format bui_format;
68};
69
70static inline size_t
71xfs_bui_log_item_sizeof(
72 unsigned int nr)
73{
74 return offsetof(struct xfs_bui_log_item, bui_format) +
75 xfs_bui_log_format_sizeof(nr);
76}
77
78/*
79 * This is the "bmap update done" log item. It is used to log the fact that
80 * some bmbt updates mentioned in an earlier bui item have been performed.
81 */
82struct xfs_bud_log_item {
83 struct xfs_log_item bud_item;
84 struct xfs_bui_log_item *bud_buip;
85 struct xfs_bud_log_format bud_format;
86};
87
88extern struct kmem_zone *xfs_bui_zone;
89extern struct kmem_zone *xfs_bud_zone;
90
91struct xfs_bui_log_item *xfs_bui_init(struct xfs_mount *);
92struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
93 struct xfs_bui_log_item *);
94void xfs_bui_item_free(struct xfs_bui_log_item *);
95void xfs_bui_release(struct xfs_bui_log_item *);
96int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip);
97
98#endif /* __XFS_BMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e827d657c314..552465e011ec 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -42,6 +42,9 @@
42#include "xfs_icache.h" 42#include "xfs_icache.h"
43#include "xfs_log.h" 43#include "xfs_log.h"
44#include "xfs_rmap_btree.h" 44#include "xfs_rmap_btree.h"
45#include "xfs_iomap.h"
46#include "xfs_reflink.h"
47#include "xfs_refcount.h"
45 48
46/* Kernel only BMAP related definitions and functions */ 49/* Kernel only BMAP related definitions and functions */
47 50
@@ -389,11 +392,13 @@ xfs_bmap_count_blocks(
389STATIC int 392STATIC int
390xfs_getbmapx_fix_eof_hole( 393xfs_getbmapx_fix_eof_hole(
391 xfs_inode_t *ip, /* xfs incore inode pointer */ 394 xfs_inode_t *ip, /* xfs incore inode pointer */
395 int whichfork,
392 struct getbmapx *out, /* output structure */ 396 struct getbmapx *out, /* output structure */
393 int prealloced, /* this is a file with 397 int prealloced, /* this is a file with
394 * preallocated data space */ 398 * preallocated data space */
395 __int64_t end, /* last block requested */ 399 __int64_t end, /* last block requested */
396 xfs_fsblock_t startblock) 400 xfs_fsblock_t startblock,
401 bool moretocome)
397{ 402{
398 __int64_t fixlen; 403 __int64_t fixlen;
399 xfs_mount_t *mp; /* file system mount point */ 404 xfs_mount_t *mp; /* file system mount point */
@@ -418,8 +423,9 @@ xfs_getbmapx_fix_eof_hole(
418 else 423 else
419 out->bmv_block = xfs_fsb_to_db(ip, startblock); 424 out->bmv_block = xfs_fsb_to_db(ip, startblock);
420 fileblock = XFS_BB_TO_FSB(ip->i_mount, out->bmv_offset); 425 fileblock = XFS_BB_TO_FSB(ip->i_mount, out->bmv_offset);
421 ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK); 426 ifp = XFS_IFORK_PTR(ip, whichfork);
422 if (xfs_iext_bno_to_ext(ifp, fileblock, &lastx) && 427 if (!moretocome &&
428 xfs_iext_bno_to_ext(ifp, fileblock, &lastx) &&
423 (lastx == (ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t))-1)) 429 (lastx == (ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t))-1))
424 out->bmv_oflags |= BMV_OF_LAST; 430 out->bmv_oflags |= BMV_OF_LAST;
425 } 431 }
@@ -427,6 +433,81 @@ xfs_getbmapx_fix_eof_hole(
427 return 1; 433 return 1;
428} 434}
429 435
436/* Adjust the reported bmap around shared/unshared extent transitions. */
437STATIC int
438xfs_getbmap_adjust_shared(
439 struct xfs_inode *ip,
440 int whichfork,
441 struct xfs_bmbt_irec *map,
442 struct getbmapx *out,
443 struct xfs_bmbt_irec *next_map)
444{
445 struct xfs_mount *mp = ip->i_mount;
446 xfs_agnumber_t agno;
447 xfs_agblock_t agbno;
448 xfs_agblock_t ebno;
449 xfs_extlen_t elen;
450 xfs_extlen_t nlen;
451 int error;
452
453 next_map->br_startblock = NULLFSBLOCK;
454 next_map->br_startoff = NULLFILEOFF;
455 next_map->br_blockcount = 0;
456
457 /* Only written data blocks can be shared. */
458 if (!xfs_is_reflink_inode(ip) || whichfork != XFS_DATA_FORK ||
459 map->br_startblock == DELAYSTARTBLOCK ||
460 map->br_startblock == HOLESTARTBLOCK ||
461 ISUNWRITTEN(map))
462 return 0;
463
464 agno = XFS_FSB_TO_AGNO(mp, map->br_startblock);
465 agbno = XFS_FSB_TO_AGBNO(mp, map->br_startblock);
466 error = xfs_reflink_find_shared(mp, agno, agbno, map->br_blockcount,
467 &ebno, &elen, true);
468 if (error)
469 return error;
470
471 if (ebno == NULLAGBLOCK) {
472 /* No shared blocks at all. */
473 return 0;
474 } else if (agbno == ebno) {
475 /*
476 * Shared extent at (agbno, elen). Shrink the reported
477 * extent length and prepare to move the start of map[i]
478 * to agbno+elen, with the aim of (re)formatting the new
479 * map[i] the next time through the inner loop.
480 */
481 out->bmv_length = XFS_FSB_TO_BB(mp, elen);
482 out->bmv_oflags |= BMV_OF_SHARED;
483 if (elen != map->br_blockcount) {
484 *next_map = *map;
485 next_map->br_startblock += elen;
486 next_map->br_startoff += elen;
487 next_map->br_blockcount -= elen;
488 }
489 map->br_blockcount -= elen;
490 } else {
491 /*
492 * There's an unshared extent (agbno, ebno - agbno)
493 * followed by shared extent at (ebno, elen). Shrink
494 * the reported extent length to cover only the unshared
495 * extent and prepare to move up the start of map[i] to
496 * ebno, with the aim of (re)formatting the new map[i]
497 * the next time through the inner loop.
498 */
499 *next_map = *map;
500 nlen = ebno - agbno;
501 out->bmv_length = XFS_FSB_TO_BB(mp, nlen);
502 next_map->br_startblock += nlen;
503 next_map->br_startoff += nlen;
504 next_map->br_blockcount -= nlen;
505 map->br_blockcount -= nlen;
506 }
507
508 return 0;
509}
510
430/* 511/*
431 * Get inode's extents as described in bmv, and format for output. 512 * Get inode's extents as described in bmv, and format for output.
432 * Calls formatter to fill the user's buffer until all extents 513 * Calls formatter to fill the user's buffer until all extents
@@ -459,12 +540,28 @@ xfs_getbmap(
459 int iflags; /* interface flags */ 540 int iflags; /* interface flags */
460 int bmapi_flags; /* flags for xfs_bmapi */ 541 int bmapi_flags; /* flags for xfs_bmapi */
461 int cur_ext = 0; 542 int cur_ext = 0;
543 struct xfs_bmbt_irec inject_map;
462 544
463 mp = ip->i_mount; 545 mp = ip->i_mount;
464 iflags = bmv->bmv_iflags; 546 iflags = bmv->bmv_iflags;
465 whichfork = iflags & BMV_IF_ATTRFORK ? XFS_ATTR_FORK : XFS_DATA_FORK;
466 547
467 if (whichfork == XFS_ATTR_FORK) { 548#ifndef DEBUG
549 /* Only allow CoW fork queries if we're debugging. */
550 if (iflags & BMV_IF_COWFORK)
551 return -EINVAL;
552#endif
553 if ((iflags & BMV_IF_ATTRFORK) && (iflags & BMV_IF_COWFORK))
554 return -EINVAL;
555
556 if (iflags & BMV_IF_ATTRFORK)
557 whichfork = XFS_ATTR_FORK;
558 else if (iflags & BMV_IF_COWFORK)
559 whichfork = XFS_COW_FORK;
560 else
561 whichfork = XFS_DATA_FORK;
562
563 switch (whichfork) {
564 case XFS_ATTR_FORK:
468 if (XFS_IFORK_Q(ip)) { 565 if (XFS_IFORK_Q(ip)) {
469 if (ip->i_d.di_aformat != XFS_DINODE_FMT_EXTENTS && 566 if (ip->i_d.di_aformat != XFS_DINODE_FMT_EXTENTS &&
470 ip->i_d.di_aformat != XFS_DINODE_FMT_BTREE && 567 ip->i_d.di_aformat != XFS_DINODE_FMT_BTREE &&
@@ -480,7 +577,20 @@ xfs_getbmap(
480 577
481 prealloced = 0; 578 prealloced = 0;
482 fixlen = 1LL << 32; 579 fixlen = 1LL << 32;
483 } else { 580 break;
581 case XFS_COW_FORK:
582 if (ip->i_cformat != XFS_DINODE_FMT_EXTENTS)
583 return -EINVAL;
584
585 if (xfs_get_cowextsz_hint(ip)) {
586 prealloced = 1;
587 fixlen = mp->m_super->s_maxbytes;
588 } else {
589 prealloced = 0;
590 fixlen = XFS_ISIZE(ip);
591 }
592 break;
593 default:
484 if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS && 594 if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
485 ip->i_d.di_format != XFS_DINODE_FMT_BTREE && 595 ip->i_d.di_format != XFS_DINODE_FMT_BTREE &&
486 ip->i_d.di_format != XFS_DINODE_FMT_LOCAL) 596 ip->i_d.di_format != XFS_DINODE_FMT_LOCAL)
@@ -494,6 +604,7 @@ xfs_getbmap(
494 prealloced = 0; 604 prealloced = 0;
495 fixlen = XFS_ISIZE(ip); 605 fixlen = XFS_ISIZE(ip);
496 } 606 }
607 break;
497 } 608 }
498 609
499 if (bmv->bmv_length == -1) { 610 if (bmv->bmv_length == -1) {
@@ -520,7 +631,8 @@ xfs_getbmap(
520 return -ENOMEM; 631 return -ENOMEM;
521 632
522 xfs_ilock(ip, XFS_IOLOCK_SHARED); 633 xfs_ilock(ip, XFS_IOLOCK_SHARED);
523 if (whichfork == XFS_DATA_FORK) { 634 switch (whichfork) {
635 case XFS_DATA_FORK:
524 if (!(iflags & BMV_IF_DELALLOC) && 636 if (!(iflags & BMV_IF_DELALLOC) &&
525 (ip->i_delayed_blks || XFS_ISIZE(ip) > ip->i_d.di_size)) { 637 (ip->i_delayed_blks || XFS_ISIZE(ip) > ip->i_d.di_size)) {
526 error = filemap_write_and_wait(VFS_I(ip)->i_mapping); 638 error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
@@ -538,8 +650,14 @@ xfs_getbmap(
538 } 650 }
539 651
540 lock = xfs_ilock_data_map_shared(ip); 652 lock = xfs_ilock_data_map_shared(ip);
541 } else { 653 break;
654 case XFS_COW_FORK:
655 lock = XFS_ILOCK_SHARED;
656 xfs_ilock(ip, lock);
657 break;
658 case XFS_ATTR_FORK:
542 lock = xfs_ilock_attr_map_shared(ip); 659 lock = xfs_ilock_attr_map_shared(ip);
660 break;
543 } 661 }
544 662
545 /* 663 /*
@@ -581,7 +699,8 @@ xfs_getbmap(
581 goto out_free_map; 699 goto out_free_map;
582 ASSERT(nmap <= subnex); 700 ASSERT(nmap <= subnex);
583 701
584 for (i = 0; i < nmap && nexleft && bmv->bmv_length; i++) { 702 for (i = 0; i < nmap && nexleft && bmv->bmv_length &&
703 cur_ext < bmv->bmv_count; i++) {
585 out[cur_ext].bmv_oflags = 0; 704 out[cur_ext].bmv_oflags = 0;
586 if (map[i].br_state == XFS_EXT_UNWRITTEN) 705 if (map[i].br_state == XFS_EXT_UNWRITTEN)
587 out[cur_ext].bmv_oflags |= BMV_OF_PREALLOC; 706 out[cur_ext].bmv_oflags |= BMV_OF_PREALLOC;
@@ -614,9 +733,16 @@ xfs_getbmap(
614 goto out_free_map; 733 goto out_free_map;
615 } 734 }
616 735
617 if (!xfs_getbmapx_fix_eof_hole(ip, &out[cur_ext], 736 /* Is this a shared block? */
618 prealloced, bmvend, 737 error = xfs_getbmap_adjust_shared(ip, whichfork,
619 map[i].br_startblock)) 738 &map[i], &out[cur_ext], &inject_map);
739 if (error)
740 goto out_free_map;
741
742 if (!xfs_getbmapx_fix_eof_hole(ip, whichfork,
743 &out[cur_ext], prealloced, bmvend,
744 map[i].br_startblock,
745 inject_map.br_startblock != NULLFSBLOCK))
620 goto out_free_map; 746 goto out_free_map;
621 747
622 bmv->bmv_offset = 748 bmv->bmv_offset =
@@ -636,11 +762,16 @@ xfs_getbmap(
636 continue; 762 continue;
637 } 763 }
638 764
639 nexleft--; 765 if (inject_map.br_startblock != NULLFSBLOCK) {
766 map[i] = inject_map;
767 i--;
768 } else
769 nexleft--;
640 bmv->bmv_entries++; 770 bmv->bmv_entries++;
641 cur_ext++; 771 cur_ext++;
642 } 772 }
643 } while (nmap && nexleft && bmv->bmv_length); 773 } while (nmap && nexleft && bmv->bmv_length &&
774 cur_ext < bmv->bmv_count);
644 775
645 out_free_map: 776 out_free_map:
646 kmem_free(map); 777 kmem_free(map);
@@ -1433,8 +1564,8 @@ xfs_insert_file_space(
1433 */ 1564 */
1434static int 1565static int
1435xfs_swap_extents_check_format( 1566xfs_swap_extents_check_format(
1436 xfs_inode_t *ip, /* target inode */ 1567 struct xfs_inode *ip, /* target inode */
1437 xfs_inode_t *tip) /* tmp inode */ 1568 struct xfs_inode *tip) /* tmp inode */
1438{ 1569{
1439 1570
1440 /* Should never get a local format */ 1571 /* Should never get a local format */
@@ -1450,6 +1581,13 @@ xfs_swap_extents_check_format(
1450 return -EINVAL; 1581 return -EINVAL;
1451 1582
1452 /* 1583 /*
1584 * If we have to use the (expensive) rmap swap method, we can
1585 * handle any number of extents and any format.
1586 */
1587 if (xfs_sb_version_hasrmapbt(&ip->i_mount->m_sb))
1588 return 0;
1589
1590 /*
1453 * if the target inode is in extent form and the temp inode is in btree 1591 * if the target inode is in extent form and the temp inode is in btree
1454 * form then we will end up with the target inode in the wrong format 1592 * form then we will end up with the target inode in the wrong format
1455 * as we already know there are less extents in the temp inode. 1593 * as we already know there are less extents in the temp inode.
@@ -1518,125 +1656,161 @@ xfs_swap_extent_flush(
1518 return 0; 1656 return 0;
1519} 1657}
1520 1658
1521int 1659/*
1522xfs_swap_extents( 1660 * Move extents from one file to another, when rmap is enabled.
1523 xfs_inode_t *ip, /* target inode */ 1661 */
1524 xfs_inode_t *tip, /* tmp inode */ 1662STATIC int
1525 xfs_swapext_t *sxp) 1663xfs_swap_extent_rmap(
1664 struct xfs_trans **tpp,
1665 struct xfs_inode *ip,
1666 struct xfs_inode *tip)
1526{ 1667{
1527 xfs_mount_t *mp = ip->i_mount; 1668 struct xfs_bmbt_irec irec;
1528 xfs_trans_t *tp; 1669 struct xfs_bmbt_irec uirec;
1529 xfs_bstat_t *sbp = &sxp->sx_stat; 1670 struct xfs_bmbt_irec tirec;
1530 xfs_ifork_t *tempifp, *ifp, *tifp; 1671 xfs_fileoff_t offset_fsb;
1531 int src_log_flags, target_log_flags; 1672 xfs_fileoff_t end_fsb;
1532 int error = 0; 1673 xfs_filblks_t count_fsb;
1533 int aforkblks = 0; 1674 xfs_fsblock_t firstfsb;
1534 int taforkblks = 0; 1675 struct xfs_defer_ops dfops;
1535 __uint64_t tmp; 1676 int error;
1536 int lock_flags; 1677 xfs_filblks_t ilen;
1537 1678 xfs_filblks_t rlen;
1538 /* XXX: we can't do this with rmap, will fix later */ 1679 int nimaps;
1539 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) 1680 __uint64_t tip_flags2;
1540 return -EOPNOTSUPP;
1541
1542 tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
1543 if (!tempifp) {
1544 error = -ENOMEM;
1545 goto out;
1546 }
1547 1681
1548 /* 1682 /*
1549 * Lock the inodes against other IO, page faults and truncate to 1683 * If the source file has shared blocks, we must flag the donor
1550 * begin with. Then we can ensure the inodes are flushed and have no 1684 * file as having shared blocks so that we get the shared-block
1551 * page cache safely. Once we have done this we can take the ilocks and 1685 * rmap functions when we go to fix up the rmaps. The flags
1552 * do the rest of the checks. 1686 * will be switch for reals later.
1553 */ 1687 */
1554 lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; 1688 tip_flags2 = tip->i_d.di_flags2;
1555 xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL); 1689 if (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
1556 xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL); 1690 tip->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
1557 1691
1558 /* Verify that both files have the same format */ 1692 offset_fsb = 0;
1559 if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) { 1693 end_fsb = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
1560 error = -EINVAL; 1694 count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
1561 goto out_unlock; 1695
1562 } 1696 while (count_fsb) {
1697 /* Read extent from the donor file */
1698 nimaps = 1;
1699 error = xfs_bmapi_read(tip, offset_fsb, count_fsb, &tirec,
1700 &nimaps, 0);
1701 if (error)
1702 goto out;
1703 ASSERT(nimaps == 1);
1704 ASSERT(tirec.br_startblock != DELAYSTARTBLOCK);
1705
1706 trace_xfs_swap_extent_rmap_remap(tip, &tirec);
1707 ilen = tirec.br_blockcount;
1708
1709 /* Unmap the old blocks in the source file. */
1710 while (tirec.br_blockcount) {
1711 xfs_defer_init(&dfops, &firstfsb);
1712 trace_xfs_swap_extent_rmap_remap_piece(tip, &tirec);
1713
1714 /* Read extent from the source file */
1715 nimaps = 1;
1716 error = xfs_bmapi_read(ip, tirec.br_startoff,
1717 tirec.br_blockcount, &irec,
1718 &nimaps, 0);
1719 if (error)
1720 goto out_defer;
1721 ASSERT(nimaps == 1);
1722 ASSERT(tirec.br_startoff == irec.br_startoff);
1723 trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
1724
1725 /* Trim the extent. */
1726 uirec = tirec;
1727 uirec.br_blockcount = rlen = min_t(xfs_filblks_t,
1728 tirec.br_blockcount,
1729 irec.br_blockcount);
1730 trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
1731
1732 /* Remove the mapping from the donor file. */
1733 error = xfs_bmap_unmap_extent((*tpp)->t_mountp, &dfops,
1734 tip, &uirec);
1735 if (error)
1736 goto out_defer;
1563 1737
1564 /* Verify both files are either real-time or non-realtime */ 1738 /* Remove the mapping from the source file. */
1565 if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) { 1739 error = xfs_bmap_unmap_extent((*tpp)->t_mountp, &dfops,
1566 error = -EINVAL; 1740 ip, &irec);
1567 goto out_unlock; 1741 if (error)
1568 } 1742 goto out_defer;
1569 1743
1570 error = xfs_swap_extent_flush(ip); 1744 /* Map the donor file's blocks into the source file. */
1571 if (error) 1745 error = xfs_bmap_map_extent((*tpp)->t_mountp, &dfops,
1572 goto out_unlock; 1746 ip, &uirec);
1573 error = xfs_swap_extent_flush(tip); 1747 if (error)
1574 if (error) 1748 goto out_defer;
1575 goto out_unlock;
1576 1749
1577 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp); 1750 /* Map the source file's blocks into the donor file. */
1578 if (error) 1751 error = xfs_bmap_map_extent((*tpp)->t_mountp, &dfops,
1579 goto out_unlock; 1752 tip, &irec);
1753 if (error)
1754 goto out_defer;
1580 1755
1581 /* 1756 error = xfs_defer_finish(tpp, &dfops, ip);
1582 * Lock and join the inodes to the tansaction so that transaction commit 1757 if (error)
1583 * or cancel will unlock the inodes from this point onwards. 1758 goto out_defer;
1584 */
1585 xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
1586 lock_flags |= XFS_ILOCK_EXCL;
1587 xfs_trans_ijoin(tp, ip, lock_flags);
1588 xfs_trans_ijoin(tp, tip, lock_flags);
1589 1759
1760 tirec.br_startoff += rlen;
1761 if (tirec.br_startblock != HOLESTARTBLOCK &&
1762 tirec.br_startblock != DELAYSTARTBLOCK)
1763 tirec.br_startblock += rlen;
1764 tirec.br_blockcount -= rlen;
1765 }
1590 1766
1591 /* Verify all data are being swapped */ 1767 /* Roll on... */
1592 if (sxp->sx_offset != 0 || 1768 count_fsb -= ilen;
1593 sxp->sx_length != ip->i_d.di_size || 1769 offset_fsb += ilen;
1594 sxp->sx_length != tip->i_d.di_size) {
1595 error = -EFAULT;
1596 goto out_trans_cancel;
1597 } 1770 }
1598 1771
1599 trace_xfs_swap_extent_before(ip, 0); 1772 tip->i_d.di_flags2 = tip_flags2;
1600 trace_xfs_swap_extent_before(tip, 1); 1773 return 0;
1601 1774
1602 /* check inode formats now that data is flushed */ 1775out_defer:
1603 error = xfs_swap_extents_check_format(ip, tip); 1776 xfs_defer_cancel(&dfops);
1604 if (error) { 1777out:
1605 xfs_notice(mp, 1778 trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
1606 "%s: inode 0x%llx format is incompatible for exchanging.", 1779 tip->i_d.di_flags2 = tip_flags2;
1607 __func__, ip->i_ino); 1780 return error;
1608 goto out_trans_cancel; 1781}
1609 } 1782
1783/* Swap the extents of two files by swapping data forks. */
1784STATIC int
1785xfs_swap_extent_forks(
1786 struct xfs_trans *tp,
1787 struct xfs_inode *ip,
1788 struct xfs_inode *tip,
1789 int *src_log_flags,
1790 int *target_log_flags)
1791{
1792 struct xfs_ifork tempifp, *ifp, *tifp;
1793 int aforkblks = 0;
1794 int taforkblks = 0;
1795 __uint64_t tmp;
1796 int error;
1610 1797
1611 /*
1612 * Compare the current change & modify times with that
1613 * passed in. If they differ, we abort this swap.
1614 * This is the mechanism used to ensure the calling
1615 * process that the file was not changed out from
1616 * under it.
1617 */
1618 if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
1619 (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
1620 (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
1621 (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
1622 error = -EBUSY;
1623 goto out_trans_cancel;
1624 }
1625 /* 1798 /*
1626 * Count the number of extended attribute blocks 1799 * Count the number of extended attribute blocks
1627 */ 1800 */
1628 if ( ((XFS_IFORK_Q(ip) != 0) && (ip->i_d.di_anextents > 0)) && 1801 if ( ((XFS_IFORK_Q(ip) != 0) && (ip->i_d.di_anextents > 0)) &&
1629 (ip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) { 1802 (ip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) {
1630 error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK, &aforkblks); 1803 error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK,
1804 &aforkblks);
1631 if (error) 1805 if (error)
1632 goto out_trans_cancel; 1806 return error;
1633 } 1807 }
1634 if ( ((XFS_IFORK_Q(tip) != 0) && (tip->i_d.di_anextents > 0)) && 1808 if ( ((XFS_IFORK_Q(tip) != 0) && (tip->i_d.di_anextents > 0)) &&
1635 (tip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) { 1809 (tip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) {
1636 error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK, 1810 error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK,
1637 &taforkblks); 1811 &taforkblks);
1638 if (error) 1812 if (error)
1639 goto out_trans_cancel; 1813 return error;
1640 } 1814 }
1641 1815
1642 /* 1816 /*
@@ -1645,31 +1819,23 @@ xfs_swap_extents(
1645 * buffers, and so the validation done on read will expect the owner 1819 * buffers, and so the validation done on read will expect the owner
1646 * field to be correctly set. Once we change the owners, we can swap the 1820 * field to be correctly set. Once we change the owners, we can swap the
1647 * inode forks. 1821 * inode forks.
1648 *
1649 * Note the trickiness in setting the log flags - we set the owner log
1650 * flag on the opposite inode (i.e. the inode we are setting the new
1651 * owner to be) because once we swap the forks and log that, log
1652 * recovery is going to see the fork as owned by the swapped inode,
1653 * not the pre-swapped inodes.
1654 */ 1822 */
1655 src_log_flags = XFS_ILOG_CORE;
1656 target_log_flags = XFS_ILOG_CORE;
1657 if (ip->i_d.di_version == 3 && 1823 if (ip->i_d.di_version == 3 &&
1658 ip->i_d.di_format == XFS_DINODE_FMT_BTREE) { 1824 ip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
1659 target_log_flags |= XFS_ILOG_DOWNER; 1825 (*target_log_flags) |= XFS_ILOG_DOWNER;
1660 error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, 1826 error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK,
1661 tip->i_ino, NULL); 1827 tip->i_ino, NULL);
1662 if (error) 1828 if (error)
1663 goto out_trans_cancel; 1829 return error;
1664 } 1830 }
1665 1831
1666 if (tip->i_d.di_version == 3 && 1832 if (tip->i_d.di_version == 3 &&
1667 tip->i_d.di_format == XFS_DINODE_FMT_BTREE) { 1833 tip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
1668 src_log_flags |= XFS_ILOG_DOWNER; 1834 (*src_log_flags) |= XFS_ILOG_DOWNER;
1669 error = xfs_bmbt_change_owner(tp, tip, XFS_DATA_FORK, 1835 error = xfs_bmbt_change_owner(tp, tip, XFS_DATA_FORK,
1670 ip->i_ino, NULL); 1836 ip->i_ino, NULL);
1671 if (error) 1837 if (error)
1672 goto out_trans_cancel; 1838 return error;
1673 } 1839 }
1674 1840
1675 /* 1841 /*
@@ -1677,9 +1843,9 @@ xfs_swap_extents(
1677 */ 1843 */
1678 ifp = &ip->i_df; 1844 ifp = &ip->i_df;
1679 tifp = &tip->i_df; 1845 tifp = &tip->i_df;
1680 *tempifp = *ifp; /* struct copy */ 1846 tempifp = *ifp; /* struct copy */
1681 *ifp = *tifp; /* struct copy */ 1847 *ifp = *tifp; /* struct copy */
1682 *tifp = *tempifp; /* struct copy */ 1848 *tifp = tempifp; /* struct copy */
1683 1849
1684 /* 1850 /*
1685 * Fix the on-disk inode values 1851 * Fix the on-disk inode values
@@ -1719,12 +1885,12 @@ xfs_swap_extents(
1719 ifp->if_u1.if_extents = 1885 ifp->if_u1.if_extents =
1720 ifp->if_u2.if_inline_ext; 1886 ifp->if_u2.if_inline_ext;
1721 } 1887 }
1722 src_log_flags |= XFS_ILOG_DEXT; 1888 (*src_log_flags) |= XFS_ILOG_DEXT;
1723 break; 1889 break;
1724 case XFS_DINODE_FMT_BTREE: 1890 case XFS_DINODE_FMT_BTREE:
1725 ASSERT(ip->i_d.di_version < 3 || 1891 ASSERT(ip->i_d.di_version < 3 ||
1726 (src_log_flags & XFS_ILOG_DOWNER)); 1892 (*src_log_flags & XFS_ILOG_DOWNER));
1727 src_log_flags |= XFS_ILOG_DBROOT; 1893 (*src_log_flags) |= XFS_ILOG_DBROOT;
1728 break; 1894 break;
1729 } 1895 }
1730 1896
@@ -1738,15 +1904,166 @@ xfs_swap_extents(
1738 tifp->if_u1.if_extents = 1904 tifp->if_u1.if_extents =
1739 tifp->if_u2.if_inline_ext; 1905 tifp->if_u2.if_inline_ext;
1740 } 1906 }
1741 target_log_flags |= XFS_ILOG_DEXT; 1907 (*target_log_flags) |= XFS_ILOG_DEXT;
1742 break; 1908 break;
1743 case XFS_DINODE_FMT_BTREE: 1909 case XFS_DINODE_FMT_BTREE:
1744 target_log_flags |= XFS_ILOG_DBROOT; 1910 (*target_log_flags) |= XFS_ILOG_DBROOT;
1745 ASSERT(tip->i_d.di_version < 3 || 1911 ASSERT(tip->i_d.di_version < 3 ||
1746 (target_log_flags & XFS_ILOG_DOWNER)); 1912 (*target_log_flags & XFS_ILOG_DOWNER));
1747 break; 1913 break;
1748 } 1914 }
1749 1915
1916 return 0;
1917}
1918
1919int
1920xfs_swap_extents(
1921 struct xfs_inode *ip, /* target inode */
1922 struct xfs_inode *tip, /* tmp inode */
1923 struct xfs_swapext *sxp)
1924{
1925 struct xfs_mount *mp = ip->i_mount;
1926 struct xfs_trans *tp;
1927 struct xfs_bstat *sbp = &sxp->sx_stat;
1928 int src_log_flags, target_log_flags;
1929 int error = 0;
1930 int lock_flags;
1931 struct xfs_ifork *cowfp;
1932 __uint64_t f;
1933 int resblks;
1934
1935 /*
1936 * Lock the inodes against other IO, page faults and truncate to
1937 * begin with. Then we can ensure the inodes are flushed and have no
1938 * page cache safely. Once we have done this we can take the ilocks and
1939 * do the rest of the checks.
1940 */
1941 lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
1942 xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
1943 xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
1944
1945 /* Verify that both files have the same format */
1946 if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
1947 error = -EINVAL;
1948 goto out_unlock;
1949 }
1950
1951 /* Verify both files are either real-time or non-realtime */
1952 if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
1953 error = -EINVAL;
1954 goto out_unlock;
1955 }
1956
1957 error = xfs_swap_extent_flush(ip);
1958 if (error)
1959 goto out_unlock;
1960 error = xfs_swap_extent_flush(tip);
1961 if (error)
1962 goto out_unlock;
1963
1964 /*
1965 * Extent "swapping" with rmap requires a permanent reservation and
1966 * a block reservation because it's really just a remap operation
1967 * performed with log redo items!
1968 */
1969 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
1970 /*
1971 * Conceptually this shouldn't affect the shape of either
1972 * bmbt, but since we atomically move extents one by one,
1973 * we reserve enough space to rebuild both trees.
1974 */
1975 resblks = XFS_SWAP_RMAP_SPACE_RES(mp,
1976 XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK),
1977 XFS_DATA_FORK) +
1978 XFS_SWAP_RMAP_SPACE_RES(mp,
1979 XFS_IFORK_NEXTENTS(tip, XFS_DATA_FORK),
1980 XFS_DATA_FORK);
1981 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks,
1982 0, 0, &tp);
1983 } else
1984 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0,
1985 0, 0, &tp);
1986 if (error)
1987 goto out_unlock;
1988
1989 /*
1990 * Lock and join the inodes to the tansaction so that transaction commit
1991 * or cancel will unlock the inodes from this point onwards.
1992 */
1993 xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
1994 lock_flags |= XFS_ILOCK_EXCL;
1995 xfs_trans_ijoin(tp, ip, 0);
1996 xfs_trans_ijoin(tp, tip, 0);
1997
1998
1999 /* Verify all data are being swapped */
2000 if (sxp->sx_offset != 0 ||
2001 sxp->sx_length != ip->i_d.di_size ||
2002 sxp->sx_length != tip->i_d.di_size) {
2003 error = -EFAULT;
2004 goto out_trans_cancel;
2005 }
2006
2007 trace_xfs_swap_extent_before(ip, 0);
2008 trace_xfs_swap_extent_before(tip, 1);
2009
2010 /* check inode formats now that data is flushed */
2011 error = xfs_swap_extents_check_format(ip, tip);
2012 if (error) {
2013 xfs_notice(mp,
2014 "%s: inode 0x%llx format is incompatible for exchanging.",
2015 __func__, ip->i_ino);
2016 goto out_trans_cancel;
2017 }
2018
2019 /*
2020 * Compare the current change & modify times with that
2021 * passed in. If they differ, we abort this swap.
2022 * This is the mechanism used to ensure the calling
2023 * process that the file was not changed out from
2024 * under it.
2025 */
2026 if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
2027 (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
2028 (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
2029 (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
2030 error = -EBUSY;
2031 goto out_trans_cancel;
2032 }
2033
2034 /*
2035 * Note the trickiness in setting the log flags - we set the owner log
2036 * flag on the opposite inode (i.e. the inode we are setting the new
2037 * owner to be) because once we swap the forks and log that, log
2038 * recovery is going to see the fork as owned by the swapped inode,
2039 * not the pre-swapped inodes.
2040 */
2041 src_log_flags = XFS_ILOG_CORE;
2042 target_log_flags = XFS_ILOG_CORE;
2043
2044 if (xfs_sb_version_hasrmapbt(&mp->m_sb))
2045 error = xfs_swap_extent_rmap(&tp, ip, tip);
2046 else
2047 error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
2048 &target_log_flags);
2049 if (error)
2050 goto out_trans_cancel;
2051
2052 /* Do we have to swap reflink flags? */
2053 if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
2054 (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
2055 f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
2056 ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
2057 ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
2058 tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
2059 tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
2060 cowfp = ip->i_cowfp;
2061 ip->i_cowfp = tip->i_cowfp;
2062 tip->i_cowfp = cowfp;
2063 xfs_inode_set_cowblocks_tag(ip);
2064 xfs_inode_set_cowblocks_tag(tip);
2065 }
2066
1750 xfs_trans_log_inode(tp, ip, src_log_flags); 2067 xfs_trans_log_inode(tp, ip, src_log_flags);
1751 xfs_trans_log_inode(tp, tip, target_log_flags); 2068 xfs_trans_log_inode(tp, tip, target_log_flags);
1752 2069
@@ -1761,16 +2078,16 @@ xfs_swap_extents(
1761 2078
1762 trace_xfs_swap_extent_after(ip, 0); 2079 trace_xfs_swap_extent_after(ip, 0);
1763 trace_xfs_swap_extent_after(tip, 1); 2080 trace_xfs_swap_extent_after(tip, 1);
1764out:
1765 kmem_free(tempifp);
1766 return error;
1767 2081
1768out_unlock:
1769 xfs_iunlock(ip, lock_flags); 2082 xfs_iunlock(ip, lock_flags);
1770 xfs_iunlock(tip, lock_flags); 2083 xfs_iunlock(tip, lock_flags);
1771 goto out; 2084 return error;
1772 2085
1773out_trans_cancel: 2086out_trans_cancel:
1774 xfs_trans_cancel(tp); 2087 xfs_trans_cancel(tp);
1775 goto out; 2088
2089out_unlock:
2090 xfs_iunlock(ip, lock_flags);
2091 xfs_iunlock(tip, lock_flags);
2092 return error;
1776} 2093}
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index f44f79996978..29816981b50a 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -84,7 +84,8 @@ xfs_dir2_sf_getdents(
84 84
85 sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data; 85 sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
86 86
87 ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->i8count)); 87 if (dp->i_d.di_size < xfs_dir2_sf_hdr_size(sfp->i8count))
88 return -EFSCORRUPTED;
88 89
89 /* 90 /*
90 * If the block number in the offset is out of range, we're done. 91 * If the block number in the offset is out of range, we're done.
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 3d224702fbc0..05f8666733a0 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -92,7 +92,11 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
92#define XFS_ERRTAG_BMAPIFORMAT 21 92#define XFS_ERRTAG_BMAPIFORMAT 21
93#define XFS_ERRTAG_FREE_EXTENT 22 93#define XFS_ERRTAG_FREE_EXTENT 22
94#define XFS_ERRTAG_RMAP_FINISH_ONE 23 94#define XFS_ERRTAG_RMAP_FINISH_ONE 23
95#define XFS_ERRTAG_MAX 24 95#define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE 24
96#define XFS_ERRTAG_REFCOUNT_FINISH_ONE 25
97#define XFS_ERRTAG_BMAP_FINISH_ONE 26
98#define XFS_ERRTAG_AG_RESV_CRITICAL 27
99#define XFS_ERRTAG_MAX 28
96 100
97/* 101/*
98 * Random factors for above tags, 1 means always, 2 means 1/2 time, etc. 102 * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -121,6 +125,10 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
121#define XFS_RANDOM_BMAPIFORMAT XFS_RANDOM_DEFAULT 125#define XFS_RANDOM_BMAPIFORMAT XFS_RANDOM_DEFAULT
122#define XFS_RANDOM_FREE_EXTENT 1 126#define XFS_RANDOM_FREE_EXTENT 1
123#define XFS_RANDOM_RMAP_FINISH_ONE 1 127#define XFS_RANDOM_RMAP_FINISH_ONE 1
128#define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE 1
129#define XFS_RANDOM_REFCOUNT_FINISH_ONE 1
130#define XFS_RANDOM_BMAP_FINISH_ONE 1
131#define XFS_RANDOM_AG_RESV_CRITICAL 4
124 132
125#ifdef DEBUG 133#ifdef DEBUG
126extern int xfs_error_test_active; 134extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2bc58b3fd37d..a314fc7b56fa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -38,6 +38,7 @@
38#include "xfs_icache.h" 38#include "xfs_icache.h"
39#include "xfs_pnfs.h" 39#include "xfs_pnfs.h"
40#include "xfs_iomap.h" 40#include "xfs_iomap.h"
41#include "xfs_reflink.h"
41 42
42#include <linux/dcache.h> 43#include <linux/dcache.h>
43#include <linux/falloc.h> 44#include <linux/falloc.h>
@@ -634,6 +635,13 @@ xfs_file_dio_aio_write(
634 635
635 trace_xfs_file_direct_write(ip, count, iocb->ki_pos); 636 trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
636 637
638 /* If this is a block-aligned directio CoW, remap immediately. */
639 if (xfs_is_reflink_inode(ip) && !unaligned_io) {
640 ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
641 if (ret)
642 goto out;
643 }
644
637 data = *from; 645 data = *from;
638 ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data, 646 ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
639 xfs_get_blocks_direct, xfs_end_io_direct_write, 647 xfs_get_blocks_direct, xfs_end_io_direct_write,
@@ -735,6 +743,9 @@ write_retry:
735 enospc = xfs_inode_free_quota_eofblocks(ip); 743 enospc = xfs_inode_free_quota_eofblocks(ip);
736 if (enospc) 744 if (enospc)
737 goto write_retry; 745 goto write_retry;
746 enospc = xfs_inode_free_quota_cowblocks(ip);
747 if (enospc)
748 goto write_retry;
738 } else if (ret == -ENOSPC && !enospc) { 749 } else if (ret == -ENOSPC && !enospc) {
739 struct xfs_eofblocks eofb = {0}; 750 struct xfs_eofblocks eofb = {0};
740 751
@@ -774,10 +785,20 @@ xfs_file_write_iter(
774 785
775 if (IS_DAX(inode)) 786 if (IS_DAX(inode))
776 ret = xfs_file_dax_write(iocb, from); 787 ret = xfs_file_dax_write(iocb, from);
777 else if (iocb->ki_flags & IOCB_DIRECT) 788 else if (iocb->ki_flags & IOCB_DIRECT) {
789 /*
790 * Allow a directio write to fall back to a buffered
791 * write *only* in the case that we're doing a reflink
792 * CoW. In all other directio scenarios we do not
793 * allow an operation to fall back to buffered mode.
794 */
778 ret = xfs_file_dio_aio_write(iocb, from); 795 ret = xfs_file_dio_aio_write(iocb, from);
779 else 796 if (ret == -EREMCHG)
797 goto buffered;
798 } else {
799buffered:
780 ret = xfs_file_buffered_aio_write(iocb, from); 800 ret = xfs_file_buffered_aio_write(iocb, from);
801 }
781 802
782 if (ret > 0) { 803 if (ret > 0) {
783 XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret); 804 XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
@@ -791,7 +812,7 @@ xfs_file_write_iter(
791#define XFS_FALLOC_FL_SUPPORTED \ 812#define XFS_FALLOC_FL_SUPPORTED \
792 (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \ 813 (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
793 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \ 814 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \
794 FALLOC_FL_INSERT_RANGE) 815 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
795 816
796STATIC long 817STATIC long
797xfs_file_fallocate( 818xfs_file_fallocate(
@@ -881,9 +902,15 @@ xfs_file_fallocate(
881 902
882 if (mode & FALLOC_FL_ZERO_RANGE) 903 if (mode & FALLOC_FL_ZERO_RANGE)
883 error = xfs_zero_file_space(ip, offset, len); 904 error = xfs_zero_file_space(ip, offset, len);
884 else 905 else {
906 if (mode & FALLOC_FL_UNSHARE_RANGE) {
907 error = xfs_reflink_unshare(ip, offset, len);
908 if (error)
909 goto out_unlock;
910 }
885 error = xfs_alloc_file_space(ip, offset, len, 911 error = xfs_alloc_file_space(ip, offset, len,
886 XFS_BMAPI_PREALLOC); 912 XFS_BMAPI_PREALLOC);
913 }
887 if (error) 914 if (error)
888 goto out_unlock; 915 goto out_unlock;
889 } 916 }
@@ -920,6 +947,189 @@ out_unlock:
920 return error; 947 return error;
921} 948}
922 949
950/*
951 * Flush all file writes out to disk.
952 */
953static int
954xfs_file_wait_for_io(
955 struct inode *inode,
956 loff_t offset,
957 size_t len)
958{
959 loff_t rounding;
960 loff_t ioffset;
961 loff_t iendoffset;
962 loff_t bs;
963 int ret;
964
965 bs = inode->i_sb->s_blocksize;
966 inode_dio_wait(inode);
967
968 rounding = max_t(xfs_off_t, bs, PAGE_SIZE);
969 ioffset = round_down(offset, rounding);
970 iendoffset = round_up(offset + len, rounding) - 1;
971 ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
972 iendoffset);
973 return ret;
974}
975
976/* Hook up to the VFS reflink function */
977STATIC int
978xfs_file_share_range(
979 struct file *file_in,
980 loff_t pos_in,
981 struct file *file_out,
982 loff_t pos_out,
983 u64 len,
984 bool is_dedupe)
985{
986 struct inode *inode_in;
987 struct inode *inode_out;
988 ssize_t ret;
989 loff_t bs;
990 loff_t isize;
991 int same_inode;
992 loff_t blen;
993 unsigned int flags = 0;
994
995 inode_in = file_inode(file_in);
996 inode_out = file_inode(file_out);
997 bs = inode_out->i_sb->s_blocksize;
998
999 /* Don't touch certain kinds of inodes */
1000 if (IS_IMMUTABLE(inode_out))
1001 return -EPERM;
1002 if (IS_SWAPFILE(inode_in) ||
1003 IS_SWAPFILE(inode_out))
1004 return -ETXTBSY;
1005
1006 /* Reflink only works within this filesystem. */
1007 if (inode_in->i_sb != inode_out->i_sb)
1008 return -EXDEV;
1009 same_inode = (inode_in->i_ino == inode_out->i_ino);
1010
1011 /* Don't reflink dirs, pipes, sockets... */
1012 if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
1013 return -EISDIR;
1014 if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
1015 return -EINVAL;
1016 if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
1017 return -EINVAL;
1018
1019 /* Don't share DAX file data for now. */
1020 if (IS_DAX(inode_in) || IS_DAX(inode_out))
1021 return -EINVAL;
1022
1023 /* Are we going all the way to the end? */
1024 isize = i_size_read(inode_in);
1025 if (isize == 0)
1026 return 0;
1027 if (len == 0)
1028 len = isize - pos_in;
1029
1030 /* Ensure offsets don't wrap and the input is inside i_size */
1031 if (pos_in + len < pos_in || pos_out + len < pos_out ||
1032 pos_in + len > isize)
1033 return -EINVAL;
1034
1035 /* Don't allow dedupe past EOF in the dest file */
1036 if (is_dedupe) {
1037 loff_t disize;
1038
1039 disize = i_size_read(inode_out);
1040 if (pos_out >= disize || pos_out + len > disize)
1041 return -EINVAL;
1042 }
1043
1044 /* If we're linking to EOF, continue to the block boundary. */
1045 if (pos_in + len == isize)
1046 blen = ALIGN(isize, bs) - pos_in;
1047 else
1048 blen = len;
1049
1050 /* Only reflink if we're aligned to block boundaries */
1051 if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
1052 !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
1053 return -EINVAL;
1054
1055 /* Don't allow overlapped reflink within the same file */
1056 if (same_inode && pos_out + blen > pos_in && pos_out < pos_in + blen)
1057 return -EINVAL;
1058
1059 /* Wait for the completion of any pending IOs on srcfile */
1060 ret = xfs_file_wait_for_io(inode_in, pos_in, len);
1061 if (ret)
1062 goto out;
1063 ret = xfs_file_wait_for_io(inode_out, pos_out, len);
1064 if (ret)
1065 goto out;
1066
1067 if (is_dedupe)
1068 flags |= XFS_REFLINK_DEDUPE;
1069 ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
1070 pos_out, len, flags);
1071 if (ret < 0)
1072 goto out;
1073
1074out:
1075 return ret;
1076}
1077
1078STATIC ssize_t
1079xfs_file_copy_range(
1080 struct file *file_in,
1081 loff_t pos_in,
1082 struct file *file_out,
1083 loff_t pos_out,
1084 size_t len,
1085 unsigned int flags)
1086{
1087 int error;
1088
1089 error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
1090 len, false);
1091 if (error)
1092 return error;
1093 return len;
1094}
1095
1096STATIC int
1097xfs_file_clone_range(
1098 struct file *file_in,
1099 loff_t pos_in,
1100 struct file *file_out,
1101 loff_t pos_out,
1102 u64 len)
1103{
1104 return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
1105 len, false);
1106}
1107
1108#define XFS_MAX_DEDUPE_LEN (16 * 1024 * 1024)
1109STATIC ssize_t
1110xfs_file_dedupe_range(
1111 struct file *src_file,
1112 u64 loff,
1113 u64 len,
1114 struct file *dst_file,
1115 u64 dst_loff)
1116{
1117 int error;
1118
1119 /*
1120 * Limit the total length we will dedupe for each operation.
1121 * This is intended to bound the total time spent in this
1122 * ioctl to something sane.
1123 */
1124 if (len > XFS_MAX_DEDUPE_LEN)
1125 len = XFS_MAX_DEDUPE_LEN;
1126
1127 error = xfs_file_share_range(src_file, loff, dst_file, dst_loff,
1128 len, true);
1129 if (error)
1130 return error;
1131 return len;
1132}
923 1133
924STATIC int 1134STATIC int
925xfs_file_open( 1135xfs_file_open(
@@ -1581,6 +1791,9 @@ const struct file_operations xfs_file_operations = {
1581 .fsync = xfs_file_fsync, 1791 .fsync = xfs_file_fsync,
1582 .get_unmapped_area = thp_get_unmapped_area, 1792 .get_unmapped_area = thp_get_unmapped_area,
1583 .fallocate = xfs_file_fallocate, 1793 .fallocate = xfs_file_fallocate,
1794 .copy_file_range = xfs_file_copy_range,
1795 .clone_file_range = xfs_file_clone_range,
1796 .dedupe_file_range = xfs_file_dedupe_range,
1584}; 1797};
1585 1798
1586const struct file_operations xfs_dir_file_operations = { 1799const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 94ac06f3d908..93d12fa2670d 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -43,6 +43,7 @@
43#include "xfs_log.h" 43#include "xfs_log.h"
44#include "xfs_filestream.h" 44#include "xfs_filestream.h"
45#include "xfs_rmap.h" 45#include "xfs_rmap.h"
46#include "xfs_ag_resv.h"
46 47
47/* 48/*
48 * File system operations 49 * File system operations
@@ -108,7 +109,9 @@ xfs_fs_geometry(
108 (xfs_sb_version_hassparseinodes(&mp->m_sb) ? 109 (xfs_sb_version_hassparseinodes(&mp->m_sb) ?
109 XFS_FSOP_GEOM_FLAGS_SPINODES : 0) | 110 XFS_FSOP_GEOM_FLAGS_SPINODES : 0) |
110 (xfs_sb_version_hasrmapbt(&mp->m_sb) ? 111 (xfs_sb_version_hasrmapbt(&mp->m_sb) ?
111 XFS_FSOP_GEOM_FLAGS_RMAPBT : 0); 112 XFS_FSOP_GEOM_FLAGS_RMAPBT : 0) |
113 (xfs_sb_version_hasreflink(&mp->m_sb) ?
114 XFS_FSOP_GEOM_FLAGS_REFLINK : 0);
112 geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ? 115 geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
113 mp->m_sb.sb_logsectsize : BBSIZE; 116 mp->m_sb.sb_logsectsize : BBSIZE;
114 geo->rtsectsize = mp->m_sb.sb_blocksize; 117 geo->rtsectsize = mp->m_sb.sb_blocksize;
@@ -259,6 +262,12 @@ xfs_growfs_data_private(
259 agf->agf_longest = cpu_to_be32(tmpsize); 262 agf->agf_longest = cpu_to_be32(tmpsize);
260 if (xfs_sb_version_hascrc(&mp->m_sb)) 263 if (xfs_sb_version_hascrc(&mp->m_sb))
261 uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid); 264 uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
265 if (xfs_sb_version_hasreflink(&mp->m_sb)) {
266 agf->agf_refcount_root = cpu_to_be32(
267 xfs_refc_block(mp));
268 agf->agf_refcount_level = cpu_to_be32(1);
269 agf->agf_refcount_blocks = cpu_to_be32(1);
270 }
262 271
263 error = xfs_bwrite(bp); 272 error = xfs_bwrite(bp);
264 xfs_buf_relse(bp); 273 xfs_buf_relse(bp);
@@ -450,6 +459,17 @@ xfs_growfs_data_private(
450 rrec->rm_offset = 0; 459 rrec->rm_offset = 0;
451 be16_add_cpu(&block->bb_numrecs, 1); 460 be16_add_cpu(&block->bb_numrecs, 1);
452 461
462 /* account for refc btree root */
463 if (xfs_sb_version_hasreflink(&mp->m_sb)) {
464 rrec = XFS_RMAP_REC_ADDR(block, 5);
465 rrec->rm_startblock = cpu_to_be32(
466 xfs_refc_block(mp));
467 rrec->rm_blockcount = cpu_to_be32(1);
468 rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
469 rrec->rm_offset = 0;
470 be16_add_cpu(&block->bb_numrecs, 1);
471 }
472
453 error = xfs_bwrite(bp); 473 error = xfs_bwrite(bp);
454 xfs_buf_relse(bp); 474 xfs_buf_relse(bp);
455 if (error) 475 if (error)
@@ -507,6 +527,28 @@ xfs_growfs_data_private(
507 goto error0; 527 goto error0;
508 } 528 }
509 529
530 /*
531 * refcount btree root block
532 */
533 if (xfs_sb_version_hasreflink(&mp->m_sb)) {
534 bp = xfs_growfs_get_hdr_buf(mp,
535 XFS_AGB_TO_DADDR(mp, agno, xfs_refc_block(mp)),
536 BTOBB(mp->m_sb.sb_blocksize), 0,
537 &xfs_refcountbt_buf_ops);
538 if (!bp) {
539 error = -ENOMEM;
540 goto error0;
541 }
542
543 xfs_btree_init_block(mp, bp, XFS_REFC_CRC_MAGIC,
544 0, 0, agno,
545 XFS_BTREE_CRC_BLOCKS);
546
547 error = xfs_bwrite(bp);
548 xfs_buf_relse(bp);
549 if (error)
550 goto error0;
551 }
510 } 552 }
511 xfs_trans_agblocks_delta(tp, nfree); 553 xfs_trans_agblocks_delta(tp, nfree);
512 /* 554 /*
@@ -589,6 +631,11 @@ xfs_growfs_data_private(
589 xfs_set_low_space_thresholds(mp); 631 xfs_set_low_space_thresholds(mp);
590 mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); 632 mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
591 633
634 /* Reserve AG metadata blocks. */
635 error = xfs_fs_reserve_ag_blocks(mp);
636 if (error && error != -ENOSPC)
637 goto out;
638
592 /* update secondary superblocks. */ 639 /* update secondary superblocks. */
593 for (agno = 1; agno < nagcount; agno++) { 640 for (agno = 1; agno < nagcount; agno++) {
594 error = 0; 641 error = 0;
@@ -639,6 +686,8 @@ xfs_growfs_data_private(
639 continue; 686 continue;
640 } 687 }
641 } 688 }
689
690 out:
642 return saved_error ? saved_error : error; 691 return saved_error ? saved_error : error;
643 692
644 error0: 693 error0:
@@ -948,3 +997,59 @@ xfs_do_force_shutdown(
948 "Please umount the filesystem and rectify the problem(s)"); 997 "Please umount the filesystem and rectify the problem(s)");
949 } 998 }
950} 999}
1000
1001/*
1002 * Reserve free space for per-AG metadata.
1003 */
1004int
1005xfs_fs_reserve_ag_blocks(
1006 struct xfs_mount *mp)
1007{
1008 xfs_agnumber_t agno;
1009 struct xfs_perag *pag;
1010 int error = 0;
1011 int err2;
1012
1013 for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
1014 pag = xfs_perag_get(mp, agno);
1015 err2 = xfs_ag_resv_init(pag);
1016 xfs_perag_put(pag);
1017 if (err2 && !error)
1018 error = err2;
1019 }
1020
1021 if (error && error != -ENOSPC) {
1022 xfs_warn(mp,
1023 "Error %d reserving per-AG metadata reserve pool.", error);
1024 xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
1025 }
1026
1027 return error;
1028}
1029
1030/*
1031 * Free space reserved for per-AG metadata.
1032 */
1033int
1034xfs_fs_unreserve_ag_blocks(
1035 struct xfs_mount *mp)
1036{
1037 xfs_agnumber_t agno;
1038 struct xfs_perag *pag;
1039 int error = 0;
1040 int err2;
1041
1042 for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
1043 pag = xfs_perag_get(mp, agno);
1044 err2 = xfs_ag_resv_free(pag);
1045 xfs_perag_put(pag);
1046 if (err2 && !error)
1047 error = err2;
1048 }
1049
1050 if (error)
1051 xfs_warn(mp,
1052 "Error %d freeing per-AG metadata reserve pool.", error);
1053
1054 return error;
1055}
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index f32713f14f9a..f34915898fea 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
26 xfs_fsop_resblks_t *outval); 26 xfs_fsop_resblks_t *outval);
27extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags); 27extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
28 28
29extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
30extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
31
29#endif /* __XFS_FSOPS_H__ */ 32#endif /* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index 4d41b241298f..687a4b01fc53 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -21,8 +21,8 @@
21/* 21/*
22 * Tunable XFS parameters. xfs_params is required even when CONFIG_SYSCTL=n, 22 * Tunable XFS parameters. xfs_params is required even when CONFIG_SYSCTL=n,
23 * other XFS code uses these values. Times are measured in centisecs (i.e. 23 * other XFS code uses these values. Times are measured in centisecs (i.e.
24 * 100ths of a second) with the exception of eofb_timer, which is measured in 24 * 100ths of a second) with the exception of eofb_timer and cowb_timer, which
25 * seconds. 25 * are measured in seconds.
26 */ 26 */
27xfs_param_t xfs_params = { 27xfs_param_t xfs_params = {
28 /* MIN DFLT MAX */ 28 /* MIN DFLT MAX */
@@ -42,6 +42,7 @@ xfs_param_t xfs_params = {
42 .inherit_nodfrg = { 0, 1, 1 }, 42 .inherit_nodfrg = { 0, 1, 1 },
43 .fstrm_timer = { 1, 30*100, 3600*100}, 43 .fstrm_timer = { 1, 30*100, 3600*100},
44 .eofb_timer = { 1, 300, 3600*24}, 44 .eofb_timer = { 1, 300, 3600*24},
45 .cowb_timer = { 1, 1800, 3600*24},
45}; 46};
46 47
47struct xfs_globals xfs_globals = { 48struct xfs_globals xfs_globals = {
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 65b2e3f85f52..14796b744e0a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -33,6 +33,7 @@
33#include "xfs_bmap_util.h" 33#include "xfs_bmap_util.h"
34#include "xfs_dquot_item.h" 34#include "xfs_dquot_item.h"
35#include "xfs_dquot.h" 35#include "xfs_dquot.h"
36#include "xfs_reflink.h"
36 37
37#include <linux/kthread.h> 38#include <linux/kthread.h>
38#include <linux/freezer.h> 39#include <linux/freezer.h>
@@ -76,6 +77,9 @@ xfs_inode_alloc(
76 ip->i_mount = mp; 77 ip->i_mount = mp;
77 memset(&ip->i_imap, 0, sizeof(struct xfs_imap)); 78 memset(&ip->i_imap, 0, sizeof(struct xfs_imap));
78 ip->i_afp = NULL; 79 ip->i_afp = NULL;
80 ip->i_cowfp = NULL;
81 ip->i_cnextents = 0;
82 ip->i_cformat = XFS_DINODE_FMT_EXTENTS;
79 memset(&ip->i_df, 0, sizeof(xfs_ifork_t)); 83 memset(&ip->i_df, 0, sizeof(xfs_ifork_t));
80 ip->i_flags = 0; 84 ip->i_flags = 0;
81 ip->i_delayed_blks = 0; 85 ip->i_delayed_blks = 0;
@@ -101,6 +105,8 @@ xfs_inode_free_callback(
101 105
102 if (ip->i_afp) 106 if (ip->i_afp)
103 xfs_idestroy_fork(ip, XFS_ATTR_FORK); 107 xfs_idestroy_fork(ip, XFS_ATTR_FORK);
108 if (ip->i_cowfp)
109 xfs_idestroy_fork(ip, XFS_COW_FORK);
104 110
105 if (ip->i_itemp) { 111 if (ip->i_itemp) {
106 ASSERT(!(ip->i_itemp->ili_item.li_flags & XFS_LI_IN_AIL)); 112 ASSERT(!(ip->i_itemp->ili_item.li_flags & XFS_LI_IN_AIL));
@@ -787,6 +793,33 @@ xfs_eofblocks_worker(
787 xfs_queue_eofblocks(mp); 793 xfs_queue_eofblocks(mp);
788} 794}
789 795
796/*
797 * Background scanning to trim preallocated CoW space. This is queued
798 * based on the 'speculative_cow_prealloc_lifetime' tunable (5m by default).
799 * (We'll just piggyback on the post-EOF prealloc space workqueue.)
800 */
801STATIC void
802xfs_queue_cowblocks(
803 struct xfs_mount *mp)
804{
805 rcu_read_lock();
806 if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_COWBLOCKS_TAG))
807 queue_delayed_work(mp->m_eofblocks_workqueue,
808 &mp->m_cowblocks_work,
809 msecs_to_jiffies(xfs_cowb_secs * 1000));
810 rcu_read_unlock();
811}
812
813void
814xfs_cowblocks_worker(
815 struct work_struct *work)
816{
817 struct xfs_mount *mp = container_of(to_delayed_work(work),
818 struct xfs_mount, m_cowblocks_work);
819 xfs_icache_free_cowblocks(mp, NULL);
820 xfs_queue_cowblocks(mp);
821}
822
790int 823int
791xfs_inode_ag_iterator( 824xfs_inode_ag_iterator(
792 struct xfs_mount *mp, 825 struct xfs_mount *mp,
@@ -1343,18 +1376,30 @@ xfs_inode_free_eofblocks(
1343 return ret; 1376 return ret;
1344} 1377}
1345 1378
1346int 1379static int
1347xfs_icache_free_eofblocks( 1380__xfs_icache_free_eofblocks(
1348 struct xfs_mount *mp, 1381 struct xfs_mount *mp,
1349 struct xfs_eofblocks *eofb) 1382 struct xfs_eofblocks *eofb,
1383 int (*execute)(struct xfs_inode *ip, int flags,
1384 void *args),
1385 int tag)
1350{ 1386{
1351 int flags = SYNC_TRYLOCK; 1387 int flags = SYNC_TRYLOCK;
1352 1388
1353 if (eofb && (eofb->eof_flags & XFS_EOF_FLAGS_SYNC)) 1389 if (eofb && (eofb->eof_flags & XFS_EOF_FLAGS_SYNC))
1354 flags = SYNC_WAIT; 1390 flags = SYNC_WAIT;
1355 1391
1356 return xfs_inode_ag_iterator_tag(mp, xfs_inode_free_eofblocks, flags, 1392 return xfs_inode_ag_iterator_tag(mp, execute, flags,
1357 eofb, XFS_ICI_EOFBLOCKS_TAG); 1393 eofb, tag);
1394}
1395
1396int
1397xfs_icache_free_eofblocks(
1398 struct xfs_mount *mp,
1399 struct xfs_eofblocks *eofb)
1400{
1401 return __xfs_icache_free_eofblocks(mp, eofb, xfs_inode_free_eofblocks,
1402 XFS_ICI_EOFBLOCKS_TAG);
1358} 1403}
1359 1404
1360/* 1405/*
@@ -1363,9 +1408,11 @@ xfs_icache_free_eofblocks(
1363 * failure. We make a best effort by including each quota under low free space 1408 * failure. We make a best effort by including each quota under low free space
1364 * conditions (less than 1% free space) in the scan. 1409 * conditions (less than 1% free space) in the scan.
1365 */ 1410 */
1366int 1411static int
1367xfs_inode_free_quota_eofblocks( 1412__xfs_inode_free_quota_eofblocks(
1368 struct xfs_inode *ip) 1413 struct xfs_inode *ip,
1414 int (*execute)(struct xfs_mount *mp,
1415 struct xfs_eofblocks *eofb))
1369{ 1416{
1370 int scan = 0; 1417 int scan = 0;
1371 struct xfs_eofblocks eofb = {0}; 1418 struct xfs_eofblocks eofb = {0};
@@ -1401,14 +1448,25 @@ xfs_inode_free_quota_eofblocks(
1401 } 1448 }
1402 1449
1403 if (scan) 1450 if (scan)
1404 xfs_icache_free_eofblocks(ip->i_mount, &eofb); 1451 execute(ip->i_mount, &eofb);
1405 1452
1406 return scan; 1453 return scan;
1407} 1454}
1408 1455
1409void 1456int
1410xfs_inode_set_eofblocks_tag( 1457xfs_inode_free_quota_eofblocks(
1411 xfs_inode_t *ip) 1458 struct xfs_inode *ip)
1459{
1460 return __xfs_inode_free_quota_eofblocks(ip, xfs_icache_free_eofblocks);
1461}
1462
1463static void
1464__xfs_inode_set_eofblocks_tag(
1465 xfs_inode_t *ip,
1466 void (*execute)(struct xfs_mount *mp),
1467 void (*set_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
1468 int error, unsigned long caller_ip),
1469 int tag)
1412{ 1470{
1413 struct xfs_mount *mp = ip->i_mount; 1471 struct xfs_mount *mp = ip->i_mount;
1414 struct xfs_perag *pag; 1472 struct xfs_perag *pag;
@@ -1426,26 +1484,22 @@ xfs_inode_set_eofblocks_tag(
1426 1484
1427 pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); 1485 pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
1428 spin_lock(&pag->pag_ici_lock); 1486 spin_lock(&pag->pag_ici_lock);
1429 trace_xfs_inode_set_eofblocks_tag(ip);
1430 1487
1431 tagged = radix_tree_tagged(&pag->pag_ici_root, 1488 tagged = radix_tree_tagged(&pag->pag_ici_root, tag);
1432 XFS_ICI_EOFBLOCKS_TAG);
1433 radix_tree_tag_set(&pag->pag_ici_root, 1489 radix_tree_tag_set(&pag->pag_ici_root,
1434 XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), 1490 XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
1435 XFS_ICI_EOFBLOCKS_TAG);
1436 if (!tagged) { 1491 if (!tagged) {
1437 /* propagate the eofblocks tag up into the perag radix tree */ 1492 /* propagate the eofblocks tag up into the perag radix tree */
1438 spin_lock(&ip->i_mount->m_perag_lock); 1493 spin_lock(&ip->i_mount->m_perag_lock);
1439 radix_tree_tag_set(&ip->i_mount->m_perag_tree, 1494 radix_tree_tag_set(&ip->i_mount->m_perag_tree,
1440 XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino), 1495 XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
1441 XFS_ICI_EOFBLOCKS_TAG); 1496 tag);
1442 spin_unlock(&ip->i_mount->m_perag_lock); 1497 spin_unlock(&ip->i_mount->m_perag_lock);
1443 1498
1444 /* kick off background trimming */ 1499 /* kick off background trimming */
1445 xfs_queue_eofblocks(ip->i_mount); 1500 execute(ip->i_mount);
1446 1501
1447 trace_xfs_perag_set_eofblocks(ip->i_mount, pag->pag_agno, 1502 set_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
1448 -1, _RET_IP_);
1449 } 1503 }
1450 1504
1451 spin_unlock(&pag->pag_ici_lock); 1505 spin_unlock(&pag->pag_ici_lock);
@@ -1453,9 +1507,22 @@ xfs_inode_set_eofblocks_tag(
1453} 1507}
1454 1508
1455void 1509void
1456xfs_inode_clear_eofblocks_tag( 1510xfs_inode_set_eofblocks_tag(
1457 xfs_inode_t *ip) 1511 xfs_inode_t *ip)
1458{ 1512{
1513 trace_xfs_inode_set_eofblocks_tag(ip);
1514 return __xfs_inode_set_eofblocks_tag(ip, xfs_queue_eofblocks,
1515 trace_xfs_perag_set_eofblocks,
1516 XFS_ICI_EOFBLOCKS_TAG);
1517}
1518
1519static void
1520__xfs_inode_clear_eofblocks_tag(
1521 xfs_inode_t *ip,
1522 void (*clear_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
1523 int error, unsigned long caller_ip),
1524 int tag)
1525{
1459 struct xfs_mount *mp = ip->i_mount; 1526 struct xfs_mount *mp = ip->i_mount;
1460 struct xfs_perag *pag; 1527 struct xfs_perag *pag;
1461 1528
@@ -1465,23 +1532,141 @@ xfs_inode_clear_eofblocks_tag(
1465 1532
1466 pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); 1533 pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
1467 spin_lock(&pag->pag_ici_lock); 1534 spin_lock(&pag->pag_ici_lock);
1468 trace_xfs_inode_clear_eofblocks_tag(ip);
1469 1535
1470 radix_tree_tag_clear(&pag->pag_ici_root, 1536 radix_tree_tag_clear(&pag->pag_ici_root,
1471 XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), 1537 XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
1472 XFS_ICI_EOFBLOCKS_TAG); 1538 if (!radix_tree_tagged(&pag->pag_ici_root, tag)) {
1473 if (!radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_EOFBLOCKS_TAG)) {
1474 /* clear the eofblocks tag from the perag radix tree */ 1539 /* clear the eofblocks tag from the perag radix tree */
1475 spin_lock(&ip->i_mount->m_perag_lock); 1540 spin_lock(&ip->i_mount->m_perag_lock);
1476 radix_tree_tag_clear(&ip->i_mount->m_perag_tree, 1541 radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
1477 XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino), 1542 XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
1478 XFS_ICI_EOFBLOCKS_TAG); 1543 tag);
1479 spin_unlock(&ip->i_mount->m_perag_lock); 1544 spin_unlock(&ip->i_mount->m_perag_lock);
1480 trace_xfs_perag_clear_eofblocks(ip->i_mount, pag->pag_agno, 1545 clear_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
1481 -1, _RET_IP_);
1482 } 1546 }
1483 1547
1484 spin_unlock(&pag->pag_ici_lock); 1548 spin_unlock(&pag->pag_ici_lock);
1485 xfs_perag_put(pag); 1549 xfs_perag_put(pag);
1486} 1550}
1487 1551
1552void
1553xfs_inode_clear_eofblocks_tag(
1554 xfs_inode_t *ip)
1555{
1556 trace_xfs_inode_clear_eofblocks_tag(ip);
1557 return __xfs_inode_clear_eofblocks_tag(ip,
1558 trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
1559}
1560
1561/*
1562 * Automatic CoW Reservation Freeing
1563 *
1564 * These functions automatically garbage collect leftover CoW reservations
1565 * that were made on behalf of a cowextsize hint when we start to run out
1566 * of quota or when the reservations sit around for too long. If the file
1567 * has dirty pages or is undergoing writeback, its CoW reservations will
1568 * be retained.
1569 *
1570 * The actual garbage collection piggybacks off the same code that runs
1571 * the speculative EOF preallocation garbage collector.
1572 */
1573STATIC int
1574xfs_inode_free_cowblocks(
1575 struct xfs_inode *ip,
1576 int flags,
1577 void *args)
1578{
1579 int ret;
1580 struct xfs_eofblocks *eofb = args;
1581 bool need_iolock = true;
1582 int match;
1583
1584 ASSERT(!eofb || (eofb && eofb->eof_scan_owner != 0));
1585
1586 if (!xfs_reflink_has_real_cow_blocks(ip)) {
1587 trace_xfs_inode_free_cowblocks_invalid(ip);
1588 xfs_inode_clear_cowblocks_tag(ip);
1589 return 0;
1590 }
1591
1592 /*
1593 * If the mapping is dirty or under writeback we cannot touch the
1594 * CoW fork. Leave it alone if we're in the midst of a directio.
1595 */
1596 if (mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
1597 mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
1598 atomic_read(&VFS_I(ip)->i_dio_count))
1599 return 0;
1600
1601 if (eofb) {
1602 if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
1603 match = xfs_inode_match_id_union(ip, eofb);
1604 else
1605 match = xfs_inode_match_id(ip, eofb);
1606 if (!match)
1607 return 0;
1608
1609 /* skip the inode if the file size is too small */
1610 if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
1611 XFS_ISIZE(ip) < eofb->eof_min_file_size)
1612 return 0;
1613
1614 /*
1615 * A scan owner implies we already hold the iolock. Skip it in
1616 * xfs_free_eofblocks() to avoid deadlock. This also eliminates
1617 * the possibility of EAGAIN being returned.
1618 */
1619 if (eofb->eof_scan_owner == ip->i_ino)
1620 need_iolock = false;
1621 }
1622
1623 /* Free the CoW blocks */
1624 if (need_iolock) {
1625 xfs_ilock(ip, XFS_IOLOCK_EXCL);
1626 xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
1627 }
1628
1629 ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
1630
1631 if (need_iolock) {
1632 xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
1633 xfs_iunlock(ip, XFS_IOLOCK_EXCL);
1634 }
1635
1636 return ret;
1637}
1638
1639int
1640xfs_icache_free_cowblocks(
1641 struct xfs_mount *mp,
1642 struct xfs_eofblocks *eofb)
1643{
1644 return __xfs_icache_free_eofblocks(mp, eofb, xfs_inode_free_cowblocks,
1645 XFS_ICI_COWBLOCKS_TAG);
1646}
1647
1648int
1649xfs_inode_free_quota_cowblocks(
1650 struct xfs_inode *ip)
1651{
1652 return __xfs_inode_free_quota_eofblocks(ip, xfs_icache_free_cowblocks);
1653}
1654
1655void
1656xfs_inode_set_cowblocks_tag(
1657 xfs_inode_t *ip)
1658{
1659 trace_xfs_inode_set_eofblocks_tag(ip);
1660 return __xfs_inode_set_eofblocks_tag(ip, xfs_queue_cowblocks,
1661 trace_xfs_perag_set_eofblocks,
1662 XFS_ICI_COWBLOCKS_TAG);
1663}
1664
1665void
1666xfs_inode_clear_cowblocks_tag(
1667 xfs_inode_t *ip)
1668{
1669 trace_xfs_inode_clear_eofblocks_tag(ip);
1670 return __xfs_inode_clear_eofblocks_tag(ip,
1671 trace_xfs_perag_clear_eofblocks, XFS_ICI_COWBLOCKS_TAG);
1672}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 05bac99bef75..a1e02f4708ab 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -40,6 +40,7 @@ struct xfs_eofblocks {
40 in xfs_inode_ag_iterator */ 40 in xfs_inode_ag_iterator */
41#define XFS_ICI_RECLAIM_TAG 0 /* inode is to be reclaimed */ 41#define XFS_ICI_RECLAIM_TAG 0 /* inode is to be reclaimed */
42#define XFS_ICI_EOFBLOCKS_TAG 1 /* inode has blocks beyond EOF */ 42#define XFS_ICI_EOFBLOCKS_TAG 1 /* inode has blocks beyond EOF */
43#define XFS_ICI_COWBLOCKS_TAG 2 /* inode can have cow blocks to gc */
43 44
44/* 45/*
45 * Flags for xfs_iget() 46 * Flags for xfs_iget()
@@ -70,6 +71,12 @@ int xfs_inode_free_quota_eofblocks(struct xfs_inode *ip);
70void xfs_eofblocks_worker(struct work_struct *); 71void xfs_eofblocks_worker(struct work_struct *);
71void xfs_queue_eofblocks(struct xfs_mount *); 72void xfs_queue_eofblocks(struct xfs_mount *);
72 73
74void xfs_inode_set_cowblocks_tag(struct xfs_inode *ip);
75void xfs_inode_clear_cowblocks_tag(struct xfs_inode *ip);
76int xfs_icache_free_cowblocks(struct xfs_mount *, struct xfs_eofblocks *);
77int xfs_inode_free_quota_cowblocks(struct xfs_inode *ip);
78void xfs_cowblocks_worker(struct work_struct *);
79
73int xfs_inode_ag_iterator(struct xfs_mount *mp, 80int xfs_inode_ag_iterator(struct xfs_mount *mp,
74 int (*execute)(struct xfs_inode *ip, int flags, void *args), 81 int (*execute)(struct xfs_inode *ip, int flags, void *args),
75 int flags, void *args); 82 int flags, void *args);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 624e1dfa716b..4e560e6a12c1 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -49,6 +49,7 @@
49#include "xfs_trans_priv.h" 49#include "xfs_trans_priv.h"
50#include "xfs_log.h" 50#include "xfs_log.h"
51#include "xfs_bmap_btree.h" 51#include "xfs_bmap_btree.h"
52#include "xfs_reflink.h"
52 53
53kmem_zone_t *xfs_inode_zone; 54kmem_zone_t *xfs_inode_zone;
54 55
@@ -77,6 +78,29 @@ xfs_get_extsz_hint(
77} 78}
78 79
79/* 80/*
81 * Helper function to extract CoW extent size hint from inode.
82 * Between the extent size hint and the CoW extent size hint, we
83 * return the greater of the two. If the value is zero (automatic),
84 * use the default size.
85 */
86xfs_extlen_t
87xfs_get_cowextsz_hint(
88 struct xfs_inode *ip)
89{
90 xfs_extlen_t a, b;
91
92 a = 0;
93 if (ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
94 a = ip->i_d.di_cowextsize;
95 b = xfs_get_extsz_hint(ip);
96
97 a = max(a, b);
98 if (a == 0)
99 return XFS_DEFAULT_COWEXTSZ_HINT;
100 return a;
101}
102
103/*
80 * These two are wrapper routines around the xfs_ilock() routine used to 104 * These two are wrapper routines around the xfs_ilock() routine used to
81 * centralize some grungy code. They are used in places that wish to lock the 105 * centralize some grungy code. They are used in places that wish to lock the
82 * inode solely for reading the extents. The reason these places can't just 106 * inode solely for reading the extents. The reason these places can't just
@@ -651,6 +675,8 @@ _xfs_dic2xflags(
651 if (di_flags2 & XFS_DIFLAG2_ANY) { 675 if (di_flags2 & XFS_DIFLAG2_ANY) {
652 if (di_flags2 & XFS_DIFLAG2_DAX) 676 if (di_flags2 & XFS_DIFLAG2_DAX)
653 flags |= FS_XFLAG_DAX; 677 flags |= FS_XFLAG_DAX;
678 if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
679 flags |= FS_XFLAG_COWEXTSIZE;
654 } 680 }
655 681
656 if (has_attr) 682 if (has_attr)
@@ -834,6 +860,7 @@ xfs_ialloc(
834 if (ip->i_d.di_version == 3) { 860 if (ip->i_d.di_version == 3) {
835 inode->i_version = 1; 861 inode->i_version = 1;
836 ip->i_d.di_flags2 = 0; 862 ip->i_d.di_flags2 = 0;
863 ip->i_d.di_cowextsize = 0;
837 ip->i_d.di_crtime.t_sec = (__int32_t)tv.tv_sec; 864 ip->i_d.di_crtime.t_sec = (__int32_t)tv.tv_sec;
838 ip->i_d.di_crtime.t_nsec = (__int32_t)tv.tv_nsec; 865 ip->i_d.di_crtime.t_nsec = (__int32_t)tv.tv_nsec;
839 } 866 }
@@ -896,6 +923,15 @@ xfs_ialloc(
896 ip->i_d.di_flags |= di_flags; 923 ip->i_d.di_flags |= di_flags;
897 ip->i_d.di_flags2 |= di_flags2; 924 ip->i_d.di_flags2 |= di_flags2;
898 } 925 }
926 if (pip &&
927 (pip->i_d.di_flags2 & XFS_DIFLAG2_ANY) &&
928 pip->i_d.di_version == 3 &&
929 ip->i_d.di_version == 3) {
930 if (pip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) {
931 ip->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
932 ip->i_d.di_cowextsize = pip->i_d.di_cowextsize;
933 }
934 }
899 /* FALLTHROUGH */ 935 /* FALLTHROUGH */
900 case S_IFLNK: 936 case S_IFLNK:
901 ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS; 937 ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
@@ -1586,6 +1622,20 @@ xfs_itruncate_extents(
1586 goto out; 1622 goto out;
1587 } 1623 }
1588 1624
1625 /* Remove all pending CoW reservations. */
1626 error = xfs_reflink_cancel_cow_blocks(ip, &tp, first_unmap_block,
1627 last_block);
1628 if (error)
1629 goto out;
1630
1631 /*
1632 * Clear the reflink flag if we truncated everything.
1633 */
1634 if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
1635 ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
1636 xfs_inode_clear_cowblocks_tag(ip);
1637 }
1638
1589 /* 1639 /*
1590 * Always re-log the inode so that our permanent transaction can keep 1640 * Always re-log the inode so that our permanent transaction can keep
1591 * on rolling it forward in the log. 1641 * on rolling it forward in the log.
@@ -1850,6 +1900,7 @@ xfs_inactive(
1850 } 1900 }
1851 1901
1852 mp = ip->i_mount; 1902 mp = ip->i_mount;
1903 ASSERT(!xfs_iflags_test(ip, XFS_IRECOVERY));
1853 1904
1854 /* If this is a read-only mount, don't do this (would generate I/O) */ 1905 /* If this is a read-only mount, don't do this (would generate I/O) */
1855 if (mp->m_flags & XFS_MOUNT_RDONLY) 1906 if (mp->m_flags & XFS_MOUNT_RDONLY)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8f30d2533b48..f14c1de2549d 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -47,6 +47,7 @@ typedef struct xfs_inode {
47 47
48 /* Extent information. */ 48 /* Extent information. */
49 xfs_ifork_t *i_afp; /* attribute fork pointer */ 49 xfs_ifork_t *i_afp; /* attribute fork pointer */
50 xfs_ifork_t *i_cowfp; /* copy on write extents */
50 xfs_ifork_t i_df; /* data fork */ 51 xfs_ifork_t i_df; /* data fork */
51 52
52 /* operations vectors */ 53 /* operations vectors */
@@ -65,6 +66,9 @@ typedef struct xfs_inode {
65 66
66 struct xfs_icdinode i_d; /* most of ondisk inode */ 67 struct xfs_icdinode i_d; /* most of ondisk inode */
67 68
69 xfs_extnum_t i_cnextents; /* # of extents in cow fork */
70 unsigned int i_cformat; /* format of cow fork */
71
68 /* VFS inode */ 72 /* VFS inode */
69 struct inode i_vnode; /* embedded VFS inode */ 73 struct inode i_vnode; /* embedded VFS inode */
70} xfs_inode_t; 74} xfs_inode_t;
@@ -202,6 +206,11 @@ xfs_get_initial_prid(struct xfs_inode *dp)
202 return XFS_PROJID_DEFAULT; 206 return XFS_PROJID_DEFAULT;
203} 207}
204 208
209static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
210{
211 return ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
212}
213
205/* 214/*
206 * In-core inode flags. 215 * In-core inode flags.
207 */ 216 */
@@ -217,6 +226,12 @@ xfs_get_initial_prid(struct xfs_inode *dp)
217#define XFS_IPINNED (1 << __XFS_IPINNED_BIT) 226#define XFS_IPINNED (1 << __XFS_IPINNED_BIT)
218#define XFS_IDONTCACHE (1 << 9) /* don't cache the inode long term */ 227#define XFS_IDONTCACHE (1 << 9) /* don't cache the inode long term */
219#define XFS_IEOFBLOCKS (1 << 10)/* has the preallocblocks tag set */ 228#define XFS_IEOFBLOCKS (1 << 10)/* has the preallocblocks tag set */
229/*
230 * If this unlinked inode is in the middle of recovery, don't let drop_inode
231 * truncate and free the inode. This can happen if we iget the inode during
232 * log recovery to replay a bmap operation on the inode.
233 */
234#define XFS_IRECOVERY (1 << 11)
220 235
221/* 236/*
222 * Per-lifetime flags need to be reset when re-using a reclaimable inode during 237 * Per-lifetime flags need to be reset when re-using a reclaimable inode during
@@ -411,6 +426,7 @@ int xfs_iflush(struct xfs_inode *, struct xfs_buf **);
411void xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint); 426void xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
412 427
413xfs_extlen_t xfs_get_extsz_hint(struct xfs_inode *ip); 428xfs_extlen_t xfs_get_extsz_hint(struct xfs_inode *ip);
429xfs_extlen_t xfs_get_cowextsz_hint(struct xfs_inode *ip);
414 430
415int xfs_dir_ialloc(struct xfs_trans **, struct xfs_inode *, umode_t, 431int xfs_dir_ialloc(struct xfs_trans **, struct xfs_inode *, umode_t,
416 xfs_nlink_t, xfs_dev_t, prid_t, int, 432 xfs_nlink_t, xfs_dev_t, prid_t, int,
@@ -474,4 +490,7 @@ do { \
474 490
475extern struct kmem_zone *xfs_inode_zone; 491extern struct kmem_zone *xfs_inode_zone;
476 492
493/* The default CoW extent size hint. */
494#define XFS_DEFAULT_COWEXTSZ_HINT 32
495
477#endif /* __XFS_INODE_H__ */ 496#endif /* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 892c2aced207..9610e9c00952 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -368,7 +368,7 @@ xfs_inode_to_log_dinode(
368 to->di_crtime.t_sec = from->di_crtime.t_sec; 368 to->di_crtime.t_sec = from->di_crtime.t_sec;
369 to->di_crtime.t_nsec = from->di_crtime.t_nsec; 369 to->di_crtime.t_nsec = from->di_crtime.t_nsec;
370 to->di_flags2 = from->di_flags2; 370 to->di_flags2 = from->di_flags2;
371 371 to->di_cowextsize = from->di_cowextsize;
372 to->di_ino = ip->i_ino; 372 to->di_ino = ip->i_ino;
373 to->di_lsn = lsn; 373 to->di_lsn = lsn;
374 memset(to->di_pad2, 0, sizeof(to->di_pad2)); 374 memset(to->di_pad2, 0, sizeof(to->di_pad2));
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0d9021f0551e..c245bed3249b 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -903,6 +903,8 @@ xfs_ioc_fsgetxattr(
903 xfs_ilock(ip, XFS_ILOCK_SHARED); 903 xfs_ilock(ip, XFS_ILOCK_SHARED);
904 fa.fsx_xflags = xfs_ip2xflags(ip); 904 fa.fsx_xflags = xfs_ip2xflags(ip);
905 fa.fsx_extsize = ip->i_d.di_extsize << ip->i_mount->m_sb.sb_blocklog; 905 fa.fsx_extsize = ip->i_d.di_extsize << ip->i_mount->m_sb.sb_blocklog;
906 fa.fsx_cowextsize = ip->i_d.di_cowextsize <<
907 ip->i_mount->m_sb.sb_blocklog;
906 fa.fsx_projid = xfs_get_projid(ip); 908 fa.fsx_projid = xfs_get_projid(ip);
907 909
908 if (attr) { 910 if (attr) {
@@ -973,12 +975,13 @@ xfs_set_diflags(
973 if (ip->i_d.di_version < 3) 975 if (ip->i_d.di_version < 3)
974 return; 976 return;
975 977
976 di_flags2 = 0; 978 di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
977 if (xflags & FS_XFLAG_DAX) 979 if (xflags & FS_XFLAG_DAX)
978 di_flags2 |= XFS_DIFLAG2_DAX; 980 di_flags2 |= XFS_DIFLAG2_DAX;
981 if (xflags & FS_XFLAG_COWEXTSIZE)
982 di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
979 983
980 ip->i_d.di_flags2 = di_flags2; 984 ip->i_d.di_flags2 = di_flags2;
981
982} 985}
983 986
984STATIC void 987STATIC void
@@ -1031,6 +1034,14 @@ xfs_ioctl_setattr_xflags(
1031 return -EINVAL; 1034 return -EINVAL;
1032 } 1035 }
1033 1036
1037 /* Clear reflink if we are actually able to set the rt flag. */
1038 if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
1039 ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
1040
1041 /* Don't allow us to set DAX mode for a reflinked file for now. */
1042 if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
1043 return -EINVAL;
1044
1034 /* 1045 /*
1035 * Can't modify an immutable/append-only file unless 1046 * Can't modify an immutable/append-only file unless
1036 * we have appropriate permission. 1047 * we have appropriate permission.
@@ -1219,6 +1230,56 @@ xfs_ioctl_setattr_check_extsize(
1219 return 0; 1230 return 0;
1220} 1231}
1221 1232
1233/*
1234 * CoW extent size hint validation rules are:
1235 *
1236 * 1. CoW extent size hint can only be set if reflink is enabled on the fs.
1237 * The inode does not have to have any shared blocks, but it must be a v3.
1238 * 2. FS_XFLAG_COWEXTSIZE is only valid for directories and regular files;
1239 * for a directory, the hint is propagated to new files.
1240 * 3. Can be changed on files & directories at any time.
1241 * 4. CoW extsize hint of 0 turns off hints, clears inode flags.
1242 * 5. Extent size must be a multiple of the appropriate block size.
1243 * 6. The extent size hint must be limited to half the AG size to avoid
1244 * alignment extending the extent beyond the limits of the AG.
1245 */
1246static int
1247xfs_ioctl_setattr_check_cowextsize(
1248 struct xfs_inode *ip,
1249 struct fsxattr *fa)
1250{
1251 struct xfs_mount *mp = ip->i_mount;
1252
1253 if (!(fa->fsx_xflags & FS_XFLAG_COWEXTSIZE))
1254 return 0;
1255
1256 if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb) ||
1257 ip->i_d.di_version != 3)
1258 return -EINVAL;
1259
1260 if (!S_ISREG(VFS_I(ip)->i_mode) && !S_ISDIR(VFS_I(ip)->i_mode))
1261 return -EINVAL;
1262
1263 if (fa->fsx_cowextsize != 0) {
1264 xfs_extlen_t size;
1265 xfs_fsblock_t cowextsize_fsb;
1266
1267 cowextsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
1268 if (cowextsize_fsb > MAXEXTLEN)
1269 return -EINVAL;
1270
1271 size = mp->m_sb.sb_blocksize;
1272 if (cowextsize_fsb > mp->m_sb.sb_agblocks / 2)
1273 return -EINVAL;
1274
1275 if (fa->fsx_cowextsize % size)
1276 return -EINVAL;
1277 } else
1278 fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE;
1279
1280 return 0;
1281}
1282
1222static int 1283static int
1223xfs_ioctl_setattr_check_projid( 1284xfs_ioctl_setattr_check_projid(
1224 struct xfs_inode *ip, 1285 struct xfs_inode *ip,
@@ -1311,6 +1372,10 @@ xfs_ioctl_setattr(
1311 if (code) 1372 if (code)
1312 goto error_trans_cancel; 1373 goto error_trans_cancel;
1313 1374
1375 code = xfs_ioctl_setattr_check_cowextsize(ip, fa);
1376 if (code)
1377 goto error_trans_cancel;
1378
1314 code = xfs_ioctl_setattr_xflags(tp, ip, fa); 1379 code = xfs_ioctl_setattr_xflags(tp, ip, fa);
1315 if (code) 1380 if (code)
1316 goto error_trans_cancel; 1381 goto error_trans_cancel;
@@ -1346,6 +1411,12 @@ xfs_ioctl_setattr(
1346 ip->i_d.di_extsize = fa->fsx_extsize >> mp->m_sb.sb_blocklog; 1411 ip->i_d.di_extsize = fa->fsx_extsize >> mp->m_sb.sb_blocklog;
1347 else 1412 else
1348 ip->i_d.di_extsize = 0; 1413 ip->i_d.di_extsize = 0;
1414 if (ip->i_d.di_version == 3 &&
1415 (ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
1416 ip->i_d.di_cowextsize = fa->fsx_cowextsize >>
1417 mp->m_sb.sb_blocklog;
1418 else
1419 ip->i_d.di_cowextsize = 0;
1349 1420
1350 code = xfs_trans_commit(tp); 1421 code = xfs_trans_commit(tp);
1351 1422
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c08253e11545..d907eb9f8ef3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -39,6 +39,7 @@
39#include "xfs_quota.h" 39#include "xfs_quota.h"
40#include "xfs_dquot_item.h" 40#include "xfs_dquot_item.h"
41#include "xfs_dquot.h" 41#include "xfs_dquot.h"
42#include "xfs_reflink.h"
42 43
43 44
44#define XFS_WRITEIO_ALIGN(mp,off) (((off) >> mp->m_writeio_log) \ 45#define XFS_WRITEIO_ALIGN(mp,off) (((off) >> mp->m_writeio_log) \
@@ -70,7 +71,7 @@ xfs_bmbt_to_iomap(
70 iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip)); 71 iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
71} 72}
72 73
73static xfs_extlen_t 74xfs_extlen_t
74xfs_eof_alignment( 75xfs_eof_alignment(
75 struct xfs_inode *ip, 76 struct xfs_inode *ip,
76 xfs_extlen_t extsize) 77 xfs_extlen_t extsize)
@@ -609,7 +610,7 @@ xfs_file_iomap_begin_delay(
609 } 610 }
610 611
611retry: 612retry:
612 error = xfs_bmapi_reserve_delalloc(ip, offset_fsb, 613 error = xfs_bmapi_reserve_delalloc(ip, XFS_DATA_FORK, offset_fsb,
613 end_fsb - offset_fsb, &got, 614 end_fsb - offset_fsb, &got,
614 &prev, &idx, eof); 615 &prev, &idx, eof);
615 switch (error) { 616 switch (error) {
@@ -666,6 +667,7 @@ out_unlock:
666int 667int
667xfs_iomap_write_allocate( 668xfs_iomap_write_allocate(
668 xfs_inode_t *ip, 669 xfs_inode_t *ip,
670 int whichfork,
669 xfs_off_t offset, 671 xfs_off_t offset,
670 xfs_bmbt_irec_t *imap) 672 xfs_bmbt_irec_t *imap)
671{ 673{
@@ -678,8 +680,12 @@ xfs_iomap_write_allocate(
678 xfs_trans_t *tp; 680 xfs_trans_t *tp;
679 int nimaps; 681 int nimaps;
680 int error = 0; 682 int error = 0;
683 int flags = 0;
681 int nres; 684 int nres;
682 685
686 if (whichfork == XFS_COW_FORK)
687 flags |= XFS_BMAPI_COWFORK;
688
683 /* 689 /*
684 * Make sure that the dquots are there. 690 * Make sure that the dquots are there.
685 */ 691 */
@@ -773,7 +779,7 @@ xfs_iomap_write_allocate(
773 * pointer that the caller gave to us. 779 * pointer that the caller gave to us.
774 */ 780 */
775 error = xfs_bmapi_write(tp, ip, map_start_fsb, 781 error = xfs_bmapi_write(tp, ip, map_start_fsb,
776 count_fsb, 0, &first_block, 782 count_fsb, flags, &first_block,
777 nres, imap, &nimaps, 783 nres, imap, &nimaps,
778 &dfops); 784 &dfops);
779 if (error) 785 if (error)
@@ -955,14 +961,22 @@ xfs_file_iomap_begin(
955 struct xfs_mount *mp = ip->i_mount; 961 struct xfs_mount *mp = ip->i_mount;
956 struct xfs_bmbt_irec imap; 962 struct xfs_bmbt_irec imap;
957 xfs_fileoff_t offset_fsb, end_fsb; 963 xfs_fileoff_t offset_fsb, end_fsb;
964 bool shared, trimmed;
958 int nimaps = 1, error = 0; 965 int nimaps = 1, error = 0;
959 unsigned lockmode; 966 unsigned lockmode;
960 967
961 if (XFS_FORCED_SHUTDOWN(mp)) 968 if (XFS_FORCED_SHUTDOWN(mp))
962 return -EIO; 969 return -EIO;
963 970
964 if ((flags & IOMAP_WRITE) && 971 if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
965 !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) { 972 error = xfs_reflink_reserve_cow_range(ip, offset, length);
973 if (error < 0)
974 return error;
975 }
976
977 if ((flags & IOMAP_WRITE) && !IS_DAX(inode) &&
978 !xfs_get_extsz_hint(ip)) {
979 /* Reserve delalloc blocks for regular writeback. */
966 return xfs_file_iomap_begin_delay(inode, offset, length, flags, 980 return xfs_file_iomap_begin_delay(inode, offset, length, flags,
967 iomap); 981 iomap);
968 } 982 }
@@ -976,7 +990,14 @@ xfs_file_iomap_begin(
976 end_fsb = XFS_B_TO_FSB(mp, offset + length); 990 end_fsb = XFS_B_TO_FSB(mp, offset + length);
977 991
978 error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap, 992 error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
979 &nimaps, XFS_BMAPI_ENTIRE); 993 &nimaps, 0);
994 if (error) {
995 xfs_iunlock(ip, lockmode);
996 return error;
997 }
998
999 /* Trim the mapping to the nearest shared extent boundary. */
1000 error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
980 if (error) { 1001 if (error) {
981 xfs_iunlock(ip, lockmode); 1002 xfs_iunlock(ip, lockmode);
982 return error; 1003 return error;
@@ -1015,6 +1036,8 @@ xfs_file_iomap_begin(
1015 } 1036 }
1016 1037
1017 xfs_bmbt_to_iomap(ip, iomap, &imap); 1038 xfs_bmbt_to_iomap(ip, iomap, &imap);
1039 if (shared)
1040 iomap->flags |= IOMAP_F_SHARED;
1018 return 0; 1041 return 0;
1019} 1042}
1020 1043
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 6498be485932..6d45cf01fcff 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,12 +25,13 @@ struct xfs_bmbt_irec;
25 25
26int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t, 26int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
27 struct xfs_bmbt_irec *, int); 27 struct xfs_bmbt_irec *, int);
28int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t, 28int xfs_iomap_write_allocate(struct xfs_inode *, int, xfs_off_t,
29 struct xfs_bmbt_irec *); 29 struct xfs_bmbt_irec *);
30int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t); 30int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
31 31
32void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *, 32void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
33 struct xfs_bmbt_irec *); 33 struct xfs_bmbt_irec *);
34xfs_extlen_t xfs_eof_alignment(struct xfs_inode *ip, xfs_extlen_t extsize);
34 35
35extern struct iomap_ops xfs_iomap_ops; 36extern struct iomap_ops xfs_iomap_ops;
36extern struct iomap_ops xfs_xattr_iomap_ops; 37extern struct iomap_ops xfs_xattr_iomap_ops;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c5da95eb79b8..405a65cd9d6b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1159,6 +1159,7 @@ xfs_diflags_to_iflags(
1159 inode->i_flags |= S_NOATIME; 1159 inode->i_flags |= S_NOATIME;
1160 if (S_ISREG(inode->i_mode) && 1160 if (S_ISREG(inode->i_mode) &&
1161 ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE && 1161 ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
1162 !xfs_is_reflink_inode(ip) &&
1162 (ip->i_mount->m_flags & XFS_MOUNT_DAX || 1163 (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
1163 ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)) 1164 ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
1164 inode->i_flags |= S_DAX; 1165 inode->i_flags |= S_DAX;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index ce73eb34620d..66e881790c17 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -66,7 +66,7 @@ xfs_bulkstat_one_int(
66 if (!buffer || xfs_internal_inum(mp, ino)) 66 if (!buffer || xfs_internal_inum(mp, ino))
67 return -EINVAL; 67 return -EINVAL;
68 68
69 buf = kmem_alloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL); 69 buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
70 if (!buf) 70 if (!buf)
71 return -ENOMEM; 71 return -ENOMEM;
72 72
@@ -111,6 +111,12 @@ xfs_bulkstat_one_int(
111 buf->bs_aextents = dic->di_anextents; 111 buf->bs_aextents = dic->di_anextents;
112 buf->bs_forkoff = XFS_IFORK_BOFF(ip); 112 buf->bs_forkoff = XFS_IFORK_BOFF(ip);
113 113
114 if (dic->di_version == 3) {
115 if (dic->di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
116 buf->bs_cowextsize = dic->di_cowextsize <<
117 mp->m_sb.sb_blocklog;
118 }
119
114 switch (dic->di_format) { 120 switch (dic->di_format) {
115 case XFS_DINODE_FMT_DEV: 121 case XFS_DINODE_FMT_DEV:
116 buf->bs_rdev = ip->i_df.if_u2.if_rdev; 122 buf->bs_rdev = ip->i_df.if_u2.if_rdev;
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index b8d64d520e12..68640fb63a54 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -116,6 +116,7 @@ typedef __u32 xfs_nlink_t;
116#define xfs_inherit_nodefrag xfs_params.inherit_nodfrg.val 116#define xfs_inherit_nodefrag xfs_params.inherit_nodfrg.val
117#define xfs_fstrm_centisecs xfs_params.fstrm_timer.val 117#define xfs_fstrm_centisecs xfs_params.fstrm_timer.val
118#define xfs_eofb_secs xfs_params.eofb_timer.val 118#define xfs_eofb_secs xfs_params.eofb_timer.val
119#define xfs_cowb_secs xfs_params.cowb_timer.val
119 120
120#define current_cpu() (raw_smp_processor_id()) 121#define current_cpu() (raw_smp_processor_id())
121#define current_pid() (current->pid) 122#define current_pid() (current->pid)
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 846483d56949..9b3d7c76915d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -45,6 +45,8 @@
45#include "xfs_dir2.h" 45#include "xfs_dir2.h"
46#include "xfs_rmap_item.h" 46#include "xfs_rmap_item.h"
47#include "xfs_buf_item.h" 47#include "xfs_buf_item.h"
48#include "xfs_refcount_item.h"
49#include "xfs_bmap_item.h"
48 50
49#define BLK_AVG(blk1, blk2) ((blk1+blk2) >> 1) 51#define BLK_AVG(blk1, blk2) ((blk1+blk2) >> 1)
50 52
@@ -1924,6 +1926,10 @@ xlog_recover_reorder_trans(
1924 case XFS_LI_EFI: 1926 case XFS_LI_EFI:
1925 case XFS_LI_RUI: 1927 case XFS_LI_RUI:
1926 case XFS_LI_RUD: 1928 case XFS_LI_RUD:
1929 case XFS_LI_CUI:
1930 case XFS_LI_CUD:
1931 case XFS_LI_BUI:
1932 case XFS_LI_BUD:
1927 trace_xfs_log_recover_item_reorder_tail(log, 1933 trace_xfs_log_recover_item_reorder_tail(log,
1928 trans, item, pass); 1934 trans, item, pass);
1929 list_move_tail(&item->ri_list, &inode_list); 1935 list_move_tail(&item->ri_list, &inode_list);
@@ -2242,6 +2248,7 @@ xlog_recover_get_buf_lsn(
2242 case XFS_ABTB_MAGIC: 2248 case XFS_ABTB_MAGIC:
2243 case XFS_ABTC_MAGIC: 2249 case XFS_ABTC_MAGIC:
2244 case XFS_RMAP_CRC_MAGIC: 2250 case XFS_RMAP_CRC_MAGIC:
2251 case XFS_REFC_CRC_MAGIC:
2245 case XFS_IBT_CRC_MAGIC: 2252 case XFS_IBT_CRC_MAGIC:
2246 case XFS_IBT_MAGIC: { 2253 case XFS_IBT_MAGIC: {
2247 struct xfs_btree_block *btb = blk; 2254 struct xfs_btree_block *btb = blk;
@@ -2415,6 +2422,9 @@ xlog_recover_validate_buf_type(
2415 case XFS_RMAP_CRC_MAGIC: 2422 case XFS_RMAP_CRC_MAGIC:
2416 bp->b_ops = &xfs_rmapbt_buf_ops; 2423 bp->b_ops = &xfs_rmapbt_buf_ops;
2417 break; 2424 break;
2425 case XFS_REFC_CRC_MAGIC:
2426 bp->b_ops = &xfs_refcountbt_buf_ops;
2427 break;
2418 default: 2428 default:
2419 warnmsg = "Bad btree block magic!"; 2429 warnmsg = "Bad btree block magic!";
2420 break; 2430 break;
@@ -3547,6 +3557,242 @@ xlog_recover_rud_pass2(
3547} 3557}
3548 3558
3549/* 3559/*
3560 * Copy an CUI format buffer from the given buf, and into the destination
3561 * CUI format structure. The CUI/CUD items were designed not to need any
3562 * special alignment handling.
3563 */
3564static int
3565xfs_cui_copy_format(
3566 struct xfs_log_iovec *buf,
3567 struct xfs_cui_log_format *dst_cui_fmt)
3568{
3569 struct xfs_cui_log_format *src_cui_fmt;
3570 uint len;
3571
3572 src_cui_fmt = buf->i_addr;
3573 len = xfs_cui_log_format_sizeof(src_cui_fmt->cui_nextents);
3574
3575 if (buf->i_len == len) {
3576 memcpy(dst_cui_fmt, src_cui_fmt, len);
3577 return 0;
3578 }
3579 return -EFSCORRUPTED;
3580}
3581
3582/*
3583 * This routine is called to create an in-core extent refcount update
3584 * item from the cui format structure which was logged on disk.
3585 * It allocates an in-core cui, copies the extents from the format
3586 * structure into it, and adds the cui to the AIL with the given
3587 * LSN.
3588 */
3589STATIC int
3590xlog_recover_cui_pass2(
3591 struct xlog *log,
3592 struct xlog_recover_item *item,
3593 xfs_lsn_t lsn)
3594{
3595 int error;
3596 struct xfs_mount *mp = log->l_mp;
3597 struct xfs_cui_log_item *cuip;
3598 struct xfs_cui_log_format *cui_formatp;
3599
3600 cui_formatp = item->ri_buf[0].i_addr;
3601
3602 cuip = xfs_cui_init(mp, cui_formatp->cui_nextents);
3603 error = xfs_cui_copy_format(&item->ri_buf[0], &cuip->cui_format);
3604 if (error) {
3605 xfs_cui_item_free(cuip);
3606 return error;
3607 }
3608 atomic_set(&cuip->cui_next_extent, cui_formatp->cui_nextents);
3609
3610 spin_lock(&log->l_ailp->xa_lock);
3611 /*
3612 * The CUI has two references. One for the CUD and one for CUI to ensure
3613 * it makes it into the AIL. Insert the CUI into the AIL directly and
3614 * drop the CUI reference. Note that xfs_trans_ail_update() drops the
3615 * AIL lock.
3616 */
3617 xfs_trans_ail_update(log->l_ailp, &cuip->cui_item, lsn);
3618 xfs_cui_release(cuip);
3619 return 0;
3620}
3621
3622
3623/*
3624 * This routine is called when an CUD format structure is found in a committed
3625 * transaction in the log. Its purpose is to cancel the corresponding CUI if it
3626 * was still in the log. To do this it searches the AIL for the CUI with an id
3627 * equal to that in the CUD format structure. If we find it we drop the CUD
3628 * reference, which removes the CUI from the AIL and frees it.
3629 */
3630STATIC int
3631xlog_recover_cud_pass2(
3632 struct xlog *log,
3633 struct xlog_recover_item *item)
3634{
3635 struct xfs_cud_log_format *cud_formatp;
3636 struct xfs_cui_log_item *cuip = NULL;
3637 struct xfs_log_item *lip;
3638 __uint64_t cui_id;
3639 struct xfs_ail_cursor cur;
3640 struct xfs_ail *ailp = log->l_ailp;
3641
3642 cud_formatp = item->ri_buf[0].i_addr;
3643 if (item->ri_buf[0].i_len != sizeof(struct xfs_cud_log_format))
3644 return -EFSCORRUPTED;
3645 cui_id = cud_formatp->cud_cui_id;
3646
3647 /*
3648 * Search for the CUI with the id in the CUD format structure in the
3649 * AIL.
3650 */
3651 spin_lock(&ailp->xa_lock);
3652 lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
3653 while (lip != NULL) {
3654 if (lip->li_type == XFS_LI_CUI) {
3655 cuip = (struct xfs_cui_log_item *)lip;
3656 if (cuip->cui_format.cui_id == cui_id) {
3657 /*
3658 * Drop the CUD reference to the CUI. This
3659 * removes the CUI from the AIL and frees it.
3660 */
3661 spin_unlock(&ailp->xa_lock);
3662 xfs_cui_release(cuip);
3663 spin_lock(&ailp->xa_lock);
3664 break;
3665 }
3666 }
3667 lip = xfs_trans_ail_cursor_next(ailp, &cur);
3668 }
3669
3670 xfs_trans_ail_cursor_done(&cur);
3671 spin_unlock(&ailp->xa_lock);
3672
3673 return 0;
3674}
3675
3676/*
3677 * Copy an BUI format buffer from the given buf, and into the destination
3678 * BUI format structure. The BUI/BUD items were designed not to need any
3679 * special alignment handling.
3680 */
3681static int
3682xfs_bui_copy_format(
3683 struct xfs_log_iovec *buf,
3684 struct xfs_bui_log_format *dst_bui_fmt)
3685{
3686 struct xfs_bui_log_format *src_bui_fmt;
3687 uint len;
3688
3689 src_bui_fmt = buf->i_addr;
3690 len = xfs_bui_log_format_sizeof(src_bui_fmt->bui_nextents);
3691
3692 if (buf->i_len == len) {
3693 memcpy(dst_bui_fmt, src_bui_fmt, len);
3694 return 0;
3695 }
3696 return -EFSCORRUPTED;
3697}
3698
3699/*
3700 * This routine is called to create an in-core extent bmap update
3701 * item from the bui format structure which was logged on disk.
3702 * It allocates an in-core bui, copies the extents from the format
3703 * structure into it, and adds the bui to the AIL with the given
3704 * LSN.
3705 */
3706STATIC int
3707xlog_recover_bui_pass2(
3708 struct xlog *log,
3709 struct xlog_recover_item *item,
3710 xfs_lsn_t lsn)
3711{
3712 int error;
3713 struct xfs_mount *mp = log->l_mp;
3714 struct xfs_bui_log_item *buip;
3715 struct xfs_bui_log_format *bui_formatp;
3716
3717 bui_formatp = item->ri_buf[0].i_addr;
3718
3719 if (bui_formatp->bui_nextents != XFS_BUI_MAX_FAST_EXTENTS)
3720 return -EFSCORRUPTED;
3721 buip = xfs_bui_init(mp);
3722 error = xfs_bui_copy_format(&item->ri_buf[0], &buip->bui_format);
3723 if (error) {
3724 xfs_bui_item_free(buip);
3725 return error;
3726 }
3727 atomic_set(&buip->bui_next_extent, bui_formatp->bui_nextents);
3728
3729 spin_lock(&log->l_ailp->xa_lock);
3730 /*
3731 * The RUI has two references. One for the RUD and one for RUI to ensure
3732 * it makes it into the AIL. Insert the RUI into the AIL directly and
3733 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
3734 * AIL lock.
3735 */
3736 xfs_trans_ail_update(log->l_ailp, &buip->bui_item, lsn);
3737 xfs_bui_release(buip);
3738 return 0;
3739}
3740
3741
3742/*
3743 * This routine is called when an BUD format structure is found in a committed
3744 * transaction in the log. Its purpose is to cancel the corresponding BUI if it
3745 * was still in the log. To do this it searches the AIL for the BUI with an id
3746 * equal to that in the BUD format structure. If we find it we drop the BUD
3747 * reference, which removes the BUI from the AIL and frees it.
3748 */
3749STATIC int
3750xlog_recover_bud_pass2(
3751 struct xlog *log,
3752 struct xlog_recover_item *item)
3753{
3754 struct xfs_bud_log_format *bud_formatp;
3755 struct xfs_bui_log_item *buip = NULL;
3756 struct xfs_log_item *lip;
3757 __uint64_t bui_id;
3758 struct xfs_ail_cursor cur;
3759 struct xfs_ail *ailp = log->l_ailp;
3760
3761 bud_formatp = item->ri_buf[0].i_addr;
3762 if (item->ri_buf[0].i_len != sizeof(struct xfs_bud_log_format))
3763 return -EFSCORRUPTED;
3764 bui_id = bud_formatp->bud_bui_id;
3765
3766 /*
3767 * Search for the BUI with the id in the BUD format structure in the
3768 * AIL.
3769 */
3770 spin_lock(&ailp->xa_lock);
3771 lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
3772 while (lip != NULL) {
3773 if (lip->li_type == XFS_LI_BUI) {
3774 buip = (struct xfs_bui_log_item *)lip;
3775 if (buip->bui_format.bui_id == bui_id) {
3776 /*
3777 * Drop the BUD reference to the BUI. This
3778 * removes the BUI from the AIL and frees it.
3779 */
3780 spin_unlock(&ailp->xa_lock);
3781 xfs_bui_release(buip);
3782 spin_lock(&ailp->xa_lock);
3783 break;
3784 }
3785 }
3786 lip = xfs_trans_ail_cursor_next(ailp, &cur);
3787 }
3788
3789 xfs_trans_ail_cursor_done(&cur);
3790 spin_unlock(&ailp->xa_lock);
3791
3792 return 0;
3793}
3794
3795/*
3550 * This routine is called when an inode create format structure is found in a 3796 * This routine is called when an inode create format structure is found in a
3551 * committed transaction in the log. It's purpose is to initialise the inodes 3797 * committed transaction in the log. It's purpose is to initialise the inodes
3552 * being allocated on disk. This requires us to get inode cluster buffers that 3798 * being allocated on disk. This requires us to get inode cluster buffers that
@@ -3773,6 +4019,10 @@ xlog_recover_ra_pass2(
3773 case XFS_LI_QUOTAOFF: 4019 case XFS_LI_QUOTAOFF:
3774 case XFS_LI_RUI: 4020 case XFS_LI_RUI:
3775 case XFS_LI_RUD: 4021 case XFS_LI_RUD:
4022 case XFS_LI_CUI:
4023 case XFS_LI_CUD:
4024 case XFS_LI_BUI:
4025 case XFS_LI_BUD:
3776 default: 4026 default:
3777 break; 4027 break;
3778 } 4028 }
@@ -3798,6 +4048,10 @@ xlog_recover_commit_pass1(
3798 case XFS_LI_ICREATE: 4048 case XFS_LI_ICREATE:
3799 case XFS_LI_RUI: 4049 case XFS_LI_RUI:
3800 case XFS_LI_RUD: 4050 case XFS_LI_RUD:
4051 case XFS_LI_CUI:
4052 case XFS_LI_CUD:
4053 case XFS_LI_BUI:
4054 case XFS_LI_BUD:
3801 /* nothing to do in pass 1 */ 4055 /* nothing to do in pass 1 */
3802 return 0; 4056 return 0;
3803 default: 4057 default:
@@ -3832,6 +4086,14 @@ xlog_recover_commit_pass2(
3832 return xlog_recover_rui_pass2(log, item, trans->r_lsn); 4086 return xlog_recover_rui_pass2(log, item, trans->r_lsn);
3833 case XFS_LI_RUD: 4087 case XFS_LI_RUD:
3834 return xlog_recover_rud_pass2(log, item); 4088 return xlog_recover_rud_pass2(log, item);
4089 case XFS_LI_CUI:
4090 return xlog_recover_cui_pass2(log, item, trans->r_lsn);
4091 case XFS_LI_CUD:
4092 return xlog_recover_cud_pass2(log, item);
4093 case XFS_LI_BUI:
4094 return xlog_recover_bui_pass2(log, item, trans->r_lsn);
4095 case XFS_LI_BUD:
4096 return xlog_recover_bud_pass2(log, item);
3835 case XFS_LI_DQUOT: 4097 case XFS_LI_DQUOT:
3836 return xlog_recover_dquot_pass2(log, buffer_list, item, 4098 return xlog_recover_dquot_pass2(log, buffer_list, item,
3837 trans->r_lsn); 4099 trans->r_lsn);
@@ -4419,12 +4681,94 @@ xlog_recover_cancel_rui(
4419 spin_lock(&ailp->xa_lock); 4681 spin_lock(&ailp->xa_lock);
4420} 4682}
4421 4683
4684/* Recover the CUI if necessary. */
4685STATIC int
4686xlog_recover_process_cui(
4687 struct xfs_mount *mp,
4688 struct xfs_ail *ailp,
4689 struct xfs_log_item *lip)
4690{
4691 struct xfs_cui_log_item *cuip;
4692 int error;
4693
4694 /*
4695 * Skip CUIs that we've already processed.
4696 */
4697 cuip = container_of(lip, struct xfs_cui_log_item, cui_item);
4698 if (test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags))
4699 return 0;
4700
4701 spin_unlock(&ailp->xa_lock);
4702 error = xfs_cui_recover(mp, cuip);
4703 spin_lock(&ailp->xa_lock);
4704
4705 return error;
4706}
4707
4708/* Release the CUI since we're cancelling everything. */
4709STATIC void
4710xlog_recover_cancel_cui(
4711 struct xfs_mount *mp,
4712 struct xfs_ail *ailp,
4713 struct xfs_log_item *lip)
4714{
4715 struct xfs_cui_log_item *cuip;
4716
4717 cuip = container_of(lip, struct xfs_cui_log_item, cui_item);
4718
4719 spin_unlock(&ailp->xa_lock);
4720 xfs_cui_release(cuip);
4721 spin_lock(&ailp->xa_lock);
4722}
4723
4724/* Recover the BUI if necessary. */
4725STATIC int
4726xlog_recover_process_bui(
4727 struct xfs_mount *mp,
4728 struct xfs_ail *ailp,
4729 struct xfs_log_item *lip)
4730{
4731 struct xfs_bui_log_item *buip;
4732 int error;
4733
4734 /*
4735 * Skip BUIs that we've already processed.
4736 */
4737 buip = container_of(lip, struct xfs_bui_log_item, bui_item);
4738 if (test_bit(XFS_BUI_RECOVERED, &buip->bui_flags))
4739 return 0;
4740
4741 spin_unlock(&ailp->xa_lock);
4742 error = xfs_bui_recover(mp, buip);
4743 spin_lock(&ailp->xa_lock);
4744
4745 return error;
4746}
4747
4748/* Release the BUI since we're cancelling everything. */
4749STATIC void
4750xlog_recover_cancel_bui(
4751 struct xfs_mount *mp,
4752 struct xfs_ail *ailp,
4753 struct xfs_log_item *lip)
4754{
4755 struct xfs_bui_log_item *buip;
4756
4757 buip = container_of(lip, struct xfs_bui_log_item, bui_item);
4758
4759 spin_unlock(&ailp->xa_lock);
4760 xfs_bui_release(buip);
4761 spin_lock(&ailp->xa_lock);
4762}
4763
4422/* Is this log item a deferred action intent? */ 4764/* Is this log item a deferred action intent? */
4423static inline bool xlog_item_is_intent(struct xfs_log_item *lip) 4765static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
4424{ 4766{
4425 switch (lip->li_type) { 4767 switch (lip->li_type) {
4426 case XFS_LI_EFI: 4768 case XFS_LI_EFI:
4427 case XFS_LI_RUI: 4769 case XFS_LI_RUI:
4770 case XFS_LI_CUI:
4771 case XFS_LI_BUI:
4428 return true; 4772 return true;
4429 default: 4773 default:
4430 return false; 4774 return false;
@@ -4488,6 +4832,12 @@ xlog_recover_process_intents(
4488 case XFS_LI_RUI: 4832 case XFS_LI_RUI:
4489 error = xlog_recover_process_rui(log->l_mp, ailp, lip); 4833 error = xlog_recover_process_rui(log->l_mp, ailp, lip);
4490 break; 4834 break;
4835 case XFS_LI_CUI:
4836 error = xlog_recover_process_cui(log->l_mp, ailp, lip);
4837 break;
4838 case XFS_LI_BUI:
4839 error = xlog_recover_process_bui(log->l_mp, ailp, lip);
4840 break;
4491 } 4841 }
4492 if (error) 4842 if (error)
4493 goto out; 4843 goto out;
@@ -4535,6 +4885,12 @@ xlog_recover_cancel_intents(
4535 case XFS_LI_RUI: 4885 case XFS_LI_RUI:
4536 xlog_recover_cancel_rui(log->l_mp, ailp, lip); 4886 xlog_recover_cancel_rui(log->l_mp, ailp, lip);
4537 break; 4887 break;
4888 case XFS_LI_CUI:
4889 xlog_recover_cancel_cui(log->l_mp, ailp, lip);
4890 break;
4891 case XFS_LI_BUI:
4892 xlog_recover_cancel_bui(log->l_mp, ailp, lip);
4893 break;
4538 } 4894 }
4539 4895
4540 lip = xfs_trans_ail_cursor_next(ailp, &cur); 4896 lip = xfs_trans_ail_cursor_next(ailp, &cur);
@@ -4613,6 +4969,7 @@ xlog_recover_process_one_iunlink(
4613 if (error) 4969 if (error)
4614 goto fail_iput; 4970 goto fail_iput;
4615 4971
4972 xfs_iflags_clear(ip, XFS_IRECOVERY);
4616 ASSERT(VFS_I(ip)->i_nlink == 0); 4973 ASSERT(VFS_I(ip)->i_nlink == 0);
4617 ASSERT(VFS_I(ip)->i_mode != 0); 4974 ASSERT(VFS_I(ip)->i_mode != 0);
4618 4975
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 56e85a6c85c7..fc7873942bea 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -43,6 +43,8 @@
43#include "xfs_icache.h" 43#include "xfs_icache.h"
44#include "xfs_sysfs.h" 44#include "xfs_sysfs.h"
45#include "xfs_rmap_btree.h" 45#include "xfs_rmap_btree.h"
46#include "xfs_refcount_btree.h"
47#include "xfs_reflink.h"
46 48
47 49
48static DEFINE_MUTEX(xfs_uuid_table_mutex); 50static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -684,6 +686,7 @@ xfs_mountfs(
684 xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK); 686 xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
685 xfs_ialloc_compute_maxlevels(mp); 687 xfs_ialloc_compute_maxlevels(mp);
686 xfs_rmapbt_compute_maxlevels(mp); 688 xfs_rmapbt_compute_maxlevels(mp);
689 xfs_refcountbt_compute_maxlevels(mp);
687 690
688 xfs_set_maxicount(mp); 691 xfs_set_maxicount(mp);
689 692
@@ -923,6 +926,15 @@ xfs_mountfs(
923 } 926 }
924 927
925 /* 928 /*
929 * During the second phase of log recovery, we need iget and
930 * iput to behave like they do for an active filesystem.
931 * xfs_fs_drop_inode needs to be able to prevent the deletion
932 * of inodes before we're done replaying log items on those
933 * inodes.
934 */
935 mp->m_super->s_flags |= MS_ACTIVE;
936
937 /*
926 * Finish recovering the file system. This part needed to be delayed 938 * Finish recovering the file system. This part needed to be delayed
927 * until after the root and real-time bitmap inodes were consistently 939 * until after the root and real-time bitmap inodes were consistently
928 * read in. 940 * read in.
@@ -974,10 +986,28 @@ xfs_mountfs(
974 if (error) 986 if (error)
975 xfs_warn(mp, 987 xfs_warn(mp,
976 "Unable to allocate reserve blocks. Continuing without reserve pool."); 988 "Unable to allocate reserve blocks. Continuing without reserve pool.");
989
990 /* Recover any CoW blocks that never got remapped. */
991 error = xfs_reflink_recover_cow(mp);
992 if (error) {
993 xfs_err(mp,
994 "Error %d recovering leftover CoW allocations.", error);
995 xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
996 goto out_quota;
997 }
998
999 /* Reserve AG blocks for future btree expansion. */
1000 error = xfs_fs_reserve_ag_blocks(mp);
1001 if (error && error != -ENOSPC)
1002 goto out_agresv;
977 } 1003 }
978 1004
979 return 0; 1005 return 0;
980 1006
1007 out_agresv:
1008 xfs_fs_unreserve_ag_blocks(mp);
1009 out_quota:
1010 xfs_qm_unmount_quotas(mp);
981 out_rtunmount: 1011 out_rtunmount:
982 xfs_rtunmount_inodes(mp); 1012 xfs_rtunmount_inodes(mp);
983 out_rele_rip: 1013 out_rele_rip:
@@ -1019,7 +1049,9 @@ xfs_unmountfs(
1019 int error; 1049 int error;
1020 1050
1021 cancel_delayed_work_sync(&mp->m_eofblocks_work); 1051 cancel_delayed_work_sync(&mp->m_eofblocks_work);
1052 cancel_delayed_work_sync(&mp->m_cowblocks_work);
1022 1053
1054 xfs_fs_unreserve_ag_blocks(mp);
1023 xfs_qm_unmount_quotas(mp); 1055 xfs_qm_unmount_quotas(mp);
1024 xfs_rtunmount_inodes(mp); 1056 xfs_rtunmount_inodes(mp);
1025 IRELE(mp->m_rootip); 1057 IRELE(mp->m_rootip);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 041d9493e798..819b80b15bfb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -124,10 +124,13 @@ typedef struct xfs_mount {
124 uint m_inobt_mnr[2]; /* min inobt btree records */ 124 uint m_inobt_mnr[2]; /* min inobt btree records */
125 uint m_rmap_mxr[2]; /* max rmap btree records */ 125 uint m_rmap_mxr[2]; /* max rmap btree records */
126 uint m_rmap_mnr[2]; /* min rmap btree records */ 126 uint m_rmap_mnr[2]; /* min rmap btree records */
127 uint m_refc_mxr[2]; /* max refc btree records */
128 uint m_refc_mnr[2]; /* min refc btree records */
127 uint m_ag_maxlevels; /* XFS_AG_MAXLEVELS */ 129 uint m_ag_maxlevels; /* XFS_AG_MAXLEVELS */
128 uint m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */ 130 uint m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
129 uint m_in_maxlevels; /* max inobt btree levels. */ 131 uint m_in_maxlevels; /* max inobt btree levels. */
130 uint m_rmap_maxlevels; /* max rmap btree levels */ 132 uint m_rmap_maxlevels; /* max rmap btree levels */
133 uint m_refc_maxlevels; /* max refcount btree level */
131 xfs_extlen_t m_ag_prealloc_blocks; /* reserved ag blocks */ 134 xfs_extlen_t m_ag_prealloc_blocks; /* reserved ag blocks */
132 uint m_alloc_set_aside; /* space we can't use */ 135 uint m_alloc_set_aside; /* space we can't use */
133 uint m_ag_max_usable; /* max space per AG */ 136 uint m_ag_max_usable; /* max space per AG */
@@ -161,6 +164,8 @@ typedef struct xfs_mount {
161 struct delayed_work m_reclaim_work; /* background inode reclaim */ 164 struct delayed_work m_reclaim_work; /* background inode reclaim */
162 struct delayed_work m_eofblocks_work; /* background eof blocks 165 struct delayed_work m_eofblocks_work; /* background eof blocks
163 trimming */ 166 trimming */
167 struct delayed_work m_cowblocks_work; /* background cow blocks
168 trimming */
164 bool m_update_sb; /* sb needs update in mount */ 169 bool m_update_sb; /* sb needs update in mount */
165 int64_t m_low_space[XFS_LOWSP_MAX]; 170 int64_t m_low_space[XFS_LOWSP_MAX];
166 /* low free space thresholds */ 171 /* low free space thresholds */
@@ -399,6 +404,9 @@ typedef struct xfs_perag {
399 struct xfs_ag_resv pag_meta_resv; 404 struct xfs_ag_resv pag_meta_resv;
400 /* Blocks reserved for just AGFL-based metadata. */ 405 /* Blocks reserved for just AGFL-based metadata. */
401 struct xfs_ag_resv pag_agfl_resv; 406 struct xfs_ag_resv pag_agfl_resv;
407
408 /* reference count */
409 __uint8_t pagf_refcount_level;
402} xfs_perag_t; 410} xfs_perag_t;
403 411
404static inline struct xfs_ag_resv * 412static inline struct xfs_ag_resv *
diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 69e2986a3776..0c381d71b242 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -49,6 +49,8 @@ xfs_check_ondisk_structs(void)
49 XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr, 56); 49 XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr, 56);
50 XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key, 4); 50 XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key, 4);
51 XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec, 16); 51 XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec, 16);
52 XFS_CHECK_STRUCT_SIZE(struct xfs_refcount_key, 4);
53 XFS_CHECK_STRUCT_SIZE(struct xfs_refcount_rec, 12);
52 XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key, 20); 54 XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key, 20);
53 XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec, 24); 55 XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec, 24);
54 XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp, 8); 56 XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp, 8);
@@ -56,6 +58,7 @@ xfs_check_ondisk_structs(void)
56 XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t, 4); 58 XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t, 4);
57 XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t, 8); 59 XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t, 8);
58 XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t, 4); 60 XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t, 4);
61 XFS_CHECK_STRUCT_SIZE(xfs_refcount_ptr_t, 4);
59 XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t, 4); 62 XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t, 4);
60 63
61 /* dir/attr trees */ 64 /* dir/attr trees */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 0f14b2e4bf6c..93a7aafa56d6 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -114,6 +114,13 @@ xfs_fs_map_blocks(
114 return -ENXIO; 114 return -ENXIO;
115 115
116 /* 116 /*
117 * The pNFS block layout spec actually supports reflink like
118 * functionality, but the Linux pNFS server doesn't implement it yet.
119 */
120 if (xfs_is_reflink_inode(ip))
121 return -ENXIO;
122
123 /*
117 * Lock out any other I/O before we flush and invalidate the pagecache, 124 * Lock out any other I/O before we flush and invalidate the pagecache,
118 * and then hand out a layout to the remote system. This is very 125 * and then hand out a layout to the remote system. This is very
119 * similar to direct I/O, except that the synchronization is much more 126 * similar to direct I/O, except that the synchronization is much more
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
new file mode 100644
index 000000000000..fe86a668a57e
--- /dev/null
+++ b/fs/xfs/xfs_refcount_item.c
@@ -0,0 +1,539 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_format.h"
23#include "xfs_log_format.h"
24#include "xfs_trans_resv.h"
25#include "xfs_bit.h"
26#include "xfs_mount.h"
27#include "xfs_defer.h"
28#include "xfs_trans.h"
29#include "xfs_trans_priv.h"
30#include "xfs_buf_item.h"
31#include "xfs_refcount_item.h"
32#include "xfs_log.h"
33#include "xfs_refcount.h"
34
35
36kmem_zone_t *xfs_cui_zone;
37kmem_zone_t *xfs_cud_zone;
38
39static inline struct xfs_cui_log_item *CUI_ITEM(struct xfs_log_item *lip)
40{
41 return container_of(lip, struct xfs_cui_log_item, cui_item);
42}
43
44void
45xfs_cui_item_free(
46 struct xfs_cui_log_item *cuip)
47{
48 if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
49 kmem_free(cuip);
50 else
51 kmem_zone_free(xfs_cui_zone, cuip);
52}
53
54STATIC void
55xfs_cui_item_size(
56 struct xfs_log_item *lip,
57 int *nvecs,
58 int *nbytes)
59{
60 struct xfs_cui_log_item *cuip = CUI_ITEM(lip);
61
62 *nvecs += 1;
63 *nbytes += xfs_cui_log_format_sizeof(cuip->cui_format.cui_nextents);
64}
65
66/*
67 * This is called to fill in the vector of log iovecs for the
68 * given cui log item. We use only 1 iovec, and we point that
69 * at the cui_log_format structure embedded in the cui item.
70 * It is at this point that we assert that all of the extent
71 * slots in the cui item have been filled.
72 */
73STATIC void
74xfs_cui_item_format(
75 struct xfs_log_item *lip,
76 struct xfs_log_vec *lv)
77{
78 struct xfs_cui_log_item *cuip = CUI_ITEM(lip);
79 struct xfs_log_iovec *vecp = NULL;
80
81 ASSERT(atomic_read(&cuip->cui_next_extent) ==
82 cuip->cui_format.cui_nextents);
83
84 cuip->cui_format.cui_type = XFS_LI_CUI;
85 cuip->cui_format.cui_size = 1;
86
87 xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_CUI_FORMAT, &cuip->cui_format,
88 xfs_cui_log_format_sizeof(cuip->cui_format.cui_nextents));
89}
90
91/*
92 * Pinning has no meaning for an cui item, so just return.
93 */
94STATIC void
95xfs_cui_item_pin(
96 struct xfs_log_item *lip)
97{
98}
99
100/*
101 * The unpin operation is the last place an CUI is manipulated in the log. It is
102 * either inserted in the AIL or aborted in the event of a log I/O error. In
103 * either case, the CUI transaction has been successfully committed to make it
104 * this far. Therefore, we expect whoever committed the CUI to either construct
105 * and commit the CUD or drop the CUD's reference in the event of error. Simply
106 * drop the log's CUI reference now that the log is done with it.
107 */
108STATIC void
109xfs_cui_item_unpin(
110 struct xfs_log_item *lip,
111 int remove)
112{
113 struct xfs_cui_log_item *cuip = CUI_ITEM(lip);
114
115 xfs_cui_release(cuip);
116}
117
118/*
119 * CUI items have no locking or pushing. However, since CUIs are pulled from
120 * the AIL when their corresponding CUDs are committed to disk, their situation
121 * is very similar to being pinned. Return XFS_ITEM_PINNED so that the caller
122 * will eventually flush the log. This should help in getting the CUI out of
123 * the AIL.
124 */
125STATIC uint
126xfs_cui_item_push(
127 struct xfs_log_item *lip,
128 struct list_head *buffer_list)
129{
130 return XFS_ITEM_PINNED;
131}
132
133/*
134 * The CUI has been either committed or aborted if the transaction has been
135 * cancelled. If the transaction was cancelled, an CUD isn't going to be
136 * constructed and thus we free the CUI here directly.
137 */
138STATIC void
139xfs_cui_item_unlock(
140 struct xfs_log_item *lip)
141{
142 if (lip->li_flags & XFS_LI_ABORTED)
143 xfs_cui_item_free(CUI_ITEM(lip));
144}
145
146/*
147 * The CUI is logged only once and cannot be moved in the log, so simply return
148 * the lsn at which it's been logged.
149 */
150STATIC xfs_lsn_t
151xfs_cui_item_committed(
152 struct xfs_log_item *lip,
153 xfs_lsn_t lsn)
154{
155 return lsn;
156}
157
158/*
159 * The CUI dependency tracking op doesn't do squat. It can't because
160 * it doesn't know where the free extent is coming from. The dependency
161 * tracking has to be handled by the "enclosing" metadata object. For
162 * example, for inodes, the inode is locked throughout the extent freeing
163 * so the dependency should be recorded there.
164 */
165STATIC void
166xfs_cui_item_committing(
167 struct xfs_log_item *lip,
168 xfs_lsn_t lsn)
169{
170}
171
172/*
173 * This is the ops vector shared by all cui log items.
174 */
175static const struct xfs_item_ops xfs_cui_item_ops = {
176 .iop_size = xfs_cui_item_size,
177 .iop_format = xfs_cui_item_format,
178 .iop_pin = xfs_cui_item_pin,
179 .iop_unpin = xfs_cui_item_unpin,
180 .iop_unlock = xfs_cui_item_unlock,
181 .iop_committed = xfs_cui_item_committed,
182 .iop_push = xfs_cui_item_push,
183 .iop_committing = xfs_cui_item_committing,
184};
185
186/*
187 * Allocate and initialize an cui item with the given number of extents.
188 */
189struct xfs_cui_log_item *
190xfs_cui_init(
191 struct xfs_mount *mp,
192 uint nextents)
193
194{
195 struct xfs_cui_log_item *cuip;
196
197 ASSERT(nextents > 0);
198 if (nextents > XFS_CUI_MAX_FAST_EXTENTS)
199 cuip = kmem_zalloc(xfs_cui_log_item_sizeof(nextents),
200 KM_SLEEP);
201 else
202 cuip = kmem_zone_zalloc(xfs_cui_zone, KM_SLEEP);
203
204 xfs_log_item_init(mp, &cuip->cui_item, XFS_LI_CUI, &xfs_cui_item_ops);
205 cuip->cui_format.cui_nextents = nextents;
206 cuip->cui_format.cui_id = (uintptr_t)(void *)cuip;
207 atomic_set(&cuip->cui_next_extent, 0);
208 atomic_set(&cuip->cui_refcount, 2);
209
210 return cuip;
211}
212
213/*
214 * Freeing the CUI requires that we remove it from the AIL if it has already
215 * been placed there. However, the CUI may not yet have been placed in the AIL
216 * when called by xfs_cui_release() from CUD processing due to the ordering of
217 * committed vs unpin operations in bulk insert operations. Hence the reference
218 * count to ensure only the last caller frees the CUI.
219 */
220void
221xfs_cui_release(
222 struct xfs_cui_log_item *cuip)
223{
224 if (atomic_dec_and_test(&cuip->cui_refcount)) {
225 xfs_trans_ail_remove(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
226 xfs_cui_item_free(cuip);
227 }
228}
229
230static inline struct xfs_cud_log_item *CUD_ITEM(struct xfs_log_item *lip)
231{
232 return container_of(lip, struct xfs_cud_log_item, cud_item);
233}
234
235STATIC void
236xfs_cud_item_size(
237 struct xfs_log_item *lip,
238 int *nvecs,
239 int *nbytes)
240{
241 *nvecs += 1;
242 *nbytes += sizeof(struct xfs_cud_log_format);
243}
244
245/*
246 * This is called to fill in the vector of log iovecs for the
247 * given cud log item. We use only 1 iovec, and we point that
248 * at the cud_log_format structure embedded in the cud item.
249 * It is at this point that we assert that all of the extent
250 * slots in the cud item have been filled.
251 */
252STATIC void
253xfs_cud_item_format(
254 struct xfs_log_item *lip,
255 struct xfs_log_vec *lv)
256{
257 struct xfs_cud_log_item *cudp = CUD_ITEM(lip);
258 struct xfs_log_iovec *vecp = NULL;
259
260 cudp->cud_format.cud_type = XFS_LI_CUD;
261 cudp->cud_format.cud_size = 1;
262
263 xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_CUD_FORMAT, &cudp->cud_format,
264 sizeof(struct xfs_cud_log_format));
265}
266
267/*
268 * Pinning has no meaning for an cud item, so just return.
269 */
270STATIC void
271xfs_cud_item_pin(
272 struct xfs_log_item *lip)
273{
274}
275
276/*
277 * Since pinning has no meaning for an cud item, unpinning does
278 * not either.
279 */
280STATIC void
281xfs_cud_item_unpin(
282 struct xfs_log_item *lip,
283 int remove)
284{
285}
286
287/*
288 * There isn't much you can do to push on an cud item. It is simply stuck
289 * waiting for the log to be flushed to disk.
290 */
291STATIC uint
292xfs_cud_item_push(
293 struct xfs_log_item *lip,
294 struct list_head *buffer_list)
295{
296 return XFS_ITEM_PINNED;
297}
298
299/*
300 * The CUD is either committed or aborted if the transaction is cancelled. If
301 * the transaction is cancelled, drop our reference to the CUI and free the
302 * CUD.
303 */
304STATIC void
305xfs_cud_item_unlock(
306 struct xfs_log_item *lip)
307{
308 struct xfs_cud_log_item *cudp = CUD_ITEM(lip);
309
310 if (lip->li_flags & XFS_LI_ABORTED) {
311 xfs_cui_release(cudp->cud_cuip);
312 kmem_zone_free(xfs_cud_zone, cudp);
313 }
314}
315
316/*
317 * When the cud item is committed to disk, all we need to do is delete our
318 * reference to our partner cui item and then free ourselves. Since we're
319 * freeing ourselves we must return -1 to keep the transaction code from
320 * further referencing this item.
321 */
322STATIC xfs_lsn_t
323xfs_cud_item_committed(
324 struct xfs_log_item *lip,
325 xfs_lsn_t lsn)
326{
327 struct xfs_cud_log_item *cudp = CUD_ITEM(lip);
328
329 /*
330 * Drop the CUI reference regardless of whether the CUD has been
331 * aborted. Once the CUD transaction is constructed, it is the sole
332 * responsibility of the CUD to release the CUI (even if the CUI is
333 * aborted due to log I/O error).
334 */
335 xfs_cui_release(cudp->cud_cuip);
336 kmem_zone_free(xfs_cud_zone, cudp);
337
338 return (xfs_lsn_t)-1;
339}
340
341/*
342 * The CUD dependency tracking op doesn't do squat. It can't because
343 * it doesn't know where the free extent is coming from. The dependency
344 * tracking has to be handled by the "enclosing" metadata object. For
345 * example, for inodes, the inode is locked throughout the extent freeing
346 * so the dependency should be recorded there.
347 */
348STATIC void
349xfs_cud_item_committing(
350 struct xfs_log_item *lip,
351 xfs_lsn_t lsn)
352{
353}
354
355/*
356 * This is the ops vector shared by all cud log items.
357 */
358static const struct xfs_item_ops xfs_cud_item_ops = {
359 .iop_size = xfs_cud_item_size,
360 .iop_format = xfs_cud_item_format,
361 .iop_pin = xfs_cud_item_pin,
362 .iop_unpin = xfs_cud_item_unpin,
363 .iop_unlock = xfs_cud_item_unlock,
364 .iop_committed = xfs_cud_item_committed,
365 .iop_push = xfs_cud_item_push,
366 .iop_committing = xfs_cud_item_committing,
367};
368
369/*
370 * Allocate and initialize an cud item with the given number of extents.
371 */
372struct xfs_cud_log_item *
373xfs_cud_init(
374 struct xfs_mount *mp,
375 struct xfs_cui_log_item *cuip)
376
377{
378 struct xfs_cud_log_item *cudp;
379
380 cudp = kmem_zone_zalloc(xfs_cud_zone, KM_SLEEP);
381 xfs_log_item_init(mp, &cudp->cud_item, XFS_LI_CUD, &xfs_cud_item_ops);
382 cudp->cud_cuip = cuip;
383 cudp->cud_format.cud_cui_id = cuip->cui_format.cui_id;
384
385 return cudp;
386}
387
388/*
389 * Process a refcount update intent item that was recovered from the log.
390 * We need to update the refcountbt.
391 */
392int
393xfs_cui_recover(
394 struct xfs_mount *mp,
395 struct xfs_cui_log_item *cuip)
396{
397 int i;
398 int error = 0;
399 unsigned int refc_type;
400 struct xfs_phys_extent *refc;
401 xfs_fsblock_t startblock_fsb;
402 bool op_ok;
403 struct xfs_cud_log_item *cudp;
404 struct xfs_trans *tp;
405 struct xfs_btree_cur *rcur = NULL;
406 enum xfs_refcount_intent_type type;
407 xfs_fsblock_t firstfsb;
408 xfs_fsblock_t new_fsb;
409 xfs_extlen_t new_len;
410 struct xfs_bmbt_irec irec;
411 struct xfs_defer_ops dfops;
412 bool requeue_only = false;
413
414 ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
415
416 /*
417 * First check the validity of the extents described by the
418 * CUI. If any are bad, then assume that all are bad and
419 * just toss the CUI.
420 */
421 for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
422 refc = &cuip->cui_format.cui_extents[i];
423 startblock_fsb = XFS_BB_TO_FSB(mp,
424 XFS_FSB_TO_DADDR(mp, refc->pe_startblock));
425 switch (refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK) {
426 case XFS_REFCOUNT_INCREASE:
427 case XFS_REFCOUNT_DECREASE:
428 case XFS_REFCOUNT_ALLOC_COW:
429 case XFS_REFCOUNT_FREE_COW:
430 op_ok = true;
431 break;
432 default:
433 op_ok = false;
434 break;
435 }
436 if (!op_ok || startblock_fsb == 0 ||
437 refc->pe_len == 0 ||
438 startblock_fsb >= mp->m_sb.sb_dblocks ||
439 refc->pe_len >= mp->m_sb.sb_agblocks ||
440 (refc->pe_flags & ~XFS_REFCOUNT_EXTENT_FLAGS)) {
441 /*
442 * This will pull the CUI from the AIL and
443 * free the memory associated with it.
444 */
445 set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
446 xfs_cui_release(cuip);
447 return -EIO;
448 }
449 }
450
451 /*
452 * Under normal operation, refcount updates are deferred, so we
453 * wouldn't be adding them directly to a transaction. All
454 * refcount updates manage reservation usage internally and
455 * dynamically by deferring work that won't fit in the
456 * transaction. Normally, any work that needs to be deferred
457 * gets attached to the same defer_ops that scheduled the
458 * refcount update. However, we're in log recovery here, so we
459 * we create our own defer_ops and use that to finish up any
460 * work that doesn't fit.
461 */
462 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
463 if (error)
464 return error;
465 cudp = xfs_trans_get_cud(tp, cuip);
466
467 xfs_defer_init(&dfops, &firstfsb);
468 for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
469 refc = &cuip->cui_format.cui_extents[i];
470 refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
471 switch (refc_type) {
472 case XFS_REFCOUNT_INCREASE:
473 case XFS_REFCOUNT_DECREASE:
474 case XFS_REFCOUNT_ALLOC_COW:
475 case XFS_REFCOUNT_FREE_COW:
476 type = refc_type;
477 break;
478 default:
479 error = -EFSCORRUPTED;
480 goto abort_error;
481 }
482 if (requeue_only) {
483 new_fsb = refc->pe_startblock;
484 new_len = refc->pe_len;
485 } else
486 error = xfs_trans_log_finish_refcount_update(tp, cudp,
487 &dfops, type, refc->pe_startblock, refc->pe_len,
488 &new_fsb, &new_len, &rcur);
489 if (error)
490 goto abort_error;
491
492 /* Requeue what we didn't finish. */
493 if (new_len > 0) {
494 irec.br_startblock = new_fsb;
495 irec.br_blockcount = new_len;
496 switch (type) {
497 case XFS_REFCOUNT_INCREASE:
498 error = xfs_refcount_increase_extent(
499 tp->t_mountp, &dfops, &irec);
500 break;
501 case XFS_REFCOUNT_DECREASE:
502 error = xfs_refcount_decrease_extent(
503 tp->t_mountp, &dfops, &irec);
504 break;
505 case XFS_REFCOUNT_ALLOC_COW:
506 error = xfs_refcount_alloc_cow_extent(
507 tp->t_mountp, &dfops,
508 irec.br_startblock,
509 irec.br_blockcount);
510 break;
511 case XFS_REFCOUNT_FREE_COW:
512 error = xfs_refcount_free_cow_extent(
513 tp->t_mountp, &dfops,
514 irec.br_startblock,
515 irec.br_blockcount);
516 break;
517 default:
518 ASSERT(0);
519 }
520 if (error)
521 goto abort_error;
522 requeue_only = true;
523 }
524 }
525
526 xfs_refcount_finish_one_cleanup(tp, rcur, error);
527 error = xfs_defer_finish(&tp, &dfops, NULL);
528 if (error)
529 goto abort_error;
530 set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
531 error = xfs_trans_commit(tp);
532 return error;
533
534abort_error:
535 xfs_refcount_finish_one_cleanup(tp, rcur, error);
536 xfs_defer_cancel(&dfops);
537 xfs_trans_cancel(tp);
538 return error;
539}
diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h
new file mode 100644
index 000000000000..5b74dddfa64b
--- /dev/null
+++ b/fs/xfs/xfs_refcount_item.h
@@ -0,0 +1,101 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#ifndef __XFS_REFCOUNT_ITEM_H__
21#define __XFS_REFCOUNT_ITEM_H__
22
23/*
24 * There are (currently) two pairs of refcount btree redo item types:
25 * increase and decrease. The log items for these are CUI (refcount
26 * update intent) and CUD (refcount update done). The redo item type
27 * is encoded in the flags field of each xfs_map_extent.
28 *
29 * *I items should be recorded in the *first* of a series of rolled
30 * transactions, and the *D items should be recorded in the same
31 * transaction that records the associated refcountbt updates.
32 *
33 * Should the system crash after the commit of the first transaction
34 * but before the commit of the final transaction in a series, log
35 * recovery will use the redo information recorded by the intent items
36 * to replay the refcountbt metadata updates.
37 */
38
39/* kernel only CUI/CUD definitions */
40
41struct xfs_mount;
42struct kmem_zone;
43
44/*
45 * Max number of extents in fast allocation path.
46 */
47#define XFS_CUI_MAX_FAST_EXTENTS 16
48
49/*
50 * Define CUI flag bits. Manipulated by set/clear/test_bit operators.
51 */
52#define XFS_CUI_RECOVERED 1
53
54/*
55 * This is the "refcount update intent" log item. It is used to log
56 * the fact that some reverse mappings need to change. It is used in
57 * conjunction with the "refcount update done" log item described
58 * below.
59 *
60 * These log items follow the same rules as struct xfs_efi_log_item;
61 * see the comments about that structure (in xfs_extfree_item.h) for
62 * more details.
63 */
64struct xfs_cui_log_item {
65 struct xfs_log_item cui_item;
66 atomic_t cui_refcount;
67 atomic_t cui_next_extent;
68 unsigned long cui_flags; /* misc flags */
69 struct xfs_cui_log_format cui_format;
70};
71
72static inline size_t
73xfs_cui_log_item_sizeof(
74 unsigned int nr)
75{
76 return offsetof(struct xfs_cui_log_item, cui_format) +
77 xfs_cui_log_format_sizeof(nr);
78}
79
80/*
81 * This is the "refcount update done" log item. It is used to log the
82 * fact that some refcountbt updates mentioned in an earlier cui item
83 * have been performed.
84 */
85struct xfs_cud_log_item {
86 struct xfs_log_item cud_item;
87 struct xfs_cui_log_item *cud_cuip;
88 struct xfs_cud_log_format cud_format;
89};
90
91extern struct kmem_zone *xfs_cui_zone;
92extern struct kmem_zone *xfs_cud_zone;
93
94struct xfs_cui_log_item *xfs_cui_init(struct xfs_mount *, uint);
95struct xfs_cud_log_item *xfs_cud_init(struct xfs_mount *,
96 struct xfs_cui_log_item *);
97void xfs_cui_item_free(struct xfs_cui_log_item *);
98void xfs_cui_release(struct xfs_cui_log_item *);
99int xfs_cui_recover(struct xfs_mount *mp, struct xfs_cui_log_item *cuip);
100
101#endif /* __XFS_REFCOUNT_ITEM_H__ */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
new file mode 100644
index 000000000000..5965e9455d91
--- /dev/null
+++ b/fs/xfs/xfs_reflink.c
@@ -0,0 +1,1688 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_shared.h"
23#include "xfs_format.h"
24#include "xfs_log_format.h"
25#include "xfs_trans_resv.h"
26#include "xfs_mount.h"
27#include "xfs_defer.h"
28#include "xfs_da_format.h"
29#include "xfs_da_btree.h"
30#include "xfs_inode.h"
31#include "xfs_trans.h"
32#include "xfs_inode_item.h"
33#include "xfs_bmap.h"
34#include "xfs_bmap_util.h"
35#include "xfs_error.h"
36#include "xfs_dir2.h"
37#include "xfs_dir2_priv.h"
38#include "xfs_ioctl.h"
39#include "xfs_trace.h"
40#include "xfs_log.h"
41#include "xfs_icache.h"
42#include "xfs_pnfs.h"
43#include "xfs_btree.h"
44#include "xfs_refcount_btree.h"
45#include "xfs_refcount.h"
46#include "xfs_bmap_btree.h"
47#include "xfs_trans_space.h"
48#include "xfs_bit.h"
49#include "xfs_alloc.h"
50#include "xfs_quota_defs.h"
51#include "xfs_quota.h"
52#include "xfs_btree.h"
53#include "xfs_bmap_btree.h"
54#include "xfs_reflink.h"
55#include "xfs_iomap.h"
56#include "xfs_rmap_btree.h"
57#include "xfs_sb.h"
58#include "xfs_ag_resv.h"
59
60/*
61 * Copy on Write of Shared Blocks
62 *
63 * XFS must preserve "the usual" file semantics even when two files share
64 * the same physical blocks. This means that a write to one file must not
65 * alter the blocks in a different file; the way that we'll do that is
66 * through the use of a copy-on-write mechanism. At a high level, that
67 * means that when we want to write to a shared block, we allocate a new
68 * block, write the data to the new block, and if that succeeds we map the
69 * new block into the file.
70 *
71 * XFS provides a "delayed allocation" mechanism that defers the allocation
72 * of disk blocks to dirty-but-not-yet-mapped file blocks as long as
73 * possible. This reduces fragmentation by enabling the filesystem to ask
74 * for bigger chunks less often, which is exactly what we want for CoW.
75 *
76 * The delalloc mechanism begins when the kernel wants to make a block
77 * writable (write_begin or page_mkwrite). If the offset is not mapped, we
78 * create a delalloc mapping, which is a regular in-core extent, but without
79 * a real startblock. (For delalloc mappings, the startblock encodes both
80 * a flag that this is a delalloc mapping, and a worst-case estimate of how
81 * many blocks might be required to put the mapping into the BMBT.) delalloc
82 * mappings are a reservation against the free space in the filesystem;
83 * adjacent mappings can also be combined into fewer larger mappings.
84 *
85 * When dirty pages are being written out (typically in writepage), the
86 * delalloc reservations are converted into real mappings by allocating
87 * blocks and replacing the delalloc mapping with real ones. A delalloc
88 * mapping can be replaced by several real ones if the free space is
89 * fragmented.
90 *
91 * We want to adapt the delalloc mechanism for copy-on-write, since the
92 * write paths are similar. The first two steps (creating the reservation
93 * and allocating the blocks) are exactly the same as delalloc except that
94 * the mappings must be stored in a separate CoW fork because we do not want
95 * to disturb the mapping in the data fork until we're sure that the write
96 * succeeded. IO completion in this case is the process of removing the old
97 * mapping from the data fork and moving the new mapping from the CoW fork to
98 * the data fork. This will be discussed shortly.
99 *
100 * For now, unaligned directio writes will be bounced back to the page cache.
101 * Block-aligned directio writes will use the same mechanism as buffered
102 * writes.
103 *
104 * CoW remapping must be done after the data block write completes,
105 * because we don't want to destroy the old data fork map until we're sure
106 * the new block has been written. Since the new mappings are kept in a
107 * separate fork, we can simply iterate these mappings to find the ones
108 * that cover the file blocks that we just CoW'd. For each extent, simply
109 * unmap the corresponding range in the data fork, map the new range into
110 * the data fork, and remove the extent from the CoW fork.
111 *
112 * Since the remapping operation can be applied to an arbitrary file
113 * range, we record the need for the remap step as a flag in the ioend
114 * instead of declaring a new IO type. This is required for direct io
115 * because we only have ioend for the whole dio, and we have to be able to
116 * remember the presence of unwritten blocks and CoW blocks with a single
117 * ioend structure. Better yet, the more ground we can cover with one
118 * ioend, the better.
119 */
120
121/*
122 * Given an AG extent, find the lowest-numbered run of shared blocks
123 * within that range and return the range in fbno/flen. If
124 * find_end_of_shared is true, return the longest contiguous extent of
125 * shared blocks. If there are no shared extents, fbno and flen will
126 * be set to NULLAGBLOCK and 0, respectively.
127 */
128int
129xfs_reflink_find_shared(
130 struct xfs_mount *mp,
131 xfs_agnumber_t agno,
132 xfs_agblock_t agbno,
133 xfs_extlen_t aglen,
134 xfs_agblock_t *fbno,
135 xfs_extlen_t *flen,
136 bool find_end_of_shared)
137{
138 struct xfs_buf *agbp;
139 struct xfs_btree_cur *cur;
140 int error;
141
142 error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
143 if (error)
144 return error;
145
146 cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
147
148 error = xfs_refcount_find_shared(cur, agbno, aglen, fbno, flen,
149 find_end_of_shared);
150
151 xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
152
153 xfs_buf_relse(agbp);
154 return error;
155}
156
157/*
158 * Trim the mapping to the next block where there's a change in the
159 * shared/unshared status. More specifically, this means that we
160 * find the lowest-numbered extent of shared blocks that coincides with
161 * the given block mapping. If the shared extent overlaps the start of
162 * the mapping, trim the mapping to the end of the shared extent. If
163 * the shared region intersects the mapping, trim the mapping to the
164 * start of the shared extent. If there are no shared regions that
165 * overlap, just return the original extent.
166 */
167int
168xfs_reflink_trim_around_shared(
169 struct xfs_inode *ip,
170 struct xfs_bmbt_irec *irec,
171 bool *shared,
172 bool *trimmed)
173{
174 xfs_agnumber_t agno;
175 xfs_agblock_t agbno;
176 xfs_extlen_t aglen;
177 xfs_agblock_t fbno;
178 xfs_extlen_t flen;
179 int error = 0;
180
181 /* Holes, unwritten, and delalloc extents cannot be shared */
182 if (!xfs_is_reflink_inode(ip) ||
183 ISUNWRITTEN(irec) ||
184 irec->br_startblock == HOLESTARTBLOCK ||
185 irec->br_startblock == DELAYSTARTBLOCK) {
186 *shared = false;
187 return 0;
188 }
189
190 trace_xfs_reflink_trim_around_shared(ip, irec);
191
192 agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
193 agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
194 aglen = irec->br_blockcount;
195
196 error = xfs_reflink_find_shared(ip->i_mount, agno, agbno,
197 aglen, &fbno, &flen, true);
198 if (error)
199 return error;
200
201 *shared = *trimmed = false;
202 if (fbno == NULLAGBLOCK) {
203 /* No shared blocks at all. */
204 return 0;
205 } else if (fbno == agbno) {
206 /*
207 * The start of this extent is shared. Truncate the
208 * mapping at the end of the shared region so that a
209 * subsequent iteration starts at the start of the
210 * unshared region.
211 */
212 irec->br_blockcount = flen;
213 *shared = true;
214 if (flen != aglen)
215 *trimmed = true;
216 return 0;
217 } else {
218 /*
219 * There's a shared extent midway through this extent.
220 * Truncate the mapping at the start of the shared
221 * extent so that a subsequent iteration starts at the
222 * start of the shared region.
223 */
224 irec->br_blockcount = fbno - agbno;
225 *trimmed = true;
226 return 0;
227 }
228}
229
230/* Create a CoW reservation for a range of blocks within a file. */
231static int
232__xfs_reflink_reserve_cow(
233 struct xfs_inode *ip,
234 xfs_fileoff_t *offset_fsb,
235 xfs_fileoff_t end_fsb,
236 bool *skipped)
237{
238 struct xfs_bmbt_irec got, prev, imap;
239 xfs_fileoff_t orig_end_fsb;
240 int nimaps, eof = 0, error = 0;
241 bool shared = false, trimmed = false;
242 xfs_extnum_t idx;
243 xfs_extlen_t align;
244
245 /* Already reserved? Skip the refcount btree access. */
246 xfs_bmap_search_extents(ip, *offset_fsb, XFS_COW_FORK, &eof, &idx,
247 &got, &prev);
248 if (!eof && got.br_startoff <= *offset_fsb) {
249 end_fsb = orig_end_fsb = got.br_startoff + got.br_blockcount;
250 trace_xfs_reflink_cow_found(ip, &got);
251 goto done;
252 }
253
254 /* Read extent from the source file. */
255 nimaps = 1;
256 error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
257 &imap, &nimaps, 0);
258 if (error)
259 goto out_unlock;
260 ASSERT(nimaps == 1);
261
262 /* Trim the mapping to the nearest shared extent boundary. */
263 error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
264 if (error)
265 goto out_unlock;
266
267 end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
268
269 /* Not shared? Just report the (potentially capped) extent. */
270 if (!shared) {
271 *skipped = true;
272 goto done;
273 }
274
275 /*
276 * Fork all the shared blocks from our write offset until the end of
277 * the extent.
278 */
279 error = xfs_qm_dqattach_locked(ip, 0);
280 if (error)
281 goto out_unlock;
282
283 align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
284 if (align)
285 end_fsb = roundup_64(end_fsb, align);
286
287retry:
288 error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, *offset_fsb,
289 end_fsb - *offset_fsb, &got,
290 &prev, &idx, eof);
291 switch (error) {
292 case 0:
293 break;
294 case -ENOSPC:
295 case -EDQUOT:
296 /* retry without any preallocation */
297 trace_xfs_reflink_cow_enospc(ip, &imap);
298 if (end_fsb != orig_end_fsb) {
299 end_fsb = orig_end_fsb;
300 goto retry;
301 }
302 /*FALLTHRU*/
303 default:
304 goto out_unlock;
305 }
306
307 if (end_fsb != orig_end_fsb)
308 xfs_inode_set_cowblocks_tag(ip);
309
310 trace_xfs_reflink_cow_alloc(ip, &got);
311done:
312 *offset_fsb = end_fsb;
313out_unlock:
314 return error;
315}
316
317/* Create a CoW reservation for part of a file. */
318int
319xfs_reflink_reserve_cow_range(
320 struct xfs_inode *ip,
321 xfs_off_t offset,
322 xfs_off_t count)
323{
324 struct xfs_mount *mp = ip->i_mount;
325 xfs_fileoff_t offset_fsb, end_fsb;
326 bool skipped = false;
327 int error;
328
329 trace_xfs_reflink_reserve_cow_range(ip, offset, count);
330
331 offset_fsb = XFS_B_TO_FSBT(mp, offset);
332 end_fsb = XFS_B_TO_FSB(mp, offset + count);
333
334 xfs_ilock(ip, XFS_ILOCK_EXCL);
335 while (offset_fsb < end_fsb) {
336 error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb,
337 &skipped);
338 if (error) {
339 trace_xfs_reflink_reserve_cow_range_error(ip, error,
340 _RET_IP_);
341 break;
342 }
343 }
344 xfs_iunlock(ip, XFS_ILOCK_EXCL);
345
346 return error;
347}
348
349/* Allocate all CoW reservations covering a range of blocks in a file. */
350static int
351__xfs_reflink_allocate_cow(
352 struct xfs_inode *ip,
353 xfs_fileoff_t *offset_fsb,
354 xfs_fileoff_t end_fsb)
355{
356 struct xfs_mount *mp = ip->i_mount;
357 struct xfs_bmbt_irec imap;
358 struct xfs_defer_ops dfops;
359 struct xfs_trans *tp;
360 xfs_fsblock_t first_block;
361 xfs_fileoff_t next_fsb;
362 int nimaps = 1, error;
363 bool skipped = false;
364
365 xfs_defer_init(&dfops, &first_block);
366
367 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
368 XFS_TRANS_RESERVE, &tp);
369 if (error)
370 return error;
371
372 xfs_ilock(ip, XFS_ILOCK_EXCL);
373
374 next_fsb = *offset_fsb;
375 error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
376 if (error)
377 goto out_trans_cancel;
378
379 if (skipped) {
380 *offset_fsb = next_fsb;
381 goto out_trans_cancel;
382 }
383
384 xfs_trans_ijoin(tp, ip, 0);
385 error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
386 XFS_BMAPI_COWFORK, &first_block,
387 XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
388 &imap, &nimaps, &dfops);
389 if (error)
390 goto out_trans_cancel;
391
392 /* We might not have been able to map the whole delalloc extent */
393 *offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
394
395 error = xfs_defer_finish(&tp, &dfops, NULL);
396 if (error)
397 goto out_trans_cancel;
398
399 error = xfs_trans_commit(tp);
400
401out_unlock:
402 xfs_iunlock(ip, XFS_ILOCK_EXCL);
403 return error;
404out_trans_cancel:
405 xfs_defer_cancel(&dfops);
406 xfs_trans_cancel(tp);
407 goto out_unlock;
408}
409
410/* Allocate all CoW reservations covering a part of a file. */
411int
412xfs_reflink_allocate_cow_range(
413 struct xfs_inode *ip,
414 xfs_off_t offset,
415 xfs_off_t count)
416{
417 struct xfs_mount *mp = ip->i_mount;
418 xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
419 xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, offset + count);
420 int error;
421
422 ASSERT(xfs_is_reflink_inode(ip));
423
424 trace_xfs_reflink_allocate_cow_range(ip, offset, count);
425
426 /*
427 * Make sure that the dquots are there.
428 */
429 error = xfs_qm_dqattach(ip, 0);
430 if (error)
431 return error;
432
433 while (offset_fsb < end_fsb) {
434 error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
435 if (error) {
436 trace_xfs_reflink_allocate_cow_range_error(ip, error,
437 _RET_IP_);
438 break;
439 }
440 }
441
442 return error;
443}
444
445/*
446 * Find the CoW reservation (and whether or not it needs block allocation)
447 * for a given byte offset of a file.
448 */
449bool
450xfs_reflink_find_cow_mapping(
451 struct xfs_inode *ip,
452 xfs_off_t offset,
453 struct xfs_bmbt_irec *imap,
454 bool *need_alloc)
455{
456 struct xfs_bmbt_irec irec;
457 struct xfs_ifork *ifp;
458 struct xfs_bmbt_rec_host *gotp;
459 xfs_fileoff_t bno;
460 xfs_extnum_t idx;
461
462 ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
463 ASSERT(xfs_is_reflink_inode(ip));
464
465 /* Find the extent in the CoW fork. */
466 ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
467 bno = XFS_B_TO_FSBT(ip->i_mount, offset);
468 gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
469 if (!gotp)
470 return false;
471
472 xfs_bmbt_get_all(gotp, &irec);
473 if (bno >= irec.br_startoff + irec.br_blockcount ||
474 bno < irec.br_startoff)
475 return false;
476
477 trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
478 &irec);
479
480 /* If it's still delalloc, we must allocate later. */
481 *imap = irec;
482 *need_alloc = !!(isnullstartblock(irec.br_startblock));
483
484 return true;
485}
486
487/*
488 * Trim an extent to end at the next CoW reservation past offset_fsb.
489 */
490int
491xfs_reflink_trim_irec_to_next_cow(
492 struct xfs_inode *ip,
493 xfs_fileoff_t offset_fsb,
494 struct xfs_bmbt_irec *imap)
495{
496 struct xfs_bmbt_irec irec;
497 struct xfs_ifork *ifp;
498 struct xfs_bmbt_rec_host *gotp;
499 xfs_extnum_t idx;
500
501 if (!xfs_is_reflink_inode(ip))
502 return 0;
503
504 /* Find the extent in the CoW fork. */
505 ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
506 gotp = xfs_iext_bno_to_ext(ifp, offset_fsb, &idx);
507 if (!gotp)
508 return 0;
509 xfs_bmbt_get_all(gotp, &irec);
510
511 /* This is the extent before; try sliding up one. */
512 if (irec.br_startoff < offset_fsb) {
513 idx++;
514 if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
515 return 0;
516 gotp = xfs_iext_get_ext(ifp, idx);
517 xfs_bmbt_get_all(gotp, &irec);
518 }
519
520 if (irec.br_startoff >= imap->br_startoff + imap->br_blockcount)
521 return 0;
522
523 imap->br_blockcount = irec.br_startoff - imap->br_startoff;
524 trace_xfs_reflink_trim_irec(ip, imap);
525
526 return 0;
527}
528
529/*
530 * Cancel all pending CoW reservations for some block range of an inode.
531 */
532int
533xfs_reflink_cancel_cow_blocks(
534 struct xfs_inode *ip,
535 struct xfs_trans **tpp,
536 xfs_fileoff_t offset_fsb,
537 xfs_fileoff_t end_fsb)
538{
539 struct xfs_bmbt_irec irec;
540 xfs_filblks_t count_fsb;
541 xfs_fsblock_t firstfsb;
542 struct xfs_defer_ops dfops;
543 int error = 0;
544 int nimaps;
545
546 if (!xfs_is_reflink_inode(ip))
547 return 0;
548
549 /* Go find the old extent in the CoW fork. */
550 while (offset_fsb < end_fsb) {
551 nimaps = 1;
552 count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
553 error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
554 &nimaps, XFS_BMAPI_COWFORK);
555 if (error)
556 break;
557 ASSERT(nimaps == 1);
558
559 trace_xfs_reflink_cancel_cow(ip, &irec);
560
561 if (irec.br_startblock == DELAYSTARTBLOCK) {
562 /* Free a delayed allocation. */
563 xfs_mod_fdblocks(ip->i_mount, irec.br_blockcount,
564 false);
565 ip->i_delayed_blks -= irec.br_blockcount;
566
567 /* Remove the mapping from the CoW fork. */
568 error = xfs_bunmapi_cow(ip, &irec);
569 if (error)
570 break;
571 } else if (irec.br_startblock == HOLESTARTBLOCK) {
572 /* empty */
573 } else {
574 xfs_trans_ijoin(*tpp, ip, 0);
575 xfs_defer_init(&dfops, &firstfsb);
576
577 /* Free the CoW orphan record. */
578 error = xfs_refcount_free_cow_extent(ip->i_mount,
579 &dfops, irec.br_startblock,
580 irec.br_blockcount);
581 if (error)
582 break;
583
584 xfs_bmap_add_free(ip->i_mount, &dfops,
585 irec.br_startblock, irec.br_blockcount,
586 NULL);
587
588 /* Update quota accounting */
589 xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
590 -(long)irec.br_blockcount);
591
592 /* Roll the transaction */
593 error = xfs_defer_finish(tpp, &dfops, ip);
594 if (error) {
595 xfs_defer_cancel(&dfops);
596 break;
597 }
598
599 /* Remove the mapping from the CoW fork. */
600 error = xfs_bunmapi_cow(ip, &irec);
601 if (error)
602 break;
603 }
604
605 /* Roll on... */
606 offset_fsb = irec.br_startoff + irec.br_blockcount;
607 }
608
609 return error;
610}
611
612/*
613 * Cancel all pending CoW reservations for some byte range of an inode.
614 */
615int
616xfs_reflink_cancel_cow_range(
617 struct xfs_inode *ip,
618 xfs_off_t offset,
619 xfs_off_t count)
620{
621 struct xfs_trans *tp;
622 xfs_fileoff_t offset_fsb;
623 xfs_fileoff_t end_fsb;
624 int error;
625
626 trace_xfs_reflink_cancel_cow_range(ip, offset, count);
627 ASSERT(xfs_is_reflink_inode(ip));
628
629 offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
630 if (count == NULLFILEOFF)
631 end_fsb = NULLFILEOFF;
632 else
633 end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
634
635 /* Start a rolling transaction to remove the mappings */
636 error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
637 0, 0, 0, &tp);
638 if (error)
639 goto out;
640
641 xfs_ilock(ip, XFS_ILOCK_EXCL);
642 xfs_trans_ijoin(tp, ip, 0);
643
644 /* Scrape out the old CoW reservations */
645 error = xfs_reflink_cancel_cow_blocks(ip, &tp, offset_fsb, end_fsb);
646 if (error)
647 goto out_cancel;
648
649 error = xfs_trans_commit(tp);
650
651 xfs_iunlock(ip, XFS_ILOCK_EXCL);
652 return error;
653
654out_cancel:
655 xfs_trans_cancel(tp);
656 xfs_iunlock(ip, XFS_ILOCK_EXCL);
657out:
658 trace_xfs_reflink_cancel_cow_range_error(ip, error, _RET_IP_);
659 return error;
660}
661
662/*
663 * Remap parts of a file's data fork after a successful CoW.
664 */
665int
666xfs_reflink_end_cow(
667 struct xfs_inode *ip,
668 xfs_off_t offset,
669 xfs_off_t count)
670{
671 struct xfs_bmbt_irec irec;
672 struct xfs_bmbt_irec uirec;
673 struct xfs_trans *tp;
674 xfs_fileoff_t offset_fsb;
675 xfs_fileoff_t end_fsb;
676 xfs_filblks_t count_fsb;
677 xfs_fsblock_t firstfsb;
678 struct xfs_defer_ops dfops;
679 int error;
680 unsigned int resblks;
681 xfs_filblks_t ilen;
682 xfs_filblks_t rlen;
683 int nimaps;
684
685 trace_xfs_reflink_end_cow(ip, offset, count);
686
687 offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
688 end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
689 count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
690
691 /* Start a rolling transaction to switch the mappings */
692 resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
693 error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
694 resblks, 0, 0, &tp);
695 if (error)
696 goto out;
697
698 xfs_ilock(ip, XFS_ILOCK_EXCL);
699 xfs_trans_ijoin(tp, ip, 0);
700
701 /* Go find the old extent in the CoW fork. */
702 while (offset_fsb < end_fsb) {
703 /* Read extent from the source file */
704 nimaps = 1;
705 count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
706 error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
707 &nimaps, XFS_BMAPI_COWFORK);
708 if (error)
709 goto out_cancel;
710 ASSERT(nimaps == 1);
711
712 ASSERT(irec.br_startblock != DELAYSTARTBLOCK);
713 trace_xfs_reflink_cow_remap(ip, &irec);
714
715 /*
716 * We can have a hole in the CoW fork if part of a directio
717 * write is CoW but part of it isn't.
718 */
719 rlen = ilen = irec.br_blockcount;
720 if (irec.br_startblock == HOLESTARTBLOCK)
721 goto next_extent;
722
723 /* Unmap the old blocks in the data fork. */
724 while (rlen) {
725 xfs_defer_init(&dfops, &firstfsb);
726 error = __xfs_bunmapi(tp, ip, irec.br_startoff,
727 &rlen, 0, 1, &firstfsb, &dfops);
728 if (error)
729 goto out_defer;
730
731 /*
732 * Trim the extent to whatever got unmapped.
733 * Remember, bunmapi works backwards.
734 */
735 uirec.br_startblock = irec.br_startblock + rlen;
736 uirec.br_startoff = irec.br_startoff + rlen;
737 uirec.br_blockcount = irec.br_blockcount - rlen;
738 irec.br_blockcount = rlen;
739 trace_xfs_reflink_cow_remap_piece(ip, &uirec);
740
741 /* Free the CoW orphan record. */
742 error = xfs_refcount_free_cow_extent(tp->t_mountp,
743 &dfops, uirec.br_startblock,
744 uirec.br_blockcount);
745 if (error)
746 goto out_defer;
747
748 /* Map the new blocks into the data fork. */
749 error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
750 ip, &uirec);
751 if (error)
752 goto out_defer;
753
754 /* Remove the mapping from the CoW fork. */
755 error = xfs_bunmapi_cow(ip, &uirec);
756 if (error)
757 goto out_defer;
758
759 error = xfs_defer_finish(&tp, &dfops, ip);
760 if (error)
761 goto out_defer;
762 }
763
764next_extent:
765 /* Roll on... */
766 offset_fsb = irec.br_startoff + ilen;
767 }
768
769 error = xfs_trans_commit(tp);
770 xfs_iunlock(ip, XFS_ILOCK_EXCL);
771 if (error)
772 goto out;
773 return 0;
774
775out_defer:
776 xfs_defer_cancel(&dfops);
777out_cancel:
778 xfs_trans_cancel(tp);
779 xfs_iunlock(ip, XFS_ILOCK_EXCL);
780out:
781 trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
782 return error;
783}
784
785/*
786 * Free leftover CoW reservations that didn't get cleaned out.
787 */
788int
789xfs_reflink_recover_cow(
790 struct xfs_mount *mp)
791{
792 xfs_agnumber_t agno;
793 int error = 0;
794
795 if (!xfs_sb_version_hasreflink(&mp->m_sb))
796 return 0;
797
798 for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
799 error = xfs_refcount_recover_cow_leftovers(mp, agno);
800 if (error)
801 break;
802 }
803
804 return error;
805}
806
807/*
808 * Reflinking (Block) Ranges of Two Files Together
809 *
810 * First, ensure that the reflink flag is set on both inodes. The flag is an
811 * optimization to avoid unnecessary refcount btree lookups in the write path.
812 *
813 * Now we can iteratively remap the range of extents (and holes) in src to the
814 * corresponding ranges in dest. Let drange and srange denote the ranges of
815 * logical blocks in dest and src touched by the reflink operation.
816 *
817 * While the length of drange is greater than zero,
818 * - Read src's bmbt at the start of srange ("imap")
819 * - If imap doesn't exist, make imap appear to start at the end of srange
820 * with zero length.
821 * - If imap starts before srange, advance imap to start at srange.
822 * - If imap goes beyond srange, truncate imap to end at the end of srange.
823 * - Punch (imap start - srange start + imap len) blocks from dest at
824 * offset (drange start).
825 * - If imap points to a real range of pblks,
826 * > Increase the refcount of the imap's pblks
827 * > Map imap's pblks into dest at the offset
828 * (drange start + imap start - srange start)
829 * - Advance drange and srange by (imap start - srange start + imap len)
830 *
831 * Finally, if the reflink made dest longer, update both the in-core and
832 * on-disk file sizes.
833 *
834 * ASCII Art Demonstration:
835 *
836 * Let's say we want to reflink this source file:
837 *
838 * ----SSSSSSS-SSSSS----SSSSSS (src file)
839 * <-------------------->
840 *
841 * into this destination file:
842 *
843 * --DDDDDDDDDDDDDDDDDDD--DDD (dest file)
844 * <-------------------->
845 * '-' means a hole, and 'S' and 'D' are written blocks in the src and dest.
846 * Observe that the range has different logical offsets in either file.
847 *
848 * Consider that the first extent in the source file doesn't line up with our
849 * reflink range. Unmapping and remapping are separate operations, so we can
850 * unmap more blocks from the destination file than we remap.
851 *
852 * ----SSSSSSS-SSSSS----SSSSSS
853 * <------->
854 * --DDDDD---------DDDDD--DDD
855 * <------->
856 *
857 * Now remap the source extent into the destination file:
858 *
859 * ----SSSSSSS-SSSSS----SSSSSS
860 * <------->
861 * --DDDDD--SSSSSSSDDDDD--DDD
862 * <------->
863 *
864 * Do likewise with the second hole and extent in our range. Holes in the
865 * unmap range don't affect our operation.
866 *
867 * ----SSSSSSS-SSSSS----SSSSSS
868 * <---->
869 * --DDDDD--SSSSSSS-SSSSS-DDD
870 * <---->
871 *
872 * Finally, unmap and remap part of the third extent. This will increase the
873 * size of the destination file.
874 *
875 * ----SSSSSSS-SSSSS----SSSSSS
876 * <----->
877 * --DDDDD--SSSSSSS-SSSSS----SSS
878 * <----->
879 *
880 * Once we update the destination file's i_size, we're done.
881 */
882
883/*
884 * Ensure the reflink bit is set in both inodes.
885 */
886STATIC int
887xfs_reflink_set_inode_flag(
888 struct xfs_inode *src,
889 struct xfs_inode *dest)
890{
891 struct xfs_mount *mp = src->i_mount;
892 int error;
893 struct xfs_trans *tp;
894
895 if (xfs_is_reflink_inode(src) && xfs_is_reflink_inode(dest))
896 return 0;
897
898 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
899 if (error)
900 goto out_error;
901
902 /* Lock both files against IO */
903 if (src->i_ino == dest->i_ino)
904 xfs_ilock(src, XFS_ILOCK_EXCL);
905 else
906 xfs_lock_two_inodes(src, dest, XFS_ILOCK_EXCL);
907
908 if (!xfs_is_reflink_inode(src)) {
909 trace_xfs_reflink_set_inode_flag(src);
910 xfs_trans_ijoin(tp, src, XFS_ILOCK_EXCL);
911 src->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
912 xfs_trans_log_inode(tp, src, XFS_ILOG_CORE);
913 xfs_ifork_init_cow(src);
914 } else
915 xfs_iunlock(src, XFS_ILOCK_EXCL);
916
917 if (src->i_ino == dest->i_ino)
918 goto commit_flags;
919
920 if (!xfs_is_reflink_inode(dest)) {
921 trace_xfs_reflink_set_inode_flag(dest);
922 xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
923 dest->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
924 xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
925 xfs_ifork_init_cow(dest);
926 } else
927 xfs_iunlock(dest, XFS_ILOCK_EXCL);
928
929commit_flags:
930 error = xfs_trans_commit(tp);
931 if (error)
932 goto out_error;
933 return error;
934
935out_error:
936 trace_xfs_reflink_set_inode_flag_error(dest, error, _RET_IP_);
937 return error;
938}
939
940/*
941 * Update destination inode size & cowextsize hint, if necessary.
942 */
943STATIC int
944xfs_reflink_update_dest(
945 struct xfs_inode *dest,
946 xfs_off_t newlen,
947 xfs_extlen_t cowextsize)
948{
949 struct xfs_mount *mp = dest->i_mount;
950 struct xfs_trans *tp;
951 int error;
952
953 if (newlen <= i_size_read(VFS_I(dest)) && cowextsize == 0)
954 return 0;
955
956 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
957 if (error)
958 goto out_error;
959
960 xfs_ilock(dest, XFS_ILOCK_EXCL);
961 xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
962
963 if (newlen > i_size_read(VFS_I(dest))) {
964 trace_xfs_reflink_update_inode_size(dest, newlen);
965 i_size_write(VFS_I(dest), newlen);
966 dest->i_d.di_size = newlen;
967 }
968
969 if (cowextsize) {
970 dest->i_d.di_cowextsize = cowextsize;
971 dest->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
972 }
973
974 xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
975
976 error = xfs_trans_commit(tp);
977 if (error)
978 goto out_error;
979 return error;
980
981out_error:
982 trace_xfs_reflink_update_inode_size_error(dest, error, _RET_IP_);
983 return error;
984}
985
986/*
987 * Do we have enough reserve in this AG to handle a reflink? The refcount
988 * btree already reserved all the space it needs, but the rmap btree can grow
989 * infinitely, so we won't allow more reflinks when the AG is down to the
990 * btree reserves.
991 */
992static int
993xfs_reflink_ag_has_free_space(
994 struct xfs_mount *mp,
995 xfs_agnumber_t agno)
996{
997 struct xfs_perag *pag;
998 int error = 0;
999
1000 if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
1001 return 0;
1002
1003 pag = xfs_perag_get(mp, agno);
1004 if (xfs_ag_resv_critical(pag, XFS_AG_RESV_AGFL) ||
1005 xfs_ag_resv_critical(pag, XFS_AG_RESV_METADATA))
1006 error = -ENOSPC;
1007 xfs_perag_put(pag);
1008 return error;
1009}
1010
1011/*
1012 * Unmap a range of blocks from a file, then map other blocks into the hole.
1013 * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
1014 * The extent irec is mapped into dest at irec->br_startoff.
1015 */
1016STATIC int
1017xfs_reflink_remap_extent(
1018 struct xfs_inode *ip,
1019 struct xfs_bmbt_irec *irec,
1020 xfs_fileoff_t destoff,
1021 xfs_off_t new_isize)
1022{
1023 struct xfs_mount *mp = ip->i_mount;
1024 struct xfs_trans *tp;
1025 xfs_fsblock_t firstfsb;
1026 unsigned int resblks;
1027 struct xfs_defer_ops dfops;
1028 struct xfs_bmbt_irec uirec;
1029 bool real_extent;
1030 xfs_filblks_t rlen;
1031 xfs_filblks_t unmap_len;
1032 xfs_off_t newlen;
1033 int error;
1034
1035 unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
1036 trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
1037
1038 /* Only remap normal extents. */
1039 real_extent = (irec->br_startblock != HOLESTARTBLOCK &&
1040 irec->br_startblock != DELAYSTARTBLOCK &&
1041 !ISUNWRITTEN(irec));
1042
1043 /* No reflinking if we're low on space */
1044 if (real_extent) {
1045 error = xfs_reflink_ag_has_free_space(mp,
1046 XFS_FSB_TO_AGNO(mp, irec->br_startblock));
1047 if (error)
1048 goto out;
1049 }
1050
1051 /* Start a rolling transaction to switch the mappings */
1052 resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
1053 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
1054 if (error)
1055 goto out;
1056
1057 xfs_ilock(ip, XFS_ILOCK_EXCL);
1058 xfs_trans_ijoin(tp, ip, 0);
1059
1060 /* If we're not just clearing space, then do we have enough quota? */
1061 if (real_extent) {
1062 error = xfs_trans_reserve_quota_nblks(tp, ip,
1063 irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
1064 if (error)
1065 goto out_cancel;
1066 }
1067
1068 trace_xfs_reflink_remap(ip, irec->br_startoff,
1069 irec->br_blockcount, irec->br_startblock);
1070
1071 /* Unmap the old blocks in the data fork. */
1072 rlen = unmap_len;
1073 while (rlen) {
1074 xfs_defer_init(&dfops, &firstfsb);
1075 error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
1076 &firstfsb, &dfops);
1077 if (error)
1078 goto out_defer;
1079
1080 /*
1081 * Trim the extent to whatever got unmapped.
1082 * Remember, bunmapi works backwards.
1083 */
1084 uirec.br_startblock = irec->br_startblock + rlen;
1085 uirec.br_startoff = irec->br_startoff + rlen;
1086 uirec.br_blockcount = unmap_len - rlen;
1087 unmap_len = rlen;
1088
1089 /* If this isn't a real mapping, we're done. */
1090 if (!real_extent || uirec.br_blockcount == 0)
1091 goto next_extent;
1092
1093 trace_xfs_reflink_remap(ip, uirec.br_startoff,
1094 uirec.br_blockcount, uirec.br_startblock);
1095
1096 /* Update the refcount tree */
1097 error = xfs_refcount_increase_extent(mp, &dfops, &uirec);
1098 if (error)
1099 goto out_defer;
1100
1101 /* Map the new blocks into the data fork. */
1102 error = xfs_bmap_map_extent(mp, &dfops, ip, &uirec);
1103 if (error)
1104 goto out_defer;
1105
1106 /* Update quota accounting. */
1107 xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
1108 uirec.br_blockcount);
1109
1110 /* Update dest isize if needed. */
1111 newlen = XFS_FSB_TO_B(mp,
1112 uirec.br_startoff + uirec.br_blockcount);
1113 newlen = min_t(xfs_off_t, newlen, new_isize);
1114 if (newlen > i_size_read(VFS_I(ip))) {
1115 trace_xfs_reflink_update_inode_size(ip, newlen);
1116 i_size_write(VFS_I(ip), newlen);
1117 ip->i_d.di_size = newlen;
1118 xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
1119 }
1120
1121next_extent:
1122 /* Process all the deferred stuff. */
1123 error = xfs_defer_finish(&tp, &dfops, ip);
1124 if (error)
1125 goto out_defer;
1126 }
1127
1128 error = xfs_trans_commit(tp);
1129 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1130 if (error)
1131 goto out;
1132 return 0;
1133
1134out_defer:
1135 xfs_defer_cancel(&dfops);
1136out_cancel:
1137 xfs_trans_cancel(tp);
1138 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1139out:
1140 trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
1141 return error;
1142}
1143
1144/*
1145 * Iteratively remap one file's extents (and holes) to another's.
1146 */
1147STATIC int
1148xfs_reflink_remap_blocks(
1149 struct xfs_inode *src,
1150 xfs_fileoff_t srcoff,
1151 struct xfs_inode *dest,
1152 xfs_fileoff_t destoff,
1153 xfs_filblks_t len,
1154 xfs_off_t new_isize)
1155{
1156 struct xfs_bmbt_irec imap;
1157 int nimaps;
1158 int error = 0;
1159 xfs_filblks_t range_len;
1160
1161 /* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
1162 while (len) {
1163 trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
1164 dest, destoff);
1165 /* Read extent from the source file */
1166 nimaps = 1;
1167 xfs_ilock(src, XFS_ILOCK_EXCL);
1168 error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
1169 xfs_iunlock(src, XFS_ILOCK_EXCL);
1170 if (error)
1171 goto err;
1172 ASSERT(nimaps == 1);
1173
1174 trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
1175 &imap);
1176
1177 /* Translate imap into the destination file. */
1178 range_len = imap.br_startoff + imap.br_blockcount - srcoff;
1179 imap.br_startoff += destoff - srcoff;
1180
1181 /* Clear dest from destoff to the end of imap and map it in. */
1182 error = xfs_reflink_remap_extent(dest, &imap, destoff,
1183 new_isize);
1184 if (error)
1185 goto err;
1186
1187 if (fatal_signal_pending(current)) {
1188 error = -EINTR;
1189 goto err;
1190 }
1191
1192 /* Advance drange/srange */
1193 srcoff += range_len;
1194 destoff += range_len;
1195 len -= range_len;
1196 }
1197
1198 return 0;
1199
1200err:
1201 trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
1202 return error;
1203}
1204
1205/*
1206 * Read a page's worth of file data into the page cache. Return the page
1207 * locked.
1208 */
1209static struct page *
1210xfs_get_page(
1211 struct inode *inode,
1212 xfs_off_t offset)
1213{
1214 struct address_space *mapping;
1215 struct page *page;
1216 pgoff_t n;
1217
1218 n = offset >> PAGE_SHIFT;
1219 mapping = inode->i_mapping;
1220 page = read_mapping_page(mapping, n, NULL);
1221 if (IS_ERR(page))
1222 return page;
1223 if (!PageUptodate(page)) {
1224 put_page(page);
1225 return ERR_PTR(-EIO);
1226 }
1227 lock_page(page);
1228 return page;
1229}
1230
1231/*
1232 * Compare extents of two files to see if they are the same.
1233 */
1234static int
1235xfs_compare_extents(
1236 struct inode *src,
1237 xfs_off_t srcoff,
1238 struct inode *dest,
1239 xfs_off_t destoff,
1240 xfs_off_t len,
1241 bool *is_same)
1242{
1243 xfs_off_t src_poff;
1244 xfs_off_t dest_poff;
1245 void *src_addr;
1246 void *dest_addr;
1247 struct page *src_page;
1248 struct page *dest_page;
1249 xfs_off_t cmp_len;
1250 bool same;
1251 int error;
1252
1253 error = -EINVAL;
1254 same = true;
1255 while (len) {
1256 src_poff = srcoff & (PAGE_SIZE - 1);
1257 dest_poff = destoff & (PAGE_SIZE - 1);
1258 cmp_len = min(PAGE_SIZE - src_poff,
1259 PAGE_SIZE - dest_poff);
1260 cmp_len = min(cmp_len, len);
1261 ASSERT(cmp_len > 0);
1262
1263 trace_xfs_reflink_compare_extents(XFS_I(src), srcoff, cmp_len,
1264 XFS_I(dest), destoff);
1265
1266 src_page = xfs_get_page(src, srcoff);
1267 if (IS_ERR(src_page)) {
1268 error = PTR_ERR(src_page);
1269 goto out_error;
1270 }
1271 dest_page = xfs_get_page(dest, destoff);
1272 if (IS_ERR(dest_page)) {
1273 error = PTR_ERR(dest_page);
1274 unlock_page(src_page);
1275 put_page(src_page);
1276 goto out_error;
1277 }
1278 src_addr = kmap_atomic(src_page);
1279 dest_addr = kmap_atomic(dest_page);
1280
1281 flush_dcache_page(src_page);
1282 flush_dcache_page(dest_page);
1283
1284 if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
1285 same = false;
1286
1287 kunmap_atomic(dest_addr);
1288 kunmap_atomic(src_addr);
1289 unlock_page(dest_page);
1290 unlock_page(src_page);
1291 put_page(dest_page);
1292 put_page(src_page);
1293
1294 if (!same)
1295 break;
1296
1297 srcoff += cmp_len;
1298 destoff += cmp_len;
1299 len -= cmp_len;
1300 }
1301
1302 *is_same = same;
1303 return 0;
1304
1305out_error:
1306 trace_xfs_reflink_compare_extents_error(XFS_I(dest), error, _RET_IP_);
1307 return error;
1308}
1309
1310/*
1311 * Link a range of blocks from one file to another.
1312 */
1313int
1314xfs_reflink_remap_range(
1315 struct xfs_inode *src,
1316 xfs_off_t srcoff,
1317 struct xfs_inode *dest,
1318 xfs_off_t destoff,
1319 xfs_off_t len,
1320 unsigned int flags)
1321{
1322 struct xfs_mount *mp = src->i_mount;
1323 xfs_fileoff_t sfsbno, dfsbno;
1324 xfs_filblks_t fsblen;
1325 int error;
1326 xfs_extlen_t cowextsize;
1327 bool is_same;
1328
1329 if (!xfs_sb_version_hasreflink(&mp->m_sb))
1330 return -EOPNOTSUPP;
1331
1332 if (XFS_FORCED_SHUTDOWN(mp))
1333 return -EIO;
1334
1335 /* Don't reflink realtime inodes */
1336 if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
1337 return -EINVAL;
1338
1339 if (flags & ~XFS_REFLINK_ALL)
1340 return -EINVAL;
1341
1342 trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
1343
1344 /* Lock both files against IO */
1345 if (src->i_ino == dest->i_ino) {
1346 xfs_ilock(src, XFS_IOLOCK_EXCL);
1347 xfs_ilock(src, XFS_MMAPLOCK_EXCL);
1348 } else {
1349 xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
1350 xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
1351 }
1352
1353 /*
1354 * Check that the extents are the same.
1355 */
1356 if (flags & XFS_REFLINK_DEDUPE) {
1357 is_same = false;
1358 error = xfs_compare_extents(VFS_I(src), srcoff, VFS_I(dest),
1359 destoff, len, &is_same);
1360 if (error)
1361 goto out_error;
1362 if (!is_same) {
1363 error = -EBADE;
1364 goto out_error;
1365 }
1366 }
1367
1368 error = xfs_reflink_set_inode_flag(src, dest);
1369 if (error)
1370 goto out_error;
1371
1372 /*
1373 * Invalidate the page cache so that we can clear any CoW mappings
1374 * in the destination file.
1375 */
1376 truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
1377 PAGE_ALIGN(destoff + len) - 1);
1378
1379 dfsbno = XFS_B_TO_FSBT(mp, destoff);
1380 sfsbno = XFS_B_TO_FSBT(mp, srcoff);
1381 fsblen = XFS_B_TO_FSB(mp, len);
1382 error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
1383 destoff + len);
1384 if (error)
1385 goto out_error;
1386
1387 /*
1388 * Carry the cowextsize hint from src to dest if we're sharing the
1389 * entire source file to the entire destination file, the source file
1390 * has a cowextsize hint, and the destination file does not.
1391 */
1392 cowextsize = 0;
1393 if (srcoff == 0 && len == i_size_read(VFS_I(src)) &&
1394 (src->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) &&
1395 destoff == 0 && len >= i_size_read(VFS_I(dest)) &&
1396 !(dest->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
1397 cowextsize = src->i_d.di_cowextsize;
1398
1399 error = xfs_reflink_update_dest(dest, destoff + len, cowextsize);
1400 if (error)
1401 goto out_error;
1402
1403out_error:
1404 xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
1405 xfs_iunlock(src, XFS_IOLOCK_EXCL);
1406 if (src->i_ino != dest->i_ino) {
1407 xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
1408 xfs_iunlock(dest, XFS_IOLOCK_EXCL);
1409 }
1410 if (error)
1411 trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
1412 return error;
1413}
1414
1415/*
1416 * The user wants to preemptively CoW all shared blocks in this file,
1417 * which enables us to turn off the reflink flag. Iterate all
1418 * extents which are not prealloc/delalloc to see which ranges are
1419 * mentioned in the refcount tree, then read those blocks into the
1420 * pagecache, dirty them, fsync them back out, and then we can update
1421 * the inode flag. What happens if we run out of memory? :)
1422 */
1423STATIC int
1424xfs_reflink_dirty_extents(
1425 struct xfs_inode *ip,
1426 xfs_fileoff_t fbno,
1427 xfs_filblks_t end,
1428 xfs_off_t isize)
1429{
1430 struct xfs_mount *mp = ip->i_mount;
1431 xfs_agnumber_t agno;
1432 xfs_agblock_t agbno;
1433 xfs_extlen_t aglen;
1434 xfs_agblock_t rbno;
1435 xfs_extlen_t rlen;
1436 xfs_off_t fpos;
1437 xfs_off_t flen;
1438 struct xfs_bmbt_irec map[2];
1439 int nmaps;
1440 int error = 0;
1441
1442 while (end - fbno > 0) {
1443 nmaps = 1;
1444 /*
1445 * Look for extents in the file. Skip holes, delalloc, or
1446 * unwritten extents; they can't be reflinked.
1447 */
1448 error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
1449 if (error)
1450 goto out;
1451 if (nmaps == 0)
1452 break;
1453 if (map[0].br_startblock == HOLESTARTBLOCK ||
1454 map[0].br_startblock == DELAYSTARTBLOCK ||
1455 ISUNWRITTEN(&map[0]))
1456 goto next;
1457
1458 map[1] = map[0];
1459 while (map[1].br_blockcount) {
1460 agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
1461 agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
1462 aglen = map[1].br_blockcount;
1463
1464 error = xfs_reflink_find_shared(mp, agno, agbno, aglen,
1465 &rbno, &rlen, true);
1466 if (error)
1467 goto out;
1468 if (rbno == NULLAGBLOCK)
1469 break;
1470
1471 /* Dirty the pages */
1472 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1473 fpos = XFS_FSB_TO_B(mp, map[1].br_startoff +
1474 (rbno - agbno));
1475 flen = XFS_FSB_TO_B(mp, rlen);
1476 if (fpos + flen > isize)
1477 flen = isize - fpos;
1478 error = iomap_file_dirty(VFS_I(ip), fpos, flen,
1479 &xfs_iomap_ops);
1480 xfs_ilock(ip, XFS_ILOCK_EXCL);
1481 if (error)
1482 goto out;
1483
1484 map[1].br_blockcount -= (rbno - agbno + rlen);
1485 map[1].br_startoff += (rbno - agbno + rlen);
1486 map[1].br_startblock += (rbno - agbno + rlen);
1487 }
1488
1489next:
1490 fbno = map[0].br_startoff + map[0].br_blockcount;
1491 }
1492out:
1493 return error;
1494}
1495
1496/* Clear the inode reflink flag if there are no shared extents. */
1497int
1498xfs_reflink_clear_inode_flag(
1499 struct xfs_inode *ip,
1500 struct xfs_trans **tpp)
1501{
1502 struct xfs_mount *mp = ip->i_mount;
1503 xfs_fileoff_t fbno;
1504 xfs_filblks_t end;
1505 xfs_agnumber_t agno;
1506 xfs_agblock_t agbno;
1507 xfs_extlen_t aglen;
1508 xfs_agblock_t rbno;
1509 xfs_extlen_t rlen;
1510 struct xfs_bmbt_irec map;
1511 int nmaps;
1512 int error = 0;
1513
1514 ASSERT(xfs_is_reflink_inode(ip));
1515
1516 fbno = 0;
1517 end = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
1518 while (end - fbno > 0) {
1519 nmaps = 1;
1520 /*
1521 * Look for extents in the file. Skip holes, delalloc, or
1522 * unwritten extents; they can't be reflinked.
1523 */
1524 error = xfs_bmapi_read(ip, fbno, end - fbno, &map, &nmaps, 0);
1525 if (error)
1526 return error;
1527 if (nmaps == 0)
1528 break;
1529 if (map.br_startblock == HOLESTARTBLOCK ||
1530 map.br_startblock == DELAYSTARTBLOCK ||
1531 ISUNWRITTEN(&map))
1532 goto next;
1533
1534 agno = XFS_FSB_TO_AGNO(mp, map.br_startblock);
1535 agbno = XFS_FSB_TO_AGBNO(mp, map.br_startblock);
1536 aglen = map.br_blockcount;
1537
1538 error = xfs_reflink_find_shared(mp, agno, agbno, aglen,
1539 &rbno, &rlen, false);
1540 if (error)
1541 return error;
1542 /* Is there still a shared block here? */
1543 if (rbno != NULLAGBLOCK)
1544 return 0;
1545next:
1546 fbno = map.br_startoff + map.br_blockcount;
1547 }
1548
1549 /*
1550 * We didn't find any shared blocks so turn off the reflink flag.
1551 * First, get rid of any leftover CoW mappings.
1552 */
1553 error = xfs_reflink_cancel_cow_blocks(ip, tpp, 0, NULLFILEOFF);
1554 if (error)
1555 return error;
1556
1557 /* Clear the inode flag. */
1558 trace_xfs_reflink_unset_inode_flag(ip);
1559 ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
1560 xfs_inode_clear_cowblocks_tag(ip);
1561 xfs_trans_ijoin(*tpp, ip, 0);
1562 xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
1563
1564 return error;
1565}
1566
1567/*
1568 * Clear the inode reflink flag if there are no shared extents and the size
1569 * hasn't changed.
1570 */
1571STATIC int
1572xfs_reflink_try_clear_inode_flag(
1573 struct xfs_inode *ip)
1574{
1575 struct xfs_mount *mp = ip->i_mount;
1576 struct xfs_trans *tp;
1577 int error = 0;
1578
1579 /* Start a rolling transaction to remove the mappings */
1580 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
1581 if (error)
1582 return error;
1583
1584 xfs_ilock(ip, XFS_ILOCK_EXCL);
1585 xfs_trans_ijoin(tp, ip, 0);
1586
1587 error = xfs_reflink_clear_inode_flag(ip, &tp);
1588 if (error)
1589 goto cancel;
1590
1591 error = xfs_trans_commit(tp);
1592 if (error)
1593 goto out;
1594
1595 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1596 return 0;
1597cancel:
1598 xfs_trans_cancel(tp);
1599out:
1600 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1601 return error;
1602}
1603
1604/*
1605 * Pre-COW all shared blocks within a given byte range of a file and turn off
1606 * the reflink flag if we unshare all of the file's blocks.
1607 */
1608int
1609xfs_reflink_unshare(
1610 struct xfs_inode *ip,
1611 xfs_off_t offset,
1612 xfs_off_t len)
1613{
1614 struct xfs_mount *mp = ip->i_mount;
1615 xfs_fileoff_t fbno;
1616 xfs_filblks_t end;
1617 xfs_off_t isize;
1618 int error;
1619
1620 if (!xfs_is_reflink_inode(ip))
1621 return 0;
1622
1623 trace_xfs_reflink_unshare(ip, offset, len);
1624
1625 inode_dio_wait(VFS_I(ip));
1626
1627 /* Try to CoW the selected ranges */
1628 xfs_ilock(ip, XFS_ILOCK_EXCL);
1629 fbno = XFS_B_TO_FSBT(mp, offset);
1630 isize = i_size_read(VFS_I(ip));
1631 end = XFS_B_TO_FSB(mp, offset + len);
1632 error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
1633 if (error)
1634 goto out_unlock;
1635 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1636
1637 /* Wait for the IO to finish */
1638 error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
1639 if (error)
1640 goto out;
1641
1642 /* Turn off the reflink flag if possible. */
1643 error = xfs_reflink_try_clear_inode_flag(ip);
1644 if (error)
1645 goto out;
1646
1647 return 0;
1648
1649out_unlock:
1650 xfs_iunlock(ip, XFS_ILOCK_EXCL);
1651out:
1652 trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
1653 return error;
1654}
1655
1656/*
1657 * Does this inode have any real CoW reservations?
1658 */
1659bool
1660xfs_reflink_has_real_cow_blocks(
1661 struct xfs_inode *ip)
1662{
1663 struct xfs_bmbt_irec irec;
1664 struct xfs_ifork *ifp;
1665 struct xfs_bmbt_rec_host *gotp;
1666 xfs_extnum_t idx;
1667
1668 if (!xfs_is_reflink_inode(ip))
1669 return false;
1670
1671 /* Go find the old extent in the CoW fork. */
1672 ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
1673 gotp = xfs_iext_bno_to_ext(ifp, 0, &idx);
1674 while (gotp) {
1675 xfs_bmbt_get_all(gotp, &irec);
1676
1677 if (!isnullstartblock(irec.br_startblock))
1678 return true;
1679
1680 /* Roll on... */
1681 idx++;
1682 if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
1683 break;
1684 gotp = xfs_iext_get_ext(ifp, idx);
1685 }
1686
1687 return false;
1688}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
new file mode 100644
index 000000000000..5dc3c8ac12aa
--- /dev/null
+++ b/fs/xfs/xfs_reflink.h
@@ -0,0 +1,58 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#ifndef __XFS_REFLINK_H
21#define __XFS_REFLINK_H 1
22
23extern int xfs_reflink_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
24 xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
25 xfs_extlen_t *flen, bool find_maximal);
26extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
27 struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
28
29extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
30 xfs_off_t offset, xfs_off_t count);
31extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
32 xfs_off_t offset, xfs_off_t count);
33extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
34 struct xfs_bmbt_irec *imap, bool *need_alloc);
35extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
36 xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
37
38extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
39 struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
40 xfs_fileoff_t end_fsb);
41extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
42 xfs_off_t count);
43extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
44 xfs_off_t count);
45extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
46#define XFS_REFLINK_DEDUPE 1 /* only reflink if contents match */
47#define XFS_REFLINK_ALL (XFS_REFLINK_DEDUPE)
48extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
49 struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
50 unsigned int flags);
51extern int xfs_reflink_clear_inode_flag(struct xfs_inode *ip,
52 struct xfs_trans **tpp);
53extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
54 xfs_off_t len);
55
56extern bool xfs_reflink_has_real_cow_blocks(struct xfs_inode *ip);
57
58#endif /* __XFS_REFLINK_H */
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 0432a459871c..73c827831551 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -441,8 +441,11 @@ xfs_rui_recover(
441 XFS_FSB_TO_DADDR(mp, rmap->me_startblock)); 441 XFS_FSB_TO_DADDR(mp, rmap->me_startblock));
442 switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) { 442 switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
443 case XFS_RMAP_EXTENT_MAP: 443 case XFS_RMAP_EXTENT_MAP:
444 case XFS_RMAP_EXTENT_MAP_SHARED:
444 case XFS_RMAP_EXTENT_UNMAP: 445 case XFS_RMAP_EXTENT_UNMAP:
446 case XFS_RMAP_EXTENT_UNMAP_SHARED:
445 case XFS_RMAP_EXTENT_CONVERT: 447 case XFS_RMAP_EXTENT_CONVERT:
448 case XFS_RMAP_EXTENT_CONVERT_SHARED:
446 case XFS_RMAP_EXTENT_ALLOC: 449 case XFS_RMAP_EXTENT_ALLOC:
447 case XFS_RMAP_EXTENT_FREE: 450 case XFS_RMAP_EXTENT_FREE:
448 op_ok = true; 451 op_ok = true;
@@ -481,12 +484,21 @@ xfs_rui_recover(
481 case XFS_RMAP_EXTENT_MAP: 484 case XFS_RMAP_EXTENT_MAP:
482 type = XFS_RMAP_MAP; 485 type = XFS_RMAP_MAP;
483 break; 486 break;
487 case XFS_RMAP_EXTENT_MAP_SHARED:
488 type = XFS_RMAP_MAP_SHARED;
489 break;
484 case XFS_RMAP_EXTENT_UNMAP: 490 case XFS_RMAP_EXTENT_UNMAP:
485 type = XFS_RMAP_UNMAP; 491 type = XFS_RMAP_UNMAP;
486 break; 492 break;
493 case XFS_RMAP_EXTENT_UNMAP_SHARED:
494 type = XFS_RMAP_UNMAP_SHARED;
495 break;
487 case XFS_RMAP_EXTENT_CONVERT: 496 case XFS_RMAP_EXTENT_CONVERT:
488 type = XFS_RMAP_CONVERT; 497 type = XFS_RMAP_CONVERT;
489 break; 498 break;
499 case XFS_RMAP_EXTENT_CONVERT_SHARED:
500 type = XFS_RMAP_CONVERT_SHARED;
501 break;
490 case XFS_RMAP_EXTENT_ALLOC: 502 case XFS_RMAP_EXTENT_ALLOC:
491 type = XFS_RMAP_ALLOC; 503 type = XFS_RMAP_ALLOC;
492 break; 504 break;
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index 6e812fe0fd43..12d48cd8f8a4 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -62,6 +62,7 @@ int xfs_stats_format(struct xfsstats __percpu *stats, char *buf)
62 { "ibt2", XFSSTAT_END_IBT_V2 }, 62 { "ibt2", XFSSTAT_END_IBT_V2 },
63 { "fibt2", XFSSTAT_END_FIBT_V2 }, 63 { "fibt2", XFSSTAT_END_FIBT_V2 },
64 { "rmapbt", XFSSTAT_END_RMAP_V2 }, 64 { "rmapbt", XFSSTAT_END_RMAP_V2 },
65 { "refcntbt", XFSSTAT_END_REFCOUNT },
65 /* we print both series of quota information together */ 66 /* we print both series of quota information together */
66 { "qm", XFSSTAT_END_QM }, 67 { "qm", XFSSTAT_END_QM },
67 }; 68 };
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index 657865f51e78..79ad2e69fc33 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -213,7 +213,23 @@ struct xfsstats {
213 __uint32_t xs_rmap_2_alloc; 213 __uint32_t xs_rmap_2_alloc;
214 __uint32_t xs_rmap_2_free; 214 __uint32_t xs_rmap_2_free;
215 __uint32_t xs_rmap_2_moves; 215 __uint32_t xs_rmap_2_moves;
216#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_RMAP_V2+6) 216#define XFSSTAT_END_REFCOUNT (XFSSTAT_END_RMAP_V2 + 15)
217 __uint32_t xs_refcbt_2_lookup;
218 __uint32_t xs_refcbt_2_compare;
219 __uint32_t xs_refcbt_2_insrec;
220 __uint32_t xs_refcbt_2_delrec;
221 __uint32_t xs_refcbt_2_newroot;
222 __uint32_t xs_refcbt_2_killroot;
223 __uint32_t xs_refcbt_2_increment;
224 __uint32_t xs_refcbt_2_decrement;
225 __uint32_t xs_refcbt_2_lshift;
226 __uint32_t xs_refcbt_2_rshift;
227 __uint32_t xs_refcbt_2_split;
228 __uint32_t xs_refcbt_2_join;
229 __uint32_t xs_refcbt_2_alloc;
230 __uint32_t xs_refcbt_2_free;
231 __uint32_t xs_refcbt_2_moves;
232#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_REFCOUNT + 6)
217 __uint32_t xs_qm_dqreclaims; 233 __uint32_t xs_qm_dqreclaims;
218 __uint32_t xs_qm_dqreclaim_misses; 234 __uint32_t xs_qm_dqreclaim_misses;
219 __uint32_t xs_qm_dquot_dups; 235 __uint32_t xs_qm_dquot_dups;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2d092f9577ca..ade4691e3f74 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -47,6 +47,9 @@
47#include "xfs_sysfs.h" 47#include "xfs_sysfs.h"
48#include "xfs_ondisk.h" 48#include "xfs_ondisk.h"
49#include "xfs_rmap_item.h" 49#include "xfs_rmap_item.h"
50#include "xfs_refcount_item.h"
51#include "xfs_bmap_item.h"
52#include "xfs_reflink.h"
50 53
51#include <linux/namei.h> 54#include <linux/namei.h>
52#include <linux/init.h> 55#include <linux/init.h>
@@ -936,6 +939,7 @@ xfs_fs_destroy_inode(
936 struct inode *inode) 939 struct inode *inode)
937{ 940{
938 struct xfs_inode *ip = XFS_I(inode); 941 struct xfs_inode *ip = XFS_I(inode);
942 int error;
939 943
940 trace_xfs_destroy_inode(ip); 944 trace_xfs_destroy_inode(ip);
941 945
@@ -943,6 +947,14 @@ xfs_fs_destroy_inode(
943 XFS_STATS_INC(ip->i_mount, vn_rele); 947 XFS_STATS_INC(ip->i_mount, vn_rele);
944 XFS_STATS_INC(ip->i_mount, vn_remove); 948 XFS_STATS_INC(ip->i_mount, vn_remove);
945 949
950 if (xfs_is_reflink_inode(ip)) {
951 error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
952 if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount))
953 xfs_warn(ip->i_mount,
954"Error %d while evicting CoW blocks for inode %llu.",
955 error, ip->i_ino);
956 }
957
946 xfs_inactive(ip); 958 xfs_inactive(ip);
947 959
948 ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0); 960 ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
@@ -1006,6 +1018,16 @@ xfs_fs_drop_inode(
1006{ 1018{
1007 struct xfs_inode *ip = XFS_I(inode); 1019 struct xfs_inode *ip = XFS_I(inode);
1008 1020
1021 /*
1022 * If this unlinked inode is in the middle of recovery, don't
1023 * drop the inode just yet; log recovery will take care of
1024 * that. See the comment for this inode flag.
1025 */
1026 if (ip->i_flags & XFS_IRECOVERY) {
1027 ASSERT(ip->i_mount->m_log->l_flags & XLOG_RECOVERY_NEEDED);
1028 return 0;
1029 }
1030
1009 return generic_drop_inode(inode) || (ip->i_flags & XFS_IDONTCACHE); 1031 return generic_drop_inode(inode) || (ip->i_flags & XFS_IDONTCACHE);
1010} 1032}
1011 1033
@@ -1296,10 +1318,31 @@ xfs_fs_remount(
1296 xfs_restore_resvblks(mp); 1318 xfs_restore_resvblks(mp);
1297 xfs_log_work_queue(mp); 1319 xfs_log_work_queue(mp);
1298 xfs_queue_eofblocks(mp); 1320 xfs_queue_eofblocks(mp);
1321
1322 /* Recover any CoW blocks that never got remapped. */
1323 error = xfs_reflink_recover_cow(mp);
1324 if (error) {
1325 xfs_err(mp,
1326 "Error %d recovering leftover CoW allocations.", error);
1327 xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
1328 return error;
1329 }
1330
1331 /* Create the per-AG metadata reservation pool .*/
1332 error = xfs_fs_reserve_ag_blocks(mp);
1333 if (error && error != -ENOSPC)
1334 return error;
1299 } 1335 }
1300 1336
1301 /* rw -> ro */ 1337 /* rw -> ro */
1302 if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) { 1338 if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
1339 /* Free the per-AG metadata reservation pool. */
1340 error = xfs_fs_unreserve_ag_blocks(mp);
1341 if (error) {
1342 xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
1343 return error;
1344 }
1345
1303 /* 1346 /*
1304 * Before we sync the metadata, we need to free up the reserve 1347 * Before we sync the metadata, we need to free up the reserve
1305 * block pool so that the used block count in the superblock on 1348 * block pool so that the used block count in the superblock on
@@ -1490,6 +1533,7 @@ xfs_fs_fill_super(
1490 atomic_set(&mp->m_active_trans, 0); 1533 atomic_set(&mp->m_active_trans, 0);
1491 INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); 1534 INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
1492 INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker); 1535 INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
1536 INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
1493 mp->m_kobj.kobject.kset = xfs_kset; 1537 mp->m_kobj.kobject.kset = xfs_kset;
1494 1538
1495 mp->m_super = sb; 1539 mp->m_super = sb;
@@ -1572,6 +1616,9 @@ xfs_fs_fill_super(
1572 "DAX unsupported by block device. Turning off DAX."); 1616 "DAX unsupported by block device. Turning off DAX.");
1573 mp->m_flags &= ~XFS_MOUNT_DAX; 1617 mp->m_flags &= ~XFS_MOUNT_DAX;
1574 } 1618 }
1619 if (xfs_sb_version_hasreflink(&mp->m_sb))
1620 xfs_alert(mp,
1621 "DAX and reflink have not been tested together!");
1575 } 1622 }
1576 1623
1577 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) { 1624 if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
@@ -1585,6 +1632,10 @@ xfs_fs_fill_super(
1585 "EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!"); 1632 "EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!");
1586 } 1633 }
1587 1634
1635 if (xfs_sb_version_hasreflink(&mp->m_sb))
1636 xfs_alert(mp,
1637 "EXPERIMENTAL reflink feature enabled. Use at your own risk!");
1638
1588 error = xfs_mountfs(mp); 1639 error = xfs_mountfs(mp);
1589 if (error) 1640 if (error)
1590 goto out_filestream_unmount; 1641 goto out_filestream_unmount;
@@ -1788,8 +1839,38 @@ xfs_init_zones(void)
1788 if (!xfs_rui_zone) 1839 if (!xfs_rui_zone)
1789 goto out_destroy_rud_zone; 1840 goto out_destroy_rud_zone;
1790 1841
1842 xfs_cud_zone = kmem_zone_init(sizeof(struct xfs_cud_log_item),
1843 "xfs_cud_item");
1844 if (!xfs_cud_zone)
1845 goto out_destroy_rui_zone;
1846
1847 xfs_cui_zone = kmem_zone_init(
1848 xfs_cui_log_item_sizeof(XFS_CUI_MAX_FAST_EXTENTS),
1849 "xfs_cui_item");
1850 if (!xfs_cui_zone)
1851 goto out_destroy_cud_zone;
1852
1853 xfs_bud_zone = kmem_zone_init(sizeof(struct xfs_bud_log_item),
1854 "xfs_bud_item");
1855 if (!xfs_bud_zone)
1856 goto out_destroy_cui_zone;
1857
1858 xfs_bui_zone = kmem_zone_init(
1859 xfs_bui_log_item_sizeof(XFS_BUI_MAX_FAST_EXTENTS),
1860 "xfs_bui_item");
1861 if (!xfs_bui_zone)
1862 goto out_destroy_bud_zone;
1863
1791 return 0; 1864 return 0;
1792 1865
1866 out_destroy_bud_zone:
1867 kmem_zone_destroy(xfs_bud_zone);
1868 out_destroy_cui_zone:
1869 kmem_zone_destroy(xfs_cui_zone);
1870 out_destroy_cud_zone:
1871 kmem_zone_destroy(xfs_cud_zone);
1872 out_destroy_rui_zone:
1873 kmem_zone_destroy(xfs_rui_zone);
1793 out_destroy_rud_zone: 1874 out_destroy_rud_zone:
1794 kmem_zone_destroy(xfs_rud_zone); 1875 kmem_zone_destroy(xfs_rud_zone);
1795 out_destroy_icreate_zone: 1876 out_destroy_icreate_zone:
@@ -1832,6 +1913,10 @@ xfs_destroy_zones(void)
1832 * destroy caches. 1913 * destroy caches.
1833 */ 1914 */
1834 rcu_barrier(); 1915 rcu_barrier();
1916 kmem_zone_destroy(xfs_bui_zone);
1917 kmem_zone_destroy(xfs_bud_zone);
1918 kmem_zone_destroy(xfs_cui_zone);
1919 kmem_zone_destroy(xfs_cud_zone);
1835 kmem_zone_destroy(xfs_rui_zone); 1920 kmem_zone_destroy(xfs_rui_zone);
1836 kmem_zone_destroy(xfs_rud_zone); 1921 kmem_zone_destroy(xfs_rud_zone);
1837 kmem_zone_destroy(xfs_icreate_zone); 1922 kmem_zone_destroy(xfs_icreate_zone);
@@ -1885,6 +1970,8 @@ init_xfs_fs(void)
1885 1970
1886 xfs_extent_free_init_defer_op(); 1971 xfs_extent_free_init_defer_op();
1887 xfs_rmap_update_init_defer_op(); 1972 xfs_rmap_update_init_defer_op();
1973 xfs_refcount_update_init_defer_op();
1974 xfs_bmap_update_init_defer_op();
1888 1975
1889 xfs_dir_startup(); 1976 xfs_dir_startup();
1890 1977
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index aed74d3f8da9..afe1f66aaa69 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -184,6 +184,15 @@ static struct ctl_table xfs_table[] = {
184 .extra1 = &xfs_params.eofb_timer.min, 184 .extra1 = &xfs_params.eofb_timer.min,
185 .extra2 = &xfs_params.eofb_timer.max, 185 .extra2 = &xfs_params.eofb_timer.max,
186 }, 186 },
187 {
188 .procname = "speculative_cow_prealloc_lifetime",
189 .data = &xfs_params.cowb_timer.val,
190 .maxlen = sizeof(int),
191 .mode = 0644,
192 .proc_handler = proc_dointvec_minmax,
193 .extra1 = &xfs_params.cowb_timer.min,
194 .extra2 = &xfs_params.cowb_timer.max,
195 },
187 /* please keep this the last entry */ 196 /* please keep this the last entry */
188#ifdef CONFIG_PROC_FS 197#ifdef CONFIG_PROC_FS
189 { 198 {
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index ffef45375754..984a3499cfe3 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -48,6 +48,7 @@ typedef struct xfs_param {
48 xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */ 48 xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */
49 xfs_sysctl_val_t fstrm_timer; /* Filestream dir-AG assoc'n timeout. */ 49 xfs_sysctl_val_t fstrm_timer; /* Filestream dir-AG assoc'n timeout. */
50 xfs_sysctl_val_t eofb_timer; /* Interval between eofb scan wakeups */ 50 xfs_sysctl_val_t eofb_timer; /* Interval between eofb scan wakeups */
51 xfs_sysctl_val_t cowb_timer; /* Interval between cowb scan wakeups */
51} xfs_param_t; 52} xfs_param_t;
52 53
53/* 54/*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 16093c7dacde..ad188d3a83f3 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -39,6 +39,7 @@ struct xfs_buf_log_format;
39struct xfs_inode_log_format; 39struct xfs_inode_log_format;
40struct xfs_bmbt_irec; 40struct xfs_bmbt_irec;
41struct xfs_btree_cur; 41struct xfs_btree_cur;
42struct xfs_refcount_irec;
42 43
43DECLARE_EVENT_CLASS(xfs_attr_list_class, 44DECLARE_EVENT_CLASS(xfs_attr_list_class,
44 TP_PROTO(struct xfs_attr_list_context *ctx), 45 TP_PROTO(struct xfs_attr_list_context *ctx),
@@ -135,6 +136,8 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_set_reclaim);
135DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim); 136DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim);
136DEFINE_PERAG_REF_EVENT(xfs_perag_set_eofblocks); 137DEFINE_PERAG_REF_EVENT(xfs_perag_set_eofblocks);
137DEFINE_PERAG_REF_EVENT(xfs_perag_clear_eofblocks); 138DEFINE_PERAG_REF_EVENT(xfs_perag_clear_eofblocks);
139DEFINE_PERAG_REF_EVENT(xfs_perag_set_cowblocks);
140DEFINE_PERAG_REF_EVENT(xfs_perag_clear_cowblocks);
138 141
139DECLARE_EVENT_CLASS(xfs_ag_class, 142DECLARE_EVENT_CLASS(xfs_ag_class,
140 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno), 143 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
@@ -268,10 +271,10 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
268 __field(unsigned long, caller_ip) 271 __field(unsigned long, caller_ip)
269 ), 272 ),
270 TP_fast_assign( 273 TP_fast_assign(
271 struct xfs_ifork *ifp = (state & BMAP_ATTRFORK) ? 274 struct xfs_ifork *ifp;
272 ip->i_afp : &ip->i_df;
273 struct xfs_bmbt_irec r; 275 struct xfs_bmbt_irec r;
274 276
277 ifp = xfs_iext_state_to_fork(ip, state);
275 xfs_bmbt_get_all(xfs_iext_get_ext(ifp, idx), &r); 278 xfs_bmbt_get_all(xfs_iext_get_ext(ifp, idx), &r);
276 __entry->dev = VFS_I(ip)->i_sb->s_dev; 279 __entry->dev = VFS_I(ip)->i_sb->s_dev;
277 __entry->ino = ip->i_ino; 280 __entry->ino = ip->i_ino;
@@ -686,6 +689,9 @@ DEFINE_INODE_EVENT(xfs_dquot_dqdetach);
686DEFINE_INODE_EVENT(xfs_inode_set_eofblocks_tag); 689DEFINE_INODE_EVENT(xfs_inode_set_eofblocks_tag);
687DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag); 690DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
688DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid); 691DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
692DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
693DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
694DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
689 695
690DEFINE_INODE_EVENT(xfs_filemap_fault); 696DEFINE_INODE_EVENT(xfs_filemap_fault);
691DEFINE_INODE_EVENT(xfs_filemap_pmd_fault); 697DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);
@@ -2581,10 +2587,20 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_delete);
2581DEFINE_AG_ERROR_EVENT(xfs_rmap_insert_error); 2587DEFINE_AG_ERROR_EVENT(xfs_rmap_insert_error);
2582DEFINE_AG_ERROR_EVENT(xfs_rmap_delete_error); 2588DEFINE_AG_ERROR_EVENT(xfs_rmap_delete_error);
2583DEFINE_AG_ERROR_EVENT(xfs_rmap_update_error); 2589DEFINE_AG_ERROR_EVENT(xfs_rmap_update_error);
2590
2591DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_candidate);
2592DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_query);
2593DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_candidate);
2594DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range);
2584DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result); 2595DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
2585DEFINE_RMAPBT_EVENT(xfs_rmap_find_right_neighbor_result); 2596DEFINE_RMAPBT_EVENT(xfs_rmap_find_right_neighbor_result);
2586DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result); 2597DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
2587 2598
2599/* deferred bmbt updates */
2600#define DEFINE_BMAP_DEFERRED_EVENT DEFINE_RMAP_DEFERRED_EVENT
2601DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_defer);
2602DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_deferred);
2603
2588/* per-AG reservation */ 2604/* per-AG reservation */
2589DECLARE_EVENT_CLASS(xfs_ag_resv_class, 2605DECLARE_EVENT_CLASS(xfs_ag_resv_class,
2590 TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type resv, 2606 TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type resv,
@@ -2639,6 +2655,728 @@ DEFINE_AG_RESV_EVENT(xfs_ag_resv_needed);
2639DEFINE_AG_ERROR_EVENT(xfs_ag_resv_free_error); 2655DEFINE_AG_ERROR_EVENT(xfs_ag_resv_free_error);
2640DEFINE_AG_ERROR_EVENT(xfs_ag_resv_init_error); 2656DEFINE_AG_ERROR_EVENT(xfs_ag_resv_init_error);
2641 2657
2658/* refcount tracepoint classes */
2659
2660/* reuse the discard trace class for agbno/aglen-based traces */
2661#define DEFINE_AG_EXTENT_EVENT(name) DEFINE_DISCARD_EVENT(name)
2662
2663/* ag btree lookup tracepoint class */
2664#define XFS_AG_BTREE_CMP_FORMAT_STR \
2665 { XFS_LOOKUP_EQ, "eq" }, \
2666 { XFS_LOOKUP_LE, "le" }, \
2667 { XFS_LOOKUP_GE, "ge" }
2668DECLARE_EVENT_CLASS(xfs_ag_btree_lookup_class,
2669 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2670 xfs_agblock_t agbno, xfs_lookup_t dir),
2671 TP_ARGS(mp, agno, agbno, dir),
2672 TP_STRUCT__entry(
2673 __field(dev_t, dev)
2674 __field(xfs_agnumber_t, agno)
2675 __field(xfs_agblock_t, agbno)
2676 __field(xfs_lookup_t, dir)
2677 ),
2678 TP_fast_assign(
2679 __entry->dev = mp->m_super->s_dev;
2680 __entry->agno = agno;
2681 __entry->agbno = agbno;
2682 __entry->dir = dir;
2683 ),
2684 TP_printk("dev %d:%d agno %u agbno %u cmp %s(%d)\n",
2685 MAJOR(__entry->dev), MINOR(__entry->dev),
2686 __entry->agno,
2687 __entry->agbno,
2688 __print_symbolic(__entry->dir, XFS_AG_BTREE_CMP_FORMAT_STR),
2689 __entry->dir)
2690)
2691
2692#define DEFINE_AG_BTREE_LOOKUP_EVENT(name) \
2693DEFINE_EVENT(xfs_ag_btree_lookup_class, name, \
2694 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
2695 xfs_agblock_t agbno, xfs_lookup_t dir), \
2696 TP_ARGS(mp, agno, agbno, dir))
2697
2698/* single-rcext tracepoint class */
2699DECLARE_EVENT_CLASS(xfs_refcount_extent_class,
2700 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2701 struct xfs_refcount_irec *irec),
2702 TP_ARGS(mp, agno, irec),
2703 TP_STRUCT__entry(
2704 __field(dev_t, dev)
2705 __field(xfs_agnumber_t, agno)
2706 __field(xfs_agblock_t, startblock)
2707 __field(xfs_extlen_t, blockcount)
2708 __field(xfs_nlink_t, refcount)
2709 ),
2710 TP_fast_assign(
2711 __entry->dev = mp->m_super->s_dev;
2712 __entry->agno = agno;
2713 __entry->startblock = irec->rc_startblock;
2714 __entry->blockcount = irec->rc_blockcount;
2715 __entry->refcount = irec->rc_refcount;
2716 ),
2717 TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u\n",
2718 MAJOR(__entry->dev), MINOR(__entry->dev),
2719 __entry->agno,
2720 __entry->startblock,
2721 __entry->blockcount,
2722 __entry->refcount)
2723)
2724
2725#define DEFINE_REFCOUNT_EXTENT_EVENT(name) \
2726DEFINE_EVENT(xfs_refcount_extent_class, name, \
2727 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
2728 struct xfs_refcount_irec *irec), \
2729 TP_ARGS(mp, agno, irec))
2730
2731/* single-rcext and an agbno tracepoint class */
2732DECLARE_EVENT_CLASS(xfs_refcount_extent_at_class,
2733 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2734 struct xfs_refcount_irec *irec, xfs_agblock_t agbno),
2735 TP_ARGS(mp, agno, irec, agbno),
2736 TP_STRUCT__entry(
2737 __field(dev_t, dev)
2738 __field(xfs_agnumber_t, agno)
2739 __field(xfs_agblock_t, startblock)
2740 __field(xfs_extlen_t, blockcount)
2741 __field(xfs_nlink_t, refcount)
2742 __field(xfs_agblock_t, agbno)
2743 ),
2744 TP_fast_assign(
2745 __entry->dev = mp->m_super->s_dev;
2746 __entry->agno = agno;
2747 __entry->startblock = irec->rc_startblock;
2748 __entry->blockcount = irec->rc_blockcount;
2749 __entry->refcount = irec->rc_refcount;
2750 __entry->agbno = agbno;
2751 ),
2752 TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u @ agbno %u\n",
2753 MAJOR(__entry->dev), MINOR(__entry->dev),
2754 __entry->agno,
2755 __entry->startblock,
2756 __entry->blockcount,
2757 __entry->refcount,
2758 __entry->agbno)
2759)
2760
2761#define DEFINE_REFCOUNT_EXTENT_AT_EVENT(name) \
2762DEFINE_EVENT(xfs_refcount_extent_at_class, name, \
2763 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
2764 struct xfs_refcount_irec *irec, xfs_agblock_t agbno), \
2765 TP_ARGS(mp, agno, irec, agbno))
2766
2767/* double-rcext tracepoint class */
2768DECLARE_EVENT_CLASS(xfs_refcount_double_extent_class,
2769 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2770 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2),
2771 TP_ARGS(mp, agno, i1, i2),
2772 TP_STRUCT__entry(
2773 __field(dev_t, dev)
2774 __field(xfs_agnumber_t, agno)
2775 __field(xfs_agblock_t, i1_startblock)
2776 __field(xfs_extlen_t, i1_blockcount)
2777 __field(xfs_nlink_t, i1_refcount)
2778 __field(xfs_agblock_t, i2_startblock)
2779 __field(xfs_extlen_t, i2_blockcount)
2780 __field(xfs_nlink_t, i2_refcount)
2781 ),
2782 TP_fast_assign(
2783 __entry->dev = mp->m_super->s_dev;
2784 __entry->agno = agno;
2785 __entry->i1_startblock = i1->rc_startblock;
2786 __entry->i1_blockcount = i1->rc_blockcount;
2787 __entry->i1_refcount = i1->rc_refcount;
2788 __entry->i2_startblock = i2->rc_startblock;
2789 __entry->i2_blockcount = i2->rc_blockcount;
2790 __entry->i2_refcount = i2->rc_refcount;
2791 ),
2792 TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
2793 "agbno %u len %u refcount %u\n",
2794 MAJOR(__entry->dev), MINOR(__entry->dev),
2795 __entry->agno,
2796 __entry->i1_startblock,
2797 __entry->i1_blockcount,
2798 __entry->i1_refcount,
2799 __entry->i2_startblock,
2800 __entry->i2_blockcount,
2801 __entry->i2_refcount)
2802)
2803
2804#define DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(name) \
2805DEFINE_EVENT(xfs_refcount_double_extent_class, name, \
2806 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
2807 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2), \
2808 TP_ARGS(mp, agno, i1, i2))
2809
2810/* double-rcext and an agbno tracepoint class */
2811DECLARE_EVENT_CLASS(xfs_refcount_double_extent_at_class,
2812 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2813 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2,
2814 xfs_agblock_t agbno),
2815 TP_ARGS(mp, agno, i1, i2, agbno),
2816 TP_STRUCT__entry(
2817 __field(dev_t, dev)
2818 __field(xfs_agnumber_t, agno)
2819 __field(xfs_agblock_t, i1_startblock)
2820 __field(xfs_extlen_t, i1_blockcount)
2821 __field(xfs_nlink_t, i1_refcount)
2822 __field(xfs_agblock_t, i2_startblock)
2823 __field(xfs_extlen_t, i2_blockcount)
2824 __field(xfs_nlink_t, i2_refcount)
2825 __field(xfs_agblock_t, agbno)
2826 ),
2827 TP_fast_assign(
2828 __entry->dev = mp->m_super->s_dev;
2829 __entry->agno = agno;
2830 __entry->i1_startblock = i1->rc_startblock;
2831 __entry->i1_blockcount = i1->rc_blockcount;
2832 __entry->i1_refcount = i1->rc_refcount;
2833 __entry->i2_startblock = i2->rc_startblock;
2834 __entry->i2_blockcount = i2->rc_blockcount;
2835 __entry->i2_refcount = i2->rc_refcount;
2836 __entry->agbno = agbno;
2837 ),
2838 TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
2839 "agbno %u len %u refcount %u @ agbno %u\n",
2840 MAJOR(__entry->dev), MINOR(__entry->dev),
2841 __entry->agno,
2842 __entry->i1_startblock,
2843 __entry->i1_blockcount,
2844 __entry->i1_refcount,
2845 __entry->i2_startblock,
2846 __entry->i2_blockcount,
2847 __entry->i2_refcount,
2848 __entry->agbno)
2849)
2850
2851#define DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(name) \
2852DEFINE_EVENT(xfs_refcount_double_extent_at_class, name, \
2853 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
2854 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2, \
2855 xfs_agblock_t agbno), \
2856 TP_ARGS(mp, agno, i1, i2, agbno))
2857
2858/* triple-rcext tracepoint class */
2859DECLARE_EVENT_CLASS(xfs_refcount_triple_extent_class,
2860 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2861 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2,
2862 struct xfs_refcount_irec *i3),
2863 TP_ARGS(mp, agno, i1, i2, i3),
2864 TP_STRUCT__entry(
2865 __field(dev_t, dev)
2866 __field(xfs_agnumber_t, agno)
2867 __field(xfs_agblock_t, i1_startblock)
2868 __field(xfs_extlen_t, i1_blockcount)
2869 __field(xfs_nlink_t, i1_refcount)
2870 __field(xfs_agblock_t, i2_startblock)
2871 __field(xfs_extlen_t, i2_blockcount)
2872 __field(xfs_nlink_t, i2_refcount)
2873 __field(xfs_agblock_t, i3_startblock)
2874 __field(xfs_extlen_t, i3_blockcount)
2875 __field(xfs_nlink_t, i3_refcount)
2876 ),
2877 TP_fast_assign(
2878 __entry->dev = mp->m_super->s_dev;
2879 __entry->agno = agno;
2880 __entry->i1_startblock = i1->rc_startblock;
2881 __entry->i1_blockcount = i1->rc_blockcount;
2882 __entry->i1_refcount = i1->rc_refcount;
2883 __entry->i2_startblock = i2->rc_startblock;
2884 __entry->i2_blockcount = i2->rc_blockcount;
2885 __entry->i2_refcount = i2->rc_refcount;
2886 __entry->i3_startblock = i3->rc_startblock;
2887 __entry->i3_blockcount = i3->rc_blockcount;
2888 __entry->i3_refcount = i3->rc_refcount;
2889 ),
2890 TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
2891 "agbno %u len %u refcount %u -- "
2892 "agbno %u len %u refcount %u\n",
2893 MAJOR(__entry->dev), MINOR(__entry->dev),
2894 __entry->agno,
2895 __entry->i1_startblock,
2896 __entry->i1_blockcount,
2897 __entry->i1_refcount,
2898 __entry->i2_startblock,
2899 __entry->i2_blockcount,
2900 __entry->i2_refcount,
2901 __entry->i3_startblock,
2902 __entry->i3_blockcount,
2903 __entry->i3_refcount)
2904);
2905
2906#define DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(name) \
2907DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
2908 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
2909 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2, \
2910 struct xfs_refcount_irec *i3), \
2911 TP_ARGS(mp, agno, i1, i2, i3))
2912
2913/* refcount btree tracepoints */
2914DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
2915DEFINE_BUSY_EVENT(xfs_refcountbt_free_block);
2916DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcount_lookup);
2917DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_get);
2918DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_update);
2919DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_insert);
2920DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_delete);
2921DEFINE_AG_ERROR_EVENT(xfs_refcount_insert_error);
2922DEFINE_AG_ERROR_EVENT(xfs_refcount_delete_error);
2923DEFINE_AG_ERROR_EVENT(xfs_refcount_update_error);
2924
2925/* refcount adjustment tracepoints */
2926DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
2927DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
2928DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_increase);
2929DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_decrease);
2930DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
2931DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
2932DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_recover_extent);
2933DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
2934DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
2935DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
2936DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
2937DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
2938DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
2939DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_cow_error);
2940DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
2941DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
2942DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);
2943DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_left_extent_error);
2944DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_right_extent_error);
2945DEFINE_AG_ERROR_EVENT(xfs_refcount_find_left_extent_error);
2946DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
2947
2948/* reflink helpers */
2949DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
2950DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
2951DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
2952#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
2953DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
2954DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
2955
2956TRACE_EVENT(xfs_refcount_finish_one_leftover,
2957 TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
2958 int type, xfs_agblock_t agbno, xfs_extlen_t len,
2959 xfs_agblock_t new_agbno, xfs_extlen_t new_len),
2960 TP_ARGS(mp, agno, type, agbno, len, new_agbno, new_len),
2961 TP_STRUCT__entry(
2962 __field(dev_t, dev)
2963 __field(xfs_agnumber_t, agno)
2964 __field(int, type)
2965 __field(xfs_agblock_t, agbno)
2966 __field(xfs_extlen_t, len)
2967 __field(xfs_agblock_t, new_agbno)
2968 __field(xfs_extlen_t, new_len)
2969 ),
2970 TP_fast_assign(
2971 __entry->dev = mp->m_super->s_dev;
2972 __entry->agno = agno;
2973 __entry->type = type;
2974 __entry->agbno = agbno;
2975 __entry->len = len;
2976 __entry->new_agbno = new_agbno;
2977 __entry->new_len = new_len;
2978 ),
2979 TP_printk("dev %d:%d type %d agno %u agbno %u len %u new_agbno %u new_len %u",
2980 MAJOR(__entry->dev), MINOR(__entry->dev),
2981 __entry->type,
2982 __entry->agno,
2983 __entry->agbno,
2984 __entry->len,
2985 __entry->new_agbno,
2986 __entry->new_len)
2987);
2988
2989/* simple inode-based error/%ip tracepoint class */
2990DECLARE_EVENT_CLASS(xfs_inode_error_class,
2991 TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
2992 TP_ARGS(ip, error, caller_ip),
2993 TP_STRUCT__entry(
2994 __field(dev_t, dev)
2995 __field(xfs_ino_t, ino)
2996 __field(int, error)
2997 __field(unsigned long, caller_ip)
2998 ),
2999 TP_fast_assign(
3000 __entry->dev = VFS_I(ip)->i_sb->s_dev;
3001 __entry->ino = ip->i_ino;
3002 __entry->error = error;
3003 __entry->caller_ip = caller_ip;
3004 ),
3005 TP_printk("dev %d:%d ino %llx error %d caller %ps",
3006 MAJOR(__entry->dev), MINOR(__entry->dev),
3007 __entry->ino,
3008 __entry->error,
3009 (char *)__entry->caller_ip)
3010);
3011
3012#define DEFINE_INODE_ERROR_EVENT(name) \
3013DEFINE_EVENT(xfs_inode_error_class, name, \
3014 TP_PROTO(struct xfs_inode *ip, int error, \
3015 unsigned long caller_ip), \
3016 TP_ARGS(ip, error, caller_ip))
3017
3018/* reflink allocator */
3019TRACE_EVENT(xfs_bmap_remap_alloc,
3020 TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
3021 xfs_extlen_t len),
3022 TP_ARGS(ip, fsbno, len),
3023 TP_STRUCT__entry(
3024 __field(dev_t, dev)
3025 __field(xfs_ino_t, ino)
3026 __field(xfs_fsblock_t, fsbno)
3027 __field(xfs_extlen_t, len)
3028 ),
3029 TP_fast_assign(
3030 __entry->dev = VFS_I(ip)->i_sb->s_dev;
3031 __entry->ino = ip->i_ino;
3032 __entry->fsbno = fsbno;
3033 __entry->len = len;
3034 ),
3035 TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
3036 MAJOR(__entry->dev), MINOR(__entry->dev),
3037 __entry->ino,
3038 __entry->fsbno,
3039 __entry->len)
3040);
3041DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
3042
3043/* reflink tracepoint classes */
3044
3045/* two-file io tracepoint class */
3046DECLARE_EVENT_CLASS(xfs_double_io_class,
3047 TP_PROTO(struct xfs_inode *src, xfs_off_t soffset, xfs_off_t len,
3048 struct xfs_inode *dest, xfs_off_t doffset),
3049 TP_ARGS(src, soffset, len, dest, doffset),
3050 TP_STRUCT__entry(
3051 __field(dev_t, dev)
3052 __field(xfs_ino_t, src_ino)
3053 __field(loff_t, src_isize)
3054 __field(loff_t, src_disize)
3055 __field(loff_t, src_offset)
3056 __field(size_t, len)
3057 __field(xfs_ino_t, dest_ino)
3058 __field(loff_t, dest_isize)
3059 __field(loff_t, dest_disize)
3060 __field(loff_t, dest_offset)
3061 ),
3062 TP_fast_assign(
3063 __entry->dev = VFS_I(src)->i_sb->s_dev;
3064 __entry->src_ino = src->i_ino;
3065 __entry->src_isize = VFS_I(src)->i_size;
3066 __entry->src_disize = src->i_d.di_size;
3067 __entry->src_offset = soffset;
3068 __entry->len = len;
3069 __entry->dest_ino = dest->i_ino;
3070 __entry->dest_isize = VFS_I(dest)->i_size;
3071 __entry->dest_disize = dest->i_d.di_size;
3072 __entry->dest_offset = doffset;
3073 ),
3074 TP_printk("dev %d:%d count %zd "
3075 "ino 0x%llx isize 0x%llx disize 0x%llx offset 0x%llx -> "
3076 "ino 0x%llx isize 0x%llx disize 0x%llx offset 0x%llx",
3077 MAJOR(__entry->dev), MINOR(__entry->dev),
3078 __entry->len,
3079 __entry->src_ino,
3080 __entry->src_isize,
3081 __entry->src_disize,
3082 __entry->src_offset,
3083 __entry->dest_ino,
3084 __entry->dest_isize,
3085 __entry->dest_disize,
3086 __entry->dest_offset)
3087)
3088
3089#define DEFINE_DOUBLE_IO_EVENT(name) \
3090DEFINE_EVENT(xfs_double_io_class, name, \
3091 TP_PROTO(struct xfs_inode *src, xfs_off_t soffset, xfs_off_t len, \
3092 struct xfs_inode *dest, xfs_off_t doffset), \
3093 TP_ARGS(src, soffset, len, dest, doffset))
3094
3095/* two-file vfs io tracepoint class */
3096DECLARE_EVENT_CLASS(xfs_double_vfs_io_class,
3097 TP_PROTO(struct inode *src, u64 soffset, u64 len,
3098 struct inode *dest, u64 doffset),
3099 TP_ARGS(src, soffset, len, dest, doffset),
3100 TP_STRUCT__entry(
3101 __field(dev_t, dev)
3102 __field(unsigned long, src_ino)
3103 __field(loff_t, src_isize)
3104 __field(loff_t, src_offset)
3105 __field(size_t, len)
3106 __field(unsigned long, dest_ino)
3107 __field(loff_t, dest_isize)
3108 __field(loff_t, dest_offset)
3109 ),
3110 TP_fast_assign(
3111 __entry->dev = src->i_sb->s_dev;
3112 __entry->src_ino = src->i_ino;
3113 __entry->src_isize = i_size_read(src);
3114 __entry->src_offset = soffset;
3115 __entry->len = len;
3116 __entry->dest_ino = dest->i_ino;
3117 __entry->dest_isize = i_size_read(dest);
3118 __entry->dest_offset = doffset;
3119 ),
3120 TP_printk("dev %d:%d count %zd "
3121 "ino 0x%lx isize 0x%llx offset 0x%llx -> "
3122 "ino 0x%lx isize 0x%llx offset 0x%llx",
3123 MAJOR(__entry->dev), MINOR(__entry->dev),
3124 __entry->len,
3125 __entry->src_ino,
3126 __entry->src_isize,
3127 __entry->src_offset,
3128 __entry->dest_ino,
3129 __entry->dest_isize,
3130 __entry->dest_offset)
3131)
3132
3133#define DEFINE_DOUBLE_VFS_IO_EVENT(name) \
3134DEFINE_EVENT(xfs_double_vfs_io_class, name, \
3135 TP_PROTO(struct inode *src, u64 soffset, u64 len, \
3136 struct inode *dest, u64 doffset), \
3137 TP_ARGS(src, soffset, len, dest, doffset))
3138
3139/* CoW write tracepoint */
3140DECLARE_EVENT_CLASS(xfs_copy_on_write_class,
3141 TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk, xfs_fsblock_t pblk,
3142 xfs_extlen_t len, xfs_fsblock_t new_pblk),
3143 TP_ARGS(ip, lblk, pblk, len, new_pblk),
3144 TP_STRUCT__entry(
3145 __field(dev_t, dev)
3146 __field(xfs_ino_t, ino)
3147 __field(xfs_fileoff_t, lblk)
3148 __field(xfs_fsblock_t, pblk)
3149 __field(xfs_extlen_t, len)
3150 __field(xfs_fsblock_t, new_pblk)
3151 ),
3152 TP_fast_assign(
3153 __entry->dev = VFS_I(ip)->i_sb->s_dev;
3154 __entry->ino = ip->i_ino;
3155 __entry->lblk = lblk;
3156 __entry->pblk = pblk;
3157 __entry->len = len;
3158 __entry->new_pblk = new_pblk;
3159 ),
3160 TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx pblk 0x%llx "
3161 "len 0x%x new_pblk %llu",
3162 MAJOR(__entry->dev), MINOR(__entry->dev),
3163 __entry->ino,
3164 __entry->lblk,
3165 __entry->pblk,
3166 __entry->len,
3167 __entry->new_pblk)
3168)
3169
3170#define DEFINE_COW_EVENT(name) \
3171DEFINE_EVENT(xfs_copy_on_write_class, name, \
3172 TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk, xfs_fsblock_t pblk, \
3173 xfs_extlen_t len, xfs_fsblock_t new_pblk), \
3174 TP_ARGS(ip, lblk, pblk, len, new_pblk))
3175
3176/* inode/irec events */
3177DECLARE_EVENT_CLASS(xfs_inode_irec_class,
3178 TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec),
3179 TP_ARGS(ip, irec),
3180 TP_STRUCT__entry(
3181 __field(dev_t, dev)
3182 __field(xfs_ino_t, ino)
3183 __field(xfs_fileoff_t, lblk)
3184 __field(xfs_extlen_t, len)
3185 __field(xfs_fsblock_t, pblk)
3186 ),
3187 TP_fast_assign(
3188 __entry->dev = VFS_I(ip)->i_sb->s_dev;
3189 __entry->ino = ip->i_ino;
3190 __entry->lblk = irec->br_startoff;
3191 __entry->len = irec->br_blockcount;
3192 __entry->pblk = irec->br_startblock;
3193 ),
3194 TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
3195 MAJOR(__entry->dev), MINOR(__entry->dev),
3196 __entry->ino,
3197 __entry->lblk,
3198 __entry->len,
3199 __entry->pblk)
3200);
3201#define DEFINE_INODE_IREC_EVENT(name) \
3202DEFINE_EVENT(xfs_inode_irec_class, name, \
3203 TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \
3204 TP_ARGS(ip, irec))
3205
3206/* refcount/reflink tracepoint definitions */
3207
3208/* reflink tracepoints */
3209DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
3210DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
3211DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
3212DEFINE_IOMAP_EVENT(xfs_reflink_remap_imap);
3213TRACE_EVENT(xfs_reflink_remap_blocks_loop,
3214 TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
3215 xfs_filblks_t len, struct xfs_inode *dest,
3216 xfs_fileoff_t doffset),
3217 TP_ARGS(src, soffset, len, dest, doffset),
3218 TP_STRUCT__entry(
3219 __field(dev_t, dev)
3220 __field(xfs_ino_t, src_ino)
3221 __field(xfs_fileoff_t, src_lblk)
3222 __field(xfs_filblks_t, len)
3223 __field(xfs_ino_t, dest_ino)
3224 __field(xfs_fileoff_t, dest_lblk)
3225 ),
3226 TP_fast_assign(
3227 __entry->dev = VFS_I(src)->i_sb->s_dev;
3228 __entry->src_ino = src->i_ino;
3229 __entry->src_lblk = soffset;
3230 __entry->len = len;
3231 __entry->dest_ino = dest->i_ino;
3232 __entry->dest_lblk = doffset;
3233 ),
3234 TP_printk("dev %d:%d len 0x%llx "
3235 "ino 0x%llx offset 0x%llx blocks -> "
3236 "ino 0x%llx offset 0x%llx blocks",
3237 MAJOR(__entry->dev), MINOR(__entry->dev),
3238 __entry->len,
3239 __entry->src_ino,
3240 __entry->src_lblk,
3241 __entry->dest_ino,
3242 __entry->dest_lblk)
3243);
3244TRACE_EVENT(xfs_reflink_punch_range,
3245 TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
3246 xfs_extlen_t len),
3247 TP_ARGS(ip, lblk, len),
3248 TP_STRUCT__entry(
3249 __field(dev_t, dev)
3250 __field(xfs_ino_t, ino)
3251 __field(xfs_fileoff_t, lblk)
3252 __field(xfs_extlen_t, len)
3253 ),
3254 TP_fast_assign(
3255 __entry->dev = VFS_I(ip)->i_sb->s_dev;
3256 __entry->ino = ip->i_ino;
3257 __entry->lblk = lblk;
3258 __entry->len = len;
3259 ),
3260 TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
3261 MAJOR(__entry->dev), MINOR(__entry->dev),
3262 __entry->ino,
3263 __entry->lblk,
3264 __entry->len)
3265);
3266TRACE_EVENT(xfs_reflink_remap,
3267 TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
3268 xfs_extlen_t len, xfs_fsblock_t new_pblk),
3269 TP_ARGS(ip, lblk, len, new_pblk),
3270 TP_STRUCT__entry(
3271 __field(dev_t, dev)
3272 __field(xfs_ino_t, ino)
3273 __field(xfs_fileoff_t, lblk)
3274 __field(xfs_extlen_t, len)
3275 __field(xfs_fsblock_t, new_pblk)
3276 ),
3277 TP_fast_assign(
3278 __entry->dev = VFS_I(ip)->i_sb->s_dev;
3279 __entry->ino = ip->i_ino;
3280 __entry->lblk = lblk;
3281 __entry->len = len;
3282 __entry->new_pblk = new_pblk;
3283 ),
3284 TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
3285 MAJOR(__entry->dev), MINOR(__entry->dev),
3286 __entry->ino,
3287 __entry->lblk,
3288 __entry->len,
3289 __entry->new_pblk)
3290);
3291DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
3292DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
3293DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
3294DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
3295DEFINE_INODE_ERROR_EVENT(xfs_reflink_reflink_main_loop_error);
3296DEFINE_INODE_ERROR_EVENT(xfs_reflink_read_iomap_error);
3297DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
3298DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
3299
3300/* dedupe tracepoints */
3301DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
3302DEFINE_INODE_ERROR_EVENT(xfs_reflink_compare_extents_error);
3303
3304/* ioctl tracepoints */
3305DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_reflink);
3306DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_clone_range);
3307DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_file_extent_same);
3308TRACE_EVENT(xfs_ioctl_clone,
3309 TP_PROTO(struct inode *src, struct inode *dest),
3310 TP_ARGS(src, dest),
3311 TP_STRUCT__entry(
3312 __field(dev_t, dev)
3313 __field(unsigned long, src_ino)
3314 __field(loff_t, src_isize)
3315 __field(unsigned long, dest_ino)
3316 __field(loff_t, dest_isize)
3317 ),
3318 TP_fast_assign(
3319 __entry->dev = src->i_sb->s_dev;
3320 __entry->src_ino = src->i_ino;
3321 __entry->src_isize = i_size_read(src);
3322 __entry->dest_ino = dest->i_ino;
3323 __entry->dest_isize = i_size_read(dest);
3324 ),
3325 TP_printk("dev %d:%d "
3326 "ino 0x%lx isize 0x%llx -> "
3327 "ino 0x%lx isize 0x%llx\n",
3328 MAJOR(__entry->dev), MINOR(__entry->dev),
3329 __entry->src_ino,
3330 __entry->src_isize,
3331 __entry->dest_ino,
3332 __entry->dest_isize)
3333);
3334
3335/* unshare tracepoints */
3336DEFINE_SIMPLE_IO_EVENT(xfs_reflink_unshare);
3337DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cow_eof_block);
3338DEFINE_PAGE_EVENT(xfs_reflink_unshare_page);
3339DEFINE_INODE_ERROR_EVENT(xfs_reflink_unshare_error);
3340DEFINE_INODE_ERROR_EVENT(xfs_reflink_cow_eof_block_error);
3341DEFINE_INODE_ERROR_EVENT(xfs_reflink_dirty_page_error);
3342
3343/* copy on write */
3344DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
3345DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
3346DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
3347DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
3348
3349DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
3350DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
3351
3352DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
3353DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
3354DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
3355
3356DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
3357DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
3358DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap);
3359DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap_piece);
3360
3361DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_range_error);
3362DEFINE_INODE_ERROR_EVENT(xfs_reflink_allocate_cow_range_error);
3363DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_cow_range_error);
3364DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
3365
3366DEFINE_COW_EVENT(xfs_reflink_fork_buf);
3367DEFINE_COW_EVENT(xfs_reflink_finish_fork_buf);
3368DEFINE_INODE_ERROR_EVENT(xfs_reflink_fork_buf_error);
3369DEFINE_INODE_ERROR_EVENT(xfs_reflink_finish_fork_buf_error);
3370
3371DEFINE_INODE_EVENT(xfs_reflink_cancel_pending_cow);
3372DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
3373DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_pending_cow_error);
3374
3375/* rmap swapext tracepoints */
3376DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
3377DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
3378DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
3379
2642#endif /* _TRACE_XFS_H */ 3380#endif /* _TRACE_XFS_H */
2643 3381
2644#undef TRACE_INCLUDE_PATH 3382#undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index e2bf86aad33d..61b7fbdd3ebd 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -36,6 +36,11 @@ struct xfs_busy_extent;
36struct xfs_rud_log_item; 36struct xfs_rud_log_item;
37struct xfs_rui_log_item; 37struct xfs_rui_log_item;
38struct xfs_btree_cur; 38struct xfs_btree_cur;
39struct xfs_cui_log_item;
40struct xfs_cud_log_item;
41struct xfs_defer_ops;
42struct xfs_bui_log_item;
43struct xfs_bud_log_item;
39 44
40typedef struct xfs_log_item { 45typedef struct xfs_log_item {
41 struct list_head li_ail; /* AIL pointers */ 46 struct list_head li_ail; /* AIL pointers */
@@ -248,4 +253,28 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
248 xfs_fsblock_t startblock, xfs_filblks_t blockcount, 253 xfs_fsblock_t startblock, xfs_filblks_t blockcount,
249 xfs_exntst_t state, struct xfs_btree_cur **pcur); 254 xfs_exntst_t state, struct xfs_btree_cur **pcur);
250 255
256/* refcount updates */
257enum xfs_refcount_intent_type;
258
259void xfs_refcount_update_init_defer_op(void);
260struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
261 struct xfs_cui_log_item *cuip);
262int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
263 struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops,
264 enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
265 xfs_extlen_t blockcount, xfs_fsblock_t *new_fsb,
266 xfs_extlen_t *new_len, struct xfs_btree_cur **pcur);
267
268/* mapping updates */
269enum xfs_bmap_intent_type;
270
271void xfs_bmap_update_init_defer_op(void);
272struct xfs_bud_log_item *xfs_trans_get_bud(struct xfs_trans *tp,
273 struct xfs_bui_log_item *buip);
274int xfs_trans_log_finish_bmap_update(struct xfs_trans *tp,
275 struct xfs_bud_log_item *rudp, struct xfs_defer_ops *dfops,
276 enum xfs_bmap_intent_type type, struct xfs_inode *ip,
277 int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
278 xfs_filblks_t blockcount, xfs_exntst_t state);
279
251#endif /* __XFS_TRANS_H__ */ 280#endif /* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
new file mode 100644
index 000000000000..6408e7d7c08c
--- /dev/null
+++ b/fs/xfs/xfs_trans_bmap.c
@@ -0,0 +1,249 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_shared.h"
23#include "xfs_format.h"
24#include "xfs_log_format.h"
25#include "xfs_trans_resv.h"
26#include "xfs_mount.h"
27#include "xfs_defer.h"
28#include "xfs_trans.h"
29#include "xfs_trans_priv.h"
30#include "xfs_bmap_item.h"
31#include "xfs_alloc.h"
32#include "xfs_bmap.h"
33#include "xfs_inode.h"
34
35/*
36 * This routine is called to allocate a "bmap update done"
37 * log item.
38 */
39struct xfs_bud_log_item *
40xfs_trans_get_bud(
41 struct xfs_trans *tp,
42 struct xfs_bui_log_item *buip)
43{
44 struct xfs_bud_log_item *budp;
45
46 budp = xfs_bud_init(tp->t_mountp, buip);
47 xfs_trans_add_item(tp, &budp->bud_item);
48 return budp;
49}
50
51/*
52 * Finish an bmap update and log it to the BUD. Note that the
53 * transaction is marked dirty regardless of whether the bmap update
54 * succeeds or fails to support the BUI/BUD lifecycle rules.
55 */
56int
57xfs_trans_log_finish_bmap_update(
58 struct xfs_trans *tp,
59 struct xfs_bud_log_item *budp,
60 struct xfs_defer_ops *dop,
61 enum xfs_bmap_intent_type type,
62 struct xfs_inode *ip,
63 int whichfork,
64 xfs_fileoff_t startoff,
65 xfs_fsblock_t startblock,
66 xfs_filblks_t blockcount,
67 xfs_exntst_t state)
68{
69 int error;
70
71 error = xfs_bmap_finish_one(tp, dop, ip, type, whichfork, startoff,
72 startblock, blockcount, state);
73
74 /*
75 * Mark the transaction dirty, even on error. This ensures the
76 * transaction is aborted, which:
77 *
78 * 1.) releases the BUI and frees the BUD
79 * 2.) shuts down the filesystem
80 */
81 tp->t_flags |= XFS_TRANS_DIRTY;
82 budp->bud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
83
84 return error;
85}
86
87/* Sort bmap intents by inode. */
88static int
89xfs_bmap_update_diff_items(
90 void *priv,
91 struct list_head *a,
92 struct list_head *b)
93{
94 struct xfs_bmap_intent *ba;
95 struct xfs_bmap_intent *bb;
96
97 ba = container_of(a, struct xfs_bmap_intent, bi_list);
98 bb = container_of(b, struct xfs_bmap_intent, bi_list);
99 return ba->bi_owner->i_ino - bb->bi_owner->i_ino;
100}
101
102/* Get an BUI. */
103STATIC void *
104xfs_bmap_update_create_intent(
105 struct xfs_trans *tp,
106 unsigned int count)
107{
108 struct xfs_bui_log_item *buip;
109
110 ASSERT(count == XFS_BUI_MAX_FAST_EXTENTS);
111 ASSERT(tp != NULL);
112
113 buip = xfs_bui_init(tp->t_mountp);
114 ASSERT(buip != NULL);
115
116 /*
117 * Get a log_item_desc to point at the new item.
118 */
119 xfs_trans_add_item(tp, &buip->bui_item);
120 return buip;
121}
122
123/* Set the map extent flags for this mapping. */
124static void
125xfs_trans_set_bmap_flags(
126 struct xfs_map_extent *bmap,
127 enum xfs_bmap_intent_type type,
128 int whichfork,
129 xfs_exntst_t state)
130{
131 bmap->me_flags = 0;
132 switch (type) {
133 case XFS_BMAP_MAP:
134 case XFS_BMAP_UNMAP:
135 bmap->me_flags = type;
136 break;
137 default:
138 ASSERT(0);
139 }
140 if (state == XFS_EXT_UNWRITTEN)
141 bmap->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
142 if (whichfork == XFS_ATTR_FORK)
143 bmap->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
144}
145
146/* Log bmap updates in the intent item. */
147STATIC void
148xfs_bmap_update_log_item(
149 struct xfs_trans *tp,
150 void *intent,
151 struct list_head *item)
152{
153 struct xfs_bui_log_item *buip = intent;
154 struct xfs_bmap_intent *bmap;
155 uint next_extent;
156 struct xfs_map_extent *map;
157
158 bmap = container_of(item, struct xfs_bmap_intent, bi_list);
159
160 tp->t_flags |= XFS_TRANS_DIRTY;
161 buip->bui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
162
163 /*
164 * atomic_inc_return gives us the value after the increment;
165 * we want to use it as an array index so we need to subtract 1 from
166 * it.
167 */
168 next_extent = atomic_inc_return(&buip->bui_next_extent) - 1;
169 ASSERT(next_extent < buip->bui_format.bui_nextents);
170 map = &buip->bui_format.bui_extents[next_extent];
171 map->me_owner = bmap->bi_owner->i_ino;
172 map->me_startblock = bmap->bi_bmap.br_startblock;
173 map->me_startoff = bmap->bi_bmap.br_startoff;
174 map->me_len = bmap->bi_bmap.br_blockcount;
175 xfs_trans_set_bmap_flags(map, bmap->bi_type, bmap->bi_whichfork,
176 bmap->bi_bmap.br_state);
177}
178
179/* Get an BUD so we can process all the deferred rmap updates. */
180STATIC void *
181xfs_bmap_update_create_done(
182 struct xfs_trans *tp,
183 void *intent,
184 unsigned int count)
185{
186 return xfs_trans_get_bud(tp, intent);
187}
188
189/* Process a deferred rmap update. */
190STATIC int
191xfs_bmap_update_finish_item(
192 struct xfs_trans *tp,
193 struct xfs_defer_ops *dop,
194 struct list_head *item,
195 void *done_item,
196 void **state)
197{
198 struct xfs_bmap_intent *bmap;
199 int error;
200
201 bmap = container_of(item, struct xfs_bmap_intent, bi_list);
202 error = xfs_trans_log_finish_bmap_update(tp, done_item, dop,
203 bmap->bi_type,
204 bmap->bi_owner, bmap->bi_whichfork,
205 bmap->bi_bmap.br_startoff,
206 bmap->bi_bmap.br_startblock,
207 bmap->bi_bmap.br_blockcount,
208 bmap->bi_bmap.br_state);
209 kmem_free(bmap);
210 return error;
211}
212
213/* Abort all pending BUIs. */
214STATIC void
215xfs_bmap_update_abort_intent(
216 void *intent)
217{
218 xfs_bui_release(intent);
219}
220
221/* Cancel a deferred rmap update. */
222STATIC void
223xfs_bmap_update_cancel_item(
224 struct list_head *item)
225{
226 struct xfs_bmap_intent *bmap;
227
228 bmap = container_of(item, struct xfs_bmap_intent, bi_list);
229 kmem_free(bmap);
230}
231
232static const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
233 .type = XFS_DEFER_OPS_TYPE_BMAP,
234 .max_items = XFS_BUI_MAX_FAST_EXTENTS,
235 .diff_items = xfs_bmap_update_diff_items,
236 .create_intent = xfs_bmap_update_create_intent,
237 .abort_intent = xfs_bmap_update_abort_intent,
238 .log_item = xfs_bmap_update_log_item,
239 .create_done = xfs_bmap_update_create_done,
240 .finish_item = xfs_bmap_update_finish_item,
241 .cancel_item = xfs_bmap_update_cancel_item,
242};
243
244/* Register the deferred op type. */
245void
246xfs_bmap_update_init_defer_op(void)
247{
248 xfs_defer_init_op_type(&xfs_bmap_update_defer_type);
249}
diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
new file mode 100644
index 000000000000..94c1877af834
--- /dev/null
+++ b/fs/xfs/xfs_trans_refcount.c
@@ -0,0 +1,264 @@
1/*
2 * Copyright (C) 2016 Oracle. All Rights Reserved.
3 *
4 * Author: Darrick J. Wong <darrick.wong@oracle.com>
5 *
6 * This program is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License
8 * as published by the Free Software Foundation; either version 2
9 * of the License, or (at your option) any later version.
10 *
11 * This program is distributed in the hope that it would be useful,
12 * but WITHOUT ANY WARRANTY; without even the implied warranty of
13 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details.
15 *
16 * You should have received a copy of the GNU General Public License
17 * along with this program; if not, write the Free Software Foundation,
18 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
19 */
20#include "xfs.h"
21#include "xfs_fs.h"
22#include "xfs_shared.h"
23#include "xfs_format.h"
24#include "xfs_log_format.h"
25#include "xfs_trans_resv.h"
26#include "xfs_mount.h"
27#include "xfs_defer.h"
28#include "xfs_trans.h"
29#include "xfs_trans_priv.h"
30#include "xfs_refcount_item.h"
31#include "xfs_alloc.h"
32#include "xfs_refcount.h"
33
34/*
35 * This routine is called to allocate a "refcount update done"
36 * log item.
37 */
38struct xfs_cud_log_item *
39xfs_trans_get_cud(
40 struct xfs_trans *tp,
41 struct xfs_cui_log_item *cuip)
42{
43 struct xfs_cud_log_item *cudp;
44
45 cudp = xfs_cud_init(tp->t_mountp, cuip);
46 xfs_trans_add_item(tp, &cudp->cud_item);
47 return cudp;
48}
49
50/*
51 * Finish an refcount update and log it to the CUD. Note that the
52 * transaction is marked dirty regardless of whether the refcount
53 * update succeeds or fails to support the CUI/CUD lifecycle rules.
54 */
55int
56xfs_trans_log_finish_refcount_update(
57 struct xfs_trans *tp,
58 struct xfs_cud_log_item *cudp,
59 struct xfs_defer_ops *dop,
60 enum xfs_refcount_intent_type type,
61 xfs_fsblock_t startblock,
62 xfs_extlen_t blockcount,
63 xfs_fsblock_t *new_fsb,
64 xfs_extlen_t *new_len,
65 struct xfs_btree_cur **pcur)
66{
67 int error;
68
69 error = xfs_refcount_finish_one(tp, dop, type, startblock,
70 blockcount, new_fsb, new_len, pcur);
71
72 /*
73 * Mark the transaction dirty, even on error. This ensures the
74 * transaction is aborted, which:
75 *
76 * 1.) releases the CUI and frees the CUD
77 * 2.) shuts down the filesystem
78 */
79 tp->t_flags |= XFS_TRANS_DIRTY;
80 cudp->cud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
81
82 return error;
83}
84
85/* Sort refcount intents by AG. */
86static int
87xfs_refcount_update_diff_items(
88 void *priv,
89 struct list_head *a,
90 struct list_head *b)
91{
92 struct xfs_mount *mp = priv;
93 struct xfs_refcount_intent *ra;
94 struct xfs_refcount_intent *rb;
95
96 ra = container_of(a, struct xfs_refcount_intent, ri_list);
97 rb = container_of(b, struct xfs_refcount_intent, ri_list);
98 return XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
99 XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
100}
101
102/* Get an CUI. */
103STATIC void *
104xfs_refcount_update_create_intent(
105 struct xfs_trans *tp,
106 unsigned int count)
107{
108 struct xfs_cui_log_item *cuip;
109
110 ASSERT(tp != NULL);
111 ASSERT(count > 0);
112
113 cuip = xfs_cui_init(tp->t_mountp, count);
114 ASSERT(cuip != NULL);
115
116 /*
117 * Get a log_item_desc to point at the new item.
118 */
119 xfs_trans_add_item(tp, &cuip->cui_item);
120 return cuip;
121}
122
123/* Set the phys extent flags for this reverse mapping. */
124static void
125xfs_trans_set_refcount_flags(
126 struct xfs_phys_extent *refc,
127 enum xfs_refcount_intent_type type)
128{
129 refc->pe_flags = 0;
130 switch (type) {
131 case XFS_REFCOUNT_INCREASE:
132 case XFS_REFCOUNT_DECREASE:
133 case XFS_REFCOUNT_ALLOC_COW:
134 case XFS_REFCOUNT_FREE_COW:
135 refc->pe_flags |= type;
136 break;
137 default:
138 ASSERT(0);
139 }
140}
141
142/* Log refcount updates in the intent item. */
143STATIC void
144xfs_refcount_update_log_item(
145 struct xfs_trans *tp,
146 void *intent,
147 struct list_head *item)
148{
149 struct xfs_cui_log_item *cuip = intent;
150 struct xfs_refcount_intent *refc;
151 uint next_extent;
152 struct xfs_phys_extent *ext;
153
154 refc = container_of(item, struct xfs_refcount_intent, ri_list);
155
156 tp->t_flags |= XFS_TRANS_DIRTY;
157 cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
158
159 /*
160 * atomic_inc_return gives us the value after the increment;
161 * we want to use it as an array index so we need to subtract 1 from
162 * it.
163 */
164 next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
165 ASSERT(next_extent < cuip->cui_format.cui_nextents);
166 ext = &cuip->cui_format.cui_extents[next_extent];
167 ext->pe_startblock = refc->ri_startblock;
168 ext->pe_len = refc->ri_blockcount;
169 xfs_trans_set_refcount_flags(ext, refc->ri_type);
170}
171
172/* Get an CUD so we can process all the deferred refcount updates. */
173STATIC void *
174xfs_refcount_update_create_done(
175 struct xfs_trans *tp,
176 void *intent,
177 unsigned int count)
178{
179 return xfs_trans_get_cud(tp, intent);
180}
181
182/* Process a deferred refcount update. */
183STATIC int
184xfs_refcount_update_finish_item(
185 struct xfs_trans *tp,
186 struct xfs_defer_ops *dop,
187 struct list_head *item,
188 void *done_item,
189 void **state)
190{
191 struct xfs_refcount_intent *refc;
192 xfs_fsblock_t new_fsb;
193 xfs_extlen_t new_aglen;
194 int error;
195
196 refc = container_of(item, struct xfs_refcount_intent, ri_list);
197 error = xfs_trans_log_finish_refcount_update(tp, done_item, dop,
198 refc->ri_type,
199 refc->ri_startblock,
200 refc->ri_blockcount,
201 &new_fsb, &new_aglen,
202 (struct xfs_btree_cur **)state);
203 /* Did we run out of reservation? Requeue what we didn't finish. */
204 if (!error && new_aglen > 0) {
205 ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
206 refc->ri_type == XFS_REFCOUNT_DECREASE);
207 refc->ri_startblock = new_fsb;
208 refc->ri_blockcount = new_aglen;
209 return -EAGAIN;
210 }
211 kmem_free(refc);
212 return error;
213}
214
215/* Clean up after processing deferred refcounts. */
216STATIC void
217xfs_refcount_update_finish_cleanup(
218 struct xfs_trans *tp,
219 void *state,
220 int error)
221{
222 struct xfs_btree_cur *rcur = state;
223
224 xfs_refcount_finish_one_cleanup(tp, rcur, error);
225}
226
227/* Abort all pending CUIs. */
228STATIC void
229xfs_refcount_update_abort_intent(
230 void *intent)
231{
232 xfs_cui_release(intent);
233}
234
235/* Cancel a deferred refcount update. */
236STATIC void
237xfs_refcount_update_cancel_item(
238 struct list_head *item)
239{
240 struct xfs_refcount_intent *refc;
241
242 refc = container_of(item, struct xfs_refcount_intent, ri_list);
243 kmem_free(refc);
244}
245
246static const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
247 .type = XFS_DEFER_OPS_TYPE_REFCOUNT,
248 .max_items = XFS_CUI_MAX_FAST_EXTENTS,
249 .diff_items = xfs_refcount_update_diff_items,
250 .create_intent = xfs_refcount_update_create_intent,
251 .abort_intent = xfs_refcount_update_abort_intent,
252 .log_item = xfs_refcount_update_log_item,
253 .create_done = xfs_refcount_update_create_done,
254 .finish_item = xfs_refcount_update_finish_item,
255 .finish_cleanup = xfs_refcount_update_finish_cleanup,
256 .cancel_item = xfs_refcount_update_cancel_item,
257};
258
259/* Register the deferred op type. */
260void
261xfs_refcount_update_init_defer_op(void)
262{
263 xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
264}
diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
index 5a50ef881568..9ead064b5e90 100644
--- a/fs/xfs/xfs_trans_rmap.c
+++ b/fs/xfs/xfs_trans_rmap.c
@@ -48,12 +48,21 @@ xfs_trans_set_rmap_flags(
48 case XFS_RMAP_MAP: 48 case XFS_RMAP_MAP:
49 rmap->me_flags |= XFS_RMAP_EXTENT_MAP; 49 rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
50 break; 50 break;
51 case XFS_RMAP_MAP_SHARED:
52 rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
53 break;
51 case XFS_RMAP_UNMAP: 54 case XFS_RMAP_UNMAP:
52 rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP; 55 rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
53 break; 56 break;
57 case XFS_RMAP_UNMAP_SHARED:
58 rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
59 break;
54 case XFS_RMAP_CONVERT: 60 case XFS_RMAP_CONVERT:
55 rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT; 61 rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
56 break; 62 break;
63 case XFS_RMAP_CONVERT_SHARED:
64 rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
65 break;
57 case XFS_RMAP_ALLOC: 66 case XFS_RMAP_ALLOC:
58 rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC; 67 rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
59 break; 68 break;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 996111000a8c..7494dc67c66f 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -25,6 +25,7 @@ struct space_resv {
25 FALLOC_FL_PUNCH_HOLE | \ 25 FALLOC_FL_PUNCH_HOLE | \
26 FALLOC_FL_COLLAPSE_RANGE | \ 26 FALLOC_FL_COLLAPSE_RANGE | \
27 FALLOC_FL_ZERO_RANGE | \ 27 FALLOC_FL_ZERO_RANGE | \
28 FALLOC_FL_INSERT_RANGE) 28 FALLOC_FL_INSERT_RANGE | \
29 FALLOC_FL_UNSHARE_RANGE)
29 30
30#endif /* _FALLOC_H_ */ 31#endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 3e445a760f14..b075f601919b 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -58,4 +58,22 @@
58 */ 58 */
59#define FALLOC_FL_INSERT_RANGE 0x20 59#define FALLOC_FL_INSERT_RANGE 0x20
60 60
61/*
62 * FALLOC_FL_UNSHARE_RANGE is used to unshare shared blocks within the
63 * file size without overwriting any existing data. The purpose of this
64 * call is to preemptively reallocate any blocks that are subject to
65 * copy-on-write.
66 *
67 * Different filesystems may implement different limitations on the
68 * granularity of the operation. Most will limit operations to filesystem
69 * block size boundaries, but this boundary may be larger or smaller
70 * depending on the filesystem and/or the configuration of the filesystem
71 * or file.
72 *
73 * This flag can only be used with allocate-mode fallocate, which is
74 * to say that it cannot be used with the punch, zero, collapse, or
75 * insert range modes.
76 */
77#define FALLOC_FL_UNSHARE_RANGE 0x40
78
61#endif /* _UAPI_FALLOC_H_ */ 79#endif /* _UAPI_FALLOC_H_ */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 2473272169f2..acb2b6152ba0 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -158,7 +158,8 @@ struct fsxattr {
158 __u32 fsx_extsize; /* extsize field value (get/set)*/ 158 __u32 fsx_extsize; /* extsize field value (get/set)*/
159 __u32 fsx_nextents; /* nextents field value (get) */ 159 __u32 fsx_nextents; /* nextents field value (get) */
160 __u32 fsx_projid; /* project identifier (get/set) */ 160 __u32 fsx_projid; /* project identifier (get/set) */
161 unsigned char fsx_pad[12]; 161 __u32 fsx_cowextsize; /* CoW extsize field value (get/set)*/
162 unsigned char fsx_pad[8];
162}; 163};
163 164
164/* 165/*
@@ -179,6 +180,7 @@ struct fsxattr {
179#define FS_XFLAG_NODEFRAG 0x00002000 /* do not defragment */ 180#define FS_XFLAG_NODEFRAG 0x00002000 /* do not defragment */
180#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */ 181#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
181#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */ 182#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
183#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
182#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ 184#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
183 185
184/* the read-only stuff doesn't really belong here, but any other place is 186/* the read-only stuff doesn't really belong here, but any other place is