Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2013-07-02 12:39:34 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2013-07-02 12:39:34 -0400
commit: 9e239bb93914e1c832d54161c7f8f398d0c914ab (patch)
tree: 0fe11e8e717152660ad77d77e66bf0f1695d7ed1
parent: 63580e51bb3e7ec459501165884e5f815a7a9322 (diff)
parent: 6ae06ff51eab5dcbbf959b05ce0f11003a305ba5 (diff)
63 files changed, 2652 insertions, 2170 deletions
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index bdd82b2339d9..9858f337529c 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -189,7 +189,7 @@ prototypes:
                                loff_t pos, unsigned len, unsigned copied,
                                struct page *page, void *fsdata);
        sector_t (*bmap)(struct address_space *, sector_t);
-        int (*invalidatepage) (struct page *, unsigned long);
+        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, int);
        void (*freepage)(struct page *);
        int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
@@ -310,8 +310,8 @@ filesystems and by the swapper. The latter will eventually go away.  Please,
 keep it that way and don't breed new callers.
        ->invalidatepage() is called when the filesystem must attempt to drop
-some or all of the buffers from the page when it is being truncated.  It
+some or all of the buffers from the page when it is being truncated. It
-returns zero on success.  If ->invalidatepage is zero, the kernel uses
+returns zero on success. If ->invalidatepage is zero, the kernel uses
 block_invalidatepage() instead.
        ->releasepage() is called when the kernel is about to try to drop the
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 4a35f6614a66..e6bd1ffd821e 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -549,7 +549,7 @@ struct address_space_operations
 -------------------------------
 This describes how the VFS can manipulate mapping of a file to page cache in
-your filesystem. As of kernel 2.6.22, the following members are defined:
+your filesystem. The following members are defined:
 struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -566,7 +566,7 @@ struct address_space_operations {
                                loff_t pos, unsigned len, unsigned copied,
                                struct page *page, void *fsdata);
        sector_t (*bmap)(struct address_space *, sector_t);
-        int (*invalidatepage) (struct page *, unsigned long);
+        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, int);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
@@ -685,14 +685,14 @@ struct address_space_operations {
  invalidatepage: If a page has PagePrivate set, then invalidatepage
        will be called when part or all of the page is to be removed
        from the address space.  This generally corresponds to either a
-        truncation or a complete invalidation of the address space
+        truncation, punch hole  or a complete invalidation of the address
-        (in the latter case 'offset' will always be 0).
+        space (in the latter case 'offset' will always be 0 and 'length'
-        Any private data associated with the page should be updated
+        will be PAGE_CACHE_SIZE). Any private data associated with the page
-        to reflect this truncation.  If offset is 0, then
+        should be updated to reflect this truncation.  If offset is 0 and
-        the private data should be released, because the page
+        length is PAGE_CACHE_SIZE, then the private data should be released,
-        must be able to be completely discarded.  This may be done by
+        because the page must be able to be completely discarded.  This may
-        calling the ->releasepage function, but in this case the
+        be done by calling the ->releasepage function, but in this case the
-        release MUST succeed.
+        release MUST succeed.
  releasepage: releasepage is called on PagePrivate pages to indicate
        that the page should be freed if possible.  ->releasepage
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 055562c580b4..9ff073f4090a 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -148,13 +148,14 @@ static int v9fs_release_page(struct page *page, gfp_t gfp)
 * @offset: offset in the page
 */
-static void v9fs_invalidate_page(struct page *page, unsigned long offset)
+static void v9fs_invalidate_page(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        /*
         * If called with zero offset, we should release
         * the private state assocated with the page
         */
-        if (offset == 0)
+        if (offset == 0 && length == PAGE_CACHE_SIZE)
                v9fs_fscache_invalidate_page(page);
 }
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 8f6e9234d565..66d50fe2ee45 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -19,7 +19,8 @@
 #include "internal.h"
 static int afs_readpage(struct file *file, struct page *page);
-static void afs_invalidatepage(struct page *page, unsigned long offset);
+static void afs_invalidatepage(struct page *page, unsigned int offset,
+                               unsigned int length);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
@@ -310,16 +311,17 @@ static int afs_launder_page(struct page *page)
 * - release a page and clean up its private data if offset is 0 (indicating
 *   the entire page)
 */
-static void afs_invalidatepage(struct page *page, unsigned long offset)
+static void afs_invalidatepage(struct page *page, unsigned int offset,
+                               unsigned int length)
 {
        struct afs_writeback *wb = (struct afs_writeback *) page_private(page);
-        _enter("{%lu},%lu", page->index, offset);
+        _enter("{%lu},%u,%u", page->index, offset, length);
        BUG_ON(!PageLocked(page));
        /* we clean up only if the entire page is being invalidated */
-        if (offset == 0) {
+        if (offset == 0 && length == PAGE_CACHE_SIZE) {
 #ifdef CONFIG_AFS_FSCACHE
                if (PageFsCache(page)) {
                        struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b8b60b660c8f..b0292b3ead54 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1013,7 +1013,8 @@ static int btree_releasepage(struct page *page, gfp_t gfp_flags)
        return try_release_extent_buffer(page);
 }
-static void btree_invalidatepage(struct page *page, unsigned long offset)
+static void btree_invalidatepage(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        struct extent_io_tree *tree;
        tree = &BTRFS_I(page->mapping->host)->io_tree;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e7e7afb4a872..6bca9472f313 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2957,7 +2957,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
        pg_offset = i_size & (PAGE_CACHE_SIZE - 1);
        if (page->index > end_index ||
           (page->index == end_index && !pg_offset)) {
-                page->mapping->a_ops->invalidatepage(page, 0);
+                page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
                unlock_page(page);
                return 0;
        }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a46b656d08de..4f9d16b70d3d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7493,7 +7493,8 @@ static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
        return __btrfs_releasepage(page, gfp_flags & GFP_NOFS);
 }
-static void btrfs_invalidatepage(struct page *page, unsigned long offset)
+static void btrfs_invalidatepage(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        struct inode *inode = page->mapping->host;
        struct extent_io_tree *tree;
diff --git a/fs/buffer.c b/fs/buffer.c
index d2a4d1bb2d57..f93392e2df12 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1454,7 +1454,8 @@ static void discard_buffer(struct buffer_head * bh)
 * block_invalidatepage - invalidate part or all of a buffer-backed page
 *
 * @page: the page which is affected
- * @offset: the index of the truncation point
+ * @offset: start of the range to invalidate
+ * @length: length of the range to invalidate
 *
 * block_invalidatepage() is called when all or part of the page has become
 * invalidated by a truncate operation.
@@ -1465,15 +1466,22 @@ static void discard_buffer(struct buffer_head * bh)
 * point.  Because the caller is about to free (and possibly reuse) those
 * blocks on-disk.
 */
-void block_invalidatepage(struct page *page, unsigned long offset)
+void block_invalidatepage(struct page *page, unsigned int offset,
+                          unsigned int length)
 {
        struct buffer_head *head, *bh, *next;
        unsigned int curr_off = 0;
+        unsigned int stop = length + offset;
        BUG_ON(!PageLocked(page));
        if (!page_has_buffers(page))
                goto out;
+        /*
+         * Check for overflow
+         */
+        BUG_ON(stop > PAGE_CACHE_SIZE || stop < length);
        head = page_buffers(page);
        bh = head;
        do {
@@ -1481,6 +1489,12 @@ void block_invalidatepage(struct page *page, unsigned long offset)
                next = bh->b_this_page;
                /*
+                 * Are we still fully in range ?
+                 */
+                if (next_off > stop)
+                        goto out;
+                /*
                 * is this block fully invalidated?
                 */
                if (offset <= curr_off)
@@ -1501,6 +1515,7 @@ out:
 }
 EXPORT_SYMBOL(block_invalidatepage);
 /*
 * We attach and possibly dirty the buffers atomically wrt
 * __set_page_dirty_buffers() via private_lock.  try_to_free_buffers
@@ -2841,7 +2856,7 @@ int block_write_full_page_endio(struct page *page, get_block_t *get_block,
                 * they may have been added in ext3_writepage().  Make them
                 * freeable here, so the page does not leak.
                 */
-                do_invalidatepage(page, 0);
+                do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
                unlock_page(page);
                return 0; /* don't care */
        }
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac101040..38b5c1bc6776 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -143,7 +143,8 @@ static int ceph_set_page_dirty(struct page *page)
 * dirty page counters appropriately.  Only called if there is private
 * data on the page.
 */
-static void ceph_invalidatepage(struct page *page, unsigned long offset)
+static void ceph_invalidatepage(struct page *page, unsigned int offset,
+                                unsigned int length)
 {
        struct inode *inode;
        struct ceph_inode_info *ci;
@@ -163,20 +164,20 @@ static void ceph_invalidatepage(struct page *page, unsigned long offset)
        if (!PageDirty(page))
                pr_err("%p invalidatepage %p page not dirty\n", inode, page);
-        if (offset == 0)
+        if (offset == 0 && length == PAGE_CACHE_SIZE)
                ClearPageChecked(page);
        ci = ceph_inode(inode);
-        if (offset == 0) {
+        if (offset == 0 && length == PAGE_CACHE_SIZE) {
-                dout("%p invalidatepage %p idx %lu full dirty page %lu\n",
+                dout("%p invalidatepage %p idx %lu full dirty page\n",
-                     inode, page, page->index, offset);
+                     inode, page, page->index);
                ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
                ceph_put_snap_context(snapc);
                page->private = 0;
                ClearPagePrivate(page);
        } else {
-                dout("%p invalidatepage %p idx %lu partial dirty page\n",
+                dout("%p invalidatepage %p idx %lu partial dirty page %u(%u)\n",
-                     inode, page, page->index);
+                     inode, page, page->index, offset, length);
        }
 }
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 48b29d24c9f4..4d8ba8d491e5 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3546,11 +3546,12 @@ static int cifs_release_page(struct page *page, gfp_t gfp)
        return cifs_fscache_release_page(page, gfp);
 }
-static void cifs_invalidate_page(struct page *page, unsigned long offset)
+static void cifs_invalidate_page(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        struct cifsInodeInfo *cifsi = CIFS_I(page->mapping->host);
-        if (offset == 0)
+        if (offset == 0 && length == PAGE_CACHE_SIZE)
                cifs_fscache_invalidate_page(page, &cifsi->vfs_inode);
 }
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index d1f80abd8828..2ec8eb1ab269 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -953,9 +953,11 @@ static int exofs_releasepage(struct page *page, gfp_t gfp)
        return 0;
 }
-static void exofs_invalidatepage(struct page *page, unsigned long offset)
+static void exofs_invalidatepage(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
-        EXOFS_DBGMSG("page 0x%lx offset 0x%lx\n", page->index, offset);
+        EXOFS_DBGMSG("page 0x%lx offset 0x%x length 0x%x\n",
+                     page->index, offset, length);
        WARN_ON(1);
 }
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 23c712825640..f67668f724ba 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1825,19 +1825,20 @@ ext3_readpages(struct file *file, struct address_space *mapping,
        return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
 }
-static void ext3_invalidatepage(struct page *page, unsigned long offset)
+static void ext3_invalidatepage(struct page *page, unsigned int offset,
+                                unsigned int length)
 {
        journal_t *journal = EXT3_JOURNAL(page->mapping->host);
-        trace_ext3_invalidatepage(page, offset);
+        trace_ext3_invalidatepage(page, offset, length);
        /*
         * If it's a full truncate we just forget about the pending dirtying
         */
-        if (offset == 0)
+        if (offset == 0 && length == PAGE_CACHE_SIZE)
                ClearPageChecked(page);
-        journal_invalidatepage(journal, page, offset);
+        journal_invalidatepage(journal, page, offset, length);
 }
 static int ext3_releasepage(struct page *page, gfp_t wait)
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 692de13e3596..cea8ecf3e76e 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -576,11 +576,8 @@ static int htree_dirblock_to_tree(struct file *dir_file,
                if (!ext3_check_dir_entry("htree_dirblock_to_tree", dir, de, bh,
                                        (block<<EXT3_BLOCK_SIZE_BITS(dir->i_sb))
                                                +((char *)de - bh->b_data))) {
-                        /* On error, skip the f_pos to the next block. */
+                        /* silently ignore the rest of the block */
-                        dir_file->f_pos = (dir_file->f_pos |
+                        break;
-                                        (dir->i_sb->s_blocksize - 1)) + 1;
-                        brelse (bh);
-                        return count;
                }
                ext3fs_dirhash(de->name, de->name_len, hinfo);
                if ((hinfo->hash < start_hash) ||
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index d0f13eada0ed..58339393fa6e 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -682,11 +682,15 @@ ext4_fsblk_t ext4_count_free_clusters(struct super_block *sb)
 static inline int test_root(ext4_group_t a, int b)
 {
-        int num = b;
+        while (1) {
+                if (a < b)
-        while (a > num)
+                        return 0;
-                num *= b;
+                if (a == b)
-        return num == a;
+                        return 1;
+                if ((a % b) != 0)
+                        return 0;
+                a = a / b;
+        }
 }
 static int ext4_group_sparse(ext4_group_t group)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4af03ea84aa3..b577e45425b0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -177,38 +177,28 @@ struct ext4_map_blocks {
 };
 /*
- * For delayed allocation tracking
- */
-struct mpage_da_data {
-        struct inode *inode;
-        sector_t b_blocknr;             /* start block number of extent */
-        size_t b_size;                  /* size of extent */
-        unsigned long b_state;          /* state of the extent */
-        unsigned long first_page, next_page;    /* extent of pages */
-        struct writeback_control *wbc;
-        int io_done;
-        int pages_written;
-        int retval;
-};
-/*
 * Flags for ext4_io_end->flags
 */
 #define EXT4_IO_END_UNWRITTEN   0x0001
-#define EXT4_IO_END_ERROR       0x0002
+#define EXT4_IO_END_DIRECT      0x0002
-#define EXT4_IO_END_DIRECT      0x0004
 /*
- * For converting uninitialized extents on a work queue.
+ * For converting uninitialized extents on a work queue. 'handle' is used for
+ * buffered writeback.
 */
 typedef struct ext4_io_end {
        struct list_head        list;           /* per-file finished IO list */
+        handle_t                *handle;        /* handle reserved for extent
+                                                 * conversion */
        struct inode            *inode;         /* file being written to */
+        struct bio              *bio;           /* Linked list of completed
+                                                 * bios covering the extent */
        unsigned int            flag;           /* unwritten or not */
        loff_t                  offset;         /* offset in the file */
        ssize_t                 size;           /* size of the extent */
        struct kiocb            *iocb;          /* iocb struct for AIO */
        int                     result;         /* error value for AIO */
+        atomic_t                count;          /* reference counter */
 } ext4_io_end_t;
 struct ext4_io_submit {
@@ -581,11 +571,6 @@ enum {
 #define EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER    0x0020
 /*
- * Flags used by ext4_discard_partial_page_buffers
- */
-#define EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED   0x0001
-/*
 * ioctl commands
 */
 #define EXT4_IOC_GETFLAGS               FS_IOC_GETFLAGS
@@ -879,6 +864,7 @@ struct ext4_inode_info {
        rwlock_t i_es_lock;
        struct list_head i_es_lru;
        unsigned int i_es_lru_nr;       /* protected by i_es_lock */
+        unsigned long i_touch_when;     /* jiffies of last accessing */
        /* ialloc */
        ext4_group_t    i_last_alloc_group;
@@ -903,12 +889,22 @@ struct ext4_inode_info {
        qsize_t i_reserved_quota;
 #endif
-        /* completed IOs that might need unwritten extents handling */
+        /* Lock protecting lists below */
-        struct list_head i_completed_io_list;
        spinlock_t i_completed_io_lock;
+        /*
+         * Completed IOs that need unwritten extents handling and have
+         * transaction reserved
+         */
+        struct list_head i_rsv_conversion_list;
+        /*
+         * Completed IOs that need unwritten extents handling and don't have
+         * transaction reserved
+         */
+        struct list_head i_unrsv_conversion_list;
        atomic_t i_ioend_count; /* Number of outstanding io_end structs */
        atomic_t i_unwritten; /* Nr. of inflight conversions pending */
-        struct work_struct i_unwritten_work;    /* deferred extent conversion */
+        struct work_struct i_rsv_conversion_work;
+        struct work_struct i_unrsv_conversion_work;
        spinlock_t i_block_reservation_lock;
@@ -1245,7 +1241,6 @@ struct ext4_sb_info {
        unsigned int s_mb_stats;
        unsigned int s_mb_order2_reqs;
        unsigned int s_mb_group_prealloc;
-        unsigned int s_max_writeback_mb_bump;
        unsigned int s_max_dir_size_kb;
        /* where last allocation was done - for stream allocation */
        unsigned long s_mb_last_group;
@@ -1281,8 +1276,10 @@ struct ext4_sb_info {
        struct flex_groups *s_flex_groups;
        ext4_group_t s_flex_groups_allocated;
-        /* workqueue for dio unwritten */
+        /* workqueue for unreserved extent convertions (dio) */
-        struct workqueue_struct *dio_unwritten_wq;
+        struct workqueue_struct *unrsv_conversion_wq;
+        /* workqueue for reserved extent conversions (buffered io) */
+        struct workqueue_struct *rsv_conversion_wq;
        /* timer for periodic error stats printing */
        struct timer_list s_err_report;
@@ -1307,6 +1304,7 @@ struct ext4_sb_info {
        /* Reclaim extents from extent status tree */
        struct shrinker s_es_shrinker;
        struct list_head s_es_lru;
+        unsigned long s_es_last_sorted;
        struct percpu_counter s_extent_cache_cnt;
        spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 };
@@ -1342,6 +1340,9 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
                                              struct ext4_io_end *io_end)
 {
        if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+                /* Writeback has to have coversion transaction reserved */
+                WARN_ON(EXT4_SB(inode->i_sb)->s_journal && !io_end->handle &&
+                        !(io_end->flag & EXT4_IO_END_DIRECT));
                io_end->flag |= EXT4_IO_END_UNWRITTEN;
                atomic_inc(&EXT4_I(inode)->i_unwritten);
        }
@@ -1999,7 +2000,6 @@ static inline  unsigned char get_dtype(struct super_block *sb, int filetype)
 /* fsync.c */
 extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
-extern int ext4_flush_unwritten_io(struct inode *);
 /* hash.c */
 extern int ext4fs_dirhash(const char *name, int len, struct
@@ -2088,7 +2088,7 @@ extern int ext4_change_inode_journal_flag(struct inode *, int);
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
 extern int ext4_can_truncate(struct inode *inode);
 extern void ext4_truncate(struct inode *);
-extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
+extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks);
 extern void ext4_set_inode_flags(struct inode *);
 extern void ext4_get_inode_flags(struct ext4_inode_info *);
@@ -2096,9 +2096,12 @@ extern int ext4_alloc_da_blocks(struct inode *inode);
 extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
-extern int ext4_discard_partial_page_buffers(handle_t *handle,
+extern int ext4_block_truncate_page(handle_t *handle,
-                struct address_space *mapping, loff_t from,
+                struct address_space *mapping, loff_t from);
-                loff_t length, int flags);
+extern int ext4_block_zero_page_range(handle_t *handle,
+                struct address_space *mapping, loff_t from, loff_t length);
+extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
+                             loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern void ext4_da_update_reserve_space(struct inode *inode,
@@ -2111,7 +2114,7 @@ extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
                                const struct iovec *iov, loff_t offset,
                                unsigned long nr_segs);
 extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
-extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk);
+extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
 extern void ext4_ind_truncate(handle_t *, struct inode *inode);
 extern int ext4_free_hole_blocks(handle_t *handle, struct inode *inode,
                                 ext4_lblk_t first, ext4_lblk_t stop);
@@ -2166,42 +2169,96 @@ extern int ext4_alloc_flex_bg_array(struct super_block *sb,
                                    ext4_group_t ngroup);
 extern const char *ext4_decode_error(struct super_block *sb, int errno,
                                     char nbuf[16]);
 extern __printf(4, 5)
 void __ext4_error(struct super_block *, const char *, unsigned int,
                  const char *, ...);
-#define ext4_error(sb, message...)      __ext4_error(sb, __func__,      \
-                                                     __LINE__, ## message)
 extern __printf(5, 6)
-void ext4_error_inode(struct inode *, const char *, unsigned int, ext4_fsblk_t,
+void __ext4_error_inode(struct inode *, const char *, unsigned int, ext4_fsblk_t,
                      const char *, ...);
 extern __printf(5, 6)
-void ext4_error_file(struct file *, const char *, unsigned int, ext4_fsblk_t,
+void __ext4_error_file(struct file *, const char *, unsigned int, ext4_fsblk_t,
                     const char *, ...);
 extern void __ext4_std_error(struct super_block *, const char *,
                             unsigned int, int);
 extern __printf(4, 5)
 void __ext4_abort(struct super_block *, const char *, unsigned int,
                  const char *, ...);
-#define ext4_abort(sb, message...)      __ext4_abort(sb, __func__, \
-                                                       __LINE__, ## message)
 extern __printf(4, 5)
 void __ext4_warning(struct super_block *, const char *, unsigned int,
                    const char *, ...);
-#define ext4_warning(sb, message...)    __ext4_warning(sb, __func__, \
-                                                       __LINE__, ## message)
 extern __printf(3, 4)
-void ext4_msg(struct super_block *, const char *, const char *, ...);
+void __ext4_msg(struct super_block *, const char *, const char *, ...);
 extern void __dump_mmp_msg(struct super_block *, struct mmp_struct *mmp,
                           const char *, unsigned int, const char *);
-#define dump_mmp_msg(sb, mmp, msg)      __dump_mmp_msg(sb, mmp, __func__, \
-                                                       __LINE__, msg)
 extern __printf(7, 8)
 void __ext4_grp_locked_error(const char *, unsigned int,
                             struct super_block *, ext4_group_t,
                             unsigned long, ext4_fsblk_t,
                             const char *, ...);
-#define ext4_grp_locked_error(sb, grp, message...) \
-        __ext4_grp_locked_error(__func__, __LINE__, (sb), (grp), ## message)
+#ifdef CONFIG_PRINTK
+#define ext4_error_inode(inode, func, line, block, fmt, ...)            \
+        __ext4_error_inode(inode, func, line, block, fmt, ##__VA_ARGS__)
+#define ext4_error_file(file, func, line, block, fmt, ...)              \
+        __ext4_error_file(file, func, line, block, fmt, ##__VA_ARGS__)
+#define ext4_error(sb, fmt, ...)                                        \
+        __ext4_error(sb, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define ext4_abort(sb, fmt, ...)                                        \
+        __ext4_abort(sb, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define ext4_warning(sb, fmt, ...)                                      \
+        __ext4_warning(sb, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define ext4_msg(sb, level, fmt, ...)                           \
+        __ext4_msg(sb, level, fmt, ##__VA_ARGS__)
+#define dump_mmp_msg(sb, mmp, msg)                                      \
+        __dump_mmp_msg(sb, mmp, __func__, __LINE__, msg)
+#define ext4_grp_locked_error(sb, grp, ino, block, fmt, ...)            \
+        __ext4_grp_locked_error(__func__, __LINE__, sb, grp, ino, block, \
+                                fmt, ##__VA_ARGS__)
+#else
+#define ext4_error_inode(inode, func, line, block, fmt, ...)            \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                                  \
+        __ext4_error_inode(inode, "", 0, block, " ");                   \
+} while (0)
+#define ext4_error_file(file, func, line, block, fmt, ...)              \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                                  \
+        __ext4_error_file(file, "", 0, block, " ");                     \
+} while (0)
+#define ext4_error(sb, fmt, ...)                                        \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                                  \
+        __ext4_error(sb, "", 0, " ");                                   \
+} while (0)
+#define ext4_abort(sb, fmt, ...)                                        \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                                  \
+        __ext4_abort(sb, "", 0, " ");                                   \
+} while (0)
+#define ext4_warning(sb, fmt, ...)                                      \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                                  \
+        __ext4_warning(sb, "", 0, " ");                                 \
+} while (0)
+#define ext4_msg(sb, level, fmt, ...)                                   \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                                  \
+        __ext4_msg(sb, "", " ");                                        \
+} while (0)
+#define dump_mmp_msg(sb, mmp, msg)                                      \
+        __dump_mmp_msg(sb, mmp, "", 0, "")
+#define ext4_grp_locked_error(sb, grp, ino, block, fmt, ...)            \
+do {                                                                    \
+        no_printk(fmt, ##__VA_ARGS__);                          \
+        __ext4_grp_locked_error("", 0, sb, grp, ino, block, " ");       \
+} while (0)
+#endif
 extern void ext4_update_dynamic_rev(struct super_block *sb);
 extern int ext4_update_compat_feature(handle_t *handle, struct super_block *sb,
                                        __u32 compat);
@@ -2312,6 +2369,7 @@ struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
 {
         struct ext4_group_info ***grp_info;
         long indexv, indexh;
+         BUG_ON(group >= EXT4_SB(sb)->s_groups_count);
         grp_info = EXT4_SB(sb)->s_group_info;
         indexv = group >> (EXT4_DESC_PER_BLOCK_BITS(sb));
         indexh = group & ((EXT4_DESC_PER_BLOCK(sb)) - 1);
@@ -2598,8 +2656,7 @@ struct ext4_extent;
 extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
 extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
-extern int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks,
+extern int ext4_ext_index_trans_blocks(struct inode *inode, int extents);
-                                       int chunk);
 extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
                               struct ext4_map_blocks *map, int flags);
 extern void ext4_ext_truncate(handle_t *, struct inode *);
@@ -2609,8 +2666,8 @@ extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
 extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
                          loff_t len);
-extern int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
+extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
-                          ssize_t len);
+                                          loff_t offset, ssize_t len);
 extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
                           struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_calc_metadata_amount(struct inode *inode,
@@ -2650,12 +2707,15 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
 /* page-io.c */
 extern int __init ext4_init_pageio(void);
-extern void ext4_add_complete_io(ext4_io_end_t *io_end);
 extern void ext4_exit_pageio(void);
-extern void ext4_ioend_shutdown(struct inode *);
-extern void ext4_free_io_end(ext4_io_end_t *io);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
-extern void ext4_end_io_work(struct work_struct *work);
+extern ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end);
+extern int ext4_put_io_end(ext4_io_end_t *io_end);
+extern void ext4_put_io_end_defer(ext4_io_end_t *io_end);
+extern void ext4_io_submit_init(struct ext4_io_submit *io,
+                                struct writeback_control *wbc);
+extern void ext4_end_io_rsv_work(struct work_struct *work);
+extern void ext4_end_io_unrsv_work(struct work_struct *work);
 extern void ext4_io_submit(struct ext4_io_submit *io);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
                               struct page *page,
@@ -2668,20 +2728,17 @@ extern void ext4_mmp_csum_set(struct super_block *sb, struct mmp_struct *mmp);
 extern int ext4_mmp_csum_verify(struct super_block *sb,
                                struct mmp_struct *mmp);
-/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
+/*
+ * Note that these flags will never ever appear in a buffer_head's state flag.
+ * See EXT4_MAP_... to see where this is used.
+ */
 enum ext4_state_bits {
        BH_Uninit       /* blocks are allocated but uninitialized on disk */
-          = BH_JBDPrivateStart,
+         = BH_JBDPrivateStart,
        BH_AllocFromCluster,    /* allocated blocks were part of already
-                                 * allocated cluster. Note that this flag will
+                                 * allocated cluster. */
-                                 * never, ever appear in a buffer_head's state
-                                 * flag. See EXT4_MAP_FROM_CLUSTER to see where
-                                 * this is used. */
 };
-BUFFER_FNS(Uninit, uninit)
-TAS_BUFFER_FNS(Uninit, uninit)
 /*
 * Add new method to test whether block and inode bitmaps are properly
 * initialized. With uninit_bg reading the block from disk is not enough
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 451eb4045330..72a3600aedbd 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -38,31 +38,43 @@ static void ext4_put_nojournal(handle_t *handle)
 /*
 * Wrappers for jbd2_journal_start/end.
 */
-handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
+static int ext4_journal_check_start(struct super_block *sb)
-                                  int type, int nblocks)
 {
        journal_t *journal;
        might_sleep();
-        trace_ext4_journal_start(sb, nblocks, _RET_IP_);
        if (sb->s_flags & MS_RDONLY)
-                return ERR_PTR(-EROFS);
+                return -EROFS;
        WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
        journal = EXT4_SB(sb)->s_journal;
-        if (!journal)
-                return ext4_get_nojournal();
        /*
         * Special case here: if the journal has aborted behind our
         * backs (eg. EIO in the commit thread), then we still need to
         * take the FS itself readonly cleanly.
         */
-        if (is_journal_aborted(journal)) {
+        if (journal && is_journal_aborted(journal)) {
                ext4_abort(sb, "Detected aborted journal");
-                return ERR_PTR(-EROFS);
+                return -EROFS;
        }
-        return jbd2__journal_start(journal, nblocks, GFP_NOFS, type, line);
+        return 0;
+}
+handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
+                                  int type, int blocks, int rsv_blocks)
+{
+        journal_t *journal;
+        int err;
+        trace_ext4_journal_start(sb, blocks, rsv_blocks, _RET_IP_);
+        err = ext4_journal_check_start(sb);
+        if (err < 0)
+                return ERR_PTR(err);
+        journal = EXT4_SB(sb)->s_journal;
+        if (!journal)
+                return ext4_get_nojournal();
+        return jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS,
+                                   type, line);
 }
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
@@ -86,6 +98,30 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
        return err;
 }
+handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
+                                        int type)
+{
+        struct super_block *sb;
+        int err;
+        if (!ext4_handle_valid(handle))
+                return ext4_get_nojournal();
+        sb = handle->h_journal->j_private;
+        trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
+                                          _RET_IP_);
+        err = ext4_journal_check_start(sb);
+        if (err < 0) {
+                jbd2_journal_free_reserved(handle);
+                return ERR_PTR(err);
+        }
+        err = jbd2_journal_start_reserved(handle, type, line);
+        if (err < 0)
+                return ERR_PTR(err);
+        return handle;
+}
 void ext4_journal_abort_handle(const char *caller, unsigned int line,
                               const char *err_fn, struct buffer_head *bh,
                               handle_t *handle, int err)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index c8c6885406db..2877258d9497 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -134,7 +134,8 @@ static inline int ext4_jbd2_credits_xattr(struct inode *inode)
 #define EXT4_HT_MIGRATE          8
 #define EXT4_HT_MOVE_EXTENTS     9
 #define EXT4_HT_XATTR           10
-#define EXT4_HT_MAX             11
+#define EXT4_HT_EXT_CONVERT     11
+#define EXT4_HT_MAX             12
 /**
 *   struct ext4_journal_cb_entry - Base structure for callback information.
@@ -265,7 +266,7 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
        __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
-                                  int type, int nblocks);
+                                  int type, int blocks, int rsv_blocks);
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
 #define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -300,21 +301,37 @@ static inline int ext4_handle_has_enough_credits(handle_t *handle, int needed)
 }
 #define ext4_journal_start_sb(sb, type, nblocks)                        \
-        __ext4_journal_start_sb((sb), __LINE__, (type), (nblocks))
+        __ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0)
 #define ext4_journal_start(inode, type, nblocks)                        \
-        __ext4_journal_start((inode), __LINE__, (type), (nblocks))
+        __ext4_journal_start((inode), __LINE__, (type), (nblocks), 0)
+#define ext4_journal_start_with_reserve(inode, type, blocks, rsv_blocks) \
+        __ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks))
 static inline handle_t *__ext4_journal_start(struct inode *inode,
                                             unsigned int line, int type,
-                                             int nblocks)
+                                             int blocks, int rsv_blocks)
 {
-        return __ext4_journal_start_sb(inode->i_sb, line, type, nblocks);
+        return __ext4_journal_start_sb(inode->i_sb, line, type, blocks,
+                                       rsv_blocks);
 }
 #define ext4_journal_stop(handle) \
        __ext4_journal_stop(__func__, __LINE__, (handle))
+#define ext4_journal_start_reserved(handle, type) \
+        __ext4_journal_start_reserved((handle), __LINE__, (type))
+handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
+                                        int type);
+static inline void ext4_journal_free_reserved(handle_t *handle)
+{
+        if (ext4_handle_valid(handle))
+                jbd2_journal_free_reserved(handle);
+}
 static inline handle_t *ext4_journal_current_handle(void)
 {
        return journal_current_handle();
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index bc0f1910b9cf..7097b0f680e6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2125,7 +2125,8 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
                next_del = ext4_find_delayed_extent(inode, &es);
                if (!exists && next_del) {
                        exists = 1;
-                        flags |= FIEMAP_EXTENT_DELALLOC;
+                        flags |= (FIEMAP_EXTENT_DELALLOC |
+                                  FIEMAP_EXTENT_UNKNOWN);
                }
                up_read(&EXT4_I(inode)->i_data_sem);
@@ -2328,17 +2329,15 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int nrblocks,
 }
 /*
- * How many index/leaf blocks need to change/allocate to modify nrblocks?
+ * How many index/leaf blocks need to change/allocate to add @extents extents?
 *
- * if nrblocks are fit in a single extent (chunk flag is 1), then
+ * If we add a single extent, then in the worse case, each tree level
- * in the worse case, each tree level index/leaf need to be changed
+ * index/leaf need to be changed in case of the tree split.
- * if the tree split due to insert a new extent, then the old tree
- * index/leaf need to be updated too
 *
- * If the nrblocks are discontiguous, they could cause
+ * If more extents are inserted, they could cause the whole tree split more
- * the whole tree split more than once, but this is really rare.
+ * than once, but this is really rare.
 */
-int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+int ext4_ext_index_trans_blocks(struct inode *inode, int extents)
 {
        int index;
        int depth;
@@ -2349,7 +2348,7 @@ int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
        depth = ext_depth(inode);
-        if (chunk)
+        if (extents <= 1)
                index = depth * 2;
        else
                index = depth * 3;
@@ -2357,20 +2356,24 @@ int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
        return index;
 }
+static inline int get_default_free_blocks_flags(struct inode *inode)
+{
+        if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
+                return EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET;
+        else if (ext4_should_journal_data(inode))
+                return EXT4_FREE_BLOCKS_FORGET;
+        return 0;
+}
 static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
                              struct ext4_extent *ex,
-                              ext4_fsblk_t *partial_cluster,
+                              long long *partial_cluster,
                              ext4_lblk_t from, ext4_lblk_t to)
 {
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
        unsigned short ee_len =  ext4_ext_get_actual_len(ex);
        ext4_fsblk_t pblk;
-        int flags = 0;
+        int flags = get_default_free_blocks_flags(inode);
-        if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
-                flags |= EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET;
-        else if (ext4_should_journal_data(inode))
-                flags |= EXT4_FREE_BLOCKS_FORGET;
        /*
         * For bigalloc file systems, we never free a partial cluster
@@ -2388,7 +2391,8 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
         * partial cluster here.
         */
        pblk = ext4_ext_pblock(ex) + ee_len - 1;
-        if (*partial_cluster && (EXT4_B2C(sbi, pblk) != *partial_cluster)) {
+        if ((*partial_cluster > 0) &&
+            (EXT4_B2C(sbi, pblk) != *partial_cluster)) {
                ext4_free_blocks(handle, inode, NULL,
                                 EXT4_C2B(sbi, *partial_cluster),
                                 sbi->s_cluster_ratio, flags);
@@ -2414,41 +2418,46 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
            && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
                /* tail removal */
                ext4_lblk_t num;
+                unsigned int unaligned;
                num = le32_to_cpu(ex->ee_block) + ee_len - from;
                pblk = ext4_ext_pblock(ex) + ee_len - num;
-                ext_debug("free last %u blocks starting %llu\n", num, pblk);
+                /*
+                 * Usually we want to free partial cluster at the end of the
+                 * extent, except for the situation when the cluster is still
+                 * used by any other extent (partial_cluster is negative).
+                 */
+                if (*partial_cluster < 0 &&
+                    -(*partial_cluster) == EXT4_B2C(sbi, pblk + num - 1))
+                        flags |= EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER;
+                ext_debug("free last %u blocks starting %llu partial %lld\n",
+                          num, pblk, *partial_cluster);
                ext4_free_blocks(handle, inode, NULL, pblk, num, flags);
                /*
                 * If the block range to be freed didn't start at the
                 * beginning of a cluster, and we removed the entire
-                 * extent, save the partial cluster here, since we
+                 * extent and the cluster is not used by any other extent,
-                 * might need to delete if we determine that the
+                 * save the partial cluster here, since we might need to
-                 * truncate operation has removed all of the blocks in
+                 * delete if we determine that the truncate operation has
-                 * the cluster.
+                 * removed all of the blocks in the cluster.
+                 *
+                 * On the other hand, if we did not manage to free the whole
+                 * extent, we have to mark the cluster as used (store negative
+                 * cluster number in partial_cluster).
                 */
-                if (pblk & (sbi->s_cluster_ratio - 1) &&
+                unaligned = pblk & (sbi->s_cluster_ratio - 1);
-                    (ee_len == num))
+                if (unaligned && (ee_len == num) &&
+                    (*partial_cluster != -((long long)EXT4_B2C(sbi, pblk))))
                        *partial_cluster = EXT4_B2C(sbi, pblk);
-                else
+                else if (unaligned)
+                        *partial_cluster = -((long long)EXT4_B2C(sbi, pblk));
+                else if (*partial_cluster > 0)
                        *partial_cluster = 0;
-        } else if (from == le32_to_cpu(ex->ee_block)
+        } else
-                   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
+                ext4_error(sbi->s_sb, "strange request: removal(2) "
-                /* head removal */
+                           "%u-%u from %u:%u\n",
-                ext4_lblk_t num;
+                           from, to, le32_to_cpu(ex->ee_block), ee_len);
-                ext4_fsblk_t start;
-                num = to - from;
-                start = ext4_ext_pblock(ex);
-                ext_debug("free first %u blocks starting %llu\n", num, start);
-                ext4_free_blocks(handle, inode, NULL, start, num, flags);
-        } else {
-                printk(KERN_INFO "strange request: removal(2) "
-                                "%u-%u from %u:%u\n",
-                                from, to, le32_to_cpu(ex->ee_block), ee_len);
-        }
        return 0;
 }
@@ -2461,12 +2470,16 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
 * @handle: The journal handle
 * @inode:  The files inode
 * @path:   The path to the leaf
+ * @partial_cluster: The cluster which we'll have to free if all extents
+ *                   has been released from it. It gets negative in case
+ *                   that the cluster is still used.
 * @start:  The first block to remove
 * @end:   The last block to remove
 */
 static int
 ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
-                 struct ext4_ext_path *path, ext4_fsblk_t *partial_cluster,
+                 struct ext4_ext_path *path,
+                 long long *partial_cluster,
                 ext4_lblk_t start, ext4_lblk_t end)
 {
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2479,6 +2492,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
        unsigned short ex_ee_len;
        unsigned uninitialized = 0;
        struct ext4_extent *ex;
+        ext4_fsblk_t pblk;
        /* the header must be checked already in ext4_ext_remove_space() */
        ext_debug("truncate since %u in leaf to %u\n", start, end);
@@ -2490,7 +2504,9 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
                return -EIO;
        }
        /* find where to start removing */
-        ex = EXT_LAST_EXTENT(eh);
+        ex = path[depth].p_ext;
+        if (!ex)
+                ex = EXT_LAST_EXTENT(eh);
        ex_ee_block = le32_to_cpu(ex->ee_block);
        ex_ee_len = ext4_ext_get_actual_len(ex);
@@ -2517,6 +2533,16 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
                /* If this extent is beyond the end of the hole, skip it */
                if (end < ex_ee_block) {
+                        /*
+                         * We're going to skip this extent and move to another,
+                         * so if this extent is not cluster aligned we have
+                         * to mark the current cluster as used to avoid
+                         * accidentally freeing it later on
+                         */
+                        pblk = ext4_ext_pblock(ex);
+                        if (pblk & (sbi->s_cluster_ratio - 1))
+                                *partial_cluster =
+                                        -((long long)EXT4_B2C(sbi, pblk));
                        ex--;
                        ex_ee_block = le32_to_cpu(ex->ee_block);
                        ex_ee_len = ext4_ext_get_actual_len(ex);
@@ -2592,7 +2618,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
                                        sizeof(struct ext4_extent));
                        }
                        le16_add_cpu(&eh->eh_entries, -1);
-                } else
+                } else if (*partial_cluster > 0)
                        *partial_cluster = 0;
                err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2610,17 +2636,13 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
                err = ext4_ext_correct_indexes(handle, inode, path);
        /*
-         * If there is still a entry in the leaf node, check to see if
+         * Free the partial cluster only if the current extent does not
-         * it references the partial cluster.  This is the only place
+         * reference it. Otherwise we might free used cluster.
-         * where it could; if it doesn't, we can free the cluster.
         */
-        if (*partial_cluster && ex >= EXT_FIRST_EXTENT(eh) &&
+        if (*partial_cluster > 0 &&
            (EXT4_B2C(sbi, ext4_ext_pblock(ex) + ex_ee_len - 1) !=
             *partial_cluster)) {
-                int flags = EXT4_FREE_BLOCKS_FORGET;
+                int flags = get_default_free_blocks_flags(inode);
-                if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
-                        flags |= EXT4_FREE_BLOCKS_METADATA;
                ext4_free_blocks(handle, inode, NULL,
                                 EXT4_C2B(sbi, *partial_cluster),
@@ -2664,7 +2686,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
        struct super_block *sb = inode->i_sb;
        int depth = ext_depth(inode);
        struct ext4_ext_path *path = NULL;
-        ext4_fsblk_t partial_cluster = 0;
+        long long partial_cluster = 0;
        handle_t *handle;
        int i = 0, err = 0;
@@ -2676,7 +2698,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
                return PTR_ERR(handle);
 again:
-        trace_ext4_ext_remove_space(inode, start, depth);
+        trace_ext4_ext_remove_space(inode, start, end, depth);
        /*
         * Check if we are removing extents inside the extent tree. If that
@@ -2844,17 +2866,14 @@ again:
                }
        }
-        trace_ext4_ext_remove_space_done(inode, start, depth, partial_cluster,
+        trace_ext4_ext_remove_space_done(inode, start, end, depth,
-                        path->p_hdr->eh_entries);
+                        partial_cluster, path->p_hdr->eh_entries);
        /* If we still have something in the partial cluster and we have removed
         * even the first extent, then we should free the blocks in the partial
         * cluster as well. */
-        if (partial_cluster && path->p_hdr->eh_entries == 0) {
+        if (partial_cluster > 0 && path->p_hdr->eh_entries == 0) {
-                int flags = EXT4_FREE_BLOCKS_FORGET;
+                int flags = get_default_free_blocks_flags(inode);
-                if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
-                        flags |= EXT4_FREE_BLOCKS_METADATA;
                ext4_free_blocks(handle, inode, NULL,
                                 EXT4_C2B(EXT4_SB(sb), partial_cluster),
@@ -4363,7 +4382,7 @@ out2:
        }
 out3:
-        trace_ext4_ext_map_blocks_exit(inode, map, err ? err : allocated);
+        trace_ext4_ext_map_blocks_exit(inode, flags, map, err ? err : allocated);
        return err ? err : allocated;
 }
@@ -4446,7 +4465,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
                return -EOPNOTSUPP;
        if (mode & FALLOC_FL_PUNCH_HOLE)
-                return ext4_punch_hole(file, offset, len);
+                return ext4_punch_hole(inode, offset, len);
        ret = ext4_convert_inline_data(inode);
        if (ret)
@@ -4548,10 +4567,9 @@ retry:
 * function, to convert the fallocated extents after IO is completed.
 * Returns 0 on success.
 */
-int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
+int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
-                                    ssize_t len)
+                                   loff_t offset, ssize_t len)
 {
-        handle_t *handle;
        unsigned int max_blocks;
        int ret = 0;
        int ret2 = 0;
@@ -4566,16 +4584,32 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
        max_blocks = ((EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) -
                      map.m_lblk);
        /*
-         * credits to insert 1 extent into extent tree
+         * This is somewhat ugly but the idea is clear: When transaction is
+         * reserved, everything goes into it. Otherwise we rather start several
+         * smaller transactions for conversion of each extent separately.
         */
-        credits = ext4_chunk_trans_blocks(inode, max_blocks);
+        if (handle) {
+                handle = ext4_journal_start_reserved(handle,
+                                                     EXT4_HT_EXT_CONVERT);
+                if (IS_ERR(handle))
+                        return PTR_ERR(handle);
+                credits = 0;
+        } else {
+                /*
+                 * credits to insert 1 extent into extent tree
+                 */
+                credits = ext4_chunk_trans_blocks(inode, max_blocks);
+        }
        while (ret >= 0 && ret < max_blocks) {
                map.m_lblk += ret;
                map.m_len = (max_blocks -= ret);
-                handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+                if (credits) {
-                if (IS_ERR(handle)) {
+                        handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
-                        ret = PTR_ERR(handle);
+                                                    credits);
-                        break;
+                        if (IS_ERR(handle)) {
+                                ret = PTR_ERR(handle);
+                                break;
+                        }
                }
                ret = ext4_map_blocks(handle, inode, &map,
                                      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
@@ -4586,10 +4620,13 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
                                     inode->i_ino, map.m_lblk,
                                     map.m_len, ret);
                ext4_mark_inode_dirty(handle, inode);
-                ret2 = ext4_journal_stop(handle);
+                if (credits)
-                if (ret <= 0 || ret2 )
+                        ret2 = ext4_journal_stop(handle);
+                if (ret <= 0 || ret2)
                        break;
        }
+        if (!credits)
+                ret2 = ext4_journal_stop(handle);
        return ret > 0 ? ret2 : ret;
 }
@@ -4659,7 +4696,7 @@ static int ext4_xattr_fiemap(struct inode *inode,
                error = ext4_get_inode_loc(inode, &iloc);
                if (error)
                        return error;
-                physical = iloc.bh->b_blocknr << blockbits;
+                physical = (__u64)iloc.bh->b_blocknr << blockbits;
                offset = EXT4_GOOD_OLD_INODE_SIZE +
                                EXT4_I(inode)->i_extra_isize;
                physical += offset;
@@ -4667,7 +4704,7 @@ static int ext4_xattr_fiemap(struct inode *inode,
                flags |= FIEMAP_EXTENT_DATA_INLINE;
                brelse(iloc.bh);
        } else { /* external block */
-                physical = EXT4_I(inode)->i_file_acl << blockbits;
+                physical = (__u64)EXT4_I(inode)->i_file_acl << blockbits;
                length = inode->i_sb->s_blocksize;
        }
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e6941e622d31..ee018d5f397e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -10,6 +10,7 @@
 * Ext4 extents status tree core functions.
 */
 #include <linux/rbtree.h>
+#include <linux/list_sort.h>
 #include "ext4.h"
 #include "extents_status.h"
 #include "ext4_extents.h"
@@ -291,7 +292,6 @@ out:
        read_unlock(&EXT4_I(inode)->i_es_lock);
-        ext4_es_lru_add(inode);
        trace_ext4_es_find_delayed_extent_range_exit(inode, es);
 }
@@ -672,7 +672,6 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 error:
        write_unlock(&EXT4_I(inode)->i_es_lock);
-        ext4_es_lru_add(inode);
        ext4_es_print_tree(inode);
        return err;
@@ -734,7 +733,6 @@ out:
        read_unlock(&EXT4_I(inode)->i_es_lock);
-        ext4_es_lru_add(inode);
        trace_ext4_es_lookup_extent_exit(inode, es, found);
        return found;
 }
@@ -878,12 +876,28 @@ int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
                                     EXTENT_STATUS_WRITTEN);
 }
+static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
+                                     struct list_head *b)
+{
+        struct ext4_inode_info *eia, *eib;
+        eia = list_entry(a, struct ext4_inode_info, i_es_lru);
+        eib = list_entry(b, struct ext4_inode_info, i_es_lru);
+        if (eia->i_touch_when == eib->i_touch_when)
+                return 0;
+        if (time_after(eia->i_touch_when, eib->i_touch_when))
+                return 1;
+        else
+                return -1;
+}
 static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 {
        struct ext4_sb_info *sbi = container_of(shrink,
                                        struct ext4_sb_info, s_es_shrinker);
        struct ext4_inode_info *ei;
-        struct list_head *cur, *tmp, scanned;
+        struct list_head *cur, *tmp;
+        LIST_HEAD(skiped);
        int nr_to_scan = sc->nr_to_scan;
        int ret, nr_shrunk = 0;
@@ -893,23 +907,41 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
        if (!nr_to_scan)
                return ret;
-        INIT_LIST_HEAD(&scanned);
        spin_lock(&sbi->s_es_lru_lock);
+        /*
+         * If the inode that is at the head of LRU list is newer than
+         * last_sorted time, that means that we need to sort this list.
+         */
+        ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info, i_es_lru);
+        if (sbi->s_es_last_sorted < ei->i_touch_when) {
+                list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
+                sbi->s_es_last_sorted = jiffies;
+        }
        list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
-                list_move_tail(cur, &scanned);
+                /*
+                 * If we have already reclaimed all extents from extent
+                 * status tree, just stop the loop immediately.
+                 */
+                if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
+                        break;
                ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
-                read_lock(&ei->i_es_lock);
+                /* Skip the inode that is newer than the last_sorted time */
-                if (ei->i_es_lru_nr == 0) {
+                if (sbi->s_es_last_sorted < ei->i_touch_when) {
-                        read_unlock(&ei->i_es_lock);
+                        list_move_tail(cur, &skiped);
                        continue;
                }
-                read_unlock(&ei->i_es_lock);
+                if (ei->i_es_lru_nr == 0)
+                        continue;
                write_lock(&ei->i_es_lock);
                ret = __es_try_to_reclaim_extents(ei, nr_to_scan);
+                if (ei->i_es_lru_nr == 0)
+                        list_del_init(&ei->i_es_lru);
                write_unlock(&ei->i_es_lock);
                nr_shrunk += ret;
@@ -917,7 +949,9 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
                if (nr_to_scan == 0)
                        break;
        }
-        list_splice_tail(&scanned, &sbi->s_es_lru);
+        /* Move the newer inodes into the tail of the LRU list. */
+        list_splice_tail(&skiped, &sbi->s_es_lru);
        spin_unlock(&sbi->s_es_lru_lock);
        ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
@@ -925,21 +959,19 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
        return ret;
 }
-void ext4_es_register_shrinker(struct super_block *sb)
+void ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 {
-        struct ext4_sb_info *sbi;
-        sbi = EXT4_SB(sb);
        INIT_LIST_HEAD(&sbi->s_es_lru);
        spin_lock_init(&sbi->s_es_lru_lock);
+        sbi->s_es_last_sorted = 0;
        sbi->s_es_shrinker.shrink = ext4_es_shrink;
        sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
        register_shrinker(&sbi->s_es_shrinker);
 }
-void ext4_es_unregister_shrinker(struct super_block *sb)
+void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 {
-        unregister_shrinker(&EXT4_SB(sb)->s_es_shrinker);
+        unregister_shrinker(&sbi->s_es_shrinker);
 }
 void ext4_es_lru_add(struct inode *inode)
@@ -947,11 +979,14 @@ void ext4_es_lru_add(struct inode *inode)
        struct ext4_inode_info *ei = EXT4_I(inode);
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+        ei->i_touch_when = jiffies;
+        if (!list_empty(&ei->i_es_lru))
+                return;
        spin_lock(&sbi->s_es_lru_lock);
        if (list_empty(&ei->i_es_lru))
                list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
-        else
-                list_move_tail(&ei->i_es_lru, &sbi->s_es_lru);
        spin_unlock(&sbi->s_es_lru_lock);
 }
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f740eb03b707..e936730cc5b0 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -39,6 +39,7 @@
                                 EXTENT_STATUS_DELAYED | \
                                 EXTENT_STATUS_HOLE)
+struct ext4_sb_info;
 struct ext4_extent;
 struct extent_status {
@@ -119,8 +120,8 @@ static inline void ext4_es_store_status(struct extent_status *es,
        es->es_pblk = block;
 }
-extern void ext4_es_register_shrinker(struct super_block *sb);
+extern void ext4_es_register_shrinker(struct ext4_sb_info *sbi);
-extern void ext4_es_unregister_shrinker(struct super_block *sb);
+extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
 extern void ext4_es_lru_add(struct inode *inode);
 extern void ext4_es_lru_del(struct inode *inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1b4d51b5d86..b19f0a457f32 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -312,7 +312,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
        blkbits = inode->i_sb->s_blocksize_bits;
        startoff = *offset;
        lastoff = startoff;
-        endoff = (map->m_lblk + map->m_len) << blkbits;
+        endoff = (loff_t)(map->m_lblk + map->m_len) << blkbits;
        index = startoff >> PAGE_CACHE_SHIFT;
        end = endoff >> PAGE_CACHE_SHIFT;
@@ -457,7 +457,7 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
                ret = ext4_map_blocks(NULL, inode, &map, 0);
                if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
                        if (last != start)
-                                dataoff = last << blkbits;
+                                dataoff = (loff_t)last << blkbits;
                        break;
                }
@@ -468,7 +468,7 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
                ext4_es_find_delayed_extent_range(inode, last, last, &es);
                if (es.es_len != 0 && in_range(last, es.es_lblk, es.es_len)) {
                        if (last != start)
-                                dataoff = last << blkbits;
+                                dataoff = (loff_t)last << blkbits;
                        break;
                }
@@ -486,7 +486,7 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
                }
                last++;
-                dataoff = last << blkbits;
+                dataoff = (loff_t)last << blkbits;
        } while (last <= end);
        mutex_unlock(&inode->i_mutex);
@@ -540,7 +540,7 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
                ret = ext4_map_blocks(NULL, inode, &map, 0);
                if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
                        last += ret;
-                        holeoff = last << blkbits;
+                        holeoff = (loff_t)last << blkbits;
                        continue;
                }
@@ -551,7 +551,7 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
                ext4_es_find_delayed_extent_range(inode, last, last, &es);
                if (es.es_len != 0 && in_range(last, es.es_lblk, es.es_len)) {
                        last = es.es_lblk + es.es_len;
-                        holeoff = last << blkbits;
+                        holeoff = (loff_t)last << blkbits;
                        continue;
                }
@@ -566,7 +566,7 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
                                                              &map, &holeoff);
                        if (!unwritten) {
                                last += ret;
-                                holeoff = last << blkbits;
+                                holeoff = (loff_t)last << blkbits;
                                continue;
                        }
                }
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index e0ba8a408def..a8bc47f75fa0 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -73,32 +73,6 @@ static int ext4_sync_parent(struct inode *inode)
        return ret;
 }
-/**
- * __sync_file - generic_file_fsync without the locking and filemap_write
- * @inode:      inode to sync
- * @datasync:   only sync essential metadata if true
- *
- * This is just generic_file_fsync without the locking.  This is needed for
- * nojournal mode to make sure this inodes data/metadata makes it to disk
- * properly.  The i_mutex should be held already.
- */
-static int __sync_inode(struct inode *inode, int datasync)
-{
-        int err;
-        int ret;
-        ret = sync_mapping_buffers(inode->i_mapping);
-        if (!(inode->i_state & I_DIRTY))
-                return ret;
-        if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-                return ret;
-        err = sync_inode_metadata(inode, 1);
-        if (ret == 0)
-                ret = err;
-        return ret;
-}
 /*
 * akpm: A new design for ext4_sync_file().
 *
@@ -116,7 +90,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
        struct inode *inode = file->f_mapping->host;
        struct ext4_inode_info *ei = EXT4_I(inode);
        journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
-        int ret, err;
+        int ret = 0, err;
        tid_t commit_tid;
        bool needs_barrier = false;
@@ -124,25 +98,24 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
        trace_ext4_sync_file_enter(file, datasync);
-        ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+        if (inode->i_sb->s_flags & MS_RDONLY) {
-        if (ret)
+                /* Make sure that we read updated s_mount_flags value */
-                return ret;
+                smp_rmb();
-        mutex_lock(&inode->i_mutex);
+                if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+                        ret = -EROFS;
-        if (inode->i_sb->s_flags & MS_RDONLY)
-                goto out;
-        ret = ext4_flush_unwritten_io(inode);
-        if (ret < 0)
                goto out;
+        }
        if (!journal) {
-                ret = __sync_inode(inode, datasync);
+                ret = generic_file_fsync(file, start, end, datasync);
                if (!ret && !hlist_empty(&inode->i_dentry))
                        ret = ext4_sync_parent(inode);
                goto out;
        }
+        ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+        if (ret)
+                return ret;
        /*
         * data=writeback,ordered:
         *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -172,8 +145,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
                if (!ret)
                        ret = err;
        }
- out:
+out:
-        mutex_unlock(&inode->i_mutex);
        trace_ext4_sync_file_exit(inode, ret);
        return ret;
 }
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 00a818d67b54..f03598c6ffd3 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -747,7 +747,8 @@ repeat_in_this_group:
                if (!handle) {
                        BUG_ON(nblocks <= 0);
                        handle = __ext4_journal_start_sb(dir->i_sb, line_no,
-                                                         handle_type, nblocks);
+                                                         handle_type, nblocks,
+                                                         0);
                        if (IS_ERR(handle)) {
                                err = PTR_ERR(handle);
                                ext4_std_error(sb, err);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index b8d5d351e24f..87b30cd357e7 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -624,7 +624,7 @@ cleanup:
                partial--;
        }
 out:
-        trace_ext4_ind_map_blocks_exit(inode, map, err);
+        trace_ext4_ind_map_blocks_exit(inode, flags, map, err);
        return err;
 }
@@ -675,11 +675,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 retry:
        if (rw == READ && ext4_should_dioread_nolock(inode)) {
-                if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-                        mutex_lock(&inode->i_mutex);
-                        ext4_flush_unwritten_io(inode);
-                        mutex_unlock(&inode->i_mutex);
-                }
                /*
                 * Nolock dioread optimization may be dynamically disabled
                 * via ext4_inode_block_unlocked_dio(). Check inode's state
@@ -779,27 +774,18 @@ int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock)
        return (blk_bits / EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb)) + 1;
 }
-int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+/*
+ * Calculate number of indirect blocks touched by mapping @nrblocks logically
+ * contiguous blocks
+ */
+int ext4_ind_trans_blocks(struct inode *inode, int nrblocks)
 {
-        int indirects;
-        /* if nrblocks are contiguous */
-        if (chunk) {
-                /*
-                 * With N contiguous data blocks, we need at most
-                 * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) + 1 indirect blocks,
-                 * 2 dindirect blocks, and 1 tindirect block
-                 */
-                return DIV_ROUND_UP(nrblocks,
-                                    EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
-        }
        /*
-         * if nrblocks are not contiguous, worse case, each block touch
+         * With N contiguous data blocks, we need at most
-         * a indirect block, and each indirect block touch a double indirect
+         * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) + 1 indirect blocks,
-         * block, plus a triple indirect block
+         * 2 dindirect blocks, and 1 tindirect block
         */
-        indirects = nrblocks * 2 + 1;
+        return DIV_ROUND_UP(nrblocks, EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
-        return indirects;
 }
 /*
@@ -940,11 +926,13 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
                             __le32 *last)
 {
        __le32 *p;
-        int     flags = EXT4_FREE_BLOCKS_FORGET | EXT4_FREE_BLOCKS_VALIDATED;
+        int     flags = EXT4_FREE_BLOCKS_VALIDATED;
        int     err;
        if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
-                flags |= EXT4_FREE_BLOCKS_METADATA;
+                flags |= EXT4_FREE_BLOCKS_FORGET | EXT4_FREE_BLOCKS_METADATA;
+        else if (ext4_should_journal_data(inode))
+                flags |= EXT4_FREE_BLOCKS_FORGET;
        if (!ext4_data_block_valid(EXT4_SB(inode->i_sb), block_to_free,
                                   count)) {
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1a346a6bdc8f..d9ecbf1113a7 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -72,7 +72,7 @@ static int get_max_inline_xattr_value_size(struct inode *inode,
                entry = (struct ext4_xattr_entry *)
                        ((void *)raw_inode + EXT4_I(inode)->i_inline_off);
-                free += le32_to_cpu(entry->e_value_size);
+                free += EXT4_XATTR_SIZE(le32_to_cpu(entry->e_value_size));
                goto out;
        }
@@ -1810,7 +1810,7 @@ int ext4_inline_data_fiemap(struct inode *inode,
        if (error)
                goto out;
-        physical = iloc.bh->b_blocknr << inode->i_sb->s_blocksize_bits;
+        physical = (__u64)iloc.bh->b_blocknr << inode->i_sb->s_blocksize_bits;
        physical += (char *)ext4_raw_inode(&iloc) - iloc.bh->b_data;
        physical += offsetof(struct ext4_inode, i_block);
        length = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d6382b89ecbd..0188e65e1f58 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -132,12 +132,12 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,
                                                   new_size);
 }
-static void ext4_invalidatepage(struct page *page, unsigned long offset);
+static void ext4_invalidatepage(struct page *page, unsigned int offset,
+                                unsigned int length);
 static int __ext4_journalled_writepage(struct page *page, unsigned int len);
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
-static int ext4_discard_partial_page_buffers_no_lock(handle_t *handle,
+static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
-                struct inode *inode, struct page *page, loff_t from,
+                                  int pextents);
-                loff_t length, int flags);
 /*
 * Test whether an inode is a fast symlink.
@@ -215,7 +215,8 @@ void ext4_evict_inode(struct inode *inode)
                        filemap_write_and_wait(&inode->i_data);
                }
                truncate_inode_pages(&inode->i_data, 0);
-                ext4_ioend_shutdown(inode);
+                WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
                goto no_delete;
        }
@@ -225,8 +226,8 @@ void ext4_evict_inode(struct inode *inode)
        if (ext4_should_order_data(inode))
                ext4_begin_ordered_truncate(inode, 0);
        truncate_inode_pages(&inode->i_data, 0);
-        ext4_ioend_shutdown(inode);
+        WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
        if (is_bad_inode(inode))
                goto no_delete;
@@ -423,66 +424,6 @@ static int __check_block_validity(struct inode *inode, const char *func,
 #define check_block_validity(inode, map)        \
        __check_block_validity((inode), __func__, __LINE__, (map))
-/*
- * Return the number of contiguous dirty pages in a given inode
- * starting at page frame idx.
- */
-static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
-                                    unsigned int max_pages)
-{
-        struct address_space *mapping = inode->i_mapping;
-        pgoff_t index;
-        struct pagevec pvec;
-        pgoff_t num = 0;
-        int i, nr_pages, done = 0;
-        if (max_pages == 0)
-                return 0;
-        pagevec_init(&pvec, 0);
-        while (!done) {
-                index = idx;
-                nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-                                              PAGECACHE_TAG_DIRTY,
-                                              (pgoff_t)PAGEVEC_SIZE);
-                if (nr_pages == 0)
-                        break;
-                for (i = 0; i < nr_pages; i++) {
-                        struct page *page = pvec.pages[i];
-                        struct buffer_head *bh, *head;
-                        lock_page(page);
-                        if (unlikely(page->mapping != mapping) ||
-                            !PageDirty(page) ||
-                            PageWriteback(page) ||
-                            page->index != idx) {
-                                done = 1;
-                                unlock_page(page);
-                                break;
-                        }
-                        if (page_has_buffers(page)) {
-                                bh = head = page_buffers(page);
-                                do {
-                                        if (!buffer_delay(bh) &&
-                                            !buffer_unwritten(bh))
-                                                done = 1;
-                                        bh = bh->b_this_page;
-                                } while (!done && (bh != head));
-                        }
-                        unlock_page(page);
-                        if (done)
-                                break;
-                        idx++;
-                        num++;
-                        if (num >= max_pages) {
-                                done = 1;
-                                break;
-                        }
-                }
-                pagevec_release(&pvec);
-        }
-        return num;
-}
 #ifdef ES_AGGRESSIVE_TEST
 static void ext4_map_blocks_es_recheck(handle_t *handle,
                                       struct inode *inode,
@@ -573,6 +514,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
                  "logical block %lu\n", inode->i_ino, flags, map->m_len,
                  (unsigned long) map->m_lblk);
+        ext4_es_lru_add(inode);
        /* Lookup extent status tree firstly */
        if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
                if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
@@ -1118,10 +1061,13 @@ static int ext4_write_end(struct file *file,
                }
        }
-        if (ext4_has_inline_data(inode))
+        if (ext4_has_inline_data(inode)) {
-                copied = ext4_write_inline_data_end(inode, pos, len,
+                ret = ext4_write_inline_data_end(inode, pos, len,
-                                                    copied, page);
+                                                 copied, page);
-        else
+                if (ret < 0)
+                        goto errout;
+                copied = ret;
+        } else
                copied = block_write_end(file, mapping, pos,
                                         len, copied, page, fsdata);
@@ -1157,8 +1103,6 @@ static int ext4_write_end(struct file *file,
        if (i_size_changed)
                ext4_mark_inode_dirty(handle, inode);
-        if (copied < 0)
-                ret = copied;
        if (pos + len > inode->i_size && ext4_can_truncate(inode))
                /* if we have allocated more blocks and copied
                 * less. We will have blocks allocated outside
@@ -1415,21 +1359,28 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
 }
 static void ext4_da_page_release_reservation(struct page *page,
-                                             unsigned long offset)
+                                             unsigned int offset,
+                                             unsigned int length)
 {
        int to_release = 0;
        struct buffer_head *head, *bh;
        unsigned int curr_off = 0;
        struct inode *inode = page->mapping->host;
        struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+        unsigned int stop = offset + length;
        int num_clusters;
        ext4_fsblk_t lblk;
+        BUG_ON(stop > PAGE_CACHE_SIZE || stop < length);
        head = page_buffers(page);
        bh = head;
        do {
                unsigned int next_off = curr_off + bh->b_size;
+                if (next_off > stop)
+                        break;
                if ((offset <= curr_off) && (buffer_delay(bh))) {
                        to_release++;
                        clear_buffer_delay(bh);
@@ -1460,140 +1411,43 @@ static void ext4_da_page_release_reservation(struct page *page,
 * Delayed allocation stuff
 */
-/*
+struct mpage_da_data {
- * mpage_da_submit_io - walks through extent of pages and try to write
+        struct inode *inode;
- * them with writepage() call back
+        struct writeback_control *wbc;
- *
- * @mpd->inode: inode
- * @mpd->first_page: first page of the extent
- * @mpd->next_page: page after the last page of the extent
- *
- * By the time mpage_da_submit_io() is called we expect all blocks
- * to be allocated. this may be wrong if allocation failed.
- *
- * As pages are already locked by write_cache_pages(), we can't use it
- */
-static int mpage_da_submit_io(struct mpage_da_data *mpd,
-                              struct ext4_map_blocks *map)
-{
-        struct pagevec pvec;
-        unsigned long index, end;
-        int ret = 0, err, nr_pages, i;
-        struct inode *inode = mpd->inode;
-        struct address_space *mapping = inode->i_mapping;
-        loff_t size = i_size_read(inode);
-        unsigned int len, block_start;
-        struct buffer_head *bh, *page_bufs = NULL;
-        sector_t pblock = 0, cur_logical = 0;
-        struct ext4_io_submit io_submit;
-        BUG_ON(mpd->next_page <= mpd->first_page);
+        pgoff_t first_page;     /* The first page to write */
-        memset(&io_submit, 0, sizeof(io_submit));
+        pgoff_t next_page;      /* Current page to examine */
+        pgoff_t last_page;      /* Last page to examine */
        /*
-         * We need to start from the first_page to the next_page - 1
+         * Extent to map - this can be after first_page because that can be
-         * to make sure we also write the mapped dirty buffer_heads.
+         * fully mapped. We somewhat abuse m_flags to store whether the extent
-         * If we look at mpd->b_blocknr we would only be looking
+         * is delalloc or unwritten.
-         * at the currently mapped buffer_heads.
         */
-        index = mpd->first_page;
+        struct ext4_map_blocks map;
-        end = mpd->next_page - 1;
+        struct ext4_io_submit io_submit;        /* IO submission data */
+};
-        pagevec_init(&pvec, 0);
-        while (index <= end) {
-                nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
-                if (nr_pages == 0)
-                        break;
-                for (i = 0; i < nr_pages; i++) {
-                        int skip_page = 0;
-                        struct page *page = pvec.pages[i];
-                        index = page->index;
-                        if (index > end)
-                                break;
-                        if (index == size >> PAGE_CACHE_SHIFT)
-                                len = size & ~PAGE_CACHE_MASK;
-                        else
-                                len = PAGE_CACHE_SIZE;
-                        if (map) {
-                                cur_logical = index << (PAGE_CACHE_SHIFT -
-                                                        inode->i_blkbits);
-                                pblock = map->m_pblk + (cur_logical -
-                                                        map->m_lblk);
-                        }
-                        index++;
-                        BUG_ON(!PageLocked(page));
-                        BUG_ON(PageWriteback(page));
-                        bh = page_bufs = page_buffers(page);
-                        block_start = 0;
-                        do {
-                                if (map && (cur_logical >= map->m_lblk) &&
-                                    (cur_logical <= (map->m_lblk +
-                                                     (map->m_len - 1)))) {
-                                        if (buffer_delay(bh)) {
-                                                clear_buffer_delay(bh);
-                                                bh->b_blocknr = pblock;
-                                        }
-                                        if (buffer_unwritten(bh) ||
-                                            buffer_mapped(bh))
-                                                BUG_ON(bh->b_blocknr != pblock);
-                                        if (map->m_flags & EXT4_MAP_UNINIT)
-                                                set_buffer_uninit(bh);
-                                        clear_buffer_unwritten(bh);
-                                }
-                                /*
-                                 * skip page if block allocation undone and
-                                 * block is dirty
-                                 */
-                                if (ext4_bh_delay_or_unwritten(NULL, bh))
-                                        skip_page = 1;
-                                bh = bh->b_this_page;
-                                block_start += bh->b_size;
-                                cur_logical++;
-                                pblock++;
-                        } while (bh != page_bufs);
-                        if (skip_page) {
-                                unlock_page(page);
-                                continue;
-                        }
-                        clear_page_dirty_for_io(page);
-                        err = ext4_bio_write_page(&io_submit, page, len,
-                                                  mpd->wbc);
-                        if (!err)
-                                mpd->pages_written++;
-                        /*
-                         * In error case, we have to continue because
-                         * remaining pages are still locked
-                         */
-                        if (ret == 0)
-                                ret = err;
-                }
-                pagevec_release(&pvec);
-        }
-        ext4_io_submit(&io_submit);
-        return ret;
-}
-static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd)
+static void mpage_release_unused_pages(struct mpage_da_data *mpd,
+                                       bool invalidate)
 {
        int nr_pages, i;
        pgoff_t index, end;
        struct pagevec pvec;
        struct inode *inode = mpd->inode;
        struct address_space *mapping = inode->i_mapping;
-        ext4_lblk_t start, last;
+        /* This is necessary when next_page == 0. */
+        if (mpd->first_page >= mpd->next_page)
+                return;
        index = mpd->first_page;
        end   = mpd->next_page - 1;
+        if (invalidate) {
-        start = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+                ext4_lblk_t start, last;
-        last = end << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+                start = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
-        ext4_es_remove_extent(inode, start, last - start + 1);
+                last = end << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+                ext4_es_remove_extent(inode, start, last - start + 1);
+        }
        pagevec_init(&pvec, 0);
        while (index <= end) {
@@ -1606,14 +1460,15 @@ static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd)
                                break;
                        BUG_ON(!PageLocked(page));
                        BUG_ON(PageWriteback(page));
-                        block_invalidatepage(page, 0);
+                        if (invalidate) {
-                        ClearPageUptodate(page);
+                                block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
+                                ClearPageUptodate(page);
+                        }
                        unlock_page(page);
                }
                index = pvec.pages[nr_pages - 1]->index + 1;
                pagevec_release(&pvec);
        }
-        return;
 }
 static void ext4_print_free_blocks(struct inode *inode)
@@ -1642,215 +1497,6 @@ static void ext4_print_free_blocks(struct inode *inode)
        return;
 }
-/*
- * mpage_da_map_and_submit - go through given space, map them
- *       if necessary, and then submit them for I/O
- *
- * @mpd - bh describing space
- *
- * The function skips space we know is already mapped to disk blocks.
- *
- */
-static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
-{
-        int err, blks, get_blocks_flags;
-        struct ext4_map_blocks map, *mapp = NULL;
-        sector_t next = mpd->b_blocknr;
-        unsigned max_blocks = mpd->b_size >> mpd->inode->i_blkbits;
-        loff_t disksize = EXT4_I(mpd->inode)->i_disksize;
-        handle_t *handle = NULL;
-        /*
-         * If the blocks are mapped already, or we couldn't accumulate
-         * any blocks, then proceed immediately to the submission stage.
-         */
-        if ((mpd->b_size == 0) ||
-            ((mpd->b_state  & (1 << BH_Mapped)) &&
-             !(mpd->b_state & (1 << BH_Delay)) &&
-             !(mpd->b_state & (1 << BH_Unwritten))))
-                goto submit_io;
-        handle = ext4_journal_current_handle();
-        BUG_ON(!handle);
-        /*
-         * Call ext4_map_blocks() to allocate any delayed allocation
-         * blocks, or to convert an uninitialized extent to be
-         * initialized (in the case where we have written into
-         * one or more preallocated blocks).
-         *
-         * We pass in the magic EXT4_GET_BLOCKS_DELALLOC_RESERVE to
-         * indicate that we are on the delayed allocation path.  This
-         * affects functions in many different parts of the allocation
-         * call path.  This flag exists primarily because we don't
-         * want to change *many* call functions, so ext4_map_blocks()
-         * will set the EXT4_STATE_DELALLOC_RESERVED flag once the
-         * inode's allocation semaphore is taken.
-         *
-         * If the blocks in questions were delalloc blocks, set
-         * EXT4_GET_BLOCKS_DELALLOC_RESERVE so the delalloc accounting
-         * variables are updated after the blocks have been allocated.
-         */
-        map.m_lblk = next;
-        map.m_len = max_blocks;
-        /*
-         * We're in delalloc path and it is possible that we're going to
-         * need more metadata blocks than previously reserved. However
-         * we must not fail because we're in writeback and there is
-         * nothing we can do about it so it might result in data loss.
-         * So use reserved blocks to allocate metadata if possible.
-         */
-        get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-                           EXT4_GET_BLOCKS_METADATA_NOFAIL;
-        if (ext4_should_dioread_nolock(mpd->inode))
-                get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
-        if (mpd->b_state & (1 << BH_Delay))
-                get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
-        blks = ext4_map_blocks(handle, mpd->inode, &map, get_blocks_flags);
-        if (blks < 0) {
-                struct super_block *sb = mpd->inode->i_sb;
-                err = blks;
-                /*
-                 * If get block returns EAGAIN or ENOSPC and there
-                 * appears to be free blocks we will just let
-                 * mpage_da_submit_io() unlock all of the pages.
-                 */
-                if (err == -EAGAIN)
-                        goto submit_io;
-                if (err == -ENOSPC && ext4_count_free_clusters(sb)) {
-                        mpd->retval = err;
-                        goto submit_io;
-                }
-                /*
-                 * get block failure will cause us to loop in
-                 * writepages, because a_ops->writepage won't be able
-                 * to make progress. The page will be redirtied by
-                 * writepage and writepages will again try to write
-                 * the same.
-                 */
-                if (!(EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) {
-                        ext4_msg(sb, KERN_CRIT,
-                                 "delayed block allocation failed for inode %lu "
-                                 "at logical offset %llu with max blocks %zd "
-                                 "with error %d", mpd->inode->i_ino,
-                                 (unsigned long long) next,
-                                 mpd->b_size >> mpd->inode->i_blkbits, err);
-                        ext4_msg(sb, KERN_CRIT,
-                                "This should not happen!! Data will be lost");
-                        if (err == -ENOSPC)
-                                ext4_print_free_blocks(mpd->inode);
-                }
-                /* invalidate all the pages */
-                ext4_da_block_invalidatepages(mpd);
-                /* Mark this page range as having been completed */
-                mpd->io_done = 1;
-                return;
-        }
-        BUG_ON(blks == 0);
-        mapp = &map;
-        if (map.m_flags & EXT4_MAP_NEW) {
-                struct block_device *bdev = mpd->inode->i_sb->s_bdev;
-                int i;
-                for (i = 0; i < map.m_len; i++)
-                        unmap_underlying_metadata(bdev, map.m_pblk + i);
-        }
-        /*
-         * Update on-disk size along with block allocation.
-         */
-        disksize = ((loff_t) next + blks) << mpd->inode->i_blkbits;
-        if (disksize > i_size_read(mpd->inode))
-                disksize = i_size_read(mpd->inode);
-        if (disksize > EXT4_I(mpd->inode)->i_disksize) {
-                ext4_update_i_disksize(mpd->inode, disksize);
-                err = ext4_mark_inode_dirty(handle, mpd->inode);
-                if (err)
-                        ext4_error(mpd->inode->i_sb,
-                                   "Failed to mark inode %lu dirty",
-                                   mpd->inode->i_ino);
-        }
-submit_io:
-        mpage_da_submit_io(mpd, mapp);
-        mpd->io_done = 1;
-}
-#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
-                (1 << BH_Delay) | (1 << BH_Unwritten))
-/*
- * mpage_add_bh_to_extent - try to add one more block to extent of blocks
- *
- * @mpd->lbh - extent of blocks
- * @logical - logical number of the block in the file
- * @b_state - b_state of the buffer head added
- *
- * the function is used to collect contig. blocks in same state
- */
-static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, sector_t logical,
-                                   unsigned long b_state)
-{
-        sector_t next;
-        int blkbits = mpd->inode->i_blkbits;
-        int nrblocks = mpd->b_size >> blkbits;
-        /*
-         * XXX Don't go larger than mballoc is willing to allocate
-         * This is a stopgap solution.  We eventually need to fold
-         * mpage_da_submit_io() into this function and then call
-         * ext4_map_blocks() multiple times in a loop
-         */
-        if (nrblocks >= (8*1024*1024 >> blkbits))
-                goto flush_it;
-        /* check if the reserved journal credits might overflow */
-        if (!ext4_test_inode_flag(mpd->inode, EXT4_INODE_EXTENTS)) {
-                if (nrblocks >= EXT4_MAX_TRANS_DATA) {
-                        /*
-                         * With non-extent format we are limited by the journal
-                         * credit available.  Total credit needed to insert
-                         * nrblocks contiguous blocks is dependent on the
-                         * nrblocks.  So limit nrblocks.
-                         */
-                        goto flush_it;
-                }
-        }
-        /*
-         * First block in the extent
-         */
-        if (mpd->b_size == 0) {
-                mpd->b_blocknr = logical;
-                mpd->b_size = 1 << blkbits;
-                mpd->b_state = b_state & BH_FLAGS;
-                return;
-        }
-        next = mpd->b_blocknr + nrblocks;
-        /*
-         * Can we merge the block to our big extent?
-         */
-        if (logical == next && (b_state & BH_FLAGS) == mpd->b_state) {
-                mpd->b_size += 1 << blkbits;
-                return;
-        }
-flush_it:
-        /*
-         * We couldn't merge the block to our extent, so we
-         * need to flush current  extent and start new one
-         */
-        mpage_da_map_and_submit(mpd);
-        return;
-}
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh)
 {
        return (buffer_delay(bh) || buffer_unwritten(bh)) && buffer_dirty(bh);
@@ -1883,6 +1529,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
                  "logical block %lu\n", inode->i_ino, map->m_len,
                  (unsigned long) map->m_lblk);
+        ext4_es_lru_add(inode);
        /* Lookup extent status tree firstly */
        if (ext4_es_lookup_extent(inode, iblock, &es)) {
@@ -2156,7 +1804,7 @@ out:
 * lock so we have to do some magic.
 *
 * This function can get called via...
- *   - ext4_da_writepages after taking page lock (have journal handle)
+ *   - ext4_writepages after taking page lock (have journal handle)
 *   - journal_submit_inode_data_buffers (no journal handle)
 *   - shrink_page_list via the kswapd/direct reclaim (no journal handle)
 *   - grab_page_cache when doing write_begin (have journal handle)
@@ -2234,76 +1882,405 @@ static int ext4_writepage(struct page *page,
                 */
                return __ext4_journalled_writepage(page, len);
-        memset(&io_submit, 0, sizeof(io_submit));
+        ext4_io_submit_init(&io_submit, wbc);
+        io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+        if (!io_submit.io_end) {
+                redirty_page_for_writepage(wbc, page);
+                unlock_page(page);
+                return -ENOMEM;
+        }
        ret = ext4_bio_write_page(&io_submit, page, len, wbc);
        ext4_io_submit(&io_submit);
+        /* Drop io_end reference we got from init */
+        ext4_put_io_end_defer(io_submit.io_end);
        return ret;
 }
+#define BH_FLAGS ((1 << BH_Unwritten) | (1 << BH_Delay))
 /*
- * This is called via ext4_da_writepages() to
+ * mballoc gives us at most this number of blocks...
- * calculate the total number of credits to reserve to fit
+ * XXX: That seems to be only a limitation of ext4_mb_normalize_request().
- * a single extent allocation into a single transaction,
+ * The rest of mballoc seems to handle chunks upto full group size.
- * ext4_da_writpeages() will loop calling this before
- * the block allocation.
 */
+#define MAX_WRITEPAGES_EXTENT_LEN 2048
-static int ext4_da_writepages_trans_blocks(struct inode *inode)
+/*
+ * mpage_add_bh_to_extent - try to add bh to extent of blocks to map
+ *
+ * @mpd - extent of blocks
+ * @lblk - logical number of the block in the file
+ * @b_state - b_state of the buffer head added
+ *
+ * the function is used to collect contig. blocks in same state
+ */
+static int mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
+                                  unsigned long b_state)
+{
+        struct ext4_map_blocks *map = &mpd->map;
+        /* Don't go larger than mballoc is willing to allocate */
+        if (map->m_len >= MAX_WRITEPAGES_EXTENT_LEN)
+                return 0;
+        /* First block in the extent? */
+        if (map->m_len == 0) {
+                map->m_lblk = lblk;
+                map->m_len = 1;
+                map->m_flags = b_state & BH_FLAGS;
+                return 1;
+        }
+        /* Can we merge the block to our big extent? */
+        if (lblk == map->m_lblk + map->m_len &&
+            (b_state & BH_FLAGS) == map->m_flags) {
+                map->m_len++;
+                return 1;
+        }
+        return 0;
+}
+static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
+                                    struct buffer_head *head,
+                                    struct buffer_head *bh,
+                                    ext4_lblk_t lblk)
+{
+        struct inode *inode = mpd->inode;
+        ext4_lblk_t blocks = (i_size_read(inode) + (1 << inode->i_blkbits) - 1)
+                                                        >> inode->i_blkbits;
+        do {
+                BUG_ON(buffer_locked(bh));
+                if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
+                    (!buffer_delay(bh) && !buffer_unwritten(bh)) ||
+                    lblk >= blocks) {
+                        /* Found extent to map? */
+                        if (mpd->map.m_len)
+                                return false;
+                        if (lblk >= blocks)
+                                return true;
+                        continue;
+                }
+                if (!mpage_add_bh_to_extent(mpd, lblk, bh->b_state))
+                        return false;
+        } while (lblk++, (bh = bh->b_this_page) != head);
+        return true;
+}
+static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
 {
-        int max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
+        int len;
+        loff_t size = i_size_read(mpd->inode);
+        int err;
+        BUG_ON(page->index != mpd->first_page);
+        if (page->index == size >> PAGE_CACHE_SHIFT)
+                len = size & ~PAGE_CACHE_MASK;
+        else
+                len = PAGE_CACHE_SIZE;
+        clear_page_dirty_for_io(page);
+        err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc);
+        if (!err)
+                mpd->wbc->nr_to_write--;
+        mpd->first_page++;
+        return err;
+}
+/*
+ * mpage_map_buffers - update buffers corresponding to changed extent and
+ *                     submit fully mapped pages for IO
+ *
+ * @mpd - description of extent to map, on return next extent to map
+ *
+ * Scan buffers corresponding to changed extent (we expect corresponding pages
+ * to be already locked) and update buffer state according to new extent state.
+ * We map delalloc buffers to their physical location, clear unwritten bits,
+ * and mark buffers as uninit when we perform writes to uninitialized extents
+ * and do extent conversion after IO is finished. If the last page is not fully
+ * mapped, we update @map to the next extent in the last page that needs
+ * mapping. Otherwise we submit the page for IO.
+ */
+static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
+{
+        struct pagevec pvec;
+        int nr_pages, i;
+        struct inode *inode = mpd->inode;
+        struct buffer_head *head, *bh;
+        int bpp_bits = PAGE_CACHE_SHIFT - inode->i_blkbits;
+        ext4_lblk_t blocks = (i_size_read(inode) + (1 << inode->i_blkbits) - 1)
+                                                        >> inode->i_blkbits;
+        pgoff_t start, end;
+        ext4_lblk_t lblk;
+        sector_t pblock;
+        int err;
+        start = mpd->map.m_lblk >> bpp_bits;
+        end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
+        lblk = start << bpp_bits;
+        pblock = mpd->map.m_pblk;
+        pagevec_init(&pvec, 0);
+        while (start <= end) {
+                nr_pages = pagevec_lookup(&pvec, inode->i_mapping, start,
+                                          PAGEVEC_SIZE);
+                if (nr_pages == 0)
+                        break;
+                for (i = 0; i < nr_pages; i++) {
+                        struct page *page = pvec.pages[i];
+                        if (page->index > end)
+                                break;
+                        /* Upto 'end' pages must be contiguous */
+                        BUG_ON(page->index != start);
+                        bh = head = page_buffers(page);
+                        do {
+                                if (lblk < mpd->map.m_lblk)
+                                        continue;
+                                if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
+                                        /*
+                                         * Buffer after end of mapped extent.
+                                         * Find next buffer in the page to map.
+                                         */
+                                        mpd->map.m_len = 0;
+                                        mpd->map.m_flags = 0;
+                                        add_page_bufs_to_extent(mpd, head, bh,
+                                                                lblk);
+                                        pagevec_release(&pvec);
+                                        return 0;
+                                }
+                                if (buffer_delay(bh)) {
+                                        clear_buffer_delay(bh);
+                                        bh->b_blocknr = pblock++;
+                                }
+                                clear_buffer_unwritten(bh);
+                        } while (++lblk < blocks &&
+                                 (bh = bh->b_this_page) != head);
+                        /*
+                         * FIXME: This is going to break if dioread_nolock
+                         * supports blocksize < pagesize as we will try to
+                         * convert potentially unmapped parts of inode.
+                         */
+                        mpd->io_submit.io_end->size += PAGE_CACHE_SIZE;
+                        /* Page fully mapped - let IO run! */
+                        err = mpage_submit_page(mpd, page);
+                        if (err < 0) {
+                                pagevec_release(&pvec);
+                                return err;
+                        }
+                        start++;
+                }
+                pagevec_release(&pvec);
+        }
+        /* Extent fully mapped and matches with page boundary. We are done. */
+        mpd->map.m_len = 0;
+        mpd->map.m_flags = 0;
+        return 0;
+}
+static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
+{
+        struct inode *inode = mpd->inode;
+        struct ext4_map_blocks *map = &mpd->map;
+        int get_blocks_flags;
+        int err;
+        trace_ext4_da_write_pages_extent(inode, map);
        /*
-         * With non-extent format the journal credit needed to
+         * Call ext4_map_blocks() to allocate any delayed allocation blocks, or
-         * insert nrblocks contiguous block is dependent on
+         * to convert an uninitialized extent to be initialized (in the case
-         * number of contiguous block. So we will limit
+         * where we have written into one or more preallocated blocks).  It is
-         * number of contiguous block to a sane value
+         * possible that we're going to need more metadata blocks than
+         * previously reserved. However we must not fail because we're in
+         * writeback and there is nothing we can do about it so it might result
+         * in data loss.  So use reserved blocks to allocate metadata if
+         * possible.
+         *
+         * We pass in the magic EXT4_GET_BLOCKS_DELALLOC_RESERVE if the blocks
+         * in question are delalloc blocks.  This affects functions in many
+         * different parts of the allocation call path.  This flag exists
+         * primarily because we don't want to change *many* call functions, so
+         * ext4_map_blocks() will set the EXT4_STATE_DELALLOC_RESERVED flag
+         * once the inode's allocation semaphore is taken.
         */
-        if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) &&
+        get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-            (max_blocks > EXT4_MAX_TRANS_DATA))
+                           EXT4_GET_BLOCKS_METADATA_NOFAIL;
-                max_blocks = EXT4_MAX_TRANS_DATA;
+        if (ext4_should_dioread_nolock(inode))
+                get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
+        if (map->m_flags & (1 << BH_Delay))
+                get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
-        return ext4_chunk_trans_blocks(inode, max_blocks);
+        err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
+        if (err < 0)
+                return err;
+        if (map->m_flags & EXT4_MAP_UNINIT) {
+                if (!mpd->io_submit.io_end->handle &&
+                    ext4_handle_valid(handle)) {
+                        mpd->io_submit.io_end->handle = handle->h_rsv_handle;
+                        handle->h_rsv_handle = NULL;
+                }
+                ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
+        }
+        BUG_ON(map->m_len == 0);
+        if (map->m_flags & EXT4_MAP_NEW) {
+                struct block_device *bdev = inode->i_sb->s_bdev;
+                int i;
+                for (i = 0; i < map->m_len; i++)
+                        unmap_underlying_metadata(bdev, map->m_pblk + i);
+        }
+        return 0;
 }
 /*
- * write_cache_pages_da - walk the list of dirty pages of the given
+ * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
- * address space and accumulate pages that need writing, and call
+ *                               mpd->len and submit pages underlying it for IO
- * mpage_da_map_and_submit to map a single contiguous memory region
+ *
- * and then write them.
+ * @handle - handle for journal operations
+ * @mpd - extent to map
+ *
+ * The function maps extent starting at mpd->lblk of length mpd->len. If it is
+ * delayed, blocks are allocated, if it is unwritten, we may need to convert
+ * them to initialized or split the described range from larger unwritten
+ * extent. Note that we need not map all the described range since allocation
+ * can return less blocks or the range is covered by more unwritten extents. We
+ * cannot map more because we are limited by reserved transaction credits. On
+ * the other hand we always make sure that the last touched page is fully
+ * mapped so that it can be written out (and thus forward progress is
+ * guaranteed). After mapping we submit all mapped pages for IO.
 */
-static int write_cache_pages_da(handle_t *handle,
+static int mpage_map_and_submit_extent(handle_t *handle,
-                                struct address_space *mapping,
+                                       struct mpage_da_data *mpd,
-                                struct writeback_control *wbc,
+                                       bool *give_up_on_write)
-                                struct mpage_da_data *mpd,
-                                pgoff_t *done_index)
 {
-        struct buffer_head      *bh, *head;
+        struct inode *inode = mpd->inode;
-        struct inode            *inode = mapping->host;
+        struct ext4_map_blocks *map = &mpd->map;
-        struct pagevec          pvec;
+        int err;
-        unsigned int            nr_pages;
+        loff_t disksize;
-        sector_t                logical;
-        pgoff_t                 index, end;
-        long                    nr_to_write = wbc->nr_to_write;
-        int                     i, tag, ret = 0;
-        memset(mpd, 0, sizeof(struct mpage_da_data));
-        mpd->wbc = wbc;
-        mpd->inode = inode;
-        pagevec_init(&pvec, 0);
-        index = wbc->range_start >> PAGE_CACHE_SHIFT;
-        end = wbc->range_end >> PAGE_CACHE_SHIFT;
-        if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
+        mpd->io_submit.io_end->offset =
+                                ((loff_t)map->m_lblk) << inode->i_blkbits;
+        while (map->m_len) {
+                err = mpage_map_one_extent(handle, mpd);
+                if (err < 0) {
+                        struct super_block *sb = inode->i_sb;
+                        if (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+                                goto invalidate_dirty_pages;
+                        /*
+                         * Let the uper layers retry transient errors.
+                         * In the case of ENOSPC, if ext4_count_free_blocks()
+                         * is non-zero, a commit should free up blocks.
+                         */
+                        if ((err == -ENOMEM) ||
+                            (err == -ENOSPC && ext4_count_free_clusters(sb)))
+                                return err;
+                        ext4_msg(sb, KERN_CRIT,
+                                 "Delayed block allocation failed for "
+                                 "inode %lu at logical offset %llu with"
+                                 " max blocks %u with error %d",
+                                 inode->i_ino,
+                                 (unsigned long long)map->m_lblk,
+                                 (unsigned)map->m_len, -err);
+                        ext4_msg(sb, KERN_CRIT,
+                                 "This should not happen!! Data will "
+                                 "be lost\n");
+                        if (err == -ENOSPC)
+                                ext4_print_free_blocks(inode);
+                invalidate_dirty_pages:
+                        *give_up_on_write = true;
+                        return err;
+                }
+                /*
+                 * Update buffer state, submit mapped pages, and get us new
+                 * extent to map
+                 */
+                err = mpage_map_and_submit_buffers(mpd);
+                if (err < 0)
+                        return err;
+        }
+        /* Update on-disk size after IO is submitted */
+        disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT;
+        if (disksize > i_size_read(inode))
+                disksize = i_size_read(inode);
+        if (disksize > EXT4_I(inode)->i_disksize) {
+                int err2;
+                ext4_update_i_disksize(inode, disksize);
+                err2 = ext4_mark_inode_dirty(handle, inode);
+                if (err2)
+                        ext4_error(inode->i_sb,
+                                   "Failed to mark inode %lu dirty",
+                                   inode->i_ino);
+                if (!err)
+                        err = err2;
+        }
+        return err;
+}
+/*
+ * Calculate the total number of credits to reserve for one writepages
+ * iteration. This is called from ext4_writepages(). We map an extent of
+ * upto MAX_WRITEPAGES_EXTENT_LEN blocks and then we go on and finish mapping
+ * the last partial page. So in total we can map MAX_WRITEPAGES_EXTENT_LEN +
+ * bpp - 1 blocks in bpp different extents.
+ */
+static int ext4_da_writepages_trans_blocks(struct inode *inode)
+{
+        int bpp = ext4_journal_blocks_per_page(inode);
+        return ext4_meta_trans_blocks(inode,
+                                MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
+}
+/*
+ * mpage_prepare_extent_to_map - find & lock contiguous range of dirty pages
+ *                               and underlying extent to map
+ *
+ * @mpd - where to look for pages
+ *
+ * Walk dirty pages in the mapping. If they are fully mapped, submit them for
+ * IO immediately. When we find a page which isn't mapped we start accumulating
+ * extent of buffers underlying these pages that needs mapping (formed by
+ * either delayed or unwritten buffers). We also lock the pages containing
+ * these buffers. The extent found is returned in @mpd structure (starting at
+ * mpd->lblk with length mpd->len blocks).
+ *
+ * Note that this function can attach bios to one io_end structure which are
+ * neither logically nor physically contiguous. Although it may seem as an
+ * unnecessary complication, it is actually inevitable in blocksize < pagesize
+ * case as we need to track IO to all buffers underlying a page in one io_end.
+ */
+static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
+{
+        struct address_space *mapping = mpd->inode->i_mapping;
+        struct pagevec pvec;
+        unsigned int nr_pages;
+        pgoff_t index = mpd->first_page;
+        pgoff_t end = mpd->last_page;
+        int tag;
+        int i, err = 0;
+        int blkbits = mpd->inode->i_blkbits;
+        ext4_lblk_t lblk;
+        struct buffer_head *head;
+        if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
                tag = PAGECACHE_TAG_TOWRITE;
        else
                tag = PAGECACHE_TAG_DIRTY;
-        *done_index = index;
+        pagevec_init(&pvec, 0);
+        mpd->map.m_len = 0;
+        mpd->next_page = index;
        while (index <= end) {
                nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
                              min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
                if (nr_pages == 0)
-                        return 0;
+                        goto out;
                for (i = 0; i < nr_pages; i++) {
                        struct page *page = pvec.pages[i];
@@ -2318,31 +2295,21 @@ static int write_cache_pages_da(handle_t *handle,
                        if (page->index > end)
                                goto out;
-                        *done_index = page->index + 1;
+                        /* If we can't merge this page, we are done. */
+                        if (mpd->map.m_len > 0 && mpd->next_page != page->index)
-                        /*
+                                goto out;
-                         * If we can't merge this page, and we have
-                         * accumulated an contiguous region, write it
-                         */
-                        if ((mpd->next_page != page->index) &&
-                            (mpd->next_page != mpd->first_page)) {
-                                mpage_da_map_and_submit(mpd);
-                                goto ret_extent_tail;
-                        }
                        lock_page(page);
                        /*
-                         * If the page is no longer dirty, or its
+                         * If the page is no longer dirty, or its mapping no
-                         * mapping no longer corresponds to inode we
+                         * longer corresponds to inode we are writing (which
-                         * are writing (which means it has been
+                         * means it has been truncated or invalidated), or the
-                         * truncated or invalidated), or the page is
+                         * page is already under writeback and we are not doing
-                         * already under writeback and we are not
+                         * a data integrity writeback, skip the page
-                         * doing a data integrity writeback, skip the page
                         */
                        if (!PageDirty(page) ||
                            (PageWriteback(page) &&
-                             (wbc->sync_mode == WB_SYNC_NONE)) ||
+                             (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
                            unlikely(page->mapping != mapping)) {
                                unlock_page(page);
                                continue;
@@ -2351,106 +2318,70 @@ static int write_cache_pages_da(handle_t *handle,
                        wait_on_page_writeback(page);
                        BUG_ON(PageWriteback(page));
-                        /*
+                        if (mpd->map.m_len == 0)
-                         * If we have inline data and arrive here, it means that
-                         * we will soon create the block for the 1st page, so
-                         * we'd better clear the inline data here.
-                         */
-                        if (ext4_has_inline_data(inode)) {
-                                BUG_ON(ext4_test_inode_state(inode,
-                                                EXT4_STATE_MAY_INLINE_DATA));
-                                ext4_destroy_inline_data(handle, inode);
-                        }
-                        if (mpd->next_page != page->index)
                                mpd->first_page = page->index;
                        mpd->next_page = page->index + 1;
-                        logical = (sector_t) page->index <<
-                                (PAGE_CACHE_SHIFT - inode->i_blkbits);
                        /* Add all dirty buffers to mpd */
+                        lblk = ((ext4_lblk_t)page->index) <<
+                                (PAGE_CACHE_SHIFT - blkbits);
                        head = page_buffers(page);
-                        bh = head;
+                        if (!add_page_bufs_to_extent(mpd, head, head, lblk))
-                        do {
+                                goto out;
-                                BUG_ON(buffer_locked(bh));
+                        /* So far everything mapped? Submit the page for IO. */
-                                /*
+                        if (mpd->map.m_len == 0) {
-                                 * We need to try to allocate unmapped blocks
+                                err = mpage_submit_page(mpd, page);
-                                 * in the same page.  Otherwise we won't make
+                                if (err < 0)
-                                 * progress with the page in ext4_writepage
-                                 */
-                                if (ext4_bh_delay_or_unwritten(NULL, bh)) {
-                                        mpage_add_bh_to_extent(mpd, logical,
-                                                               bh->b_state);
-                                        if (mpd->io_done)
-                                                goto ret_extent_tail;
-                                } else if (buffer_dirty(bh) &&
-                                           buffer_mapped(bh)) {
-                                        /*
-                                         * mapped dirty buffer. We need to
-                                         * update the b_state because we look
-                                         * at b_state in mpage_da_map_blocks.
-                                         * We don't update b_size because if we
-                                         * find an unmapped buffer_head later
-                                         * we need to use the b_state flag of
-                                         * that buffer_head.
-                                         */
-                                        if (mpd->b_size == 0)
-                                                mpd->b_state =
-                                                        bh->b_state & BH_FLAGS;
-                                }
-                                logical++;
-                        } while ((bh = bh->b_this_page) != head);
-                        if (nr_to_write > 0) {
-                                nr_to_write--;
-                                if (nr_to_write == 0 &&
-                                    wbc->sync_mode == WB_SYNC_NONE)
-                                        /*
-                                         * We stop writing back only if we are
-                                         * not doing integrity sync. In case of
-                                         * integrity sync we have to keep going
-                                         * because someone may be concurrently
-                                         * dirtying pages, and we might have
-                                         * synced a lot of newly appeared dirty
-                                         * pages, but have not synced all of the
-                                         * old dirty pages.
-                                         */
                                        goto out;
                        }
+                        /*
+                         * Accumulated enough dirty pages? This doesn't apply
+                         * to WB_SYNC_ALL mode. For integrity sync we have to
+                         * keep going because someone may be concurrently
+                         * dirtying pages, and we might have synced a lot of
+                         * newly appeared dirty pages, but have not synced all
+                         * of the old dirty pages.
+                         */
+                        if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
+                            mpd->next_page - mpd->first_page >=
+                                                        mpd->wbc->nr_to_write)
+                                goto out;
                }
                pagevec_release(&pvec);
                cond_resched();
        }
        return 0;
-ret_extent_tail:
-        ret = MPAGE_DA_EXTENT_TAIL;
 out:
        pagevec_release(&pvec);
-        cond_resched();
+        return err;
-        return ret;
 }
+static int __writepage(struct page *page, struct writeback_control *wbc,
+                       void *data)
+{
+        struct address_space *mapping = data;
+        int ret = ext4_writepage(page, wbc);
+        mapping_set_error(mapping, ret);
+        return ret;
+}
-static int ext4_da_writepages(struct address_space *mapping,
+static int ext4_writepages(struct address_space *mapping,
-                              struct writeback_control *wbc)
+                           struct writeback_control *wbc)
 {
-        pgoff_t index;
+        pgoff_t writeback_index = 0;
+        long nr_to_write = wbc->nr_to_write;
        int range_whole = 0;
+        int cycled = 1;
        handle_t *handle = NULL;
        struct mpage_da_data mpd;
        struct inode *inode = mapping->host;
-        int pages_written = 0;
+        int needed_blocks, rsv_blocks = 0, ret = 0;
-        unsigned int max_pages;
-        int range_cyclic, cycled = 1, io_done = 0;
-        int needed_blocks, ret = 0;
-        long desired_nr_to_write, nr_to_writebump = 0;
-        loff_t range_start = wbc->range_start;
        struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
-        pgoff_t done_index = 0;
+        bool done;
-        pgoff_t end;
        struct blk_plug plug;
+        bool give_up_on_write = false;
-        trace_ext4_da_writepages(inode, wbc);
+        trace_ext4_writepages(inode, wbc);
        /*
         * No pages to write? This is mainly a kludge to avoid starting
@@ -2460,164 +2391,165 @@ static int ext4_da_writepages(struct address_space *mapping,
        if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
                return 0;
+        if (ext4_should_journal_data(inode)) {
+                struct blk_plug plug;
+                int ret;
+                blk_start_plug(&plug);
+                ret = write_cache_pages(mapping, wbc, __writepage, mapping);
+                blk_finish_plug(&plug);
+                return ret;
+        }
        /*
         * If the filesystem has aborted, it is read-only, so return
         * right away instead of dumping stack traces later on that
         * will obscure the real source of the problem.  We test
         * EXT4_MF_FS_ABORTED instead of sb->s_flag's MS_RDONLY because
         * the latter could be true if the filesystem is mounted
-         * read-only, and in that case, ext4_da_writepages should
+         * read-only, and in that case, ext4_writepages should
         * *never* be called, so if that ever happens, we would want
         * the stack trace.
         */
        if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
                return -EROFS;
+        if (ext4_should_dioread_nolock(inode)) {
+                /*
+                 * We may need to convert upto one extent per block in
+                 * the page and we may dirty the inode.
+                 */
+                rsv_blocks = 1 + (PAGE_CACHE_SIZE >> inode->i_blkbits);
+        }
+        /*
+         * If we have inline data and arrive here, it means that
+         * we will soon create the block for the 1st page, so
+         * we'd better clear the inline data here.
+         */
+        if (ext4_has_inline_data(inode)) {
+                /* Just inode will be modified... */
+                handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
+                if (IS_ERR(handle)) {
+                        ret = PTR_ERR(handle);
+                        goto out_writepages;
+                }
+                BUG_ON(ext4_test_inode_state(inode,
+                                EXT4_STATE_MAY_INLINE_DATA));
+                ext4_destroy_inline_data(handle, inode);
+                ext4_journal_stop(handle);
+        }
        if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
                range_whole = 1;
-        range_cyclic = wbc->range_cyclic;
        if (wbc->range_cyclic) {
-                index = mapping->writeback_index;
+                writeback_index = mapping->writeback_index;
-                if (index)
+                if (writeback_index)
                        cycled = 0;
-                wbc->range_start = index << PAGE_CACHE_SHIFT;
+                mpd.first_page = writeback_index;
-                wbc->range_end  = LLONG_MAX;
+                mpd.last_page = -1;
-                wbc->range_cyclic = 0;
-                end = -1;
        } else {
-                index = wbc->range_start >> PAGE_CACHE_SHIFT;
+                mpd.first_page = wbc->range_start >> PAGE_CACHE_SHIFT;
-                end = wbc->range_end >> PAGE_CACHE_SHIFT;
+                mpd.last_page = wbc->range_end >> PAGE_CACHE_SHIFT;
-        }
-        /*
-         * This works around two forms of stupidity.  The first is in
-         * the writeback code, which caps the maximum number of pages
-         * written to be 1024 pages.  This is wrong on multiple
-         * levels; different architectues have a different page size,
-         * which changes the maximum amount of data which gets
-         * written.  Secondly, 4 megabytes is way too small.  XFS
-         * forces this value to be 16 megabytes by multiplying
-         * nr_to_write parameter by four, and then relies on its
-         * allocator to allocate larger extents to make them
-         * contiguous.  Unfortunately this brings us to the second
-         * stupidity, which is that ext4's mballoc code only allocates
-         * at most 2048 blocks.  So we force contiguous writes up to
-         * the number of dirty blocks in the inode, or
-         * sbi->max_writeback_mb_bump whichever is smaller.
-         */
-        max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
-        if (!range_cyclic && range_whole) {
-                if (wbc->nr_to_write == LONG_MAX)
-                        desired_nr_to_write = wbc->nr_to_write;
-                else
-                        desired_nr_to_write = wbc->nr_to_write * 8;
-        } else
-                desired_nr_to_write = ext4_num_dirty_pages(inode, index,
-                                                           max_pages);
-        if (desired_nr_to_write > max_pages)
-                desired_nr_to_write = max_pages;
-        if (wbc->nr_to_write < desired_nr_to_write) {
-                nr_to_writebump = desired_nr_to_write - wbc->nr_to_write;
-                wbc->nr_to_write = desired_nr_to_write;
        }
+        mpd.inode = inode;
+        mpd.wbc = wbc;
+        ext4_io_submit_init(&mpd.io_submit, wbc);
 retry:
        if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
-                tag_pages_for_writeback(mapping, index, end);
+                tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
+        done = false;
        blk_start_plug(&plug);
-        while (!ret && wbc->nr_to_write > 0) {
+        while (!done && mpd.first_page <= mpd.last_page) {
+                /* For each extent of pages we use new io_end */
+                mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
+                if (!mpd.io_submit.io_end) {
+                        ret = -ENOMEM;
+                        break;
+                }
                /*
-                 * we  insert one extent at a time. So we need
+                 * We have two constraints: We find one extent to map and we
-                 * credit needed for single extent allocation.
+                 * must always write out whole page (makes a difference when
-                 * journalled mode is currently not supported
+                 * blocksize < pagesize) so that we don't block on IO when we
-                 * by delalloc
+                 * try to write out the rest of the page. Journalled mode is
+                 * not supported by delalloc.
                 */
                BUG_ON(ext4_should_journal_data(inode));
                needed_blocks = ext4_da_writepages_trans_blocks(inode);
-                /* start a new transaction*/
+                /* start a new transaction */
-                handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
+                handle = ext4_journal_start_with_reserve(inode,
-                                            needed_blocks);
+                                EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
                if (IS_ERR(handle)) {
                        ret = PTR_ERR(handle);
                        ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
                               "%ld pages, ino %lu; err %d", __func__,
                                wbc->nr_to_write, inode->i_ino, ret);
-                        blk_finish_plug(&plug);
+                        /* Release allocated io_end */
-                        goto out_writepages;
+                        ext4_put_io_end(mpd.io_submit.io_end);
+                        break;
                }
-                /*
+                trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
-                 * Now call write_cache_pages_da() to find the next
+                ret = mpage_prepare_extent_to_map(&mpd);
-                 * contiguous region of logical blocks that need
+                if (!ret) {
-                 * blocks to be allocated by ext4 and submit them.
+                        if (mpd.map.m_len)
-                 */
+                                ret = mpage_map_and_submit_extent(handle, &mpd,
-                ret = write_cache_pages_da(handle, mapping,
+                                        &give_up_on_write);
-                                           wbc, &mpd, &done_index);
+                        else {
-                /*
+                                /*
-                 * If we have a contiguous extent of pages and we
+                                 * We scanned the whole range (or exhausted
-                 * haven't done the I/O yet, map the blocks and submit
+                                 * nr_to_write), submitted what was mapped and
-                 * them for I/O.
+                                 * didn't find anything needing mapping. We are
-                 */
+                                 * done.
-                if (!mpd.io_done && mpd.next_page != mpd.first_page) {
+                                 */
-                        mpage_da_map_and_submit(&mpd);
+                                done = true;
-                        ret = MPAGE_DA_EXTENT_TAIL;
+                        }
                }
-                trace_ext4_da_write_pages(inode, &mpd);
-                wbc->nr_to_write -= mpd.pages_written;
                ext4_journal_stop(handle);
+                /* Submit prepared bio */
-                if ((mpd.retval == -ENOSPC) && sbi->s_journal) {
+                ext4_io_submit(&mpd.io_submit);
-                        /* commit the transaction which would
+                /* Unlock pages we didn't use */
+                mpage_release_unused_pages(&mpd, give_up_on_write);
+                /* Drop our io_end reference we got from init */
+                ext4_put_io_end(mpd.io_submit.io_end);
+                if (ret == -ENOSPC && sbi->s_journal) {
+                        /*
+                         * Commit the transaction which would
                         * free blocks released in the transaction
                         * and try again
                         */
                        jbd2_journal_force_commit_nested(sbi->s_journal);
                        ret = 0;
-                } else if (ret == MPAGE_DA_EXTENT_TAIL) {
+                        continue;
-                        /*
+                }
-                         * Got one extent now try with rest of the pages.
+                /* Fatal error - ENOMEM, EIO... */
-                         * If mpd.retval is set -EIO, journal is aborted.
+                if (ret)
-                         * So we don't need to write any more.
-                         */
-                        pages_written += mpd.pages_written;
-                        ret = mpd.retval;
-                        io_done = 1;
-                } else if (wbc->nr_to_write)
-                        /*
-                         * There is no more writeout needed
-                         * or we requested for a noblocking writeout
-                         * and we found the device congested
-                         */
                        break;
        }
        blk_finish_plug(&plug);
-        if (!io_done && !cycled) {
+        if (!ret && !cycled) {
                cycled = 1;
-                index = 0;
+                mpd.last_page = writeback_index - 1;
-                wbc->range_start = index << PAGE_CACHE_SHIFT;
+                mpd.first_page = 0;
-                wbc->range_end  = mapping->writeback_index - 1;
                goto retry;
        }
        /* Update index */
-        wbc->range_cyclic = range_cyclic;
        if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
                /*
-                 * set the writeback_index so that range_cyclic
+                 * Set the writeback_index so that range_cyclic
                 * mode will write it back later
                 */
-                mapping->writeback_index = done_index;
+                mapping->writeback_index = mpd.first_page;
 out_writepages:
-        wbc->nr_to_write -= nr_to_writebump;
+        trace_ext4_writepages_result(inode, wbc, ret,
-        wbc->range_start = range_start;
+                                     nr_to_write - wbc->nr_to_write);
-        trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
        return ret;
 }
@@ -2829,7 +2761,8 @@ static int ext4_da_write_end(struct file *file,
        return ret ? ret : copied;
 }
-static void ext4_da_invalidatepage(struct page *page, unsigned long offset)
+static void ext4_da_invalidatepage(struct page *page, unsigned int offset,
+                                   unsigned int length)
 {
        /*
         * Drop reserved blocks
@@ -2838,10 +2771,10 @@ static void ext4_da_invalidatepage(struct page *page, unsigned long offset)
        if (!page_has_buffers(page))
                goto out;
-        ext4_da_page_release_reservation(page, offset);
+        ext4_da_page_release_reservation(page, offset, length);
 out:
-        ext4_invalidatepage(page, offset);
+        ext4_invalidatepage(page, offset, length);
        return;
 }
@@ -2864,7 +2797,7 @@ int ext4_alloc_da_blocks(struct inode *inode)
         * laptop_mode, not even desirable).  However, to do otherwise
         * would require replicating code paths in:
         *
-         * ext4_da_writepages() ->
+         * ext4_writepages() ->
         *    write_cache_pages() ---> (via passed in callback function)
         *        __mpage_da_writepage() -->
         *           mpage_add_bh_to_extent()
@@ -2989,37 +2922,40 @@ ext4_readpages(struct file *file, struct address_space *mapping,
        return mpage_readpages(mapping, pages, nr_pages, ext4_get_block);
 }
-static void ext4_invalidatepage(struct page *page, unsigned long offset)
+static void ext4_invalidatepage(struct page *page, unsigned int offset,
+                                unsigned int length)
 {
-        trace_ext4_invalidatepage(page, offset);
+        trace_ext4_invalidatepage(page, offset, length);
        /* No journalling happens on data buffers when this function is used */
        WARN_ON(page_has_buffers(page) && buffer_jbd(page_buffers(page)));
-        block_invalidatepage(page, offset);
+        block_invalidatepage(page, offset, length);
 }
 static int __ext4_journalled_invalidatepage(struct page *page,
-                                            unsigned long offset)
+                                            unsigned int offset,
+                                            unsigned int length)
 {
        journal_t *journal = EXT4_JOURNAL(page->mapping->host);
-        trace_ext4_journalled_invalidatepage(page, offset);
+        trace_ext4_journalled_invalidatepage(page, offset, length);
        /*
         * If it's a full truncate we just forget about the pending dirtying
         */
-        if (offset == 0)
+        if (offset == 0 && length == PAGE_CACHE_SIZE)
                ClearPageChecked(page);
-        return jbd2_journal_invalidatepage(journal, page, offset);
+        return jbd2_journal_invalidatepage(journal, page, offset, length);
 }
 /* Wrapper for aops... */
 static void ext4_journalled_invalidatepage(struct page *page,
-                                           unsigned long offset)
+                                           unsigned int offset,
+                                           unsigned int length)
 {
-        WARN_ON(__ext4_journalled_invalidatepage(page, offset) < 0);
+        WARN_ON(__ext4_journalled_invalidatepage(page, offset, length) < 0);
 }
 static int ext4_releasepage(struct page *page, gfp_t wait)
@@ -3067,9 +3003,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
        struct inode *inode = file_inode(iocb->ki_filp);
        ext4_io_end_t *io_end = iocb->private;
-        /* if not async direct IO or dio with 0 bytes write, just return */
+        /* if not async direct IO just return */
-        if (!io_end || !size)
+        if (!io_end) {
-                goto out;
+                inode_dio_done(inode);
+                if (is_async)
+                        aio_complete(iocb, ret, 0);
+                return;
+        }
        ext_debug("ext4_end_io_dio(): io_end 0x%p "
                  "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
@@ -3077,25 +3017,13 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
                  size);
        iocb->private = NULL;
-        /* if not aio dio with unwritten extents, just free io and return */
-        if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
-                ext4_free_io_end(io_end);
-out:
-                inode_dio_done(inode);
-                if (is_async)
-                        aio_complete(iocb, ret, 0);
-                return;
-        }
        io_end->offset = offset;
        io_end->size = size;
        if (is_async) {
                io_end->iocb = iocb;
                io_end->result = ret;
        }
+        ext4_put_io_end_defer(io_end);
-        ext4_add_complete_io(io_end);
 }
 /*
@@ -3129,6 +3057,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
        get_block_t *get_block_func = NULL;
        int dio_flags = 0;
        loff_t final_size = offset + count;
+        ext4_io_end_t *io_end = NULL;
        /* Use the old path for reads and writes beyond i_size. */
        if (rw != WRITE || final_size > inode->i_size)
@@ -3136,11 +3065,18 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
        BUG_ON(iocb->private == NULL);
+        /*
+         * Make all waiters for direct IO properly wait also for extent
+         * conversion. This also disallows race between truncate() and
+         * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
+         */
+        if (rw == WRITE)
+                atomic_inc(&inode->i_dio_count);
        /* If we do a overwrite dio, i_mutex locking can be released */
        overwrite = *((int *)iocb->private);
        if (overwrite) {
-                atomic_inc(&inode->i_dio_count);
                down_read(&EXT4_I(inode)->i_data_sem);
                mutex_unlock(&inode->i_mutex);
        }
@@ -3167,13 +3103,16 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
        iocb->private = NULL;
        ext4_inode_aio_set(inode, NULL);
        if (!is_sync_kiocb(iocb)) {
-                ext4_io_end_t *io_end = ext4_init_io_end(inode, GFP_NOFS);
+                io_end = ext4_init_io_end(inode, GFP_NOFS);
                if (!io_end) {
                        ret = -ENOMEM;
                        goto retake_lock;
                }
                io_end->flag |= EXT4_IO_END_DIRECT;
-                iocb->private = io_end;
+                /*
+                 * Grab reference for DIO. Will be dropped in ext4_end_io_dio()
+                 */
+                iocb->private = ext4_get_io_end(io_end);
                /*
                 * we save the io structure for current async direct
                 * IO, so that later ext4_map_blocks() could flag the
@@ -3197,33 +3136,42 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
                                   NULL,
                                   dio_flags);
-        if (iocb->private)
-                ext4_inode_aio_set(inode, NULL);
        /*
-         * The io_end structure takes a reference to the inode, that
+         * Put our reference to io_end. This can free the io_end structure e.g.
-         * structure needs to be destroyed and the reference to the
+         * in sync IO case or in case of error. It can even perform extent
-         * inode need to be dropped, when IO is complete, even with 0
+         * conversion if all bios we submitted finished before we got here.
-         * byte write, or failed.
+         * Note that in that case iocb->private can be already set to NULL
-         *
+         * here.
-         * In the successful AIO DIO case, the io_end structure will
-         * be destroyed and the reference to the inode will be dropped
-         * after the end_io call back function is called.
-         *
-         * In the case there is 0 byte write, or error case, since VFS
-         * direct IO won't invoke the end_io call back function, we
-         * need to free the end_io structure here.
         */
-        if (ret != -EIOCBQUEUED && ret <= 0 && iocb->private) {
+        if (io_end) {
-                ext4_free_io_end(iocb->private);
+                ext4_inode_aio_set(inode, NULL);
-                iocb->private = NULL;
+                ext4_put_io_end(io_end);
-        } else if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
+                /*
+                 * When no IO was submitted ext4_end_io_dio() was not
+                 * called so we have to put iocb's reference.
+                 */
+                if (ret <= 0 && ret != -EIOCBQUEUED && iocb->private) {
+                        WARN_ON(iocb->private != io_end);
+                        WARN_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
+                        WARN_ON(io_end->iocb);
+                        /*
+                         * Generic code already did inode_dio_done() so we
+                         * have to clear EXT4_IO_END_DIRECT to not do it for
+                         * the second time.
+                         */
+                        io_end->flag = 0;
+                        ext4_put_io_end(io_end);
+                        iocb->private = NULL;
+                }
+        }
+        if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
                                                EXT4_STATE_DIO_UNWRITTEN)) {
                int err;
                /*
                 * for non AIO case, since the IO is already
                 * completed, we could do the conversion right here
                 */
-                err = ext4_convert_unwritten_extents(inode,
+                err = ext4_convert_unwritten_extents(NULL, inode,
                                                     offset, ret);
                if (err < 0)
                        ret = err;
@@ -3231,9 +3179,10 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
        }
 retake_lock:
+        if (rw == WRITE)
+                inode_dio_done(inode);
        /* take i_mutex locking again if we do a ovewrite dio */
        if (overwrite) {
-                inode_dio_done(inode);
                up_read(&EXT4_I(inode)->i_data_sem);
                mutex_lock(&inode->i_mutex);
        }
@@ -3292,6 +3241,7 @@ static const struct address_space_operations ext4_aops = {
        .readpage               = ext4_readpage,
        .readpages              = ext4_readpages,
        .writepage              = ext4_writepage,
+        .writepages             = ext4_writepages,
        .write_begin            = ext4_write_begin,
        .write_end              = ext4_write_end,
        .bmap                   = ext4_bmap,
@@ -3307,6 +3257,7 @@ static const struct address_space_operations ext4_journalled_aops = {
        .readpage               = ext4_readpage,
        .readpages              = ext4_readpages,
        .writepage              = ext4_writepage,
+        .writepages             = ext4_writepages,
        .write_begin            = ext4_write_begin,
        .write_end              = ext4_journalled_write_end,
        .set_page_dirty         = ext4_journalled_set_page_dirty,
@@ -3322,7 +3273,7 @@ static const struct address_space_operations ext4_da_aops = {
        .readpage               = ext4_readpage,
        .readpages              = ext4_readpages,
        .writepage              = ext4_writepage,
-        .writepages             = ext4_da_writepages,
+        .writepages             = ext4_writepages,
        .write_begin            = ext4_da_write_begin,
        .write_end              = ext4_da_write_end,
        .bmap                   = ext4_bmap,
@@ -3355,89 +3306,56 @@ void ext4_set_aops(struct inode *inode)
                inode->i_mapping->a_ops = &ext4_aops;
 }
 /*
- * ext4_discard_partial_page_buffers()
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * Wrapper function for ext4_discard_partial_page_buffers_no_lock.
+ * up to the end of the block which corresponds to `from'.
- * This function finds and locks the page containing the offset
+ * This required during truncate. We need to physically zero the tail end
- * "from" and passes it to ext4_discard_partial_page_buffers_no_lock.
+ * of that block so it doesn't yield old data if the file is later grown.
- * Calling functions that already have the page locked should call
- * ext4_discard_partial_page_buffers_no_lock directly.
 */
-int ext4_discard_partial_page_buffers(handle_t *handle,
+int ext4_block_truncate_page(handle_t *handle,
-                struct address_space *mapping, loff_t from,
+                struct address_space *mapping, loff_t from)
-                loff_t length, int flags)
 {
+        unsigned offset = from & (PAGE_CACHE_SIZE-1);
+        unsigned length;
+        unsigned blocksize;
        struct inode *inode = mapping->host;
-        struct page *page;
-        int err = 0;
-        page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
+        blocksize = inode->i_sb->s_blocksize;
-                                   mapping_gfp_mask(mapping) & ~__GFP_FS);
+        length = blocksize - (offset & (blocksize - 1));
-        if (!page)
-                return -ENOMEM;
-        err = ext4_discard_partial_page_buffers_no_lock(handle, inode, page,
-                from, length, flags);
-        unlock_page(page);
+        return ext4_block_zero_page_range(handle, mapping, from, length);
-        page_cache_release(page);
-        return err;
 }
 /*
- * ext4_discard_partial_page_buffers_no_lock()
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * Zeros a page range of length 'length' starting from offset 'from'.
+ * starting from file offset 'from'.  The range to be zero'd must
- * Buffer heads that correspond to the block aligned regions of the
+ * be contained with in one block.  If the specified range exceeds
- * zeroed range will be unmapped.  Unblock aligned regions
+ * the end of the block it will be shortened to end of the block
- * will have the corresponding buffer head mapped if needed so that
+ * that cooresponds to 'from'
- * that region of the page can be updated with the partial zero out.
- *
- * This function assumes that the page has already been  locked.  The
- * The range to be discarded must be contained with in the given page.
- * If the specified range exceeds the end of the page it will be shortened
- * to the end of the page that corresponds to 'from'.  This function is
- * appropriate for updating a page and it buffer heads to be unmapped and
- * zeroed for blocks that have been either released, or are going to be
- * released.
- *
- * handle: The journal handle
- * inode:  The files inode
- * page:   A locked page that contains the offset "from"
- * from:   The starting byte offset (from the beginning of the file)
- *         to begin discarding
- * len:    The length of bytes to discard
- * flags:  Optional flags that may be used:
- *
- *         EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED
- *         Only zero the regions of the page whose buffer heads
- *         have already been unmapped.  This flag is appropriate
- *         for updating the contents of a page whose blocks may
- *         have already been released, and we only want to zero
- *         out the regions that correspond to those released blocks.
- *
- * Returns zero on success or negative on failure.
 */
-static int ext4_discard_partial_page_buffers_no_lock(handle_t *handle,
+int ext4_block_zero_page_range(handle_t *handle,
-                struct inode *inode, struct page *page, loff_t from,
+                struct address_space *mapping, loff_t from, loff_t length)
-                loff_t length, int flags)
 {
        ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
-        unsigned int offset = from & (PAGE_CACHE_SIZE-1);
+        unsigned offset = from & (PAGE_CACHE_SIZE-1);
-        unsigned int blocksize, max, pos;
+        unsigned blocksize, max, pos;
        ext4_lblk_t iblock;
+        struct inode *inode = mapping->host;
        struct buffer_head *bh;
+        struct page *page;
        int err = 0;
-        blocksize = inode->i_sb->s_blocksize;
+        page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
-        max = PAGE_CACHE_SIZE - offset;
+                                   mapping_gfp_mask(mapping) & ~__GFP_FS);
+        if (!page)
+                return -ENOMEM;
-        if (index != page->index)
+        blocksize = inode->i_sb->s_blocksize;
-                return -EINVAL;
+        max = blocksize - (offset & (blocksize - 1));
        /*
         * correct length if it does not fall between
-         * 'from' and the end of the page
+         * 'from' and the end of the block
         */
        if (length > max || length < 0)
                length = max;
@@ -3455,106 +3373,91 @@ static int ext4_discard_partial_page_buffers_no_lock(handle_t *handle,
                iblock++;
                pos += blocksize;
        }
+        if (buffer_freed(bh)) {
-        pos = offset;
+                BUFFER_TRACE(bh, "freed: skip");
-        while (pos < offset + length) {
+                goto unlock;
-                unsigned int end_of_block, range_to_discard;
+        }
+        if (!buffer_mapped(bh)) {
-                err = 0;
+                BUFFER_TRACE(bh, "unmapped");
+                ext4_get_block(inode, iblock, bh, 0);
-                /* The length of space left to zero and unmap */
+                /* unmapped? It's a hole - nothing to do */
-                range_to_discard = offset + length - pos;
-                /* The length of space until the end of the block */
-                end_of_block = blocksize - (pos & (blocksize-1));
-                /*
-                 * Do not unmap or zero past end of block
-                 * for this buffer head
-                 */
-                if (range_to_discard > end_of_block)
-                        range_to_discard = end_of_block;
-                /*
-                 * Skip this buffer head if we are only zeroing unampped
-                 * regions of the page
-                 */
-                if (flags & EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED &&
-                        buffer_mapped(bh))
-                                goto next;
-                /* If the range is block aligned, unmap */
-                if (range_to_discard == blocksize) {
-                        clear_buffer_dirty(bh);
-                        bh->b_bdev = NULL;
-                        clear_buffer_mapped(bh);
-                        clear_buffer_req(bh);
-                        clear_buffer_new(bh);
-                        clear_buffer_delay(bh);
-                        clear_buffer_unwritten(bh);
-                        clear_buffer_uptodate(bh);
-                        zero_user(page, pos, range_to_discard);
-                        BUFFER_TRACE(bh, "Buffer discarded");
-                        goto next;
-                }
-                /*
-                 * If this block is not completely contained in the range
-                 * to be discarded, then it is not going to be released. Because
-                 * we need to keep this block, we need to make sure this part
-                 * of the page is uptodate before we modify it by writeing
-                 * partial zeros on it.
-                 */
                if (!buffer_mapped(bh)) {
-                        /*
+                        BUFFER_TRACE(bh, "still unmapped");
-                         * Buffer head must be mapped before we can read
+                        goto unlock;
-                         * from the block
-                         */
-                        BUFFER_TRACE(bh, "unmapped");
-                        ext4_get_block(inode, iblock, bh, 0);
-                        /* unmapped? It's a hole - nothing to do */
-                        if (!buffer_mapped(bh)) {
-                                BUFFER_TRACE(bh, "still unmapped");
-                                goto next;
-                        }
                }
+        }
-                /* Ok, it's mapped. Make sure it's up-to-date */
+        /* Ok, it's mapped. Make sure it's up-to-date */
-                if (PageUptodate(page))
+        if (PageUptodate(page))
-                        set_buffer_uptodate(bh);
+                set_buffer_uptodate(bh);
-                if (!buffer_uptodate(bh)) {
+        if (!buffer_uptodate(bh)) {
-                        err = -EIO;
+                err = -EIO;
-                        ll_rw_block(READ, 1, &bh);
+                ll_rw_block(READ, 1, &bh);
-                        wait_on_buffer(bh);
+                wait_on_buffer(bh);
-                        /* Uhhuh. Read error. Complain and punt.*/
+                /* Uhhuh. Read error. Complain and punt. */
-                        if (!buffer_uptodate(bh))
+                if (!buffer_uptodate(bh))
-                                goto next;
+                        goto unlock;
-                }
+        }
+        if (ext4_should_journal_data(inode)) {
+                BUFFER_TRACE(bh, "get write access");
+                err = ext4_journal_get_write_access(handle, bh);
+                if (err)
+                        goto unlock;
+        }
+        zero_user(page, offset, length);
+        BUFFER_TRACE(bh, "zeroed end of block");
-                if (ext4_should_journal_data(inode)) {
+        if (ext4_should_journal_data(inode)) {
-                        BUFFER_TRACE(bh, "get write access");
+                err = ext4_handle_dirty_metadata(handle, inode, bh);
-                        err = ext4_journal_get_write_access(handle, bh);
+        } else {
-                        if (err)
+                err = 0;
-                                goto next;
+                mark_buffer_dirty(bh);
-                }
+                if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE))
+                        err = ext4_jbd2_file_inode(handle, inode);
+        }
+unlock:
+        unlock_page(page);
+        page_cache_release(page);
+        return err;
+}
-                zero_user(page, pos, range_to_discard);
+int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
+                             loff_t lstart, loff_t length)
+{
+        struct super_block *sb = inode->i_sb;
+        struct address_space *mapping = inode->i_mapping;
+        unsigned partial_start, partial_end;
+        ext4_fsblk_t start, end;
+        loff_t byte_end = (lstart + length - 1);
+        int err = 0;
-                err = 0;
+        partial_start = lstart & (sb->s_blocksize - 1);
-                if (ext4_should_journal_data(inode)) {
+        partial_end = byte_end & (sb->s_blocksize - 1);
-                        err = ext4_handle_dirty_metadata(handle, inode, bh);
-                } else
-                        mark_buffer_dirty(bh);
-                BUFFER_TRACE(bh, "Partial buffer zeroed");
+        start = lstart >> sb->s_blocksize_bits;
-next:
+        end = byte_end >> sb->s_blocksize_bits;
-                bh = bh->b_this_page;
-                iblock++;
-                pos += range_to_discard;
-        }
+        /* Handle partial zero within the single block */
+        if (start == end &&
+            (partial_start || (partial_end != sb->s_blocksize - 1))) {
+                err = ext4_block_zero_page_range(handle, mapping,
+                                                 lstart, length);
+                return err;
+        }
+        /* Handle partial zero out on the start of the range */
+        if (partial_start) {
+                err = ext4_block_zero_page_range(handle, mapping,
+                                                 lstart, sb->s_blocksize);
+                if (err)
+                        return err;
+        }
+        /* Handle partial zero out on the end of the range */
+        if (partial_end != sb->s_blocksize - 1)
+                err = ext4_block_zero_page_range(handle, mapping,
+                                                 byte_end - partial_end,
+                                                 partial_end + 1);
        return err;
 }
@@ -3580,14 +3483,12 @@ int ext4_can_truncate(struct inode *inode)
 * Returns: 0 on success or negative on failure
 */
-int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
+int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 {
-        struct inode *inode = file_inode(file);
        struct super_block *sb = inode->i_sb;
        ext4_lblk_t first_block, stop_block;
        struct address_space *mapping = inode->i_mapping;
-        loff_t first_page, last_page, page_len;
+        loff_t first_block_offset, last_block_offset;
-        loff_t first_page_offset, last_page_offset;
        handle_t *handle;
        unsigned int credits;
        int ret = 0;
@@ -3638,23 +3539,16 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
                   offset;
        }
-        first_page = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+        first_block_offset = round_up(offset, sb->s_blocksize);
-        last_page = (offset + length) >> PAGE_CACHE_SHIFT;
+        last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
-        first_page_offset = first_page << PAGE_CACHE_SHIFT;
+        /* Now release the pages and zero block aligned part of pages*/
-        last_page_offset = last_page << PAGE_CACHE_SHIFT;
+        if (last_block_offset > first_block_offset)
+                truncate_pagecache_range(inode, first_block_offset,
-        /* Now release the pages */
+                                         last_block_offset);
-        if (last_page_offset > first_page_offset) {
-                truncate_pagecache_range(inode, first_page_offset,
-                                         last_page_offset - 1);
-        }
        /* Wait all existing dio workers, newcomers will block on i_mutex */
        ext4_inode_block_unlocked_dio(inode);
-        ret = ext4_flush_unwritten_io(inode);
-        if (ret)
-                goto out_dio;
        inode_dio_wait(inode);
        if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
@@ -3668,66 +3562,10 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
                goto out_dio;
        }
-        /*
+        ret = ext4_zero_partial_blocks(handle, inode, offset,
-         * Now we need to zero out the non-page-aligned data in the
+                                       length);
-         * pages at the start and tail of the hole, and unmap the
+        if (ret)
-         * buffer heads for the block aligned regions of the page that
+                goto out_stop;
-         * were completely zeroed.
-         */
-        if (first_page > last_page) {
-                /*
-                 * If the file space being truncated is contained
-                 * within a page just zero out and unmap the middle of
-                 * that page
-                 */
-                ret = ext4_discard_partial_page_buffers(handle,
-                        mapping, offset, length, 0);
-                if (ret)
-                        goto out_stop;
-        } else {
-                /*
-                 * zero out and unmap the partial page that contains
-                 * the start of the hole
-                 */
-                page_len = first_page_offset - offset;
-                if (page_len > 0) {
-                        ret = ext4_discard_partial_page_buffers(handle, mapping,
-                                                offset, page_len, 0);
-                        if (ret)
-                                goto out_stop;
-                }
-                /*
-                 * zero out and unmap the partial page that contains
-                 * the end of the hole
-                 */
-                page_len = offset + length - last_page_offset;
-                if (page_len > 0) {
-                        ret = ext4_discard_partial_page_buffers(handle, mapping,
-                                        last_page_offset, page_len, 0);
-                        if (ret)
-                                goto out_stop;
-                }
-        }
-        /*
-         * If i_size is contained in the last page, we need to
-         * unmap and zero the partial page after i_size
-         */
-        if (inode->i_size >> PAGE_CACHE_SHIFT == last_page &&
-           inode->i_size % PAGE_CACHE_SIZE != 0) {
-                page_len = PAGE_CACHE_SIZE -
-                        (inode->i_size & (PAGE_CACHE_SIZE - 1));
-                if (page_len > 0) {
-                        ret = ext4_discard_partial_page_buffers(handle,
-                                        mapping, inode->i_size, page_len, 0);
-                        if (ret)
-                                goto out_stop;
-                }
-        }
        first_block = (offset + sb->s_blocksize - 1) >>
                EXT4_BLOCK_SIZE_BITS(sb);
@@ -3803,7 +3641,6 @@ void ext4_truncate(struct inode *inode)
        unsigned int credits;
        handle_t *handle;
        struct address_space *mapping = inode->i_mapping;
-        loff_t page_len;
        /*
         * There is a possibility that we're either freeing the inode
@@ -3830,12 +3667,6 @@ void ext4_truncate(struct inode *inode)
                        return;
        }
-        /*
-         * finish any pending end_io work so we won't run the risk of
-         * converting any truncated blocks to initialized later
-         */
-        ext4_flush_unwritten_io(inode);
        if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
                credits = ext4_writepage_trans_blocks(inode);
        else
@@ -3847,14 +3678,8 @@ void ext4_truncate(struct inode *inode)
                return;
        }
-        if (inode->i_size % PAGE_CACHE_SIZE != 0) {
+        if (inode->i_size & (inode->i_sb->s_blocksize - 1))
-                page_len = PAGE_CACHE_SIZE -
+                ext4_block_truncate_page(handle, mapping, inode->i_size);
-                        (inode->i_size & (PAGE_CACHE_SIZE - 1));
-                if (ext4_discard_partial_page_buffers(handle,
-                                mapping, inode->i_size, page_len, 0))
-                        goto out_stop;
-        }
        /*
         * We add the inode to the orphan list, so that if this
@@ -4623,7 +4448,8 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
                                      inode->i_size >> PAGE_CACHE_SHIFT);
                if (!page)
                        return;
-                ret = __ext4_journalled_invalidatepage(page, offset);
+                ret = __ext4_journalled_invalidatepage(page, offset,
+                                                PAGE_CACHE_SIZE - offset);
                unlock_page(page);
                page_cache_release(page);
                if (ret != -EBUSY)
@@ -4805,7 +4631,7 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
                 struct kstat *stat)
 {
        struct inode *inode;
-        unsigned long delalloc_blocks;
+        unsigned long long delalloc_blocks;
        inode = dentry->d_inode;
        generic_fillattr(inode, stat);
@@ -4823,15 +4649,16 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
        delalloc_blocks = EXT4_C2B(EXT4_SB(inode->i_sb),
                                EXT4_I(inode)->i_reserved_data_blocks);
-        stat->blocks += (delalloc_blocks << inode->i_sb->s_blocksize_bits)>>9;
+        stat->blocks += delalloc_blocks << (inode->i_sb->s_blocksize_bits-9);
        return 0;
 }
-static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
+                                   int pextents)
 {
        if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
-                return ext4_ind_trans_blocks(inode, nrblocks, chunk);
+                return ext4_ind_trans_blocks(inode, lblocks);
-        return ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
+        return ext4_ext_index_trans_blocks(inode, pextents);
 }
 /*
@@ -4845,7 +4672,8 @@ static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
 *
 * Also account for superblock, inode, quota and xattr blocks
 */
-static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
+                                  int pextents)
 {
        ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
        int gdpblocks;
@@ -4853,14 +4681,10 @@ static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
        int ret = 0;
        /*
-         * How many index blocks need to touch to modify nrblocks?
+         * How many index blocks need to touch to map @lblocks logical blocks
-         * The "Chunk" flag indicating whether the nrblocks is
+         * to @pextents physical extents?
-         * physically contiguous on disk
-         *
-         * For Direct IO and fallocate, they calls get_block to allocate
-         * one single extent at a time, so they could set the "Chunk" flag
         */
-        idxblocks = ext4_index_trans_blocks(inode, nrblocks, chunk);
+        idxblocks = ext4_index_trans_blocks(inode, lblocks, pextents);
        ret = idxblocks;
@@ -4868,12 +4692,7 @@ static int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
         * Now let's see how many group bitmaps and group descriptors need
         * to account
         */
-        groups = idxblocks;
+        groups = idxblocks + pextents;
-        if (chunk)
-                groups += 1;
-        else
-                groups += nrblocks;
        gdpblocks = groups;
        if (groups > ngroups)
                groups = ngroups;
@@ -4904,7 +4723,7 @@ int ext4_writepage_trans_blocks(struct inode *inode)
        int bpp = ext4_journal_blocks_per_page(inode);
        int ret;
-        ret = ext4_meta_trans_blocks(inode, bpp, 0);
+        ret = ext4_meta_trans_blocks(inode, bpp, bpp);
        /* Account for data blocks for journalled mode */
        if (ext4_should_journal_data(inode))
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index def84082a9a9..a9ff5e5137ca 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2105,6 +2105,7 @@ repeat:
                group = ac->ac_g_ex.fe_group;
                for (i = 0; i < ngroups; group++, i++) {
+                        cond_resched();
                        /*
                         * Artificially restricted ngroups for non-extent
                         * files makes group > ngroups possible on first loop.
@@ -4405,17 +4406,20 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 repeat:
                /* allocate space in core */
                *errp = ext4_mb_regular_allocator(ac);
-                if (*errp) {
+                if (*errp)
-                        ext4_discard_allocated_blocks(ac);
+                        goto discard_and_exit;
-                        goto errout;
-                }
                /* as we've just preallocated more space than
-                 * user requested orinally, we store allocated
+                 * user requested originally, we store allocated
                 * space in a special descriptor */
                if (ac->ac_status == AC_STATUS_FOUND &&
-                                ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
+                    ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
-                        ext4_mb_new_preallocation(ac);
+                        *errp = ext4_mb_new_preallocation(ac);
+                if (*errp) {
+                discard_and_exit:
+                        ext4_discard_allocated_blocks(ac);
+                        goto errout;
+                }
        }
        if (likely(ac->ac_status == AC_STATUS_FOUND)) {
                *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
@@ -4612,10 +4616,11 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
                BUG_ON(bh && (count > 1));
                for (i = 0; i < count; i++) {
+                        cond_resched();
                        if (!bh)
                                tbh = sb_find_get_block(inode->i_sb,
                                                        block + i);
-                        if (unlikely(!tbh))
+                        if (!tbh)
                                continue;
                        ext4_forget(handle, flags & EXT4_FREE_BLOCKS_METADATA,
                                    inode, tbh, block + i);
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 3dcbf364022f..e86dddbd8296 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -912,7 +912,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
        struct page *pagep[2] = {NULL, NULL};
        handle_t *handle;
        ext4_lblk_t orig_blk_offset;
-        long long offs = orig_page_offset << PAGE_CACHE_SHIFT;
        unsigned long blocksize = orig_inode->i_sb->s_blocksize;
        unsigned int w_flags = 0;
        unsigned int tmp_data_size, data_size, replaced_size;
@@ -940,8 +939,6 @@ again:
        orig_blk_offset = orig_page_offset * blocks_per_page +
                data_offset_in_page;
-        offs = (long long)orig_blk_offset << orig_inode->i_blkbits;
        /* Calculate data_size */
        if ((orig_blk_offset + block_len_in_page - 1) ==
            ((orig_inode->i_size - 1) >> orig_inode->i_blkbits)) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 6653fc35ecb7..ab2f6dc44b3a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -918,11 +918,8 @@ static int htree_dirblock_to_tree(struct file *dir_file,
                                bh->b_data, bh->b_size,
                                (block<<EXT4_BLOCK_SIZE_BITS(dir->i_sb))
                                         + ((char *)de - bh->b_data))) {
-                        /* On error, skip the f_pos to the next block. */
+                        /* silently ignore the rest of the block */
-                        dir_file->f_pos = (dir_file->f_pos |
+                        break;
-                                        (dir->i_sb->s_blocksize - 1)) + 1;
-                        brelse(bh);
-                        return count;
                }
                ext4fs_dirhash(de->name, de->name_len, hinfo);
                if ((hinfo->hash < start_hash) ||
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 4acf1f78881b..48786cdb5e6c 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -46,46 +46,121 @@ void ext4_exit_pageio(void)
 }
 /*
- * This function is called by ext4_evict_inode() to make sure there is
+ * Print an buffer I/O error compatible with the fs/buffer.c.  This
- * no more pending I/O completion work left to do.
+ * provides compatibility with dmesg scrapers that look for a specific
+ * buffer I/O error message.  We really need a unified error reporting
+ * structure to userspace ala Digital Unix's uerf system, but it's
+ * probably not going to happen in my lifetime, due to LKML politics...
 */
-void ext4_ioend_shutdown(struct inode *inode)
+static void buffer_io_error(struct buffer_head *bh)
 {
-        wait_queue_head_t *wq = ext4_ioend_wq(inode);
+        char b[BDEVNAME_SIZE];
+        printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
+                        bdevname(bh->b_bdev, b),
+                        (unsigned long long)bh->b_blocknr);
+}
-        wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
+static void ext4_finish_bio(struct bio *bio)
-        /*
+{
-         * We need to make sure the work structure is finished being
+        int i;
-         * used before we let the inode get destroyed.
+        int error = !test_bit(BIO_UPTODATE, &bio->bi_flags);
-         */
-        if (work_pending(&EXT4_I(inode)->i_unwritten_work))
+        for (i = 0; i < bio->bi_vcnt; i++) {
-                cancel_work_sync(&EXT4_I(inode)->i_unwritten_work);
+                struct bio_vec *bvec = &bio->bi_io_vec[i];
+                struct page *page = bvec->bv_page;
+                struct buffer_head *bh, *head;
+                unsigned bio_start = bvec->bv_offset;
+                unsigned bio_end = bio_start + bvec->bv_len;
+                unsigned under_io = 0;
+                unsigned long flags;
+                if (!page)
+                        continue;
+                if (error) {
+                        SetPageError(page);
+                        set_bit(AS_EIO, &page->mapping->flags);
+                }
+                bh = head = page_buffers(page);
+                /*
+                 * We check all buffers in the page under BH_Uptodate_Lock
+                 * to avoid races with other end io clearing async_write flags
+                 */
+                local_irq_save(flags);
+                bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+                do {
+                        if (bh_offset(bh) < bio_start ||
+                            bh_offset(bh) + bh->b_size > bio_end) {
+                                if (buffer_async_write(bh))
+                                        under_io++;
+                                continue;
+                        }
+                        clear_buffer_async_write(bh);
+                        if (error)
+                                buffer_io_error(bh);
+                } while ((bh = bh->b_this_page) != head);
+                bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
+                local_irq_restore(flags);
+                if (!under_io)
+                        end_page_writeback(page);
+        }
 }
-void ext4_free_io_end(ext4_io_end_t *io)
+static void ext4_release_io_end(ext4_io_end_t *io_end)
 {
-        BUG_ON(!io);
+        struct bio *bio, *next_bio;
-        BUG_ON(!list_empty(&io->list));
-        BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
+        BUG_ON(!list_empty(&io_end->list));
+        BUG_ON(io_end->flag & EXT4_IO_END_UNWRITTEN);
+        WARN_ON(io_end->handle);
-        if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count))
+        if (atomic_dec_and_test(&EXT4_I(io_end->inode)->i_ioend_count))
-                wake_up_all(ext4_ioend_wq(io->inode));
+                wake_up_all(ext4_ioend_wq(io_end->inode));
-        kmem_cache_free(io_end_cachep, io);
+        for (bio = io_end->bio; bio; bio = next_bio) {
+                next_bio = bio->bi_private;
+                ext4_finish_bio(bio);
+                bio_put(bio);
+        }
+        if (io_end->flag & EXT4_IO_END_DIRECT)
+                inode_dio_done(io_end->inode);
+        if (io_end->iocb)
+                aio_complete(io_end->iocb, io_end->result, 0);
+        kmem_cache_free(io_end_cachep, io_end);
 }
-/* check a range of space and convert unwritten extents to written. */
+static void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
+{
+        struct inode *inode = io_end->inode;
+        io_end->flag &= ~EXT4_IO_END_UNWRITTEN;
+        /* Wake up anyone waiting on unwritten extent conversion */
+        if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
+                wake_up_all(ext4_ioend_wq(inode));
+}
+/*
+ * Check a range of space and convert unwritten extents to written. Note that
+ * we are protected from truncate touching same part of extent tree by the
+ * fact that truncate code waits for all DIO to finish (thus exclusion from
+ * direct IO is achieved) and also waits for PageWriteback bits. Thus we
+ * cannot get to ext4_ext_truncate() before all IOs overlapping that range are
+ * completed (happens from ext4_free_ioend()).
+ */
 static int ext4_end_io(ext4_io_end_t *io)
 {
        struct inode *inode = io->inode;
        loff_t offset = io->offset;
        ssize_t size = io->size;
+        handle_t *handle = io->handle;
        int ret = 0;
        ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
                   "list->prev 0x%p\n",
                   io, inode->i_ino, io->list.next, io->list.prev);
-        ret = ext4_convert_unwritten_extents(inode, offset, size);
+        io->handle = NULL;      /* Following call will use up the handle */
+        ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
        if (ret < 0) {
                ext4_msg(inode->i_sb, KERN_EMERG,
                         "failed to convert unwritten extents to written "
@@ -93,30 +168,22 @@ static int ext4_end_io(ext4_io_end_t *io)
                         "(inode %lu, offset %llu, size %zd, error %d)",
                         inode->i_ino, offset, size, ret);
        }
-        /* Wake up anyone waiting on unwritten extent conversion */
+        ext4_clear_io_unwritten_flag(io);
-        if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
+        ext4_release_io_end(io);
-                wake_up_all(ext4_ioend_wq(inode));
-        if (io->flag & EXT4_IO_END_DIRECT)
-                inode_dio_done(inode);
-        if (io->iocb)
-                aio_complete(io->iocb, io->result, 0);
        return ret;
 }
-static void dump_completed_IO(struct inode *inode)
+static void dump_completed_IO(struct inode *inode, struct list_head *head)
 {
 #ifdef  EXT4FS_DEBUG
        struct list_head *cur, *before, *after;
        ext4_io_end_t *io, *io0, *io1;
-        if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
+        if (list_empty(head))
-                ext4_debug("inode %lu completed_io list is empty\n",
-                           inode->i_ino);
                return;
-        }
-        ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
+        ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
-        list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
+        list_for_each_entry(io, head, list) {
                cur = &io->list;
                before = cur->prev;
                io0 = container_of(before, ext4_io_end_t, list);
@@ -130,23 +197,30 @@ static void dump_completed_IO(struct inode *inode)
 }
 /* Add the io_end to per-inode completed end_io list. */
-void ext4_add_complete_io(ext4_io_end_t *io_end)
+static void ext4_add_complete_io(ext4_io_end_t *io_end)
 {
        struct ext4_inode_info *ei = EXT4_I(io_end->inode);
        struct workqueue_struct *wq;
        unsigned long flags;
        BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
-        wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
        spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-        if (list_empty(&ei->i_completed_io_list))
+        if (io_end->handle) {
-                queue_work(wq, &ei->i_unwritten_work);
+                wq = EXT4_SB(io_end->inode->i_sb)->rsv_conversion_wq;
-        list_add_tail(&io_end->list, &ei->i_completed_io_list);
+                if (list_empty(&ei->i_rsv_conversion_list))
+                        queue_work(wq, &ei->i_rsv_conversion_work);
+                list_add_tail(&io_end->list, &ei->i_rsv_conversion_list);
+        } else {
+                wq = EXT4_SB(io_end->inode->i_sb)->unrsv_conversion_wq;
+                if (list_empty(&ei->i_unrsv_conversion_list))
+                        queue_work(wq, &ei->i_unrsv_conversion_work);
+                list_add_tail(&io_end->list, &ei->i_unrsv_conversion_list);
+        }
        spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
 }
-static int ext4_do_flush_completed_IO(struct inode *inode)
+static int ext4_do_flush_completed_IO(struct inode *inode,
+                                      struct list_head *head)
 {
        ext4_io_end_t *io;
        struct list_head unwritten;
@@ -155,8 +229,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
        int err, ret = 0;
        spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-        dump_completed_IO(inode);
+        dump_completed_IO(inode, head);
-        list_replace_init(&ei->i_completed_io_list, &unwritten);
+        list_replace_init(head, &unwritten);
        spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
        while (!list_empty(&unwritten)) {
@@ -167,30 +241,25 @@ static int ext4_do_flush_completed_IO(struct inode *inode)
                err = ext4_end_io(io);
                if (unlikely(!ret && err))
                        ret = err;
-                io->flag &= ~EXT4_IO_END_UNWRITTEN;
-                ext4_free_io_end(io);
        }
        return ret;
 }
 /*
- * work on completed aio dio IO, to convert unwritten extents to extents
+ * work on completed IO, to convert unwritten extents to extents
 */
-void ext4_end_io_work(struct work_struct *work)
+void ext4_end_io_rsv_work(struct work_struct *work)
 {
        struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
-                                                  i_unwritten_work);
+                                                  i_rsv_conversion_work);
-        ext4_do_flush_completed_IO(&ei->vfs_inode);
+        ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_rsv_conversion_list);
 }
-int ext4_flush_unwritten_io(struct inode *inode)
+void ext4_end_io_unrsv_work(struct work_struct *work)
 {
-        int ret;
+        struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
-        WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
+                                                  i_unrsv_conversion_work);
-                     !(inode->i_state & I_FREEING));
+        ext4_do_flush_completed_IO(&ei->vfs_inode, &ei->i_unrsv_conversion_list);
-        ret = ext4_do_flush_completed_IO(inode);
-        ext4_unwritten_wait(inode);
-        return ret;
 }
 ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
@@ -200,83 +269,70 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
                atomic_inc(&EXT4_I(inode)->i_ioend_count);
                io->inode = inode;
                INIT_LIST_HEAD(&io->list);
+                atomic_set(&io->count, 1);
        }
        return io;
 }
-/*
+void ext4_put_io_end_defer(ext4_io_end_t *io_end)
- * Print an buffer I/O error compatible with the fs/buffer.c.  This
- * provides compatibility with dmesg scrapers that look for a specific
- * buffer I/O error message.  We really need a unified error reporting
- * structure to userspace ala Digital Unix's uerf system, but it's
- * probably not going to happen in my lifetime, due to LKML politics...
- */
-static void buffer_io_error(struct buffer_head *bh)
 {
-        char b[BDEVNAME_SIZE];
+        if (atomic_dec_and_test(&io_end->count)) {
-        printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
+                if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
-                        bdevname(bh->b_bdev, b),
+                        ext4_release_io_end(io_end);
-                        (unsigned long long)bh->b_blocknr);
+                        return;
+                }
+                ext4_add_complete_io(io_end);
+        }
+}
+int ext4_put_io_end(ext4_io_end_t *io_end)
+{
+        int err = 0;
+        if (atomic_dec_and_test(&io_end->count)) {
+                if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
+                        err = ext4_convert_unwritten_extents(io_end->handle,
+                                                io_end->inode, io_end->offset,
+                                                io_end->size);
+                        io_end->handle = NULL;
+                        ext4_clear_io_unwritten_flag(io_end);
+                }
+                ext4_release_io_end(io_end);
+        }
+        return err;
+}
+ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end)
+{
+        atomic_inc(&io_end->count);
+        return io_end;
 }
 static void ext4_end_bio(struct bio *bio, int error)
 {
        ext4_io_end_t *io_end = bio->bi_private;
-        struct inode *inode;
-        int i;
-        int blocksize;
        sector_t bi_sector = bio->bi_sector;
        BUG_ON(!io_end);
-        inode = io_end->inode;
-        blocksize = 1 << inode->i_blkbits;
-        bio->bi_private = NULL;
        bio->bi_end_io = NULL;
        if (test_bit(BIO_UPTODATE, &bio->bi_flags))
                error = 0;
-        for (i = 0; i < bio->bi_vcnt; i++) {
-                struct bio_vec *bvec = &bio->bi_io_vec[i];
-                struct page *page = bvec->bv_page;
-                struct buffer_head *bh, *head;
-                unsigned bio_start = bvec->bv_offset;
-                unsigned bio_end = bio_start + bvec->bv_len;
-                unsigned under_io = 0;
-                unsigned long flags;
-                if (!page)
+        if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
-                        continue;
-                if (error) {
-                        SetPageError(page);
-                        set_bit(AS_EIO, &page->mapping->flags);
-                }
-                bh = head = page_buffers(page);
                /*
-                 * We check all buffers in the page under BH_Uptodate_Lock
+                 * Link bio into list hanging from io_end. We have to do it
-                 * to avoid races with other end io clearing async_write flags
+                 * atomically as bio completions can be racing against each
+                 * other.
                 */
-                local_irq_save(flags);
+                bio->bi_private = xchg(&io_end->bio, bio);
-                bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+        } else {
-                do {
+                ext4_finish_bio(bio);
-                        if (bh_offset(bh) < bio_start ||
+                bio_put(bio);
-                            bh_offset(bh) + blocksize > bio_end) {
-                                if (buffer_async_write(bh))
-                                        under_io++;
-                                continue;
-                        }
-                        clear_buffer_async_write(bh);
-                        if (error)
-                                buffer_io_error(bh);
-                } while ((bh = bh->b_this_page) != head);
-                bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
-                local_irq_restore(flags);
-                if (!under_io)
-                        end_page_writeback(page);
        }
-        bio_put(bio);
        if (error) {
-                io_end->flag |= EXT4_IO_END_ERROR;
+                struct inode *inode = io_end->inode;
                ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
                             "(offset %llu size %ld starting block %llu)",
                             inode->i_ino,
@@ -285,13 +341,7 @@ static void ext4_end_bio(struct bio *bio, int error)
                             (unsigned long long)
                             bi_sector >> (inode->i_blkbits - 9));
        }
+        ext4_put_io_end_defer(io_end);
-        if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
-                ext4_free_io_end(io_end);
-                return;
-        }
-        ext4_add_complete_io(io_end);
 }
 void ext4_io_submit(struct ext4_io_submit *io)
@@ -305,43 +355,38 @@ void ext4_io_submit(struct ext4_io_submit *io)
                bio_put(io->io_bio);
        }
        io->io_bio = NULL;
-        io->io_op = 0;
+}
+void ext4_io_submit_init(struct ext4_io_submit *io,
+                         struct writeback_control *wbc)
+{
+        io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
+        io->io_bio = NULL;
        io->io_end = NULL;
 }
-static int io_submit_init(struct ext4_io_submit *io,
+static int io_submit_init_bio(struct ext4_io_submit *io,
-                          struct inode *inode,
+                              struct buffer_head *bh)
-                          struct writeback_control *wbc,
-                          struct buffer_head *bh)
 {
-        ext4_io_end_t *io_end;
-        struct page *page = bh->b_page;
        int nvecs = bio_get_nr_vecs(bh->b_bdev);
        struct bio *bio;
-        io_end = ext4_init_io_end(inode, GFP_NOFS);
-        if (!io_end)
-                return -ENOMEM;
        bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
+        if (!bio)
+                return -ENOMEM;
        bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
        bio->bi_bdev = bh->b_bdev;
-        bio->bi_private = io->io_end = io_end;
        bio->bi_end_io = ext4_end_bio;
+        bio->bi_private = ext4_get_io_end(io->io_end);
-        io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
        io->io_bio = bio;
-        io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
        io->io_next_block = bh->b_blocknr;
        return 0;
 }
 static int io_submit_add_bh(struct ext4_io_submit *io,
                            struct inode *inode,
-                            struct writeback_control *wbc,
                            struct buffer_head *bh)
 {
-        ext4_io_end_t *io_end;
        int ret;
        if (io->io_bio && bh->b_blocknr != io->io_next_block) {
@@ -349,18 +394,14 @@ submit_and_retry:
                ext4_io_submit(io);
        }
        if (io->io_bio == NULL) {
-                ret = io_submit_init(io, inode, wbc, bh);
+                ret = io_submit_init_bio(io, bh);
                if (ret)
                        return ret;
        }
-        io_end = io->io_end;
-        if (test_clear_buffer_uninit(bh))
-                ext4_set_io_unwritten_flag(inode, io_end);
-        io->io_end->size += bh->b_size;
-        io->io_next_block++;
        ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
        if (ret != bh->b_size)
                goto submit_and_retry;
+        io->io_next_block++;
        return 0;
 }
@@ -432,7 +473,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
        do {
                if (!buffer_async_write(bh))
                        continue;
-                ret = io_submit_add_bh(io, inode, wbc, bh);
+                ret = io_submit_add_bh(io, inode, bh);
                if (ret) {
                        /*
                         * We only get here on ENOMEM.  Not much else
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index b27c96d01965..c5adbb318a90 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -79,12 +79,20 @@ static int verify_group_input(struct super_block *sb,
        ext4_fsblk_t end = start + input->blocks_count;
        ext4_group_t group = input->group;
        ext4_fsblk_t itend = input->inode_table + sbi->s_itb_per_group;
-        unsigned overhead = ext4_group_overhead_blocks(sb, group);
+        unsigned overhead;
-        ext4_fsblk_t metaend = start + overhead;
+        ext4_fsblk_t metaend;
        struct buffer_head *bh = NULL;
        ext4_grpblk_t free_blocks_count, offset;
        int err = -EINVAL;
+        if (group != sbi->s_groups_count) {
+                ext4_warning(sb, "Cannot add at group %u (only %u groups)",
+                             input->group, sbi->s_groups_count);
+                return -EINVAL;
+        }
+        overhead = ext4_group_overhead_blocks(sb, group);
+        metaend = start + overhead;
        input->free_blocks_count = free_blocks_count =
                input->blocks_count - 2 - overhead - sbi->s_itb_per_group;
@@ -96,10 +104,7 @@ static int verify_group_input(struct super_block *sb,
                       free_blocks_count, input->reserved_blocks);
        ext4_get_group_no_and_offset(sb, start, NULL, &offset);
-        if (group != sbi->s_groups_count)
+        if (offset != 0)
-                ext4_warning(sb, "Cannot add at group %u (only %u groups)",
-                             input->group, sbi->s_groups_count);
-        else if (offset != 0)
                        ext4_warning(sb, "Last group not full");
        else if (input->reserved_blocks > input->blocks_count / 5)
                ext4_warning(sb, "Reserved blocks too high (%u)",
@@ -1551,11 +1556,10 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
        int reserved_gdb = ext4_bg_has_super(sb, input->group) ?
                le16_to_cpu(es->s_reserved_gdt_blocks) : 0;
        struct inode *inode = NULL;
-        int gdb_off, gdb_num;
+        int gdb_off;
        int err;
        __u16 bg_flags = 0;
-        gdb_num = input->group / EXT4_DESC_PER_BLOCK(sb);
        gdb_off = input->group % EXT4_DESC_PER_BLOCK(sb);
        if (gdb_off == 0 && !EXT4_HAS_RO_COMPAT_FEATURE(sb,
@@ -1656,12 +1660,10 @@ errout:
                err = err2;
        if (!err) {
-                ext4_fsblk_t first_block;
-                first_block = ext4_group_first_block_no(sb, 0);
                if (test_opt(sb, DEBUG))
                        printk(KERN_DEBUG "EXT4-fs: extended group to %llu "
                               "blocks\n", ext4_blocks_count(es));
-                update_backups(sb, EXT4_SB(sb)->s_sbh->b_blocknr - first_block,
+                update_backups(sb, EXT4_SB(sb)->s_sbh->b_blocknr,
                               (char *)es, sizeof(struct ext4_super_block), 0);
        }
        return err;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 94cc84db7c9a..85b3dd60169b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -69,6 +69,7 @@ static void ext4_mark_recovery_complete(struct super_block *sb,
 static void ext4_clear_journal_err(struct super_block *sb,
                                   struct ext4_super_block *es);
 static int ext4_sync_fs(struct super_block *sb, int wait);
+static int ext4_sync_fs_nojournal(struct super_block *sb, int wait);
 static int ext4_remount(struct super_block *sb, int *flags, char *data);
 static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf);
 static int ext4_unfreeze(struct super_block *sb);
@@ -398,6 +399,11 @@ static void ext4_handle_error(struct super_block *sb)
        }
        if (test_opt(sb, ERRORS_RO)) {
                ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
+                /*
+                 * Make sure updated value of ->s_mount_flags will be visible
+                 * before ->s_flags update
+                 */
+                smp_wmb();
                sb->s_flags |= MS_RDONLY;
        }
        if (test_opt(sb, ERRORS_PANIC))
@@ -422,9 +428,9 @@ void __ext4_error(struct super_block *sb, const char *function,
        ext4_handle_error(sb);
 }
-void ext4_error_inode(struct inode *inode, const char *function,
+void __ext4_error_inode(struct inode *inode, const char *function,
-                      unsigned int line, ext4_fsblk_t block,
+                        unsigned int line, ext4_fsblk_t block,
-                      const char *fmt, ...)
+                        const char *fmt, ...)
 {
        va_list args;
        struct va_format vaf;
@@ -451,9 +457,9 @@ void ext4_error_inode(struct inode *inode, const char *function,
        ext4_handle_error(inode->i_sb);
 }
-void ext4_error_file(struct file *file, const char *function,
+void __ext4_error_file(struct file *file, const char *function,
-                     unsigned int line, ext4_fsblk_t block,
+                       unsigned int line, ext4_fsblk_t block,
-                     const char *fmt, ...)
+                       const char *fmt, ...)
 {
        va_list args;
        struct va_format vaf;
@@ -570,8 +576,13 @@ void __ext4_abort(struct super_block *sb, const char *function,
        if ((sb->s_flags & MS_RDONLY) == 0) {
                ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
-                sb->s_flags |= MS_RDONLY;
                EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED;
+                /*
+                 * Make sure updated value of ->s_mount_flags will be visible
+                 * before ->s_flags update
+                 */
+                smp_wmb();
+                sb->s_flags |= MS_RDONLY;
                if (EXT4_SB(sb)->s_journal)
                        jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO);
                save_error_info(sb, function, line);
@@ -580,7 +591,8 @@ void __ext4_abort(struct super_block *sb, const char *function,
                panic("EXT4-fs panic from previous error\n");
 }
-void ext4_msg(struct super_block *sb, const char *prefix, const char *fmt, ...)
+void __ext4_msg(struct super_block *sb,
+                const char *prefix, const char *fmt, ...)
 {
        struct va_format vaf;
        va_list args;
@@ -750,8 +762,10 @@ static void ext4_put_super(struct super_block *sb)
        ext4_unregister_li_request(sb);
        dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
-        flush_workqueue(sbi->dio_unwritten_wq);
+        flush_workqueue(sbi->unrsv_conversion_wq);
-        destroy_workqueue(sbi->dio_unwritten_wq);
+        flush_workqueue(sbi->rsv_conversion_wq);
+        destroy_workqueue(sbi->unrsv_conversion_wq);
+        destroy_workqueue(sbi->rsv_conversion_wq);
        if (sbi->s_journal) {
                err = jbd2_journal_destroy(sbi->s_journal);
@@ -760,7 +774,7 @@ static void ext4_put_super(struct super_block *sb)
                        ext4_abort(sb, "Couldn't clean up the journal");
        }
-        ext4_es_unregister_shrinker(sb);
+        ext4_es_unregister_shrinker(sbi);
        del_timer(&sbi->s_err_report);
        ext4_release_system_zone(sb);
        ext4_mb_release(sb);
@@ -849,6 +863,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
        rwlock_init(&ei->i_es_lock);
        INIT_LIST_HEAD(&ei->i_es_lru);
        ei->i_es_lru_nr = 0;
+        ei->i_touch_when = 0;
        ei->i_reserved_data_blocks = 0;
        ei->i_reserved_meta_blocks = 0;
        ei->i_allocated_meta_blocks = 0;
@@ -859,13 +874,15 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
        ei->i_reserved_quota = 0;
 #endif
        ei->jinode = NULL;
-        INIT_LIST_HEAD(&ei->i_completed_io_list);
+        INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
+        INIT_LIST_HEAD(&ei->i_unrsv_conversion_list);
        spin_lock_init(&ei->i_completed_io_lock);
        ei->i_sync_tid = 0;
        ei->i_datasync_tid = 0;
        atomic_set(&ei->i_ioend_count, 0);
        atomic_set(&ei->i_unwritten, 0);
-        INIT_WORK(&ei->i_unwritten_work, ext4_end_io_work);
+        INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+        INIT_WORK(&ei->i_unrsv_conversion_work, ext4_end_io_unrsv_work);
        return &ei->vfs_inode;
 }
@@ -1093,6 +1110,7 @@ static const struct super_operations ext4_nojournal_sops = {
        .dirty_inode    = ext4_dirty_inode,
        .drop_inode     = ext4_drop_inode,
        .evict_inode    = ext4_evict_inode,
+        .sync_fs        = ext4_sync_fs_nojournal,
        .put_super      = ext4_put_super,
        .statfs         = ext4_statfs,
        .remount_fs     = ext4_remount,
@@ -1908,7 +1926,6 @@ static int ext4_fill_flex_info(struct super_block *sb)
        struct ext4_sb_info *sbi = EXT4_SB(sb);
        struct ext4_group_desc *gdp = NULL;
        ext4_group_t flex_group;
-        unsigned int groups_per_flex = 0;
        int i, err;
        sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
@@ -1916,7 +1933,6 @@ static int ext4_fill_flex_info(struct super_block *sb)
                sbi->s_log_groups_per_flex = 0;
                return 1;
        }
-        groups_per_flex = 1U << sbi->s_log_groups_per_flex;
        err = ext4_alloc_flex_bg_array(sb, sbi->s_groups_count);
        if (err)
@@ -2164,19 +2180,22 @@ static void ext4_orphan_cleanup(struct super_block *sb,
                list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
                dquot_initialize(inode);
                if (inode->i_nlink) {
-                        ext4_msg(sb, KERN_DEBUG,
+                        if (test_opt(sb, DEBUG))
-                                "%s: truncating inode %lu to %lld bytes",
+                                ext4_msg(sb, KERN_DEBUG,
-                                __func__, inode->i_ino, inode->i_size);
+                                        "%s: truncating inode %lu to %lld bytes",
+                                        __func__, inode->i_ino, inode->i_size);
                        jbd_debug(2, "truncating inode %lu to %lld bytes\n",
                                  inode->i_ino, inode->i_size);
                        mutex_lock(&inode->i_mutex);
+                        truncate_inode_pages(inode->i_mapping, inode->i_size);
                        ext4_truncate(inode);
                        mutex_unlock(&inode->i_mutex);
                        nr_truncates++;
                } else {
-                        ext4_msg(sb, KERN_DEBUG,
+                        if (test_opt(sb, DEBUG))
-                                "%s: deleting unreferenced inode %lu",
+                                ext4_msg(sb, KERN_DEBUG,
-                                __func__, inode->i_ino);
+                                        "%s: deleting unreferenced inode %lu",
+                                        __func__, inode->i_ino);
                        jbd_debug(2, "deleting unreferenced inode %lu\n",
                                  inode->i_ino);
                        nr_orphans++;
@@ -2377,7 +2396,10 @@ struct ext4_attr {
        ssize_t (*show)(struct ext4_attr *, struct ext4_sb_info *, char *);
        ssize_t (*store)(struct ext4_attr *, struct ext4_sb_info *,
                         const char *, size_t);
-        int offset;
+        union {
+                int offset;
+                int deprecated_val;
+        } u;
 };
 static int parse_strtoull(const char *buf,
@@ -2446,7 +2468,7 @@ static ssize_t inode_readahead_blks_store(struct ext4_attr *a,
 static ssize_t sbi_ui_show(struct ext4_attr *a,
                           struct ext4_sb_info *sbi, char *buf)
 {
-        unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
+        unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
        return snprintf(buf, PAGE_SIZE, "%u\n", *ui);
 }
@@ -2455,7 +2477,7 @@ static ssize_t sbi_ui_store(struct ext4_attr *a,
                            struct ext4_sb_info *sbi,
                            const char *buf, size_t count)
 {
-        unsigned int *ui = (unsigned int *) (((char *) sbi) + a->offset);
+        unsigned int *ui = (unsigned int *) (((char *) sbi) + a->u.offset);
        unsigned long t;
        int ret;
@@ -2504,12 +2526,20 @@ static ssize_t trigger_test_error(struct ext4_attr *a,
        return count;
 }
+static ssize_t sbi_deprecated_show(struct ext4_attr *a,
+                                   struct ext4_sb_info *sbi, char *buf)
+{
+        return snprintf(buf, PAGE_SIZE, "%d\n", a->u.deprecated_val);
+}
 #define EXT4_ATTR_OFFSET(_name,_mode,_show,_store,_elname) \
 static struct ext4_attr ext4_attr_##_name = {                   \
        .attr = {.name = __stringify(_name), .mode = _mode },   \
        .show   = _show,                                        \
        .store  = _store,                                       \
-        .offset = offsetof(struct ext4_sb_info, _elname),       \
+        .u = {                                                  \
+                .offset = offsetof(struct ext4_sb_info, _elname),\
+        },                                                      \
 }
 #define EXT4_ATTR(name, mode, show, store) \
 static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
@@ -2520,6 +2550,14 @@ static struct ext4_attr ext4_attr_##name = __ATTR(name, mode, show, store)
 #define EXT4_RW_ATTR_SBI_UI(name, elname)       \
        EXT4_ATTR_OFFSET(name, 0644, sbi_ui_show, sbi_ui_store, elname)
 #define ATTR_LIST(name) &ext4_attr_##name.attr
+#define EXT4_DEPRECATED_ATTR(_name, _val)       \
+static struct ext4_attr ext4_attr_##_name = {                   \
+        .attr = {.name = __stringify(_name), .mode = 0444 },    \
+        .show   = sbi_deprecated_show,                          \
+        .u = {                                                  \
+                .deprecated_val = _val,                         \
+        },                                                      \
+}
 EXT4_RO_ATTR(delayed_allocation_blocks);
 EXT4_RO_ATTR(session_write_kbytes);
@@ -2534,7 +2572,7 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
 EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
 EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
 EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
-EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
+EXT4_DEPRECATED_ATTR(max_writeback_mb_bump, 128);
 EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
 EXT4_ATTR(trigger_fs_error, 0200, NULL, trigger_test_error);
@@ -3763,7 +3801,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
        sbi->s_err_report.data = (unsigned long) sb;
        /* Register extent status tree shrinker */
-        ext4_es_register_shrinker(sb);
+        ext4_es_register_shrinker(sbi);
        err = percpu_counter_init(&sbi->s_freeclusters_counter,
                        ext4_count_free_clusters(sb));
@@ -3787,7 +3825,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
        }
        sbi->s_stripe = ext4_get_stripe_size(sbi);
-        sbi->s_max_writeback_mb_bump = 128;
        sbi->s_extent_max_zeroout_kb = 32;
        /*
@@ -3915,12 +3952,20 @@ no_journal:
         * The maximum number of concurrent works can be high and
         * concurrency isn't really necessary.  Limit it to 1.
         */
-        EXT4_SB(sb)->dio_unwritten_wq =
+        EXT4_SB(sb)->rsv_conversion_wq =
-                alloc_workqueue("ext4-dio-unwritten", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+                alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
-        if (!EXT4_SB(sb)->dio_unwritten_wq) {
+        if (!EXT4_SB(sb)->rsv_conversion_wq) {
-                printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
+                printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
                ret = -ENOMEM;
-                goto failed_mount_wq;
+                goto failed_mount4;
+        }
+        EXT4_SB(sb)->unrsv_conversion_wq =
+                alloc_workqueue("ext4-unrsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+        if (!EXT4_SB(sb)->unrsv_conversion_wq) {
+                printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
+                ret = -ENOMEM;
+                goto failed_mount4;
        }
        /*
@@ -4074,14 +4119,17 @@ failed_mount4a:
        sb->s_root = NULL;
 failed_mount4:
        ext4_msg(sb, KERN_ERR, "mount failed");
-        destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
+        if (EXT4_SB(sb)->rsv_conversion_wq)
+                destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
+        if (EXT4_SB(sb)->unrsv_conversion_wq)
+                destroy_workqueue(EXT4_SB(sb)->unrsv_conversion_wq);
 failed_mount_wq:
        if (sbi->s_journal) {
                jbd2_journal_destroy(sbi->s_journal);
                sbi->s_journal = NULL;
        }
 failed_mount3:
-        ext4_es_unregister_shrinker(sb);
+        ext4_es_unregister_shrinker(sbi);
        del_timer(&sbi->s_err_report);
        if (sbi->s_flex_groups)
                ext4_kvfree(sbi->s_flex_groups);
@@ -4517,19 +4565,52 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 {
        int ret = 0;
        tid_t target;
+        bool needs_barrier = false;
        struct ext4_sb_info *sbi = EXT4_SB(sb);
        trace_ext4_sync_fs(sb, wait);
-        flush_workqueue(sbi->dio_unwritten_wq);
+        flush_workqueue(sbi->rsv_conversion_wq);
+        flush_workqueue(sbi->unrsv_conversion_wq);
        /*
         * Writeback quota in non-journalled quota case - journalled quota has
         * no dirty dquots
         */
        dquot_writeback_dquots(sb, -1);
+        /*
+         * Data writeback is possible w/o journal transaction, so barrier must
+         * being sent at the end of the function. But we can skip it if
+         * transaction_commit will do it for us.
+         */
+        target = jbd2_get_latest_transaction(sbi->s_journal);
+        if (wait && sbi->s_journal->j_flags & JBD2_BARRIER &&
+            !jbd2_trans_will_send_data_barrier(sbi->s_journal, target))
+                needs_barrier = true;
        if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
                if (wait)
-                        jbd2_log_wait_commit(sbi->s_journal, target);
+                        ret = jbd2_log_wait_commit(sbi->s_journal, target);
+        }
+        if (needs_barrier) {
+                int err;
+                err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+                if (!ret)
+                        ret = err;
        }
+        return ret;
+}
+static int ext4_sync_fs_nojournal(struct super_block *sb, int wait)
+{
+        int ret = 0;
+        trace_ext4_sync_fs(sb, wait);
+        flush_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
+        flush_workqueue(EXT4_SB(sb)->unrsv_conversion_wq);
+        dquot_writeback_dquots(sb, -1);
+        if (wait && test_opt(sb, BARRIER))
+                ret = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
        return ret;
 }
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 91ff93b0b0f4..ce11d9a92aed 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -698,7 +698,8 @@ static ssize_t f2fs_direct_IO(int rw, struct kiocb *iocb,
                                                  get_data_block_ro);
 }
-static void f2fs_invalidate_data_page(struct page *page, unsigned long offset)
+static void f2fs_invalidate_data_page(struct page *page, unsigned int offset,
+                                      unsigned int length)
 {
        struct inode *inode = page->mapping->host;
        struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 3df43b4efd89..74f3c7b03eb2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1205,7 +1205,8 @@ static int f2fs_set_node_page_dirty(struct page *page)
        return 0;
 }
-static void f2fs_invalidate_node_page(struct page *page, unsigned long offset)
+static void f2fs_invalidate_node_page(struct page *page, unsigned int offset,
+                                      unsigned int length)
 {
        struct inode *inode = page->mapping->host;
        struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 0bad69ed6336..ee48ad37d9c0 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -110,7 +110,7 @@ static int gfs2_writepage_common(struct page *page,
        /* Is the page fully outside i_size? (truncate in progress) */
        offset = i_size & (PAGE_CACHE_SIZE-1);
        if (page->index > end_index || (page->index == end_index && !offset)) {
-                page->mapping->a_ops->invalidatepage(page, 0);
+                page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
                goto out;
        }
        return 1;
@@ -299,7 +299,8 @@ static int gfs2_write_jdata_pagevec(struct address_space *mapping,
                /* Is the page fully outside i_size? (truncate in progress) */
                if (page->index > end_index || (page->index == end_index && !offset)) {
-                        page->mapping->a_ops->invalidatepage(page, 0);
+                        page->mapping->a_ops->invalidatepage(page, 0,
+                                                             PAGE_CACHE_SIZE);
                        unlock_page(page);
                        continue;
                }
@@ -943,27 +944,33 @@ static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh)
        unlock_buffer(bh);
 }
-static void gfs2_invalidatepage(struct page *page, unsigned long offset)
+static void gfs2_invalidatepage(struct page *page, unsigned int offset,
+                                unsigned int length)
 {
        struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
+        unsigned int stop = offset + length;
+        int partial_page = (offset || length < PAGE_CACHE_SIZE);
        struct buffer_head *bh, *head;
        unsigned long pos = 0;
        BUG_ON(!PageLocked(page));
-        if (offset == 0)
+        if (!partial_page)
                ClearPageChecked(page);
        if (!page_has_buffers(page))
                goto out;
        bh = head = page_buffers(page);
        do {
+                if (pos + bh->b_size > stop)
+                        return;
                if (offset <= pos)
                        gfs2_discard(sdp, bh);
                pos += bh->b_size;
                bh = bh->b_this_page;
        } while (bh != head);
 out:
-        if (offset == 0)
+        if (!partial_page)
                try_to_release_page(page, 0);
 }
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index e3e255c0a509..be0c39b66fe0 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -2019,16 +2019,20 @@ zap_buffer_unlocked:
 * void journal_invalidatepage() - invalidate a journal page
 * @journal: journal to use for flush
 * @page:    page to flush
- * @offset:  length of page to invalidate.
+ * @offset:  offset of the range to invalidate
+ * @length:  length of the range to invalidate
 *
- * Reap page buffers containing data after offset in page.
+ * Reap page buffers containing data in specified range in page.
 */
 void journal_invalidatepage(journal_t *journal,
                      struct page *page,
-                      unsigned long offset)
+                      unsigned int offset,
+                      unsigned int length)
 {
        struct buffer_head *head, *bh, *next;
+        unsigned int stop = offset + length;
        unsigned int curr_off = 0;
+        int partial_page = (offset || length < PAGE_CACHE_SIZE);
        int may_free = 1;
        if (!PageLocked(page))
@@ -2036,6 +2040,8 @@ void journal_invalidatepage(journal_t *journal,
        if (!page_has_buffers(page))
                return;
+        BUG_ON(stop > PAGE_CACHE_SIZE || stop < length);
        /* We will potentially be playing with lists other than just the
         * data lists (especially for journaled data mode), so be
         * cautious in our locking. */
@@ -2045,11 +2051,14 @@ void journal_invalidatepage(journal_t *journal,
                unsigned int next_off = curr_off + bh->b_size;
                next = bh->b_this_page;
+                if (next_off > stop)
+                        return;
                if (offset <= curr_off) {
                        /* This block is wholly outside the truncation point */
                        lock_buffer(bh);
                        may_free &= journal_unmap_buffer(journal, bh,
-                                                         offset > 0);
+                                                         partial_page);
                        unlock_buffer(bh);
                }
                curr_off = next_off;
@@ -2057,7 +2066,7 @@ void journal_invalidatepage(journal_t *journal,
        } while (bh != head);
-        if (!offset) {
+        if (!partial_page) {
                if (may_free && try_to_free_buffers(page))
                        J_ASSERT(!page_has_buffers(page));
        }
diff --git a/fs/jbd2/Kconfig b/fs/jbd2/Kconfig
index 69a48c2944da..5a9f5534d57b 100644
--- a/fs/jbd2/Kconfig
+++ b/fs/jbd2/Kconfig
@@ -20,7 +20,7 @@ config JBD2
 config JBD2_DEBUG
        bool "JBD2 (ext4) debugging support"
-        depends on JBD2 && DEBUG_FS
+        depends on JBD2
        help
          If you are using the ext4 journaled file system (or
          potentially any other filesystem/device using JBD2), this option
@@ -29,7 +29,7 @@ config JBD2_DEBUG
          By default, the debugging output will be turned off.
          If you select Y here, then you will be able to turn on debugging
-          with "echo N > /sys/kernel/debug/jbd2/jbd2-debug", where N is a
+          with "echo N > /sys/module/jbd2/parameters/jbd2_debug", where N is a
          number between 1 and 5. The higher the number, the more debugging
          output is generated.  To turn debugging off again, do
-          "echo 0 > /sys/kernel/debug/jbd2/jbd2-debug".
+          "echo 0 > /sys/module/jbd2/parameters/jbd2_debug".
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index c78841ee81cf..7f34f4716165 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -120,8 +120,8 @@ void __jbd2_log_wait_for_space(journal_t *journal)
        int nblocks, space_left;
        /* assert_spin_locked(&journal->j_state_lock); */
-        nblocks = jbd_space_needed(journal);
+        nblocks = jbd2_space_needed(journal);
-        while (__jbd2_log_space_left(journal) < nblocks) {
+        while (jbd2_log_space_left(journal) < nblocks) {
                if (journal->j_flags & JBD2_ABORT)
                        return;
                write_unlock(&journal->j_state_lock);
@@ -140,8 +140,8 @@ void __jbd2_log_wait_for_space(journal_t *journal)
                 */
                write_lock(&journal->j_state_lock);
                spin_lock(&journal->j_list_lock);
-                nblocks = jbd_space_needed(journal);
+                nblocks = jbd2_space_needed(journal);
-                space_left = __jbd2_log_space_left(journal);
+                space_left = jbd2_log_space_left(journal);
                if (space_left < nblocks) {
                        int chkpt = journal->j_checkpoint_transactions != NULL;
                        tid_t tid = 0;
@@ -156,7 +156,15 @@ void __jbd2_log_wait_for_space(journal_t *journal)
                                /* We were able to recover space; yay! */
                                ;
                        } else if (tid) {
+                                /*
+                                 * jbd2_journal_commit_transaction() may want
+                                 * to take the checkpoint_mutex if JBD2_FLUSHED
+                                 * is set.  So we need to temporarily drop it.
+                                 */
+                                mutex_unlock(&journal->j_checkpoint_mutex);
                                jbd2_log_wait_commit(journal, tid);
+                                write_lock(&journal->j_state_lock);
+                                continue;
                        } else {
                                printk(KERN_ERR "%s: needed %d blocks and "
                                       "only had %d space available\n",
@@ -625,10 +633,6 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
        __jbd2_journal_drop_transaction(journal, transaction);
        jbd2_journal_free_transaction(transaction);
-        /* Just in case anybody was waiting for more transactions to be
-           checkpointed... */
-        wake_up(&journal->j_wait_logspace);
        ret = 1;
 out:
        return ret;
@@ -690,9 +694,7 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
        J_ASSERT(transaction->t_state == T_FINISHED);
        J_ASSERT(transaction->t_buffers == NULL);
        J_ASSERT(transaction->t_forget == NULL);
-        J_ASSERT(transaction->t_iobuf_list == NULL);
        J_ASSERT(transaction->t_shadow_list == NULL);
-        J_ASSERT(transaction->t_log_list == NULL);
        J_ASSERT(transaction->t_checkpoint_list == NULL);
        J_ASSERT(transaction->t_checkpoint_io_list == NULL);
        J_ASSERT(atomic_read(&transaction->t_updates) == 0);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 0f53946f13c1..559bec1a37b4 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -30,15 +30,22 @@
 #include <trace/events/jbd2.h>
 /*
- * Default IO end handler for temporary BJ_IO buffer_heads.
+ * IO end handler for temporary buffer_heads handling writes to the journal.
 */
 static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
+        struct buffer_head *orig_bh = bh->b_private;
        BUFFER_TRACE(bh, "");
        if (uptodate)
                set_buffer_uptodate(bh);
        else
                clear_buffer_uptodate(bh);
+        if (orig_bh) {
+                clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+                smp_mb__after_clear_bit();
+                wake_up_bit(&orig_bh->b_state, BH_Shadow);
+        }
        unlock_buffer(bh);
 }
@@ -85,8 +92,7 @@ nope:
        __brelse(bh);
 }
-static void jbd2_commit_block_csum_set(journal_t *j,
+static void jbd2_commit_block_csum_set(journal_t *j, struct buffer_head *bh)
-                                       struct journal_head *descriptor)
 {
        struct commit_header *h;
        __u32 csum;
@@ -94,12 +100,11 @@ static void jbd2_commit_block_csum_set(journal_t *j,
        if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
                return;
-        h = (struct commit_header *)(jh2bh(descriptor)->b_data);
+        h = (struct commit_header *)(bh->b_data);
        h->h_chksum_type = 0;
        h->h_chksum_size = 0;
        h->h_chksum[0] = 0;
-        csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
+        csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
-                           j->j_blocksize);
        h->h_chksum[0] = cpu_to_be32(csum);
 }
@@ -116,7 +121,6 @@ static int journal_submit_commit_record(journal_t *journal,
                                        struct buffer_head **cbh,
                                        __u32 crc32_sum)
 {
-        struct journal_head *descriptor;
        struct commit_header *tmp;
        struct buffer_head *bh;
        int ret;
@@ -127,12 +131,10 @@ static int journal_submit_commit_record(journal_t *journal,
        if (is_journal_aborted(journal))
                return 0;
-        descriptor = jbd2_journal_get_descriptor_buffer(journal);
+        bh = jbd2_journal_get_descriptor_buffer(journal);
-        if (!descriptor)
+        if (!bh)
                return 1;
-        bh = jh2bh(descriptor);
        tmp = (struct commit_header *)bh->b_data;
        tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
        tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
@@ -146,9 +148,9 @@ static int journal_submit_commit_record(journal_t *journal,
                tmp->h_chksum_size      = JBD2_CRC32_CHKSUM_SIZE;
                tmp->h_chksum[0]        = cpu_to_be32(crc32_sum);
        }
-        jbd2_commit_block_csum_set(journal, descriptor);
+        jbd2_commit_block_csum_set(journal, bh);
-        JBUFFER_TRACE(descriptor, "submit commit block");
+        BUFFER_TRACE(bh, "submit commit block");
        lock_buffer(bh);
        clear_buffer_dirty(bh);
        set_buffer_uptodate(bh);
@@ -180,7 +182,6 @@ static int journal_wait_on_commit_record(journal_t *journal,
        if (unlikely(!buffer_uptodate(bh)))
                ret = -EIO;
        put_bh(bh);            /* One for getblk() */
-        jbd2_journal_put_journal_head(bh2jh(bh));
        return ret;
 }
@@ -321,7 +322,7 @@ static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
 }
 static void jbd2_descr_block_csum_set(journal_t *j,
-                                      struct journal_head *descriptor)
+                                      struct buffer_head *bh)
 {
        struct jbd2_journal_block_tail *tail;
        __u32 csum;
@@ -329,12 +330,10 @@ static void jbd2_descr_block_csum_set(journal_t *j,
        if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
                return;
-        tail = (struct jbd2_journal_block_tail *)
+        tail = (struct jbd2_journal_block_tail *)(bh->b_data + j->j_blocksize -
-                        (jh2bh(descriptor)->b_data + j->j_blocksize -
                        sizeof(struct jbd2_journal_block_tail));
        tail->t_checksum = 0;
-        csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
+        csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
-                           j->j_blocksize);
        tail->t_checksum = cpu_to_be32(csum);
 }
@@ -343,20 +342,21 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
 {
        struct page *page = bh->b_page;
        __u8 *addr;
-        __u32 csum;
+        __u32 csum32;
        if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
                return;
        sequence = cpu_to_be32(sequence);
        addr = kmap_atomic(page);
-        csum = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&sequence,
+        csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&sequence,
-                          sizeof(sequence));
+                             sizeof(sequence));
-        csum = jbd2_chksum(j, csum, addr + offset_in_page(bh->b_data),
+        csum32 = jbd2_chksum(j, csum32, addr + offset_in_page(bh->b_data),
-                          bh->b_size);
+                             bh->b_size);
        kunmap_atomic(addr);
-        tag->t_checksum = cpu_to_be32(csum);
+        /* We only have space to store the lower 16 bits of the crc32c. */
+        tag->t_checksum = cpu_to_be16(csum32);
 }
 /*
 * jbd2_journal_commit_transaction
@@ -368,7 +368,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 {
        struct transaction_stats_s stats;
        transaction_t *commit_transaction;
-        struct journal_head *jh, *new_jh, *descriptor;
+        struct journal_head *jh;
+        struct buffer_head *descriptor;
        struct buffer_head **wbuf = journal->j_wbuf;
        int bufs;
        int flags;
@@ -392,6 +393,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
        tid_t first_tid;
        int update_tail;
        int csum_size = 0;
+        LIST_HEAD(io_bufs);
+        LIST_HEAD(log_bufs);
        if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
                csum_size = sizeof(struct jbd2_journal_block_tail);
@@ -424,13 +427,13 @@ void jbd2_journal_commit_transaction(journal_t *journal)
        J_ASSERT(journal->j_committing_transaction == NULL);
        commit_transaction = journal->j_running_transaction;
-        J_ASSERT(commit_transaction->t_state == T_RUNNING);
        trace_jbd2_start_commit(journal, commit_transaction);
        jbd_debug(1, "JBD2: starting commit of transaction %d\n",
                        commit_transaction->t_tid);
        write_lock(&journal->j_state_lock);
+        J_ASSERT(commit_transaction->t_state == T_RUNNING);
        commit_transaction->t_state = T_LOCKED;
        trace_jbd2_commit_locking(journal, commit_transaction);
@@ -520,6 +523,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
         */
        jbd2_journal_switch_revoke_table(journal);
+        /*
+         * Reserved credits cannot be claimed anymore, free them
+         */
+        atomic_sub(atomic_read(&journal->j_reserved_credits),
+                   &commit_transaction->t_outstanding_credits);
        trace_jbd2_commit_flushing(journal, commit_transaction);
        stats.run.rs_flushing = jiffies;
        stats.run.rs_locked = jbd2_time_diff(stats.run.rs_locked,
@@ -533,7 +542,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
        wake_up(&journal->j_wait_transaction_locked);
        write_unlock(&journal->j_state_lock);
-        jbd_debug(3, "JBD2: commit phase 2\n");
+        jbd_debug(3, "JBD2: commit phase 2a\n");
        /*
         * Now start flushing things to disk, in the order they appear
@@ -545,10 +554,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
        blk_start_plug(&plug);
        jbd2_journal_write_revoke_records(journal, commit_transaction,
-                                          WRITE_SYNC);
+                                          &log_bufs, WRITE_SYNC);
        blk_finish_plug(&plug);
-        jbd_debug(3, "JBD2: commit phase 2\n");
+        jbd_debug(3, "JBD2: commit phase 2b\n");
        /*
         * Way to go: we have now written out all of the data for a
@@ -571,8 +580,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
                 atomic_read(&commit_transaction->t_outstanding_credits));
        err = 0;
-        descriptor = NULL;
        bufs = 0;
+        descriptor = NULL;
        blk_start_plug(&plug);
        while (commit_transaction->t_buffers) {
@@ -604,8 +613,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
                   record the metadata buffer. */
                if (!descriptor) {
-                        struct buffer_head *bh;
                        J_ASSERT (bufs == 0);
                        jbd_debug(4, "JBD2: get descriptor\n");
@@ -616,26 +623,26 @@ void jbd2_journal_commit_transaction(journal_t *journal)
                                continue;
                        }
-                        bh = jh2bh(descriptor);
                        jbd_debug(4, "JBD2: got buffer %llu (%p)\n",
-                                (unsigned long long)bh->b_blocknr, bh->b_data);
+                                (unsigned long long)descriptor->b_blocknr,
-                        header = (journal_header_t *)&bh->b_data[0];
+                                descriptor->b_data);
+                        header = (journal_header_t *)descriptor->b_data;
                        header->h_magic     = cpu_to_be32(JBD2_MAGIC_NUMBER);
                        header->h_blocktype = cpu_to_be32(JBD2_DESCRIPTOR_BLOCK);
                        header->h_sequence  = cpu_to_be32(commit_transaction->t_tid);
-                        tagp = &bh->b_data[sizeof(journal_header_t)];
+                        tagp = &descriptor->b_data[sizeof(journal_header_t)];
-                        space_left = bh->b_size - sizeof(journal_header_t);
+                        space_left = descriptor->b_size -
+                                                sizeof(journal_header_t);
                        first_tag = 1;
-                        set_buffer_jwrite(bh);
+                        set_buffer_jwrite(descriptor);
-                        set_buffer_dirty(bh);
+                        set_buffer_dirty(descriptor);
-                        wbuf[bufs++] = bh;
+                        wbuf[bufs++] = descriptor;
                        /* Record it so that we can wait for IO
                           completion later */
-                        BUFFER_TRACE(bh, "ph3: file as descriptor");
+                        BUFFER_TRACE(descriptor, "ph3: file as descriptor");
-                        jbd2_journal_file_buffer(descriptor, commit_transaction,
+                        jbd2_file_log_bh(&log_bufs, descriptor);
-                                        BJ_LogCtl);
                }
                /* Where is the buffer to be written? */
@@ -658,29 +665,22 @@ void jbd2_journal_commit_transaction(journal_t *journal)
                /* Bump b_count to prevent truncate from stumbling over
                   the shadowed buffer!  @@@ This can go if we ever get
-                   rid of the BJ_IO/BJ_Shadow pairing of buffers. */
+                   rid of the shadow pairing of buffers. */
                atomic_inc(&jh2bh(jh)->b_count);
-                /* Make a temporary IO buffer with which to write it out
-                   (this will requeue both the metadata buffer and the
-                   temporary IO buffer). new_bh goes on BJ_IO*/
-                set_bit(BH_JWrite, &jh2bh(jh)->b_state);
                /*
-                 * akpm: jbd2_journal_write_metadata_buffer() sets
+                 * Make a temporary IO buffer with which to write it out
-                 * new_bh->b_transaction to commit_transaction.
+                 * (this will requeue the metadata buffer to BJ_Shadow).
-                 * We need to clean this up before we release new_bh
-                 * (which is of type BJ_IO)
                 */
+                set_bit(BH_JWrite, &jh2bh(jh)->b_state);
                JBUFFER_TRACE(jh, "ph3: write metadata");
                flags = jbd2_journal_write_metadata_buffer(commit_transaction,
-                                                      jh, &new_jh, blocknr);
+                                                jh, &wbuf[bufs], blocknr);
                if (flags < 0) {
                        jbd2_journal_abort(journal, flags);
                        continue;
                }
-                set_bit(BH_JWrite, &jh2bh(new_jh)->b_state);
+                jbd2_file_log_bh(&io_bufs, wbuf[bufs]);
-                wbuf[bufs++] = jh2bh(new_jh);
                /* Record the new block's tag in the current descriptor
                   buffer */
@@ -694,10 +694,11 @@ void jbd2_journal_commit_transaction(journal_t *journal)
                tag = (journal_block_tag_t *) tagp;
                write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
                tag->t_flags = cpu_to_be16(tag_flag);
-                jbd2_block_tag_csum_set(journal, tag, jh2bh(new_jh),
+                jbd2_block_tag_csum_set(journal, tag, wbuf[bufs],
                                        commit_transaction->t_tid);
                tagp += tag_bytes;
                space_left -= tag_bytes;
+                bufs++;
                if (first_tag) {
                        memcpy (tagp, journal->j_uuid, 16);
@@ -809,7 +810,7 @@ start_journal_io:
           the log.  Before we can commit it, wait for the IO so far to
           complete.  Control buffers being written are on the
           transaction's t_log_list queue, and metadata buffers are on
-           the t_iobuf_list queue.
+           the io_bufs list.
           Wait for the buffers in reverse order.  That way we are
           less likely to be woken up until all IOs have completed, and
@@ -818,47 +819,33 @@ start_journal_io:
        jbd_debug(3, "JBD2: commit phase 3\n");
-        /*
+        while (!list_empty(&io_bufs)) {
-         * akpm: these are BJ_IO, and j_list_lock is not needed.
+                struct buffer_head *bh = list_entry(io_bufs.prev,
-         * See __journal_try_to_free_buffer.
+                                                    struct buffer_head,
-         */
+                                                    b_assoc_buffers);
-wait_for_iobuf:
-        while (commit_transaction->t_iobuf_list != NULL) {
-                struct buffer_head *bh;
-                jh = commit_transaction->t_iobuf_list->b_tprev;
+                wait_on_buffer(bh);
-                bh = jh2bh(jh);
+                cond_resched();
-                if (buffer_locked(bh)) {
-                        wait_on_buffer(bh);
-                        goto wait_for_iobuf;
-                }
-                if (cond_resched())
-                        goto wait_for_iobuf;
                if (unlikely(!buffer_uptodate(bh)))
                        err = -EIO;
+                jbd2_unfile_log_bh(bh);
-                clear_buffer_jwrite(bh);
-                JBUFFER_TRACE(jh, "ph4: unfile after journal write");
-                jbd2_journal_unfile_buffer(journal, jh);
                /*
-                 * ->t_iobuf_list should contain only dummy buffer_heads
+                 * The list contains temporary buffer heads created by
-                 * which were created by jbd2_journal_write_metadata_buffer().
+                 * jbd2_journal_write_metadata_buffer().
                 */
                BUFFER_TRACE(bh, "dumping temporary bh");
-                jbd2_journal_put_journal_head(jh);
                __brelse(bh);
                J_ASSERT_BH(bh, atomic_read(&bh->b_count) == 0);
                free_buffer_head(bh);
-                /* We also have to unlock and free the corresponding
+                /* We also have to refile the corresponding shadowed buffer */
-                   shadowed buffer */
                jh = commit_transaction->t_shadow_list->b_tprev;
                bh = jh2bh(jh);
-                clear_bit(BH_JWrite, &bh->b_state);
+                clear_buffer_jwrite(bh);
                J_ASSERT_BH(bh, buffer_jbddirty(bh));
+                J_ASSERT_BH(bh, !buffer_shadow(bh));
                /* The metadata is now released for reuse, but we need
                   to remember it against this transaction so that when
@@ -866,14 +853,6 @@ wait_for_iobuf:
                   required. */
                JBUFFER_TRACE(jh, "file as BJ_Forget");
                jbd2_journal_file_buffer(jh, commit_transaction, BJ_Forget);
-                /*
-                 * Wake up any transactions which were waiting for this IO to
-                 * complete. The barrier must be here so that changes by
-                 * jbd2_journal_file_buffer() take effect before wake_up_bit()
-                 * does the waitqueue check.
-                 */
-                smp_mb();
-                wake_up_bit(&bh->b_state, BH_Unshadow);
                JBUFFER_TRACE(jh, "brelse shadowed buffer");
                __brelse(bh);
        }
@@ -883,26 +862,19 @@ wait_for_iobuf:
        jbd_debug(3, "JBD2: commit phase 4\n");
        /* Here we wait for the revoke record and descriptor record buffers */
- wait_for_ctlbuf:
+        while (!list_empty(&log_bufs)) {
-        while (commit_transaction->t_log_list != NULL) {
                struct buffer_head *bh;
-                jh = commit_transaction->t_log_list->b_tprev;
+                bh = list_entry(log_bufs.prev, struct buffer_head, b_assoc_buffers);
-                bh = jh2bh(jh);
+                wait_on_buffer(bh);
-                if (buffer_locked(bh)) {
+                cond_resched();
-                        wait_on_buffer(bh);
-                        goto wait_for_ctlbuf;
-                }
-                if (cond_resched())
-                        goto wait_for_ctlbuf;
                if (unlikely(!buffer_uptodate(bh)))
                        err = -EIO;
                BUFFER_TRACE(bh, "ph5: control buffer writeout done: unfile");
                clear_buffer_jwrite(bh);
-                jbd2_journal_unfile_buffer(journal, jh);
+                jbd2_unfile_log_bh(bh);
-                jbd2_journal_put_journal_head(jh);
                __brelse(bh);           /* One for getblk */
                /* AKPM: bforget here */
        }
@@ -952,9 +924,7 @@ wait_for_iobuf:
        J_ASSERT(list_empty(&commit_transaction->t_inode_list));
        J_ASSERT(commit_transaction->t_buffers == NULL);
        J_ASSERT(commit_transaction->t_checkpoint_list == NULL);
-        J_ASSERT(commit_transaction->t_iobuf_list == NULL);
        J_ASSERT(commit_transaction->t_shadow_list == NULL);
-        J_ASSERT(commit_transaction->t_log_list == NULL);
 restart_loop:
        /*
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 95457576e434..02c7ad9d7a41 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -103,6 +103,24 @@ EXPORT_SYMBOL(jbd2_inode_cache);
 static void __journal_abort_soft (journal_t *journal, int errno);
 static int jbd2_journal_create_slab(size_t slab_size);
+#ifdef CONFIG_JBD2_DEBUG
+void __jbd2_debug(int level, const char *file, const char *func,
+                  unsigned int line, const char *fmt, ...)
+{
+        struct va_format vaf;
+        va_list args;
+        if (level > jbd2_journal_enable_debug)
+                return;
+        va_start(args, fmt);
+        vaf.fmt = fmt;
+        vaf.va = &args;
+        printk(KERN_DEBUG "%s: (%s, %u): %pV\n", file, func, line, &vaf);
+        va_end(args);
+}
+EXPORT_SYMBOL(__jbd2_debug);
+#endif
 /* Checksumming functions */
 int jbd2_verify_csum_type(journal_t *j, journal_superblock_t *sb)
 {
@@ -310,14 +328,12 @@ static void journal_kill_thread(journal_t *journal)
 *
 * If the source buffer has already been modified by a new transaction
 * since we took the last commit snapshot, we use the frozen copy of
- * that data for IO.  If we end up using the existing buffer_head's data
+ * that data for IO. If we end up using the existing buffer_head's data
- * for the write, then we *have* to lock the buffer to prevent anyone
+ * for the write, then we have to make sure nobody modifies it while the
- * else from using and possibly modifying it while the IO is in
+ * IO is in progress. do_get_write_access() handles this.
- * progress.
 *
- * The function returns a pointer to the buffer_heads to be used for IO.
+ * The function returns a pointer to the buffer_head to be used for IO.
- *
+ * 
- * We assume that the journal has already been locked in this function.
 *
 * Return value:
 *  <0: Error
@@ -330,15 +346,14 @@ static void journal_kill_thread(journal_t *journal)
 int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
                                  struct journal_head  *jh_in,
-                                  struct journal_head **jh_out,
+                                  struct buffer_head **bh_out,
-                                  unsigned long long blocknr)
+                                  sector_t blocknr)
 {
        int need_copy_out = 0;
        int done_copy_out = 0;
        int do_escape = 0;
        char *mapped_data;
        struct buffer_head *new_bh;
-        struct journal_head *new_jh;
        struct page *new_page;
        unsigned int new_offset;
        struct buffer_head *bh_in = jh2bh(jh_in);
@@ -368,14 +383,13 @@ retry_alloc:
        /* keep subsequent assertions sane */
        atomic_set(&new_bh->b_count, 1);
-        new_jh = jbd2_journal_add_journal_head(new_bh); /* This sleeps */
+        jbd_lock_bh_state(bh_in);
+repeat:
        /*
         * If a new transaction has already done a buffer copy-out, then
         * we use that version of the data for the commit.
         */
-        jbd_lock_bh_state(bh_in);
-repeat:
        if (jh_in->b_frozen_data) {
                done_copy_out = 1;
                new_page = virt_to_page(jh_in->b_frozen_data);
@@ -415,7 +429,7 @@ repeat:
                jbd_unlock_bh_state(bh_in);
                tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
                if (!tmp) {
-                        jbd2_journal_put_journal_head(new_jh);
+                        brelse(new_bh);
                        return -ENOMEM;
                }
                jbd_lock_bh_state(bh_in);
@@ -426,7 +440,7 @@ repeat:
                jh_in->b_frozen_data = tmp;
                mapped_data = kmap_atomic(new_page);
-                memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
+                memcpy(tmp, mapped_data + new_offset, bh_in->b_size);
                kunmap_atomic(mapped_data);
                new_page = virt_to_page(tmp);
@@ -452,14 +466,14 @@ repeat:
        }
        set_bh_page(new_bh, new_page, new_offset);
-        new_jh->b_transaction = NULL;
+        new_bh->b_size = bh_in->b_size;
-        new_bh->b_size = jh2bh(jh_in)->b_size;
+        new_bh->b_bdev = journal->j_dev;
-        new_bh->b_bdev = transaction->t_journal->j_dev;
        new_bh->b_blocknr = blocknr;
+        new_bh->b_private = bh_in;
        set_buffer_mapped(new_bh);
        set_buffer_dirty(new_bh);
-        *jh_out = new_jh;
+        *bh_out = new_bh;
        /*
         * The to-be-written buffer needs to get moved to the io queue,
@@ -470,11 +484,9 @@ repeat:
        spin_lock(&journal->j_list_lock);
        __jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
        spin_unlock(&journal->j_list_lock);
+        set_buffer_shadow(bh_in);
        jbd_unlock_bh_state(bh_in);
-        JBUFFER_TRACE(new_jh, "file as BJ_IO");
-        jbd2_journal_file_buffer(new_jh, transaction, BJ_IO);
        return do_escape | (done_copy_out << 1);
 }
@@ -484,35 +496,6 @@ repeat:
 */
 /*
- * __jbd2_log_space_left: Return the number of free blocks left in the journal.
- *
- * Called with the journal already locked.
- *
- * Called under j_state_lock
- */
-int __jbd2_log_space_left(journal_t *journal)
-{
-        int left = journal->j_free;
-        /* assert_spin_locked(&journal->j_state_lock); */
-        /*
-         * Be pessimistic here about the number of those free blocks which
-         * might be required for log descriptor control blocks.
-         */
-#define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */
-        left -= MIN_LOG_RESERVED_BLOCKS;
-        if (left <= 0)
-                return 0;
-        left -= (left >> 3);
-        return left;
-}
-/*
 * Called with j_state_lock locked for writing.
 * Returns true if a transaction commit was started.
 */
@@ -564,20 +547,17 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid)
 }
 /*
- * Force and wait upon a commit if the calling process is not within
+ * Force and wait any uncommitted transactions.  We can only force the running
- * transaction.  This is used for forcing out undo-protected data which contains
+ * transaction if we don't have an active handle, otherwise, we will deadlock.
- * bitmaps, when the fs is running out of space.
+ * Returns: <0 in case of error,
- *
+ *           0 if nothing to commit,
- * We can only force the running transaction if we don't have an active handle;
+ *           1 if transaction was successfully committed.
- * otherwise, we will deadlock.
- *
- * Returns true if a transaction was started.
 */
-int jbd2_journal_force_commit_nested(journal_t *journal)
+static int __jbd2_journal_force_commit(journal_t *journal)
 {
        transaction_t *transaction = NULL;
        tid_t tid;
-        int need_to_start = 0;
+        int need_to_start = 0, ret = 0;
        read_lock(&journal->j_state_lock);
        if (journal->j_running_transaction && !current->journal_info) {
@@ -588,16 +568,53 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
                transaction = journal->j_committing_transaction;
        if (!transaction) {
+                /* Nothing to commit */
                read_unlock(&journal->j_state_lock);
-                return 0;       /* Nothing to retry */
+                return 0;
        }
        tid = transaction->t_tid;
        read_unlock(&journal->j_state_lock);
        if (need_to_start)
                jbd2_log_start_commit(journal, tid);
-        jbd2_log_wait_commit(journal, tid);
+        ret = jbd2_log_wait_commit(journal, tid);
-        return 1;
+        if (!ret)
+                ret = 1;
+        return ret;
+}
+/**
+ * Force and wait upon a commit if the calling process is not within
+ * transaction.  This is used for forcing out undo-protected data which contains
+ * bitmaps, when the fs is running out of space.
+ *
+ * @journal: journal to force
+ * Returns true if progress was made.
+ */
+int jbd2_journal_force_commit_nested(journal_t *journal)
+{
+        int ret;
+        ret = __jbd2_journal_force_commit(journal);
+        return ret > 0;
+}
+/**
+ * int journal_force_commit() - force any uncommitted transactions
+ * @journal: journal to force
+ *
+ * Caller want unconditional commit. We can only force the running transaction
+ * if we don't have an active handle, otherwise, we will deadlock.
+ */
+int jbd2_journal_force_commit(journal_t *journal)
+{
+        int ret;
+        J_ASSERT(!current->journal_info);
+        ret = __jbd2_journal_force_commit(journal);
+        if (ret > 0)
+                ret = 0;
+        return ret;
 }
 /*
@@ -798,7 +815,7 @@ int jbd2_journal_bmap(journal_t *journal, unsigned long blocknr,
 * But we don't bother doing that, so there will be coherency problems with
 * mmaps of blockdevs which hold live JBD-controlled filesystems.
 */
-struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
+struct buffer_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
 {
        struct buffer_head *bh;
        unsigned long long blocknr;
@@ -817,7 +834,7 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
        set_buffer_uptodate(bh);
        unlock_buffer(bh);
        BUFFER_TRACE(bh, "return this buffer");
-        return jbd2_journal_add_journal_head(bh);
+        return bh;
 }
 /*
@@ -1062,11 +1079,10 @@ static journal_t * journal_init_common (void)
                return NULL;
        init_waitqueue_head(&journal->j_wait_transaction_locked);
-        init_waitqueue_head(&journal->j_wait_logspace);
        init_waitqueue_head(&journal->j_wait_done_commit);
-        init_waitqueue_head(&journal->j_wait_checkpoint);
        init_waitqueue_head(&journal->j_wait_commit);
        init_waitqueue_head(&journal->j_wait_updates);
+        init_waitqueue_head(&journal->j_wait_reserved);
        mutex_init(&journal->j_barrier);
        mutex_init(&journal->j_checkpoint_mutex);
        spin_lock_init(&journal->j_revoke_lock);
@@ -1076,6 +1092,7 @@ static journal_t * journal_init_common (void)
        journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
        journal->j_min_batch_time = 0;
        journal->j_max_batch_time = 15000; /* 15ms */
+        atomic_set(&journal->j_reserved_credits, 0);
        /* The journal is marked for error until we succeed with recovery! */
        journal->j_flags = JBD2_ABORT;
@@ -1318,6 +1335,7 @@ static int journal_reset(journal_t *journal)
 static void jbd2_write_superblock(journal_t *journal, int write_op)
 {
        struct buffer_head *bh = journal->j_sb_buffer;
+        journal_superblock_t *sb = journal->j_superblock;
        int ret;
        trace_jbd2_write_superblock(journal, write_op);
@@ -1339,6 +1357,7 @@ static void jbd2_write_superblock(journal_t *journal, int write_op)
                clear_buffer_write_io_error(bh);
                set_buffer_uptodate(bh);
        }
+        jbd2_superblock_csum_set(journal, sb);
        get_bh(bh);
        bh->b_end_io = end_buffer_write_sync;
        ret = submit_bh(write_op, bh);
@@ -1435,7 +1454,6 @@ void jbd2_journal_update_sb_errno(journal_t *journal)
        jbd_debug(1, "JBD2: updating superblock error (errno %d)\n",
                  journal->j_errno);
        sb->s_errno    = cpu_to_be32(journal->j_errno);
-        jbd2_superblock_csum_set(journal, sb);
        read_unlock(&journal->j_state_lock);
        jbd2_write_superblock(journal, WRITE_SYNC);
@@ -2325,13 +2343,13 @@ static struct journal_head *journal_alloc_journal_head(void)
 #ifdef CONFIG_JBD2_DEBUG
        atomic_inc(&nr_journal_heads);
 #endif
-        ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS);
+        ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
        if (!ret) {
                jbd_debug(1, "out of memory for journal_head\n");
                pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
                while (!ret) {
                        yield();
-                        ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS);
+                        ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
                }
        }
        return ret;
@@ -2393,10 +2411,8 @@ struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh)
        struct journal_head *new_jh = NULL;
 repeat:
-        if (!buffer_jbd(bh)) {
+        if (!buffer_jbd(bh))
                new_jh = journal_alloc_journal_head();
-                memset(new_jh, 0, sizeof(*new_jh));
-        }
        jbd_lock_bh_journal_head(bh);
        if (buffer_jbd(bh)) {
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 626846bac32f..d4851464b57e 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -399,18 +399,17 @@ static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
 static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
                                      void *buf, __u32 sequence)
 {
-        __u32 provided, calculated;
+        __u32 csum32;
        if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
                return 1;
        sequence = cpu_to_be32(sequence);
-        calculated = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&sequence,
+        csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&sequence,
-                                 sizeof(sequence));
+                             sizeof(sequence));
-        calculated = jbd2_chksum(j, calculated, buf, j->j_blocksize);
+        csum32 = jbd2_chksum(j, csum32, buf, j->j_blocksize);
-        provided = be32_to_cpu(tag->t_checksum);
-        return provided == cpu_to_be32(calculated);
+        return tag->t_checksum == cpu_to_be16(csum32);
 }
 static int do_one_pass(journal_t *journal,
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f30b80b4ce8b..198c9c10276d 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -122,9 +122,10 @@ struct jbd2_revoke_table_s
 #ifdef __KERNEL__
 static void write_one_revoke_record(journal_t *, transaction_t *,
-                                    struct journal_head **, int *,
+                                    struct list_head *,
+                                    struct buffer_head **, int *,
                                    struct jbd2_revoke_record_s *, int);
-static void flush_descriptor(journal_t *, struct journal_head *, int, int);
+static void flush_descriptor(journal_t *, struct buffer_head *, int, int);
 #endif
 /* Utility functions to maintain the revoke table */
@@ -531,9 +532,10 @@ void jbd2_journal_switch_revoke_table(journal_t *journal)
 */
 void jbd2_journal_write_revoke_records(journal_t *journal,
                                       transaction_t *transaction,
+                                       struct list_head *log_bufs,
                                       int write_op)
 {
-        struct journal_head *descriptor;
+        struct buffer_head *descriptor;
        struct jbd2_revoke_record_s *record;
        struct jbd2_revoke_table_s *revoke;
        struct list_head *hash_list;
@@ -553,7 +555,7 @@ void jbd2_journal_write_revoke_records(journal_t *journal,
                while (!list_empty(hash_list)) {
                        record = (struct jbd2_revoke_record_s *)
                                hash_list->next;
-                        write_one_revoke_record(journal, transaction,
+                        write_one_revoke_record(journal, transaction, log_bufs,
                                                &descriptor, &offset,
                                                record, write_op);
                        count++;
@@ -573,13 +575,14 @@ void jbd2_journal_write_revoke_records(journal_t *journal,
 static void write_one_revoke_record(journal_t *journal,
                                    transaction_t *transaction,
-                                    struct journal_head **descriptorp,
+                                    struct list_head *log_bufs,
+                                    struct buffer_head **descriptorp,
                                    int *offsetp,
                                    struct jbd2_revoke_record_s *record,
                                    int write_op)
 {
        int csum_size = 0;
-        struct journal_head *descriptor;
+        struct buffer_head *descriptor;
        int offset;
        journal_header_t *header;
@@ -609,26 +612,26 @@ static void write_one_revoke_record(journal_t *journal,
                descriptor = jbd2_journal_get_descriptor_buffer(journal);
                if (!descriptor)
                        return;
-                header = (journal_header_t *) &jh2bh(descriptor)->b_data[0];
+                header = (journal_header_t *)descriptor->b_data;
                header->h_magic     = cpu_to_be32(JBD2_MAGIC_NUMBER);
                header->h_blocktype = cpu_to_be32(JBD2_REVOKE_BLOCK);
                header->h_sequence  = cpu_to_be32(transaction->t_tid);
                /* Record it so that we can wait for IO completion later */
-                JBUFFER_TRACE(descriptor, "file as BJ_LogCtl");
+                BUFFER_TRACE(descriptor, "file in log_bufs");
-                jbd2_journal_file_buffer(descriptor, transaction, BJ_LogCtl);
+                jbd2_file_log_bh(log_bufs, descriptor);
                offset = sizeof(jbd2_journal_revoke_header_t);
                *descriptorp = descriptor;
        }
        if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT)) {
-                * ((__be64 *)(&jh2bh(descriptor)->b_data[offset])) =
+                * ((__be64 *)(&descriptor->b_data[offset])) =
                        cpu_to_be64(record->blocknr);
                offset += 8;
        } else {
-                * ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) =
+                * ((__be32 *)(&descriptor->b_data[offset])) =
                        cpu_to_be32(record->blocknr);
                offset += 4;
        }
@@ -636,8 +639,7 @@ static void write_one_revoke_record(journal_t *journal,
        *offsetp = offset;
 }
-static void jbd2_revoke_csum_set(journal_t *j,
+static void jbd2_revoke_csum_set(journal_t *j, struct buffer_head *bh)
-                                 struct journal_head *descriptor)
 {
        struct jbd2_journal_revoke_tail *tail;
        __u32 csum;
@@ -645,12 +647,10 @@ static void jbd2_revoke_csum_set(journal_t *j,
        if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
                return;
-        tail = (struct jbd2_journal_revoke_tail *)
+        tail = (struct jbd2_journal_revoke_tail *)(bh->b_data + j->j_blocksize -
-                        (jh2bh(descriptor)->b_data + j->j_blocksize -
                        sizeof(struct jbd2_journal_revoke_tail));
        tail->r_checksum = 0;
-        csum = jbd2_chksum(j, j->j_csum_seed, jh2bh(descriptor)->b_data,
+        csum = jbd2_chksum(j, j->j_csum_seed, bh->b_data, j->j_blocksize);
-                           j->j_blocksize);
        tail->r_checksum = cpu_to_be32(csum);
 }
@@ -662,25 +662,24 @@ static void jbd2_revoke_csum_set(journal_t *j,
 */
 static void flush_descriptor(journal_t *journal,
-                             struct journal_head *descriptor,
+                             struct buffer_head *descriptor,
                             int offset, int write_op)
 {
        jbd2_journal_revoke_header_t *header;
-        struct buffer_head *bh = jh2bh(descriptor);
        if (is_journal_aborted(journal)) {
-                put_bh(bh);
+                put_bh(descriptor);
                return;
        }
-        header = (jbd2_journal_revoke_header_t *) jh2bh(descriptor)->b_data;
+        header = (jbd2_journal_revoke_header_t *)descriptor->b_data;
        header->r_count = cpu_to_be32(offset);
        jbd2_revoke_csum_set(journal, descriptor);
-        set_buffer_jwrite(bh);
+        set_buffer_jwrite(descriptor);
-        BUFFER_TRACE(bh, "write");
+        BUFFER_TRACE(descriptor, "write");
-        set_buffer_dirty(bh);
+        set_buffer_dirty(descriptor);
-        write_dirty_buffer(bh, write_op);
+        write_dirty_buffer(descriptor, write_op);
 }
 #endif
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 10f524c59ea8..7aa9a32573bb 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -89,7 +89,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
        transaction->t_expires = jiffies + journal->j_commit_interval;
        spin_lock_init(&transaction->t_handle_lock);
        atomic_set(&transaction->t_updates, 0);
-        atomic_set(&transaction->t_outstanding_credits, 0);
+        atomic_set(&transaction->t_outstanding_credits,
+                   atomic_read(&journal->j_reserved_credits));
        atomic_set(&transaction->t_handle_count, 0);
        INIT_LIST_HEAD(&transaction->t_inode_list);
        INIT_LIST_HEAD(&transaction->t_private_list);
@@ -141,6 +142,112 @@ static inline void update_t_max_wait(transaction_t *transaction,
 }
 /*
+ * Wait until running transaction passes T_LOCKED state. Also starts the commit
+ * if needed. The function expects running transaction to exist and releases
+ * j_state_lock.
+ */
+static void wait_transaction_locked(journal_t *journal)
+        __releases(journal->j_state_lock)
+{
+        DEFINE_WAIT(wait);
+        int need_to_start;
+        tid_t tid = journal->j_running_transaction->t_tid;
+        prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
+                        TASK_UNINTERRUPTIBLE);
+        need_to_start = !tid_geq(journal->j_commit_request, tid);
+        read_unlock(&journal->j_state_lock);
+        if (need_to_start)
+                jbd2_log_start_commit(journal, tid);
+        schedule();
+        finish_wait(&journal->j_wait_transaction_locked, &wait);
+}
+static void sub_reserved_credits(journal_t *journal, int blocks)
+{
+        atomic_sub(blocks, &journal->j_reserved_credits);
+        wake_up(&journal->j_wait_reserved);
+}
+/*
+ * Wait until we can add credits for handle to the running transaction.  Called
+ * with j_state_lock held for reading. Returns 0 if handle joined the running
+ * transaction. Returns 1 if we had to wait, j_state_lock is dropped, and
+ * caller must retry.
+ */
+static int add_transaction_credits(journal_t *journal, int blocks,
+                                   int rsv_blocks)
+{
+        transaction_t *t = journal->j_running_transaction;
+        int needed;
+        int total = blocks + rsv_blocks;
+        /*
+         * If the current transaction is locked down for commit, wait
+         * for the lock to be released.
+         */
+        if (t->t_state == T_LOCKED) {
+                wait_transaction_locked(journal);
+                return 1;
+        }
+        /*
+         * If there is not enough space left in the log to write all
+         * potential buffers requested by this operation, we need to
+         * stall pending a log checkpoint to free some more log space.
+         */
+        needed = atomic_add_return(total, &t->t_outstanding_credits);
+        if (needed > journal->j_max_transaction_buffers) {
+                /*
+                 * If the current transaction is already too large,
+                 * then start to commit it: we can then go back and
+                 * attach this handle to a new transaction.
+                 */
+                atomic_sub(total, &t->t_outstanding_credits);
+                wait_transaction_locked(journal);
+                return 1;
+        }
+        /*
+         * The commit code assumes that it can get enough log space
+         * without forcing a checkpoint.  This is *critical* for
+         * correctness: a checkpoint of a buffer which is also
+         * associated with a committing transaction creates a deadlock,
+         * so commit simply cannot force through checkpoints.
+         *
+         * We must therefore ensure the necessary space in the journal
+         * *before* starting to dirty potentially checkpointed buffers
+         * in the new transaction.
+         */
+        if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
+                atomic_sub(total, &t->t_outstanding_credits);
+                read_unlock(&journal->j_state_lock);
+                write_lock(&journal->j_state_lock);
+                if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
+                        __jbd2_log_wait_for_space(journal);
+                write_unlock(&journal->j_state_lock);
+                return 1;
+        }
+        /* No reservation? We are done... */
+        if (!rsv_blocks)
+                return 0;
+        needed = atomic_add_return(rsv_blocks, &journal->j_reserved_credits);
+        /* We allow at most half of a transaction to be reserved */
+        if (needed > journal->j_max_transaction_buffers / 2) {
+                sub_reserved_credits(journal, rsv_blocks);
+                atomic_sub(total, &t->t_outstanding_credits);
+                read_unlock(&journal->j_state_lock);
+                wait_event(journal->j_wait_reserved,
+                         atomic_read(&journal->j_reserved_credits) + rsv_blocks
+                         <= journal->j_max_transaction_buffers / 2);
+                return 1;
+        }
+        return 0;
+}
+/*
 * start_this_handle: Given a handle, deal with any locking or stalling
 * needed to make sure that there is enough journal space for the handle
 * to begin.  Attach the handle to a transaction and set up the
@@ -151,18 +258,24 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
                             gfp_t gfp_mask)
 {
        transaction_t   *transaction, *new_transaction = NULL;
-        tid_t           tid;
+        int             blocks = handle->h_buffer_credits;
-        int             needed, need_to_start;
+        int             rsv_blocks = 0;
-        int             nblocks = handle->h_buffer_credits;
        unsigned long ts = jiffies;
-        if (nblocks > journal->j_max_transaction_buffers) {
+        /*
+         * 1/2 of transaction can be reserved so we can practically handle
+         * only 1/2 of maximum transaction size per operation
+         */
+        if (WARN_ON(blocks > journal->j_max_transaction_buffers / 2)) {
                printk(KERN_ERR "JBD2: %s wants too many credits (%d > %d)\n",
-                       current->comm, nblocks,
+                       current->comm, blocks,
-                       journal->j_max_transaction_buffers);
+                       journal->j_max_transaction_buffers / 2);
                return -ENOSPC;
        }
+        if (handle->h_rsv_handle)
+                rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
 alloc_transaction:
        if (!journal->j_running_transaction) {
                new_transaction = kmem_cache_zalloc(transaction_cache,
@@ -199,8 +312,12 @@ repeat:
                return -EROFS;
        }
-        /* Wait on the journal's transaction barrier if necessary */
+        /*
-        if (journal->j_barrier_count) {
+         * Wait on the journal's transaction barrier if necessary. Specifically
+         * we allow reserved handles to proceed because otherwise commit could
+         * deadlock on page writeback not being able to complete.
+         */
+        if (!handle->h_reserved && journal->j_barrier_count) {
                read_unlock(&journal->j_state_lock);
                wait_event(journal->j_wait_transaction_locked,
                                journal->j_barrier_count == 0);
@@ -213,7 +330,7 @@ repeat:
                        goto alloc_transaction;
                write_lock(&journal->j_state_lock);
                if (!journal->j_running_transaction &&
-                    !journal->j_barrier_count) {
+                    (handle->h_reserved || !journal->j_barrier_count)) {
                        jbd2_get_transaction(journal, new_transaction);
                        new_transaction = NULL;
                }
@@ -223,85 +340,18 @@ repeat:
        transaction = journal->j_running_transaction;
-        /*
+        if (!handle->h_reserved) {
-         * If the current transaction is locked down for commit, wait for the
+                /* We may have dropped j_state_lock - restart in that case */
-         * lock to be released.
+                if (add_transaction_credits(journal, blocks, rsv_blocks))
-         */
+                        goto repeat;
-        if (transaction->t_state == T_LOCKED) {
+        } else {
-                DEFINE_WAIT(wait);
-                prepare_to_wait(&journal->j_wait_transaction_locked,
-                                        &wait, TASK_UNINTERRUPTIBLE);
-                read_unlock(&journal->j_state_lock);
-                schedule();
-                finish_wait(&journal->j_wait_transaction_locked, &wait);
-                goto repeat;
-        }
-        /*
-         * If there is not enough space left in the log to write all potential
-         * buffers requested by this operation, we need to stall pending a log
-         * checkpoint to free some more log space.
-         */
-        needed = atomic_add_return(nblocks,
-                                   &transaction->t_outstanding_credits);
-        if (needed > journal->j_max_transaction_buffers) {
                /*
-                 * If the current transaction is already too large, then start
+                 * We have handle reserved so we are allowed to join T_LOCKED
-                 * to commit it: we can then go back and attach this handle to
+                 * transaction and we don't have to check for transaction size
-                 * a new transaction.
+                 * and journal space.
                 */
-                DEFINE_WAIT(wait);
+                sub_reserved_credits(journal, blocks);
+                handle->h_reserved = 0;
-                jbd_debug(2, "Handle %p starting new commit...\n", handle);
-                atomic_sub(nblocks, &transaction->t_outstanding_credits);
-                prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
-                                TASK_UNINTERRUPTIBLE);
-                tid = transaction->t_tid;
-                need_to_start = !tid_geq(journal->j_commit_request, tid);
-                read_unlock(&journal->j_state_lock);
-                if (need_to_start)
-                        jbd2_log_start_commit(journal, tid);
-                schedule();
-                finish_wait(&journal->j_wait_transaction_locked, &wait);
-                goto repeat;
-        }
-        /*
-         * The commit code assumes that it can get enough log space
-         * without forcing a checkpoint.  This is *critical* for
-         * correctness: a checkpoint of a buffer which is also
-         * associated with a committing transaction creates a deadlock,
-         * so commit simply cannot force through checkpoints.
-         *
-         * We must therefore ensure the necessary space in the journal
-         * *before* starting to dirty potentially checkpointed buffers
-         * in the new transaction.
-         *
-         * The worst part is, any transaction currently committing can
-         * reduce the free space arbitrarily.  Be careful to account for
-         * those buffers when checkpointing.
-         */
-        /*
-         * @@@ AKPM: This seems rather over-defensive.  We're giving commit
-         * a _lot_ of headroom: 1/4 of the journal plus the size of
-         * the committing transaction.  Really, we only need to give it
-         * committing_transaction->t_outstanding_credits plus "enough" for
-         * the log control blocks.
-         * Also, this test is inconsistent with the matching one in
-         * jbd2_journal_extend().
-         */
-        if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
-                jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
-                atomic_sub(nblocks, &transaction->t_outstanding_credits);
-                read_unlock(&journal->j_state_lock);
-                write_lock(&journal->j_state_lock);
-                if (__jbd2_log_space_left(journal) < jbd_space_needed(journal))
-                        __jbd2_log_wait_for_space(journal);
-                write_unlock(&journal->j_state_lock);
-                goto repeat;
        }
        /* OK, account for the buffers that this operation expects to
@@ -309,15 +359,16 @@ repeat:
         */
        update_t_max_wait(transaction, ts);
        handle->h_transaction = transaction;
-        handle->h_requested_credits = nblocks;
+        handle->h_requested_credits = blocks;
        handle->h_start_jiffies = jiffies;
        atomic_inc(&transaction->t_updates);
        atomic_inc(&transaction->t_handle_count);
-        jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
+        jbd_debug(4, "Handle %p given %d credits (total %d, free %lu)\n",
-                  handle, nblocks,
+                  handle, blocks,
                  atomic_read(&transaction->t_outstanding_credits),
-                  __jbd2_log_space_left(journal));
+                  jbd2_log_space_left(journal));
        read_unlock(&journal->j_state_lock);
+        current->journal_info = handle;
        lock_map_acquire(&handle->h_lockdep_map);
        jbd2_journal_free_transaction(new_transaction);
@@ -348,16 +399,21 @@ static handle_t *new_handle(int nblocks)
 *
 * We make sure that the transaction can guarantee at least nblocks of
 * modified buffers in the log.  We block until the log can guarantee
- * that much space.
+ * that much space. Additionally, if rsv_blocks > 0, we also create another
- *
+ * handle with rsv_blocks reserved blocks in the journal. This handle is
- * This function is visible to journal users (like ext3fs), so is not
+ * is stored in h_rsv_handle. It is not attached to any particular transaction
- * called with the journal already locked.
+ * and thus doesn't block transaction commit. If the caller uses this reserved
+ * handle, it has to set h_rsv_handle to NULL as otherwise jbd2_journal_stop()
+ * on the parent handle will dispose the reserved one. Reserved handle has to
+ * be converted to a normal handle using jbd2_journal_start_reserved() before
+ * it can be used.
 *
 * Return a pointer to a newly allocated handle, or an ERR_PTR() value
 * on failure.
 */
-handle_t *jbd2__journal_start(journal_t *journal, int nblocks, gfp_t gfp_mask,
+handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
-                              unsigned int type, unsigned int line_no)
+                              gfp_t gfp_mask, unsigned int type,
+                              unsigned int line_no)
 {
        handle_t *handle = journal_current_handle();
        int err;
@@ -374,13 +430,24 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, gfp_t gfp_mask,
        handle = new_handle(nblocks);
        if (!handle)
                return ERR_PTR(-ENOMEM);
+        if (rsv_blocks) {
+                handle_t *rsv_handle;
-        current->journal_info = handle;
+                rsv_handle = new_handle(rsv_blocks);
+                if (!rsv_handle) {
+                        jbd2_free_handle(handle);
+                        return ERR_PTR(-ENOMEM);
+                }
+                rsv_handle->h_reserved = 1;
+                rsv_handle->h_journal = journal;
+                handle->h_rsv_handle = rsv_handle;
+        }
        err = start_this_handle(journal, handle, gfp_mask);
        if (err < 0) {
+                if (handle->h_rsv_handle)
+                        jbd2_free_handle(handle->h_rsv_handle);
                jbd2_free_handle(handle);
-                current->journal_info = NULL;
                return ERR_PTR(err);
        }
        handle->h_type = type;
@@ -395,10 +462,65 @@ EXPORT_SYMBOL(jbd2__journal_start);
 handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
 {
-        return jbd2__journal_start(journal, nblocks, GFP_NOFS, 0, 0);
+        return jbd2__journal_start(journal, nblocks, 0, GFP_NOFS, 0, 0);
 }
 EXPORT_SYMBOL(jbd2_journal_start);
+void jbd2_journal_free_reserved(handle_t *handle)
+{
+        journal_t *journal = handle->h_journal;
+        WARN_ON(!handle->h_reserved);
+        sub_reserved_credits(journal, handle->h_buffer_credits);
+        jbd2_free_handle(handle);
+}
+EXPORT_SYMBOL(jbd2_journal_free_reserved);
+/**
+ * int jbd2_journal_start_reserved(handle_t *handle) - start reserved handle
+ * @handle: handle to start
+ *
+ * Start handle that has been previously reserved with jbd2_journal_reserve().
+ * This attaches @handle to the running transaction (or creates one if there's
+ * not transaction running). Unlike jbd2_journal_start() this function cannot
+ * block on journal commit, checkpointing, or similar stuff. It can block on
+ * memory allocation or frozen journal though.
+ *
+ * Return 0 on success, non-zero on error - handle is freed in that case.
+ */
+int jbd2_journal_start_reserved(handle_t *handle, unsigned int type,
+                                unsigned int line_no)
+{
+        journal_t *journal = handle->h_journal;
+        int ret = -EIO;
+        if (WARN_ON(!handle->h_reserved)) {
+                /* Someone passed in normal handle? Just stop it. */
+                jbd2_journal_stop(handle);
+                return ret;
+        }
+        /*
+         * Usefulness of mixing of reserved and unreserved handles is
+         * questionable. So far nobody seems to need it so just error out.
+         */
+        if (WARN_ON(current->journal_info)) {
+                jbd2_journal_free_reserved(handle);
+                return ret;
+        }
+        handle->h_journal = NULL;
+        /*
+         * GFP_NOFS is here because callers are likely from writeback or
+         * similarly constrained call sites
+         */
+        ret = start_this_handle(journal, handle, GFP_NOFS);
+        if (ret < 0)
+                jbd2_journal_free_reserved(handle);
+        handle->h_type = type;
+        handle->h_line_no = line_no;
+        return ret;
+}
+EXPORT_SYMBOL(jbd2_journal_start_reserved);
 /**
 * int jbd2_journal_extend() - extend buffer credits.
@@ -423,49 +545,53 @@ EXPORT_SYMBOL(jbd2_journal_start);
 int jbd2_journal_extend(handle_t *handle, int nblocks)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
        int result;
        int wanted;
-        result = -EIO;
+        WARN_ON(!transaction);
        if (is_handle_aborted(handle))
-                goto out;
+                return -EROFS;
+        journal = transaction->t_journal;
        result = 1;
        read_lock(&journal->j_state_lock);
        /* Don't extend a locked-down transaction! */
-        if (handle->h_transaction->t_state != T_RUNNING) {
+        if (transaction->t_state != T_RUNNING) {
                jbd_debug(3, "denied handle %p %d blocks: "
                          "transaction not running\n", handle, nblocks);
                goto error_out;
        }
        spin_lock(&transaction->t_handle_lock);
-        wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks;
+        wanted = atomic_add_return(nblocks,
+                                   &transaction->t_outstanding_credits);
        if (wanted > journal->j_max_transaction_buffers) {
                jbd_debug(3, "denied handle %p %d blocks: "
                          "transaction too large\n", handle, nblocks);
+                atomic_sub(nblocks, &transaction->t_outstanding_credits);
                goto unlock;
        }
-        if (wanted > __jbd2_log_space_left(journal)) {
+        if (wanted + (wanted >> JBD2_CONTROL_BLOCKS_SHIFT) >
+            jbd2_log_space_left(journal)) {
                jbd_debug(3, "denied handle %p %d blocks: "
                          "insufficient log space\n", handle, nblocks);
+                atomic_sub(nblocks, &transaction->t_outstanding_credits);
                goto unlock;
        }
        trace_jbd2_handle_extend(journal->j_fs_dev->bd_dev,
-                                 handle->h_transaction->t_tid,
+                                 transaction->t_tid,
                                 handle->h_type, handle->h_line_no,
                                 handle->h_buffer_credits,
                                 nblocks);
        handle->h_buffer_credits += nblocks;
        handle->h_requested_credits += nblocks;
-        atomic_add(nblocks, &transaction->t_outstanding_credits);
        result = 0;
        jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
@@ -473,7 +599,6 @@ unlock:
        spin_unlock(&transaction->t_handle_lock);
 error_out:
        read_unlock(&journal->j_state_lock);
-out:
        return result;
 }
@@ -490,19 +615,22 @@ out:
 * to a running handle, a call to jbd2_journal_restart will commit the
 * handle's transaction so far and reattach the handle to a new
 * transaction capabable of guaranteeing the requested number of
- * credits.
+ * credits. We preserve reserved handle if there's any attached to the
+ * passed in handle.
 */
 int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
        tid_t           tid;
        int             need_to_start, ret;
+        WARN_ON(!transaction);
        /* If we've had an abort of any type, don't even think about
         * actually doing the restart! */
        if (is_handle_aborted(handle))
                return 0;
+        journal = transaction->t_journal;
        /*
         * First unlink the handle from its current transaction, and start the
@@ -515,12 +643,18 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
        spin_lock(&transaction->t_handle_lock);
        atomic_sub(handle->h_buffer_credits,
                   &transaction->t_outstanding_credits);
+        if (handle->h_rsv_handle) {
+                sub_reserved_credits(journal,
+                                     handle->h_rsv_handle->h_buffer_credits);
+        }
        if (atomic_dec_and_test(&transaction->t_updates))
                wake_up(&journal->j_wait_updates);
+        tid = transaction->t_tid;
        spin_unlock(&transaction->t_handle_lock);
+        handle->h_transaction = NULL;
+        current->journal_info = NULL;
        jbd_debug(2, "restarting handle %p\n", handle);
-        tid = transaction->t_tid;
        need_to_start = !tid_geq(journal->j_commit_request, tid);
        read_unlock(&journal->j_state_lock);
        if (need_to_start)
@@ -557,6 +691,14 @@ void jbd2_journal_lock_updates(journal_t *journal)
        write_lock(&journal->j_state_lock);
        ++journal->j_barrier_count;
+        /* Wait until there are no reserved handles */
+        if (atomic_read(&journal->j_reserved_credits)) {
+                write_unlock(&journal->j_state_lock);
+                wait_event(journal->j_wait_reserved,
+                           atomic_read(&journal->j_reserved_credits) == 0);
+                write_lock(&journal->j_state_lock);
+        }
        /* Wait until there are no running updates */
        while (1) {
                transaction_t *transaction = journal->j_running_transaction;
@@ -619,6 +761,12 @@ static void warn_dirty_buffer(struct buffer_head *bh)
               bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
 }
+static int sleep_on_shadow_bh(void *word)
+{
+        io_schedule();
+        return 0;
+}
 /*
 * If the buffer is already part of the current transaction, then there
 * is nothing we need to do.  If it is already part of a prior
@@ -634,17 +782,16 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
                        int force_copy)
 {
        struct buffer_head *bh;
-        transaction_t *transaction;
+        transaction_t *transaction = handle->h_transaction;
        journal_t *journal;
        int error;
        char *frozen_buffer = NULL;
        int need_copy = 0;
        unsigned long start_lock, time_lock;
+        WARN_ON(!transaction);
        if (is_handle_aborted(handle))
                return -EROFS;
-        transaction = handle->h_transaction;
        journal = transaction->t_journal;
        jbd_debug(5, "journal_head %p, force_copy %d\n", jh, force_copy);
@@ -754,41 +901,29 @@ repeat:
                 * journaled.  If the primary copy is already going to
                 * disk then we cannot do copy-out here. */
-                if (jh->b_jlist == BJ_Shadow) {
+                if (buffer_shadow(bh)) {
-                        DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
-                        wait_queue_head_t *wqh;
-                        wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
                        JBUFFER_TRACE(jh, "on shadow: sleep");
                        jbd_unlock_bh_state(bh);
-                        /* commit wakes up all shadow buffers after IO */
+                        wait_on_bit(&bh->b_state, BH_Shadow,
-                        for ( ; ; ) {
+                                    sleep_on_shadow_bh, TASK_UNINTERRUPTIBLE);
-                                prepare_to_wait(wqh, &wait.wait,
-                                                TASK_UNINTERRUPTIBLE);
-                                if (jh->b_jlist != BJ_Shadow)
-                                        break;
-                                schedule();
-                        }
-                        finish_wait(wqh, &wait.wait);
                        goto repeat;
                }
-                /* Only do the copy if the currently-owning transaction
+                /*
-                 * still needs it.  If it is on the Forget list, the
+                 * Only do the copy if the currently-owning transaction still
-                 * committing transaction is past that stage.  The
+                 * needs it. If buffer isn't on BJ_Metadata list, the
-                 * buffer had better remain locked during the kmalloc,
+                 * committing transaction is past that stage (here we use the
-                 * but that should be true --- we hold the journal lock
+                 * fact that BH_Shadow is set under bh_state lock together with
-                 * still and the buffer is already on the BUF_JOURNAL
+                 * refiling to BJ_Shadow list and at this point we know the
-                 * list so won't be flushed.
+                 * buffer doesn't have BH_Shadow set).
                 *
                 * Subtle point, though: if this is a get_undo_access,
                 * then we will be relying on the frozen_data to contain
                 * the new value of the committed_data record after the
                 * transaction, so we HAVE to force the frozen_data copy
-                 * in that case. */
+                 * in that case.
+                 */
-                if (jh->b_jlist != BJ_Forget || force_copy) {
+                if (jh->b_jlist == BJ_Metadata || force_copy) {
                        JBUFFER_TRACE(jh, "generate frozen data");
                        if (!frozen_buffer) {
                                JBUFFER_TRACE(jh, "allocate memory for buffer");
@@ -915,14 +1050,16 @@ int jbd2_journal_get_write_access(handle_t *handle, struct buffer_head *bh)
 int jbd2_journal_get_create_access(handle_t *handle, struct buffer_head *bh)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
        struct journal_head *jh = jbd2_journal_add_journal_head(bh);
        int err;
        jbd_debug(5, "journal_head %p\n", jh);
+        WARN_ON(!transaction);
        err = -EROFS;
        if (is_handle_aborted(handle))
                goto out;
+        journal = transaction->t_journal;
        err = 0;
        JBUFFER_TRACE(jh, "entry");
@@ -1128,12 +1265,14 @@ void jbd2_buffer_abort_trigger(struct journal_head *jh,
 int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
        struct journal_head *jh;
        int ret = 0;
+        WARN_ON(!transaction);
        if (is_handle_aborted(handle))
-                goto out;
+                return -EROFS;
+        journal = transaction->t_journal;
        jh = jbd2_journal_grab_journal_head(bh);
        if (!jh) {
                ret = -EUCLEAN;
@@ -1227,7 +1366,7 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
        JBUFFER_TRACE(jh, "file as BJ_Metadata");
        spin_lock(&journal->j_list_lock);
-        __jbd2_journal_file_buffer(jh, handle->h_transaction, BJ_Metadata);
+        __jbd2_journal_file_buffer(jh, transaction, BJ_Metadata);
        spin_unlock(&journal->j_list_lock);
 out_unlock_bh:
        jbd_unlock_bh_state(bh);
@@ -1258,12 +1397,17 @@ out:
 int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
        struct journal_head *jh;
        int drop_reserve = 0;
        int err = 0;
        int was_modified = 0;
+        WARN_ON(!transaction);
+        if (is_handle_aborted(handle))
+                return -EROFS;
+        journal = transaction->t_journal;
        BUFFER_TRACE(bh, "entry");
        jbd_lock_bh_state(bh);
@@ -1290,7 +1434,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
         */
        jh->b_modified = 0;
-        if (jh->b_transaction == handle->h_transaction) {
+        if (jh->b_transaction == transaction) {
                J_ASSERT_JH(jh, !jh->b_frozen_data);
                /* If we are forgetting a buffer which is already part
@@ -1385,19 +1529,21 @@ drop:
 int jbd2_journal_stop(handle_t *handle)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
-        int err, wait_for_commit = 0;
+        int err = 0, wait_for_commit = 0;
        tid_t tid;
        pid_t pid;
+        if (!transaction)
+                goto free_and_exit;
+        journal = transaction->t_journal;
        J_ASSERT(journal_current_handle() == handle);
        if (is_handle_aborted(handle))
                err = -EIO;
-        else {
+        else
                J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-                err = 0;
-        }
        if (--handle->h_ref > 0) {
                jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
@@ -1407,7 +1553,7 @@ int jbd2_journal_stop(handle_t *handle)
        jbd_debug(4, "Handle %p going down\n", handle);
        trace_jbd2_handle_stats(journal->j_fs_dev->bd_dev,
-                                handle->h_transaction->t_tid,
+                                transaction->t_tid,
                                handle->h_type, handle->h_line_no,
                                jiffies - handle->h_start_jiffies,
                                handle->h_sync, handle->h_requested_credits,
@@ -1518,33 +1664,13 @@ int jbd2_journal_stop(handle_t *handle)
        lock_map_release(&handle->h_lockdep_map);
+        if (handle->h_rsv_handle)
+                jbd2_journal_free_reserved(handle->h_rsv_handle);
+free_and_exit:
        jbd2_free_handle(handle);
        return err;
 }
-/**
- * int jbd2_journal_force_commit() - force any uncommitted transactions
- * @journal: journal to force
- *
- * For synchronous operations: force any uncommitted transactions
- * to disk.  May seem kludgy, but it reuses all the handle batching
- * code in a very simple manner.
- */
-int jbd2_journal_force_commit(journal_t *journal)
-{
-        handle_t *handle;
-        int ret;
-        handle = jbd2_journal_start(journal, 1);
-        if (IS_ERR(handle)) {
-                ret = PTR_ERR(handle);
-        } else {
-                handle->h_sync = 1;
-                ret = jbd2_journal_stop(handle);
-        }
-        return ret;
-}
 /*
 *
 * List management code snippets: various functions for manipulating the
@@ -1601,10 +1727,10 @@ __blist_del_buffer(struct journal_head **list, struct journal_head *jh)
 * Remove a buffer from the appropriate transaction list.
 *
 * Note that this function can *change* the value of
- * bh->b_transaction->t_buffers, t_forget, t_iobuf_list, t_shadow_list,
+ * bh->b_transaction->t_buffers, t_forget, t_shadow_list, t_log_list or
- * t_log_list or t_reserved_list.  If the caller is holding onto a copy of one
+ * t_reserved_list.  If the caller is holding onto a copy of one of these
- * of these pointers, it could go bad.  Generally the caller needs to re-read
+ * pointers, it could go bad.  Generally the caller needs to re-read the
- * the pointer from the transaction_t.
+ * pointer from the transaction_t.
 *
 * Called under j_list_lock.
 */
@@ -1634,15 +1760,9 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
        case BJ_Forget:
                list = &transaction->t_forget;
                break;
-        case BJ_IO:
-                list = &transaction->t_iobuf_list;
-                break;
        case BJ_Shadow:
                list = &transaction->t_shadow_list;
                break;
-        case BJ_LogCtl:
-                list = &transaction->t_log_list;
-                break;
        case BJ_Reserved:
                list = &transaction->t_reserved_list;
                break;
@@ -2034,18 +2154,23 @@ zap_buffer_unlocked:
 * void jbd2_journal_invalidatepage()
 * @journal: journal to use for flush...
 * @page:    page to flush
- * @offset:  length of page to invalidate.
+ * @offset:  start of the range to invalidate
+ * @length:  length of the range to invalidate
 *
- * Reap page buffers containing data after offset in page. Can return -EBUSY
+ * Reap page buffers containing data after in the specified range in page.
- * if buffers are part of the committing transaction and the page is straddling
+ * Can return -EBUSY if buffers are part of the committing transaction and
- * i_size. Caller then has to wait for current commit and try again.
+ * the page is straddling i_size. Caller then has to wait for current commit
+ * and try again.
 */
 int jbd2_journal_invalidatepage(journal_t *journal,
                                struct page *page,
-                                unsigned long offset)
+                                unsigned int offset,
+                                unsigned int length)
 {
        struct buffer_head *head, *bh, *next;
+        unsigned int stop = offset + length;
        unsigned int curr_off = 0;
+        int partial_page = (offset || length < PAGE_CACHE_SIZE);
        int may_free = 1;
        int ret = 0;
@@ -2054,6 +2179,8 @@ int jbd2_journal_invalidatepage(journal_t *journal,
        if (!page_has_buffers(page))
                return 0;
+        BUG_ON(stop > PAGE_CACHE_SIZE || stop < length);
        /* We will potentially be playing with lists other than just the
         * data lists (especially for journaled data mode), so be
         * cautious in our locking. */
@@ -2063,10 +2190,13 @@ int jbd2_journal_invalidatepage(journal_t *journal,
                unsigned int next_off = curr_off + bh->b_size;
                next = bh->b_this_page;
+                if (next_off > stop)
+                        return 0;
                if (offset <= curr_off) {
                        /* This block is wholly outside the truncation point */
                        lock_buffer(bh);
-                        ret = journal_unmap_buffer(journal, bh, offset > 0);
+                        ret = journal_unmap_buffer(journal, bh, partial_page);
                        unlock_buffer(bh);
                        if (ret < 0)
                                return ret;
@@ -2077,7 +2207,7 @@ int jbd2_journal_invalidatepage(journal_t *journal,
        } while (bh != head);
-        if (!offset) {
+        if (!partial_page) {
                if (may_free && try_to_free_buffers(page))
                        J_ASSERT(!page_has_buffers(page));
        }
@@ -2138,15 +2268,9 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
        case BJ_Forget:
                list = &transaction->t_forget;
                break;
-        case BJ_IO:
-                list = &transaction->t_iobuf_list;
-                break;
        case BJ_Shadow:
                list = &transaction->t_shadow_list;
                break;
-        case BJ_LogCtl:
-                list = &transaction->t_log_list;
-                break;
        case BJ_Reserved:
                list = &transaction->t_reserved_list;
                break;
@@ -2248,10 +2372,12 @@ void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
 {
        transaction_t *transaction = handle->h_transaction;
-        journal_t *journal = transaction->t_journal;
+        journal_t *journal;
+        WARN_ON(!transaction);
        if (is_handle_aborted(handle))
-                return -EIO;
+                return -EROFS;
+        journal = transaction->t_journal;
        jbd_debug(4, "Adding inode %lu, tid:%d\n", jinode->i_vfs_inode->i_ino,
                        transaction->t_tid);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 6740d34cd82b..9e3aaff11f89 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -571,9 +571,10 @@ static int metapage_releasepage(struct page *page, gfp_t gfp_mask)
        return ret;
 }
-static void metapage_invalidatepage(struct page *page, unsigned long offset)
+static void metapage_invalidatepage(struct page *page, unsigned int offset,
+                                    unsigned int length)
 {
-        BUG_ON(offset);
+        BUG_ON(offset || length < PAGE_CACHE_SIZE);
        BUG_ON(PageWriteback(page));
diff --git a/fs/logfs/file.c b/fs/logfs/file.c
index c2219a6dd3c8..57914fc32b62 100644
--- a/fs/logfs/file.c
+++ b/fs/logfs/file.c
@@ -159,7 +159,8 @@ static int logfs_writepage(struct page *page, struct writeback_control *wbc)
        return __logfs_writepage(page);
 }
-static void logfs_invalidatepage(struct page *page, unsigned long offset)
+static void logfs_invalidatepage(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        struct logfs_block *block = logfs_block(page);
diff --git a/fs/logfs/segment.c b/fs/logfs/segment.c
index 038da0991794..d448a777166b 100644
--- a/fs/logfs/segment.c
+++ b/fs/logfs/segment.c
@@ -884,7 +884,8 @@ static struct logfs_area *alloc_area(struct super_block *sb)
        return area;
 }
-static void map_invalidatepage(struct page *page, unsigned long l)
+static void map_invalidatepage(struct page *page, unsigned int o,
+                               unsigned int l)
 {
        return;
 }
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index a87a44f84113..6b4a79f4ad1d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -451,11 +451,13 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
 * - Called if either PG_private or PG_fscache is set on the page
 * - Caller holds page lock
 */
-static void nfs_invalidate_page(struct page *page, unsigned long offset)
+static void nfs_invalidate_page(struct page *page, unsigned int offset,
+                                unsigned int length)
 {
-        dfprintk(PAGECACHE, "NFS: invalidate_page(%p, %lu)\n", page, offset);
+        dfprintk(PAGECACHE, "NFS: invalidate_page(%p, %u, %u)\n",
+                 page, offset, length);
-        if (offset != 0)
+        if (offset != 0 || length < PAGE_CACHE_SIZE)
                return;
        /* Cancel any unstarted writes on this page */
        nfs_wb_page_cancel(page_file_mapping(page)->host, page);
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index fa9c05f97af4..d267ea6aa1a0 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -1372,7 +1372,7 @@ retry_writepage:
                 * The page may have dirty, unmapped buffers.  Make them
                 * freeable here, so the page does not leak.
                 */
-                block_invalidatepage(page, 0);
+                block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
                unlock_page(page);
                ntfs_debug("Write outside i_size - truncated?");
                return 0;
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 20dfec72e903..79736a28d84f 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -603,11 +603,12 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
 * from ext3.  PageChecked() bits have been removed as OCFS2 does not
 * do journalled data.
 */
-static void ocfs2_invalidatepage(struct page *page, unsigned long offset)
+static void ocfs2_invalidatepage(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        journal_t *journal = OCFS2_SB(page->mapping->host->i_sb)->journal->j_journal;
-        jbd2_journal_invalidatepage(journal, page, offset);
+        jbd2_journal_invalidatepage(journal, page, offset, length);
 }
 static int ocfs2_releasepage(struct page *page, gfp_t wait)
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index f844533792ee..0048cc16a6a8 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -2975,16 +2975,19 @@ static int invalidatepage_can_drop(struct inode *inode, struct buffer_head *bh)
 }
 /* clm -- taken from fs/buffer.c:block_invalidate_page */
-static void reiserfs_invalidatepage(struct page *page, unsigned long offset)
+static void reiserfs_invalidatepage(struct page *page, unsigned int offset,
+                                    unsigned int length)
 {
        struct buffer_head *head, *bh, *next;
        struct inode *inode = page->mapping->host;
        unsigned int curr_off = 0;
+        unsigned int stop = offset + length;
+        int partial_page = (offset || length < PAGE_CACHE_SIZE);
        int ret = 1;
        BUG_ON(!PageLocked(page));
-        if (offset == 0)
+        if (!partial_page)
                ClearPageChecked(page);
        if (!page_has_buffers(page))
@@ -2996,6 +2999,9 @@ static void reiserfs_invalidatepage(struct page *page, unsigned long offset)
                unsigned int next_off = curr_off + bh->b_size;
                next = bh->b_this_page;
+                if (next_off > stop)
+                        goto out;
                /*
                 * is this block fully invalidated?
                 */
@@ -3014,7 +3020,7 @@ static void reiserfs_invalidatepage(struct page *page, unsigned long offset)
         * The get_block cached value has been unconditionally invalidated,
         * so real IO is not possible anymore.
         */
-        if (!offset && ret) {
+        if (!partial_page && ret) {
                ret = try_to_release_page(page, 0);
                /* maybe should BUG_ON(!ret); - neilb */
        }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 14374530784c..123c79b7261e 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1277,13 +1277,14 @@ int ubifs_setattr(struct dentry *dentry, struct iattr *attr)
        return err;
 }
-static void ubifs_invalidatepage(struct page *page, unsigned long offset)
+static void ubifs_invalidatepage(struct page *page, unsigned int offset,
+                                 unsigned int length)
 {
        struct inode *inode = page->mapping->host;
        struct ubifs_info *c = inode->i_sb->s_fs_info;
        ubifs_assert(PagePrivate(page));
-        if (offset)
+        if (offset || length < PAGE_CACHE_SIZE)
                /* Partial page remains dirty */
                return;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41a695048be7..596ec71da00e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -843,10 +843,12 @@ xfs_cluster_write(
 STATIC void
 xfs_vm_invalidatepage(
        struct page             *page,
-        unsigned long           offset)
+        unsigned int            offset,
+        unsigned int            length)
 {
-        trace_xfs_invalidatepage(page->mapping->host, page, offset);
+        trace_xfs_invalidatepage(page->mapping->host, page, offset,
-        block_invalidatepage(page, offset);
+                                 length);
+        block_invalidatepage(page, offset, length);
 }
 /*
@@ -910,7 +912,7 @@ next_buffer:
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out_invalidate:
-        xfs_vm_invalidatepage(page, 0);
+        xfs_vm_invalidatepage(page, 0, PAGE_CACHE_SIZE);
        return;
 }
@@ -940,7 +942,7 @@ xfs_vm_writepage(
        int                     count = 0;
        int                     nonblocking = 0;
-        trace_xfs_writepage(inode, page, 0);
+        trace_xfs_writepage(inode, page, 0, 0);
        ASSERT(page_has_buffers(page));
@@ -1171,7 +1173,7 @@ xfs_vm_releasepage(
 {
        int                     delalloc, unwritten;
-        trace_xfs_releasepage(page->mapping->host, page, 0);
+        trace_xfs_releasepage(page->mapping->host, page, 0, 0);
        xfs_count_page_state(page, &delalloc, &unwritten);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index aa4db3307d36..a04701de6bbd 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -974,14 +974,16 @@ DEFINE_RW_EVENT(xfs_file_splice_read);
 DEFINE_RW_EVENT(xfs_file_splice_write);
 DECLARE_EVENT_CLASS(xfs_page_class,
-        TP_PROTO(struct inode *inode, struct page *page, unsigned long off),
+        TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
-        TP_ARGS(inode, page, off),
+                 unsigned int len),
+        TP_ARGS(inode, page, off, len),
        TP_STRUCT__entry(
                __field(dev_t, dev)
                __field(xfs_ino_t, ino)
                __field(pgoff_t, pgoff)
                __field(loff_t, size)
                __field(unsigned long, offset)
+                __field(unsigned int, length)
                __field(int, delalloc)
                __field(int, unwritten)
        ),
@@ -995,24 +997,27 @@ DECLARE_EVENT_CLASS(xfs_page_class,
                __entry->pgoff = page_offset(page);
                __entry->size = i_size_read(inode);
                __entry->offset = off;
+                __entry->length = len;
                __entry->delalloc = delalloc;
                __entry->unwritten = unwritten;
        ),
        TP_printk("dev %d:%d ino 0x%llx pgoff 0x%lx size 0x%llx offset %lx "
-                  "delalloc %d unwritten %d",
+                  "length %x delalloc %d unwritten %d",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  __entry->ino,
                  __entry->pgoff,
                  __entry->size,
                  __entry->offset,
+                  __entry->length,
                  __entry->delalloc,
                  __entry->unwritten)
 )
 #define DEFINE_PAGE_EVENT(name)         \
 DEFINE_EVENT(xfs_page_class, name,      \
-        TP_PROTO(struct inode *inode, struct page *page, unsigned long off),    \
+        TP_PROTO(struct inode *inode, struct page *page, unsigned long off, \
-        TP_ARGS(inode, page, off))
+                 unsigned int len),     \
+        TP_ARGS(inode, page, off, len))
 DEFINE_PAGE_EVENT(xfs_writepage);
 DEFINE_PAGE_EVENT(xfs_releasepage);
 DEFINE_PAGE_EVENT(xfs_invalidatepage);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 9e52b0626b39..f5a3b838ddb0 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -198,7 +198,8 @@ extern int buffer_heads_over_limit;
 * Generic address_space_operations implementations for buffer_head-backed
 * address_spaces.
 */
-void block_invalidatepage(struct page *page, unsigned long offset);
+void block_invalidatepage(struct page *page, unsigned int offset,
+                          unsigned int length);
 int block_write_full_page(struct page *page, get_block_t *get_block,
                                struct writeback_control *wbc);
 int block_write_full_page_endio(struct page *page, get_block_t *get_block,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7c30e3a62baf..f8a5240541b7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -364,7 +364,7 @@ struct address_space_operations {
        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        sector_t (*bmap)(struct address_space *, sector_t);
-        void (*invalidatepage) (struct page *, unsigned long);
+        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, gfp_t);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 7e0b622503c4..8685d1be12c7 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -27,7 +27,6 @@
 #include <linux/buffer_head.h>
 #include <linux/journal-head.h>
 #include <linux/stddef.h>
-#include <linux/bit_spinlock.h>
 #include <linux/mutex.h>
 #include <linux/timer.h>
 #include <linux/lockdep.h>
@@ -244,6 +243,31 @@ typedef struct journal_superblock_s
 #include <linux/fs.h>
 #include <linux/sched.h>
+enum jbd_state_bits {
+        BH_JBD                  /* Has an attached ext3 journal_head */
+          = BH_PrivateStart,
+        BH_JWrite,              /* Being written to log (@@@ DEBUGGING) */
+        BH_Freed,               /* Has been freed (truncated) */
+        BH_Revoked,             /* Has been revoked from the log */
+        BH_RevokeValid,         /* Revoked flag is valid */
+        BH_JBDDirty,            /* Is dirty but journaled */
+        BH_State,               /* Pins most journal_head state */
+        BH_JournalHead,         /* Pins bh->b_private and jh->b_bh */
+        BH_Unshadow,            /* Dummy bit, for BJ_Shadow wakeup filtering */
+        BH_JBDPrivateStart,     /* First bit available for private use by FS */
+};
+BUFFER_FNS(JBD, jbd)
+BUFFER_FNS(JWrite, jwrite)
+BUFFER_FNS(JBDDirty, jbddirty)
+TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Revoked, revoked)
+TAS_BUFFER_FNS(Revoked, revoked)
+BUFFER_FNS(RevokeValid, revokevalid)
+TAS_BUFFER_FNS(RevokeValid, revokevalid)
+BUFFER_FNS(Freed, freed)
 #include <linux/jbd_common.h>
 #define J_ASSERT(assert)        BUG_ON(!(assert))
@@ -840,7 +864,7 @@ extern void	 journal_release_buffer (handle_t *, struct buffer_head *);
 extern int       journal_forget (handle_t *, struct buffer_head *);
 extern void      journal_sync_buffer (struct buffer_head *);
 extern void      journal_invalidatepage(journal_t *,
-                                struct page *, unsigned long);
+                                struct page *, unsigned int, unsigned int);
 extern int       journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int       journal_stop(handle_t *);
 extern int       journal_flush (journal_t *);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6e051f472edb..d5b50a19463c 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -26,7 +26,6 @@
 #include <linux/buffer_head.h>
 #include <linux/journal-head.h>
 #include <linux/stddef.h>
-#include <linux/bit_spinlock.h>
 #include <linux/mutex.h>
 #include <linux/timer.h>
 #include <linux/slab.h>
@@ -57,17 +56,13 @@
 */
 #define JBD2_EXPENSIVE_CHECKING
 extern ushort jbd2_journal_enable_debug;
+void __jbd2_debug(int level, const char *file, const char *func,
+                  unsigned int line, const char *fmt, ...);
-#define jbd_debug(n, f, a...)                                           \
+#define jbd_debug(n, fmt, a...) \
-        do {                                                            \
+        __jbd2_debug((n), __FILE__, __func__, __LINE__, (fmt), ##a)
-                if ((n) <= jbd2_journal_enable_debug) {                 \
-                        printk (KERN_DEBUG "(%s, %d): %s: ",            \
-                                __FILE__, __LINE__, __func__);  \
-                        printk (f, ## a);                               \
-                }                                                       \
-        } while (0)
 #else
-#define jbd_debug(f, a...)      /**/
+#define jbd_debug(n, fmt, a...)    /**/
 #endif
 extern void *jbd2_alloc(size_t size, gfp_t flags);
@@ -302,6 +297,34 @@ typedef struct journal_superblock_s
 #include <linux/fs.h>
 #include <linux/sched.h>
+enum jbd_state_bits {
+        BH_JBD                  /* Has an attached ext3 journal_head */
+          = BH_PrivateStart,
+        BH_JWrite,              /* Being written to log (@@@ DEBUGGING) */
+        BH_Freed,               /* Has been freed (truncated) */
+        BH_Revoked,             /* Has been revoked from the log */
+        BH_RevokeValid,         /* Revoked flag is valid */
+        BH_JBDDirty,            /* Is dirty but journaled */
+        BH_State,               /* Pins most journal_head state */
+        BH_JournalHead,         /* Pins bh->b_private and jh->b_bh */
+        BH_Shadow,              /* IO on shadow buffer is running */
+        BH_Verified,            /* Metadata block has been verified ok */
+        BH_JBDPrivateStart,     /* First bit available for private use by FS */
+};
+BUFFER_FNS(JBD, jbd)
+BUFFER_FNS(JWrite, jwrite)
+BUFFER_FNS(JBDDirty, jbddirty)
+TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Revoked, revoked)
+TAS_BUFFER_FNS(Revoked, revoked)
+BUFFER_FNS(RevokeValid, revokevalid)
+TAS_BUFFER_FNS(RevokeValid, revokevalid)
+BUFFER_FNS(Freed, freed)
+BUFFER_FNS(Shadow, shadow)
+BUFFER_FNS(Verified, verified)
 #include <linux/jbd_common.h>
 #define J_ASSERT(assert)        BUG_ON(!(assert))
@@ -382,8 +405,15 @@ struct jbd2_revoke_table_s;
 struct jbd2_journal_handle
 {
-        /* Which compound transaction is this update a part of? */
+        union {
-        transaction_t           *h_transaction;
+                /* Which compound transaction is this update a part of? */
+                transaction_t   *h_transaction;
+                /* Which journal handle belongs to - used iff h_reserved set */
+                journal_t       *h_journal;
+        };
+        /* Handle reserved for finishing the logical operation */
+        handle_t                *h_rsv_handle;
        /* Number of remaining buffers we are allowed to dirty: */
        int                     h_buffer_credits;
@@ -398,6 +428,7 @@ struct jbd2_journal_handle
        /* Flags [no locking] */
        unsigned int    h_sync:         1;      /* sync-on-close */
        unsigned int    h_jdata:        1;      /* force data journaling */
+        unsigned int    h_reserved:     1;      /* handle with reserved credits */
        unsigned int    h_aborted:      1;      /* fatal error on handle */
        unsigned int    h_type:         8;      /* for handle statistics */
        unsigned int    h_line_no:      16;     /* for handle statistics */
@@ -524,12 +555,6 @@ struct transaction_s
        struct journal_head     *t_checkpoint_io_list;
        /*
-         * Doubly-linked circular list of temporary buffers currently undergoing
-         * IO in the log [j_list_lock]
-         */
-        struct journal_head     *t_iobuf_list;
-        /*
         * Doubly-linked circular list of metadata buffers being shadowed by log
         * IO.  The IO buffers on the iobuf list and the shadow buffers on this
         * list match each other one for one at all times. [j_list_lock]
@@ -537,12 +562,6 @@ struct transaction_s
        struct journal_head     *t_shadow_list;
        /*
-         * Doubly-linked circular list of control buffers being written to the
-         * log. [j_list_lock]
-         */
-        struct journal_head     *t_log_list;
-        /*
         * List of inodes whose data we've modified in data=ordered mode.
         * [j_list_lock]
         */
@@ -671,11 +690,10 @@ jbd2_time_diff(unsigned long start, unsigned long end)
 *  waiting for checkpointing
 * @j_wait_transaction_locked: Wait queue for waiting for a locked transaction
 *  to start committing, or for a barrier lock to be released
- * @j_wait_logspace: Wait queue for waiting for checkpointing to complete
 * @j_wait_done_commit: Wait queue for waiting for commit to complete
- * @j_wait_checkpoint:  Wait queue to trigger checkpointing
 * @j_wait_commit: Wait queue to trigger commit
 * @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_reserved: Wait queue to wait for reserved buffer credits to drop
 * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
 * @j_head: Journal head - identifies the first unused block in the journal
 * @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -689,6 +707,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
 *     journal
 * @j_fs_dev: Device which holds the client fs.  For internal journal this will
 *     be equal to j_dev
+ * @j_reserved_credits: Number of buffers reserved from the running transaction
 * @j_maxlen: Total maximum capacity of the journal region on disk.
 * @j_list_lock: Protects the buffer lists and internal buffer state.
 * @j_inode: Optional inode where we store the journal.  If present, all journal
@@ -778,21 +797,18 @@ struct journal_s
         */
        wait_queue_head_t       j_wait_transaction_locked;
-        /* Wait queue for waiting for checkpointing to complete */
-        wait_queue_head_t       j_wait_logspace;
        /* Wait queue for waiting for commit to complete */
        wait_queue_head_t       j_wait_done_commit;
-        /* Wait queue to trigger checkpointing */
-        wait_queue_head_t       j_wait_checkpoint;
        /* Wait queue to trigger commit */
        wait_queue_head_t       j_wait_commit;
        /* Wait queue to wait for updates to complete */
        wait_queue_head_t       j_wait_updates;
+        /* Wait queue to wait for reserved buffer credits to drop */
+        wait_queue_head_t       j_wait_reserved;
        /* Semaphore for locking against concurrent checkpoints */
        struct mutex            j_checkpoint_mutex;
@@ -847,6 +863,9 @@ struct journal_s
        /* Total maximum capacity of the journal region on disk. */
        unsigned int            j_maxlen;
+        /* Number of buffers reserved from the running transaction */
+        atomic_t                j_reserved_credits;
        /*
         * Protects the buffer lists and internal buffer state.
         */
@@ -991,9 +1010,17 @@ extern void __jbd2_journal_file_buffer(struct journal_head *, transaction_t *, i
 extern void __journal_free_buffer(struct journal_head *bh);
 extern void jbd2_journal_file_buffer(struct journal_head *, transaction_t *, int);
 extern void __journal_clean_data_list(transaction_t *transaction);
+static inline void jbd2_file_log_bh(struct list_head *head, struct buffer_head *bh)
+{
+        list_add_tail(&bh->b_assoc_buffers, head);
+}
+static inline void jbd2_unfile_log_bh(struct buffer_head *bh)
+{
+        list_del_init(&bh->b_assoc_buffers);
+}
 /* Log buffer allocation */
-extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
+struct buffer_head *jbd2_journal_get_descriptor_buffer(journal_t *journal);
 int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
 int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
                              unsigned long *block);
@@ -1039,11 +1066,10 @@ extern void jbd2_buffer_abort_trigger(struct journal_head *jh,
                                      struct jbd2_buffer_trigger_type *triggers);
 /* Buffer IO */
-extern int
+extern int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
-jbd2_journal_write_metadata_buffer(transaction_t          *transaction,
+                                              struct journal_head *jh_in,
-                              struct journal_head  *jh_in,
+                                              struct buffer_head **bh_out,
-                              struct journal_head **jh_out,
+                                              sector_t blocknr);
-                              unsigned long long   blocknr);
 /* Transaction locking */
 extern void             __wait_on_journal (journal_t *);
@@ -1076,10 +1102,14 @@ static inline handle_t *journal_current_handle(void)
 */
 extern handle_t *jbd2_journal_start(journal_t *, int nblocks);
-extern handle_t *jbd2__journal_start(journal_t *, int nblocks, gfp_t gfp_mask,
+extern handle_t *jbd2__journal_start(journal_t *, int blocks, int rsv_blocks,
-                                     unsigned int type, unsigned int line_no);
+                                     gfp_t gfp_mask, unsigned int type,
+                                     unsigned int line_no);
 extern int       jbd2_journal_restart(handle_t *, int nblocks);
 extern int       jbd2__journal_restart(handle_t *, int nblocks, gfp_t gfp_mask);
+extern int       jbd2_journal_start_reserved(handle_t *handle,
+                                unsigned int type, unsigned int line_no);
+extern void      jbd2_journal_free_reserved(handle_t *handle);
 extern int       jbd2_journal_extend (handle_t *, int nblocks);
 extern int       jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
 extern int       jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
@@ -1090,7 +1120,7 @@ extern int	 jbd2_journal_dirty_metadata (handle_t *, struct buffer_head *);
 extern int       jbd2_journal_forget (handle_t *, struct buffer_head *);
 extern void      journal_sync_buffer (struct buffer_head *);
 extern int       jbd2_journal_invalidatepage(journal_t *,
-                                struct page *, unsigned long);
+                                struct page *, unsigned int, unsigned int);
 extern int       jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int       jbd2_journal_stop(handle_t *);
 extern int       jbd2_journal_flush (journal_t *);
@@ -1125,6 +1155,7 @@ extern void	   jbd2_journal_ack_err    (journal_t *);
 extern int         jbd2_journal_clear_err  (journal_t *);
 extern int         jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
 extern int         jbd2_journal_force_commit(journal_t *);
+extern int         jbd2_journal_force_commit_nested(journal_t *);
 extern int         jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
 extern int         jbd2_journal_begin_ordered_truncate(journal_t *journal,
                                struct jbd2_inode *inode, loff_t new_size);
@@ -1178,8 +1209,10 @@ extern int	   jbd2_journal_init_revoke_caches(void);
 extern void        jbd2_journal_destroy_revoke(journal_t *);
 extern int         jbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *);
 extern int         jbd2_journal_cancel_revoke(handle_t *, struct journal_head *);
-extern void        jbd2_journal_write_revoke_records(journal_t *,
+extern void        jbd2_journal_write_revoke_records(journal_t *journal,
-                                                     transaction_t *, int);
+                                                     transaction_t *transaction,
+                                                     struct list_head *log_bufs,
+                                                     int write_op);
 /* Recovery revoke support */
 extern int      jbd2_journal_set_revoke(journal_t *, unsigned long long, tid_t);
@@ -1195,11 +1228,9 @@ extern void	jbd2_clear_buffer_revoked_flags(journal_t *journal);
 * transitions on demand.
 */
-int __jbd2_log_space_left(journal_t *); /* Called with journal locked */
 int jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
-int jbd2_journal_force_commit_nested(journal_t *journal);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
 int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
@@ -1235,7 +1266,7 @@ static inline int is_journal_aborted(journal_t *journal)
 static inline int is_handle_aborted(handle_t *handle)
 {
-        if (handle->h_aborted)
+        if (handle->h_aborted || !handle->h_transaction)
                return 1;
        return is_journal_aborted(handle->h_transaction->t_journal);
 }
@@ -1266,16 +1297,37 @@ extern int jbd2_journal_blocks_per_page(struct inode *inode);
 extern size_t journal_tag_bytes(journal_t *journal);
 /*
+ * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
+ * transaction control blocks.
+ */
+#define JBD2_CONTROL_BLOCKS_SHIFT 5
+/*
 * Return the minimum number of blocks which must be free in the journal
 * before a new transaction may be started.  Must be called under j_state_lock.
 */
-static inline int jbd_space_needed(journal_t *journal)
+static inline int jbd2_space_needed(journal_t *journal)
 {
        int nblocks = journal->j_max_transaction_buffers;
-        if (journal->j_committing_transaction)
+        return nblocks + (nblocks >> JBD2_CONTROL_BLOCKS_SHIFT);
-                nblocks += atomic_read(&journal->j_committing_transaction->
+}
-                                       t_outstanding_credits);
-        return nblocks;
+/*
+ * Return number of free blocks in the log. Must be called under j_state_lock.
+ */
+static inline unsigned long jbd2_log_space_left(journal_t *journal)
+{
+        /* Allow for rounding errors */
+        unsigned long free = journal->j_free - 32;
+        if (journal->j_committing_transaction) {
+                unsigned long committing = atomic_read(&journal->
+                        j_committing_transaction->t_outstanding_credits);
+                /* Transaction + control blocks */
+                free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
+        }
+        return free;
 }
 /*
@@ -1286,11 +1338,9 @@ static inline int jbd_space_needed(journal_t *journal)
 #define BJ_None         0       /* Not journaled */
 #define BJ_Metadata     1       /* Normal journaled metadata */
 #define BJ_Forget       2       /* Buffer superseded by this transaction */
-#define BJ_IO           3       /* Buffer is for temporary IO use */
+#define BJ_Shadow       3       /* Buffer contents being shadowed to the log */
-#define BJ_Shadow       4       /* Buffer contents being shadowed to the log */
+#define BJ_Reserved     4       /* Buffer is reserved for access by journal */
-#define BJ_LogCtl       5       /* Buffer contains log descriptors */
+#define BJ_Types        5
-#define BJ_Reserved     6       /* Buffer is reserved for access by journal */
-#define BJ_Types        7
 extern int jbd_blocks_per_page(struct inode *inode);
@@ -1319,6 +1369,19 @@ static inline u32 jbd2_chksum(journal_t *journal, u32 crc,
        return *(u32 *)desc.ctx;
 }
+/* Return most recent uncommitted transaction */
+static inline tid_t  jbd2_get_latest_transaction(journal_t *journal)
+{
+        tid_t tid;
+        read_lock(&journal->j_state_lock);
+        tid = journal->j_commit_request;
+        if (journal->j_running_transaction)
+                tid = journal->j_running_transaction->t_tid;
+        read_unlock(&journal->j_state_lock);
+        return tid;
+}
 #ifdef __KERNEL__
 #define buffer_trace_init(bh)   do {} while (0)
diff --git a/include/linux/jbd_common.h b/include/linux/jbd_common.h
index 6133679bc4c0..3dc53432355f 100644
--- a/include/linux/jbd_common.h
+++ b/include/linux/jbd_common.h
@@ -1,31 +1,7 @@
 #ifndef _LINUX_JBD_STATE_H
 #define _LINUX_JBD_STATE_H
-enum jbd_state_bits {
+#include <linux/bit_spinlock.h>
-        BH_JBD                  /* Has an attached ext3 journal_head */
-          = BH_PrivateStart,
-        BH_JWrite,              /* Being written to log (@@@ DEBUGGING) */
-        BH_Freed,               /* Has been freed (truncated) */
-        BH_Revoked,             /* Has been revoked from the log */
-        BH_RevokeValid,         /* Revoked flag is valid */
-        BH_JBDDirty,            /* Is dirty but journaled */
-        BH_State,               /* Pins most journal_head state */
-        BH_JournalHead,         /* Pins bh->b_private and jh->b_bh */
-        BH_Unshadow,            /* Dummy bit, for BJ_Shadow wakeup filtering */
-        BH_Verified,            /* Metadata block has been verified ok */
-        BH_JBDPrivateStart,     /* First bit available for private use by FS */
-};
-BUFFER_FNS(JBD, jbd)
-BUFFER_FNS(JWrite, jwrite)
-BUFFER_FNS(JBDDirty, jbddirty)
-TAS_BUFFER_FNS(JBDDirty, jbddirty)
-BUFFER_FNS(Revoked, revoked)
-TAS_BUFFER_FNS(Revoked, revoked)
-BUFFER_FNS(RevokeValid, revokevalid)
-TAS_BUFFER_FNS(RevokeValid, revokevalid)
-BUFFER_FNS(Freed, freed)
-BUFFER_FNS(Verified, verified)
 static inline struct buffer_head *jh2bh(struct journal_head *jh)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528a41a4..66d881f1d576 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1041,7 +1041,8 @@ int get_kernel_page(unsigned long start, int write, struct page **pages);
 struct page *get_dump_page(unsigned long addr);
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
-extern void do_invalidatepage(struct page *page, unsigned long offset);
+extern void do_invalidatepage(struct page *page, unsigned int offset,
+                              unsigned int length);
 int __set_page_dirty_nobuffers(struct page *page);
 int __set_page_dirty_no_writeback(struct page *page);
diff --git a/include/trace/events/ext3.h b/include/trace/events/ext3.h
index 15d11a39be47..6797b9de90ed 100644
--- a/include/trace/events/ext3.h
+++ b/include/trace/events/ext3.h
@@ -290,13 +290,14 @@ DEFINE_EVENT(ext3__page_op, ext3_releasepage,
 );
 TRACE_EVENT(ext3_invalidatepage,
-        TP_PROTO(struct page *page, unsigned long offset),
+        TP_PROTO(struct page *page, unsigned int offset, unsigned int length),
-        TP_ARGS(page, offset),
+        TP_ARGS(page, offset, length),
        TP_STRUCT__entry(
                __field(        pgoff_t, index                  )
-                __field(        unsigned long, offset           )
+                __field(        unsigned int, offset            )
+                __field(        unsigned int, length            )
                __field(        ino_t,  ino                     )
                __field(        dev_t,  dev                     )
@@ -305,14 +306,15 @@ TRACE_EVENT(ext3_invalidatepage,
        TP_fast_assign(
                __entry->index  = page->index;
                __entry->offset = offset;
+                __entry->length = length;
                __entry->ino    = page->mapping->host->i_ino;
                __entry->dev    = page->mapping->host->i_sb->s_dev;
        ),
-        TP_printk("dev %d,%d ino %lu page_index %lu offset %lu",
+        TP_printk("dev %d,%d ino %lu page_index %lu offset %u length %u",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
-                  __entry->index, __entry->offset)
+                  __entry->index, __entry->offset, __entry->length)
 );
 TRACE_EVENT(ext3_discard_blocks,
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 8ee15b97cd38..2068db241f22 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -19,6 +19,57 @@ struct extent_status;
 #define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode))
+#define show_mballoc_flags(flags) __print_flags(flags, "|",     \
+        { EXT4_MB_HINT_MERGE,           "HINT_MERGE" },         \
+        { EXT4_MB_HINT_RESERVED,        "HINT_RESV" },          \
+        { EXT4_MB_HINT_METADATA,        "HINT_MDATA" },         \
+        { EXT4_MB_HINT_FIRST,           "HINT_FIRST" },         \
+        { EXT4_MB_HINT_BEST,            "HINT_BEST" },          \
+        { EXT4_MB_HINT_DATA,            "HINT_DATA" },          \
+        { EXT4_MB_HINT_NOPREALLOC,      "HINT_NOPREALLOC" },    \
+        { EXT4_MB_HINT_GROUP_ALLOC,     "HINT_GRP_ALLOC" },     \
+        { EXT4_MB_HINT_GOAL_ONLY,       "HINT_GOAL_ONLY" },     \
+        { EXT4_MB_HINT_TRY_GOAL,        "HINT_TRY_GOAL" },      \
+        { EXT4_MB_DELALLOC_RESERVED,    "DELALLOC_RESV" },      \
+        { EXT4_MB_STREAM_ALLOC,         "STREAM_ALLOC" },       \
+        { EXT4_MB_USE_ROOT_BLOCKS,      "USE_ROOT_BLKS" },      \
+        { EXT4_MB_USE_RESERVED,         "USE_RESV" })
+#define show_map_flags(flags) __print_flags(flags, "|",                 \
+        { EXT4_GET_BLOCKS_CREATE,               "CREATE" },             \
+        { EXT4_GET_BLOCKS_UNINIT_EXT,           "UNINIT" },             \
+        { EXT4_GET_BLOCKS_DELALLOC_RESERVE,     "DELALLOC" },           \
+        { EXT4_GET_BLOCKS_PRE_IO,               "PRE_IO" },             \
+        { EXT4_GET_BLOCKS_CONVERT,              "CONVERT" },            \
+        { EXT4_GET_BLOCKS_METADATA_NOFAIL,      "METADATA_NOFAIL" },    \
+        { EXT4_GET_BLOCKS_NO_NORMALIZE,         "NO_NORMALIZE" },       \
+        { EXT4_GET_BLOCKS_KEEP_SIZE,            "KEEP_SIZE" },          \
+        { EXT4_GET_BLOCKS_NO_LOCK,              "NO_LOCK" },            \
+        { EXT4_GET_BLOCKS_NO_PUT_HOLE,          "NO_PUT_HOLE" })
+#define show_mflags(flags) __print_flags(flags, "",     \
+        { EXT4_MAP_NEW,         "N" },                  \
+        { EXT4_MAP_MAPPED,      "M" },                  \
+        { EXT4_MAP_UNWRITTEN,   "U" },                  \
+        { EXT4_MAP_BOUNDARY,    "B" },                  \
+        { EXT4_MAP_UNINIT,      "u" },                  \
+        { EXT4_MAP_FROM_CLUSTER, "C" })
+#define show_free_flags(flags) __print_flags(flags, "|",        \
+        { EXT4_FREE_BLOCKS_METADATA,            "METADATA" },   \
+        { EXT4_FREE_BLOCKS_FORGET,              "FORGET" },     \
+        { EXT4_FREE_BLOCKS_VALIDATED,           "VALIDATED" },  \
+        { EXT4_FREE_BLOCKS_NO_QUOT_UPDATE,      "NO_QUOTA" },   \
+        { EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER,"1ST_CLUSTER" },\
+        { EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER, "LAST_CLUSTER" })
+#define show_extent_status(status) __print_flags(status, "",    \
+        { (1 << 3),     "W" },                                  \
+        { (1 << 2),     "U" },                                  \
+        { (1 << 1),     "D" },                                  \
+        { (1 << 0),     "H" })
 TRACE_EVENT(ext4_free_inode,
        TP_PROTO(struct inode *inode),
@@ -281,7 +332,7 @@ DEFINE_EVENT(ext4__write_end, ext4_da_write_end,
        TP_ARGS(inode, pos, len, copied)
 );
-TRACE_EVENT(ext4_da_writepages,
+TRACE_EVENT(ext4_writepages,
        TP_PROTO(struct inode *inode, struct writeback_control *wbc),
        TP_ARGS(inode, wbc),
@@ -324,46 +375,62 @@ TRACE_EVENT(ext4_da_writepages,
 );
 TRACE_EVENT(ext4_da_write_pages,
-        TP_PROTO(struct inode *inode, struct mpage_da_data *mpd),
+        TP_PROTO(struct inode *inode, pgoff_t first_page,
+                 struct writeback_control *wbc),
-        TP_ARGS(inode, mpd),
+        TP_ARGS(inode, first_page, wbc),
        TP_STRUCT__entry(
                __field(        dev_t,  dev                     )
                __field(        ino_t,  ino                     )
-                __field(        __u64,  b_blocknr               )
+                __field(      pgoff_t,  first_page              )
-                __field(        __u32,  b_size                  )
+                __field(         long,  nr_to_write             )
-                __field(        __u32,  b_state                 )
+                __field(          int,  sync_mode               )
-                __field(        unsigned long,  first_page      )
-                __field(        int,    io_done                 )
-                __field(        int,    pages_written           )
-                __field(        int,    sync_mode               )
        ),
        TP_fast_assign(
                __entry->dev            = inode->i_sb->s_dev;
                __entry->ino            = inode->i_ino;
-                __entry->b_blocknr      = mpd->b_blocknr;
+                __entry->first_page     = first_page;
-                __entry->b_size         = mpd->b_size;
+                __entry->nr_to_write    = wbc->nr_to_write;
-                __entry->b_state        = mpd->b_state;
+                __entry->sync_mode      = wbc->sync_mode;
-                __entry->first_page     = mpd->first_page;
-                __entry->io_done        = mpd->io_done;
-                __entry->pages_written  = mpd->pages_written;
-                __entry->sync_mode      = mpd->wbc->sync_mode;
        ),
-        TP_printk("dev %d,%d ino %lu b_blocknr %llu b_size %u b_state 0x%04x "
+        TP_printk("dev %d,%d ino %lu first_page %lu nr_to_write %ld "
-                  "first_page %lu io_done %d pages_written %d sync_mode %d",
+                  "sync_mode %d",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
-                  (unsigned long) __entry->ino,
+                  (unsigned long) __entry->ino, __entry->first_page,
-                  __entry->b_blocknr, __entry->b_size,
+                  __entry->nr_to_write, __entry->sync_mode)
-                  __entry->b_state, __entry->first_page,
-                  __entry->io_done, __entry->pages_written,
-                  __entry->sync_mode
-                  )
 );
-TRACE_EVENT(ext4_da_writepages_result,
+TRACE_EVENT(ext4_da_write_pages_extent,
+        TP_PROTO(struct inode *inode, struct ext4_map_blocks *map),
+        TP_ARGS(inode, map),
+        TP_STRUCT__entry(
+                __field(        dev_t,  dev                     )
+                __field(        ino_t,  ino                     )
+                __field(        __u64,  lblk                    )
+                __field(        __u32,  len                     )
+                __field(        __u32,  flags                   )
+        ),
+        TP_fast_assign(
+                __entry->dev            = inode->i_sb->s_dev;
+                __entry->ino            = inode->i_ino;
+                __entry->lblk           = map->m_lblk;
+                __entry->len            = map->m_len;
+                __entry->flags          = map->m_flags;
+        ),
+        TP_printk("dev %d,%d ino %lu lblk %llu len %u flags %s",
+                  MAJOR(__entry->dev), MINOR(__entry->dev),
+                  (unsigned long) __entry->ino, __entry->lblk, __entry->len,
+                  show_mflags(__entry->flags))
+);
+TRACE_EVENT(ext4_writepages_result,
        TP_PROTO(struct inode *inode, struct writeback_control *wbc,
                        int ret, int pages_written),
@@ -444,16 +511,16 @@ DEFINE_EVENT(ext4__page_op, ext4_releasepage,
 );
 DECLARE_EVENT_CLASS(ext4_invalidatepage_op,
-        TP_PROTO(struct page *page, unsigned long offset),
+        TP_PROTO(struct page *page, unsigned int offset, unsigned int length),
-        TP_ARGS(page, offset),
+        TP_ARGS(page, offset, length),
        TP_STRUCT__entry(
                __field(        dev_t,  dev                     )
                __field(        ino_t,  ino                     )
                __field(        pgoff_t, index                  )
-                __field(        unsigned long, offset           )
+                __field(        unsigned int, offset            )
+                __field(        unsigned int, length            )
        ),
        TP_fast_assign(
@@ -461,24 +528,26 @@ DECLARE_EVENT_CLASS(ext4_invalidatepage_op,
                __entry->ino    = page->mapping->host->i_ino;
                __entry->index  = page->index;
                __entry->offset = offset;
+                __entry->length = length;
        ),
-        TP_printk("dev %d,%d ino %lu page_index %lu offset %lu",
+        TP_printk("dev %d,%d ino %lu page_index %lu offset %u length %u",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
-                  (unsigned long) __entry->index, __entry->offset)
+                  (unsigned long) __entry->index,
+                  __entry->offset, __entry->length)
 );
 DEFINE_EVENT(ext4_invalidatepage_op, ext4_invalidatepage,
-        TP_PROTO(struct page *page, unsigned long offset),
+        TP_PROTO(struct page *page, unsigned int offset, unsigned int length),
-        TP_ARGS(page, offset)
+        TP_ARGS(page, offset, length)
 );
 DEFINE_EVENT(ext4_invalidatepage_op, ext4_journalled_invalidatepage,
-        TP_PROTO(struct page *page, unsigned long offset),
+        TP_PROTO(struct page *page, unsigned int offset, unsigned int length),
-        TP_ARGS(page, offset)
+        TP_ARGS(page, offset, length)
 );
 TRACE_EVENT(ext4_discard_blocks,
@@ -673,10 +742,10 @@ TRACE_EVENT(ext4_request_blocks,
                __entry->flags  = ar->flags;
        ),
-        TP_printk("dev %d,%d ino %lu flags %u len %u lblk %u goal %llu "
+        TP_printk("dev %d,%d ino %lu flags %s len %u lblk %u goal %llu "
                  "lleft %u lright %u pleft %llu pright %llu ",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
-                  (unsigned long) __entry->ino, __entry->flags,
+                  (unsigned long) __entry->ino, show_mballoc_flags(__entry->flags),
                  __entry->len, __entry->logical, __entry->goal,
                  __entry->lleft, __entry->lright, __entry->pleft,
                  __entry->pright)
@@ -715,10 +784,10 @@ TRACE_EVENT(ext4_allocate_blocks,
                __entry->flags  = ar->flags;
        ),
-        TP_printk("dev %d,%d ino %lu flags %u len %u block %llu lblk %u "
+        TP_printk("dev %d,%d ino %lu flags %s len %u block %llu lblk %u "
                  "goal %llu lleft %u lright %u pleft %llu pright %llu",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
-                  (unsigned long) __entry->ino, __entry->flags,
+                  (unsigned long) __entry->ino, show_mballoc_flags(__entry->flags),
                  __entry->len, __entry->block, __entry->logical,
                  __entry->goal,  __entry->lleft, __entry->lright,
                  __entry->pleft, __entry->pright)
@@ -748,11 +817,11 @@ TRACE_EVENT(ext4_free_blocks,
                __entry->mode           = inode->i_mode;
        ),
-        TP_printk("dev %d,%d ino %lu mode 0%o block %llu count %lu flags %d",
+        TP_printk("dev %d,%d ino %lu mode 0%o block %llu count %lu flags %s",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  __entry->mode, __entry->block, __entry->count,
-                  __entry->flags)
+                  show_free_flags(__entry->flags))
 );
 TRACE_EVENT(ext4_sync_file_enter,
@@ -903,7 +972,7 @@ TRACE_EVENT(ext4_mballoc_alloc,
        ),
        TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u goal %u/%d/%u@%u "
-                  "result %u/%d/%u@%u blks %u grps %u cr %u flags 0x%04x "
+                  "result %u/%d/%u@%u blks %u grps %u cr %u flags %s "
                  "tail %u broken %u",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
@@ -914,7 +983,7 @@ TRACE_EVENT(ext4_mballoc_alloc,
                  __entry->result_group, __entry->result_start,
                  __entry->result_len, __entry->result_logical,
                  __entry->found, __entry->groups, __entry->cr,
-                  __entry->flags, __entry->tail,
+                  show_mballoc_flags(__entry->flags), __entry->tail,
                  __entry->buddy ? 1 << __entry->buddy : 0)
 );
@@ -1528,10 +1597,10 @@ DECLARE_EVENT_CLASS(ext4__map_blocks_enter,
                __entry->flags  = flags;
        ),
-        TP_printk("dev %d,%d ino %lu lblk %u len %u flags %u",
+        TP_printk("dev %d,%d ino %lu lblk %u len %u flags %s",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
-                  __entry->lblk, __entry->len, __entry->flags)
+                  __entry->lblk, __entry->len, show_map_flags(__entry->flags))
 );
 DEFINE_EVENT(ext4__map_blocks_enter, ext4_ext_map_blocks_enter,
@@ -1549,47 +1618,53 @@ DEFINE_EVENT(ext4__map_blocks_enter, ext4_ind_map_blocks_enter,
 );
 DECLARE_EVENT_CLASS(ext4__map_blocks_exit,
-        TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, int ret),
+        TP_PROTO(struct inode *inode, unsigned flags, struct ext4_map_blocks *map,
+                 int ret),
-        TP_ARGS(inode, map, ret),
+        TP_ARGS(inode, flags, map, ret),
        TP_STRUCT__entry(
                __field(        dev_t,          dev             )
                __field(        ino_t,          ino             )
+                __field(        unsigned int,   flags           )
                __field(        ext4_fsblk_t,   pblk            )
                __field(        ext4_lblk_t,    lblk            )
                __field(        unsigned int,   len             )
-                __field(        unsigned int,   flags           )
+                __field(        unsigned int,   mflags          )
                __field(        int,            ret             )
        ),
        TP_fast_assign(
                __entry->dev    = inode->i_sb->s_dev;
                __entry->ino    = inode->i_ino;
+                __entry->flags  = flags;
                __entry->pblk   = map->m_pblk;
                __entry->lblk   = map->m_lblk;
                __entry->len    = map->m_len;
-                __entry->flags  = map->m_flags;
+                __entry->mflags = map->m_flags;
                __entry->ret    = ret;
        ),
-        TP_printk("dev %d,%d ino %lu lblk %u pblk %llu len %u flags %x ret %d",
+        TP_printk("dev %d,%d ino %lu flags %s lblk %u pblk %llu len %u "
+                  "mflags %s ret %d",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
-                  __entry->lblk, __entry->pblk,
+                  show_map_flags(__entry->flags), __entry->lblk, __entry->pblk,
-                  __entry->len, __entry->flags, __entry->ret)
+                  __entry->len, show_mflags(__entry->mflags), __entry->ret)
 );
 DEFINE_EVENT(ext4__map_blocks_exit, ext4_ext_map_blocks_exit,
-        TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, int ret),
+        TP_PROTO(struct inode *inode, unsigned flags,
+                 struct ext4_map_blocks *map, int ret),
-        TP_ARGS(inode, map, ret)
+        TP_ARGS(inode, flags, map, ret)
 );
 DEFINE_EVENT(ext4__map_blocks_exit, ext4_ind_map_blocks_exit,
-        TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, int ret),
+        TP_PROTO(struct inode *inode, unsigned flags,
+                 struct ext4_map_blocks *map, int ret),
-        TP_ARGS(inode, map, ret)
+        TP_ARGS(inode, flags, map, ret)
 );
 TRACE_EVENT(ext4_ext_load_extent,
@@ -1638,25 +1713,50 @@ TRACE_EVENT(ext4_load_inode,
 );
 TRACE_EVENT(ext4_journal_start,
-        TP_PROTO(struct super_block *sb, int nblocks, unsigned long IP),
+        TP_PROTO(struct super_block *sb, int blocks, int rsv_blocks,
+                 unsigned long IP),
-        TP_ARGS(sb, nblocks, IP),
+        TP_ARGS(sb, blocks, rsv_blocks, IP),
        TP_STRUCT__entry(
                __field(        dev_t,  dev                     )
                __field(unsigned long,  ip                      )
-                __field(        int,    nblocks                 )
+                __field(          int,  blocks                  )
+                __field(          int,  rsv_blocks              )
        ),
        TP_fast_assign(
-                __entry->dev     = sb->s_dev;
+                __entry->dev             = sb->s_dev;
-                __entry->ip      = IP;
+                __entry->ip              = IP;
-                __entry->nblocks = nblocks;
+                __entry->blocks          = blocks;
+                __entry->rsv_blocks      = rsv_blocks;
        ),
-        TP_printk("dev %d,%d nblocks %d caller %pF",
+        TP_printk("dev %d,%d blocks, %d rsv_blocks, %d caller %pF",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
-                  __entry->nblocks, (void *)__entry->ip)
+                  __entry->blocks, __entry->rsv_blocks, (void *)__entry->ip)
+);
+TRACE_EVENT(ext4_journal_start_reserved,
+        TP_PROTO(struct super_block *sb, int blocks, unsigned long IP),
+        TP_ARGS(sb, blocks, IP),
+        TP_STRUCT__entry(
+                __field(        dev_t,  dev                     )
+                __field(unsigned long,  ip                      )
+                __field(          int,  blocks                  )
+        ),
+        TP_fast_assign(
+                __entry->dev             = sb->s_dev;
+                __entry->ip              = IP;
+                __entry->blocks          = blocks;
+        ),
+        TP_printk("dev %d,%d blocks, %d caller %pF",
+                  MAJOR(__entry->dev), MINOR(__entry->dev),
+                  __entry->blocks, (void *)__entry->ip)
 );
 DECLARE_EVENT_CLASS(ext4__trim,
@@ -1736,12 +1836,12 @@ TRACE_EVENT(ext4_ext_handle_uninitialized_extents,
                __entry->newblk         = newblock;
        ),
-        TP_printk("dev %d,%d ino %lu m_lblk %u m_pblk %llu m_len %u flags %x "
+        TP_printk("dev %d,%d ino %lu m_lblk %u m_pblk %llu m_len %u flags %s "
                  "allocated %d newblock %llu",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  (unsigned) __entry->lblk, (unsigned long long) __entry->pblk,
-                  __entry->len, __entry->flags,
+                  __entry->len, show_map_flags(__entry->flags),
                  (unsigned int) __entry->allocated,
                  (unsigned long long) __entry->newblk)
 );
@@ -1769,10 +1869,10 @@ TRACE_EVENT(ext4_get_implied_cluster_alloc_exit,
                __entry->ret    = ret;
        ),
-        TP_printk("dev %d,%d m_lblk %u m_pblk %llu m_len %u m_flags %u ret %d",
+        TP_printk("dev %d,%d m_lblk %u m_pblk %llu m_len %u m_flags %s ret %d",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  __entry->lblk, (unsigned long long) __entry->pblk,
-                  __entry->len, __entry->flags, __entry->ret)
+                  __entry->len, show_mflags(__entry->flags), __entry->ret)
 );
 TRACE_EVENT(ext4_ext_put_in_cache,
@@ -1926,7 +2026,7 @@ TRACE_EVENT(ext4_ext_show_extent,
 TRACE_EVENT(ext4_remove_blocks,
            TP_PROTO(struct inode *inode, struct ext4_extent *ex,
                ext4_lblk_t from, ext4_fsblk_t to,
-                ext4_fsblk_t partial_cluster),
+                long long partial_cluster),
        TP_ARGS(inode, ex, from, to, partial_cluster),
@@ -1935,7 +2035,7 @@ TRACE_EVENT(ext4_remove_blocks,
                __field(        ino_t,          ino     )
                __field(        ext4_lblk_t,    from    )
                __field(        ext4_lblk_t,    to      )
-                __field(        ext4_fsblk_t,   partial )
+                __field(        long long,      partial )
                __field(        ext4_fsblk_t,   ee_pblk )
                __field(        ext4_lblk_t,    ee_lblk )
                __field(        unsigned short, ee_len  )
@@ -1953,7 +2053,7 @@ TRACE_EVENT(ext4_remove_blocks,
        ),
        TP_printk("dev %d,%d ino %lu extent [%u(%llu), %u]"
-                  "from %u to %u partial_cluster %u",
+                  "from %u to %u partial_cluster %lld",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  (unsigned) __entry->ee_lblk,
@@ -1961,19 +2061,20 @@ TRACE_EVENT(ext4_remove_blocks,
                  (unsigned short) __entry->ee_len,
                  (unsigned) __entry->from,
                  (unsigned) __entry->to,
-                  (unsigned) __entry->partial)
+                  (long long) __entry->partial)
 );
 TRACE_EVENT(ext4_ext_rm_leaf,
        TP_PROTO(struct inode *inode, ext4_lblk_t start,
-                 struct ext4_extent *ex, ext4_fsblk_t partial_cluster),
+                 struct ext4_extent *ex,
+                 long long partial_cluster),
        TP_ARGS(inode, start, ex, partial_cluster),
        TP_STRUCT__entry(
                __field(        dev_t,          dev     )
                __field(        ino_t,          ino     )
-                __field(        ext4_fsblk_t,   partial )
+                __field(        long long,      partial )
                __field(        ext4_lblk_t,    start   )
                __field(        ext4_lblk_t,    ee_lblk )
                __field(        ext4_fsblk_t,   ee_pblk )
@@ -1991,14 +2092,14 @@ TRACE_EVENT(ext4_ext_rm_leaf,
        ),
        TP_printk("dev %d,%d ino %lu start_lblk %u last_extent [%u(%llu), %u]"
-                  "partial_cluster %u",
+                  "partial_cluster %lld",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  (unsigned) __entry->start,
                  (unsigned) __entry->ee_lblk,
                  (unsigned long long) __entry->ee_pblk,
                  (unsigned short) __entry->ee_len,
-                  (unsigned) __entry->partial)
+                  (long long) __entry->partial)
 );
 TRACE_EVENT(ext4_ext_rm_idx,
@@ -2025,14 +2126,16 @@ TRACE_EVENT(ext4_ext_rm_idx,
 );
 TRACE_EVENT(ext4_ext_remove_space,
-        TP_PROTO(struct inode *inode, ext4_lblk_t start, int depth),
+        TP_PROTO(struct inode *inode, ext4_lblk_t start,
+                 ext4_lblk_t end, int depth),
-        TP_ARGS(inode, start, depth),
+        TP_ARGS(inode, start, end, depth),
        TP_STRUCT__entry(
                __field(        dev_t,          dev     )
                __field(        ino_t,          ino     )
                __field(        ext4_lblk_t,    start   )
+                __field(        ext4_lblk_t,    end     )
                __field(        int,            depth   )
        ),
@@ -2040,28 +2143,31 @@ TRACE_EVENT(ext4_ext_remove_space,
                __entry->dev    = inode->i_sb->s_dev;
                __entry->ino    = inode->i_ino;
                __entry->start  = start;
+                __entry->end    = end;
                __entry->depth  = depth;
        ),
-        TP_printk("dev %d,%d ino %lu since %u depth %d",
+        TP_printk("dev %d,%d ino %lu since %u end %u depth %d",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  (unsigned) __entry->start,
+                  (unsigned) __entry->end,
                  __entry->depth)
 );
 TRACE_EVENT(ext4_ext_remove_space_done,
-        TP_PROTO(struct inode *inode, ext4_lblk_t start, int depth,
+        TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t end,
-                ext4_lblk_t partial, __le16 eh_entries),
+                 int depth, long long partial, __le16 eh_entries),
-        TP_ARGS(inode, start, depth, partial, eh_entries),
+        TP_ARGS(inode, start, end, depth, partial, eh_entries),
        TP_STRUCT__entry(
                __field(        dev_t,          dev             )
                __field(        ino_t,          ino             )
                __field(        ext4_lblk_t,    start           )
+                __field(        ext4_lblk_t,    end             )
                __field(        int,            depth           )
-                __field(        ext4_lblk_t,    partial         )
+                __field(        long long,      partial         )
                __field(        unsigned short, eh_entries      )
        ),
@@ -2069,18 +2175,20 @@ TRACE_EVENT(ext4_ext_remove_space_done,
                __entry->dev            = inode->i_sb->s_dev;
                __entry->ino            = inode->i_ino;
                __entry->start          = start;
+                __entry->end            = end;
                __entry->depth          = depth;
                __entry->partial        = partial;
                __entry->eh_entries     = le16_to_cpu(eh_entries);
        ),
-        TP_printk("dev %d,%d ino %lu since %u depth %d partial %u "
+        TP_printk("dev %d,%d ino %lu since %u end %u depth %d partial %lld "
                  "remaining_entries %u",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  (unsigned) __entry->start,
+                  (unsigned) __entry->end,
                  __entry->depth,
-                  (unsigned) __entry->partial,
+                  (long long) __entry->partial,
                  (unsigned short) __entry->eh_entries)
 );
@@ -2095,7 +2203,7 @@ TRACE_EVENT(ext4_es_insert_extent,
                __field(        ext4_lblk_t,    lblk            )
                __field(        ext4_lblk_t,    len             )
                __field(        ext4_fsblk_t,   pblk            )
-                __field(        unsigned long long, status      )
+                __field(        char, status    )
        ),
        TP_fast_assign(
@@ -2104,14 +2212,14 @@ TRACE_EVENT(ext4_es_insert_extent,
                __entry->lblk   = es->es_lblk;
                __entry->len    = es->es_len;
                __entry->pblk   = ext4_es_pblock(es);
-                __entry->status = ext4_es_status(es);
+                __entry->status = ext4_es_status(es) >> 60;
        ),
-        TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %llx",
+        TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  __entry->lblk, __entry->len,
-                  __entry->pblk, __entry->status)
+                  __entry->pblk, show_extent_status(__entry->status))
 );
 TRACE_EVENT(ext4_es_remove_extent,
@@ -2172,7 +2280,7 @@ TRACE_EVENT(ext4_es_find_delayed_extent_range_exit,
                __field(        ext4_lblk_t,    lblk            )
                __field(        ext4_lblk_t,    len             )
                __field(        ext4_fsblk_t,   pblk            )
-                __field(        unsigned long long, status      )
+                __field(        char, status    )
        ),
        TP_fast_assign(
@@ -2181,14 +2289,14 @@ TRACE_EVENT(ext4_es_find_delayed_extent_range_exit,
                __entry->lblk   = es->es_lblk;
                __entry->len    = es->es_len;
                __entry->pblk   = ext4_es_pblock(es);
-                __entry->status = ext4_es_status(es);
+                __entry->status = ext4_es_status(es) >> 60;
        ),
-        TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %llx",
+        TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino,
                  __entry->lblk, __entry->len,
-                  __entry->pblk, __entry->status)
+                  __entry->pblk, show_extent_status(__entry->status))
 );
 TRACE_EVENT(ext4_es_lookup_extent_enter,
@@ -2225,7 +2333,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
                __field(        ext4_lblk_t,    lblk            )
                __field(        ext4_lblk_t,    len             )
                __field(        ext4_fsblk_t,   pblk            )
-                __field(        unsigned long long,     status  )
+                __field(        char,           status          )
                __field(        int,            found           )
        ),
@@ -2235,16 +2343,16 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
                __entry->lblk   = es->es_lblk;
                __entry->len    = es->es_len;
                __entry->pblk   = ext4_es_pblock(es);
-                __entry->status = ext4_es_status(es);
+                __entry->status = ext4_es_status(es) >> 60;
                __entry->found  = found;
        ),
-        TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %llx",
+        TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %s",
                  MAJOR(__entry->dev), MINOR(__entry->dev),
                  (unsigned long) __entry->ino, __entry->found,
                  __entry->lblk, __entry->len,
                  __entry->found ? __entry->pblk : 0,
-                  __entry->found ? __entry->status : 0)
+                  show_extent_status(__entry->found ? __entry->status : 0))
 );
 TRACE_EVENT(ext4_es_shrink_enter,
diff --git a/mm/readahead.c b/mm/readahead.c
index daed28dd5830..829a77c62834 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -48,7 +48,7 @@ static void read_cache_pages_invalidate_page(struct address_space *mapping,
                if (!trylock_page(page))
                        BUG();
                page->mapping = mapping;
-                do_invalidatepage(page, 0);
+                do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
                page->mapping = NULL;
                unlock_page(page);
        }
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736e54b7..e2e8a8a7eb9d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -26,7 +26,8 @@
 /**
 * do_invalidatepage - invalidate part or all of a page
 * @page: the page which is affected
- * @offset: the index of the truncation point
+ * @offset: start of the range to invalidate
+ * @length: length of the range to invalidate
 *
 * do_invalidatepage() is called when all or part of the page has become
 * invalidated by a truncate operation.
@@ -37,24 +38,18 @@
 * point.  Because the caller is about to free (and possibly reuse) those
 * blocks on-disk.
 */
-void do_invalidatepage(struct page *page, unsigned long offset)
+void do_invalidatepage(struct page *page, unsigned int offset,
+                       unsigned int length)
 {
-        void (*invalidatepage)(struct page *, unsigned long);
+        void (*invalidatepage)(struct page *, unsigned int, unsigned int);
        invalidatepage = page->mapping->a_ops->invalidatepage;
 #ifdef CONFIG_BLOCK
        if (!invalidatepage)
                invalidatepage = block_invalidatepage;
 #endif
        if (invalidatepage)
-                (*invalidatepage)(page, offset);
+                (*invalidatepage)(page, offset, length);
-}
-static inline void truncate_partial_page(struct page *page, unsigned partial)
-{
-        zero_user_segment(page, partial, PAGE_CACHE_SIZE);
-        cleancache_invalidate_page(page->mapping, page);
-        if (page_has_private(page))
-                do_invalidatepage(page, partial);
 }
 /*
@@ -103,7 +98,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
                return -EIO;
        if (page_has_private(page))
-                do_invalidatepage(page, 0);
+                do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
        cancel_dirty_page(page, PAGE_CACHE_SIZE);
@@ -185,11 +180,11 @@ int invalidate_inode_page(struct page *page)
 * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
 * @mapping: mapping to truncate
 * @lstart: offset from which to truncate
- * @lend: offset to which to truncate
+ * @lend: offset to which to truncate (inclusive)
 *
 * Truncate the page cache, removing the pages that are between
- * specified offsets (and zeroing out partial page
+ * specified offsets (and zeroing out partial pages
- * (if lstart is not page aligned)).
+ * if lstart or lend + 1 is not page aligned).
 *
 * Truncate takes two passes - the first pass is nonblocking.  It will not
 * block on page locks and it will not block on writeback.  The second pass
@@ -200,35 +195,58 @@ int invalidate_inode_page(struct page *page)
 * We pass down the cache-hot hint to the page freeing code.  Even if the
 * mapping is large, it is probably the case that the final pages are the most
 * recently touched, and freeing happens in ascending file offset order.
+ *
+ * Note that since ->invalidatepage() accepts range to invalidate
+ * truncate_inode_pages_range is able to handle cases where lend + 1 is not
+ * page aligned properly.
 */
 void truncate_inode_pages_range(struct address_space *mapping,
                                loff_t lstart, loff_t lend)
 {
-        const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+        pgoff_t         start;          /* inclusive */
-        const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+        pgoff_t         end;            /* exclusive */
-        struct pagevec pvec;
+        unsigned int    partial_start;  /* inclusive */
-        pgoff_t index;
+        unsigned int    partial_end;    /* exclusive */
-        pgoff_t end;
+        struct pagevec  pvec;
-        int i;
+        pgoff_t         index;
+        int             i;
        cleancache_invalidate_inode(mapping);
        if (mapping->nrpages == 0)
                return;
-        BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
+        /* Offsets within partial pages */
-        end = (lend >> PAGE_CACHE_SHIFT);
+        partial_start = lstart & (PAGE_CACHE_SIZE - 1);
+        partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
+        /*
+         * 'start' and 'end' always covers the range of pages to be fully
+         * truncated. Partial pages are covered with 'partial_start' at the
+         * start of the range and 'partial_end' at the end of the range.
+         * Note that 'end' is exclusive while 'lend' is inclusive.
+         */
+        start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+        if (lend == -1)
+                /*
+                 * lend == -1 indicates end-of-file so we have to set 'end'
+                 * to the highest possible pgoff_t and since the type is
+                 * unsigned we're using -1.
+                 */
+                end = -1;
+        else
+                end = (lend + 1) >> PAGE_CACHE_SHIFT;
        pagevec_init(&pvec, 0);
        index = start;
-        while (index <= end && pagevec_lookup(&pvec, mapping, index,
+        while (index < end && pagevec_lookup(&pvec, mapping, index,
-                        min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+                        min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
                mem_cgroup_uncharge_start();
                for (i = 0; i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];
                        /* We rely upon deletion not changing page->index */
                        index = page->index;
-                        if (index > end)
+                        if (index >= end)
                                break;
                        if (!trylock_page(page))
@@ -247,27 +265,56 @@ void truncate_inode_pages_range(struct address_space *mapping,
                index++;
        }
-        if (partial) {
+        if (partial_start) {
                struct page *page = find_lock_page(mapping, start - 1);
                if (page) {
+                        unsigned int top = PAGE_CACHE_SIZE;
+                        if (start > end) {
+                                /* Truncation within a single page */
+                                top = partial_end;
+                                partial_end = 0;
+                        }
                        wait_on_page_writeback(page);
-                        truncate_partial_page(page, partial);
+                        zero_user_segment(page, partial_start, top);
+                        cleancache_invalidate_page(mapping, page);
+                        if (page_has_private(page))
+                                do_invalidatepage(page, partial_start,
+                                                  top - partial_start);
                        unlock_page(page);
                        page_cache_release(page);
                }
        }
+        if (partial_end) {
+                struct page *page = find_lock_page(mapping, end);
+                if (page) {
+                        wait_on_page_writeback(page);
+                        zero_user_segment(page, 0, partial_end);
+                        cleancache_invalidate_page(mapping, page);
+                        if (page_has_private(page))
+                                do_invalidatepage(page, 0,
+                                                  partial_end);
+                        unlock_page(page);
+                        page_cache_release(page);
+                }
+        }
+        /*
+         * If the truncation happened within a single page no pages
+         * will be released, just zeroed, so we can bail out now.
+         */
+        if (start >= end)
+                return;
        index = start;
        for ( ; ; ) {
                cond_resched();
                if (!pagevec_lookup(&pvec, mapping, index,
-                        min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+                        min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
                        if (index == start)
                                break;
                        index = start;
                        continue;
                }
-                if (index == start && pvec.pages[0]->index > end) {
+                if (index == start && pvec.pages[0]->index >= end) {
                        pagevec_release(&pvec);
                        break;
                }
@@ -277,7 +324,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
                        /* We rely upon deletion not changing page->index */
                        index = page->index;
-                        if (index > end)
+                        if (index >= end)
                                break;
                        lock_page(page);
@@ -598,10 +645,8 @@ void truncate_pagecache_range(struct inode *inode, loff_t lstart, loff_t lend)
         * This rounding is currently just for example: unmap_mapping_range
         * expands its hole outwards, whereas we want it to contract the hole
         * inwards.  However, existing callers of truncate_pagecache_range are
-         * doing their own page rounding first; and truncate_inode_pages_range
+         * doing their own page rounding first.  Note that unmap_mapping_range
-         * currently BUGs if lend is not pagealigned-1 (it handles partial
+         * allows holelen 0 for all, and we allow lend -1 for end of file.
-         * page at start of hole, but not partial page at end of hole).  Note
-         * unmap_mapping_range allows holelen 0 for all, and we allow lend -1.
         */
        /*
author	Linus Torvalds <torvalds@linux-foundation.org>	2013-07-02 12:39:34 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2013-07-02 12:39:34 -0400
commit	9e239bb93914e1c832d54161c7f8f398d0c914ab (patch)
tree	0fe11e8e717152660ad77d77e66bf0f1695d7ed1
parent	63580e51bb3e7ec459501165884e5f815a7a9322 (diff)
parent	6ae06ff51eab5dcbbf959b05ce0f11003a305ba5 (diff)