Merge branch 'for-3.16/core' of git://git.kernel.dk/linux-block into next

Pull block core updates from Jens Axboe: "It's a big(ish) round this time, lots of development effort has gone into blk-mq in the last 3 months. Generally we're heading to where 3.16 will be a feature complete and performant blk-mq. scsi-mq is progressing nicely and will hopefully be in 3.17. A nvme port is in progress, and the Micron pci-e flash driver, mtip32xx, is converted and will be sent in with the driver pull request for 3.16. This pull request contains: - Lots of prep and support patches for scsi-mq have been integrated. All from Christoph. - API and code cleanups for blk-mq from Christoph. - Lots of good corner case and error handling cleanup fixes for blk-mq from Ming Lei. - A flew of blk-mq updates from me: * Provide strict mappings so that the driver can rely on the CPU to queue mapping. This enables optimizations in the driver. * Provided a bitmap tagging instead of percpu_ida, which never really worked well for blk-mq. percpu_ida relies on the fact that we have a lot more tags available than we really need, it fails miserably for cases where we exhaust (or are close to exhausting) the tag space. * Provide sane support for shared tag maps, as utilized by scsi-mq * Various fixes for IO timeouts. * API cleanups, and lots of perf tweaks and optimizations. - Remove 'buffer' from struct request. This is ancient code, from when requests were always virtually mapped. Kill it, to reclaim some space in struct request. From me. - Remove 'magic' from blk_plug. Since we store these on the stack and since we've never caught any actual bugs with this, lets just get rid of it. From me. - Only call part_in_flight() once for IO completion, as includes two atomic reads. Hopefully we'll get a better implementation soon, as the part IO stats are now one of the more expensive parts of doing IO on blk-mq. From me. - File migration of block code from {mm,fs}/ to block/. This includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me, from a discussion on lkml. That should describe the meat of the pull request. Also has various little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong, Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam Bradshaw" * 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits) blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() blk-mq: remember to start timeout handler for direct queue block: ensure that the timer is always added blk-mq: blk_mq_unregister_hctx() can be static blk-mq: make the sysfs mq/ layout reflect current mappings blk-mq: blk_mq_tag_to_rq should handle flush request block: remove dead code in scsi_ioctl:blk_verify_command blk-mq: request initialization optimizations block: add queue flag for disabling SG merging block: remove 'magic' from struct blk_plug blk-mq: remove alloc_hctx and free_hctx methods blk-mq: add file comments and update copyright notices blk-mq: remove blk_mq_alloc_request_pinned blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request blk-mq: remove blk_mq_wait_for_tags blk-mq: initialize request in __blk_mq_alloc_request blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request blk-mq: add helper to insert requests from irq context blk-mq: remove stale comment for blk_mq_complete_request() blk-mq: allow non-softirq completions ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2014-06-02 12:29:34 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2014-06-02 12:29:34 -0400
commit: 681a2895486243a82547d8c9f53043eb54b53da0 (patch)
tree: 464273280aed6db55a99cc0d8614d4393f94fc48 /block
parent: 6c52486dedbb30a1313da64945dcd686b4579c51 (diff)
parent: ed851860b4552fc8963ecf71eab9f6f7a5c19d74 (diff)
24 files changed, 5083 insertions, 756 deletions
diff --git a/block/Makefile b/block/Makefile
index 20645e88fb57..a2ce6ac935ec 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -2,13 +2,15 @@
 # Makefile for the kernel block layer
 #
-obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
+obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
                        blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
                        blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
                        blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
                        blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
-                        genhd.o scsi_ioctl.o partition-generic.o partitions/
+                        genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
+                        partitions/
+obj-$(CONFIG_BOUNCE)    += bounce.o
 obj-$(CONFIG_BLK_DEV_BSG)       += bsg.o
 obj-$(CONFIG_BLK_DEV_BSGLIB)    += bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)        += blk-cgroup.o
@@ -20,3 +22,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_BLOCK_COMPAT)      += compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)        += cmdline-parser.o
+obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
new file mode 100644
index 000000000000..9e241063a616
--- /dev/null
+++ b/block/bio-integrity.c
@@ -0,0 +1,657 @@
+/*
+ * bio-integrity.c - bio data integrity extensions
+ *
+ * Copyright (C) 2007, 2008, 2009 Oracle Corporation
+ * Written by: Martin K. Petersen <martin.petersen@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+#include <linux/blkdev.h>
+#include <linux/mempool.h>
+#include <linux/export.h>
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#define BIP_INLINE_VECS 4
+static struct kmem_cache *bip_slab;
+static struct workqueue_struct *kintegrityd_wq;
+/**
+ * bio_integrity_alloc - Allocate integrity payload and attach it to bio
+ * @bio:        bio to attach integrity metadata to
+ * @gfp_mask:   Memory allocation mask
+ * @nr_vecs:    Number of integrity metadata scatter-gather elements
+ *
+ * Description: This function prepares a bio for attaching integrity
+ * metadata.  nr_vecs specifies the maximum number of pages containing
+ * integrity metadata that can be attached.
+ */
+struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
+                                                  gfp_t gfp_mask,
+                                                  unsigned int nr_vecs)
+{
+        struct bio_integrity_payload *bip;
+        struct bio_set *bs = bio->bi_pool;
+        unsigned long idx = BIO_POOL_NONE;
+        unsigned inline_vecs;
+        if (!bs) {
+                bip = kmalloc(sizeof(struct bio_integrity_payload) +
+                              sizeof(struct bio_vec) * nr_vecs, gfp_mask);
+                inline_vecs = nr_vecs;
+        } else {
+                bip = mempool_alloc(bs->bio_integrity_pool, gfp_mask);
+                inline_vecs = BIP_INLINE_VECS;
+        }
+        if (unlikely(!bip))
+                return NULL;
+        memset(bip, 0, sizeof(*bip));
+        if (nr_vecs > inline_vecs) {
+                bip->bip_vec = bvec_alloc(gfp_mask, nr_vecs, &idx,
+                                          bs->bvec_integrity_pool);
+                if (!bip->bip_vec)
+                        goto err;
+        } else {
+                bip->bip_vec = bip->bip_inline_vecs;
+        }
+        bip->bip_slab = idx;
+        bip->bip_bio = bio;
+        bio->bi_integrity = bip;
+        return bip;
+err:
+        mempool_free(bip, bs->bio_integrity_pool);
+        return NULL;
+}
+EXPORT_SYMBOL(bio_integrity_alloc);
+/**
+ * bio_integrity_free - Free bio integrity payload
+ * @bio:        bio containing bip to be freed
+ *
+ * Description: Used to free the integrity portion of a bio. Usually
+ * called from bio_free().
+ */
+void bio_integrity_free(struct bio *bio)
+{
+        struct bio_integrity_payload *bip = bio->bi_integrity;
+        struct bio_set *bs = bio->bi_pool;
+        if (bip->bip_owns_buf)
+                kfree(bip->bip_buf);
+        if (bs) {
+                if (bip->bip_slab != BIO_POOL_NONE)
+                        bvec_free(bs->bvec_integrity_pool, bip->bip_vec,
+                                  bip->bip_slab);
+                mempool_free(bip, bs->bio_integrity_pool);
+        } else {
+                kfree(bip);
+        }
+        bio->bi_integrity = NULL;
+}
+EXPORT_SYMBOL(bio_integrity_free);
+static inline unsigned int bip_integrity_vecs(struct bio_integrity_payload *bip)
+{
+        if (bip->bip_slab == BIO_POOL_NONE)
+                return BIP_INLINE_VECS;
+        return bvec_nr_vecs(bip->bip_slab);
+}
+/**
+ * bio_integrity_add_page - Attach integrity metadata
+ * @bio:        bio to update
+ * @page:       page containing integrity metadata
+ * @len:        number of bytes of integrity metadata in page
+ * @offset:     start offset within page
+ *
+ * Description: Attach a page containing integrity metadata to bio.
+ */
+int bio_integrity_add_page(struct bio *bio, struct page *page,
+                           unsigned int len, unsigned int offset)
+{
+        struct bio_integrity_payload *bip = bio->bi_integrity;
+        struct bio_vec *iv;
+        if (bip->bip_vcnt >= bip_integrity_vecs(bip)) {
+                printk(KERN_ERR "%s: bip_vec full\n", __func__);
+                return 0;
+        }
+        iv = bip->bip_vec + bip->bip_vcnt;
+        iv->bv_page = page;
+        iv->bv_len = len;
+        iv->bv_offset = offset;
+        bip->bip_vcnt++;
+        return len;
+}
+EXPORT_SYMBOL(bio_integrity_add_page);
+static int bdev_integrity_enabled(struct block_device *bdev, int rw)
+{
+        struct blk_integrity *bi = bdev_get_integrity(bdev);
+        if (bi == NULL)
+                return 0;
+        if (rw == READ && bi->verify_fn != NULL &&
+            (bi->flags & INTEGRITY_FLAG_READ))
+                return 1;
+        if (rw == WRITE && bi->generate_fn != NULL &&
+            (bi->flags & INTEGRITY_FLAG_WRITE))
+                return 1;
+        return 0;
+}
+/**
+ * bio_integrity_enabled - Check whether integrity can be passed
+ * @bio:        bio to check
+ *
+ * Description: Determines whether bio_integrity_prep() can be called
+ * on this bio or not.  bio data direction and target device must be
+ * set prior to calling.  The functions honors the write_generate and
+ * read_verify flags in sysfs.
+ */
+int bio_integrity_enabled(struct bio *bio)
+{
+        if (!bio_is_rw(bio))
+                return 0;
+        /* Already protected? */
+        if (bio_integrity(bio))
+                return 0;
+        return bdev_integrity_enabled(bio->bi_bdev, bio_data_dir(bio));
+}
+EXPORT_SYMBOL(bio_integrity_enabled);
+/**
+ * bio_integrity_hw_sectors - Convert 512b sectors to hardware ditto
+ * @bi:         blk_integrity profile for device
+ * @sectors:    Number of 512 sectors to convert
+ *
+ * Description: The block layer calculates everything in 512 byte
+ * sectors but integrity metadata is done in terms of the hardware
+ * sector size of the storage device.  Convert the block layer sectors
+ * to physical sectors.
+ */
+static inline unsigned int bio_integrity_hw_sectors(struct blk_integrity *bi,
+                                                    unsigned int sectors)
+{
+        /* At this point there are only 512b or 4096b DIF/EPP devices */
+        if (bi->sector_size == 4096)
+                return sectors >>= 3;
+        return sectors;
+}
+static inline unsigned int bio_integrity_bytes(struct blk_integrity *bi,
+                                               unsigned int sectors)
+{
+        return bio_integrity_hw_sectors(bi, sectors) * bi->tuple_size;
+}
+/**
+ * bio_integrity_tag_size - Retrieve integrity tag space
+ * @bio:        bio to inspect
+ *
+ * Description: Returns the maximum number of tag bytes that can be
+ * attached to this bio. Filesystems can use this to determine how
+ * much metadata to attach to an I/O.
+ */
+unsigned int bio_integrity_tag_size(struct bio *bio)
+{
+        struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+        BUG_ON(bio->bi_iter.bi_size == 0);
+        return bi->tag_size * (bio->bi_iter.bi_size / bi->sector_size);
+}
+EXPORT_SYMBOL(bio_integrity_tag_size);
+static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
+                             int set)
+{
+        struct bio_integrity_payload *bip = bio->bi_integrity;
+        struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+        unsigned int nr_sectors;
+        BUG_ON(bip->bip_buf == NULL);
+        if (bi->tag_size == 0)
+                return -1;
+        nr_sectors = bio_integrity_hw_sectors(bi,
+                                        DIV_ROUND_UP(len, bi->tag_size));
+        if (nr_sectors * bi->tuple_size > bip->bip_iter.bi_size) {
+                printk(KERN_ERR "%s: tag too big for bio: %u > %u\n", __func__,
+                       nr_sectors * bi->tuple_size, bip->bip_iter.bi_size);
+                return -1;
+        }
+        if (set)
+                bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+        else
+                bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+        return 0;
+}
+/**
+ * bio_integrity_set_tag - Attach a tag buffer to a bio
+ * @bio:        bio to attach buffer to
+ * @tag_buf:    Pointer to a buffer containing tag data
+ * @len:        Length of the included buffer
+ *
+ * Description: Use this function to tag a bio by leveraging the extra
+ * space provided by devices formatted with integrity protection.  The
+ * size of the integrity buffer must be <= to the size reported by
+ * bio_integrity_tag_size().
+ */
+int bio_integrity_set_tag(struct bio *bio, void *tag_buf, unsigned int len)
+{
+        BUG_ON(bio_data_dir(bio) != WRITE);
+        return bio_integrity_tag(bio, tag_buf, len, 1);
+}
+EXPORT_SYMBOL(bio_integrity_set_tag);
+/**
+ * bio_integrity_get_tag - Retrieve a tag buffer from a bio
+ * @bio:        bio to retrieve buffer from
+ * @tag_buf:    Pointer to a buffer for the tag data
+ * @len:        Length of the target buffer
+ *
+ * Description: Use this function to retrieve the tag buffer from a
+ * completed I/O. The size of the integrity buffer must be <= to the
+ * size reported by bio_integrity_tag_size().
+ */
+int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
+{
+        BUG_ON(bio_data_dir(bio) != READ);
+        return bio_integrity_tag(bio, tag_buf, len, 0);
+}
+EXPORT_SYMBOL(bio_integrity_get_tag);
+/**
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:        bio to generate/verify integrity metadata for
+ * @operate:    operate number, 1 for generate, 0 for verify
+ */
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
+{
+        struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+        struct blk_integrity_exchg bix;
+        struct bio_vec *bv;
+        sector_t sector;
+        unsigned int sectors, ret = 0, i;
+        void *prot_buf = bio->bi_integrity->bip_buf;
+        if (operate)
+                sector = bio->bi_iter.bi_sector;
+        else
+                sector = bio->bi_integrity->bip_iter.bi_sector;
+        bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
+        bix.sector_size = bi->sector_size;
+        bio_for_each_segment_all(bv, bio, i) {
+                void *kaddr = kmap_atomic(bv->bv_page);
+                bix.data_buf = kaddr + bv->bv_offset;
+                bix.data_size = bv->bv_len;
+                bix.prot_buf = prot_buf;
+                bix.sector = sector;
+                if (operate)
+                        bi->generate_fn(&bix);
+                else {
+                        ret = bi->verify_fn(&bix);
+                        if (ret) {
+                                kunmap_atomic(kaddr);
+                                return ret;
+                        }
+                }
+                sectors = bv->bv_len / bi->sector_size;
+                sector += sectors;
+                prot_buf += sectors * bi->tuple_size;
+                kunmap_atomic(kaddr);
+        }
+        return ret;
+}
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:        bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+        bio_integrity_generate_verify(bio, 1);
+}
+static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
+{
+        if (bi)
+                return bi->tuple_size;
+        return 0;
+}
+/**
+ * bio_integrity_prep - Prepare bio for integrity I/O
+ * @bio:        bio to prepare
+ *
+ * Description: Allocates a buffer for integrity metadata, maps the
+ * pages and attaches them to a bio.  The bio must have data
+ * direction, target device and start sector set priot to calling.  In
+ * the WRITE case, integrity metadata will be generated using the
+ * block device's integrity function.  In the READ case, the buffer
+ * will be prepared for DMA and a suitable end_io handler set up.
+ */
+int bio_integrity_prep(struct bio *bio)
+{
+        struct bio_integrity_payload *bip;
+        struct blk_integrity *bi;
+        struct request_queue *q;
+        void *buf;
+        unsigned long start, end;
+        unsigned int len, nr_pages;
+        unsigned int bytes, offset, i;
+        unsigned int sectors;
+        bi = bdev_get_integrity(bio->bi_bdev);
+        q = bdev_get_queue(bio->bi_bdev);
+        BUG_ON(bi == NULL);
+        BUG_ON(bio_integrity(bio));
+        sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
+        /* Allocate kernel buffer for protection data */
+        len = sectors * blk_integrity_tuple_size(bi);
+        buf = kmalloc(len, GFP_NOIO | q->bounce_gfp);
+        if (unlikely(buf == NULL)) {
+                printk(KERN_ERR "could not allocate integrity buffer\n");
+                return -ENOMEM;
+        }
+        end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+        start = ((unsigned long) buf) >> PAGE_SHIFT;
+        nr_pages = end - start;
+        /* Allocate bio integrity payload and integrity vectors */
+        bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages);
+        if (unlikely(bip == NULL)) {
+                printk(KERN_ERR "could not allocate data integrity bioset\n");
+                kfree(buf);
+                return -EIO;
+        }
+        bip->bip_owns_buf = 1;
+        bip->bip_buf = buf;
+        bip->bip_iter.bi_size = len;
+        bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
+        /* Map it */
+        offset = offset_in_page(buf);
+        for (i = 0 ; i < nr_pages ; i++) {
+                int ret;
+                bytes = PAGE_SIZE - offset;
+                if (len <= 0)
+                        break;
+                if (bytes > len)
+                        bytes = len;
+                ret = bio_integrity_add_page(bio, virt_to_page(buf),
+                                             bytes, offset);
+                if (ret == 0)
+                        return 0;
+                if (ret < bytes)
+                        break;
+                buf += bytes;
+                len -= bytes;
+                offset = 0;
+        }
+        /* Install custom I/O completion handler if read verify is enabled */
+        if (bio_data_dir(bio) == READ) {
+                bip->bip_end_io = bio->bi_end_io;
+                bio->bi_end_io = bio_integrity_endio;
+        }
+        /* Auto-generate integrity metadata if this is a write */
+        if (bio_data_dir(bio) == WRITE)
+                bio_integrity_generate(bio);
+        return 0;
+}
+EXPORT_SYMBOL(bio_integrity_prep);
+/**
+ * bio_integrity_verify - Verify integrity metadata for a bio
+ * @bio:        bio to verify
+ *
+ * Description: This function is called to verify the integrity of a
+ * bio.  The data in the bio io_vec is compared to the integrity
+ * metadata returned by the HBA.
+ */
+static int bio_integrity_verify(struct bio *bio)
+{
+        return bio_integrity_generate_verify(bio, 0);
+}
+/**
+ * bio_integrity_verify_fn - Integrity I/O completion worker
+ * @work:       Work struct stored in bio to be verified
+ *
+ * Description: This workqueue function is called to complete a READ
+ * request.  The function verifies the transferred integrity metadata
+ * and then calls the original bio end_io function.
+ */
+static void bio_integrity_verify_fn(struct work_struct *work)
+{
+        struct bio_integrity_payload *bip =
+                container_of(work, struct bio_integrity_payload, bip_work);
+        struct bio *bio = bip->bip_bio;
+        int error;
+        error = bio_integrity_verify(bio);
+        /* Restore original bio completion handler */
+        bio->bi_end_io = bip->bip_end_io;
+        bio_endio_nodec(bio, error);
+}
+/**
+ * bio_integrity_endio - Integrity I/O completion function
+ * @bio:        Protected bio
+ * @error:      Pointer to errno
+ *
+ * Description: Completion for integrity I/O
+ *
+ * Normally I/O completion is done in interrupt context.  However,
+ * verifying I/O integrity is a time-consuming task which must be run
+ * in process context.  This function postpones completion
+ * accordingly.
+ */
+void bio_integrity_endio(struct bio *bio, int error)
+{
+        struct bio_integrity_payload *bip = bio->bi_integrity;
+        BUG_ON(bip->bip_bio != bio);
+        /* In case of an I/O error there is no point in verifying the
+         * integrity metadata.  Restore original bio end_io handler
+         * and run it.
+         */
+        if (error) {
+                bio->bi_end_io = bip->bip_end_io;
+                bio_endio(bio, error);
+                return;
+        }
+        INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
+        queue_work(kintegrityd_wq, &bip->bip_work);
+}
+EXPORT_SYMBOL(bio_integrity_endio);
+/**
+ * bio_integrity_advance - Advance integrity vector
+ * @bio:        bio whose integrity vector to update
+ * @bytes_done: number of data bytes that have been completed
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and advances the
+ * integrity vector accordingly.
+ */
+void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
+{
+        struct bio_integrity_payload *bip = bio->bi_integrity;
+        struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+        unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+        bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
+}
+EXPORT_SYMBOL(bio_integrity_advance);
+/**
+ * bio_integrity_trim - Trim integrity vector
+ * @bio:        bio whose integrity vector to update
+ * @offset:     offset to first data sector
+ * @sectors:    number of data sectors
+ *
+ * Description: Used to trim the integrity vector in a cloned bio.
+ * The ivec will be advanced corresponding to 'offset' data sectors
+ * and the length will be truncated corresponding to 'len' data
+ * sectors.
+ */
+void bio_integrity_trim(struct bio *bio, unsigned int offset,
+                        unsigned int sectors)
+{
+        struct bio_integrity_payload *bip = bio->bi_integrity;
+        struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+        bio_integrity_advance(bio, offset << 9);
+        bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
+}
+EXPORT_SYMBOL(bio_integrity_trim);
+/**
+ * bio_integrity_clone - Callback for cloning bios with integrity metadata
+ * @bio:        New bio
+ * @bio_src:    Original bio
+ * @gfp_mask:   Memory allocation mask
+ *
+ * Description: Called to allocate a bip when cloning a bio
+ */
+int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
+                        gfp_t gfp_mask)
+{
+        struct bio_integrity_payload *bip_src = bio_src->bi_integrity;
+        struct bio_integrity_payload *bip;
+        BUG_ON(bip_src == NULL);
+        bip = bio_integrity_alloc(bio, gfp_mask, bip_src->bip_vcnt);
+        if (bip == NULL)
+                return -EIO;
+        memcpy(bip->bip_vec, bip_src->bip_vec,
+               bip_src->bip_vcnt * sizeof(struct bio_vec));
+        bip->bip_vcnt = bip_src->bip_vcnt;
+        bip->bip_iter = bip_src->bip_iter;
+        return 0;
+}
+EXPORT_SYMBOL(bio_integrity_clone);
+int bioset_integrity_create(struct bio_set *bs, int pool_size)
+{
+        if (bs->bio_integrity_pool)
+                return 0;
+        bs->bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab);
+        if (!bs->bio_integrity_pool)
+                return -1;
+        bs->bvec_integrity_pool = biovec_create_pool(pool_size);
+        if (!bs->bvec_integrity_pool) {
+                mempool_destroy(bs->bio_integrity_pool);
+                return -1;
+        }
+        return 0;
+}
+EXPORT_SYMBOL(bioset_integrity_create);
+void bioset_integrity_free(struct bio_set *bs)
+{
+        if (bs->bio_integrity_pool)
+                mempool_destroy(bs->bio_integrity_pool);
+        if (bs->bvec_integrity_pool)
+                mempool_destroy(bs->bvec_integrity_pool);
+}
+EXPORT_SYMBOL(bioset_integrity_free);
+void __init bio_integrity_init(void)
+{
+        /*
+         * kintegrityd won't block much but may burn a lot of CPU cycles.
+         * Make it highpri CPU intensive wq with max concurrency of 1.
+         */
+        kintegrityd_wq = alloc_workqueue("kintegrityd", WQ_MEM_RECLAIM |
+                                         WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1);
+        if (!kintegrityd_wq)
+                panic("Failed to create kintegrityd\n");
+        bip_slab = kmem_cache_create("bio_integrity_payload",
+                                     sizeof(struct bio_integrity_payload) +
+                                     sizeof(struct bio_vec) * BIP_INLINE_VECS,
+                                     0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+        if (!bip_slab)
+                panic("Failed to create slab\n");
+}
diff --git a/block/bio.c b/block/bio.c
new file mode 100644
index 000000000000..96d28eee8a1e
--- /dev/null
+++ b/block/bio.c
@@ -0,0 +1,2038 @@
+/*
+ * Copyright (C) 2001 Jens Axboe <axboe@kernel.dk>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
+ *
+ */
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/uio.h>
+#include <linux/iocontext.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/mempool.h>
+#include <linux/workqueue.h>
+#include <linux/cgroup.h>
+#include <scsi/sg.h>            /* for struct sg_iovec */
+#include <trace/events/block.h>
+/*
+ * Test patch to inline a certain number of bi_io_vec's inside the bio
+ * itself, to shrink a bio data allocation from two mempool calls to one
+ */
+#define BIO_INLINE_VECS         4
+/*
+ * if you change this list, also change bvec_alloc or things will
+ * break badly! cannot be bigger than what you can fit into an
+ * unsigned short
+ */
+#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
+static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
+        BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+};
+#undef BV
+/*
+ * fs_bio_set is the bio_set containing bio and iovec memory pools used by
+ * IO code that does not need private memory pools.
+ */
+struct bio_set *fs_bio_set;
+EXPORT_SYMBOL(fs_bio_set);
+/*
+ * Our slab pool management
+ */
+struct bio_slab {
+        struct kmem_cache *slab;
+        unsigned int slab_ref;
+        unsigned int slab_size;
+        char name[8];
+};
+static DEFINE_MUTEX(bio_slab_lock);
+static struct bio_slab *bio_slabs;
+static unsigned int bio_slab_nr, bio_slab_max;
+static struct kmem_cache *bio_find_or_create_slab(unsigned int extra_size)
+{
+        unsigned int sz = sizeof(struct bio) + extra_size;
+        struct kmem_cache *slab = NULL;
+        struct bio_slab *bslab, *new_bio_slabs;
+        unsigned int new_bio_slab_max;
+        unsigned int i, entry = -1;
+        mutex_lock(&bio_slab_lock);
+        i = 0;
+        while (i < bio_slab_nr) {
+                bslab = &bio_slabs[i];
+                if (!bslab->slab && entry == -1)
+                        entry = i;
+                else if (bslab->slab_size == sz) {
+                        slab = bslab->slab;
+                        bslab->slab_ref++;
+                        break;
+                }
+                i++;
+        }
+        if (slab)
+                goto out_unlock;
+        if (bio_slab_nr == bio_slab_max && entry == -1) {
+                new_bio_slab_max = bio_slab_max << 1;
+                new_bio_slabs = krealloc(bio_slabs,
+                                         new_bio_slab_max * sizeof(struct bio_slab),
+                                         GFP_KERNEL);
+                if (!new_bio_slabs)
+                        goto out_unlock;
+                bio_slab_max = new_bio_slab_max;
+                bio_slabs = new_bio_slabs;
+        }
+        if (entry == -1)
+                entry = bio_slab_nr++;
+        bslab = &bio_slabs[entry];
+        snprintf(bslab->name, sizeof(bslab->name), "bio-%d", entry);
+        slab = kmem_cache_create(bslab->name, sz, 0, SLAB_HWCACHE_ALIGN, NULL);
+        if (!slab)
+                goto out_unlock;
+        bslab->slab = slab;
+        bslab->slab_ref = 1;
+        bslab->slab_size = sz;
+out_unlock:
+        mutex_unlock(&bio_slab_lock);
+        return slab;
+}
+static void bio_put_slab(struct bio_set *bs)
+{
+        struct bio_slab *bslab = NULL;
+        unsigned int i;
+        mutex_lock(&bio_slab_lock);
+        for (i = 0; i < bio_slab_nr; i++) {
+                if (bs->bio_slab == bio_slabs[i].slab) {
+                        bslab = &bio_slabs[i];
+                        break;
+                }
+        }
+        if (WARN(!bslab, KERN_ERR "bio: unable to find slab!\n"))
+                goto out;
+        WARN_ON(!bslab->slab_ref);
+        if (--bslab->slab_ref)
+                goto out;
+        kmem_cache_destroy(bslab->slab);
+        bslab->slab = NULL;
+out:
+        mutex_unlock(&bio_slab_lock);
+}
+unsigned int bvec_nr_vecs(unsigned short idx)
+{
+        return bvec_slabs[idx].nr_vecs;
+}
+void bvec_free(mempool_t *pool, struct bio_vec *bv, unsigned int idx)
+{
+        BIO_BUG_ON(idx >= BIOVEC_NR_POOLS);
+        if (idx == BIOVEC_MAX_IDX)
+                mempool_free(bv, pool);
+        else {
+                struct biovec_slab *bvs = bvec_slabs + idx;
+                kmem_cache_free(bvs->slab, bv);
+        }
+}
+struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
+                           mempool_t *pool)
+{
+        struct bio_vec *bvl;
+        /*
+         * see comment near bvec_array define!
+         */
+        switch (nr) {
+        case 1:
+                *idx = 0;
+                break;
+        case 2 ... 4:
+                *idx = 1;
+                break;
+        case 5 ... 16:
+                *idx = 2;
+                break;
+        case 17 ... 64:
+                *idx = 3;
+                break;
+        case 65 ... 128:
+                *idx = 4;
+                break;
+        case 129 ... BIO_MAX_PAGES:
+                *idx = 5;
+                break;
+        default:
+                return NULL;
+        }
+        /*
+         * idx now points to the pool we want to allocate from. only the
+         * 1-vec entry pool is mempool backed.
+         */
+        if (*idx == BIOVEC_MAX_IDX) {
+fallback:
+                bvl = mempool_alloc(pool, gfp_mask);
+        } else {
+                struct biovec_slab *bvs = bvec_slabs + *idx;
+                gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+                /*
+                 * Make this allocation restricted and don't dump info on
+                 * allocation failures, since we'll fallback to the mempool
+                 * in case of failure.
+                 */
+                __gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
+                /*
+                 * Try a slab allocation. If this fails and __GFP_WAIT
+                 * is set, retry with the 1-entry mempool
+                 */
+                bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
+                if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
+                        *idx = BIOVEC_MAX_IDX;
+                        goto fallback;
+                }
+        }
+        return bvl;
+}
+static void __bio_free(struct bio *bio)
+{
+        bio_disassociate_task(bio);
+        if (bio_integrity(bio))
+                bio_integrity_free(bio);
+}
+static void bio_free(struct bio *bio)
+{
+        struct bio_set *bs = bio->bi_pool;
+        void *p;
+        __bio_free(bio);
+        if (bs) {
+                if (bio_flagged(bio, BIO_OWNS_VEC))
+                        bvec_free(bs->bvec_pool, bio->bi_io_vec, BIO_POOL_IDX(bio));
+                /*
+                 * If we have front padding, adjust the bio pointer before freeing
+                 */
+                p = bio;
+                p -= bs->front_pad;
+                mempool_free(p, bs->bio_pool);
+        } else {
+                /* Bio was allocated by bio_kmalloc() */
+                kfree(bio);
+        }
+}
+void bio_init(struct bio *bio)
+{
+        memset(bio, 0, sizeof(*bio));
+        bio->bi_flags = 1 << BIO_UPTODATE;
+        atomic_set(&bio->bi_remaining, 1);
+        atomic_set(&bio->bi_cnt, 1);
+}
+EXPORT_SYMBOL(bio_init);
+/**
+ * bio_reset - reinitialize a bio
+ * @bio:        bio to reset
+ *
+ * Description:
+ *   After calling bio_reset(), @bio will be in the same state as a freshly
+ *   allocated bio returned bio bio_alloc_bioset() - the only fields that are
+ *   preserved are the ones that are initialized by bio_alloc_bioset(). See
+ *   comment in struct bio.
+ */
+void bio_reset(struct bio *bio)
+{
+        unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
+        __bio_free(bio);
+        memset(bio, 0, BIO_RESET_BYTES);
+        bio->bi_flags = flags|(1 << BIO_UPTODATE);
+        atomic_set(&bio->bi_remaining, 1);
+}
+EXPORT_SYMBOL(bio_reset);
+static void bio_chain_endio(struct bio *bio, int error)
+{
+        bio_endio(bio->bi_private, error);
+        bio_put(bio);
+}
+/**
+ * bio_chain - chain bio completions
+ * @bio: the target bio
+ * @parent: the @bio's parent bio
+ *
+ * The caller won't have a bi_end_io called when @bio completes - instead,
+ * @parent's bi_end_io won't be called until both @parent and @bio have
+ * completed; the chained bio will also be freed when it completes.
+ *
+ * The caller must not set bi_private or bi_end_io in @bio.
+ */
+void bio_chain(struct bio *bio, struct bio *parent)
+{
+        BUG_ON(bio->bi_private || bio->bi_end_io);
+        bio->bi_private = parent;
+        bio->bi_end_io  = bio_chain_endio;
+        atomic_inc(&parent->bi_remaining);
+}
+EXPORT_SYMBOL(bio_chain);
+static void bio_alloc_rescue(struct work_struct *work)
+{
+        struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+        struct bio *bio;
+        while (1) {
+                spin_lock(&bs->rescue_lock);
+                bio = bio_list_pop(&bs->rescue_list);
+                spin_unlock(&bs->rescue_lock);
+                if (!bio)
+                        break;
+                generic_make_request(bio);
+        }
+}
+static void punt_bios_to_rescuer(struct bio_set *bs)
+{
+        struct bio_list punt, nopunt;
+        struct bio *bio;
+        /*
+         * In order to guarantee forward progress we must punt only bios that
+         * were allocated from this bio_set; otherwise, if there was a bio on
+         * there for a stacking driver higher up in the stack, processing it
+         * could require allocating bios from this bio_set, and doing that from
+         * our own rescuer would be bad.
+         *
+         * Since bio lists are singly linked, pop them all instead of trying to
+         * remove from the middle of the list:
+         */
+        bio_list_init(&punt);
+        bio_list_init(&nopunt);
+        while ((bio = bio_list_pop(current->bio_list)))
+                bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+        *current->bio_list = nopunt;
+        spin_lock(&bs->rescue_lock);
+        bio_list_merge(&bs->rescue_list, &punt);
+        spin_unlock(&bs->rescue_lock);
+        queue_work(bs->rescue_workqueue, &bs->rescue_work);
+}
+/**
+ * bio_alloc_bioset - allocate a bio for I/O
+ * @gfp_mask:   the GFP_ mask given to the slab allocator
+ * @nr_iovecs:  number of iovecs to pre-allocate
+ * @bs:         the bio_set to allocate from.
+ *
+ * Description:
+ *   If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
+ *   backed by the @bs's mempool.
+ *
+ *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
+ *   able to allocate a bio. This is due to the mempool guarantees. To make this
+ *   work, callers must never allocate more than 1 bio at a time from this pool.
+ *   Callers that need to allocate more than 1 bio must always submit the
+ *   previously allocated bio for IO before attempting to allocate a new one.
+ *   Failure to do so can cause deadlocks under memory pressure.
+ *
+ *   Note that when running under generic_make_request() (i.e. any block
+ *   driver), bios are not submitted until after you return - see the code in
+ *   generic_make_request() that converts recursion into iteration, to prevent
+ *   stack overflows.
+ *
+ *   This would normally mean allocating multiple bios under
+ *   generic_make_request() would be susceptible to deadlocks, but we have
+ *   deadlock avoidance code that resubmits any blocked bios from a rescuer
+ *   thread.
+ *
+ *   However, we do not guarantee forward progress for allocations from other
+ *   mempools. Doing multiple allocations from the same mempool under
+ *   generic_make_request() should be avoided - instead, use bio_set's front_pad
+ *   for per bio allocations.
+ *
+ *   RETURNS:
+ *   Pointer to new bio on success, NULL on failure.
+ */
+struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
+{
+        gfp_t saved_gfp = gfp_mask;
+        unsigned front_pad;
+        unsigned inline_vecs;
+        unsigned long idx = BIO_POOL_NONE;
+        struct bio_vec *bvl = NULL;
+        struct bio *bio;
+        void *p;
+        if (!bs) {
+                if (nr_iovecs > UIO_MAXIOV)
+                        return NULL;
+                p = kmalloc(sizeof(struct bio) +
+                            nr_iovecs * sizeof(struct bio_vec),
+                            gfp_mask);
+                front_pad = 0;
+                inline_vecs = nr_iovecs;
+        } else {
+                /*
+                 * generic_make_request() converts recursion to iteration; this
+                 * means if we're running beneath it, any bios we allocate and
+                 * submit will not be submitted (and thus freed) until after we
+                 * return.
+                 *
+                 * This exposes us to a potential deadlock if we allocate
+                 * multiple bios from the same bio_set() while running
+                 * underneath generic_make_request(). If we were to allocate
+                 * multiple bios (say a stacking block driver that was splitting
+                 * bios), we would deadlock if we exhausted the mempool's
+                 * reserve.
+                 *
+                 * We solve this, and guarantee forward progress, with a rescuer
+                 * workqueue per bio_set. If we go to allocate and there are
+                 * bios on current->bio_list, we first try the allocation
+                 * without __GFP_WAIT; if that fails, we punt those bios we
+                 * would be blocking to the rescuer workqueue before we retry
+                 * with the original gfp_flags.
+                 */
+                if (current->bio_list && !bio_list_empty(current->bio_list))
+                        gfp_mask &= ~__GFP_WAIT;
+                p = mempool_alloc(bs->bio_pool, gfp_mask);
+                if (!p && gfp_mask != saved_gfp) {
+                        punt_bios_to_rescuer(bs);
+                        gfp_mask = saved_gfp;
+                        p = mempool_alloc(bs->bio_pool, gfp_mask);
+                }
+                front_pad = bs->front_pad;
+                inline_vecs = BIO_INLINE_VECS;
+        }
+        if (unlikely(!p))
+                return NULL;
+        bio = p + front_pad;
+        bio_init(bio);
+        if (nr_iovecs > inline_vecs) {
+                bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
+                if (!bvl && gfp_mask != saved_gfp) {
+                        punt_bios_to_rescuer(bs);
+                        gfp_mask = saved_gfp;
+                        bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
+                }
+                if (unlikely(!bvl))
+                        goto err_free;
+                bio->bi_flags |= 1 << BIO_OWNS_VEC;
+        } else if (nr_iovecs) {
+                bvl = bio->bi_inline_vecs;
+        }
+        bio->bi_pool = bs;
+        bio->bi_flags |= idx << BIO_POOL_OFFSET;
+        bio->bi_max_vecs = nr_iovecs;
+        bio->bi_io_vec = bvl;
+        return bio;
+err_free:
+        mempool_free(p, bs->bio_pool);
+        return NULL;
+}
+EXPORT_SYMBOL(bio_alloc_bioset);
+void zero_fill_bio(struct bio *bio)
+{
+        unsigned long flags;
+        struct bio_vec bv;
+        struct bvec_iter iter;
+        bio_for_each_segment(bv, bio, iter) {
+                char *data = bvec_kmap_irq(&bv, &flags);
+                memset(data, 0, bv.bv_len);
+                flush_dcache_page(bv.bv_page);
+                bvec_kunmap_irq(data, &flags);
+        }
+}
+EXPORT_SYMBOL(zero_fill_bio);
+/**
+ * bio_put - release a reference to a bio
+ * @bio:   bio to release reference to
+ *
+ * Description:
+ *   Put a reference to a &struct bio, either one you have gotten with
+ *   bio_alloc, bio_get or bio_clone. The last put of a bio will free it.
+ **/
+void bio_put(struct bio *bio)
+{
+        BIO_BUG_ON(!atomic_read(&bio->bi_cnt));
+        /*
+         * last put frees it
+         */
+        if (atomic_dec_and_test(&bio->bi_cnt))
+                bio_free(bio);
+}
+EXPORT_SYMBOL(bio_put);
+inline int bio_phys_segments(struct request_queue *q, struct bio *bio)
+{
+        if (unlikely(!bio_flagged(bio, BIO_SEG_VALID)))
+                blk_recount_segments(q, bio);
+        return bio->bi_phys_segments;
+}
+EXPORT_SYMBOL(bio_phys_segments);
+/**
+ *      __bio_clone_fast - clone a bio that shares the original bio's biovec
+ *      @bio: destination bio
+ *      @bio_src: bio to clone
+ *
+ *      Clone a &bio. Caller will own the returned bio, but not
+ *      the actual data it points to. Reference count of returned
+ *      bio will be one.
+ *
+ *      Caller must ensure that @bio_src is not freed before @bio.
+ */
+void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
+{
+        BUG_ON(bio->bi_pool && BIO_POOL_IDX(bio) != BIO_POOL_NONE);
+        /*
+         * most users will be overriding ->bi_bdev with a new target,
+         * so we don't set nor calculate new physical/hw segment counts here
+         */
+        bio->bi_bdev = bio_src->bi_bdev;
+        bio->bi_flags |= 1 << BIO_CLONED;
+        bio->bi_rw = bio_src->bi_rw;
+        bio->bi_iter = bio_src->bi_iter;
+        bio->bi_io_vec = bio_src->bi_io_vec;
+}
+EXPORT_SYMBOL(__bio_clone_fast);
+/**
+ *      bio_clone_fast - clone a bio that shares the original bio's biovec
+ *      @bio: bio to clone
+ *      @gfp_mask: allocation priority
+ *      @bs: bio_set to allocate from
+ *
+ *      Like __bio_clone_fast, only also allocates the returned bio
+ */
+struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
+{
+        struct bio *b;
+        b = bio_alloc_bioset(gfp_mask, 0, bs);
+        if (!b)
+                return NULL;
+        __bio_clone_fast(b, bio);
+        if (bio_integrity(bio)) {
+                int ret;
+                ret = bio_integrity_clone(b, bio, gfp_mask);
+                if (ret < 0) {
+                        bio_put(b);
+                        return NULL;
+                }
+        }
+        return b;
+}
+EXPORT_SYMBOL(bio_clone_fast);
+/**
+ *      bio_clone_bioset - clone a bio
+ *      @bio_src: bio to clone
+ *      @gfp_mask: allocation priority
+ *      @bs: bio_set to allocate from
+ *
+ *      Clone bio. Caller will own the returned bio, but not the actual data it
+ *      points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+                             struct bio_set *bs)
+{
+        struct bvec_iter iter;
+        struct bio_vec bv;
+        struct bio *bio;
+        /*
+         * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
+         * bio_src->bi_io_vec to bio->bi_io_vec.
+         *
+         * We can't do that anymore, because:
+         *
+         *  - The point of cloning the biovec is to produce a bio with a biovec
+         *    the caller can modify: bi_idx and bi_bvec_done should be 0.
+         *
+         *  - The original bio could've had more than BIO_MAX_PAGES biovecs; if
+         *    we tried to clone the whole thing bio_alloc_bioset() would fail.
+         *    But the clone should succeed as long as the number of biovecs we
+         *    actually need to allocate is fewer than BIO_MAX_PAGES.
+         *
+         *  - Lastly, bi_vcnt should not be looked at or relied upon by code
+         *    that does not own the bio - reason being drivers don't use it for
+         *    iterating over the biovec anymore, so expecting it to be kept up
+         *    to date (i.e. for clones that share the parent biovec) is just
+         *    asking for trouble and would force extra work on
+         *    __bio_clone_fast() anyways.
+         */
+        bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+        if (!bio)
+                return NULL;
+        bio->bi_bdev            = bio_src->bi_bdev;
+        bio->bi_rw              = bio_src->bi_rw;
+        bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
+        bio->bi_iter.bi_size    = bio_src->bi_iter.bi_size;
+        if (bio->bi_rw & REQ_DISCARD)
+                goto integrity_clone;
+        if (bio->bi_rw & REQ_WRITE_SAME) {
+                bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
+                goto integrity_clone;
+        }
+        bio_for_each_segment(bv, bio_src, iter)
+                bio->bi_io_vec[bio->bi_vcnt++] = bv;
+integrity_clone:
+        if (bio_integrity(bio_src)) {
+                int ret;
+                ret = bio_integrity_clone(bio, bio_src, gfp_mask);
+                if (ret < 0) {
+                        bio_put(bio);
+                        return NULL;
+                }
+        }
+        return bio;
+}
+EXPORT_SYMBOL(bio_clone_bioset);
+/**
+ *      bio_get_nr_vecs         - return approx number of vecs
+ *      @bdev:  I/O target
+ *
+ *      Return the approximate number of pages we can send to this target.
+ *      There's no guarantee that you will be able to fit this number of pages
+ *      into a bio, it does not account for dynamic restrictions that vary
+ *      on offset.
+ */
+int bio_get_nr_vecs(struct block_device *bdev)
+{
+        struct request_queue *q = bdev_get_queue(bdev);
+        int nr_pages;
+        nr_pages = min_t(unsigned,
+                     queue_max_segments(q),
+                     queue_max_sectors(q) / (PAGE_SIZE >> 9) + 1);
+        return min_t(unsigned, nr_pages, BIO_MAX_PAGES);
+}
+EXPORT_SYMBOL(bio_get_nr_vecs);
+static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
+                          *page, unsigned int len, unsigned int offset,
+                          unsigned int max_sectors)
+{
+        int retried_segments = 0;
+        struct bio_vec *bvec;
+        /*
+         * cloned bio must not modify vec list
+         */
+        if (unlikely(bio_flagged(bio, BIO_CLONED)))
+                return 0;
+        if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
+                return 0;
+        /*
+         * For filesystems with a blocksize smaller than the pagesize
+         * we will often be called with the same page as last time and
+         * a consecutive offset.  Optimize this special case.
+         */
+        if (bio->bi_vcnt > 0) {
+                struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
+                if (page == prev->bv_page &&
+                    offset == prev->bv_offset + prev->bv_len) {
+                        unsigned int prev_bv_len = prev->bv_len;
+                        prev->bv_len += len;
+                        if (q->merge_bvec_fn) {
+                                struct bvec_merge_data bvm = {
+                                        /* prev_bvec is already charged in
+                                           bi_size, discharge it in order to
+                                           simulate merging updated prev_bvec
+                                           as new bvec. */
+                                        .bi_bdev = bio->bi_bdev,
+                                        .bi_sector = bio->bi_iter.bi_sector,
+                                        .bi_size = bio->bi_iter.bi_size -
+                                                prev_bv_len,
+                                        .bi_rw = bio->bi_rw,
+                                };
+                                if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
+                                        prev->bv_len -= len;
+                                        return 0;
+                                }
+                        }
+                        goto done;
+                }
+        }
+        if (bio->bi_vcnt >= bio->bi_max_vecs)
+                return 0;
+        /*
+         * we might lose a segment or two here, but rather that than
+         * make this too complex.
+         */
+        while (bio->bi_phys_segments >= queue_max_segments(q)) {
+                if (retried_segments)
+                        return 0;
+                retried_segments = 1;
+                blk_recount_segments(q, bio);
+        }
+        /*
+         * setup the new entry, we might clear it again later if we
+         * cannot add the page
+         */
+        bvec = &bio->bi_io_vec[bio->bi_vcnt];
+        bvec->bv_page = page;
+        bvec->bv_len = len;
+        bvec->bv_offset = offset;
+        /*
+         * if queue has other restrictions (eg varying max sector size
+         * depending on offset), it can specify a merge_bvec_fn in the
+         * queue to get further control
+         */
+        if (q->merge_bvec_fn) {
+                struct bvec_merge_data bvm = {
+                        .bi_bdev = bio->bi_bdev,
+                        .bi_sector = bio->bi_iter.bi_sector,
+                        .bi_size = bio->bi_iter.bi_size,
+                        .bi_rw = bio->bi_rw,
+                };
+                /*
+                 * merge_bvec_fn() returns number of bytes it can accept
+                 * at this offset
+                 */
+                if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
+                        bvec->bv_page = NULL;
+                        bvec->bv_len = 0;
+                        bvec->bv_offset = 0;
+                        return 0;
+                }
+        }
+        /* If we may be able to merge these biovecs, force a recount */
+        if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
+                bio->bi_flags &= ~(1 << BIO_SEG_VALID);
+        bio->bi_vcnt++;
+        bio->bi_phys_segments++;
+ done:
+        bio->bi_iter.bi_size += len;
+        return len;
+}
+/**
+ *      bio_add_pc_page -       attempt to add page to bio
+ *      @q: the target queue
+ *      @bio: destination bio
+ *      @page: page to add
+ *      @len: vec entry length
+ *      @offset: vec entry offset
+ *
+ *      Attempt to add a page to the bio_vec maplist. This can fail for a
+ *      number of reasons, such as the bio being full or target block device
+ *      limitations. The target block device must allow bio's up to PAGE_SIZE,
+ *      so it is always possible to add a single page to an empty bio.
+ *
+ *      This should only be used by REQ_PC bios.
+ */
+int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
+                    unsigned int len, unsigned int offset)
+{
+        return __bio_add_page(q, bio, page, len, offset,
+                              queue_max_hw_sectors(q));
+}
+EXPORT_SYMBOL(bio_add_pc_page);
+/**
+ *      bio_add_page    -       attempt to add page to bio
+ *      @bio: destination bio
+ *      @page: page to add
+ *      @len: vec entry length
+ *      @offset: vec entry offset
+ *
+ *      Attempt to add a page to the bio_vec maplist. This can fail for a
+ *      number of reasons, such as the bio being full or target block device
+ *      limitations. The target block device must allow bio's up to PAGE_SIZE,
+ *      so it is always possible to add a single page to an empty bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
+                 unsigned int offset)
+{
+        struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+        return __bio_add_page(q, bio, page, len, offset, queue_max_sectors(q));
+}
+EXPORT_SYMBOL(bio_add_page);
+struct submit_bio_ret {
+        struct completion event;
+        int error;
+};
+static void submit_bio_wait_endio(struct bio *bio, int error)
+{
+        struct submit_bio_ret *ret = bio->bi_private;
+        ret->error = error;
+        complete(&ret->event);
+}
+/**
+ * submit_bio_wait - submit a bio, and wait until it completes
+ * @rw: whether to %READ or %WRITE, or maybe to %READA (read ahead)
+ * @bio: The &struct bio which describes the I/O
+ *
+ * Simple wrapper around submit_bio(). Returns 0 on success, or the error from
+ * bio_endio() on failure.
+ */
+int submit_bio_wait(int rw, struct bio *bio)
+{
+        struct submit_bio_ret ret;
+        rw |= REQ_SYNC;
+        init_completion(&ret.event);
+        bio->bi_private = &ret;
+        bio->bi_end_io = submit_bio_wait_endio;
+        submit_bio(rw, bio);
+        wait_for_completion(&ret.event);
+        return ret.error;
+}
+EXPORT_SYMBOL(submit_bio_wait);
+/**
+ * bio_advance - increment/complete a bio by some number of bytes
+ * @bio:        bio to advance
+ * @bytes:      number of bytes to complete
+ *
+ * This updates bi_sector, bi_size and bi_idx; if the number of bytes to
+ * complete doesn't align with a bvec boundary, then bv_len and bv_offset will
+ * be updated on the last bvec as well.
+ *
+ * @bio will then represent the remaining, uncompleted portion of the io.
+ */
+void bio_advance(struct bio *bio, unsigned bytes)
+{
+        if (bio_integrity(bio))
+                bio_integrity_advance(bio, bytes);
+        bio_advance_iter(bio, &bio->bi_iter, bytes);
+}
+EXPORT_SYMBOL(bio_advance);
+/**
+ * bio_alloc_pages - allocates a single page for each bvec in a bio
+ * @bio: bio to allocate pages for
+ * @gfp_mask: flags for allocation
+ *
+ * Allocates pages up to @bio->bi_vcnt.
+ *
+ * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are
+ * freed.
+ */
+int bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
+{
+        int i;
+        struct bio_vec *bv;
+        bio_for_each_segment_all(bv, bio, i) {
+                bv->bv_page = alloc_page(gfp_mask);
+                if (!bv->bv_page) {
+                        while (--bv >= bio->bi_io_vec)
+                                __free_page(bv->bv_page);
+                        return -ENOMEM;
+                }
+        }
+        return 0;
+}
+EXPORT_SYMBOL(bio_alloc_pages);
+/**
+ * bio_copy_data - copy contents of data buffers from one chain of bios to
+ * another
+ * @src: source bio list
+ * @dst: destination bio list
+ *
+ * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
+ * @src and @dst as linked lists of bios.
+ *
+ * Stops when it reaches the end of either @src or @dst - that is, copies
+ * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
+ */
+void bio_copy_data(struct bio *dst, struct bio *src)
+{
+        struct bvec_iter src_iter, dst_iter;
+        struct bio_vec src_bv, dst_bv;
+        void *src_p, *dst_p;
+        unsigned bytes;
+        src_iter = src->bi_iter;
+        dst_iter = dst->bi_iter;
+        while (1) {
+                if (!src_iter.bi_size) {
+                        src = src->bi_next;
+                        if (!src)
+                                break;
+                        src_iter = src->bi_iter;
+                }
+                if (!dst_iter.bi_size) {
+                        dst = dst->bi_next;
+                        if (!dst)
+                                break;
+                        dst_iter = dst->bi_iter;
+                }
+                src_bv = bio_iter_iovec(src, src_iter);
+                dst_bv = bio_iter_iovec(dst, dst_iter);
+                bytes = min(src_bv.bv_len, dst_bv.bv_len);
+                src_p = kmap_atomic(src_bv.bv_page);
+                dst_p = kmap_atomic(dst_bv.bv_page);
+                memcpy(dst_p + dst_bv.bv_offset,
+                       src_p + src_bv.bv_offset,
+                       bytes);
+                kunmap_atomic(dst_p);
+                kunmap_atomic(src_p);
+                bio_advance_iter(src, &src_iter, bytes);
+                bio_advance_iter(dst, &dst_iter, bytes);
+        }
+}
+EXPORT_SYMBOL(bio_copy_data);
+struct bio_map_data {
+        int nr_sgvecs;
+        int is_our_pages;
+        struct sg_iovec sgvecs[];
+};
+static void bio_set_map_data(struct bio_map_data *bmd, struct bio *bio,
+                             const struct sg_iovec *iov, int iov_count,
+                             int is_our_pages)
+{
+        memcpy(bmd->sgvecs, iov, sizeof(struct sg_iovec) * iov_count);
+        bmd->nr_sgvecs = iov_count;
+        bmd->is_our_pages = is_our_pages;
+        bio->bi_private = bmd;
+}
+static struct bio_map_data *bio_alloc_map_data(unsigned int iov_count,
+                                               gfp_t gfp_mask)
+{
+        if (iov_count > UIO_MAXIOV)
+                return NULL;
+        return kmalloc(sizeof(struct bio_map_data) +
+                       sizeof(struct sg_iovec) * iov_count, gfp_mask);
+}
+static int __bio_copy_iov(struct bio *bio, const struct sg_iovec *iov, int iov_count,
+                          int to_user, int from_user, int do_free_page)
+{
+        int ret = 0, i;
+        struct bio_vec *bvec;
+        int iov_idx = 0;
+        unsigned int iov_off = 0;
+        bio_for_each_segment_all(bvec, bio, i) {
+                char *bv_addr = page_address(bvec->bv_page);
+                unsigned int bv_len = bvec->bv_len;
+                while (bv_len && iov_idx < iov_count) {
+                        unsigned int bytes;
+                        char __user *iov_addr;
+                        bytes = min_t(unsigned int,
+                                      iov[iov_idx].iov_len - iov_off, bv_len);
+                        iov_addr = iov[iov_idx].iov_base + iov_off;
+                        if (!ret) {
+                                if (to_user)
+                                        ret = copy_to_user(iov_addr, bv_addr,
+                                                           bytes);
+                                if (from_user)
+                                        ret = copy_from_user(bv_addr, iov_addr,
+                                                             bytes);
+                                if (ret)
+                                        ret = -EFAULT;
+                        }
+                        bv_len -= bytes;
+                        bv_addr += bytes;
+                        iov_addr += bytes;
+                        iov_off += bytes;
+                        if (iov[iov_idx].iov_len == iov_off) {
+                                iov_idx++;
+                                iov_off = 0;
+                        }
+                }
+                if (do_free_page)
+                        __free_page(bvec->bv_page);
+        }
+        return ret;
+}
+/**
+ *      bio_uncopy_user -       finish previously mapped bio
+ *      @bio: bio being terminated
+ *
+ *      Free pages allocated from bio_copy_user() and write back data
+ *      to user space in case of a read.
+ */
+int bio_uncopy_user(struct bio *bio)
+{
+        struct bio_map_data *bmd = bio->bi_private;
+        struct bio_vec *bvec;
+        int ret = 0, i;
+        if (!bio_flagged(bio, BIO_NULL_MAPPED)) {
+                /*
+                 * if we're in a workqueue, the request is orphaned, so
+                 * don't copy into a random user address space, just free.
+                 */
+                if (current->mm)
+                        ret = __bio_copy_iov(bio, bmd->sgvecs, bmd->nr_sgvecs,
+                                             bio_data_dir(bio) == READ,
+                                             0, bmd->is_our_pages);
+                else if (bmd->is_our_pages)
+                        bio_for_each_segment_all(bvec, bio, i)
+                                __free_page(bvec->bv_page);
+        }
+        kfree(bmd);
+        bio_put(bio);
+        return ret;
+}
+EXPORT_SYMBOL(bio_uncopy_user);
+/**
+ *      bio_copy_user_iov       -       copy user data to bio
+ *      @q: destination block queue
+ *      @map_data: pointer to the rq_map_data holding pages (if necessary)
+ *      @iov:   the iovec.
+ *      @iov_count: number of elements in the iovec
+ *      @write_to_vm: bool indicating writing to pages or not
+ *      @gfp_mask: memory allocation flags
+ *
+ *      Prepares and returns a bio for indirect user io, bouncing data
+ *      to/from kernel pages as necessary. Must be paired with
+ *      call bio_uncopy_user() on io completion.
+ */
+struct bio *bio_copy_user_iov(struct request_queue *q,
+                              struct rq_map_data *map_data,
+                              const struct sg_iovec *iov, int iov_count,
+                              int write_to_vm, gfp_t gfp_mask)
+{
+        struct bio_map_data *bmd;
+        struct bio_vec *bvec;
+        struct page *page;
+        struct bio *bio;
+        int i, ret;
+        int nr_pages = 0;
+        unsigned int len = 0;
+        unsigned int offset = map_data ? map_data->offset & ~PAGE_MASK : 0;
+        for (i = 0; i < iov_count; i++) {
+                unsigned long uaddr;
+                unsigned long end;
+                unsigned long start;
+                uaddr = (unsigned long)iov[i].iov_base;
+                end = (uaddr + iov[i].iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+                start = uaddr >> PAGE_SHIFT;
+                /*
+                 * Overflow, abort
+                 */
+                if (end < start)
+                        return ERR_PTR(-EINVAL);
+                nr_pages += end - start;
+                len += iov[i].iov_len;
+        }
+        if (offset)
+                nr_pages++;
+        bmd = bio_alloc_map_data(iov_count, gfp_mask);
+        if (!bmd)
+                return ERR_PTR(-ENOMEM);
+        ret = -ENOMEM;
+        bio = bio_kmalloc(gfp_mask, nr_pages);
+        if (!bio)
+                goto out_bmd;
+        if (!write_to_vm)
+                bio->bi_rw |= REQ_WRITE;
+        ret = 0;
+        if (map_data) {
+                nr_pages = 1 << map_data->page_order;
+                i = map_data->offset / PAGE_SIZE;
+        }
+        while (len) {
+                unsigned int bytes = PAGE_SIZE;
+                bytes -= offset;
+                if (bytes > len)
+                        bytes = len;
+                if (map_data) {
+                        if (i == map_data->nr_entries * nr_pages) {
+                                ret = -ENOMEM;
+                                break;
+                        }
+                        page = map_data->pages[i / nr_pages];
+                        page += (i % nr_pages);
+                        i++;
+                } else {
+                        page = alloc_page(q->bounce_gfp | gfp_mask);
+                        if (!page) {
+                                ret = -ENOMEM;
+                                break;
+                        }
+                }
+                if (bio_add_pc_page(q, bio, page, bytes, offset) < bytes)
+                        break;
+                len -= bytes;
+                offset = 0;
+        }
+        if (ret)
+                goto cleanup;
+        /*
+         * success
+         */
+        if ((!write_to_vm && (!map_data || !map_data->null_mapped)) ||
+            (map_data && map_data->from_user)) {
+                ret = __bio_copy_iov(bio, iov, iov_count, 0, 1, 0);
+                if (ret)
+                        goto cleanup;
+        }
+        bio_set_map_data(bmd, bio, iov, iov_count, map_data ? 0 : 1);
+        return bio;
+cleanup:
+        if (!map_data)
+                bio_for_each_segment_all(bvec, bio, i)
+                        __free_page(bvec->bv_page);
+        bio_put(bio);
+out_bmd:
+        kfree(bmd);
+        return ERR_PTR(ret);
+}
+/**
+ *      bio_copy_user   -       copy user data to bio
+ *      @q: destination block queue
+ *      @map_data: pointer to the rq_map_data holding pages (if necessary)
+ *      @uaddr: start of user address
+ *      @len: length in bytes
+ *      @write_to_vm: bool indicating writing to pages or not
+ *      @gfp_mask: memory allocation flags
+ *
+ *      Prepares and returns a bio for indirect user io, bouncing data
+ *      to/from kernel pages as necessary. Must be paired with
+ *      call bio_uncopy_user() on io completion.
+ */
+struct bio *bio_copy_user(struct request_queue *q, struct rq_map_data *map_data,
+                          unsigned long uaddr, unsigned int len,
+                          int write_to_vm, gfp_t gfp_mask)
+{
+        struct sg_iovec iov;
+        iov.iov_base = (void __user *)uaddr;
+        iov.iov_len = len;
+        return bio_copy_user_iov(q, map_data, &iov, 1, write_to_vm, gfp_mask);
+}
+EXPORT_SYMBOL(bio_copy_user);
+static struct bio *__bio_map_user_iov(struct request_queue *q,
+                                      struct block_device *bdev,
+                                      const struct sg_iovec *iov, int iov_count,
+                                      int write_to_vm, gfp_t gfp_mask)
+{
+        int i, j;
+        int nr_pages = 0;
+        struct page **pages;
+        struct bio *bio;
+        int cur_page = 0;
+        int ret, offset;
+        for (i = 0; i < iov_count; i++) {
+                unsigned long uaddr = (unsigned long)iov[i].iov_base;
+                unsigned long len = iov[i].iov_len;
+                unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+                unsigned long start = uaddr >> PAGE_SHIFT;
+                /*
+                 * Overflow, abort
+                 */
+                if (end < start)
+                        return ERR_PTR(-EINVAL);
+                nr_pages += end - start;
+                /*
+                 * buffer must be aligned to at least hardsector size for now
+                 */
+                if (uaddr & queue_dma_alignment(q))
+                        return ERR_PTR(-EINVAL);
+        }
+        if (!nr_pages)
+                return ERR_PTR(-EINVAL);
+        bio = bio_kmalloc(gfp_mask, nr_pages);
+        if (!bio)
+                return ERR_PTR(-ENOMEM);
+        ret = -ENOMEM;
+        pages = kcalloc(nr_pages, sizeof(struct page *), gfp_mask);
+        if (!pages)
+                goto out;
+        for (i = 0; i < iov_count; i++) {
+                unsigned long uaddr = (unsigned long)iov[i].iov_base;
+                unsigned long len = iov[i].iov_len;
+                unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+                unsigned long start = uaddr >> PAGE_SHIFT;
+                const int local_nr_pages = end - start;
+                const int page_limit = cur_page + local_nr_pages;
+                ret = get_user_pages_fast(uaddr, local_nr_pages,
+                                write_to_vm, &pages[cur_page]);
+                if (ret < local_nr_pages) {
+                        ret = -EFAULT;
+                        goto out_unmap;
+                }
+                offset = uaddr & ~PAGE_MASK;
+                for (j = cur_page; j < page_limit; j++) {
+                        unsigned int bytes = PAGE_SIZE - offset;
+                        if (len <= 0)
+                                break;
+                        
+                        if (bytes > len)
+                                bytes = len;
+                        /*
+                         * sorry...
+                         */
+                        if (bio_add_pc_page(q, bio, pages[j], bytes, offset) <
+                                            bytes)
+                                break;
+                        len -= bytes;
+                        offset = 0;
+                }
+                cur_page = j;
+                /*
+                 * release the pages we didn't map into the bio, if any
+                 */
+                while (j < page_limit)
+                        page_cache_release(pages[j++]);
+        }
+        kfree(pages);
+        /*
+         * set data direction, and check if mapped pages need bouncing
+         */
+        if (!write_to_vm)
+                bio->bi_rw |= REQ_WRITE;
+        bio->bi_bdev = bdev;
+        bio->bi_flags |= (1 << BIO_USER_MAPPED);
+        return bio;
+ out_unmap:
+        for (i = 0; i < nr_pages; i++) {
+                if(!pages[i])
+                        break;
+                page_cache_release(pages[i]);
+        }
+ out:
+        kfree(pages);
+        bio_put(bio);
+        return ERR_PTR(ret);
+}
+/**
+ *      bio_map_user    -       map user address into bio
+ *      @q: the struct request_queue for the bio
+ *      @bdev: destination block device
+ *      @uaddr: start of user address
+ *      @len: length in bytes
+ *      @write_to_vm: bool indicating writing to pages or not
+ *      @gfp_mask: memory allocation flags
+ *
+ *      Map the user space address into a bio suitable for io to a block
+ *      device. Returns an error pointer in case of error.
+ */
+struct bio *bio_map_user(struct request_queue *q, struct block_device *bdev,
+                         unsigned long uaddr, unsigned int len, int write_to_vm,
+                         gfp_t gfp_mask)
+{
+        struct sg_iovec iov;
+        iov.iov_base = (void __user *)uaddr;
+        iov.iov_len = len;
+        return bio_map_user_iov(q, bdev, &iov, 1, write_to_vm, gfp_mask);
+}
+EXPORT_SYMBOL(bio_map_user);
+/**
+ *      bio_map_user_iov - map user sg_iovec table into bio
+ *      @q: the struct request_queue for the bio
+ *      @bdev: destination block device
+ *      @iov:   the iovec.
+ *      @iov_count: number of elements in the iovec
+ *      @write_to_vm: bool indicating writing to pages or not
+ *      @gfp_mask: memory allocation flags
+ *
+ *      Map the user space address into a bio suitable for io to a block
+ *      device. Returns an error pointer in case of error.
+ */
+struct bio *bio_map_user_iov(struct request_queue *q, struct block_device *bdev,
+                             const struct sg_iovec *iov, int iov_count,
+                             int write_to_vm, gfp_t gfp_mask)
+{
+        struct bio *bio;
+        bio = __bio_map_user_iov(q, bdev, iov, iov_count, write_to_vm,
+                                 gfp_mask);
+        if (IS_ERR(bio))
+                return bio;
+        /*
+         * subtle -- if __bio_map_user() ended up bouncing a bio,
+         * it would normally disappear when its bi_end_io is run.
+         * however, we need it for the unmap, so grab an extra
+         * reference to it
+         */
+        bio_get(bio);
+        return bio;
+}
+static void __bio_unmap_user(struct bio *bio)
+{
+        struct bio_vec *bvec;
+        int i;
+        /*
+         * make sure we dirty pages we wrote to
+         */
+        bio_for_each_segment_all(bvec, bio, i) {
+                if (bio_data_dir(bio) == READ)
+                        set_page_dirty_lock(bvec->bv_page);
+                page_cache_release(bvec->bv_page);
+        }
+        bio_put(bio);
+}
+/**
+ *      bio_unmap_user  -       unmap a bio
+ *      @bio:           the bio being unmapped
+ *
+ *      Unmap a bio previously mapped by bio_map_user(). Must be called with
+ *      a process context.
+ *
+ *      bio_unmap_user() may sleep.
+ */
+void bio_unmap_user(struct bio *bio)
+{
+        __bio_unmap_user(bio);
+        bio_put(bio);
+}
+EXPORT_SYMBOL(bio_unmap_user);
+static void bio_map_kern_endio(struct bio *bio, int err)
+{
+        bio_put(bio);
+}
+static struct bio *__bio_map_kern(struct request_queue *q, void *data,
+                                  unsigned int len, gfp_t gfp_mask)
+{
+        unsigned long kaddr = (unsigned long)data;
+        unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+        unsigned long start = kaddr >> PAGE_SHIFT;
+        const int nr_pages = end - start;
+        int offset, i;
+        struct bio *bio;
+        bio = bio_kmalloc(gfp_mask, nr_pages);
+        if (!bio)
+                return ERR_PTR(-ENOMEM);
+        offset = offset_in_page(kaddr);
+        for (i = 0; i < nr_pages; i++) {
+                unsigned int bytes = PAGE_SIZE - offset;
+                if (len <= 0)
+                        break;
+                if (bytes > len)
+                        bytes = len;
+                if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
+                                    offset) < bytes)
+                        break;
+                data += bytes;
+                len -= bytes;
+                offset = 0;
+        }
+        bio->bi_end_io = bio_map_kern_endio;
+        return bio;
+}
+/**
+ *      bio_map_kern    -       map kernel address into bio
+ *      @q: the struct request_queue for the bio
+ *      @data: pointer to buffer to map
+ *      @len: length in bytes
+ *      @gfp_mask: allocation flags for bio allocation
+ *
+ *      Map the kernel address into a bio suitable for io to a block
+ *      device. Returns an error pointer in case of error.
+ */
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+                         gfp_t gfp_mask)
+{
+        struct bio *bio;
+        bio = __bio_map_kern(q, data, len, gfp_mask);
+        if (IS_ERR(bio))
+                return bio;
+        if (bio->bi_iter.bi_size == len)
+                return bio;
+        /*
+         * Don't support partial mappings.
+         */
+        bio_put(bio);
+        return ERR_PTR(-EINVAL);
+}
+EXPORT_SYMBOL(bio_map_kern);
+static void bio_copy_kern_endio(struct bio *bio, int err)
+{
+        struct bio_vec *bvec;
+        const int read = bio_data_dir(bio) == READ;
+        struct bio_map_data *bmd = bio->bi_private;
+        int i;
+        char *p = bmd->sgvecs[0].iov_base;
+        bio_for_each_segment_all(bvec, bio, i) {
+                char *addr = page_address(bvec->bv_page);
+                if (read)
+                        memcpy(p, addr, bvec->bv_len);
+                __free_page(bvec->bv_page);
+                p += bvec->bv_len;
+        }
+        kfree(bmd);
+        bio_put(bio);
+}
+/**
+ *      bio_copy_kern   -       copy kernel address into bio
+ *      @q: the struct request_queue for the bio
+ *      @data: pointer to buffer to copy
+ *      @len: length in bytes
+ *      @gfp_mask: allocation flags for bio and page allocation
+ *      @reading: data direction is READ
+ *
+ *      copy the kernel address into a bio suitable for io to a block
+ *      device. Returns an error pointer in case of error.
+ */
+struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
+                          gfp_t gfp_mask, int reading)
+{
+        struct bio *bio;
+        struct bio_vec *bvec;
+        int i;
+        bio = bio_copy_user(q, NULL, (unsigned long)data, len, 1, gfp_mask);
+        if (IS_ERR(bio))
+                return bio;
+        if (!reading) {
+                void *p = data;
+                bio_for_each_segment_all(bvec, bio, i) {
+                        char *addr = page_address(bvec->bv_page);
+                        memcpy(addr, p, bvec->bv_len);
+                        p += bvec->bv_len;
+                }
+        }
+        bio->bi_end_io = bio_copy_kern_endio;
+        return bio;
+}
+EXPORT_SYMBOL(bio_copy_kern);
+/*
+ * bio_set_pages_dirty() and bio_check_pages_dirty() are support functions
+ * for performing direct-IO in BIOs.
+ *
+ * The problem is that we cannot run set_page_dirty() from interrupt context
+ * because the required locks are not interrupt-safe.  So what we can do is to
+ * mark the pages dirty _before_ performing IO.  And in interrupt context,
+ * check that the pages are still dirty.   If so, fine.  If not, redirty them
+ * in process context.
+ *
+ * We special-case compound pages here: normally this means reads into hugetlb
+ * pages.  The logic in here doesn't really work right for compound pages
+ * because the VM does not uniformly chase down the head page in all cases.
+ * But dirtiness of compound pages is pretty meaningless anyway: the VM doesn't
+ * handle them at all.  So we skip compound pages here at an early stage.
+ *
+ * Note that this code is very hard to test under normal circumstances because
+ * direct-io pins the pages with get_user_pages().  This makes
+ * is_page_cache_freeable return false, and the VM will not clean the pages.
+ * But other code (eg, flusher threads) could clean the pages if they are mapped
+ * pagecache.
+ *
+ * Simply disabling the call to bio_set_pages_dirty() is a good way to test the
+ * deferred bio dirtying paths.
+ */
+/*
+ * bio_set_pages_dirty() will mark all the bio's pages as dirty.
+ */
+void bio_set_pages_dirty(struct bio *bio)
+{
+        struct bio_vec *bvec;
+        int i;
+        bio_for_each_segment_all(bvec, bio, i) {
+                struct page *page = bvec->bv_page;
+                if (page && !PageCompound(page))
+                        set_page_dirty_lock(page);
+        }
+}
+static void bio_release_pages(struct bio *bio)
+{
+        struct bio_vec *bvec;
+        int i;
+        bio_for_each_segment_all(bvec, bio, i) {
+                struct page *page = bvec->bv_page;
+                if (page)
+                        put_page(page);
+        }
+}
+/*
+ * bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
+ * If they are, then fine.  If, however, some pages are clean then they must
+ * have been written out during the direct-IO read.  So we take another ref on
+ * the BIO and the offending pages and re-dirty the pages in process context.
+ *
+ * It is expected that bio_check_pages_dirty() will wholly own the BIO from
+ * here on.  It will run one page_cache_release() against each page and will
+ * run one bio_put() against the BIO.
+ */
+static void bio_dirty_fn(struct work_struct *work);
+static DECLARE_WORK(bio_dirty_work, bio_dirty_fn);
+static DEFINE_SPINLOCK(bio_dirty_lock);
+static struct bio *bio_dirty_list;
+/*
+ * This runs in process context
+ */
+static void bio_dirty_fn(struct work_struct *work)
+{
+        unsigned long flags;
+        struct bio *bio;
+        spin_lock_irqsave(&bio_dirty_lock, flags);
+        bio = bio_dirty_list;
+        bio_dirty_list = NULL;
+        spin_unlock_irqrestore(&bio_dirty_lock, flags);
+        while (bio) {
+                struct bio *next = bio->bi_private;
+                bio_set_pages_dirty(bio);
+                bio_release_pages(bio);
+                bio_put(bio);
+                bio = next;
+        }
+}
+void bio_check_pages_dirty(struct bio *bio)
+{
+        struct bio_vec *bvec;
+        int nr_clean_pages = 0;
+        int i;
+        bio_for_each_segment_all(bvec, bio, i) {
+                struct page *page = bvec->bv_page;
+                if (PageDirty(page) || PageCompound(page)) {
+                        page_cache_release(page);
+                        bvec->bv_page = NULL;
+                } else {
+                        nr_clean_pages++;
+                }
+        }
+        if (nr_clean_pages) {
+                unsigned long flags;
+                spin_lock_irqsave(&bio_dirty_lock, flags);
+                bio->bi_private = bio_dirty_list;
+                bio_dirty_list = bio;
+                spin_unlock_irqrestore(&bio_dirty_lock, flags);
+                schedule_work(&bio_dirty_work);
+        } else {
+                bio_put(bio);
+        }
+}
+#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
+void bio_flush_dcache_pages(struct bio *bi)
+{
+        struct bio_vec bvec;
+        struct bvec_iter iter;
+        bio_for_each_segment(bvec, bi, iter)
+                flush_dcache_page(bvec.bv_page);
+}
+EXPORT_SYMBOL(bio_flush_dcache_pages);
+#endif
+/**
+ * bio_endio - end I/O on a bio
+ * @bio:        bio
+ * @error:      error, if any
+ *
+ * Description:
+ *   bio_endio() will end I/O on the whole bio. bio_endio() is the
+ *   preferred way to end I/O on a bio, it takes care of clearing
+ *   BIO_UPTODATE on error. @error is 0 on success, and and one of the
+ *   established -Exxxx (-EIO, for instance) error values in case
+ *   something went wrong. No one should call bi_end_io() directly on a
+ *   bio unless they own it and thus know that it has an end_io
+ *   function.
+ **/
+void bio_endio(struct bio *bio, int error)
+{
+        while (bio) {
+                BUG_ON(atomic_read(&bio->bi_remaining) <= 0);
+                if (error)
+                        clear_bit(BIO_UPTODATE, &bio->bi_flags);
+                else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+                        error = -EIO;
+                if (!atomic_dec_and_test(&bio->bi_remaining))
+                        return;
+                /*
+                 * Need to have a real endio function for chained bios,
+                 * otherwise various corner cases will break (like stacking
+                 * block devices that save/restore bi_end_io) - however, we want
+                 * to avoid unbounded recursion and blowing the stack. Tail call
+                 * optimization would handle this, but compiling with frame
+                 * pointers also disables gcc's sibling call optimization.
+                 */
+                if (bio->bi_end_io == bio_chain_endio) {
+                        struct bio *parent = bio->bi_private;
+                        bio_put(bio);
+                        bio = parent;
+                } else {
+                        if (bio->bi_end_io)
+                                bio->bi_end_io(bio, error);
+                        bio = NULL;
+                }
+        }
+}
+EXPORT_SYMBOL(bio_endio);
+/**
+ * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
+ * @bio:        bio
+ * @error:      error, if any
+ *
+ * For code that has saved and restored bi_end_io; thing hard before using this
+ * function, probably you should've cloned the entire bio.
+ **/
+void bio_endio_nodec(struct bio *bio, int error)
+{
+        atomic_inc(&bio->bi_remaining);
+        bio_endio(bio, error);
+}
+EXPORT_SYMBOL(bio_endio_nodec);
+/**
+ * bio_split - split a bio
+ * @bio:        bio to split
+ * @sectors:    number of sectors to split from the front of @bio
+ * @gfp:        gfp mask
+ * @bs:         bio set to allocate from
+ *
+ * Allocates and returns a new bio which represents @sectors from the start of
+ * @bio, and updates @bio to represent the remaining sectors.
+ *
+ * The newly allocated bio will point to @bio's bi_io_vec; it is the caller's
+ * responsibility to ensure that @bio is not freed before the split.
+ */
+struct bio *bio_split(struct bio *bio, int sectors,
+                      gfp_t gfp, struct bio_set *bs)
+{
+        struct bio *split = NULL;
+        BUG_ON(sectors <= 0);
+        BUG_ON(sectors >= bio_sectors(bio));
+        split = bio_clone_fast(bio, gfp, bs);
+        if (!split)
+                return NULL;
+        split->bi_iter.bi_size = sectors << 9;
+        if (bio_integrity(split))
+                bio_integrity_trim(split, 0, sectors);
+        bio_advance(bio, split->bi_iter.bi_size);
+        return split;
+}
+EXPORT_SYMBOL(bio_split);
+/**
+ * bio_trim - trim a bio
+ * @bio:        bio to trim
+ * @offset:     number of sectors to trim from the front of @bio
+ * @size:       size we want to trim @bio to, in sectors
+ */
+void bio_trim(struct bio *bio, int offset, int size)
+{
+        /* 'bio' is a cloned bio which we need to trim to match
+         * the given offset and size.
+         */
+        size <<= 9;
+        if (offset == 0 && size == bio->bi_iter.bi_size)
+                return;
+        clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+        bio_advance(bio, offset << 9);
+        bio->bi_iter.bi_size = size;
+}
+EXPORT_SYMBOL_GPL(bio_trim);
+/*
+ * create memory pools for biovec's in a bio_set.
+ * use the global biovec slabs created for general use.
+ */
+mempool_t *biovec_create_pool(int pool_entries)
+{
+        struct biovec_slab *bp = bvec_slabs + BIOVEC_MAX_IDX;
+        return mempool_create_slab_pool(pool_entries, bp->slab);
+}
+void bioset_free(struct bio_set *bs)
+{
+        if (bs->rescue_workqueue)
+                destroy_workqueue(bs->rescue_workqueue);
+        if (bs->bio_pool)
+                mempool_destroy(bs->bio_pool);
+        if (bs->bvec_pool)
+                mempool_destroy(bs->bvec_pool);
+        bioset_integrity_free(bs);
+        bio_put_slab(bs);
+        kfree(bs);
+}
+EXPORT_SYMBOL(bioset_free);
+/**
+ * bioset_create  - Create a bio_set
+ * @pool_size:  Number of bio and bio_vecs to cache in the mempool
+ * @front_pad:  Number of bytes to allocate in front of the returned bio
+ *
+ * Description:
+ *    Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
+ *    to ask for a number of bytes to be allocated in front of the bio.
+ *    Front pad allocation is useful for embedding the bio inside
+ *    another structure, to avoid allocating extra data to go with the bio.
+ *    Note that the bio must be embedded at the END of that structure always,
+ *    or things will break badly.
+ */
+struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
+{
+        unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
+        struct bio_set *bs;
+        bs = kzalloc(sizeof(*bs), GFP_KERNEL);
+        if (!bs)
+                return NULL;
+        bs->front_pad = front_pad;
+        spin_lock_init(&bs->rescue_lock);
+        bio_list_init(&bs->rescue_list);
+        INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+        bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
+        if (!bs->bio_slab) {
+                kfree(bs);
+                return NULL;
+        }
+        bs->bio_pool = mempool_create_slab_pool(pool_size, bs->bio_slab);
+        if (!bs->bio_pool)
+                goto bad;
+        bs->bvec_pool = biovec_create_pool(pool_size);
+        if (!bs->bvec_pool)
+                goto bad;
+        bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+        if (!bs->rescue_workqueue)
+                goto bad;
+        return bs;
+bad:
+        bioset_free(bs);
+        return NULL;
+}
+EXPORT_SYMBOL(bioset_create);
+#ifdef CONFIG_BLK_CGROUP
+/**
+ * bio_associate_current - associate a bio with %current
+ * @bio: target bio
+ *
+ * Associate @bio with %current if it hasn't been associated yet.  Block
+ * layer will treat @bio as if it were issued by %current no matter which
+ * task actually issues it.
+ *
+ * This function takes an extra reference of @task's io_context and blkcg
+ * which will be put when @bio is released.  The caller must own @bio,
+ * ensure %current->io_context exists, and is responsible for synchronizing
+ * calls to this function.
+ */
+int bio_associate_current(struct bio *bio)
+{
+        struct io_context *ioc;
+        struct cgroup_subsys_state *css;
+        if (bio->bi_ioc)
+                return -EBUSY;
+        ioc = current->io_context;
+        if (!ioc)
+                return -ENOENT;
+        /* acquire active ref on @ioc and associate */
+        get_io_context_active(ioc);
+        bio->bi_ioc = ioc;
+        /* associate blkcg if exists */
+        rcu_read_lock();
+        css = task_css(current, blkio_cgrp_id);
+        if (css && css_tryget(css))
+                bio->bi_css = css;
+        rcu_read_unlock();
+        return 0;
+}
+/**
+ * bio_disassociate_task - undo bio_associate_current()
+ * @bio: target bio
+ */
+void bio_disassociate_task(struct bio *bio)
+{
+        if (bio->bi_ioc) {
+                put_io_context(bio->bi_ioc);
+                bio->bi_ioc = NULL;
+        }
+        if (bio->bi_css) {
+                css_put(bio->bi_css);
+                bio->bi_css = NULL;
+        }
+}
+#endif /* CONFIG_BLK_CGROUP */
+static void __init biovec_init_slabs(void)
+{
+        int i;
+        for (i = 0; i < BIOVEC_NR_POOLS; i++) {
+                int size;
+                struct biovec_slab *bvs = bvec_slabs + i;
+                if (bvs->nr_vecs <= BIO_INLINE_VECS) {
+                        bvs->slab = NULL;
+                        continue;
+                }
+                size = bvs->nr_vecs * sizeof(struct bio_vec);
+                bvs->slab = kmem_cache_create(bvs->name, size, 0,
+                                SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+        }
+}
+static int __init init_bio(void)
+{
+        bio_slab_max = 2;
+        bio_slab_nr = 0;
+        bio_slabs = kzalloc(bio_slab_max * sizeof(struct bio_slab), GFP_KERNEL);
+        if (!bio_slabs)
+                panic("bio: can't allocate bios\n");
+        bio_integrity_init();
+        biovec_init_slabs();
+        fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+        if (!fs_bio_set)
+                panic("bio: can't allocate bios\n");
+        if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
+                panic("bio: can't create integrity pool\n");
+        return 0;
+}
+subsys_initcall(init_bio);
diff --git a/block/blk-core.c b/block/blk-core.c
index a0e3096c4bb5..40d654861c33 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -146,8 +146,8 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
        printk(KERN_INFO "  sector %llu, nr/cnr %u/%u\n",
               (unsigned long long)blk_rq_pos(rq),
               blk_rq_sectors(rq), blk_rq_cur_sectors(rq));
-        printk(KERN_INFO "  bio %p, biotail %p, buffer %p, len %u\n",
+        printk(KERN_INFO "  bio %p, biotail %p, len %u\n",
-               rq->bio, rq->biotail, rq->buffer, blk_rq_bytes(rq));
+               rq->bio, rq->biotail, blk_rq_bytes(rq));
        if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
                printk(KERN_INFO "  cdb: ");
@@ -251,8 +251,10 @@ void blk_sync_queue(struct request_queue *q)
                struct blk_mq_hw_ctx *hctx;
                int i;
-                queue_for_each_hw_ctx(q, hctx, i)
+                queue_for_each_hw_ctx(q, hctx, i) {
-                        cancel_delayed_work_sync(&hctx->delayed_work);
+                        cancel_delayed_work_sync(&hctx->run_work);
+                        cancel_delayed_work_sync(&hctx->delay_work);
+                }
        } else {
                cancel_delayed_work_sync(&q->delay_work);
        }
@@ -574,12 +576,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
        if (!q)
                return NULL;
-        if (percpu_counter_init(&q->mq_usage_counter, 0))
-                goto fail_q;
        q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask);
        if (q->id < 0)
-                goto fail_c;
+                goto fail_q;
        q->backing_dev_info.ra_pages =
                        (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
@@ -637,8 +636,6 @@ fail_bdi:
        bdi_destroy(&q->backing_dev_info);
 fail_id:
        ida_simple_remove(&blk_queue_ida, q->id);
-fail_c:
-        percpu_counter_destroy(&q->mq_usage_counter);
 fail_q:
        kmem_cache_free(blk_requestq_cachep, q);
        return NULL;
@@ -846,6 +843,47 @@ static void freed_request(struct request_list *rl, unsigned int flags)
                __freed_request(rl, sync ^ 1);
 }
+int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+        struct request_list *rl;
+        spin_lock_irq(q->queue_lock);
+        q->nr_requests = nr;
+        blk_queue_congestion_threshold(q);
+        /* congestion isn't cgroup aware and follows root blkcg for now */
+        rl = &q->root_rl;
+        if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+                blk_set_queue_congested(q, BLK_RW_SYNC);
+        else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+                blk_clear_queue_congested(q, BLK_RW_SYNC);
+        if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+                blk_set_queue_congested(q, BLK_RW_ASYNC);
+        else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+                blk_clear_queue_congested(q, BLK_RW_ASYNC);
+        blk_queue_for_each_rl(rl, q) {
+                if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+                        blk_set_rl_full(rl, BLK_RW_SYNC);
+                } else {
+                        blk_clear_rl_full(rl, BLK_RW_SYNC);
+                        wake_up(&rl->wait[BLK_RW_SYNC]);
+                }
+                if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+                        blk_set_rl_full(rl, BLK_RW_ASYNC);
+                } else {
+                        blk_clear_rl_full(rl, BLK_RW_ASYNC);
+                        wake_up(&rl->wait[BLK_RW_ASYNC]);
+                }
+        }
+        spin_unlock_irq(q->queue_lock);
+        return 0;
+}
 /*
 * Determine if elevator data should be initialized when allocating the
 * request associated with @bio.
@@ -1135,7 +1173,7 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
        if (q->mq_ops)
-                return blk_mq_alloc_request(q, rw, gfp_mask);
+                return blk_mq_alloc_request(q, rw, gfp_mask, false);
        else
                return blk_old_get_request(q, rw, gfp_mask);
 }
@@ -1231,12 +1269,15 @@ static void add_acct_request(struct request_queue *q, struct request *rq,
 static void part_round_stats_single(int cpu, struct hd_struct *part,
                                    unsigned long now)
 {
+        int inflight;
        if (now == part->stamp)
                return;
-        if (part_in_flight(part)) {
+        inflight = part_in_flight(part);
+        if (inflight) {
                __part_stat_add(cpu, part, time_in_queue,
-                                part_in_flight(part) * (now - part->stamp));
+                                inflight * (now - part->stamp));
                __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
        }
        part->stamp = now;
@@ -1360,7 +1401,6 @@ void blk_add_request_payload(struct request *rq, struct page *page,
        rq->__data_len = rq->resid_len = len;
        rq->nr_phys_segments = 1;
-        rq->buffer = bio_data(bio);
 }
 EXPORT_SYMBOL_GPL(blk_add_request_payload);
@@ -1402,12 +1442,6 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
        bio->bi_next = req->bio;
        req->bio = bio;
-        /*
-         * may not be valid. if the low level driver said
-         * it didn't need a bounce buffer then it better
-         * not touch req->buffer either...
-         */
-        req->buffer = bio_data(bio);
        req->__sector = bio->bi_iter.bi_sector;
        req->__data_len += bio->bi_iter.bi_size;
        req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
@@ -1432,6 +1466,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 * added on the elevator at this point.  In addition, we don't have
 * reliable access to the elevator outside queue lock.  Only check basic
 * merging parameters without querying the elevator.
+ *
+ * Caller must ensure !blk_queue_nomerges(q) beforehand.
 */
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
                            unsigned int *request_count)
@@ -1441,9 +1477,6 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
        bool ret = false;
        struct list_head *plug_list;
-        if (blk_queue_nomerges(q))
-                goto out;
        plug = current->plug;
        if (!plug)
                goto out;
@@ -1522,7 +1555,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
         * Check if we can merge with the plugged list before grabbing
         * any locks.
         */
-        if (blk_attempt_plug_merge(q, bio, &request_count))
+        if (!blk_queue_nomerges(q) &&
+            blk_attempt_plug_merge(q, bio, &request_count))
                return;
        spin_lock_irq(q->queue_lock);
@@ -1654,7 +1688,7 @@ static int __init fail_make_request_debugfs(void)
        struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
                                                NULL, &fail_make_request);
-        return IS_ERR(dir) ? PTR_ERR(dir) : 0;
+        return PTR_ERR_OR_ZERO(dir);
 }
 late_initcall(fail_make_request_debugfs);
@@ -2434,7 +2468,6 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
        }
        req->__data_len -= total_bytes;
-        req->buffer = bio_data(req->bio);
        /* update sector only for requests with clear definition of sector */
        if (req->cmd_type == REQ_TYPE_FS)
@@ -2503,7 +2536,7 @@ EXPORT_SYMBOL_GPL(blk_unprep_request);
 /*
 * queue lock must be held
 */
-static void blk_finish_request(struct request *req, int error)
+void blk_finish_request(struct request *req, int error)
 {
        if (blk_rq_tagged(req))
                blk_queue_end_tag(req->q, req);
@@ -2529,6 +2562,7 @@ static void blk_finish_request(struct request *req, int error)
                __blk_put_request(req->q, req);
        }
 }
+EXPORT_SYMBOL(blk_finish_request);
 /**
 * blk_end_bidi_request - Complete a bidi request
@@ -2752,10 +2786,9 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
        /* Bit 0 (R/W) is identical in rq->cmd_flags and bio->bi_rw */
        rq->cmd_flags |= bio->bi_rw & REQ_WRITE;
-        if (bio_has_data(bio)) {
+        if (bio_has_data(bio))
                rq->nr_phys_segments = bio_phys_segments(q, bio);
-                rq->buffer = bio_data(bio);
-        }
        rq->__data_len = bio->bi_iter.bi_size;
        rq->bio = rq->biotail = bio;
@@ -2831,7 +2864,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 /*
 * Copy attributes of the original request to the clone request.
- * The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not copied.
+ * The actual data parts (e.g. ->cmd, ->sense) are not copied.
 */
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
@@ -2857,7 +2890,7 @@ static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 *
 * Description:
 *     Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
- *     The actual data parts of @rq_src (e.g. ->cmd, ->buffer, ->sense)
+ *     The actual data parts of @rq_src (e.g. ->cmd, ->sense)
 *     are not copied, and copying such parts is the caller's responsibility.
 *     Also, pages which the original bios are pointing to are not copied
 *     and the cloned bios just point same pages.
@@ -2904,20 +2937,25 @@ free_and_out:
 }
 EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
-int kblockd_schedule_work(struct request_queue *q, struct work_struct *work)
+int kblockd_schedule_work(struct work_struct *work)
 {
        return queue_work(kblockd_workqueue, work);
 }
 EXPORT_SYMBOL(kblockd_schedule_work);
-int kblockd_schedule_delayed_work(struct request_queue *q,
+int kblockd_schedule_delayed_work(struct delayed_work *dwork,
-                        struct delayed_work *dwork, unsigned long delay)
+                                  unsigned long delay)
 {
        return queue_delayed_work(kblockd_workqueue, dwork, delay);
 }
 EXPORT_SYMBOL(kblockd_schedule_delayed_work);
-#define PLUG_MAGIC      0x91827364
+int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
+                                     unsigned long delay)
+{
+        return queue_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
+}
+EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
 /**
 * blk_start_plug - initialize blk_plug and track it inside the task_struct
@@ -2937,7 +2975,6 @@ void blk_start_plug(struct blk_plug *plug)
 {
        struct task_struct *tsk = current;
-        plug->magic = PLUG_MAGIC;
        INIT_LIST_HEAD(&plug->list);
        INIT_LIST_HEAD(&plug->mq_list);
        INIT_LIST_HEAD(&plug->cb_list);
@@ -3034,8 +3071,6 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
        LIST_HEAD(list);
        unsigned int depth;
-        BUG_ON(plug->magic != PLUG_MAGIC);
        flush_plug_callbacks(plug, from_schedule);
        if (!list_empty(&plug->mq_list))
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 43e6b4755e9a..ff87c664b7df 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -130,21 +130,13 @@ static void blk_flush_restore_request(struct request *rq)
        blk_clear_rq_complete(rq);
 }
-static void mq_flush_run(struct work_struct *work)
-{
-        struct request *rq;
-        rq = container_of(work, struct request, mq_flush_work);
-        memset(&rq->csd, 0, sizeof(rq->csd));
-        blk_mq_insert_request(rq, false, true, false);
-}
 static bool blk_flush_queue_rq(struct request *rq, bool add_front)
 {
        if (rq->q->mq_ops) {
-                INIT_WORK(&rq->mq_flush_work, mq_flush_run);
+                struct request_queue *q = rq->q;
-                kblockd_schedule_work(rq->q, &rq->mq_flush_work);
+                blk_mq_add_to_requeue_list(rq, add_front);
+                blk_mq_kick_requeue_list(q);
                return false;
        } else {
                if (add_front)
@@ -231,8 +223,10 @@ static void flush_end_io(struct request *flush_rq, int error)
        struct request *rq, *n;
        unsigned long flags = 0;
-        if (q->mq_ops)
+        if (q->mq_ops) {
                spin_lock_irqsave(&q->mq_flush_lock, flags);
+                q->flush_rq->cmd_flags = 0;
+        }
        running = &q->flush_queue[q->flush_running_idx];
        BUG_ON(q->flush_pending_idx == q->flush_running_idx);
@@ -306,23 +300,9 @@ static bool blk_kick_flush(struct request_queue *q)
         */
        q->flush_pending_idx ^= 1;
-        if (q->mq_ops) {
+        blk_rq_init(q, q->flush_rq);
-                struct blk_mq_ctx *ctx = first_rq->mq_ctx;
+        if (q->mq_ops)
-                struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu);
+                blk_mq_clone_flush_request(q->flush_rq, first_rq);
-                blk_mq_rq_init(hctx, q->flush_rq);
-                q->flush_rq->mq_ctx = ctx;
-                /*
-                 * Reuse the tag value from the fist waiting request,
-                 * with blk-mq the tag is generated during request
-                 * allocation and drivers can rely on it being inside
-                 * the range they asked for.
-                 */
-                q->flush_rq->tag = first_rq->tag;
-        } else {
-                blk_rq_init(q, q->flush_rq);
-        }
        q->flush_rq->cmd_type = REQ_TYPE_FS;
        q->flush_rq->cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
diff --git a/block/blk-iopoll.c b/block/blk-iopoll.c
index c11d24e379e2..d828b44a404b 100644
--- a/block/blk-iopoll.c
+++ b/block/blk-iopoll.c
@@ -64,12 +64,12 @@ EXPORT_SYMBOL(__blk_iopoll_complete);
 *     iopoll handler will not be invoked again before blk_iopoll_sched_prep()
 *     is called.
 **/
-void blk_iopoll_complete(struct blk_iopoll *iopoll)
+void blk_iopoll_complete(struct blk_iopoll *iop)
 {
        unsigned long flags;
        local_irq_save(flags);
-        __blk_iopoll_complete(iopoll);
+        __blk_iopoll_complete(iop);
        local_irq_restore(flags);
 }
 EXPORT_SYMBOL(blk_iopoll_complete);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 97a733cf3d5f..8411be3c19d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,8 +226,8 @@ EXPORT_SYMBOL(blkdev_issue_write_same);
 *  Generate and issue number of bios with zerofiled pages.
 */
-int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
+static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-                        sector_t nr_sects, gfp_t gfp_mask)
+                                  sector_t nr_sects, gfp_t gfp_mask)
 {
        int ret;
        struct bio *bio;
diff --git a/block/blk-map.c b/block/blk-map.c
index f7b22bc21518..f890d4345b0c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -155,7 +155,6 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
        if (!bio_flagged(bio, BIO_USER_MAPPED))
                rq->cmd_flags |= REQ_COPY_USER;
-        rq->buffer = NULL;
        return 0;
 unmap_rq:
        blk_rq_unmap_user(bio);
@@ -238,7 +237,6 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
        blk_queue_bounce(q, &bio);
        bio_get(bio);
        blk_rq_bio_prep(q, rq, bio);
-        rq->buffer = NULL;
        return 0;
 }
 EXPORT_SYMBOL(blk_rq_map_user_iov);
@@ -325,7 +323,6 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
        }
        blk_queue_bounce(q, &rq->bio);
-        rq->buffer = NULL;
        return 0;
 }
 EXPORT_SYMBOL(blk_rq_map_kern);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6c583f9c5b65..b3bf0df0f4c2 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
                                             struct bio *bio)
 {
        struct bio_vec bv, bvprv = { NULL };
-        int cluster, high, highprv = 1;
+        int cluster, high, highprv = 1, no_sg_merge;
        unsigned int seg_size, nr_phys_segs;
        struct bio *fbio, *bbio;
        struct bvec_iter iter;
@@ -35,12 +35,21 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
        cluster = blk_queue_cluster(q);
        seg_size = 0;
        nr_phys_segs = 0;
+        no_sg_merge = test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
+        high = 0;
        for_each_bio(bio) {
                bio_for_each_segment(bv, bio, iter) {
                        /*
+                         * If SG merging is disabled, each bio vector is
+                         * a segment
+                         */
+                        if (no_sg_merge)
+                                goto new_segment;
+                        /*
                         * the trick here is making sure that a high page is
-                         * never considered part of another segment, since that
+                         * never considered part of another segment, since
-                         * might change with the bounce page.
+                         * that might change with the bounce page.
                         */
                        high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
                        if (!high && !highprv && cluster) {
@@ -84,11 +93,16 @@ void blk_recalc_rq_segments(struct request *rq)
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-        struct bio *nxt = bio->bi_next;
+        if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
+                bio->bi_phys_segments = bio->bi_vcnt;
+        else {
+                struct bio *nxt = bio->bi_next;
+                bio->bi_next = NULL;
+                bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+                bio->bi_next = nxt;
+        }
-        bio->bi_next = NULL;
-        bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
-        bio->bi_next = nxt;
        bio->bi_flags |= (1 << BIO_SEG_VALID);
 }
 EXPORT_SYMBOL(blk_recount_segments);
diff --git a/block/blk-mq-cpu.c b/block/blk-mq-cpu.c
index 136ef8643bba..bb3ed488f7b5 100644
--- a/block/blk-mq-cpu.c
+++ b/block/blk-mq-cpu.c
@@ -1,3 +1,8 @@
+/*
+ * CPU notifier helper code for blk-mq
+ *
+ * Copyright (C) 2013-2014 Jens Axboe
+ */
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/init.h>
@@ -18,14 +23,18 @@ static int blk_mq_main_cpu_notify(struct notifier_block *self,
 {
        unsigned int cpu = (unsigned long) hcpu;
        struct blk_mq_cpu_notifier *notify;
+        int ret = NOTIFY_OK;
        raw_spin_lock(&blk_mq_cpu_notify_lock);
-        list_for_each_entry(notify, &blk_mq_cpu_notify_list, list)
+        list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) {
-                notify->notify(notify->data, action, cpu);
+                ret = notify->notify(notify->data, action, cpu);
+                if (ret != NOTIFY_OK)
+                        break;
+        }
        raw_spin_unlock(&blk_mq_cpu_notify_lock);
-        return NOTIFY_OK;
+        return ret;
 }
 void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier)
@@ -45,7 +54,7 @@ void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier)
 }
 void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier,
-                              void (*fn)(void *, unsigned long, unsigned int),
+                              int (*fn)(void *, unsigned long, unsigned int),
                              void *data)
 {
        notifier->notify = fn;
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 097921329619..1065d7c65fa1 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -1,3 +1,8 @@
+/*
+ * CPU <-> hardware queue mapping helpers
+ *
+ * Copyright (C) 2013-2014 Jens Axboe
+ */
 #include <linux/kernel.h>
 #include <linux/threads.h>
 #include <linux/module.h>
@@ -80,19 +85,35 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
        return 0;
 }
-unsigned int *blk_mq_make_queue_map(struct blk_mq_reg *reg)
+unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
 {
        unsigned int *map;
        /* If cpus are offline, map them to first hctx */
        map = kzalloc_node(sizeof(*map) * num_possible_cpus(), GFP_KERNEL,
-                                reg->numa_node);
+                                set->numa_node);
        if (!map)
                return NULL;
-        if (!blk_mq_update_queue_map(map, reg->nr_hw_queues))
+        if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
                return map;
        kfree(map);
        return NULL;
 }
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+        int i;
+        for_each_possible_cpu(i) {
+                if (index == mq_map[i])
+                        return cpu_to_node(i);
+        }
+        return NUMA_NO_NODE;
+}
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b0ba264b0522..ed5217867555 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -203,59 +203,24 @@ static ssize_t blk_mq_hw_sysfs_rq_list_show(struct blk_mq_hw_ctx *hctx,
        return ret;
 }
-static ssize_t blk_mq_hw_sysfs_ipi_show(struct blk_mq_hw_ctx *hctx, char *page)
+static ssize_t blk_mq_hw_sysfs_tags_show(struct blk_mq_hw_ctx *hctx, char *page)
-{
-        ssize_t ret;
-        spin_lock(&hctx->lock);
-        ret = sprintf(page, "%u\n", !!(hctx->flags & BLK_MQ_F_SHOULD_IPI));
-        spin_unlock(&hctx->lock);
-        return ret;
-}
-static ssize_t blk_mq_hw_sysfs_ipi_store(struct blk_mq_hw_ctx *hctx,
-                                         const char *page, size_t len)
 {
-        struct blk_mq_ctx *ctx;
+        return blk_mq_tag_sysfs_show(hctx->tags, page);
-        unsigned long ret;
-        unsigned int i;
-        if (kstrtoul(page, 10, &ret)) {
-                pr_err("blk-mq-sysfs: invalid input '%s'\n", page);
-                return -EINVAL;
-        }
-        spin_lock(&hctx->lock);
-        if (ret)
-                hctx->flags |= BLK_MQ_F_SHOULD_IPI;
-        else
-                hctx->flags &= ~BLK_MQ_F_SHOULD_IPI;
-        spin_unlock(&hctx->lock);
-        hctx_for_each_ctx(hctx, ctx, i)
-                ctx->ipi_redirect = !!ret;
-        return len;
 }
-static ssize_t blk_mq_hw_sysfs_tags_show(struct blk_mq_hw_ctx *hctx, char *page)
+static ssize_t blk_mq_hw_sysfs_active_show(struct blk_mq_hw_ctx *hctx, char *page)
 {
-        return blk_mq_tag_sysfs_show(hctx->tags, page);
+        return sprintf(page, "%u\n", atomic_read(&hctx->nr_active));
 }
 static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 {
-        unsigned int i, queue_num, first = 1;
+        unsigned int i, first = 1;
        ssize_t ret = 0;
        blk_mq_disable_hotplug();
-        for_each_online_cpu(i) {
+        for_each_cpu(i, hctx->cpumask) {
-                queue_num = hctx->queue->mq_map[i];
-                if (queue_num != hctx->queue_num)
-                        continue;
                if (first)
                        ret += sprintf(ret + page, "%u", i);
                else
@@ -307,15 +272,14 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_dispatched = {
        .attr = {.name = "dispatched", .mode = S_IRUGO },
        .show = blk_mq_hw_sysfs_dispatched_show,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_active = {
+        .attr = {.name = "active", .mode = S_IRUGO },
+        .show = blk_mq_hw_sysfs_active_show,
+};
 static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_pending = {
        .attr = {.name = "pending", .mode = S_IRUGO },
        .show = blk_mq_hw_sysfs_rq_list_show,
 };
-static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_ipi = {
-        .attr = {.name = "ipi_redirect", .mode = S_IRUGO | S_IWUSR},
-        .show = blk_mq_hw_sysfs_ipi_show,
-        .store = blk_mq_hw_sysfs_ipi_store,
-};
 static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tags = {
        .attr = {.name = "tags", .mode = S_IRUGO },
        .show = blk_mq_hw_sysfs_tags_show,
@@ -330,9 +294,9 @@ static struct attribute *default_hw_ctx_attrs[] = {
        &blk_mq_hw_sysfs_run.attr,
        &blk_mq_hw_sysfs_dispatched.attr,
        &blk_mq_hw_sysfs_pending.attr,
-        &blk_mq_hw_sysfs_ipi.attr,
        &blk_mq_hw_sysfs_tags.attr,
        &blk_mq_hw_sysfs_cpus.attr,
+        &blk_mq_hw_sysfs_active.attr,
        NULL,
 };
@@ -363,6 +327,42 @@ static struct kobj_type blk_mq_hw_ktype = {
        .release        = blk_mq_sysfs_release,
 };
+static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
+{
+        struct blk_mq_ctx *ctx;
+        int i;
+        if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+                return;
+        hctx_for_each_ctx(hctx, ctx, i)
+                kobject_del(&ctx->kobj);
+        kobject_del(&hctx->kobj);
+}
+static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
+{
+        struct request_queue *q = hctx->queue;
+        struct blk_mq_ctx *ctx;
+        int i, ret;
+        if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+                return 0;
+        ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
+        if (ret)
+                return ret;
+        hctx_for_each_ctx(hctx, ctx, i) {
+                ret = kobject_add(&ctx->kobj, &hctx->kobj, "cpu%u", ctx->cpu);
+                if (ret)
+                        break;
+        }
+        return ret;
+}
 void blk_mq_unregister_disk(struct gendisk *disk)
 {
        struct request_queue *q = disk->queue;
@@ -371,11 +371,11 @@ void blk_mq_unregister_disk(struct gendisk *disk)
        int i, j;
        queue_for_each_hw_ctx(q, hctx, i) {
-                hctx_for_each_ctx(hctx, ctx, j) {
+                blk_mq_unregister_hctx(hctx);
-                        kobject_del(&ctx->kobj);
+                hctx_for_each_ctx(hctx, ctx, j)
                        kobject_put(&ctx->kobj);
-                }
-                kobject_del(&hctx->kobj);
                kobject_put(&hctx->kobj);
        }
@@ -386,15 +386,30 @@ void blk_mq_unregister_disk(struct gendisk *disk)
        kobject_put(&disk_to_dev(disk)->kobj);
 }
+static void blk_mq_sysfs_init(struct request_queue *q)
+{
+        struct blk_mq_hw_ctx *hctx;
+        struct blk_mq_ctx *ctx;
+        int i, j;
+        kobject_init(&q->mq_kobj, &blk_mq_ktype);
+        queue_for_each_hw_ctx(q, hctx, i) {
+                kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
+                hctx_for_each_ctx(hctx, ctx, j)
+                        kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
+        }
+}
 int blk_mq_register_disk(struct gendisk *disk)
 {
        struct device *dev = disk_to_dev(disk);
        struct request_queue *q = disk->queue;
        struct blk_mq_hw_ctx *hctx;
-        struct blk_mq_ctx *ctx;
+        int ret, i;
-        int ret, i, j;
-        kobject_init(&q->mq_kobj, &blk_mq_ktype);
+        blk_mq_sysfs_init(q);
        ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
        if (ret < 0)
@@ -403,20 +418,10 @@ int blk_mq_register_disk(struct gendisk *disk)
        kobject_uevent(&q->mq_kobj, KOBJ_ADD);
        queue_for_each_hw_ctx(q, hctx, i) {
-                kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
+                hctx->flags |= BLK_MQ_F_SYSFS_UP;
-                ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", i);
+                ret = blk_mq_register_hctx(hctx);
                if (ret)
                        break;
-                if (!hctx->nr_ctx)
-                        continue;
-                hctx_for_each_ctx(hctx, ctx, j) {
-                        kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
-                        ret = kobject_add(&ctx->kobj, &hctx->kobj, "cpu%u", ctx->cpu);
-                        if (ret)
-                                break;
-                }
        }
        if (ret) {
@@ -426,3 +431,26 @@ int blk_mq_register_disk(struct gendisk *disk)
        return 0;
 }
+void blk_mq_sysfs_unregister(struct request_queue *q)
+{
+        struct blk_mq_hw_ctx *hctx;
+        int i;
+        queue_for_each_hw_ctx(q, hctx, i)
+                blk_mq_unregister_hctx(hctx);
+}
+int blk_mq_sysfs_register(struct request_queue *q)
+{
+        struct blk_mq_hw_ctx *hctx;
+        int i, ret = 0;
+        queue_for_each_hw_ctx(q, hctx, i) {
+                ret = blk_mq_register_hctx(hctx);
+                if (ret)
+                        break;
+        }
+        return ret;
+}
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 83ae96c51a27..d90c4aeb7dd3 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -1,78 +1,345 @@
+/*
+ * Fast and scalable bitmap tagging variant. Uses sparser bitmaps spread
+ * over multiple cachelines to avoid ping-pong between multiple submitters
+ * or submitter and completer. Uses rolling wakeups to avoid falling of
+ * the scaling cliff when we run out of tags and have to start putting
+ * submitters to sleep.
+ *
+ * Uses active queue tracking to support fairer distribution of tags
+ * between multiple submitters when a shared tag map is used.
+ *
+ * Copyright (C) 2013-2014 Jens Axboe
+ */
 #include <linux/kernel.h>
 #include <linux/module.h>
-#include <linux/percpu_ida.h>
+#include <linux/random.h>
 #include <linux/blk-mq.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+static bool bt_has_free_tags(struct blk_mq_bitmap_tags *bt)
+{
+        int i;
+        for (i = 0; i < bt->map_nr; i++) {
+                struct blk_align_bitmap *bm = &bt->map[i];
+                int ret;
+                ret = find_first_zero_bit(&bm->word, bm->depth);
+                if (ret < bm->depth)
+                        return true;
+        }
+        return false;
+}
+bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
+{
+        if (!tags)
+                return true;
+        return bt_has_free_tags(&tags->bitmap_tags);
+}
+static inline void bt_index_inc(unsigned int *index)
+{
+        *index = (*index + 1) & (BT_WAIT_QUEUES - 1);
+}
 /*
- * Per tagged queue (tag address space) map
+ * If a previously inactive queue goes active, bump the active user count.
 */
-struct blk_mq_tags {
+bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
-        unsigned int nr_tags;
+{
-        unsigned int nr_reserved_tags;
+        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-        unsigned int nr_batch_move;
+            !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-        unsigned int nr_max_cache;
+                atomic_inc(&hctx->tags->active_queues);
-        struct percpu_ida free_tags;
+        return true;
-        struct percpu_ida reserved_tags;
+}
-};
-void blk_mq_wait_for_tags(struct blk_mq_tags *tags)
+/*
+ * Wakeup all potentially sleeping on normal (non-reserved) tags
+ */
+static void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags)
 {
-        int tag = blk_mq_get_tag(tags, __GFP_WAIT, false);
+        struct blk_mq_bitmap_tags *bt;
-        blk_mq_put_tag(tags, tag);
+        int i, wake_index;
+        bt = &tags->bitmap_tags;
+        wake_index = bt->wake_index;
+        for (i = 0; i < BT_WAIT_QUEUES; i++) {
+                struct bt_wait_state *bs = &bt->bs[wake_index];
+                if (waitqueue_active(&bs->wait))
+                        wake_up(&bs->wait);
+                bt_index_inc(&wake_index);
+        }
 }
-bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
+/*
+ * If a previously busy queue goes inactive, potential waiters could now
+ * be allowed to queue. Wake them up and check.
+ */
+void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
+{
+        struct blk_mq_tags *tags = hctx->tags;
+        if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+                return;
+        atomic_dec(&tags->active_queues);
+        blk_mq_tag_wakeup_all(tags);
+}
+/*
+ * For shared tag users, we track the number of currently active users
+ * and attempt to provide a fair share of the tag depth for each of them.
+ */
+static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+                                  struct blk_mq_bitmap_tags *bt)
+{
+        unsigned int depth, users;
+        if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
+                return true;
+        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+                return true;
+        /*
+         * Don't try dividing an ant
+         */
+        if (bt->depth == 1)
+                return true;
+        users = atomic_read(&hctx->tags->active_queues);
+        if (!users)
+                return true;
+        /*
+         * Allow at least some tags
+         */
+        depth = max((bt->depth + users - 1) / users, 4U);
+        return atomic_read(&hctx->nr_active) < depth;
+}
+static int __bt_get_word(struct blk_align_bitmap *bm, unsigned int last_tag)
 {
-        return !tags ||
+        int tag, org_last_tag, end;
-                percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids) != 0;
+        org_last_tag = last_tag;
+        end = bm->depth;
+        do {
+restart:
+                tag = find_next_zero_bit(&bm->word, end, last_tag);
+                if (unlikely(tag >= end)) {
+                        /*
+                         * We started with an offset, start from 0 to
+                         * exhaust the map.
+                         */
+                        if (org_last_tag && last_tag) {
+                                end = last_tag;
+                                last_tag = 0;
+                                goto restart;
+                        }
+                        return -1;
+                }
+                last_tag = tag + 1;
+        } while (test_and_set_bit_lock(tag, &bm->word));
+        return tag;
 }
-static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
+/*
+ * Straight forward bitmap tag implementation, where each bit is a tag
+ * (cleared == free, and set == busy). The small twist is using per-cpu
+ * last_tag caches, which blk-mq stores in the blk_mq_ctx software queue
+ * contexts. This enables us to drastically limit the space searched,
+ * without dirtying an extra shared cacheline like we would if we stored
+ * the cache value inside the shared blk_mq_bitmap_tags structure. On top
+ * of that, each word of tags is in a separate cacheline. This means that
+ * multiple users will tend to stick to different cachelines, at least
+ * until the map is exhausted.
+ */
+static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
+                    unsigned int *tag_cache)
 {
+        unsigned int last_tag, org_last_tag;
+        int index, i, tag;
+        if (!hctx_may_queue(hctx, bt))
+                return -1;
+        last_tag = org_last_tag = *tag_cache;
+        index = TAG_TO_INDEX(bt, last_tag);
+        for (i = 0; i < bt->map_nr; i++) {
+                tag = __bt_get_word(&bt->map[index], TAG_TO_BIT(bt, last_tag));
+                if (tag != -1) {
+                        tag += (index << bt->bits_per_word);
+                        goto done;
+                }
+                last_tag = 0;
+                if (++index >= bt->map_nr)
+                        index = 0;
+        }
+        *tag_cache = 0;
+        return -1;
+        /*
+         * Only update the cache from the allocation path, if we ended
+         * up using the specific cached tag.
+         */
+done:
+        if (tag == org_last_tag) {
+                last_tag = tag + 1;
+                if (last_tag >= bt->depth - 1)
+                        last_tag = 0;
+                *tag_cache = last_tag;
+        }
+        return tag;
+}
+static struct bt_wait_state *bt_wait_ptr(struct blk_mq_bitmap_tags *bt,
+                                         struct blk_mq_hw_ctx *hctx)
+{
+        struct bt_wait_state *bs;
+        if (!hctx)
+                return &bt->bs[0];
+        bs = &bt->bs[hctx->wait_index];
+        bt_index_inc(&hctx->wait_index);
+        return bs;
+}
+static int bt_get(struct blk_mq_bitmap_tags *bt, struct blk_mq_hw_ctx *hctx,
+                  unsigned int *last_tag, gfp_t gfp)
+{
+        struct bt_wait_state *bs;
+        DEFINE_WAIT(wait);
        int tag;
-        tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
+        tag = __bt_get(hctx, bt, last_tag);
-                               TASK_UNINTERRUPTIBLE : TASK_RUNNING);
+        if (tag != -1)
-        if (tag < 0)
+                return tag;
-                return BLK_MQ_TAG_FAIL;
-        return tag + tags->nr_reserved_tags;
+        if (!(gfp & __GFP_WAIT))
+                return -1;
+        bs = bt_wait_ptr(bt, hctx);
+        do {
+                bool was_empty;
+                was_empty = list_empty(&wait.task_list);
+                prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
+                tag = __bt_get(hctx, bt, last_tag);
+                if (tag != -1)
+                        break;
+                if (was_empty)
+                        atomic_set(&bs->wait_cnt, bt->wake_cnt);
+                io_schedule();
+        } while (1);
+        finish_wait(&bs->wait, &wait);
+        return tag;
+}
+static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags,
+                                     struct blk_mq_hw_ctx *hctx,
+                                     unsigned int *last_tag, gfp_t gfp)
+{
+        int tag;
+        tag = bt_get(&tags->bitmap_tags, hctx, last_tag, gfp);
+        if (tag >= 0)
+                return tag + tags->nr_reserved_tags;
+        return BLK_MQ_TAG_FAIL;
 }
 static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
                                              gfp_t gfp)
 {
-        int tag;
+        int tag, zero = 0;
        if (unlikely(!tags->nr_reserved_tags)) {
                WARN_ON_ONCE(1);
                return BLK_MQ_TAG_FAIL;
        }
-        tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
+        tag = bt_get(&tags->breserved_tags, NULL, &zero, gfp);
-                               TASK_UNINTERRUPTIBLE : TASK_RUNNING);
        if (tag < 0)
                return BLK_MQ_TAG_FAIL;
        return tag;
 }
-unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved)
+unsigned int blk_mq_get_tag(struct blk_mq_hw_ctx *hctx, unsigned int *last_tag,
+                            gfp_t gfp, bool reserved)
 {
        if (!reserved)
-                return __blk_mq_get_tag(tags, gfp);
+                return __blk_mq_get_tag(hctx->tags, hctx, last_tag, gfp);
-        return __blk_mq_get_reserved_tag(tags, gfp);
+        return __blk_mq_get_reserved_tag(hctx->tags, gfp);
+}
+static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
+{
+        int i, wake_index;
+        wake_index = bt->wake_index;
+        for (i = 0; i < BT_WAIT_QUEUES; i++) {
+                struct bt_wait_state *bs = &bt->bs[wake_index];
+                if (waitqueue_active(&bs->wait)) {
+                        if (wake_index != bt->wake_index)
+                                bt->wake_index = wake_index;
+                        return bs;
+                }
+                bt_index_inc(&wake_index);
+        }
+        return NULL;
+}
+static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
+{
+        const int index = TAG_TO_INDEX(bt, tag);
+        struct bt_wait_state *bs;
+        /*
+         * The unlock memory barrier need to order access to req in free
+         * path and clearing tag bit
+         */
+        clear_bit_unlock(TAG_TO_BIT(bt, tag), &bt->map[index].word);
+        bs = bt_wake_ptr(bt);
+        if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
+                atomic_set(&bs->wait_cnt, bt->wake_cnt);
+                bt_index_inc(&bt->wake_index);
+                wake_up(&bs->wait);
+        }
 }
 static void __blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)
 {
        BUG_ON(tag >= tags->nr_tags);
-        percpu_ida_free(&tags->free_tags, tag - tags->nr_reserved_tags);
+        bt_clear_tag(&tags->bitmap_tags, tag);
 }
 static void __blk_mq_put_reserved_tag(struct blk_mq_tags *tags,
@@ -80,22 +347,43 @@ static void __blk_mq_put_reserved_tag(struct blk_mq_tags *tags,
 {
        BUG_ON(tag >= tags->nr_reserved_tags);
-        percpu_ida_free(&tags->reserved_tags, tag);
+        bt_clear_tag(&tags->breserved_tags, tag);
 }
-void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)
+void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, unsigned int tag,
+                    unsigned int *last_tag)
 {
-        if (tag >= tags->nr_reserved_tags)
+        struct blk_mq_tags *tags = hctx->tags;
-                __blk_mq_put_tag(tags, tag);
-        else
+        if (tag >= tags->nr_reserved_tags) {
+                const int real_tag = tag - tags->nr_reserved_tags;
+                __blk_mq_put_tag(tags, real_tag);
+                *last_tag = real_tag;
+        } else
                __blk_mq_put_reserved_tag(tags, tag);
 }
-static int __blk_mq_tag_iter(unsigned id, void *data)
+static void bt_for_each_free(struct blk_mq_bitmap_tags *bt,
+                             unsigned long *free_map, unsigned int off)
 {
-        unsigned long *tag_map = data;
+        int i;
-        __set_bit(id, tag_map);
-        return 0;
+        for (i = 0; i < bt->map_nr; i++) {
+                struct blk_align_bitmap *bm = &bt->map[i];
+                int bit = 0;
+                do {
+                        bit = find_next_zero_bit(&bm->word, bm->depth, bit);
+                        if (bit >= bm->depth)
+                                break;
+                        __set_bit(bit + off, free_map);
+                        bit++;
+                } while (1);
+                off += (1 << bt->bits_per_word);
+        }
 }
 void blk_mq_tag_busy_iter(struct blk_mq_tags *tags,
@@ -109,21 +397,128 @@ void blk_mq_tag_busy_iter(struct blk_mq_tags *tags,
        if (!tag_map)
                return;
-        percpu_ida_for_each_free(&tags->free_tags, __blk_mq_tag_iter, tag_map);
+        bt_for_each_free(&tags->bitmap_tags, tag_map, tags->nr_reserved_tags);
        if (tags->nr_reserved_tags)
-                percpu_ida_for_each_free(&tags->reserved_tags, __blk_mq_tag_iter,
+                bt_for_each_free(&tags->breserved_tags, tag_map, 0);
-                        tag_map);
        fn(data, tag_map);
        kfree(tag_map);
 }
+EXPORT_SYMBOL(blk_mq_tag_busy_iter);
+static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
+{
+        unsigned int i, used;
+        for (i = 0, used = 0; i < bt->map_nr; i++) {
+                struct blk_align_bitmap *bm = &bt->map[i];
+                used += bitmap_weight(&bm->word, bm->depth);
+        }
+        return bt->depth - used;
+}
+static void bt_update_count(struct blk_mq_bitmap_tags *bt,
+                            unsigned int depth)
+{
+        unsigned int tags_per_word = 1U << bt->bits_per_word;
+        unsigned int map_depth = depth;
+        if (depth) {
+                int i;
+                for (i = 0; i < bt->map_nr; i++) {
+                        bt->map[i].depth = min(map_depth, tags_per_word);
+                        map_depth -= bt->map[i].depth;
+                }
+        }
+        bt->wake_cnt = BT_WAIT_BATCH;
+        if (bt->wake_cnt > depth / 4)
+                bt->wake_cnt = max(1U, depth / 4);
+        bt->depth = depth;
+}
+static int bt_alloc(struct blk_mq_bitmap_tags *bt, unsigned int depth,
+                        int node, bool reserved)
+{
+        int i;
+        bt->bits_per_word = ilog2(BITS_PER_LONG);
+        /*
+         * Depth can be zero for reserved tags, that's not a failure
+         * condition.
+         */
+        if (depth) {
+                unsigned int nr, tags_per_word;
+                tags_per_word = (1 << bt->bits_per_word);
+                /*
+                 * If the tag space is small, shrink the number of tags
+                 * per word so we spread over a few cachelines, at least.
+                 * If less than 4 tags, just forget about it, it's not
+                 * going to work optimally anyway.
+                 */
+                if (depth >= 4) {
+                        while (tags_per_word * 4 > depth) {
+                                bt->bits_per_word--;
+                                tags_per_word = (1 << bt->bits_per_word);
+                        }
+                }
+                nr = ALIGN(depth, tags_per_word) / tags_per_word;
+                bt->map = kzalloc_node(nr * sizeof(struct blk_align_bitmap),
+                                                GFP_KERNEL, node);
+                if (!bt->map)
+                        return -ENOMEM;
+                bt->map_nr = nr;
+        }
+        bt->bs = kzalloc(BT_WAIT_QUEUES * sizeof(*bt->bs), GFP_KERNEL);
+        if (!bt->bs) {
+                kfree(bt->map);
+                return -ENOMEM;
+        }
+        for (i = 0; i < BT_WAIT_QUEUES; i++)
+                init_waitqueue_head(&bt->bs[i].wait);
+        bt_update_count(bt, depth);
+        return 0;
+}
+static void bt_free(struct blk_mq_bitmap_tags *bt)
+{
+        kfree(bt->map);
+        kfree(bt->bs);
+}
+static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+                                                   int node)
+{
+        unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
+        if (bt_alloc(&tags->bitmap_tags, depth, node, false))
+                goto enomem;
+        if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, node, true))
+                goto enomem;
+        return tags;
+enomem:
+        bt_free(&tags->bitmap_tags);
+        kfree(tags);
+        return NULL;
+}
 struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
                                     unsigned int reserved_tags, int node)
 {
-        unsigned int nr_tags, nr_cache;
        struct blk_mq_tags *tags;
-        int ret;
        if (total_tags > BLK_MQ_TAG_MAX) {
                pr_err("blk-mq: tag depth too large\n");
@@ -134,73 +529,59 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
        if (!tags)
                return NULL;
-        nr_tags = total_tags - reserved_tags;
-        nr_cache = nr_tags / num_possible_cpus();
-        if (nr_cache < BLK_MQ_TAG_CACHE_MIN)
-                nr_cache = BLK_MQ_TAG_CACHE_MIN;
-        else if (nr_cache > BLK_MQ_TAG_CACHE_MAX)
-                nr_cache = BLK_MQ_TAG_CACHE_MAX;
        tags->nr_tags = total_tags;
        tags->nr_reserved_tags = reserved_tags;
-        tags->nr_max_cache = nr_cache;
-        tags->nr_batch_move = max(1u, nr_cache / 2);
-        ret = __percpu_ida_init(&tags->free_tags, tags->nr_tags -
+        return blk_mq_init_bitmap_tags(tags, node);
-                                tags->nr_reserved_tags,
+}
-                                tags->nr_max_cache,
-                                tags->nr_batch_move);
-        if (ret)
-                goto err_free_tags;
-        if (reserved_tags) {
+void blk_mq_free_tags(struct blk_mq_tags *tags)
-                /*
+{
-                 * With max_cahe and batch set to 1, the allocator fallbacks to
+        bt_free(&tags->bitmap_tags);
-                 * no cached. It's fine reserved tags allocation is slow.
+        bt_free(&tags->breserved_tags);
-                 */
+        kfree(tags);
-                ret = __percpu_ida_init(&tags->reserved_tags, reserved_tags,
+}
-                                1, 1);
-                if (ret)
-                        goto err_reserved_tags;
-        }
-        return tags;
+void blk_mq_tag_init_last_tag(struct blk_mq_tags *tags, unsigned int *tag)
+{
+        unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
-err_reserved_tags:
+        *tag = prandom_u32() % depth;
-        percpu_ida_destroy(&tags->free_tags);
-err_free_tags:
-        kfree(tags);
-        return NULL;
 }
-void blk_mq_free_tags(struct blk_mq_tags *tags)
+int blk_mq_tag_update_depth(struct blk_mq_tags *tags, unsigned int tdepth)
 {
-        percpu_ida_destroy(&tags->free_tags);
+        tdepth -= tags->nr_reserved_tags;
-        percpu_ida_destroy(&tags->reserved_tags);
+        if (tdepth > tags->nr_tags)
-        kfree(tags);
+                return -EINVAL;
+        /*
+         * Don't need (or can't) update reserved tags here, they remain
+         * static and should never need resizing.
+         */
+        bt_update_count(&tags->bitmap_tags, tdepth);
+        blk_mq_tag_wakeup_all(tags);
+        return 0;
 }
 ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page)
 {
        char *orig_page = page;
-        unsigned int cpu;
+        unsigned int free, res;
        if (!tags)
                return 0;
-        page += sprintf(page, "nr_tags=%u, reserved_tags=%u, batch_move=%u,"
+        page += sprintf(page, "nr_tags=%u, reserved_tags=%u, "
-                        " max_cache=%u\n", tags->nr_tags, tags->nr_reserved_tags,
+                        "bits_per_word=%u\n",
-                        tags->nr_batch_move, tags->nr_max_cache);
+                        tags->nr_tags, tags->nr_reserved_tags,
+                        tags->bitmap_tags.bits_per_word);
-        page += sprintf(page, "nr_free=%u, nr_reserved=%u\n",
+        free = bt_unused_tags(&tags->bitmap_tags);
-                        percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids),
+        res = bt_unused_tags(&tags->breserved_tags);
-                        percpu_ida_free_tags(&tags->reserved_tags, nr_cpu_ids));
-        for_each_possible_cpu(cpu) {
+        page += sprintf(page, "nr_free=%u, nr_reserved=%u\n", free, res);
-                page += sprintf(page, "  cpu%02u: nr_free=%u\n", cpu,
+        page += sprintf(page, "active_queues=%u\n", atomic_read(&tags->active_queues));
-                                percpu_ida_free_tags(&tags->free_tags, cpu));
-        }
        return page - orig_page;
 }
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 947ba2c6148e..c959de58d2a5 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -1,17 +1,59 @@
 #ifndef INT_BLK_MQ_TAG_H
 #define INT_BLK_MQ_TAG_H
-struct blk_mq_tags;
+#include "blk-mq.h"
+enum {
+        BT_WAIT_QUEUES  = 8,
+        BT_WAIT_BATCH   = 8,
+};
+struct bt_wait_state {
+        atomic_t wait_cnt;
+        wait_queue_head_t wait;
+} ____cacheline_aligned_in_smp;
+#define TAG_TO_INDEX(bt, tag)   ((tag) >> (bt)->bits_per_word)
+#define TAG_TO_BIT(bt, tag)     ((tag) & ((1 << (bt)->bits_per_word) - 1))
+struct blk_mq_bitmap_tags {
+        unsigned int depth;
+        unsigned int wake_cnt;
+        unsigned int bits_per_word;
+        unsigned int map_nr;
+        struct blk_align_bitmap *map;
+        unsigned int wake_index;
+        struct bt_wait_state *bs;
+};
+/*
+ * Tag address space map.
+ */
+struct blk_mq_tags {
+        unsigned int nr_tags;
+        unsigned int nr_reserved_tags;
+        atomic_t active_queues;
+        struct blk_mq_bitmap_tags bitmap_tags;
+        struct blk_mq_bitmap_tags breserved_tags;
+        struct request **rqs;
+        struct list_head page_list;
+};
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
-extern unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved);
+extern unsigned int blk_mq_get_tag(struct blk_mq_hw_ctx *hctx, unsigned int *last_tag, gfp_t gfp, bool reserved);
-extern void blk_mq_wait_for_tags(struct blk_mq_tags *tags);
+extern void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, unsigned int tag, unsigned int *last_tag);
-extern void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag);
-extern void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, void (*fn)(void *data, unsigned long *), void *data);
 extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags);
 extern ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page);
+extern void blk_mq_tag_init_last_tag(struct blk_mq_tags *tags, unsigned int *last_tag);
+extern int blk_mq_tag_update_depth(struct blk_mq_tags *tags, unsigned int depth);
 enum {
        BLK_MQ_TAG_CACHE_MIN    = 1,
@@ -24,4 +66,23 @@ enum {
        BLK_MQ_TAG_MAX          = BLK_MQ_TAG_FAIL - 1,
 };
+extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
+extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
+static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
+{
+        if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+                return false;
+        return __blk_mq_tag_busy(hctx);
+}
+static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
+{
+        if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+                return;
+        __blk_mq_tag_idle(hctx);
+}
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1d2a9bdbee57..0f5879c42dcd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1,3 +1,9 @@
+/*
+ * Block multiqueue core code
+ *
+ * Copyright (C) 2013-2014 Jens Axboe
+ * Copyright (C) 2013-2014 Christoph Hellwig
+ */
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/backing-dev.h>
@@ -56,38 +62,40 @@ static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
        unsigned int i;
-        for (i = 0; i < hctx->nr_ctx_map; i++)
+        for (i = 0; i < hctx->ctx_map.map_size; i++)
-                if (hctx->ctx_map[i])
+                if (hctx->ctx_map.map[i].word)
                        return true;
        return false;
 }
+static inline struct blk_align_bitmap *get_bm(struct blk_mq_hw_ctx *hctx,
+                                              struct blk_mq_ctx *ctx)
+{
+        return &hctx->ctx_map.map[ctx->index_hw / hctx->ctx_map.bits_per_word];
+}
+#define CTX_TO_BIT(hctx, ctx)   \
+        ((ctx)->index_hw & ((hctx)->ctx_map.bits_per_word - 1))
 /*
 * Mark this ctx as having pending work in this hardware queue
 */
 static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
                                     struct blk_mq_ctx *ctx)
 {
-        if (!test_bit(ctx->index_hw, hctx->ctx_map))
+        struct blk_align_bitmap *bm = get_bm(hctx, ctx);
-                set_bit(ctx->index_hw, hctx->ctx_map);
+        if (!test_bit(CTX_TO_BIT(hctx, ctx), &bm->word))
+                set_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
-static struct request *__blk_mq_alloc_request(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
-                                              gfp_t gfp, bool reserved)
+                                      struct blk_mq_ctx *ctx)
 {
-        struct request *rq;
+        struct blk_align_bitmap *bm = get_bm(hctx, ctx);
-        unsigned int tag;
-        tag = blk_mq_get_tag(hctx->tags, gfp, reserved);
+        clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
-        if (tag != BLK_MQ_TAG_FAIL) {
-                rq = hctx->rqs[tag];
-                rq->tag = tag;
-                return rq;
-        }
-        return NULL;
 }
 static int blk_mq_queue_enter(struct request_queue *q)
@@ -186,78 +194,95 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
        if (blk_queue_io_stat(q))
                rw_flags |= REQ_IO_STAT;
+        INIT_LIST_HEAD(&rq->queuelist);
+        /* csd/requeue_work/fifo_time is initialized before use */
+        rq->q = q;
        rq->mq_ctx = ctx;
-        rq->cmd_flags = rw_flags;
+        rq->cmd_flags |= rw_flags;
-        rq->start_time = jiffies;
+        /* do not touch atomic flags, it needs atomic ops against the timer */
+        rq->cpu = -1;
+        INIT_HLIST_NODE(&rq->hash);
+        RB_CLEAR_NODE(&rq->rb_node);
+        rq->rq_disk = NULL;
+        rq->part = NULL;
+#ifdef CONFIG_BLK_CGROUP
+        rq->rl = NULL;
        set_start_time_ns(rq);
+        rq->io_start_time_ns = 0;
+#endif
+        rq->nr_phys_segments = 0;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+        rq->nr_integrity_segments = 0;
+#endif
+        rq->special = NULL;
+        /* tag was already set */
+        rq->errors = 0;
+        rq->extra_len = 0;
+        rq->sense_len = 0;
+        rq->resid_len = 0;
+        rq->sense = NULL;
+        INIT_LIST_HEAD(&rq->timeout_list);
+        rq->end_io = NULL;
+        rq->end_io_data = NULL;
+        rq->next_rq = NULL;
        ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
 }
-static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
+static struct request *
-                                                   int rw, gfp_t gfp,
+__blk_mq_alloc_request(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
-                                                   bool reserved)
+                struct blk_mq_ctx *ctx, int rw, gfp_t gfp, bool reserved)
 {
        struct request *rq;
+        unsigned int tag;
-        do {
+        tag = blk_mq_get_tag(hctx, &ctx->last_tag, gfp, reserved);
-                struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+        if (tag != BLK_MQ_TAG_FAIL) {
-                struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu);
+                rq = hctx->tags->rqs[tag];
-                rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved);
+                rq->cmd_flags = 0;
-                if (rq) {
+                if (blk_mq_tag_busy(hctx)) {
-                        blk_mq_rq_ctx_init(q, ctx, rq, rw);
+                        rq->cmd_flags = REQ_MQ_INFLIGHT;
-                        break;
+                        atomic_inc(&hctx->nr_active);
                }
-                blk_mq_put_ctx(ctx);
+                rq->tag = tag;
-                if (!(gfp & __GFP_WAIT))
+                blk_mq_rq_ctx_init(q, ctx, rq, rw);
-                        break;
+                return rq;
+        }
-                __blk_mq_run_hw_queue(hctx);
-                blk_mq_wait_for_tags(hctx->tags);
-        } while (1);
-        return rq;
+        return NULL;
 }
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp)
+struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
+                bool reserved)
 {
+        struct blk_mq_ctx *ctx;
+        struct blk_mq_hw_ctx *hctx;
        struct request *rq;
        if (blk_mq_queue_enter(q))
                return NULL;
-        rq = blk_mq_alloc_request_pinned(q, rw, gfp, false);
+        ctx = blk_mq_get_ctx(q);
-        if (rq)
+        hctx = q->mq_ops->map_queue(q, ctx->cpu);
-                blk_mq_put_ctx(rq->mq_ctx);
-        return rq;
-}
-struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw,
-                                              gfp_t gfp)
-{
-        struct request *rq;
-        if (blk_mq_queue_enter(q))
+        rq = __blk_mq_alloc_request(q, hctx, ctx, rw, gfp & ~__GFP_WAIT,
-                return NULL;
+                                    reserved);
+        if (!rq && (gfp & __GFP_WAIT)) {
+                __blk_mq_run_hw_queue(hctx);
+                blk_mq_put_ctx(ctx);
-        rq = blk_mq_alloc_request_pinned(q, rw, gfp, true);
+                ctx = blk_mq_get_ctx(q);
-        if (rq)
+                hctx = q->mq_ops->map_queue(q, ctx->cpu);
-                blk_mq_put_ctx(rq->mq_ctx);
+                rq =  __blk_mq_alloc_request(q, hctx, ctx, rw, gfp, reserved);
+        }
+        blk_mq_put_ctx(ctx);
        return rq;
 }
-EXPORT_SYMBOL(blk_mq_alloc_reserved_request);
+EXPORT_SYMBOL(blk_mq_alloc_request);
-/*
- * Re-init and set pdu, if we have it
- */
-void blk_mq_rq_init(struct blk_mq_hw_ctx *hctx, struct request *rq)
-{
-        blk_rq_init(hctx->queue, rq);
-        if (hctx->cmd_size)
-                rq->special = blk_mq_rq_to_pdu(rq);
-}
 static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
                                  struct blk_mq_ctx *ctx, struct request *rq)
@@ -265,9 +290,11 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
        const int tag = rq->tag;
        struct request_queue *q = rq->q;
-        blk_mq_rq_init(hctx, rq);
+        if (rq->cmd_flags & REQ_MQ_INFLIGHT)
-        blk_mq_put_tag(hctx->tags, tag);
+                atomic_dec(&hctx->nr_active);
+        clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+        blk_mq_put_tag(hctx, tag, &ctx->last_tag);
        blk_mq_queue_exit(q);
 }
@@ -283,20 +310,47 @@ void blk_mq_free_request(struct request *rq)
        __blk_mq_free_request(hctx, ctx, rq);
 }
-bool blk_mq_end_io_partial(struct request *rq, int error, unsigned int nr_bytes)
+/*
+ * Clone all relevant state from a request that has been put on hold in
+ * the flush state machine into the preallocated flush request that hangs
+ * off the request queue.
+ *
+ * For a driver the flush request should be invisible, that's why we are
+ * impersonating the original request here.
+ */
+void blk_mq_clone_flush_request(struct request *flush_rq,
+                struct request *orig_rq)
 {
-        if (blk_update_request(rq, error, blk_rq_bytes(rq)))
+        struct blk_mq_hw_ctx *hctx =
-                return true;
+                orig_rq->q->mq_ops->map_queue(orig_rq->q, orig_rq->mq_ctx->cpu);
+        flush_rq->mq_ctx = orig_rq->mq_ctx;
+        flush_rq->tag = orig_rq->tag;
+        memcpy(blk_mq_rq_to_pdu(flush_rq), blk_mq_rq_to_pdu(orig_rq),
+                hctx->cmd_size);
+}
+inline void __blk_mq_end_io(struct request *rq, int error)
+{
        blk_account_io_done(rq);
-        if (rq->end_io)
+        if (rq->end_io) {
                rq->end_io(rq, error);
-        else
+        } else {
+                if (unlikely(blk_bidi_rq(rq)))
+                        blk_mq_free_request(rq->next_rq);
                blk_mq_free_request(rq);
-        return false;
+        }
+}
+EXPORT_SYMBOL(__blk_mq_end_io);
+void blk_mq_end_io(struct request *rq, int error)
+{
+        if (blk_update_request(rq, error, blk_rq_bytes(rq)))
+                BUG();
+        __blk_mq_end_io(rq, error);
 }
-EXPORT_SYMBOL(blk_mq_end_io_partial);
+EXPORT_SYMBOL(blk_mq_end_io);
 static void __blk_mq_complete_request_remote(void *data)
 {
@@ -305,18 +359,22 @@ static void __blk_mq_complete_request_remote(void *data)
        rq->q->softirq_done_fn(rq);
 }
-void __blk_mq_complete_request(struct request *rq)
+static void blk_mq_ipi_complete_request(struct request *rq)
 {
        struct blk_mq_ctx *ctx = rq->mq_ctx;
+        bool shared = false;
        int cpu;
-        if (!ctx->ipi_redirect) {
+        if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
                rq->q->softirq_done_fn(rq);
                return;
        }
        cpu = get_cpu();
-        if (cpu != ctx->cpu && cpu_online(ctx->cpu)) {
+        if (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags))
+                shared = cpus_share_cache(cpu, ctx->cpu);
+        if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) {
                rq->csd.func = __blk_mq_complete_request_remote;
                rq->csd.info = rq;
                rq->csd.flags = 0;
@@ -327,6 +385,16 @@ void __blk_mq_complete_request(struct request *rq)
        put_cpu();
 }
+void __blk_mq_complete_request(struct request *rq)
+{
+        struct request_queue *q = rq->q;
+        if (!q->softirq_done_fn)
+                blk_mq_end_io(rq, rq->errors);
+        else
+                blk_mq_ipi_complete_request(rq);
+}
 /**
 * blk_mq_complete_request - end I/O on a request
 * @rq:         the request being processed
@@ -337,7 +405,9 @@ void __blk_mq_complete_request(struct request *rq)
 **/
 void blk_mq_complete_request(struct request *rq)
 {
-        if (unlikely(blk_should_fake_timeout(rq->q)))
+        struct request_queue *q = rq->q;
+        if (unlikely(blk_should_fake_timeout(q)))
                return;
        if (!blk_mark_rq_complete(rq))
                __blk_mq_complete_request(rq);
@@ -350,13 +420,31 @@ static void blk_mq_start_request(struct request *rq, bool last)
        trace_block_rq_issue(q, rq);
+        rq->resid_len = blk_rq_bytes(rq);
+        if (unlikely(blk_bidi_rq(rq)))
+                rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
        /*
         * Just mark start time and set the started bit. Due to memory
         * ordering, we know we'll see the correct deadline as long as
-         * REQ_ATOMIC_STARTED is seen.
+         * REQ_ATOMIC_STARTED is seen. Use the default queue timeout,
+         * unless one has been set in the request.
+         */
+        if (!rq->timeout)
+                rq->deadline = jiffies + q->rq_timeout;
+        else
+                rq->deadline = jiffies + rq->timeout;
+        /*
+         * Mark us as started and clear complete. Complete might have been
+         * set if requeue raced with timeout, which then marked it as
+         * complete. So be sure to clear complete again when we start
+         * the request, otherwise we'll ignore the completion event.
         */
-        rq->deadline = jiffies + q->rq_timeout;
+        if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
-        set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+                set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+        if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+                clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
        if (q->dma_drain_size && blk_rq_bytes(rq)) {
                /*
@@ -378,7 +466,7 @@ static void blk_mq_start_request(struct request *rq, bool last)
                rq->cmd_flags |= REQ_END;
 }
-static void blk_mq_requeue_request(struct request *rq)
+static void __blk_mq_requeue_request(struct request *rq)
 {
        struct request_queue *q = rq->q;
@@ -391,6 +479,86 @@ static void blk_mq_requeue_request(struct request *rq)
                rq->nr_phys_segments--;
 }
+void blk_mq_requeue_request(struct request *rq)
+{
+        __blk_mq_requeue_request(rq);
+        blk_clear_rq_complete(rq);
+        BUG_ON(blk_queued_rq(rq));
+        blk_mq_add_to_requeue_list(rq, true);
+}
+EXPORT_SYMBOL(blk_mq_requeue_request);
+static void blk_mq_requeue_work(struct work_struct *work)
+{
+        struct request_queue *q =
+                container_of(work, struct request_queue, requeue_work);
+        LIST_HEAD(rq_list);
+        struct request *rq, *next;
+        unsigned long flags;
+        spin_lock_irqsave(&q->requeue_lock, flags);
+        list_splice_init(&q->requeue_list, &rq_list);
+        spin_unlock_irqrestore(&q->requeue_lock, flags);
+        list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+                if (!(rq->cmd_flags & REQ_SOFTBARRIER))
+                        continue;
+                rq->cmd_flags &= ~REQ_SOFTBARRIER;
+                list_del_init(&rq->queuelist);
+                blk_mq_insert_request(rq, true, false, false);
+        }
+        while (!list_empty(&rq_list)) {
+                rq = list_entry(rq_list.next, struct request, queuelist);
+                list_del_init(&rq->queuelist);
+                blk_mq_insert_request(rq, false, false, false);
+        }
+        blk_mq_run_queues(q, false);
+}
+void blk_mq_add_to_requeue_list(struct request *rq, bool at_head)
+{
+        struct request_queue *q = rq->q;
+        unsigned long flags;
+        /*
+         * We abuse this flag that is otherwise used by the I/O scheduler to
+         * request head insertation from the workqueue.
+         */
+        BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER);
+        spin_lock_irqsave(&q->requeue_lock, flags);
+        if (at_head) {
+                rq->cmd_flags |= REQ_SOFTBARRIER;
+                list_add(&rq->queuelist, &q->requeue_list);
+        } else {
+                list_add_tail(&rq->queuelist, &q->requeue_list);
+        }
+        spin_unlock_irqrestore(&q->requeue_lock, flags);
+}
+EXPORT_SYMBOL(blk_mq_add_to_requeue_list);
+void blk_mq_kick_requeue_list(struct request_queue *q)
+{
+        kblockd_schedule_work(&q->requeue_work);
+}
+EXPORT_SYMBOL(blk_mq_kick_requeue_list);
+struct request *blk_mq_tag_to_rq(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+{
+        struct request_queue *q = hctx->queue;
+        if ((q->flush_rq->cmd_flags & REQ_FLUSH_SEQ) &&
+            q->flush_rq->tag == tag)
+                return q->flush_rq;
+        return hctx->tags->rqs[tag];
+}
+EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
        struct blk_mq_hw_ctx *hctx;
        unsigned long *next;
@@ -412,12 +580,13 @@ static void blk_mq_timeout_check(void *__data, unsigned long *free_tags)
        do {
                struct request *rq;
-                tag = find_next_zero_bit(free_tags, hctx->queue_depth, tag);
+                tag = find_next_zero_bit(free_tags, hctx->tags->nr_tags, tag);
-                if (tag >= hctx->queue_depth)
+                if (tag >= hctx->tags->nr_tags)
                        break;
-                rq = hctx->rqs[tag++];
+                rq = blk_mq_tag_to_rq(hctx, tag++);
+                if (rq->q != hctx->queue)
+                        continue;
                if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
                        continue;
@@ -442,6 +611,28 @@ static void blk_mq_hw_ctx_check_timeout(struct blk_mq_hw_ctx *hctx,
        blk_mq_tag_busy_iter(hctx->tags, blk_mq_timeout_check, &data);
 }
+static enum blk_eh_timer_return blk_mq_rq_timed_out(struct request *rq)
+{
+        struct request_queue *q = rq->q;
+        /*
+         * We know that complete is set at this point. If STARTED isn't set
+         * anymore, then the request isn't active and the "timeout" should
+         * just be ignored. This can happen due to the bitflag ordering.
+         * Timeout first checks if STARTED is set, and if it is, assumes
+         * the request is active. But if we race with completion, then
+         * we both flags will get cleared. So check here again, and ignore
+         * a timeout event with a request that isn't active.
+         */
+        if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+                return BLK_EH_NOT_HANDLED;
+        if (!q->mq_ops->timeout)
+                return BLK_EH_RESET_TIMER;
+        return q->mq_ops->timeout(rq);
+}
 static void blk_mq_rq_timer(unsigned long data)
 {
        struct request_queue *q = (struct request_queue *) data;
@@ -449,11 +640,24 @@ static void blk_mq_rq_timer(unsigned long data)
        unsigned long next = 0;
        int i, next_set = 0;
-        queue_for_each_hw_ctx(q, hctx, i)
+        queue_for_each_hw_ctx(q, hctx, i) {
+                /*
+                 * If not software queues are currently mapped to this
+                 * hardware queue, there's nothing to check
+                 */
+                if (!hctx->nr_ctx || !hctx->tags)
+                        continue;
                blk_mq_hw_ctx_check_timeout(hctx, &next, &next_set);
+        }
-        if (next_set)
+        if (next_set) {
-                mod_timer(&q->timeout, round_jiffies_up(next));
+                next = blk_rq_timeout(round_jiffies_up(next));
+                mod_timer(&q->timeout, next);
+        } else {
+                queue_for_each_hw_ctx(q, hctx, i)
+                        blk_mq_tag_idle(hctx);
+        }
 }
 /*
@@ -495,9 +699,38 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
        return false;
 }
-void blk_mq_add_timer(struct request *rq)
+/*
+ * Process software queues that have been marked busy, splicing them
+ * to the for-dispatch
+ */
+static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
-        __blk_add_timer(rq, NULL);
+        struct blk_mq_ctx *ctx;
+        int i;
+        for (i = 0; i < hctx->ctx_map.map_size; i++) {
+                struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];
+                unsigned int off, bit;
+                if (!bm->word)
+                        continue;
+                bit = 0;
+                off = i * hctx->ctx_map.bits_per_word;
+                do {
+                        bit = find_next_bit(&bm->word, bm->depth, bit);
+                        if (bit >= bm->depth)
+                                break;
+                        ctx = hctx->ctxs[bit + off];
+                        clear_bit(bit, &bm->word);
+                        spin_lock(&ctx->lock);
+                        list_splice_tail_init(&ctx->rq_list, list);
+                        spin_unlock(&ctx->lock);
+                        bit++;
+                } while (1);
+        }
 }
 /*
@@ -509,10 +742,11 @@ void blk_mq_add_timer(struct request *rq)
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
        struct request_queue *q = hctx->queue;
-        struct blk_mq_ctx *ctx;
        struct request *rq;
        LIST_HEAD(rq_list);
-        int bit, queued;
+        int queued;
+        WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
        if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
                return;
@@ -522,15 +756,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
        /*
         * Touch any software queue that has pending entries.
         */
-        for_each_set_bit(bit, hctx->ctx_map, hctx->nr_ctx) {
+        flush_busy_ctxs(hctx, &rq_list);
-                clear_bit(bit, hctx->ctx_map);
-                ctx = hctx->ctxs[bit];
-                BUG_ON(bit != ctx->index_hw);
-                spin_lock(&ctx->lock);
-                list_splice_tail_init(&ctx->rq_list, &rq_list);
-                spin_unlock(&ctx->lock);
-        }
        /*
         * If we have previous entries on our dispatch list, grab them
@@ -544,13 +770,9 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
        }
        /*
-         * Delete and return all entries from our dispatch list
-         */
-        queued = 0;
-        /*
         * Now process all the entries, sending them to the driver.
         */
+        queued = 0;
        while (!list_empty(&rq_list)) {
                int ret;
@@ -565,13 +787,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
                        queued++;
                        continue;
                case BLK_MQ_RQ_QUEUE_BUSY:
-                        /*
-                         * FIXME: we should have a mechanism to stop the queue
-                         * like blk_stop_queue, otherwise we will waste cpu
-                         * time
-                         */
                        list_add(&rq->queuelist, &rq_list);
-                        blk_mq_requeue_request(rq);
+                        __blk_mq_requeue_request(rq);
                        break;
                default:
                        pr_err("blk-mq: bad return on queue: %d\n", ret);
@@ -601,17 +818,44 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
        }
 }
+/*
+ * It'd be great if the workqueue API had a way to pass
+ * in a mask and had some smarts for more clever placement.
+ * For now we just round-robin here, switching for every
+ * BLK_MQ_CPU_WORK_BATCH queued items.
+ */
+static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+        int cpu = hctx->next_cpu;
+        if (--hctx->next_cpu_batch <= 0) {
+                int next_cpu;
+                next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+                if (next_cpu >= nr_cpu_ids)
+                        next_cpu = cpumask_first(hctx->cpumask);
+                hctx->next_cpu = next_cpu;
+                hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+        }
+        return cpu;
+}
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
        if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
                return;
-        if (!async)
+        if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask))
                __blk_mq_run_hw_queue(hctx);
+        else if (hctx->queue->nr_hw_queues == 1)
+                kblockd_schedule_delayed_work(&hctx->run_work, 0);
        else {
-                struct request_queue *q = hctx->queue;
+                unsigned int cpu;
-                kblockd_schedule_delayed_work(q, &hctx->delayed_work, 0);
+                cpu = blk_mq_hctx_next_cpu(hctx);
+                kblockd_schedule_delayed_work_on(cpu, &hctx->run_work, 0);
        }
 }
@@ -626,14 +870,17 @@ void blk_mq_run_queues(struct request_queue *q, bool async)
                    test_bit(BLK_MQ_S_STOPPED, &hctx->state))
                        continue;
+                preempt_disable();
                blk_mq_run_hw_queue(hctx, async);
+                preempt_enable();
        }
 }
 EXPORT_SYMBOL(blk_mq_run_queues);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-        cancel_delayed_work(&hctx->delayed_work);
+        cancel_delayed_work(&hctx->run_work);
+        cancel_delayed_work(&hctx->delay_work);
        set_bit(BLK_MQ_S_STOPPED, &hctx->state);
 }
 EXPORT_SYMBOL(blk_mq_stop_hw_queue);
@@ -651,11 +898,25 @@ EXPORT_SYMBOL(blk_mq_stop_hw_queues);
 void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
        clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+        preempt_disable();
        __blk_mq_run_hw_queue(hctx);
+        preempt_enable();
 }
 EXPORT_SYMBOL(blk_mq_start_hw_queue);
-void blk_mq_start_stopped_hw_queues(struct request_queue *q)
+void blk_mq_start_hw_queues(struct request_queue *q)
+{
+        struct blk_mq_hw_ctx *hctx;
+        int i;
+        queue_for_each_hw_ctx(q, hctx, i)
+                blk_mq_start_hw_queue(hctx);
+}
+EXPORT_SYMBOL(blk_mq_start_hw_queues);
+void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
 {
        struct blk_mq_hw_ctx *hctx;
        int i;
@@ -665,19 +926,47 @@ void blk_mq_start_stopped_hw_queues(struct request_queue *q)
                        continue;
                clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
-                blk_mq_run_hw_queue(hctx, true);
+                preempt_disable();
+                blk_mq_run_hw_queue(hctx, async);
+                preempt_enable();
        }
 }
 EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
-static void blk_mq_work_fn(struct work_struct *work)
+static void blk_mq_run_work_fn(struct work_struct *work)
 {
        struct blk_mq_hw_ctx *hctx;
-        hctx = container_of(work, struct blk_mq_hw_ctx, delayed_work.work);
+        hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work);
        __blk_mq_run_hw_queue(hctx);
 }
+static void blk_mq_delay_work_fn(struct work_struct *work)
+{
+        struct blk_mq_hw_ctx *hctx;
+        hctx = container_of(work, struct blk_mq_hw_ctx, delay_work.work);
+        if (test_and_clear_bit(BLK_MQ_S_STOPPED, &hctx->state))
+                __blk_mq_run_hw_queue(hctx);
+}
+void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
+{
+        unsigned long tmo = msecs_to_jiffies(msecs);
+        if (hctx->queue->nr_hw_queues == 1)
+                kblockd_schedule_delayed_work(&hctx->delay_work, tmo);
+        else {
+                unsigned int cpu;
+                cpu = blk_mq_hctx_next_cpu(hctx);
+                kblockd_schedule_delayed_work_on(cpu, &hctx->delay_work, tmo);
+        }
+}
+EXPORT_SYMBOL(blk_mq_delay_queue);
 static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
                                    struct request *rq, bool at_head)
 {
@@ -689,12 +978,13 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
                list_add(&rq->queuelist, &ctx->rq_list);
        else
                list_add_tail(&rq->queuelist, &ctx->rq_list);
        blk_mq_hctx_mark_pending(hctx, ctx);
        /*
         * We do this early, to ensure we are on the right CPU.
         */
-        blk_mq_add_timer(rq);
+        blk_add_timer(rq);
 }
 void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
@@ -719,10 +1009,10 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
                spin_unlock(&ctx->lock);
        }
-        blk_mq_put_ctx(current_ctx);
        if (run_queue)
                blk_mq_run_hw_queue(hctx, async);
+        blk_mq_put_ctx(current_ctx);
 }
 static void blk_mq_insert_requests(struct request_queue *q,
@@ -758,9 +1048,8 @@ static void blk_mq_insert_requests(struct request_queue *q,
        }
        spin_unlock(&ctx->lock);
-        blk_mq_put_ctx(current_ctx);
        blk_mq_run_hw_queue(hctx, from_schedule);
+        blk_mq_put_ctx(current_ctx);
 }
 static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
@@ -823,24 +1112,169 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
 {
        init_request_from_bio(rq, bio);
-        blk_account_io_start(rq, 1);
+        if (blk_do_io_stat(rq)) {
+                rq->start_time = jiffies;
+                blk_account_io_start(rq, 1);
+        }
 }
-static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
+static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
+                                         struct blk_mq_ctx *ctx,
+                                         struct request *rq, struct bio *bio)
+{
+        struct request_queue *q = hctx->queue;
+        if (!(hctx->flags & BLK_MQ_F_SHOULD_MERGE)) {
+                blk_mq_bio_to_request(rq, bio);
+                spin_lock(&ctx->lock);
+insert_rq:
+                __blk_mq_insert_request(hctx, rq, false);
+                spin_unlock(&ctx->lock);
+                return false;
+        } else {
+                spin_lock(&ctx->lock);
+                if (!blk_mq_attempt_merge(q, ctx, bio)) {
+                        blk_mq_bio_to_request(rq, bio);
+                        goto insert_rq;
+                }
+                spin_unlock(&ctx->lock);
+                __blk_mq_free_request(hctx, ctx, rq);
+                return true;
+        }
+}
+struct blk_map_ctx {
+        struct blk_mq_hw_ctx *hctx;
+        struct blk_mq_ctx *ctx;
+};
+static struct request *blk_mq_map_request(struct request_queue *q,
+                                          struct bio *bio,
+                                          struct blk_map_ctx *data)
 {
        struct blk_mq_hw_ctx *hctx;
        struct blk_mq_ctx *ctx;
+        struct request *rq;
+        int rw = bio_data_dir(bio);
+        if (unlikely(blk_mq_queue_enter(q))) {
+                bio_endio(bio, -EIO);
+                return NULL;
+        }
+        ctx = blk_mq_get_ctx(q);
+        hctx = q->mq_ops->map_queue(q, ctx->cpu);
+        if (rw_is_sync(bio->bi_rw))
+                rw |= REQ_SYNC;
+        trace_block_getrq(q, bio, rw);
+        rq = __blk_mq_alloc_request(q, hctx, ctx, rw, GFP_ATOMIC, false);
+        if (unlikely(!rq)) {
+                __blk_mq_run_hw_queue(hctx);
+                blk_mq_put_ctx(ctx);
+                trace_block_sleeprq(q, bio, rw);
+                ctx = blk_mq_get_ctx(q);
+                hctx = q->mq_ops->map_queue(q, ctx->cpu);
+                rq = __blk_mq_alloc_request(q, hctx, ctx, rw,
+                                            __GFP_WAIT|GFP_ATOMIC, false);
+        }
+        hctx->queued++;
+        data->hctx = hctx;
+        data->ctx = ctx;
+        return rq;
+}
+/*
+ * Multiple hardware queue variant. This will not use per-process plugs,
+ * but will attempt to bypass the hctx queueing if we can go straight to
+ * hardware for SYNC IO.
+ */
+static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
+{
        const int is_sync = rw_is_sync(bio->bi_rw);
        const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
-        int rw = bio_data_dir(bio);
+        struct blk_map_ctx data;
        struct request *rq;
+        blk_queue_bounce(q, &bio);
+        if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
+                bio_endio(bio, -EIO);
+                return;
+        }
+        rq = blk_mq_map_request(q, bio, &data);
+        if (unlikely(!rq))
+                return;
+        if (unlikely(is_flush_fua)) {
+                blk_mq_bio_to_request(rq, bio);
+                blk_insert_flush(rq);
+                goto run_queue;
+        }
+        if (is_sync) {
+                int ret;
+                blk_mq_bio_to_request(rq, bio);
+                blk_mq_start_request(rq, true);
+                blk_add_timer(rq);
+                /*
+                 * For OK queue, we are done. For error, kill it. Any other
+                 * error (busy), just add it to our list as we previously
+                 * would have done
+                 */
+                ret = q->mq_ops->queue_rq(data.hctx, rq);
+                if (ret == BLK_MQ_RQ_QUEUE_OK)
+                        goto done;
+                else {
+                        __blk_mq_requeue_request(rq);
+                        if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
+                                rq->errors = -EIO;
+                                blk_mq_end_io(rq, rq->errors);
+                                goto done;
+                        }
+                }
+        }
+        if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+                /*
+                 * For a SYNC request, send it to the hardware immediately. For
+                 * an ASYNC request, just ensure that we run it later on. The
+                 * latter allows for merging opportunities and more efficient
+                 * dispatching.
+                 */
+run_queue:
+                blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+        }
+done:
+        blk_mq_put_ctx(data.ctx);
+}
+/*
+ * Single hardware queue variant. This will attempt to use any per-process
+ * plug for merging and IO deferral.
+ */
+static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
+{
+        const int is_sync = rw_is_sync(bio->bi_rw);
+        const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
        unsigned int use_plug, request_count = 0;
+        struct blk_map_ctx data;
+        struct request *rq;
        /*
         * If we have multiple hardware queues, just go directly to
         * one of those for sync IO.
         */
-        use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);
+        use_plug = !is_flush_fua && !is_sync;
        blk_queue_bounce(q, &bio);
@@ -849,37 +1283,14 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
                return;
        }
-        if (use_plug && blk_attempt_plug_merge(q, bio, &request_count))
+        if (use_plug && !blk_queue_nomerges(q) &&
+            blk_attempt_plug_merge(q, bio, &request_count))
                return;
-        if (blk_mq_queue_enter(q)) {
+        rq = blk_mq_map_request(q, bio, &data);
-                bio_endio(bio, -EIO);
-                return;
-        }
-        ctx = blk_mq_get_ctx(q);
-        hctx = q->mq_ops->map_queue(q, ctx->cpu);
-        if (is_sync)
-                rw |= REQ_SYNC;
-        trace_block_getrq(q, bio, rw);
-        rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false);
-        if (likely(rq))
-                blk_mq_rq_ctx_init(q, ctx, rq, rw);
-        else {
-                blk_mq_put_ctx(ctx);
-                trace_block_sleeprq(q, bio, rw);
-                rq = blk_mq_alloc_request_pinned(q, rw, __GFP_WAIT|GFP_ATOMIC,
-                                                        false);
-                ctx = rq->mq_ctx;
-                hctx = q->mq_ops->map_queue(q, ctx->cpu);
-        }
-        hctx->queued++;
        if (unlikely(is_flush_fua)) {
                blk_mq_bio_to_request(rq, bio);
-                blk_mq_put_ctx(ctx);
                blk_insert_flush(rq);
                goto run_queue;
        }
@@ -901,31 +1312,23 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
                                trace_block_plug(q);
                        }
                        list_add_tail(&rq->queuelist, &plug->mq_list);
-                        blk_mq_put_ctx(ctx);
+                        blk_mq_put_ctx(data.ctx);
                        return;
                }
        }
-        spin_lock(&ctx->lock);
+        if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+                /*
-        if ((hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
+                 * For a SYNC request, send it to the hardware immediately. For
-            blk_mq_attempt_merge(q, ctx, bio))
+                 * an ASYNC request, just ensure that we run it later on. The
-                __blk_mq_free_request(hctx, ctx, rq);
+                 * latter allows for merging opportunities and more efficient
-        else {
+                 * dispatching.
-                blk_mq_bio_to_request(rq, bio);
+                 */
-                __blk_mq_insert_request(hctx, rq, false);
+run_queue:
+                blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
        }
-        spin_unlock(&ctx->lock);
+        blk_mq_put_ctx(data.ctx);
-        blk_mq_put_ctx(ctx);
-        /*
-         * For a SYNC request, send it to the hardware immediately. For an
-         * ASYNC request, just ensure that we run it later on. The latter
-         * allows for merging opportunities and more efficient dispatching.
-         */
-run_queue:
-        blk_mq_run_hw_queue(hctx, !is_sync || is_flush_fua);
 }
 /*
@@ -937,32 +1340,153 @@ struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu)
 }
 EXPORT_SYMBOL(blk_mq_map_queue);
-struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_reg *reg,
+static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
-                                                   unsigned int hctx_index)
+                struct blk_mq_tags *tags, unsigned int hctx_idx)
 {
-        return kmalloc_node(sizeof(struct blk_mq_hw_ctx),
+        struct page *page;
-                                GFP_KERNEL | __GFP_ZERO, reg->numa_node);
+        if (tags->rqs && set->ops->exit_request) {
+                int i;
+                for (i = 0; i < tags->nr_tags; i++) {
+                        if (!tags->rqs[i])
+                                continue;
+                        set->ops->exit_request(set->driver_data, tags->rqs[i],
+                                                hctx_idx, i);
+                }
+        }
+        while (!list_empty(&tags->page_list)) {
+                page = list_first_entry(&tags->page_list, struct page, lru);
+                list_del_init(&page->lru);
+                __free_pages(page, page->private);
+        }
+        kfree(tags->rqs);
+        blk_mq_free_tags(tags);
 }
-EXPORT_SYMBOL(blk_mq_alloc_single_hw_queue);
-void blk_mq_free_single_hw_queue(struct blk_mq_hw_ctx *hctx,
+static size_t order_to_size(unsigned int order)
-                                 unsigned int hctx_index)
 {
-        kfree(hctx);
+        return (size_t)PAGE_SIZE << order;
 }
-EXPORT_SYMBOL(blk_mq_free_single_hw_queue);
-static void blk_mq_hctx_notify(void *data, unsigned long action,
+static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
-                               unsigned int cpu)
+                unsigned int hctx_idx)
+{
+        struct blk_mq_tags *tags;
+        unsigned int i, j, entries_per_page, max_order = 4;
+        size_t rq_size, left;
+        tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
+                                set->numa_node);
+        if (!tags)
+                return NULL;
+        INIT_LIST_HEAD(&tags->page_list);
+        tags->rqs = kmalloc_node(set->queue_depth * sizeof(struct request *),
+                                        GFP_KERNEL, set->numa_node);
+        if (!tags->rqs) {
+                blk_mq_free_tags(tags);
+                return NULL;
+        }
+        /*
+         * rq_size is the size of the request plus driver payload, rounded
+         * to the cacheline size
+         */
+        rq_size = round_up(sizeof(struct request) + set->cmd_size,
+                                cache_line_size());
+        left = rq_size * set->queue_depth;
+        for (i = 0; i < set->queue_depth; ) {
+                int this_order = max_order;
+                struct page *page;
+                int to_do;
+                void *p;
+                while (left < order_to_size(this_order - 1) && this_order)
+                        this_order--;
+                do {
+                        page = alloc_pages_node(set->numa_node, GFP_KERNEL,
+                                                this_order);
+                        if (page)
+                                break;
+                        if (!this_order--)
+                                break;
+                        if (order_to_size(this_order) < rq_size)
+                                break;
+                } while (1);
+                if (!page)
+                        goto fail;
+                page->private = this_order;
+                list_add_tail(&page->lru, &tags->page_list);
+                p = page_address(page);
+                entries_per_page = order_to_size(this_order) / rq_size;
+                to_do = min(entries_per_page, set->queue_depth - i);
+                left -= to_do * rq_size;
+                for (j = 0; j < to_do; j++) {
+                        tags->rqs[i] = p;
+                        if (set->ops->init_request) {
+                                if (set->ops->init_request(set->driver_data,
+                                                tags->rqs[i], hctx_idx, i,
+                                                set->numa_node))
+                                        goto fail;
+                        }
+                        p += rq_size;
+                        i++;
+                }
+        }
+        return tags;
+fail:
+        pr_warn("%s: failed to allocate requests\n", __func__);
+        blk_mq_free_rq_map(set, tags, hctx_idx);
+        return NULL;
+}
+static void blk_mq_free_bitmap(struct blk_mq_ctxmap *bitmap)
+{
+        kfree(bitmap->map);
+}
+static int blk_mq_alloc_bitmap(struct blk_mq_ctxmap *bitmap, int node)
+{
+        unsigned int bpw = 8, total, num_maps, i;
+        bitmap->bits_per_word = bpw;
+        num_maps = ALIGN(nr_cpu_ids, bpw) / bpw;
+        bitmap->map = kzalloc_node(num_maps * sizeof(struct blk_align_bitmap),
+                                        GFP_KERNEL, node);
+        if (!bitmap->map)
+                return -ENOMEM;
+        bitmap->map_size = num_maps;
+        total = nr_cpu_ids;
+        for (i = 0; i < num_maps; i++) {
+                bitmap->map[i].depth = min(total, bitmap->bits_per_word);
+                total -= bitmap->map[i].depth;
+        }
+        return 0;
+}
+static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 {
-        struct blk_mq_hw_ctx *hctx = data;
        struct request_queue *q = hctx->queue;
        struct blk_mq_ctx *ctx;
        LIST_HEAD(tmp);
-        if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
-                return;
        /*
         * Move ctx entries to new CPU, if this one is going away.
         */
@@ -971,12 +1495,12 @@ static void blk_mq_hctx_notify(void *data, unsigned long action,
        spin_lock(&ctx->lock);
        if (!list_empty(&ctx->rq_list)) {
                list_splice_init(&ctx->rq_list, &tmp);
-                clear_bit(ctx->index_hw, hctx->ctx_map);
+                blk_mq_hctx_clear_pending(hctx, ctx);
        }
        spin_unlock(&ctx->lock);
        if (list_empty(&tmp))
-                return;
+                return NOTIFY_OK;
        ctx = blk_mq_get_ctx(q);
        spin_lock(&ctx->lock);
@@ -993,210 +1517,103 @@ static void blk_mq_hctx_notify(void *data, unsigned long action,
        blk_mq_hctx_mark_pending(hctx, ctx);
        spin_unlock(&ctx->lock);
-        blk_mq_put_ctx(ctx);
        blk_mq_run_hw_queue(hctx, true);
+        blk_mq_put_ctx(ctx);
+        return NOTIFY_OK;
 }
-static int blk_mq_init_hw_commands(struct blk_mq_hw_ctx *hctx,
+static int blk_mq_hctx_cpu_online(struct blk_mq_hw_ctx *hctx, int cpu)
-                                   int (*init)(void *, struct blk_mq_hw_ctx *,
-                                        struct request *, unsigned int),
-                                   void *data)
 {
-        unsigned int i;
+        struct request_queue *q = hctx->queue;
-        int ret = 0;
+        struct blk_mq_tag_set *set = q->tag_set;
-        for (i = 0; i < hctx->queue_depth; i++) {
-                struct request *rq = hctx->rqs[i];
-                ret = init(data, hctx, rq, i);
-                if (ret)
-                        break;
-        }
-        return ret;
-}
-int blk_mq_init_commands(struct request_queue *q,
+        if (set->tags[hctx->queue_num])
-                         int (*init)(void *, struct blk_mq_hw_ctx *,
+                return NOTIFY_OK;
-                                        struct request *, unsigned int),
-                         void *data)
-{
-        struct blk_mq_hw_ctx *hctx;
-        unsigned int i;
-        int ret = 0;
-        queue_for_each_hw_ctx(q, hctx, i) {
+        set->tags[hctx->queue_num] = blk_mq_init_rq_map(set, hctx->queue_num);
-                ret = blk_mq_init_hw_commands(hctx, init, data);
+        if (!set->tags[hctx->queue_num])
-                if (ret)
+                return NOTIFY_STOP;
-                        break;
-        }
-        return ret;
+        hctx->tags = set->tags[hctx->queue_num];
+        return NOTIFY_OK;
 }
-EXPORT_SYMBOL(blk_mq_init_commands);
-static void blk_mq_free_hw_commands(struct blk_mq_hw_ctx *hctx,
+static int blk_mq_hctx_notify(void *data, unsigned long action,
-                                    void (*free)(void *, struct blk_mq_hw_ctx *,
+                              unsigned int cpu)
-                                        struct request *, unsigned int),
-                                    void *data)
 {
-        unsigned int i;
+        struct blk_mq_hw_ctx *hctx = data;
-        for (i = 0; i < hctx->queue_depth; i++) {
+        if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
-                struct request *rq = hctx->rqs[i];
+                return blk_mq_hctx_cpu_offline(hctx, cpu);
+        else if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN)
+                return blk_mq_hctx_cpu_online(hctx, cpu);
-                free(data, hctx, rq, i);
+        return NOTIFY_OK;
-        }
 }
-void blk_mq_free_commands(struct request_queue *q,
+static void blk_mq_exit_hw_queues(struct request_queue *q,
-                          void (*free)(void *, struct blk_mq_hw_ctx *,
+                struct blk_mq_tag_set *set, int nr_queue)
-                                        struct request *, unsigned int),
-                          void *data)
 {
        struct blk_mq_hw_ctx *hctx;
        unsigned int i;
-        queue_for_each_hw_ctx(q, hctx, i)
+        queue_for_each_hw_ctx(q, hctx, i) {
-                blk_mq_free_hw_commands(hctx, free, data);
+                if (i == nr_queue)
-}
+                        break;
-EXPORT_SYMBOL(blk_mq_free_commands);
-static void blk_mq_free_rq_map(struct blk_mq_hw_ctx *hctx)
+                if (set->ops->exit_hctx)
-{
+                        set->ops->exit_hctx(hctx, i);
-        struct page *page;
-        while (!list_empty(&hctx->page_list)) {
+                blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
-                page = list_first_entry(&hctx->page_list, struct page, lru);
+                kfree(hctx->ctxs);
-                list_del_init(&page->lru);
+                blk_mq_free_bitmap(&hctx->ctx_map);
-                __free_pages(page, page->private);
        }
-        kfree(hctx->rqs);
-        if (hctx->tags)
-                blk_mq_free_tags(hctx->tags);
-}
-static size_t order_to_size(unsigned int order)
-{
-        size_t ret = PAGE_SIZE;
-        while (order--)
-                ret *= 2;
-        return ret;
 }
-static int blk_mq_init_rq_map(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_free_hw_queues(struct request_queue *q,
-                              unsigned int reserved_tags, int node)
+                struct blk_mq_tag_set *set)
 {
-        unsigned int i, j, entries_per_page, max_order = 4;
+        struct blk_mq_hw_ctx *hctx;
-        size_t rq_size, left;
+        unsigned int i;
-        INIT_LIST_HEAD(&hctx->page_list);
-        hctx->rqs = kmalloc_node(hctx->queue_depth * sizeof(struct request *),
-                                        GFP_KERNEL, node);
-        if (!hctx->rqs)
-                return -ENOMEM;
-        /*
-         * rq_size is the size of the request plus driver payload, rounded
-         * to the cacheline size
-         */
-        rq_size = round_up(sizeof(struct request) + hctx->cmd_size,
-                                cache_line_size());
-        left = rq_size * hctx->queue_depth;
-        for (i = 0; i < hctx->queue_depth;) {
-                int this_order = max_order;
-                struct page *page;
-                int to_do;
-                void *p;
-                while (left < order_to_size(this_order - 1) && this_order)
-                        this_order--;
-                do {
-                        page = alloc_pages_node(node, GFP_KERNEL, this_order);
-                        if (page)
-                                break;
-                        if (!this_order--)
-                                break;
-                        if (order_to_size(this_order) < rq_size)
-                                break;
-                } while (1);
-                if (!page)
-                        break;
-                page->private = this_order;
-                list_add_tail(&page->lru, &hctx->page_list);
-                p = page_address(page);
-                entries_per_page = order_to_size(this_order) / rq_size;
-                to_do = min(entries_per_page, hctx->queue_depth - i);
-                left -= to_do * rq_size;
-                for (j = 0; j < to_do; j++) {
-                        hctx->rqs[i] = p;
-                        blk_mq_rq_init(hctx, hctx->rqs[i]);
-                        p += rq_size;
-                        i++;
-                }
-        }
-        if (i < (reserved_tags + BLK_MQ_TAG_MIN))
-                goto err_rq_map;
-        else if (i != hctx->queue_depth) {
-                hctx->queue_depth = i;
-                pr_warn("%s: queue depth set to %u because of low memory\n",
-                                        __func__, i);
-        }
-        hctx->tags = blk_mq_init_tags(hctx->queue_depth, reserved_tags, node);
+        queue_for_each_hw_ctx(q, hctx, i) {
-        if (!hctx->tags) {
+                free_cpumask_var(hctx->cpumask);
-err_rq_map:
+                kfree(hctx);
-                blk_mq_free_rq_map(hctx);
-                return -ENOMEM;
        }
-        return 0;
 }
 static int blk_mq_init_hw_queues(struct request_queue *q,
-                                 struct blk_mq_reg *reg, void *driver_data)
+                struct blk_mq_tag_set *set)
 {
        struct blk_mq_hw_ctx *hctx;
-        unsigned int i, j;
+        unsigned int i;
        /*
         * Initialize hardware queues
         */
        queue_for_each_hw_ctx(q, hctx, i) {
-                unsigned int num_maps;
                int node;
                node = hctx->numa_node;
                if (node == NUMA_NO_NODE)
-                        node = hctx->numa_node = reg->numa_node;
+                        node = hctx->numa_node = set->numa_node;
-                INIT_DELAYED_WORK(&hctx->delayed_work, blk_mq_work_fn);
+                INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
+                INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
                spin_lock_init(&hctx->lock);
                INIT_LIST_HEAD(&hctx->dispatch);
                hctx->queue = q;
                hctx->queue_num = i;
-                hctx->flags = reg->flags;
+                hctx->flags = set->flags;
-                hctx->queue_depth = reg->queue_depth;
+                hctx->cmd_size = set->cmd_size;
-                hctx->cmd_size = reg->cmd_size;
                blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
                                                blk_mq_hctx_notify, hctx);
                blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
-                if (blk_mq_init_rq_map(hctx, reg->reserved_tags, node))
+                hctx->tags = set->tags[i];
-                        break;
                /*
                 * Allocate space for all possible cpus to avoid allocation in
@@ -1207,17 +1624,13 @@ static int blk_mq_init_hw_queues(struct request_queue *q,
                if (!hctx->ctxs)
                        break;
-                num_maps = ALIGN(nr_cpu_ids, BITS_PER_LONG) / BITS_PER_LONG;
+                if (blk_mq_alloc_bitmap(&hctx->ctx_map, node))
-                hctx->ctx_map = kzalloc_node(num_maps * sizeof(unsigned long),
-                                                GFP_KERNEL, node);
-                if (!hctx->ctx_map)
                        break;
-                hctx->nr_ctx_map = num_maps;
                hctx->nr_ctx = 0;
-                if (reg->ops->init_hctx &&
+                if (set->ops->init_hctx &&
-                    reg->ops->init_hctx(hctx, driver_data, i))
+                    set->ops->init_hctx(hctx, set->driver_data, i))
                        break;
        }
@@ -1227,17 +1640,7 @@ static int blk_mq_init_hw_queues(struct request_queue *q,
        /*
         * Init failed
         */
-        queue_for_each_hw_ctx(q, hctx, j) {
+        blk_mq_exit_hw_queues(q, set, i);
-                if (i == j)
-                        break;
-                if (reg->ops->exit_hctx)
-                        reg->ops->exit_hctx(hctx, j);
-                blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
-                blk_mq_free_rq_map(hctx);
-                kfree(hctx->ctxs);
-        }
        return 1;
 }
@@ -1258,12 +1661,13 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
                __ctx->queue = q;
                /* If the cpu isn't online, the cpu is mapped to first hctx */
-                hctx = q->mq_ops->map_queue(q, i);
-                hctx->nr_ctx++;
                if (!cpu_online(i))
                        continue;
+                hctx = q->mq_ops->map_queue(q, i);
+                cpumask_set_cpu(i, hctx->cpumask);
+                hctx->nr_ctx++;
                /*
                 * Set local node, IFF we have more than one hw queue. If
                 * not, we remain on the home node of the device
@@ -1280,6 +1684,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
        struct blk_mq_ctx *ctx;
        queue_for_each_hw_ctx(q, hctx, i) {
+                cpumask_clear(hctx->cpumask);
                hctx->nr_ctx = 0;
        }
@@ -1288,115 +1693,208 @@ static void blk_mq_map_swqueue(struct request_queue *q)
         */
        queue_for_each_ctx(q, ctx, i) {
                /* If the cpu isn't online, the cpu is mapped to first hctx */
+                if (!cpu_online(i))
+                        continue;
                hctx = q->mq_ops->map_queue(q, i);
+                cpumask_set_cpu(i, hctx->cpumask);
                ctx->index_hw = hctx->nr_ctx;
                hctx->ctxs[hctx->nr_ctx++] = ctx;
        }
+        queue_for_each_hw_ctx(q, hctx, i) {
+                /*
+                 * If not software queues are mapped to this hardware queue,
+                 * disable it and free the request entries
+                 */
+                if (!hctx->nr_ctx) {
+                        struct blk_mq_tag_set *set = q->tag_set;
+                        if (set->tags[i]) {
+                                blk_mq_free_rq_map(set, set->tags[i], i);
+                                set->tags[i] = NULL;
+                                hctx->tags = NULL;
+                        }
+                        continue;
+                }
+                /*
+                 * Initialize batch roundrobin counts
+                 */
+                hctx->next_cpu = cpumask_first(hctx->cpumask);
+                hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+        }
 }
-struct request_queue *blk_mq_init_queue(struct blk_mq_reg *reg,
+static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
-                                        void *driver_data)
 {
-        struct blk_mq_hw_ctx **hctxs;
+        struct blk_mq_hw_ctx *hctx;
-        struct blk_mq_ctx *ctx;
        struct request_queue *q;
+        bool shared;
        int i;
-        if (!reg->nr_hw_queues ||
+        if (set->tag_list.next == set->tag_list.prev)
-            !reg->ops->queue_rq || !reg->ops->map_queue ||
+                shared = false;
-            !reg->ops->alloc_hctx || !reg->ops->free_hctx)
+        else
-                return ERR_PTR(-EINVAL);
+                shared = true;
+        list_for_each_entry(q, &set->tag_list, tag_set_list) {
+                blk_mq_freeze_queue(q);
-        if (!reg->queue_depth)
+                queue_for_each_hw_ctx(q, hctx, i) {
-                reg->queue_depth = BLK_MQ_MAX_DEPTH;
+                        if (shared)
-        else if (reg->queue_depth > BLK_MQ_MAX_DEPTH) {
+                                hctx->flags |= BLK_MQ_F_TAG_SHARED;
-                pr_err("blk-mq: queuedepth too large (%u)\n", reg->queue_depth);
+                        else
-                reg->queue_depth = BLK_MQ_MAX_DEPTH;
+                                hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+                }
+                blk_mq_unfreeze_queue(q);
        }
+}
-        if (reg->queue_depth < (reg->reserved_tags + BLK_MQ_TAG_MIN))
+static void blk_mq_del_queue_tag_set(struct request_queue *q)
-                return ERR_PTR(-EINVAL);
+{
+        struct blk_mq_tag_set *set = q->tag_set;
+        blk_mq_freeze_queue(q);
+        mutex_lock(&set->tag_list_lock);
+        list_del_init(&q->tag_set_list);
+        blk_mq_update_tag_set_depth(set);
+        mutex_unlock(&set->tag_list_lock);
+        blk_mq_unfreeze_queue(q);
+}
+static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
+                                     struct request_queue *q)
+{
+        q->tag_set = set;
+        mutex_lock(&set->tag_list_lock);
+        list_add_tail(&q->tag_set_list, &set->tag_list);
+        blk_mq_update_tag_set_depth(set);
+        mutex_unlock(&set->tag_list_lock);
+}
+struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
+{
+        struct blk_mq_hw_ctx **hctxs;
+        struct blk_mq_ctx *ctx;
+        struct request_queue *q;
+        unsigned int *map;
+        int i;
        ctx = alloc_percpu(struct blk_mq_ctx);
        if (!ctx)
                return ERR_PTR(-ENOMEM);
-        hctxs = kmalloc_node(reg->nr_hw_queues * sizeof(*hctxs), GFP_KERNEL,
+        hctxs = kmalloc_node(set->nr_hw_queues * sizeof(*hctxs), GFP_KERNEL,
-                        reg->numa_node);
+                        set->numa_node);
        if (!hctxs)
                goto err_percpu;
-        for (i = 0; i < reg->nr_hw_queues; i++) {
+        map = blk_mq_make_queue_map(set);
-                hctxs[i] = reg->ops->alloc_hctx(reg, i);
+        if (!map)
+                goto err_map;
+        for (i = 0; i < set->nr_hw_queues; i++) {
+                int node = blk_mq_hw_queue_to_node(map, i);
+                hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
+                                        GFP_KERNEL, node);
                if (!hctxs[i])
                        goto err_hctxs;
-                hctxs[i]->numa_node = NUMA_NO_NODE;
+                if (!zalloc_cpumask_var(&hctxs[i]->cpumask, GFP_KERNEL))
+                        goto err_hctxs;
+                atomic_set(&hctxs[i]->nr_active, 0);
+                hctxs[i]->numa_node = node;
                hctxs[i]->queue_num = i;
        }
-        q = blk_alloc_queue_node(GFP_KERNEL, reg->numa_node);
+        q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node);
        if (!q)
                goto err_hctxs;
-        q->mq_map = blk_mq_make_queue_map(reg);
+        if (percpu_counter_init(&q->mq_usage_counter, 0))
-        if (!q->mq_map)
                goto err_map;
        setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
        blk_queue_rq_timeout(q, 30000);
        q->nr_queues = nr_cpu_ids;
-        q->nr_hw_queues = reg->nr_hw_queues;
+        q->nr_hw_queues = set->nr_hw_queues;
+        q->mq_map = map;
        q->queue_ctx = ctx;
        q->queue_hw_ctx = hctxs;
-        q->mq_ops = reg->ops;
+        q->mq_ops = set->ops;
        q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
+        if (!(set->flags & BLK_MQ_F_SG_MERGE))
+                q->queue_flags |= 1 << QUEUE_FLAG_NO_SG_MERGE;
        q->sg_reserved_size = INT_MAX;
-        blk_queue_make_request(q, blk_mq_make_request);
+        INIT_WORK(&q->requeue_work, blk_mq_requeue_work);
-        blk_queue_rq_timed_out(q, reg->ops->timeout);
+        INIT_LIST_HEAD(&q->requeue_list);
-        if (reg->timeout)
+        spin_lock_init(&q->requeue_lock);
-                blk_queue_rq_timeout(q, reg->timeout);
+        if (q->nr_hw_queues > 1)
+                blk_queue_make_request(q, blk_mq_make_request);
+        else
+                blk_queue_make_request(q, blk_sq_make_request);
+        blk_queue_rq_timed_out(q, blk_mq_rq_timed_out);
+        if (set->timeout)
+                blk_queue_rq_timeout(q, set->timeout);
+        /*
+         * Do this after blk_queue_make_request() overrides it...
+         */
+        q->nr_requests = set->queue_depth;
-        if (reg->ops->complete)
+        if (set->ops->complete)
-                blk_queue_softirq_done(q, reg->ops->complete);
+                blk_queue_softirq_done(q, set->ops->complete);
        blk_mq_init_flush(q);
-        blk_mq_init_cpu_queues(q, reg->nr_hw_queues);
+        blk_mq_init_cpu_queues(q, set->nr_hw_queues);
-        q->flush_rq = kzalloc(round_up(sizeof(struct request) + reg->cmd_size,
+        q->flush_rq = kzalloc(round_up(sizeof(struct request) +
-                                cache_line_size()), GFP_KERNEL);
+                                set->cmd_size, cache_line_size()),
+                                GFP_KERNEL);
        if (!q->flush_rq)
                goto err_hw;
-        if (blk_mq_init_hw_queues(q, reg, driver_data))
+        if (blk_mq_init_hw_queues(q, set))
                goto err_flush_rq;
-        blk_mq_map_swqueue(q);
        mutex_lock(&all_q_mutex);
        list_add_tail(&q->all_q_node, &all_q_list);
        mutex_unlock(&all_q_mutex);
+        blk_mq_add_queue_tag_set(set, q);
+        blk_mq_map_swqueue(q);
        return q;
 err_flush_rq:
        kfree(q->flush_rq);
 err_hw:
-        kfree(q->mq_map);
-err_map:
        blk_cleanup_queue(q);
 err_hctxs:
-        for (i = 0; i < reg->nr_hw_queues; i++) {
+        kfree(map);
+        for (i = 0; i < set->nr_hw_queues; i++) {
                if (!hctxs[i])
                        break;
-                reg->ops->free_hctx(hctxs[i], i);
+                free_cpumask_var(hctxs[i]->cpumask);
+                kfree(hctxs[i]);
        }
+err_map:
        kfree(hctxs);
 err_percpu:
        free_percpu(ctx);
@@ -1406,18 +1904,14 @@ EXPORT_SYMBOL(blk_mq_init_queue);
 void blk_mq_free_queue(struct request_queue *q)
 {
-        struct blk_mq_hw_ctx *hctx;
+        struct blk_mq_tag_set   *set = q->tag_set;
-        int i;
-        queue_for_each_hw_ctx(q, hctx, i) {
+        blk_mq_del_queue_tag_set(q);
-                kfree(hctx->ctx_map);
-                kfree(hctx->ctxs);
+        blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
-                blk_mq_free_rq_map(hctx);
+        blk_mq_free_hw_queues(q, set);
-                blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
-                if (q->mq_ops->exit_hctx)
+        percpu_counter_destroy(&q->mq_usage_counter);
-                        q->mq_ops->exit_hctx(hctx, i);
-                q->mq_ops->free_hctx(hctx, i);
-        }
        free_percpu(q->queue_ctx);
        kfree(q->queue_hw_ctx);
@@ -1437,6 +1931,8 @@ static void blk_mq_queue_reinit(struct request_queue *q)
 {
        blk_mq_freeze_queue(q);
+        blk_mq_sysfs_unregister(q);
        blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
        /*
@@ -1447,6 +1943,8 @@ static void blk_mq_queue_reinit(struct request_queue *q)
        blk_mq_map_swqueue(q);
+        blk_mq_sysfs_register(q);
        blk_mq_unfreeze_queue(q);
 }
@@ -1456,10 +1954,10 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
        struct request_queue *q;
        /*
-         * Before new mapping is established, hotadded cpu might already start
+         * Before new mappings are established, hotadded cpu might already
-         * handling requests. This doesn't break anything as we map offline
+         * start handling requests. This doesn't break anything as we map
-         * CPUs to first hardware queue. We will re-init queue below to get
+         * offline CPUs to first hardware queue. We will re-init the queue
-         * optimal settings.
+         * below to get optimal settings.
         */
        if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
            action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
@@ -1472,6 +1970,81 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
        return NOTIFY_OK;
 }
+int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
+{
+        int i;
+        if (!set->nr_hw_queues)
+                return -EINVAL;
+        if (!set->queue_depth || set->queue_depth > BLK_MQ_MAX_DEPTH)
+                return -EINVAL;
+        if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
+                return -EINVAL;
+        if (!set->nr_hw_queues || !set->ops->queue_rq || !set->ops->map_queue)
+                return -EINVAL;
+        set->tags = kmalloc_node(set->nr_hw_queues *
+                                 sizeof(struct blk_mq_tags *),
+                                 GFP_KERNEL, set->numa_node);
+        if (!set->tags)
+                goto out;
+        for (i = 0; i < set->nr_hw_queues; i++) {
+                set->tags[i] = blk_mq_init_rq_map(set, i);
+                if (!set->tags[i])
+                        goto out_unwind;
+        }
+        mutex_init(&set->tag_list_lock);
+        INIT_LIST_HEAD(&set->tag_list);
+        return 0;
+out_unwind:
+        while (--i >= 0)
+                blk_mq_free_rq_map(set, set->tags[i], i);
+out:
+        return -ENOMEM;
+}
+EXPORT_SYMBOL(blk_mq_alloc_tag_set);
+void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
+{
+        int i;
+        for (i = 0; i < set->nr_hw_queues; i++) {
+                if (set->tags[i])
+                        blk_mq_free_rq_map(set, set->tags[i], i);
+        }
+        kfree(set->tags);
+}
+EXPORT_SYMBOL(blk_mq_free_tag_set);
+int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+        struct blk_mq_tag_set *set = q->tag_set;
+        struct blk_mq_hw_ctx *hctx;
+        int i, ret;
+        if (!set || nr > set->queue_depth)
+                return -EINVAL;
+        ret = 0;
+        queue_for_each_hw_ctx(q, hctx, i) {
+                ret = blk_mq_tag_update_depth(hctx->tags, nr);
+                if (ret)
+                        break;
+        }
+        if (!ret)
+                q->nr_requests = nr;
+        return ret;
+}
 void blk_mq_disable_hotplug(void)
 {
        mutex_lock(&all_q_mutex);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ebbe6bac9d61..de7b3bbd5bd6 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -1,6 +1,8 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
+struct blk_mq_tag_set;
 struct blk_mq_ctx {
        struct {
                spinlock_t              lock;
@@ -9,7 +11,8 @@ struct blk_mq_ctx {
        unsigned int            cpu;
        unsigned int            index_hw;
-        unsigned int            ipi_redirect;
+        unsigned int            last_tag ____cacheline_aligned_in_smp;
        /* incremented at dispatch time */
        unsigned long           rq_dispatched[2];
@@ -20,21 +23,23 @@ struct blk_mq_ctx {
        struct request_queue    *queue;
        struct kobject          kobj;
-};
+} ____cacheline_aligned_in_smp;
 void __blk_mq_complete_request(struct request *rq);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_init_flush(struct request_queue *q);
 void blk_mq_drain_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
-void blk_mq_rq_init(struct blk_mq_hw_ctx *hctx, struct request *rq);
+void blk_mq_clone_flush_request(struct request *flush_rq,
+                struct request *orig_rq);
+int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 /*
 * CPU hotplug helpers
 */
 struct blk_mq_cpu_notifier;
 void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier,
-                              void (*fn)(void *, unsigned long, unsigned int),
+                              int (*fn)(void *, unsigned long, unsigned int),
                              void *data);
 void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier);
 void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier);
@@ -45,10 +50,23 @@ void blk_mq_disable_hotplug(void);
 /*
 * CPU -> queue mappings
 */
-struct blk_mq_reg;
+extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_reg *reg);
 extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
+extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-void blk_mq_add_timer(struct request *rq);
+/*
+ * sysfs helpers
+ */
+extern int blk_mq_sysfs_register(struct request_queue *q);
+extern void blk_mq_sysfs_unregister(struct request_queue *q);
+/*
+ * Basic implementation of sparser bitmap, allowing the user to spread
+ * the bits over more cachelines.
+ */
+struct blk_align_bitmap {
+        unsigned long word;
+        unsigned long depth;
+} ____cacheline_aligned_in_smp;
 #endif
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7500f876dae4..23321fbab293 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -48,11 +48,10 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-        struct request_list *rl;
        unsigned long nr;
-        int ret;
+        int ret, err;
-        if (!q->request_fn)
+        if (!q->request_fn && !q->mq_ops)
                return -EINVAL;
        ret = queue_var_store(&nr, page, count);
@@ -62,40 +61,14 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
        if (nr < BLKDEV_MIN_RQ)
                nr = BLKDEV_MIN_RQ;
-        spin_lock_irq(q->queue_lock);
+        if (q->request_fn)
-        q->nr_requests = nr;
+                err = blk_update_nr_requests(q, nr);
-        blk_queue_congestion_threshold(q);
+        else
+                err = blk_mq_update_nr_requests(q, nr);
-        /* congestion isn't cgroup aware and follows root blkcg for now */
-        rl = &q->root_rl;
+        if (err)
+                return err;
-        if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
-                blk_set_queue_congested(q, BLK_RW_SYNC);
-        else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
-                blk_clear_queue_congested(q, BLK_RW_SYNC);
-        if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
-                blk_set_queue_congested(q, BLK_RW_ASYNC);
-        else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
-                blk_clear_queue_congested(q, BLK_RW_ASYNC);
-        blk_queue_for_each_rl(rl, q) {
-                if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
-                        blk_set_rl_full(rl, BLK_RW_SYNC);
-                } else {
-                        blk_clear_rl_full(rl, BLK_RW_SYNC);
-                        wake_up(&rl->wait[BLK_RW_SYNC]);
-                }
-                if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
-                        blk_set_rl_full(rl, BLK_RW_ASYNC);
-                } else {
-                        blk_clear_rl_full(rl, BLK_RW_ASYNC);
-                        wake_up(&rl->wait[BLK_RW_ASYNC]);
-                }
-        }
-        spin_unlock_irq(q->queue_lock);
        return ret;
 }
@@ -544,8 +517,6 @@ static void blk_release_queue(struct kobject *kobj)
        if (q->queue_tags)
                __blk_queue_free_tags(q);
-        percpu_counter_destroy(&q->mq_usage_counter);
        if (q->mq_ops)
                blk_mq_free_queue(q);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 033745cd7fba..9353b4683359 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -744,7 +744,7 @@ static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
 {
        if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
-                return 0;
+                return false;
        return 1;
 }
@@ -842,7 +842,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
        if (tg->io_disp[rw] + 1 <= io_allowed) {
                if (wait)
                        *wait = 0;
-                return 1;
+                return true;
        }
        /* Calc approx time to dispatch */
@@ -880,7 +880,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
        if (tg->bytes_disp[rw] + bio->bi_iter.bi_size <= bytes_allowed) {
                if (wait)
                        *wait = 0;
-                return 1;
+                return true;
        }
        /* Calc approx time to dispatch */
@@ -923,7 +923,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
        if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
                if (wait)
                        *wait = 0;
-                return 1;
+                return true;
        }
        /*
@@ -1258,7 +1258,7 @@ out_unlock:
 * of throtl_data->service_queue.  Those bio's are ready and issued by this
 * function.
 */
-void blk_throtl_dispatch_work_fn(struct work_struct *work)
+static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 {
        struct throtl_data *td = container_of(work, struct throtl_data,
                                              dispatch_work);
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index d96f7061c6fd..95a09590ccfd 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -96,11 +96,7 @@ static void blk_rq_timed_out(struct request *req)
                        __blk_complete_request(req);
                break;
        case BLK_EH_RESET_TIMER:
-                if (q->mq_ops)
+                blk_add_timer(req);
-                        blk_mq_add_timer(req);
-                else
-                        blk_add_timer(req);
                blk_clear_rq_complete(req);
                break;
        case BLK_EH_NOT_HANDLED:
@@ -170,7 +166,26 @@ void blk_abort_request(struct request *req)
 }
 EXPORT_SYMBOL_GPL(blk_abort_request);
-void __blk_add_timer(struct request *req, struct list_head *timeout_list)
+unsigned long blk_rq_timeout(unsigned long timeout)
+{
+        unsigned long maxt;
+        maxt = round_jiffies_up(jiffies + BLK_MAX_TIMEOUT);
+        if (time_after(timeout, maxt))
+                timeout = maxt;
+        return timeout;
+}
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:        request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
 {
        struct request_queue *q = req->q;
        unsigned long expiry;
@@ -188,32 +203,29 @@ void __blk_add_timer(struct request *req, struct list_head *timeout_list)
                req->timeout = q->rq_timeout;
        req->deadline = jiffies + req->timeout;
-        if (timeout_list)
+        if (!q->mq_ops)
-                list_add_tail(&req->timeout_list, timeout_list);
+                list_add_tail(&req->timeout_list, &req->q->timeout_list);
        /*
         * If the timer isn't already pending or this timeout is earlier
         * than an existing one, modify the timer. Round up to next nearest
         * second.
         */
-        expiry = round_jiffies_up(req->deadline);
+        expiry = blk_rq_timeout(round_jiffies_up(req->deadline));
        if (!timer_pending(&q->timeout) ||
-            time_before(expiry, q->timeout.expires))
+            time_before(expiry, q->timeout.expires)) {
-                mod_timer(&q->timeout, expiry);
+                unsigned long diff = q->timeout.expires - expiry;
-}
+                /*
+                 * Due to added timer slack to group timers, the timer
+                 * will often be a little in front of what we asked for.
+                 * So apply some tolerance here too, otherwise we keep
+                 * modifying the timer because expires for value X
+                 * will be X + something.
+                 */
+                if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
+                        mod_timer(&q->timeout, expiry);
+        }
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:        request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
-{
-        __blk_add_timer(req, &req->q->timeout_list);
 }
diff --git a/block/blk.h b/block/blk.h
index 1d880f1f957f..45385e9abf6f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -9,6 +9,9 @@
 /* Number of requests a "batching" process may submit */
 #define BLK_BATCH_REQ   32
+/* Max future timer expiry for timeouts */
+#define BLK_MAX_TIMEOUT         (5 * HZ)
 extern struct kmem_cache *blk_requestq_cachep;
 extern struct kmem_cache *request_cachep;
 extern struct kobj_type blk_queue_ktype;
@@ -37,9 +40,9 @@ bool __blk_end_bidi_request(struct request *rq, int error,
 void blk_rq_timed_out_timer(unsigned long data);
 void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout,
                          unsigned int *next_set);
-void __blk_add_timer(struct request *req, struct list_head *timeout_list);
+unsigned long blk_rq_timeout(unsigned long timeout);
+void blk_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
-void blk_add_timer(struct request *);
 bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
@@ -185,6 +188,8 @@ static inline int queue_congestion_off_threshold(struct request_queue *q)
        return q->nr_congestion_off;
 }
+extern int blk_update_nr_requests(struct request_queue *, unsigned int);
 /*
 * Contribute to IO statistics IFF:
 *
diff --git a/block/bounce.c b/block/bounce.c
new file mode 100644
index 000000000000..523918b8c6dc
--- /dev/null
+++ b/block/bounce.c
@@ -0,0 +1,287 @@
+/* bounce buffer handling for block devices
+ *
+ * - Split from highmem.c
+ */
+#include <linux/mm.h>
+#include <linux/export.h>
+#include <linux/swap.h>
+#include <linux/gfp.h>
+#include <linux/bio.h>
+#include <linux/pagemap.h>
+#include <linux/mempool.h>
+#include <linux/blkdev.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/highmem.h>
+#include <linux/bootmem.h>
+#include <asm/tlbflush.h>
+#include <trace/events/block.h>
+#define POOL_SIZE       64
+#define ISA_POOL_SIZE   16
+static mempool_t *page_pool, *isa_page_pool;
+#if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
+static __init int init_emergency_pool(void)
+{
+#if defined(CONFIG_HIGHMEM) && !defined(CONFIG_MEMORY_HOTPLUG)
+        if (max_pfn <= max_low_pfn)
+                return 0;
+#endif
+        page_pool = mempool_create_page_pool(POOL_SIZE, 0);
+        BUG_ON(!page_pool);
+        printk("bounce pool size: %d pages\n", POOL_SIZE);
+        return 0;
+}
+__initcall(init_emergency_pool);
+#endif
+#ifdef CONFIG_HIGHMEM
+/*
+ * highmem version, map in to vec
+ */
+static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
+{
+        unsigned long flags;
+        unsigned char *vto;
+        local_irq_save(flags);
+        vto = kmap_atomic(to->bv_page);
+        memcpy(vto + to->bv_offset, vfrom, to->bv_len);
+        kunmap_atomic(vto);
+        local_irq_restore(flags);
+}
+#else /* CONFIG_HIGHMEM */
+#define bounce_copy_vec(to, vfrom)      \
+        memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len)
+#endif /* CONFIG_HIGHMEM */
+/*
+ * allocate pages in the DMA region for the ISA pool
+ */
+static void *mempool_alloc_pages_isa(gfp_t gfp_mask, void *data)
+{
+        return mempool_alloc_pages(gfp_mask | GFP_DMA, data);
+}
+/*
+ * gets called "every" time someone init's a queue with BLK_BOUNCE_ISA
+ * as the max address, so check if the pool has already been created.
+ */
+int init_emergency_isa_pool(void)
+{
+        if (isa_page_pool)
+                return 0;
+        isa_page_pool = mempool_create(ISA_POOL_SIZE, mempool_alloc_pages_isa,
+                                       mempool_free_pages, (void *) 0);
+        BUG_ON(!isa_page_pool);
+        printk("isa bounce pool size: %d pages\n", ISA_POOL_SIZE);
+        return 0;
+}
+/*
+ * Simple bounce buffer support for highmem pages. Depending on the
+ * queue gfp mask set, *to may or may not be a highmem page. kmap it
+ * always, it will do the Right Thing
+ */
+static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
+{
+        unsigned char *vfrom;
+        struct bio_vec tovec, *fromvec = from->bi_io_vec;
+        struct bvec_iter iter;
+        bio_for_each_segment(tovec, to, iter) {
+                if (tovec.bv_page != fromvec->bv_page) {
+                        /*
+                         * fromvec->bv_offset and fromvec->bv_len might have
+                         * been modified by the block layer, so use the original
+                         * copy, bounce_copy_vec already uses tovec->bv_len
+                         */
+                        vfrom = page_address(fromvec->bv_page) +
+                                tovec.bv_offset;
+                        bounce_copy_vec(&tovec, vfrom);
+                        flush_dcache_page(tovec.bv_page);
+                }
+                fromvec++;
+        }
+}
+static void bounce_end_io(struct bio *bio, mempool_t *pool, int err)
+{
+        struct bio *bio_orig = bio->bi_private;
+        struct bio_vec *bvec, *org_vec;
+        int i;
+        if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
+                set_bit(BIO_EOPNOTSUPP, &bio_orig->bi_flags);
+        /*
+         * free up bounce indirect pages used
+         */
+        bio_for_each_segment_all(bvec, bio, i) {
+                org_vec = bio_orig->bi_io_vec + i;
+                if (bvec->bv_page == org_vec->bv_page)
+                        continue;
+                dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
+                mempool_free(bvec->bv_page, pool);
+        }
+        bio_endio(bio_orig, err);
+        bio_put(bio);
+}
+static void bounce_end_io_write(struct bio *bio, int err)
+{
+        bounce_end_io(bio, page_pool, err);
+}
+static void bounce_end_io_write_isa(struct bio *bio, int err)
+{
+        bounce_end_io(bio, isa_page_pool, err);
+}
+static void __bounce_end_io_read(struct bio *bio, mempool_t *pool, int err)
+{
+        struct bio *bio_orig = bio->bi_private;
+        if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+                copy_to_high_bio_irq(bio_orig, bio);
+        bounce_end_io(bio, pool, err);
+}
+static void bounce_end_io_read(struct bio *bio, int err)
+{
+        __bounce_end_io_read(bio, page_pool, err);
+}
+static void bounce_end_io_read_isa(struct bio *bio, int err)
+{
+        __bounce_end_io_read(bio, isa_page_pool, err);
+}
+#ifdef CONFIG_NEED_BOUNCE_POOL
+static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
+{
+        if (bio_data_dir(bio) != WRITE)
+                return 0;
+        if (!bdi_cap_stable_pages_required(&q->backing_dev_info))
+                return 0;
+        return test_bit(BIO_SNAP_STABLE, &bio->bi_flags);
+}
+#else
+static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
+{
+        return 0;
+}
+#endif /* CONFIG_NEED_BOUNCE_POOL */
+static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
+                               mempool_t *pool, int force)
+{
+        struct bio *bio;
+        int rw = bio_data_dir(*bio_orig);
+        struct bio_vec *to, from;
+        struct bvec_iter iter;
+        unsigned i;
+        if (force)
+                goto bounce;
+        bio_for_each_segment(from, *bio_orig, iter)
+                if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
+                        goto bounce;
+        return;
+bounce:
+        bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+        bio_for_each_segment_all(to, bio, i) {
+                struct page *page = to->bv_page;
+                if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
+                        continue;
+                inc_zone_page_state(to->bv_page, NR_BOUNCE);
+                to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+                if (rw == WRITE) {
+                        char *vto, *vfrom;
+                        flush_dcache_page(page);
+                        vto = page_address(to->bv_page) + to->bv_offset;
+                        vfrom = kmap_atomic(page) + to->bv_offset;
+                        memcpy(vto, vfrom, to->bv_len);
+                        kunmap_atomic(vfrom);
+                }
+        }
+        trace_block_bio_bounce(q, *bio_orig);
+        bio->bi_flags |= (1 << BIO_BOUNCED);
+        if (pool == page_pool) {
+                bio->bi_end_io = bounce_end_io_write;
+                if (rw == READ)
+                        bio->bi_end_io = bounce_end_io_read;
+        } else {
+                bio->bi_end_io = bounce_end_io_write_isa;
+                if (rw == READ)
+                        bio->bi_end_io = bounce_end_io_read_isa;
+        }
+        bio->bi_private = *bio_orig;
+        *bio_orig = bio;
+}
+void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
+{
+        int must_bounce;
+        mempool_t *pool;
+        /*
+         * Data-less bio, nothing to bounce
+         */
+        if (!bio_has_data(*bio_orig))
+                return;
+        must_bounce = must_snapshot_stable_pages(q, *bio_orig);
+        /*
+         * for non-isa bounce case, just check if the bounce pfn is equal
+         * to or bigger than the highest pfn in the system -- in that case,
+         * don't waste time iterating over bio segments
+         */
+        if (!(q->bounce_gfp & GFP_DMA)) {
+                if (queue_bounce_pfn(q) >= blk_max_pfn && !must_bounce)
+                        return;
+                pool = page_pool;
+        } else {
+                BUG_ON(!isa_page_pool);
+                pool = isa_page_pool;
+        }
+        /*
+         * slow path
+         */
+        __blk_queue_bounce(q, bio_orig, pool, must_bounce);
+}
+EXPORT_SYMBOL(blk_queue_bounce);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e0985f1955e7..22dffebc7c73 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -908,7 +908,7 @@ static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
 {
        if (cfqd->busy_queues) {
                cfq_log(cfqd, "schedule dispatch");
-                kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
+                kblockd_schedule_work(&cfqd->unplug_work);
        }
 }
@@ -4460,7 +4460,7 @@ out_free:
 static ssize_t
 cfq_var_show(unsigned int var, char *page)
 {
-        return sprintf(page, "%d\n", var);
+        return sprintf(page, "%u\n", var);
 }
 static ssize_t
diff --git a/block/ioprio.c b/block/ioprio.c
new file mode 100644
index 000000000000..e50170ca7c33
--- /dev/null
+++ b/block/ioprio.c
@@ -0,0 +1,241 @@
+/*
+ * fs/ioprio.c
+ *
+ * Copyright (C) 2004 Jens Axboe <axboe@kernel.dk>
+ *
+ * Helper functions for setting/querying io priorities of processes. The
+ * system calls closely mimmick getpriority/setpriority, see the man page for
+ * those. The prio argument is a composite of prio class and prio data, where
+ * the data argument has meaning within that class. The standard scheduling
+ * classes have 8 distinct prio levels, with 0 being the highest prio and 7
+ * being the lowest.
+ *
+ * IOW, setting BE scheduling class with prio 2 is done ala:
+ *
+ * unsigned int prio = (IOPRIO_CLASS_BE << IOPRIO_CLASS_SHIFT) | 2;
+ *
+ * ioprio_set(PRIO_PROCESS, pid, prio);
+ *
+ * See also Documentation/block/ioprio.txt
+ *
+ */
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/ioprio.h>
+#include <linux/blkdev.h>
+#include <linux/capability.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/pid_namespace.h>
+int set_task_ioprio(struct task_struct *task, int ioprio)
+{
+        int err;
+        struct io_context *ioc;
+        const struct cred *cred = current_cred(), *tcred;
+        rcu_read_lock();
+        tcred = __task_cred(task);
+        if (!uid_eq(tcred->uid, cred->euid) &&
+            !uid_eq(tcred->uid, cred->uid) && !capable(CAP_SYS_NICE)) {
+                rcu_read_unlock();
+                return -EPERM;
+        }
+        rcu_read_unlock();
+        err = security_task_setioprio(task, ioprio);
+        if (err)
+                return err;
+        ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
+        if (ioc) {
+                ioc->ioprio = ioprio;
+                put_io_context(ioc);
+        }
+        return err;
+}
+EXPORT_SYMBOL_GPL(set_task_ioprio);
+SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+{
+        int class = IOPRIO_PRIO_CLASS(ioprio);
+        int data = IOPRIO_PRIO_DATA(ioprio);
+        struct task_struct *p, *g;
+        struct user_struct *user;
+        struct pid *pgrp;
+        kuid_t uid;
+        int ret;
+        switch (class) {
+                case IOPRIO_CLASS_RT:
+                        if (!capable(CAP_SYS_ADMIN))
+                                return -EPERM;
+                        /* fall through, rt has prio field too */
+                case IOPRIO_CLASS_BE:
+                        if (data >= IOPRIO_BE_NR || data < 0)
+                                return -EINVAL;
+                        break;
+                case IOPRIO_CLASS_IDLE:
+                        break;
+                case IOPRIO_CLASS_NONE:
+                        if (data)
+                                return -EINVAL;
+                        break;
+                default:
+                        return -EINVAL;
+        }
+        ret = -ESRCH;
+        rcu_read_lock();
+        switch (which) {
+                case IOPRIO_WHO_PROCESS:
+                        if (!who)
+                                p = current;
+                        else
+                                p = find_task_by_vpid(who);
+                        if (p)
+                                ret = set_task_ioprio(p, ioprio);
+                        break;
+                case IOPRIO_WHO_PGRP:
+                        if (!who)
+                                pgrp = task_pgrp(current);
+                        else
+                                pgrp = find_vpid(who);
+                        do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
+                                ret = set_task_ioprio(p, ioprio);
+                                if (ret)
+                                        break;
+                        } while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+                        break;
+                case IOPRIO_WHO_USER:
+                        uid = make_kuid(current_user_ns(), who);
+                        if (!uid_valid(uid))
+                                break;
+                        if (!who)
+                                user = current_user();
+                        else
+                                user = find_user(uid);
+                        if (!user)
+                                break;
+                        do_each_thread(g, p) {
+                                if (!uid_eq(task_uid(p), uid))
+                                        continue;
+                                ret = set_task_ioprio(p, ioprio);
+                                if (ret)
+                                        goto free_uid;
+                        } while_each_thread(g, p);
+free_uid:
+                        if (who)
+                                free_uid(user);
+                        break;
+                default:
+                        ret = -EINVAL;
+        }
+        rcu_read_unlock();
+        return ret;
+}
+static int get_task_ioprio(struct task_struct *p)
+{
+        int ret;
+        ret = security_task_getioprio(p);
+        if (ret)
+                goto out;
+        ret = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, IOPRIO_NORM);
+        if (p->io_context)
+                ret = p->io_context->ioprio;
+out:
+        return ret;
+}
+int ioprio_best(unsigned short aprio, unsigned short bprio)
+{
+        unsigned short aclass = IOPRIO_PRIO_CLASS(aprio);
+        unsigned short bclass = IOPRIO_PRIO_CLASS(bprio);
+        if (aclass == IOPRIO_CLASS_NONE)
+                aclass = IOPRIO_CLASS_BE;
+        if (bclass == IOPRIO_CLASS_NONE)
+                bclass = IOPRIO_CLASS_BE;
+        if (aclass == bclass)
+                return min(aprio, bprio);
+        if (aclass > bclass)
+                return bprio;
+        else
+                return aprio;
+}
+SYSCALL_DEFINE2(ioprio_get, int, which, int, who)
+{
+        struct task_struct *g, *p;
+        struct user_struct *user;
+        struct pid *pgrp;
+        kuid_t uid;
+        int ret = -ESRCH;
+        int tmpio;
+        rcu_read_lock();
+        switch (which) {
+                case IOPRIO_WHO_PROCESS:
+                        if (!who)
+                                p = current;
+                        else
+                                p = find_task_by_vpid(who);
+                        if (p)
+                                ret = get_task_ioprio(p);
+                        break;
+                case IOPRIO_WHO_PGRP:
+                        if (!who)
+                                pgrp = task_pgrp(current);
+                        else
+                                pgrp = find_vpid(who);
+                        do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
+                                tmpio = get_task_ioprio(p);
+                                if (tmpio < 0)
+                                        continue;
+                                if (ret == -ESRCH)
+                                        ret = tmpio;
+                                else
+                                        ret = ioprio_best(ret, tmpio);
+                        } while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+                        break;
+                case IOPRIO_WHO_USER:
+                        uid = make_kuid(current_user_ns(), who);
+                        if (!who)
+                                user = current_user();
+                        else
+                                user = find_user(uid);
+                        if (!user)
+                                break;
+                        do_each_thread(g, p) {
+                                if (!uid_eq(task_uid(p), user->uid))
+                                        continue;
+                                tmpio = get_task_ioprio(p);
+                                if (tmpio < 0)
+                                        continue;
+                                if (ret == -ESRCH)
+                                        ret = tmpio;
+                                else
+                                        ret = ioprio_best(ret, tmpio);
+                        } while_each_thread(g, p);
+                        if (who)
+                                free_uid(user);
+                        break;
+                default:
+                        ret = -EINVAL;
+        }
+        rcu_read_unlock();
+        return ret;
+}
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 26487972ac54..9c28a5b38042 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -205,10 +205,6 @@ int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm)
        if (capable(CAP_SYS_RAWIO))
                return 0;
-        /* if there's no filter set, assume we're filtering everything out */
-        if (!filter)
-                return -EPERM;
        /* Anybody who can open the device can do a read-safe command */
        if (test_bit(cmd[0], filter->read_ok))
                return 0;
author	Linus Torvalds <torvalds@linux-foundation.org>	2014-06-02 12:29:34 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2014-06-02 12:29:34 -0400
commit	681a2895486243a82547d8c9f53043eb54b53da0 (patch)
tree	464273280aed6db55a99cc0d8614d4393f94fc48 /block
parent	6c52486dedbb30a1313da64945dcd686b4579c51 (diff)
parent	ed851860b4552fc8963ecf71eab9f6f7a5c19d74 (diff)