29 files changed, 2860 insertions, 305 deletions
diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt
new file mode 100644
index 000000000000..de1af7db3355
--- /dev/null
+++ b/Documentation/md-cluster.txt
@@ -0,0 +1,176 @@
+The cluster MD is a shared-device RAID for a cluster.
+1. On-disk format
+Separate write-intent-bitmap are used for each cluster node.
+The bitmaps record all writes that may have been started on that node,
+and may not yet have finished. The on-disk layout is:
+0                    4k                     8k                    12k
+-------------------------------------------------------------------
+| idle                | md super            | bm super [0] + bits |
+| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
+| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
+| bm bits [3, contd]  |                     |                     |
+During "normal" functioning we assume the filesystem ensures that only one
+node writes to any given block at a time, so a write
+request will
+ - set the appropriate bit (if not already set)
+ - commit the write to all mirrors
+ - schedule the bit to be cleared after a timeout.
+Reads are just handled normally.  It is up to the filesystem to
+ensure one node doesn't read from a location where another node (or the same
+node) is writing.
+2. DLM Locks for management
+There are two locks for managing the device:
+2.1 Bitmap lock resource (bm_lockres)
+ The bm_lockres protects individual node bitmaps. They are named in the
+ form bitmap001 for node 1, bitmap002 for node and so on. When a node
+ joins the cluster, it acquires the lock in PW mode and it stays so
+ during the lifetime the node is part of the cluster. The lock resource
+ number is based on the slot number returned by the DLM subsystem. Since
+ DLM starts node count from one and bitmap slots start from zero, one is
+ subtracted from the DLM slot number to arrive at the bitmap slot number.
+3. Communication
+Each node has to communicate with other nodes when starting or ending
+resync, and metadata superblock updates.
+3.1 Message Types
+ There are 3 types, of messages which are passed
+ 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
+   updated, and the node must re-read the md superblock. This is performed
+   synchronously.
+ 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
+   so that each node may suspend or resume the region.
+3.2 Communication mechanism
+ The DLM LVB is used to communicate within nodes of the cluster. There
+ are three resources used for the purpose:
+  3.2.1 Token: The resource which protects the entire communication
+   system. The node having the token resource is allowed to
+   communicate.
+  3.2.2 Message: The lock resource which carries the data to
+   communicate.
+  3.2.3 Ack: The resource, acquiring which means the message has been
+   acknowledged by all nodes in the cluster. The BAST of the resource
+   is used to inform the receive node that a node wants to communicate.
+The algorithm is:
+ 1. receive status
+   sender                         receiver                   receiver
+   ACK:CR                          ACK:CR                     ACK:CR
+ 2. sender get EX of TOKEN
+    sender get EX of MESSAGE
+    sender                        receiver                 receiver
+    TOKEN:EX                       ACK:CR                   ACK:CR
+    MESSAGE:EX
+    ACK:CR
+    Sender checks that it still needs to send a message. Messages received
+    or other events that happened while waiting for the TOKEN may have made
+    this message inappropriate or redundant.
+ 3. sender write LVB.
+    sender down-convert MESSAGE from EX to CR
+    sender try to get EX of ACK
+    [ wait until all receiver has *processed* the MESSAGE ]
+                                     [ triggered by bast of ACK ]
+                                     receiver get CR of MESSAGE
+                                     receiver read LVB
+                                     receiver processes the message
+                                     [ wait finish ]
+                                     receiver release ACK
+   sender                         receiver                   receiver
+   TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
+   MESSAGE:CR
+   ACK:EX
+ 4. triggered by grant of EX on ACK (indicating all receivers have processed
+    message)
+    sender down-convert ACK from EX to CR
+    sender release MESSAGE
+    sender release TOKEN
+                               receiver upconvert to EX of MESSAGE
+                               receiver get CR of ACK
+                               receiver release MESSAGE
+   sender                      receiver                   receiver
+   ACK:CR                       ACK:CR                     ACK:CR
+4. Handling Failures
+4.1 Node Failure
+ When a node fails, the DLM informs the cluster with the slot. The node
+ starts a cluster recovery thread. The cluster recovery thread:
+        - acquires the bitmap<number> lock of the failed node
+        - opens the bitmap
+        - reads the bitmap of the failed node
+        - copies the set bitmap to local node
+        - cleans the bitmap of the failed node
+        - releases bitmap<number> lock of the failed node
+        - initiates resync of the bitmap on the current node
+ The resync process, is the regular md resync. However, in a clustered
+ environment when a resync is performed, it needs to tell other nodes
+ of the areas which are suspended. Before a resync starts, the node
+ send out RESYNC_START with the (lo,hi) range of the area which needs
+ to be suspended. Each node maintains a suspend_list, which contains
+ the list  of ranges which are currently suspended. On receiving
+ RESYNC_START, the node adds the range to the suspend_list. Similarly,
+ when the node performing resync finishes, it send RESYNC_FINISHED
+ to other nodes and other nodes remove the corresponding entry from
+ the suspend_list.
+ A helper function, should_suspend() can be used to check if a particular
+ I/O range should be suspended or not.
+4.2 Device Failure
+ Device failures are handled and communicated with the metadata update
+ routine.
+5. Adding a new Device
+For adding a new device, it is necessary that all nodes "see" the new device
+to be added. For this, the following algorithm is used:
+    1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
+       ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
+    2. Node 1 sends NEWDISK with uuid and slot number
+    3. Other nodes issue kobject_uevent_env with uuid and slot number
+       (Steps 4,5 could be a udev rule)
+    4. In userspace, the node searches for the disk, perhaps
+       using blkid -t SUB_UUID=""
+    5. Other nodes issue either of the following depending on whether the disk
+       was found:
+       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
+                disc.number set to slot number)
+       ioctl(CLUSTERED_DISK_NACK)
+    6. Other nodes drop lock on no-new-devs (CR) if device is found
+    7. Node 1 attempts EX lock on no-new-devs
+    8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
+       as SpareLocal
+    9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
+    10. Other nodes get the information whether a disk is added or not
+        by the following METADATA_UPDATED.
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index d05327caf69d..5d355e0c2633 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -124,6 +124,7 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
 {
        void **srcs;
        int i;
+        int start = -1, stop = disks - 3;
        if (submit->scribble)
                srcs = submit->scribble;
@@ -134,10 +135,21 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
                if (blocks[i] == NULL) {
                        BUG_ON(i > disks - 3); /* P or Q can't be zero */
                        srcs[i] = (void*)raid6_empty_zero_page;
-                } else
+                } else {
                        srcs[i] = page_address(blocks[i]) + offset;
+                        if (i < disks - 2) {
+                                stop = i;
+                                if (start == -1)
+                                        start = i;
+                        }
+                }
        }
-        raid6_call.gen_syndrome(disks, len, srcs);
+        if (submit->flags & ASYNC_TX_PQ_XOR_DST) {
+                BUG_ON(!raid6_call.xor_syndrome);
+                if (start >= 0)
+                        raid6_call.xor_syndrome(disks, start, stop, len, srcs);
+        } else
+                raid6_call.gen_syndrome(disks, len, srcs);
        async_tx_sync_epilog(submit);
 }
@@ -178,7 +190,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
        if (device)
                unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO);
-        if (unmap &&
+        /* XORing P/Q is only implemented in software */
+        if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) &&
            (src_cnt <= dma_maxpq(device, 0) ||
             dma_maxpq(device, DMA_PREP_CONTINUE) > 0) &&
            is_dma_pq_aligned(device, offset, 0, len)) {
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 6ddc983417d5..edcf4ab66e00 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -175,6 +175,22 @@ config MD_FAULTY
          In unsure, say N.
+config MD_CLUSTER
+        tristate "Cluster Support for MD (EXPERIMENTAL)"
+        depends on BLK_DEV_MD
+        depends on DLM
+        default n
+        ---help---
+        Clustering support for MD devices. This enables locking and
+        synchronization across multiple systems on the cluster, so all
+        nodes in the cluster can access the MD devices simultaneously.
+        This brings the redundancy (and uptime) of RAID levels across the
+        nodes of the cluster.
+        If unsure, say N.
 source "drivers/md/bcache/Kconfig"
 config BLK_DEV_DM_BUILTIN
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 1863feaa5846..dba4db5985fb 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10)		+= raid10.o
 obj-$(CONFIG_MD_RAID456)        += raid456.o
 obj-$(CONFIG_MD_MULTIPATH)      += multipath.o
 obj-$(CONFIG_MD_FAULTY)         += faulty.o
+obj-$(CONFIG_MD_CLUSTER)        += md-cluster.o
 obj-$(CONFIG_BCACHE)            += bcache/
 obj-$(CONFIG_BLK_DEV_MD)        += md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)        += dm-mod.o
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 3a5767968ba0..2bc56e2a3526 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -205,6 +205,10 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
        struct block_device *bdev;
        struct mddev *mddev = bitmap->mddev;
        struct bitmap_storage *store = &bitmap->storage;
+        int node_offset = 0;
+        if (mddev_is_clustered(bitmap->mddev))
+                node_offset = bitmap->cluster_slot * store->file_pages;
        while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
                int size = PAGE_SIZE;
@@ -433,6 +437,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
        /* This might have been changed by a reshape */
        sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors);
        sb->chunksize = cpu_to_le32(bitmap->mddev->bitmap_info.chunksize);
+        sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes);
        sb->sectors_reserved = cpu_to_le32(bitmap->mddev->
                                           bitmap_info.space);
        kunmap_atomic(sb);
@@ -544,6 +549,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
        bitmap_super_t *sb;
        unsigned long chunksize, daemon_sleep, write_behind;
        unsigned long long events;
+        int nodes = 0;
        unsigned long sectors_reserved = 0;
        int err = -EINVAL;
        struct page *sb_page;
@@ -562,6 +568,22 @@ static int bitmap_read_sb(struct bitmap *bitmap)
                return -ENOMEM;
        bitmap->storage.sb_page = sb_page;
+re_read:
+        /* If cluster_slot is set, the cluster is setup */
+        if (bitmap->cluster_slot >= 0) {
+                sector_t bm_blocks = bitmap->mddev->resync_max_sectors;
+                sector_div(bm_blocks,
+                           bitmap->mddev->bitmap_info.chunksize >> 9);
+                /* bits to bytes */
+                bm_blocks = ((bm_blocks+7) >> 3) + sizeof(bitmap_super_t);
+                /* to 4k blocks */
+                bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096);
+                bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3);
+                pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__,
+                        bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset);
+        }
        if (bitmap->storage.file) {
                loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host);
                int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize;
@@ -577,12 +599,15 @@ static int bitmap_read_sb(struct bitmap *bitmap)
        if (err)
                return err;
+        err = -EINVAL;
        sb = kmap_atomic(sb_page);
        chunksize = le32_to_cpu(sb->chunksize);
        daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ;
        write_behind = le32_to_cpu(sb->write_behind);
        sectors_reserved = le32_to_cpu(sb->sectors_reserved);
+        nodes = le32_to_cpu(sb->nodes);
+        strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64);
        /* verify that the bitmap-specific fields are valid */
        if (sb->magic != cpu_to_le32(BITMAP_MAGIC))
@@ -619,7 +644,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
                        goto out;
                }
                events = le64_to_cpu(sb->events);
-                if (events < bitmap->mddev->events) {
+                if (!nodes && (events < bitmap->mddev->events)) {
                        printk(KERN_INFO
                               "%s: bitmap file is out of date (%llu < %llu) "
                               "-- forcing full recovery\n",
@@ -634,20 +659,40 @@ static int bitmap_read_sb(struct bitmap *bitmap)
        if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN)
                set_bit(BITMAP_HOSTENDIAN, &bitmap->flags);
        bitmap->events_cleared = le64_to_cpu(sb->events_cleared);
+        strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64);
        err = 0;
 out:
        kunmap_atomic(sb);
+        /* Assiging chunksize is required for "re_read" */
+        bitmap->mddev->bitmap_info.chunksize = chunksize;
+        if (nodes && (bitmap->cluster_slot < 0)) {
+                err = md_setup_cluster(bitmap->mddev, nodes);
+                if (err) {
+                        pr_err("%s: Could not setup cluster service (%d)\n",
+                                        bmname(bitmap), err);
+                        goto out_no_sb;
+                }
+                bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev);
+                goto re_read;
+        }
 out_no_sb:
        if (test_bit(BITMAP_STALE, &bitmap->flags))
                bitmap->events_cleared = bitmap->mddev->events;
        bitmap->mddev->bitmap_info.chunksize = chunksize;
        bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;
        bitmap->mddev->bitmap_info.max_write_behind = write_behind;
+        bitmap->mddev->bitmap_info.nodes = nodes;
        if (bitmap->mddev->bitmap_info.space == 0 ||
            bitmap->mddev->bitmap_info.space > sectors_reserved)
                bitmap->mddev->bitmap_info.space = sectors_reserved;
-        if (err)
+        if (err) {
                bitmap_print_sb(bitmap);
+                if (bitmap->cluster_slot < 0)
+                        md_cluster_stop(bitmap->mddev);
+        }
        return err;
 }
@@ -692,9 +737,10 @@ static inline struct page *filemap_get_page(struct bitmap_storage *store,
 }
 static int bitmap_storage_alloc(struct bitmap_storage *store,
-                                unsigned long chunks, int with_super)
+                                unsigned long chunks, int with_super,
+                                int slot_number)
 {
-        int pnum;
+        int pnum, offset = 0;
        unsigned long num_pages;
        unsigned long bytes;
@@ -703,6 +749,7 @@ static int bitmap_storage_alloc(struct bitmap_storage *store,
                bytes += sizeof(bitmap_super_t);
        num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE);
+        offset = slot_number * (num_pages - 1);
        store->filemap = kmalloc(sizeof(struct page *)
                                 * num_pages, GFP_KERNEL);
@@ -713,20 +760,22 @@ static int bitmap_storage_alloc(struct bitmap_storage *store,
                store->sb_page = alloc_page(GFP_KERNEL|__GFP_ZERO);
                if (store->sb_page == NULL)
                        return -ENOMEM;
-                store->sb_page->index = 0;
        }
        pnum = 0;
        if (store->sb_page) {
                store->filemap[0] = store->sb_page;
                pnum = 1;
+                store->sb_page->index = offset;
        }
        for ( ; pnum < num_pages; pnum++) {
                store->filemap[pnum] = alloc_page(GFP_KERNEL|__GFP_ZERO);
                if (!store->filemap[pnum]) {
                        store->file_pages = pnum;
                        return -ENOMEM;
                }
-                store->filemap[pnum]->index = pnum;
+                store->filemap[pnum]->index = pnum + offset;
        }
        store->file_pages = pnum;
@@ -885,6 +934,28 @@ static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block)
        }
 }
+static int bitmap_file_test_bit(struct bitmap *bitmap, sector_t block)
+{
+        unsigned long bit;
+        struct page *page;
+        void *paddr;
+        unsigned long chunk = block >> bitmap->counts.chunkshift;
+        int set = 0;
+        page = filemap_get_page(&bitmap->storage, chunk);
+        if (!page)
+                return -EINVAL;
+        bit = file_page_offset(&bitmap->storage, chunk);
+        paddr = kmap_atomic(page);
+        if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags))
+                set = test_bit(bit, paddr);
+        else
+                set = test_bit_le(bit, paddr);
+        kunmap_atomic(paddr);
+        return set;
+}
 /* this gets called when the md device is ready to unplug its underlying
 * (slave) device queues -- before we let any writes go down, we need to
 * sync the dirty pages of the bitmap file to disk */
@@ -935,7 +1006,7 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n
 */
 static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 {
-        unsigned long i, chunks, index, oldindex, bit;
+        unsigned long i, chunks, index, oldindex, bit, node_offset = 0;
        struct page *page = NULL;
        unsigned long bit_cnt = 0;
        struct file *file;
@@ -981,6 +1052,9 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
        if (!bitmap->mddev->bitmap_info.external)
                offset = sizeof(bitmap_super_t);
+        if (mddev_is_clustered(bitmap->mddev))
+                node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE));
        for (i = 0; i < chunks; i++) {
                int b;
                index = file_page_index(&bitmap->storage, i);
@@ -1001,7 +1075,7 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
                                        bitmap->mddev,
                                        bitmap->mddev->bitmap_info.offset,
                                        page,
-                                        index, count);
+                                        index + node_offset, count);
                        if (ret)
                                goto err;
@@ -1207,7 +1281,6 @@ void bitmap_daemon_work(struct mddev *mddev)
             j < bitmap->storage.file_pages
                     && !test_bit(BITMAP_STALE, &bitmap->flags);
             j++) {
                if (test_page_attr(bitmap, j,
                                   BITMAP_PAGE_DIRTY))
                        /* bitmap_unplug will handle the rest */
@@ -1530,11 +1603,13 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n
                return;
        }
        if (!*bmc) {
-                *bmc = 2 | (needed ? NEEDED_MASK : 0);
+                *bmc = 2;
                bitmap_count_page(&bitmap->counts, offset, 1);
                bitmap_set_pending(&bitmap->counts, offset);
                bitmap->allclean = 0;
        }
+        if (needed)
+                *bmc |= NEEDED_MASK;
        spin_unlock_irq(&bitmap->counts.lock);
 }
@@ -1591,6 +1666,10 @@ static void bitmap_free(struct bitmap *bitmap)
        if (!bitmap) /* there was no bitmap */
                return;
+        if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info &&
+                bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev))
+                md_cluster_stop(bitmap->mddev);
        /* Shouldn't be needed - but just in case.... */
        wait_event(bitmap->write_wait,
                   atomic_read(&bitmap->pending_writes) == 0);
@@ -1636,7 +1715,7 @@ void bitmap_destroy(struct mddev *mddev)
 * initialize the bitmap structure
 * if this returns an error, bitmap_destroy must be called to do clean up
 */
-int bitmap_create(struct mddev *mddev)
+struct bitmap *bitmap_create(struct mddev *mddev, int slot)
 {
        struct bitmap *bitmap;
        sector_t blocks = mddev->resync_max_sectors;
@@ -1650,7 +1729,7 @@ int bitmap_create(struct mddev *mddev)
        bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL);
        if (!bitmap)
-                return -ENOMEM;
+                return ERR_PTR(-ENOMEM);
        spin_lock_init(&bitmap->counts.lock);
        atomic_set(&bitmap->pending_writes, 0);
@@ -1659,6 +1738,7 @@ int bitmap_create(struct mddev *mddev)
        init_waitqueue_head(&bitmap->behind_wait);
        bitmap->mddev = mddev;
+        bitmap->cluster_slot = slot;
        if (mddev->kobj.sd)
                bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap");
@@ -1706,12 +1786,14 @@ int bitmap_create(struct mddev *mddev)
        printk(KERN_INFO "created bitmap (%lu pages) for device %s\n",
               bitmap->counts.pages, bmname(bitmap));
-        mddev->bitmap = bitmap;
+        err = test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0;
-        return test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0;
+        if (err)
+                goto error;
+        return bitmap;
 error:
        bitmap_free(bitmap);
-        return err;
+        return ERR_PTR(err);
 }
 int bitmap_load(struct mddev *mddev)
@@ -1765,6 +1847,60 @@ out:
 }
 EXPORT_SYMBOL_GPL(bitmap_load);
+/* Loads the bitmap associated with slot and copies the resync information
+ * to our bitmap
+ */
+int bitmap_copy_from_slot(struct mddev *mddev, int slot,
+                sector_t *low, sector_t *high, bool clear_bits)
+{
+        int rv = 0, i, j;
+        sector_t block, lo = 0, hi = 0;
+        struct bitmap_counts *counts;
+        struct bitmap *bitmap = bitmap_create(mddev, slot);
+        if (IS_ERR(bitmap))
+                return PTR_ERR(bitmap);
+        rv = bitmap_read_sb(bitmap);
+        if (rv)
+                goto err;
+        rv = bitmap_init_from_disk(bitmap, 0);
+        if (rv)
+                goto err;
+        counts = &bitmap->counts;
+        for (j = 0; j < counts->chunks; j++) {
+                block = (sector_t)j << counts->chunkshift;
+                if (bitmap_file_test_bit(bitmap, block)) {
+                        if (!lo)
+                                lo = block;
+                        hi = block;
+                        bitmap_file_clear_bit(bitmap, block);
+                        bitmap_set_memory_bits(mddev->bitmap, block, 1);
+                        bitmap_file_set_bit(mddev->bitmap, block);
+                }
+        }
+        if (clear_bits) {
+                bitmap_update_sb(bitmap);
+                /* Setting this for the ev_page should be enough.
+                 * And we do not require both write_all and PAGE_DIRT either
+                 */
+                for (i = 0; i < bitmap->storage.file_pages; i++)
+                        set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY);
+                bitmap_write_all(bitmap);
+                bitmap_unplug(bitmap);
+        }
+        *low = lo;
+        *high = hi;
+err:
+        bitmap_free(bitmap);
+        return rv;
+}
+EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
 void bitmap_status(struct seq_file *seq, struct bitmap *bitmap)
 {
        unsigned long chunk_kb;
@@ -1849,7 +1985,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
        memset(&store, 0, sizeof(store));
        if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file)
                ret = bitmap_storage_alloc(&store, chunks,
-                                           !bitmap->mddev->bitmap_info.external);
+                                           !bitmap->mddev->bitmap_info.external,
+                                           bitmap->cluster_slot);
        if (ret)
                goto err;
@@ -2021,13 +2158,18 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
                                return -EINVAL;
                        mddev->bitmap_info.offset = offset;
                        if (mddev->pers) {
+                                struct bitmap *bitmap;
                                mddev->pers->quiesce(mddev, 1);
-                                rv = bitmap_create(mddev);
+                                bitmap = bitmap_create(mddev, -1);
-                                if (!rv)
+                                if (IS_ERR(bitmap))
+                                        rv = PTR_ERR(bitmap);
+                                else {
+                                        mddev->bitmap = bitmap;
                                        rv = bitmap_load(mddev);
-                                if (rv) {
+                                        if (rv) {
-                                        bitmap_destroy(mddev);
+                                                bitmap_destroy(mddev);
-                                        mddev->bitmap_info.offset = 0;
+                                                mddev->bitmap_info.offset = 0;
+                                        }
                                }
                                mddev->pers->quiesce(mddev, 0);
                                if (rv)
@@ -2186,6 +2328,8 @@ __ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store);
 static ssize_t metadata_show(struct mddev *mddev, char *page)
 {
+        if (mddev_is_clustered(mddev))
+                return sprintf(page, "clustered\n");
        return sprintf(page, "%s\n", (mddev->bitmap_info.external
                                      ? "external" : "internal"));
 }
@@ -2198,7 +2342,8 @@ static ssize_t metadata_store(struct mddev *mddev, const char *buf, size_t len)
                return -EBUSY;
        if (strncmp(buf, "external", 8) == 0)
                mddev->bitmap_info.external = 1;
-        else if (strncmp(buf, "internal", 8) == 0)
+        else if ((strncmp(buf, "internal", 8) == 0) ||
+                        (strncmp(buf, "clustered", 9) == 0))
                mddev->bitmap_info.external = 0;
        else
                return -EINVAL;
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 30210b9c4ef9..f1f4dd01090d 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -130,8 +130,9 @@ typedef struct bitmap_super_s {
        __le32 write_behind; /* 60  number of outstanding write-behind writes */
        __le32 sectors_reserved; /* 64 number of 512-byte sectors that are
                                  * reserved for the bitmap. */
+        __le32 nodes;        /* 68 the maximum number of nodes in cluster. */
-        __u8  pad[256 - 68]; /* set to zero */
+        __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
+        __u8  pad[256 - 136]; /* set to zero */
 } bitmap_super_t;
 /* notes:
@@ -226,12 +227,13 @@ struct bitmap {
        wait_queue_head_t behind_wait;
        struct kernfs_node *sysfs_can_clear;
+        int cluster_slot;               /* Slot offset for clustered env */
 };
 /* the bitmap API */
 /* these are used only by md/bitmap */
-int  bitmap_create(struct mddev *mddev);
+struct bitmap *bitmap_create(struct mddev *mddev, int slot);
 int bitmap_load(struct mddev *mddev);
 void bitmap_flush(struct mddev *mddev);
 void bitmap_destroy(struct mddev *mddev);
@@ -260,6 +262,8 @@ void bitmap_daemon_work(struct mddev *mddev);
 int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
                  int chunksize, int init);
+int bitmap_copy_from_slot(struct mddev *mddev, int slot,
+                                sector_t *lo, sector_t *hi, bool clear_bits);
 #endif
 #endif
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
new file mode 100644
index 000000000000..fcfc4b9b2672
--- /dev/null
+++ b/drivers/md/md-cluster.c
@@ -0,0 +1,965 @@
+/*
+ * Copyright (C) 2015, SUSE
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ */
+#include <linux/module.h>
+#include <linux/dlm.h>
+#include <linux/sched.h>
+#include <linux/raid/md_p.h>
+#include "md.h"
+#include "bitmap.h"
+#include "md-cluster.h"
+#define LVB_SIZE        64
+#define NEW_DEV_TIMEOUT 5000
+struct dlm_lock_resource {
+        dlm_lockspace_t *ls;
+        struct dlm_lksb lksb;
+        char *name; /* lock name. */
+        uint32_t flags; /* flags to pass to dlm_lock() */
+        struct completion completion; /* completion for synchronized locking */
+        void (*bast)(void *arg, int mode); /* blocking AST function pointer*/
+        struct mddev *mddev; /* pointing back to mddev. */
+};
+struct suspend_info {
+        int slot;
+        sector_t lo;
+        sector_t hi;
+        struct list_head list;
+};
+struct resync_info {
+        __le64 lo;
+        __le64 hi;
+};
+/* md_cluster_info flags */
+#define         MD_CLUSTER_WAITING_FOR_NEWDISK          1
+struct md_cluster_info {
+        /* dlm lock space and resources for clustered raid. */
+        dlm_lockspace_t *lockspace;
+        int slot_number;
+        struct completion completion;
+        struct dlm_lock_resource *sb_lock;
+        struct mutex sb_mutex;
+        struct dlm_lock_resource *bitmap_lockres;
+        struct list_head suspend_list;
+        spinlock_t suspend_lock;
+        struct md_thread *recovery_thread;
+        unsigned long recovery_map;
+        /* communication loc resources */
+        struct dlm_lock_resource *ack_lockres;
+        struct dlm_lock_resource *message_lockres;
+        struct dlm_lock_resource *token_lockres;
+        struct dlm_lock_resource *no_new_dev_lockres;
+        struct md_thread *recv_thread;
+        struct completion newdisk_completion;
+        unsigned long state;
+};
+enum msg_type {
+        METADATA_UPDATED = 0,
+        RESYNCING,
+        NEWDISK,
+        REMOVE,
+        RE_ADD,
+};
+struct cluster_msg {
+        int type;
+        int slot;
+        /* TODO: Unionize this for smaller footprint */
+        sector_t low;
+        sector_t high;
+        char uuid[16];
+        int raid_slot;
+};
+static void sync_ast(void *arg)
+{
+        struct dlm_lock_resource *res;
+        res = (struct dlm_lock_resource *) arg;
+        complete(&res->completion);
+}
+static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
+{
+        int ret = 0;
+        init_completion(&res->completion);
+        ret = dlm_lock(res->ls, mode, &res->lksb,
+                        res->flags, res->name, strlen(res->name),
+                        0, sync_ast, res, res->bast);
+        if (ret)
+                return ret;
+        wait_for_completion(&res->completion);
+        return res->lksb.sb_status;
+}
+static int dlm_unlock_sync(struct dlm_lock_resource *res)
+{
+        return dlm_lock_sync(res, DLM_LOCK_NL);
+}
+static struct dlm_lock_resource *lockres_init(struct mddev *mddev,
+                char *name, void (*bastfn)(void *arg, int mode), int with_lvb)
+{
+        struct dlm_lock_resource *res = NULL;
+        int ret, namelen;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        res = kzalloc(sizeof(struct dlm_lock_resource), GFP_KERNEL);
+        if (!res)
+                return NULL;
+        res->ls = cinfo->lockspace;
+        res->mddev = mddev;
+        namelen = strlen(name);
+        res->name = kzalloc(namelen + 1, GFP_KERNEL);
+        if (!res->name) {
+                pr_err("md-cluster: Unable to allocate resource name for resource %s\n", name);
+                goto out_err;
+        }
+        strlcpy(res->name, name, namelen + 1);
+        if (with_lvb) {
+                res->lksb.sb_lvbptr = kzalloc(LVB_SIZE, GFP_KERNEL);
+                if (!res->lksb.sb_lvbptr) {
+                        pr_err("md-cluster: Unable to allocate LVB for resource %s\n", name);
+                        goto out_err;
+                }
+                res->flags = DLM_LKF_VALBLK;
+        }
+        if (bastfn)
+                res->bast = bastfn;
+        res->flags |= DLM_LKF_EXPEDITE;
+        ret = dlm_lock_sync(res, DLM_LOCK_NL);
+        if (ret) {
+                pr_err("md-cluster: Unable to lock NL on new lock resource %s\n", name);
+                goto out_err;
+        }
+        res->flags &= ~DLM_LKF_EXPEDITE;
+        res->flags |= DLM_LKF_CONVERT;
+        return res;
+out_err:
+        kfree(res->lksb.sb_lvbptr);
+        kfree(res->name);
+        kfree(res);
+        return NULL;
+}
+static void lockres_free(struct dlm_lock_resource *res)
+{
+        if (!res)
+                return;
+        init_completion(&res->completion);
+        dlm_unlock(res->ls, res->lksb.sb_lkid, 0, &res->lksb, res);
+        wait_for_completion(&res->completion);
+        kfree(res->name);
+        kfree(res->lksb.sb_lvbptr);
+        kfree(res);
+}
+static char *pretty_uuid(char *dest, char *src)
+{
+        int i, len = 0;
+        for (i = 0; i < 16; i++) {
+                if (i == 4 || i == 6 || i == 8 || i == 10)
+                        len += sprintf(dest + len, "-");
+                len += sprintf(dest + len, "%02x", (__u8)src[i]);
+        }
+        return dest;
+}
+static void add_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres,
+                sector_t lo, sector_t hi)
+{
+        struct resync_info *ri;
+        ri = (struct resync_info *)lockres->lksb.sb_lvbptr;
+        ri->lo = cpu_to_le64(lo);
+        ri->hi = cpu_to_le64(hi);
+}
+static struct suspend_info *read_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres)
+{
+        struct resync_info ri;
+        struct suspend_info *s = NULL;
+        sector_t hi = 0;
+        dlm_lock_sync(lockres, DLM_LOCK_CR);
+        memcpy(&ri, lockres->lksb.sb_lvbptr, sizeof(struct resync_info));
+        hi = le64_to_cpu(ri.hi);
+        if (ri.hi > 0) {
+                s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL);
+                if (!s)
+                        goto out;
+                s->hi = hi;
+                s->lo = le64_to_cpu(ri.lo);
+        }
+        dlm_unlock_sync(lockres);
+out:
+        return s;
+}
+static void recover_bitmaps(struct md_thread *thread)
+{
+        struct mddev *mddev = thread->mddev;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        struct dlm_lock_resource *bm_lockres;
+        char str[64];
+        int slot, ret;
+        struct suspend_info *s, *tmp;
+        sector_t lo, hi;
+        while (cinfo->recovery_map) {
+                slot = fls64((u64)cinfo->recovery_map) - 1;
+                /* Clear suspend_area associated with the bitmap */
+                spin_lock_irq(&cinfo->suspend_lock);
+                list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list)
+                        if (slot == s->slot) {
+                                list_del(&s->list);
+                                kfree(s);
+                        }
+                spin_unlock_irq(&cinfo->suspend_lock);
+                snprintf(str, 64, "bitmap%04d", slot);
+                bm_lockres = lockres_init(mddev, str, NULL, 1);
+                if (!bm_lockres) {
+                        pr_err("md-cluster: Cannot initialize bitmaps\n");
+                        goto clear_bit;
+                }
+                ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
+                if (ret) {
+                        pr_err("md-cluster: Could not DLM lock %s: %d\n",
+                                        str, ret);
+                        goto clear_bit;
+                }
+                ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi, true);
+                if (ret) {
+                        pr_err("md-cluster: Could not copy data from bitmap %d\n", slot);
+                        goto dlm_unlock;
+                }
+                if (hi > 0) {
+                        /* TODO:Wait for current resync to get over */
+                        set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+                        if (lo < mddev->recovery_cp)
+                                mddev->recovery_cp = lo;
+                        md_check_recovery(mddev);
+                }
+dlm_unlock:
+                dlm_unlock_sync(bm_lockres);
+clear_bit:
+                clear_bit(slot, &cinfo->recovery_map);
+        }
+}
+static void recover_prep(void *arg)
+{
+}
+static void recover_slot(void *arg, struct dlm_slot *slot)
+{
+        struct mddev *mddev = arg;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        pr_info("md-cluster: %s Node %d/%d down. My slot: %d. Initiating recovery.\n",
+                        mddev->bitmap_info.cluster_name,
+                        slot->nodeid, slot->slot,
+                        cinfo->slot_number);
+        set_bit(slot->slot - 1, &cinfo->recovery_map);
+        if (!cinfo->recovery_thread) {
+                cinfo->recovery_thread = md_register_thread(recover_bitmaps,
+                                mddev, "recover");
+                if (!cinfo->recovery_thread) {
+                        pr_warn("md-cluster: Could not create recovery thread\n");
+                        return;
+                }
+        }
+        md_wakeup_thread(cinfo->recovery_thread);
+}
+static void recover_done(void *arg, struct dlm_slot *slots,
+                int num_slots, int our_slot,
+                uint32_t generation)
+{
+        struct mddev *mddev = arg;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        cinfo->slot_number = our_slot;
+        complete(&cinfo->completion);
+}
+static const struct dlm_lockspace_ops md_ls_ops = {
+        .recover_prep = recover_prep,
+        .recover_slot = recover_slot,
+        .recover_done = recover_done,
+};
+/*
+ * The BAST function for the ack lock resource
+ * This function wakes up the receive thread in
+ * order to receive and process the message.
+ */
+static void ack_bast(void *arg, int mode)
+{
+        struct dlm_lock_resource *res = (struct dlm_lock_resource *)arg;
+        struct md_cluster_info *cinfo = res->mddev->cluster_info;
+        if (mode == DLM_LOCK_EX)
+                md_wakeup_thread(cinfo->recv_thread);
+}
+static void __remove_suspend_info(struct md_cluster_info *cinfo, int slot)
+{
+        struct suspend_info *s, *tmp;
+        list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list)
+                if (slot == s->slot) {
+                        pr_info("%s:%d Deleting suspend_info: %d\n",
+                                        __func__, __LINE__, slot);
+                        list_del(&s->list);
+                        kfree(s);
+                        break;
+                }
+}
+static void remove_suspend_info(struct md_cluster_info *cinfo, int slot)
+{
+        spin_lock_irq(&cinfo->suspend_lock);
+        __remove_suspend_info(cinfo, slot);
+        spin_unlock_irq(&cinfo->suspend_lock);
+}
+static void process_suspend_info(struct md_cluster_info *cinfo,
+                int slot, sector_t lo, sector_t hi)
+{
+        struct suspend_info *s;
+        if (!hi) {
+                remove_suspend_info(cinfo, slot);
+                return;
+        }
+        s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL);
+        if (!s)
+                return;
+        s->slot = slot;
+        s->lo = lo;
+        s->hi = hi;
+        spin_lock_irq(&cinfo->suspend_lock);
+        /* Remove existing entry (if exists) before adding */
+        __remove_suspend_info(cinfo, slot);
+        list_add(&s->list, &cinfo->suspend_list);
+        spin_unlock_irq(&cinfo->suspend_lock);
+}
+static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
+{
+        char disk_uuid[64];
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        char event_name[] = "EVENT=ADD_DEVICE";
+        char raid_slot[16];
+        char *envp[] = {event_name, disk_uuid, raid_slot, NULL};
+        int len;
+        len = snprintf(disk_uuid, 64, "DEVICE_UUID=");
+        pretty_uuid(disk_uuid + len, cmsg->uuid);
+        snprintf(raid_slot, 16, "RAID_DISK=%d", cmsg->raid_slot);
+        pr_info("%s:%d Sending kobject change with %s and %s\n", __func__, __LINE__, disk_uuid, raid_slot);
+        init_completion(&cinfo->newdisk_completion);
+        set_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state);
+        kobject_uevent_env(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE, envp);
+        wait_for_completion_timeout(&cinfo->newdisk_completion,
+                        NEW_DEV_TIMEOUT);
+        clear_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state);
+}
+static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        md_reload_sb(mddev);
+        dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
+}
+static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
+{
+        struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot);
+        if (rdev)
+                md_kick_rdev_from_array(rdev);
+        else
+                pr_warn("%s: %d Could not find disk(%d) to REMOVE\n", __func__, __LINE__, msg->raid_slot);
+}
+static void process_readd_disk(struct mddev *mddev, struct cluster_msg *msg)
+{
+        struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot);
+        if (rdev && test_bit(Faulty, &rdev->flags))
+                clear_bit(Faulty, &rdev->flags);
+        else
+                pr_warn("%s: %d Could not find disk(%d) which is faulty", __func__, __LINE__, msg->raid_slot);
+}
+static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
+{
+        switch (msg->type) {
+        case METADATA_UPDATED:
+                pr_info("%s: %d Received message: METADATA_UPDATE from %d\n",
+                        __func__, __LINE__, msg->slot);
+                process_metadata_update(mddev, msg);
+                break;
+        case RESYNCING:
+                pr_info("%s: %d Received message: RESYNCING from %d\n",
+                        __func__, __LINE__, msg->slot);
+                process_suspend_info(mddev->cluster_info, msg->slot,
+                                msg->low, msg->high);
+                break;
+        case NEWDISK:
+                pr_info("%s: %d Received message: NEWDISK from %d\n",
+                        __func__, __LINE__, msg->slot);
+                process_add_new_disk(mddev, msg);
+                break;
+        case REMOVE:
+                pr_info("%s: %d Received REMOVE from %d\n",
+                        __func__, __LINE__, msg->slot);
+                process_remove_disk(mddev, msg);
+                break;
+        case RE_ADD:
+                pr_info("%s: %d Received RE_ADD from %d\n",
+                        __func__, __LINE__, msg->slot);
+                process_readd_disk(mddev, msg);
+                break;
+        default:
+                pr_warn("%s:%d Received unknown message from %d\n",
+                        __func__, __LINE__, msg->slot);
+        }
+}
+/*
+ * thread for receiving message
+ */
+static void recv_daemon(struct md_thread *thread)
+{
+        struct md_cluster_info *cinfo = thread->mddev->cluster_info;
+        struct dlm_lock_resource *ack_lockres = cinfo->ack_lockres;
+        struct dlm_lock_resource *message_lockres = cinfo->message_lockres;
+        struct cluster_msg msg;
+        /*get CR on Message*/
+        if (dlm_lock_sync(message_lockres, DLM_LOCK_CR)) {
+                pr_err("md/raid1:failed to get CR on MESSAGE\n");
+                return;
+        }
+        /* read lvb and wake up thread to process this message_lockres */
+        memcpy(&msg, message_lockres->lksb.sb_lvbptr, sizeof(struct cluster_msg));
+        process_recvd_msg(thread->mddev, &msg);
+        /*release CR on ack_lockres*/
+        dlm_unlock_sync(ack_lockres);
+        /*up-convert to EX on message_lockres*/
+        dlm_lock_sync(message_lockres, DLM_LOCK_EX);
+        /*get CR on ack_lockres again*/
+        dlm_lock_sync(ack_lockres, DLM_LOCK_CR);
+        /*release CR on message_lockres*/
+        dlm_unlock_sync(message_lockres);
+}
+/* lock_comm()
+ * Takes the lock on the TOKEN lock resource so no other
+ * node can communicate while the operation is underway.
+ */
+static int lock_comm(struct md_cluster_info *cinfo)
+{
+        int error;
+        error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+        if (error)
+                pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
+                                __func__, __LINE__, error);
+        return error;
+}
+static void unlock_comm(struct md_cluster_info *cinfo)
+{
+        dlm_unlock_sync(cinfo->token_lockres);
+}
+/* __sendmsg()
+ * This function performs the actual sending of the message. This function is
+ * usually called after performing the encompassing operation
+ * The function:
+ * 1. Grabs the message lockresource in EX mode
+ * 2. Copies the message to the message LVB
+ * 3. Downconverts message lockresource to CR
+ * 4. Upconverts ack lock resource from CR to EX. This forces the BAST on other nodes
+ *    and the other nodes read the message. The thread will wait here until all other
+ *    nodes have released ack lock resource.
+ * 5. Downconvert ack lockresource to CR
+ */
+static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
+{
+        int error;
+        int slot = cinfo->slot_number - 1;
+        cmsg->slot = cpu_to_le32(slot);
+        /*get EX on Message*/
+        error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
+        if (error) {
+                pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
+                goto failed_message;
+        }
+        memcpy(cinfo->message_lockres->lksb.sb_lvbptr, (void *)cmsg,
+                        sizeof(struct cluster_msg));
+        /*down-convert EX to CR on Message*/
+        error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CR);
+        if (error) {
+                pr_err("md-cluster: failed to convert EX to CR on MESSAGE(%d)\n",
+                                error);
+                goto failed_message;
+        }
+        /*up-convert CR to EX on Ack*/
+        error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
+        if (error) {
+                pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
+                                error);
+                goto failed_ack;
+        }
+        /*down-convert EX to CR on Ack*/
+        error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
+        if (error) {
+                pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
+                                error);
+                goto failed_ack;
+        }
+failed_ack:
+        dlm_unlock_sync(cinfo->message_lockres);
+failed_message:
+        return error;
+}
+static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
+{
+        int ret;
+        lock_comm(cinfo);
+        ret = __sendmsg(cinfo, cmsg);
+        unlock_comm(cinfo);
+        return ret;
+}
+static int gather_all_resync_info(struct mddev *mddev, int total_slots)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        int i, ret = 0;
+        struct dlm_lock_resource *bm_lockres;
+        struct suspend_info *s;
+        char str[64];
+        for (i = 0; i < total_slots; i++) {
+                memset(str, '\0', 64);
+                snprintf(str, 64, "bitmap%04d", i);
+                bm_lockres = lockres_init(mddev, str, NULL, 1);
+                if (!bm_lockres)
+                        return -ENOMEM;
+                if (i == (cinfo->slot_number - 1))
+                        continue;
+                bm_lockres->flags |= DLM_LKF_NOQUEUE;
+                ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
+                if (ret == -EAGAIN) {
+                        memset(bm_lockres->lksb.sb_lvbptr, '\0', LVB_SIZE);
+                        s = read_resync_info(mddev, bm_lockres);
+                        if (s) {
+                                pr_info("%s:%d Resync[%llu..%llu] in progress on %d\n",
+                                                __func__, __LINE__,
+                                                (unsigned long long) s->lo,
+                                                (unsigned long long) s->hi, i);
+                                spin_lock_irq(&cinfo->suspend_lock);
+                                s->slot = i;
+                                list_add(&s->list, &cinfo->suspend_list);
+                                spin_unlock_irq(&cinfo->suspend_lock);
+                        }
+                        ret = 0;
+                        lockres_free(bm_lockres);
+                        continue;
+                }
+                if (ret)
+                        goto out;
+                /* TODO: Read the disk bitmap sb and check if it needs recovery */
+                dlm_unlock_sync(bm_lockres);
+                lockres_free(bm_lockres);
+        }
+out:
+        return ret;
+}
+static int join(struct mddev *mddev, int nodes)
+{
+        struct md_cluster_info *cinfo;
+        int ret, ops_rv;
+        char str[64];
+        if (!try_module_get(THIS_MODULE))
+                return -ENOENT;
+        cinfo = kzalloc(sizeof(struct md_cluster_info), GFP_KERNEL);
+        if (!cinfo)
+                return -ENOMEM;
+        init_completion(&cinfo->completion);
+        mutex_init(&cinfo->sb_mutex);
+        mddev->cluster_info = cinfo;
+        memset(str, 0, 64);
+        pretty_uuid(str, mddev->uuid);
+        ret = dlm_new_lockspace(str, mddev->bitmap_info.cluster_name,
+                                DLM_LSFL_FS, LVB_SIZE,
+                                &md_ls_ops, mddev, &ops_rv, &cinfo->lockspace);
+        if (ret)
+                goto err;
+        wait_for_completion(&cinfo->completion);
+        if (nodes < cinfo->slot_number) {
+                pr_err("md-cluster: Slot allotted(%d) is greater than available slots(%d).",
+                        cinfo->slot_number, nodes);
+                ret = -ERANGE;
+                goto err;
+        }
+        cinfo->sb_lock = lockres_init(mddev, "cmd-super",
+                                        NULL, 0);
+        if (!cinfo->sb_lock) {
+                ret = -ENOMEM;
+                goto err;
+        }
+        /* Initiate the communication resources */
+        ret = -ENOMEM;
+        cinfo->recv_thread = md_register_thread(recv_daemon, mddev, "cluster_recv");
+        if (!cinfo->recv_thread) {
+                pr_err("md-cluster: cannot allocate memory for recv_thread!\n");
+                goto err;
+        }
+        cinfo->message_lockres = lockres_init(mddev, "message", NULL, 1);
+        if (!cinfo->message_lockres)
+                goto err;
+        cinfo->token_lockres = lockres_init(mddev, "token", NULL, 0);
+        if (!cinfo->token_lockres)
+                goto err;
+        cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
+        if (!cinfo->ack_lockres)
+                goto err;
+        cinfo->no_new_dev_lockres = lockres_init(mddev, "no-new-dev", NULL, 0);
+        if (!cinfo->no_new_dev_lockres)
+                goto err;
+        /* get sync CR lock on ACK. */
+        if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR))
+                pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n",
+                                ret);
+        /* get sync CR lock on no-new-dev. */
+        if (dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR))
+                pr_err("md-cluster: failed to get a sync CR lock on no-new-dev!(%d)\n", ret);
+        pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number);
+        snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1);
+        cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1);
+        if (!cinfo->bitmap_lockres)
+                goto err;
+        if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) {
+                pr_err("Failed to get bitmap lock\n");
+                ret = -EINVAL;
+                goto err;
+        }
+        INIT_LIST_HEAD(&cinfo->suspend_list);
+        spin_lock_init(&cinfo->suspend_lock);
+        ret = gather_all_resync_info(mddev, nodes);
+        if (ret)
+                goto err;
+        return 0;
+err:
+        lockres_free(cinfo->message_lockres);
+        lockres_free(cinfo->token_lockres);
+        lockres_free(cinfo->ack_lockres);
+        lockres_free(cinfo->no_new_dev_lockres);
+        lockres_free(cinfo->bitmap_lockres);
+        lockres_free(cinfo->sb_lock);
+        if (cinfo->lockspace)
+                dlm_release_lockspace(cinfo->lockspace, 2);
+        mddev->cluster_info = NULL;
+        kfree(cinfo);
+        module_put(THIS_MODULE);
+        return ret;
+}
+static int leave(struct mddev *mddev)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        if (!cinfo)
+                return 0;
+        md_unregister_thread(&cinfo->recovery_thread);
+        md_unregister_thread(&cinfo->recv_thread);
+        lockres_free(cinfo->message_lockres);
+        lockres_free(cinfo->token_lockres);
+        lockres_free(cinfo->ack_lockres);
+        lockres_free(cinfo->no_new_dev_lockres);
+        lockres_free(cinfo->sb_lock);
+        lockres_free(cinfo->bitmap_lockres);
+        dlm_release_lockspace(cinfo->lockspace, 2);
+        return 0;
+}
+/* slot_number(): Returns the MD slot number to use
+ * DLM starts the slot numbers from 1, wheras cluster-md
+ * wants the number to be from zero, so we deduct one
+ */
+static int slot_number(struct mddev *mddev)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        return cinfo->slot_number - 1;
+}
+static void resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        add_resync_info(mddev, cinfo->bitmap_lockres, lo, hi);
+        /* Re-acquire the lock to refresh LVB */
+        dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW);
+}
+static int metadata_update_start(struct mddev *mddev)
+{
+        return lock_comm(mddev->cluster_info);
+}
+static int metadata_update_finish(struct mddev *mddev)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        struct cluster_msg cmsg;
+        int ret;
+        memset(&cmsg, 0, sizeof(cmsg));
+        cmsg.type = cpu_to_le32(METADATA_UPDATED);
+        ret = __sendmsg(cinfo, &cmsg);
+        unlock_comm(cinfo);
+        return ret;
+}
+static int metadata_update_cancel(struct mddev *mddev)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        return dlm_unlock_sync(cinfo->token_lockres);
+}
+static int resync_send(struct mddev *mddev, enum msg_type type,
+                sector_t lo, sector_t hi)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        struct cluster_msg cmsg;
+        int slot = cinfo->slot_number - 1;
+        pr_info("%s:%d lo: %llu hi: %llu\n", __func__, __LINE__,
+                        (unsigned long long)lo,
+                        (unsigned long long)hi);
+        resync_info_update(mddev, lo, hi);
+        cmsg.type = cpu_to_le32(type);
+        cmsg.slot = cpu_to_le32(slot);
+        cmsg.low = cpu_to_le64(lo);
+        cmsg.high = cpu_to_le64(hi);
+        return sendmsg(cinfo, &cmsg);
+}
+static int resync_start(struct mddev *mddev, sector_t lo, sector_t hi)
+{
+        pr_info("%s:%d\n", __func__, __LINE__);
+        return resync_send(mddev, RESYNCING, lo, hi);
+}
+static void resync_finish(struct mddev *mddev)
+{
+        pr_info("%s:%d\n", __func__, __LINE__);
+        resync_send(mddev, RESYNCING, 0, 0);
+}
+static int area_resyncing(struct mddev *mddev, sector_t lo, sector_t hi)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        int ret = 0;
+        struct suspend_info *s;
+        spin_lock_irq(&cinfo->suspend_lock);
+        if (list_empty(&cinfo->suspend_list))
+                goto out;
+        list_for_each_entry(s, &cinfo->suspend_list, list)
+                if (hi > s->lo && lo < s->hi) {
+                        ret = 1;
+                        break;
+                }
+out:
+        spin_unlock_irq(&cinfo->suspend_lock);
+        return ret;
+}
+static int add_new_disk_start(struct mddev *mddev, struct md_rdev *rdev)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        struct cluster_msg cmsg;
+        int ret = 0;
+        struct mdp_superblock_1 *sb = page_address(rdev->sb_page);
+        char *uuid = sb->device_uuid;
+        memset(&cmsg, 0, sizeof(cmsg));
+        cmsg.type = cpu_to_le32(NEWDISK);
+        memcpy(cmsg.uuid, uuid, 16);
+        cmsg.raid_slot = rdev->desc_nr;
+        lock_comm(cinfo);
+        ret = __sendmsg(cinfo, &cmsg);
+        if (ret)
+                return ret;
+        cinfo->no_new_dev_lockres->flags |= DLM_LKF_NOQUEUE;
+        ret = dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_EX);
+        cinfo->no_new_dev_lockres->flags &= ~DLM_LKF_NOQUEUE;
+        /* Some node does not "see" the device */
+        if (ret == -EAGAIN)
+                ret = -ENOENT;
+        else
+                dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
+        return ret;
+}
+static int add_new_disk_finish(struct mddev *mddev)
+{
+        struct cluster_msg cmsg;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        int ret;
+        /* Write sb and inform others */
+        md_update_sb(mddev, 1);
+        cmsg.type = METADATA_UPDATED;
+        ret = __sendmsg(cinfo, &cmsg);
+        unlock_comm(cinfo);
+        return ret;
+}
+static int new_disk_ack(struct mddev *mddev, bool ack)
+{
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        if (!test_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state)) {
+                pr_warn("md-cluster(%s): Spurious cluster confirmation\n", mdname(mddev));
+                return -EINVAL;
+        }
+        if (ack)
+                dlm_unlock_sync(cinfo->no_new_dev_lockres);
+        complete(&cinfo->newdisk_completion);
+        return 0;
+}
+static int remove_disk(struct mddev *mddev, struct md_rdev *rdev)
+{
+        struct cluster_msg cmsg;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        cmsg.type = REMOVE;
+        cmsg.raid_slot = rdev->desc_nr;
+        return __sendmsg(cinfo, &cmsg);
+}
+static int gather_bitmaps(struct md_rdev *rdev)
+{
+        int sn, err;
+        sector_t lo, hi;
+        struct cluster_msg cmsg;
+        struct mddev *mddev = rdev->mddev;
+        struct md_cluster_info *cinfo = mddev->cluster_info;
+        cmsg.type = RE_ADD;
+        cmsg.raid_slot = rdev->desc_nr;
+        err = sendmsg(cinfo, &cmsg);
+        if (err)
+                goto out;
+        for (sn = 0; sn < mddev->bitmap_info.nodes; sn++) {
+                if (sn == (cinfo->slot_number - 1))
+                        continue;
+                err = bitmap_copy_from_slot(mddev, sn, &lo, &hi, false);
+                if (err) {
+                        pr_warn("md-cluster: Could not gather bitmaps from slot %d", sn);
+                        goto out;
+                }
+                if ((hi > 0) && (lo < mddev->recovery_cp))
+                        mddev->recovery_cp = lo;
+        }
+out:
+        return err;
+}
+static struct md_cluster_operations cluster_ops = {
+        .join   = join,
+        .leave  = leave,
+        .slot_number = slot_number,
+        .resync_info_update = resync_info_update,
+        .resync_start = resync_start,
+        .resync_finish = resync_finish,
+        .metadata_update_start = metadata_update_start,
+        .metadata_update_finish = metadata_update_finish,
+        .metadata_update_cancel = metadata_update_cancel,
+        .area_resyncing = area_resyncing,
+        .add_new_disk_start = add_new_disk_start,
+        .add_new_disk_finish = add_new_disk_finish,
+        .new_disk_ack = new_disk_ack,
+        .remove_disk = remove_disk,
+        .gather_bitmaps = gather_bitmaps,
+};
+static int __init cluster_init(void)
+{
+        pr_warn("md-cluster: EXPERIMENTAL. Use with caution\n");
+        pr_info("Registering Cluster MD functions\n");
+        register_md_cluster_operations(&cluster_ops, THIS_MODULE);
+        return 0;
+}
+static void cluster_exit(void)
+{
+        unregister_md_cluster_operations();
+}
+module_init(cluster_init);
+module_exit(cluster_exit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Clustering support for MD");
diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
new file mode 100644
index 000000000000..6817ee00e053
--- /dev/null
+++ b/drivers/md/md-cluster.h
@@ -0,0 +1,29 @@
+#ifndef _MD_CLUSTER_H
+#define _MD_CLUSTER_H
+#include "md.h"
+struct mddev;
+struct md_rdev;
+struct md_cluster_operations {
+        int (*join)(struct mddev *mddev, int nodes);
+        int (*leave)(struct mddev *mddev);
+        int (*slot_number)(struct mddev *mddev);
+        void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi);
+        int (*resync_start)(struct mddev *mddev, sector_t lo, sector_t hi);
+        void (*resync_finish)(struct mddev *mddev);
+        int (*metadata_update_start)(struct mddev *mddev);
+        int (*metadata_update_finish)(struct mddev *mddev);
+        int (*metadata_update_cancel)(struct mddev *mddev);
+        int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi);
+        int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev);
+        int (*add_new_disk_finish)(struct mddev *mddev);
+        int (*new_disk_ack)(struct mddev *mddev, bool ack);
+        int (*remove_disk)(struct mddev *mddev, struct md_rdev *rdev);
+        int (*gather_bitmaps)(struct md_rdev *rdev);
+};
+#endif /* _MD_CLUSTER_H */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e6178787ce3d..d4f31e195e26 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -53,6 +53,7 @@
 #include <linux/slab.h>
 #include "md.h"
 #include "bitmap.h"
+#include "md-cluster.h"
 #ifndef MODULE
 static void autostart_arrays(int part);
@@ -66,6 +67,11 @@ static void autostart_arrays(int part);
 static LIST_HEAD(pers_list);
 static DEFINE_SPINLOCK(pers_lock);
+struct md_cluster_operations *md_cluster_ops;
+EXPORT_SYMBOL(md_cluster_ops);
+struct module *md_cluster_mod;
+EXPORT_SYMBOL(md_cluster_mod);
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
@@ -640,7 +646,7 @@ void mddev_unlock(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(mddev_unlock);
-static struct md_rdev *find_rdev_nr_rcu(struct mddev *mddev, int nr)
+struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr)
 {
        struct md_rdev *rdev;
@@ -650,6 +656,7 @@ static struct md_rdev *find_rdev_nr_rcu(struct mddev *mddev, int nr)
        return NULL;
 }
+EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);
 static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
 {
@@ -2047,11 +2054,11 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
                int choice = 0;
                if (mddev->pers)
                        choice = mddev->raid_disks;
-                while (find_rdev_nr_rcu(mddev, choice))
+                while (md_find_rdev_nr_rcu(mddev, choice))
                        choice++;
                rdev->desc_nr = choice;
        } else {
-                if (find_rdev_nr_rcu(mddev, rdev->desc_nr)) {
+                if (md_find_rdev_nr_rcu(mddev, rdev->desc_nr)) {
                        rcu_read_unlock();
                        return -EBUSY;
                }
@@ -2166,11 +2173,12 @@ static void export_rdev(struct md_rdev *rdev)
        kobject_put(&rdev->kobj);
 }
-static void kick_rdev_from_array(struct md_rdev *rdev)
+void md_kick_rdev_from_array(struct md_rdev *rdev)
 {
        unbind_rdev_from_array(rdev);
        export_rdev(rdev);
 }
+EXPORT_SYMBOL_GPL(md_kick_rdev_from_array);
 static void export_array(struct mddev *mddev)
 {
@@ -2179,7 +2187,7 @@ static void export_array(struct mddev *mddev)
        while (!list_empty(&mddev->disks)) {
                rdev = list_first_entry(&mddev->disks, struct md_rdev,
                                        same_set);
-                kick_rdev_from_array(rdev);
+                md_kick_rdev_from_array(rdev);
        }
        mddev->raid_disks = 0;
        mddev->major_version = 0;
@@ -2208,7 +2216,7 @@ static void sync_sbs(struct mddev *mddev, int nospares)
        }
 }
-static void md_update_sb(struct mddev *mddev, int force_change)
+void md_update_sb(struct mddev *mddev, int force_change)
 {
        struct md_rdev *rdev;
        int sync_req;
@@ -2369,6 +2377,37 @@ repeat:
                wake_up(&rdev->blocked_wait);
        }
 }
+EXPORT_SYMBOL(md_update_sb);
+static int add_bound_rdev(struct md_rdev *rdev)
+{
+        struct mddev *mddev = rdev->mddev;
+        int err = 0;
+        if (!mddev->pers->hot_remove_disk) {
+                /* If there is hot_add_disk but no hot_remove_disk
+                 * then added disks for geometry changes,
+                 * and should be added immediately.
+                 */
+                super_types[mddev->major_version].
+                        validate_super(mddev, rdev);
+                err = mddev->pers->hot_add_disk(mddev, rdev);
+                if (err) {
+                        unbind_rdev_from_array(rdev);
+                        export_rdev(rdev);
+                        return err;
+                }
+        }
+        sysfs_notify_dirent_safe(rdev->sysfs_state);
+        set_bit(MD_CHANGE_DEVS, &mddev->flags);
+        if (mddev->degraded)
+                set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+        set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+        md_new_event(mddev);
+        md_wakeup_thread(mddev->thread);
+        return 0;
+}
 /* words written to sysfs files may, or may not, be \n terminated.
 * We want to accept with case. For this we use cmd_match.
@@ -2471,10 +2510,16 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
                        err = -EBUSY;
                else {
                        struct mddev *mddev = rdev->mddev;
-                        kick_rdev_from_array(rdev);
+                        if (mddev_is_clustered(mddev))
+                                md_cluster_ops->remove_disk(mddev, rdev);
+                        md_kick_rdev_from_array(rdev);
+                        if (mddev_is_clustered(mddev))
+                                md_cluster_ops->metadata_update_start(mddev);
                        if (mddev->pers)
                                md_update_sb(mddev, 1);
                        md_new_event(mddev);
+                        if (mddev_is_clustered(mddev))
+                                md_cluster_ops->metadata_update_finish(mddev);
                        err = 0;
                }
        } else if (cmd_match(buf, "writemostly")) {
@@ -2553,6 +2598,21 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
                        clear_bit(Replacement, &rdev->flags);
                        err = 0;
                }
+        } else if (cmd_match(buf, "re-add")) {
+                if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1)) {
+                        /* clear_bit is performed _after_ all the devices
+                         * have their local Faulty bit cleared. If any writes
+                         * happen in the meantime in the local node, they
+                         * will land in the local bitmap, which will be synced
+                         * by this node eventually
+                         */
+                        if (!mddev_is_clustered(rdev->mddev) ||
+                            (err = md_cluster_ops->gather_bitmaps(rdev)) == 0) {
+                                clear_bit(Faulty, &rdev->flags);
+                                err = add_bound_rdev(rdev);
+                        }
+                } else
+                        err = -EBUSY;
        }
        if (!err)
                sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -3127,7 +3187,7 @@ static void analyze_sbs(struct mddev *mddev)
                                "md: fatal superblock inconsistency in %s"
                                " -- removing from array\n",
                                bdevname(rdev->bdev,b));
-                        kick_rdev_from_array(rdev);
+                        md_kick_rdev_from_array(rdev);
                }
        super_types[mddev->major_version].
@@ -3142,18 +3202,27 @@ static void analyze_sbs(struct mddev *mddev)
                               "md: %s: %s: only %d devices permitted\n",
                               mdname(mddev), bdevname(rdev->bdev, b),
                               mddev->max_disks);
-                        kick_rdev_from_array(rdev);
+                        md_kick_rdev_from_array(rdev);
                        continue;
                }
-                if (rdev != freshest)
+                if (rdev != freshest) {
                        if (super_types[mddev->major_version].
                            validate_super(mddev, rdev)) {
                                printk(KERN_WARNING "md: kicking non-fresh %s"
                                        " from array!\n",
                                        bdevname(rdev->bdev,b));
-                                kick_rdev_from_array(rdev);
+                                md_kick_rdev_from_array(rdev);
                                continue;
                        }
+                        /* No device should have a Candidate flag
+                         * when reading devices
+                         */
+                        if (test_bit(Candidate, &rdev->flags)) {
+                                pr_info("md: kicking Cluster Candidate %s from array!\n",
+                                        bdevname(rdev->bdev, b));
+                                md_kick_rdev_from_array(rdev);
+                        }
+                }
                if (mddev->level == LEVEL_MULTIPATH) {
                        rdev->desc_nr = i++;
                        rdev->raid_disk = rdev->desc_nr;
@@ -4008,8 +4077,12 @@ size_store(struct mddev *mddev, const char *buf, size_t len)
        if (err)
                return err;
        if (mddev->pers) {
+                if (mddev_is_clustered(mddev))
+                        md_cluster_ops->metadata_update_start(mddev);
                err = update_size(mddev, sectors);
                md_update_sb(mddev, 1);
+                if (mddev_is_clustered(mddev))
+                        md_cluster_ops->metadata_update_finish(mddev);
        } else {
                if (mddev->dev_sectors == 0 ||
                    mddev->dev_sectors > sectors)
@@ -4354,7 +4427,6 @@ min_sync_store(struct mddev *mddev, const char *buf, size_t len)
 {
        unsigned long long min;
        int err;
-        int chunk;
        if (kstrtoull(buf, 10, &min))
                return -EINVAL;
@@ -4368,16 +4440,8 @@ min_sync_store(struct mddev *mddev, const char *buf, size_t len)
        if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
                goto out_unlock;
-        /* Must be a multiple of chunk_size */
+        /* Round down to multiple of 4K for safety */
-        chunk = mddev->chunk_sectors;
+        mddev->resync_min = round_down(min, 8);
-        if (chunk) {
-                sector_t temp = min;
-                err = -EINVAL;
-                if (sector_div(temp, chunk))
-                        goto out_unlock;
-        }
-        mddev->resync_min = min;
        err = 0;
 out_unlock:
@@ -5077,10 +5141,16 @@ int md_run(struct mddev *mddev)
        }
        if (err == 0 && pers->sync_request &&
            (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
-                err = bitmap_create(mddev);
+                struct bitmap *bitmap;
-                if (err)
+                bitmap = bitmap_create(mddev, -1);
+                if (IS_ERR(bitmap)) {
+                        err = PTR_ERR(bitmap);
                        printk(KERN_ERR "%s: failed to create bitmap (%d)\n",
                               mdname(mddev), err);
+                } else
+                        mddev->bitmap = bitmap;
        }
        if (err) {
                mddev_detach(mddev);
@@ -5232,6 +5302,8 @@ static void md_clean(struct mddev *mddev)
 static void __md_stop_writes(struct mddev *mddev)
 {
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_start(mddev);
        set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
        flush_workqueue(md_misc_wq);
        if (mddev->sync_thread) {
@@ -5250,6 +5322,8 @@ static void __md_stop_writes(struct mddev *mddev)
                mddev->in_sync = 1;
                md_update_sb(mddev, 1);
        }
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_finish(mddev);
 }
 void md_stop_writes(struct mddev *mddev)
@@ -5636,6 +5710,8 @@ static int get_array_info(struct mddev *mddev, void __user *arg)
                info.state = (1<<MD_SB_CLEAN);
        if (mddev->bitmap && mddev->bitmap_info.offset)
                info.state |= (1<<MD_SB_BITMAP_PRESENT);
+        if (mddev_is_clustered(mddev))
+                info.state |= (1<<MD_SB_CLUSTERED);
        info.active_disks  = insync;
        info.working_disks = working;
        info.failed_disks  = failed;
@@ -5691,7 +5767,7 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
                return -EFAULT;
        rcu_read_lock();
-        rdev = find_rdev_nr_rcu(mddev, info.number);
+        rdev = md_find_rdev_nr_rcu(mddev, info.number);
        if (rdev) {
                info.major = MAJOR(rdev->bdev->bd_dev);
                info.minor = MINOR(rdev->bdev->bd_dev);
@@ -5724,6 +5800,13 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
        struct md_rdev *rdev;
        dev_t dev = MKDEV(info->major,info->minor);
+        if (mddev_is_clustered(mddev) &&
+                !(info->state & ((1 << MD_DISK_CLUSTER_ADD) | (1 << MD_DISK_CANDIDATE)))) {
+                pr_err("%s: Cannot add to clustered mddev.\n",
+                               mdname(mddev));
+                return -EINVAL;
+        }
        if (info->major != MAJOR(dev) || info->minor != MINOR(dev))
                return -EOVERFLOW;
@@ -5810,31 +5893,38 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
                else
                        clear_bit(WriteMostly, &rdev->flags);
+                /*
+                 * check whether the device shows up in other nodes
+                 */
+                if (mddev_is_clustered(mddev)) {
+                        if (info->state & (1 << MD_DISK_CANDIDATE)) {
+                                /* Through --cluster-confirm */
+                                set_bit(Candidate, &rdev->flags);
+                                err = md_cluster_ops->new_disk_ack(mddev, true);
+                                if (err) {
+                                        export_rdev(rdev);
+                                        return err;
+                                }
+                        } else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) {
+                                /* --add initiated by this node */
+                                err = md_cluster_ops->add_new_disk_start(mddev, rdev);
+                                if (err) {
+                                        md_cluster_ops->add_new_disk_finish(mddev);
+                                        export_rdev(rdev);
+                                        return err;
+                                }
+                        }
+                }
                rdev->raid_disk = -1;
                err = bind_rdev_to_array(rdev, mddev);
-                if (!err && !mddev->pers->hot_remove_disk) {
-                        /* If there is hot_add_disk but no hot_remove_disk
-                         * then added disks for geometry changes,
-                         * and should be added immediately.
-                         */
-                        super_types[mddev->major_version].
-                                validate_super(mddev, rdev);
-                        err = mddev->pers->hot_add_disk(mddev, rdev);
-                        if (err)
-                                unbind_rdev_from_array(rdev);
-                }
                if (err)
                        export_rdev(rdev);
                else
-                        sysfs_notify_dirent_safe(rdev->sysfs_state);
+                        err = add_bound_rdev(rdev);
+                if (mddev_is_clustered(mddev) &&
-                set_bit(MD_CHANGE_DEVS, &mddev->flags);
+                                (info->state & (1 << MD_DISK_CLUSTER_ADD)))
-                if (mddev->degraded)
+                        md_cluster_ops->add_new_disk_finish(mddev);
-                        set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
-                set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-                if (!err)
-                        md_new_event(mddev);
-                md_wakeup_thread(mddev->thread);
                return err;
        }
@@ -5895,18 +5985,29 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
        if (!rdev)
                return -ENXIO;
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_start(mddev);
        clear_bit(Blocked, &rdev->flags);
        remove_and_add_spares(mddev, rdev);
        if (rdev->raid_disk >= 0)
                goto busy;
-        kick_rdev_from_array(rdev);
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->remove_disk(mddev, rdev);
+        md_kick_rdev_from_array(rdev);
        md_update_sb(mddev, 1);
        md_new_event(mddev);
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_finish(mddev);
        return 0;
 busy:
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_cancel(mddev);
        printk(KERN_WARNING "md: cannot remove active disk %s from %s ...\n",
                bdevname(rdev->bdev,b), mdname(mddev));
        return -EBUSY;
@@ -5956,12 +6057,15 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
                err = -EINVAL;
                goto abort_export;
        }
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_start(mddev);
        clear_bit(In_sync, &rdev->flags);
        rdev->desc_nr = -1;
        rdev->saved_raid_disk = -1;
        err = bind_rdev_to_array(rdev, mddev);
        if (err)
-                goto abort_export;
+                goto abort_clustered;
        /*
         * The rest should better be atomic, we can have disk failures
@@ -5972,6 +6076,8 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
        md_update_sb(mddev, 1);
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_finish(mddev);
        /*
         * Kick recovery, maybe this spare has to be added to the
         * array immediately.
@@ -5981,6 +6087,9 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
        md_new_event(mddev);
        return 0;
+abort_clustered:
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_cancel(mddev);
 abort_export:
        export_rdev(rdev);
        return err;
@@ -6038,9 +6147,14 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
        if (mddev->pers) {
                mddev->pers->quiesce(mddev, 1);
                if (fd >= 0) {
-                        err = bitmap_create(mddev);
+                        struct bitmap *bitmap;
-                        if (!err)
+                        bitmap = bitmap_create(mddev, -1);
+                        if (!IS_ERR(bitmap)) {
+                                mddev->bitmap = bitmap;
                                err = bitmap_load(mddev);
+                        } else
+                                err = PTR_ERR(bitmap);
                }
                if (fd < 0 || err) {
                        bitmap_destroy(mddev);
@@ -6293,6 +6407,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
                        return rv;
                }
        }
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_start(mddev);
        if (info->size >= 0 && mddev->dev_sectors / 2 != info->size)
                rv = update_size(mddev, (sector_t)info->size * 2);
@@ -6300,33 +6416,49 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
                rv = update_raid_disks(mddev, info->raid_disks);
        if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT)) {
-                if (mddev->pers->quiesce == NULL || mddev->thread == NULL)
+                if (mddev->pers->quiesce == NULL || mddev->thread == NULL) {
-                        return -EINVAL;
+                        rv = -EINVAL;
-                if (mddev->recovery || mddev->sync_thread)
+                        goto err;
-                        return -EBUSY;
+                }
+                if (mddev->recovery || mddev->sync_thread) {
+                        rv = -EBUSY;
+                        goto err;
+                }
                if (info->state & (1<<MD_SB_BITMAP_PRESENT)) {
+                        struct bitmap *bitmap;
                        /* add the bitmap */
-                        if (mddev->bitmap)
+                        if (mddev->bitmap) {
-                                return -EEXIST;
+                                rv = -EEXIST;
-                        if (mddev->bitmap_info.default_offset == 0)
+                                goto err;
-                                return -EINVAL;
+                        }
+                        if (mddev->bitmap_info.default_offset == 0) {
+                                rv = -EINVAL;
+                                goto err;
+                        }
                        mddev->bitmap_info.offset =
                                mddev->bitmap_info.default_offset;
                        mddev->bitmap_info.space =
                                mddev->bitmap_info.default_space;
                        mddev->pers->quiesce(mddev, 1);
-                        rv = bitmap_create(mddev);
+                        bitmap = bitmap_create(mddev, -1);
-                        if (!rv)
+                        if (!IS_ERR(bitmap)) {
+                                mddev->bitmap = bitmap;
                                rv = bitmap_load(mddev);
+                        } else
+                                rv = PTR_ERR(bitmap);
                        if (rv)
                                bitmap_destroy(mddev);
                        mddev->pers->quiesce(mddev, 0);
                } else {
                        /* remove the bitmap */
-                        if (!mddev->bitmap)
+                        if (!mddev->bitmap) {
-                                return -ENOENT;
+                                rv = -ENOENT;
-                        if (mddev->bitmap->storage.file)
+                                goto err;
-                                return -EINVAL;
+                        }
+                        if (mddev->bitmap->storage.file) {
+                                rv = -EINVAL;
+                                goto err;
+                        }
                        mddev->pers->quiesce(mddev, 1);
                        bitmap_destroy(mddev);
                        mddev->pers->quiesce(mddev, 0);
@@ -6334,6 +6466,12 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
                }
        }
        md_update_sb(mddev, 1);
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_finish(mddev);
+        return rv;
+err:
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_cancel(mddev);
        return rv;
 }
@@ -6393,6 +6531,7 @@ static inline bool md_ioctl_valid(unsigned int cmd)
        case SET_DISK_FAULTY:
        case STOP_ARRAY:
        case STOP_ARRAY_RO:
+        case CLUSTERED_DISK_NACK:
                return true;
        default:
                return false;
@@ -6665,6 +6804,13 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
                goto unlock;
        }
+        case CLUSTERED_DISK_NACK:
+                if (mddev_is_clustered(mddev))
+                        md_cluster_ops->new_disk_ack(mddev, false);
+                else
+                        err = -EINVAL;
+                goto unlock;
        case HOT_ADD_DISK:
                err = hot_add_disk(mddev, new_decode_dev(arg));
                goto unlock;
@@ -7238,6 +7384,55 @@ int unregister_md_personality(struct md_personality *p)
 }
 EXPORT_SYMBOL(unregister_md_personality);
+int register_md_cluster_operations(struct md_cluster_operations *ops, struct module *module)
+{
+        if (md_cluster_ops != NULL)
+                return -EALREADY;
+        spin_lock(&pers_lock);
+        md_cluster_ops = ops;
+        md_cluster_mod = module;
+        spin_unlock(&pers_lock);
+        return 0;
+}
+EXPORT_SYMBOL(register_md_cluster_operations);
+int unregister_md_cluster_operations(void)
+{
+        spin_lock(&pers_lock);
+        md_cluster_ops = NULL;
+        spin_unlock(&pers_lock);
+        return 0;
+}
+EXPORT_SYMBOL(unregister_md_cluster_operations);
+int md_setup_cluster(struct mddev *mddev, int nodes)
+{
+        int err;
+        err = request_module("md-cluster");
+        if (err) {
+                pr_err("md-cluster module not found.\n");
+                return err;
+        }
+        spin_lock(&pers_lock);
+        if (!md_cluster_ops || !try_module_get(md_cluster_mod)) {
+                spin_unlock(&pers_lock);
+                return -ENOENT;
+        }
+        spin_unlock(&pers_lock);
+        return md_cluster_ops->join(mddev, nodes);
+}
+void md_cluster_stop(struct mddev *mddev)
+{
+        if (!md_cluster_ops)
+                return;
+        md_cluster_ops->leave(mddev);
+        module_put(md_cluster_mod);
+}
 static int is_mddev_idle(struct mddev *mddev, int init)
 {
        struct md_rdev *rdev;
@@ -7375,7 +7570,11 @@ int md_allow_write(struct mddev *mddev)
                    mddev->safemode == 0)
                        mddev->safemode = 1;
                spin_unlock(&mddev->lock);
+                if (mddev_is_clustered(mddev))
+                        md_cluster_ops->metadata_update_start(mddev);
                md_update_sb(mddev, 0);
+                if (mddev_is_clustered(mddev))
+                        md_cluster_ops->metadata_update_finish(mddev);
                sysfs_notify_dirent_safe(mddev->sysfs_state);
        } else
                spin_unlock(&mddev->lock);
@@ -7576,6 +7775,9 @@ void md_do_sync(struct md_thread *thread)
        md_new_event(mddev);
        update_time = jiffies;
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->resync_start(mddev, j, max_sectors);
        blk_start_plug(&plug);
        while (j < max_sectors) {
                sector_t sectors;
@@ -7618,8 +7820,7 @@ void md_do_sync(struct md_thread *thread)
                if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
                        break;
-                sectors = mddev->pers->sync_request(mddev, j, &skipped,
+                sectors = mddev->pers->sync_request(mddev, j, &skipped);
-                                                  currspeed < speed_min(mddev));
                if (sectors == 0) {
                        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
                        break;
@@ -7636,6 +7837,8 @@ void md_do_sync(struct md_thread *thread)
                j += sectors;
                if (j > 2)
                        mddev->curr_resync = j;
+                if (mddev_is_clustered(mddev))
+                        md_cluster_ops->resync_info_update(mddev, j, max_sectors);
                mddev->curr_mark_cnt = io_sectors;
                if (last_check == 0)
                        /* this is the earliest that rebuild will be
@@ -7677,11 +7880,18 @@ void md_do_sync(struct md_thread *thread)
                        /((jiffies-mddev->resync_mark)/HZ +1) +1;
                if (currspeed > speed_min(mddev)) {
-                        if ((currspeed > speed_max(mddev)) ||
+                        if (currspeed > speed_max(mddev)) {
-                                        !is_mddev_idle(mddev, 0)) {
                                msleep(500);
                                goto repeat;
                        }
+                        if (!is_mddev_idle(mddev, 0)) {
+                                /*
+                                 * Give other IO more of a chance.
+                                 * The faster the devices, the less we wait.
+                                 */
+                                wait_event(mddev->recovery_wait,
+                                           !atomic_read(&mddev->recovery_active));
+                        }
                }
        }
        printk(KERN_INFO "md: %s: %s %s.\n",mdname(mddev), desc,
@@ -7694,7 +7904,10 @@ void md_do_sync(struct md_thread *thread)
        wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
        /* tell personality that we are finished */
-        mddev->pers->sync_request(mddev, max_sectors, &skipped, 1);
+        mddev->pers->sync_request(mddev, max_sectors, &skipped);
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->resync_finish(mddev);
        if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
            mddev->curr_resync > 2) {
@@ -7925,8 +8138,13 @@ void md_check_recovery(struct mddev *mddev)
                                sysfs_notify_dirent_safe(mddev->sysfs_state);
                }
-                if (mddev->flags & MD_UPDATE_SB_FLAGS)
+                if (mddev->flags & MD_UPDATE_SB_FLAGS) {
+                        if (mddev_is_clustered(mddev))
+                                md_cluster_ops->metadata_update_start(mddev);
                        md_update_sb(mddev, 0);
+                        if (mddev_is_clustered(mddev))
+                                md_cluster_ops->metadata_update_finish(mddev);
+                }
                if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
                    !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
@@ -8024,6 +8242,8 @@ void md_reap_sync_thread(struct mddev *mddev)
                        set_bit(MD_CHANGE_DEVS, &mddev->flags);
                }
        }
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_start(mddev);
        if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
            mddev->pers->finish_reshape)
                mddev->pers->finish_reshape(mddev);
@@ -8036,6 +8256,8 @@ void md_reap_sync_thread(struct mddev *mddev)
                        rdev->saved_raid_disk = -1;
        md_update_sb(mddev, 1);
+        if (mddev_is_clustered(mddev))
+                md_cluster_ops->metadata_update_finish(mddev);
        clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
        clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
        clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
@@ -8656,6 +8878,28 @@ err_wq:
        return ret;
 }
+void md_reload_sb(struct mddev *mddev)
+{
+        struct md_rdev *rdev, *tmp;
+        rdev_for_each_safe(rdev, tmp, mddev) {
+                rdev->sb_loaded = 0;
+                ClearPageUptodate(rdev->sb_page);
+        }
+        mddev->raid_disks = 0;
+        analyze_sbs(mddev);
+        rdev_for_each_safe(rdev, tmp, mddev) {
+                struct mdp_superblock_1 *sb = page_address(rdev->sb_page);
+                /* since we don't write to faulty devices, we figure out if the
+                 *  disk is faulty by comparing events
+                 */
+                if (mddev->events > sb->events)
+                        set_bit(Faulty, &rdev->flags);
+        }
+}
+EXPORT_SYMBOL(md_reload_sb);
 #ifndef MODULE
 /*
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 318ca8fd430f..4046a6c6f223 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -23,6 +23,7 @@
 #include <linux/timer.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
+#include "md-cluster.h"
 #define MaxSector (~(sector_t)0)
@@ -170,6 +171,10 @@ enum flag_bits {
                                 * a want_replacement device with same
                                 * raid_disk number.
                                 */
+        Candidate,              /* For clustered environments only:
+                                 * This device is seen locally but not
+                                 * by the whole cluster
+                                 */
 };
 #define BB_LEN_MASK     (0x00000000000001FFULL)
@@ -202,6 +207,8 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
                                int is_new);
 extern void md_ack_all_badblocks(struct badblocks *bb);
+struct md_cluster_info;
 struct mddev {
        void                            *private;
        struct md_personality           *pers;
@@ -430,6 +437,8 @@ struct mddev {
                unsigned long           daemon_sleep; /* how many jiffies between updates? */
                unsigned long           max_write_behind; /* write-behind mode */
                int                     external;
+                int                     nodes; /* Maximum number of nodes in the cluster */
+                char                    cluster_name[64]; /* Name of the cluster */
        } bitmap_info;
        atomic_t                        max_corr_read_errors; /* max read retries */
@@ -448,6 +457,7 @@ struct mddev {
        struct work_struct flush_work;
        struct work_struct event_work;  /* used by dm to report failure event */
        void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
+        struct md_cluster_info          *cluster_info;
 };
 static inline int __must_check mddev_lock(struct mddev *mddev)
@@ -496,7 +506,7 @@ struct md_personality
        int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
        int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
        int (*spare_active) (struct mddev *mddev);
-        sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster);
+        sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped);
        int (*resize) (struct mddev *mddev, sector_t sectors);
        sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks);
        int (*check_reshape) (struct mddev *mddev);
@@ -608,6 +618,11 @@ static inline void safe_put_page(struct page *p)
 extern int register_md_personality(struct md_personality *p);
 extern int unregister_md_personality(struct md_personality *p);
+extern int register_md_cluster_operations(struct md_cluster_operations *ops,
+                struct module *module);
+extern int unregister_md_cluster_operations(void);
+extern int md_setup_cluster(struct mddev *mddev, int nodes);
+extern void md_cluster_stop(struct mddev *mddev);
 extern struct md_thread *md_register_thread(
        void (*run)(struct md_thread *thread),
        struct mddev *mddev,
@@ -654,6 +669,10 @@ extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
                                   struct mddev *mddev);
 extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule);
+extern void md_reload_sb(struct mddev *mddev);
+extern void md_update_sb(struct mddev *mddev, int force);
+extern void md_kick_rdev_from_array(struct md_rdev * rdev);
+struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
 static inline int mddev_check_plugged(struct mddev *mddev)
 {
        return !!blk_check_plugged(md_unplug, mddev,
@@ -669,4 +688,9 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
        }
 }
+extern struct md_cluster_operations *md_cluster_ops;
+static inline int mddev_is_clustered(struct mddev *mddev)
+{
+        return mddev->cluster_info && mddev->bitmap_info.nodes > 1;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 3b5d7f704aa3..2cb59a641cd2 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -271,14 +271,16 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
                goto abort;
        }
-        blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
+        if (mddev->queue) {
-        blk_queue_io_opt(mddev->queue,
+                blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
-                         (mddev->chunk_sectors << 9) * mddev->raid_disks);
+                blk_queue_io_opt(mddev->queue,
+                                 (mddev->chunk_sectors << 9) * mddev->raid_disks);
-        if (!discard_supported)
-                queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+                if (!discard_supported)
-        else
+                        queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
-                queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+                else
+                        queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+        }
        pr_debug("md/raid0:%s: done.\n", mdname(mddev));
        *private_conf = conf;
@@ -429,9 +431,12 @@ static int raid0_run(struct mddev *mddev)
        }
        if (md_check_no_bitmap(mddev))
                return -EINVAL;
-        blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
-        blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
+        if (mddev->queue) {
-        blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+                blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
+                blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
+                blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+        }
        /* if private is not null, we are here after takeover */
        if (mddev->private == NULL) {
@@ -448,16 +453,17 @@ static int raid0_run(struct mddev *mddev)
        printk(KERN_INFO "md/raid0:%s: md_size is %llu sectors.\n",
               mdname(mddev),
               (unsigned long long)mddev->array_sectors);
-        /* calculate the max read-ahead size.
-         * For read-ahead of large files to be effective, we need to
+        if (mddev->queue) {
-         * readahead at least twice a whole stripe. i.e. number of devices
+                /* calculate the max read-ahead size.
-         * multiplied by chunk size times 2.
+                 * For read-ahead of large files to be effective, we need to
-         * If an individual device has an ra_pages greater than the
+                 * readahead at least twice a whole stripe. i.e. number of devices
-         * chunk size, then we will not drive that device as hard as it
+                 * multiplied by chunk size times 2.
-         * wants.  We consider this a configuration error: a larger
+                 * If an individual device has an ra_pages greater than the
-         * chunksize should be used in that case.
+                 * chunk size, then we will not drive that device as hard as it
-         */
+                 * wants.  We consider this a configuration error: a larger
-        {
+                 * chunksize should be used in that case.
+                 */
                int stripe = mddev->raid_disks *
                        (mddev->chunk_sectors << 9) / PAGE_SIZE;
                if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238afa54..9157a29c8dbf 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -539,7 +539,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
        has_nonrot_disk = 0;
        choose_next_idle = 0;
-        choose_first = (conf->mddev->recovery_cp < this_sector + sectors);
+        if ((conf->mddev->recovery_cp < this_sector + sectors) ||
+            (mddev_is_clustered(conf->mddev) &&
+            md_cluster_ops->area_resyncing(conf->mddev, this_sector,
+                    this_sector + sectors)))
+                choose_first = 1;
+        else
+                choose_first = 0;
        for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
                sector_t dist;
@@ -1102,8 +1108,10 @@ static void make_request(struct mddev *mddev, struct bio * bio)
        md_write_start(mddev, bio); /* wait on superblock update early */
        if (bio_data_dir(bio) == WRITE &&
-            bio_end_sector(bio) > mddev->suspend_lo &&
+            ((bio_end_sector(bio) > mddev->suspend_lo &&
-            bio->bi_iter.bi_sector < mddev->suspend_hi) {
+            bio->bi_iter.bi_sector < mddev->suspend_hi) ||
+            (mddev_is_clustered(mddev) &&
+             md_cluster_ops->area_resyncing(mddev, bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
                /* As the suspend_* range is controlled by
                 * userspace, we want an interruptible
                 * wait.
@@ -1114,7 +1122,10 @@ static void make_request(struct mddev *mddev, struct bio * bio)
                        prepare_to_wait(&conf->wait_barrier,
                                        &w, TASK_INTERRUPTIBLE);
                        if (bio_end_sector(bio) <= mddev->suspend_lo ||
-                            bio->bi_iter.bi_sector >= mddev->suspend_hi)
+                            bio->bi_iter.bi_sector >= mddev->suspend_hi ||
+                            (mddev_is_clustered(mddev) &&
+                             !md_cluster_ops->area_resyncing(mddev,
+                                     bio->bi_iter.bi_sector, bio_end_sector(bio))))
                                break;
                        schedule();
                }
@@ -1561,6 +1572,7 @@ static int raid1_spare_active(struct mddev *mddev)
                struct md_rdev *rdev = conf->mirrors[i].rdev;
                struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
                if (repl
+                    && !test_bit(Candidate, &repl->flags)
                    && repl->recovery_offset == MaxSector
                    && !test_bit(Faulty, &repl->flags)
                    && !test_and_set_bit(In_sync, &repl->flags)) {
@@ -2468,7 +2480,7 @@ static int init_resync(struct r1conf *conf)
 * that can be installed to exclude normal IO requests.
 */
-static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster)
+static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped)
 {
        struct r1conf *conf = mddev->private;
        struct r1bio *r1_bio;
@@ -2521,13 +2533,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
                *skipped = 1;
                return sync_blocks;
        }
-        /*
-         * If there is non-resync activity waiting for a turn,
-         * and resync is going fast enough,
-         * then let it though before starting on this new sync request.
-         */
-        if (!go_faster && conf->nr_waiting)
-                msleep_interruptible(1000);
        bitmap_cond_end_sync(mddev->bitmap, sector_nr);
        r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c49d15d..e793ab6b3570 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2889,7 +2889,7 @@ static int init_resync(struct r10conf *conf)
 */
 static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
-                             int *skipped, int go_faster)
+                             int *skipped)
 {
        struct r10conf *conf = mddev->private;
        struct r10bio *r10_bio;
@@ -2994,12 +2994,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
        if (conf->geo.near_copies < conf->geo.raid_disks &&
            max_sector > (sector_nr | chunk_mask))
                max_sector = (sector_nr | chunk_mask) + 1;
-        /*
-         * If there is non-resync activity waiting for us then
-         * put in a delay to throttle resync.
-         */
-        if (!go_faster && conf->nr_waiting)
-                msleep_interruptible(1000);
        /* Again, very different code for resync and recovery.
         * Both must result in an r10bio with a list of bios that
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b2c572..77dfd720aaa0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -54,6 +54,7 @@
 #include <linux/slab.h>
 #include <linux/ratelimit.h>
 #include <linux/nodemask.h>
+#include <linux/flex_array.h>
 #include <trace/events/block.h>
 #include "md.h"
@@ -496,7 +497,7 @@ static void shrink_buffers(struct stripe_head *sh)
        }
 }
-static int grow_buffers(struct stripe_head *sh)
+static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
 {
        int i;
        int num = sh->raid_conf->pool_size;
@@ -504,7 +505,7 @@ static int grow_buffers(struct stripe_head *sh)
        for (i = 0; i < num; i++) {
                struct page *page;
-                if (!(page = alloc_page(GFP_KERNEL))) {
+                if (!(page = alloc_page(gfp))) {
                        return 1;
                }
                sh->dev[i].page = page;
@@ -525,6 +526,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
        BUG_ON(atomic_read(&sh->count) != 0);
        BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
        BUG_ON(stripe_operations_active(sh));
+        BUG_ON(sh->batch_head);
        pr_debug("init_stripe called, stripe %llu\n",
                (unsigned long long)sector);
@@ -552,8 +554,10 @@ retry:
        }
        if (read_seqcount_retry(&conf->gen_lock, seq))
                goto retry;
+        sh->overwrite_disks = 0;
        insert_hash(conf, sh);
        sh->cpu = smp_processor_id();
+        set_bit(STRIPE_BATCH_READY, &sh->state);
 }
 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
@@ -668,20 +672,28 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
                                    *(conf->hash_locks + hash));
                sh = __find_stripe(conf, sector, conf->generation - previous);
                if (!sh) {
-                        if (!conf->inactive_blocked)
+                        if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
                                sh = get_free_stripe(conf, hash);
+                                if (!sh && llist_empty(&conf->released_stripes) &&
+                                    !test_bit(R5_DID_ALLOC, &conf->cache_state))
+                                        set_bit(R5_ALLOC_MORE,
+                                                &conf->cache_state);
+                        }
                        if (noblock && sh == NULL)
                                break;
                        if (!sh) {
-                                conf->inactive_blocked = 1;
+                                set_bit(R5_INACTIVE_BLOCKED,
+                                        &conf->cache_state);
                                wait_event_lock_irq(
                                        conf->wait_for_stripe,
                                        !list_empty(conf->inactive_list + hash) &&
                                        (atomic_read(&conf->active_stripes)
                                         < (conf->max_nr_stripes * 3 / 4)
-                                         || !conf->inactive_blocked),
+                                         || !test_bit(R5_INACTIVE_BLOCKED,
+                                                      &conf->cache_state)),
                                        *(conf->hash_locks + hash));
-                                conf->inactive_blocked = 0;
+                                clear_bit(R5_INACTIVE_BLOCKED,
+                                          &conf->cache_state);
                        } else {
                                init_stripe(sh, sector, previous);
                                atomic_inc(&sh->count);
@@ -708,6 +720,130 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
        return sh;
 }
+static bool is_full_stripe_write(struct stripe_head *sh)
+{
+        BUG_ON(sh->overwrite_disks > (sh->disks - sh->raid_conf->max_degraded));
+        return sh->overwrite_disks == (sh->disks - sh->raid_conf->max_degraded);
+}
+static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
+{
+        local_irq_disable();
+        if (sh1 > sh2) {
+                spin_lock(&sh2->stripe_lock);
+                spin_lock_nested(&sh1->stripe_lock, 1);
+        } else {
+                spin_lock(&sh1->stripe_lock);
+                spin_lock_nested(&sh2->stripe_lock, 1);
+        }
+}
+static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
+{
+        spin_unlock(&sh1->stripe_lock);
+        spin_unlock(&sh2->stripe_lock);
+        local_irq_enable();
+}
+/* Only freshly new full stripe normal write stripe can be added to a batch list */
+static bool stripe_can_batch(struct stripe_head *sh)
+{
+        return test_bit(STRIPE_BATCH_READY, &sh->state) &&
+                is_full_stripe_write(sh);
+}
+/* we only do back search */
+static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
+{
+        struct stripe_head *head;
+        sector_t head_sector, tmp_sec;
+        int hash;
+        int dd_idx;
+        if (!stripe_can_batch(sh))
+                return;
+        /* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
+        tmp_sec = sh->sector;
+        if (!sector_div(tmp_sec, conf->chunk_sectors))
+                return;
+        head_sector = sh->sector - STRIPE_SECTORS;
+        hash = stripe_hash_locks_hash(head_sector);
+        spin_lock_irq(conf->hash_locks + hash);
+        head = __find_stripe(conf, head_sector, conf->generation);
+        if (head && !atomic_inc_not_zero(&head->count)) {
+                spin_lock(&conf->device_lock);
+                if (!atomic_read(&head->count)) {
+                        if (!test_bit(STRIPE_HANDLE, &head->state))
+                                atomic_inc(&conf->active_stripes);
+                        BUG_ON(list_empty(&head->lru) &&
+                               !test_bit(STRIPE_EXPANDING, &head->state));
+                        list_del_init(&head->lru);
+                        if (head->group) {
+                                head->group->stripes_cnt--;
+                                head->group = NULL;
+                        }
+                }
+                atomic_inc(&head->count);
+                spin_unlock(&conf->device_lock);
+        }
+        spin_unlock_irq(conf->hash_locks + hash);
+        if (!head)
+                return;
+        if (!stripe_can_batch(head))
+                goto out;
+        lock_two_stripes(head, sh);
+        /* clear_batch_ready clear the flag */
+        if (!stripe_can_batch(head) || !stripe_can_batch(sh))
+                goto unlock_out;
+        if (sh->batch_head)
+                goto unlock_out;
+        dd_idx = 0;
+        while (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
+                dd_idx++;
+        if (head->dev[dd_idx].towrite->bi_rw != sh->dev[dd_idx].towrite->bi_rw)
+                goto unlock_out;
+        if (head->batch_head) {
+                spin_lock(&head->batch_head->batch_lock);
+                /* This batch list is already running */
+                if (!stripe_can_batch(head)) {
+                        spin_unlock(&head->batch_head->batch_lock);
+                        goto unlock_out;
+                }
+                /*
+                 * at this point, head's BATCH_READY could be cleared, but we
+                 * can still add the stripe to batch list
+                 */
+                list_add(&sh->batch_list, &head->batch_list);
+                spin_unlock(&head->batch_head->batch_lock);
+                sh->batch_head = head->batch_head;
+        } else {
+                head->batch_head = head;
+                sh->batch_head = head->batch_head;
+                spin_lock(&head->batch_lock);
+                list_add_tail(&sh->batch_list, &head->batch_list);
+                spin_unlock(&head->batch_lock);
+        }
+        if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+                if (atomic_dec_return(&conf->preread_active_stripes)
+                    < IO_THRESHOLD)
+                        md_wakeup_thread(conf->mddev->thread);
+        atomic_inc(&sh->count);
+unlock_out:
+        unlock_two_stripes(head, sh);
+out:
+        release_stripe(head);
+}
 /* Determine if 'data_offset' or 'new_data_offset' should be used
 * in this stripe_head.
 */
@@ -738,6 +874,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 {
        struct r5conf *conf = sh->raid_conf;
        int i, disks = sh->disks;
+        struct stripe_head *head_sh = sh;
        might_sleep();
@@ -746,6 +883,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
                int replace_only = 0;
                struct bio *bi, *rbi;
                struct md_rdev *rdev, *rrdev = NULL;
+                sh = head_sh;
                if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
                        if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags))
                                rw = WRITE_FUA;
@@ -764,6 +903,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
                if (test_and_clear_bit(R5_SyncIO, &sh->dev[i].flags))
                        rw |= REQ_SYNC;
+again:
                bi = &sh->dev[i].req;
                rbi = &sh->dev[i].rreq; /* For writing to replacement */
@@ -782,7 +922,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
                                /* We raced and saw duplicates */
                                rrdev = NULL;
                } else {
-                        if (test_bit(R5_ReadRepl, &sh->dev[i].flags) && rrdev)
+                        if (test_bit(R5_ReadRepl, &head_sh->dev[i].flags) && rrdev)
                                rdev = rrdev;
                        rrdev = NULL;
                }
@@ -853,13 +993,15 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
                                __func__, (unsigned long long)sh->sector,
                                bi->bi_rw, i);
                        atomic_inc(&sh->count);
+                        if (sh != head_sh)
+                                atomic_inc(&head_sh->count);
                        if (use_new_offset(conf, sh))
                                bi->bi_iter.bi_sector = (sh->sector
                                                 + rdev->new_data_offset);
                        else
                                bi->bi_iter.bi_sector = (sh->sector
                                                 + rdev->data_offset);
-                        if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
+                        if (test_bit(R5_ReadNoMerge, &head_sh->dev[i].flags))
                                bi->bi_rw |= REQ_NOMERGE;
                        if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
@@ -903,6 +1045,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
                                __func__, (unsigned long long)sh->sector,
                                rbi->bi_rw, i);
                        atomic_inc(&sh->count);
+                        if (sh != head_sh)
+                                atomic_inc(&head_sh->count);
                        if (use_new_offset(conf, sh))
                                rbi->bi_iter.bi_sector = (sh->sector
                                                  + rrdev->new_data_offset);
@@ -934,8 +1078,18 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
                        pr_debug("skip op %ld on disc %d for sector %llu\n",
                                bi->bi_rw, i, (unsigned long long)sh->sector);
                        clear_bit(R5_LOCKED, &sh->dev[i].flags);
+                        if (sh->batch_head)
+                                set_bit(STRIPE_BATCH_ERR,
+                                        &sh->batch_head->state);
                        set_bit(STRIPE_HANDLE, &sh->state);
                }
+                if (!head_sh->batch_head)
+                        continue;
+                sh = list_first_entry(&sh->batch_list, struct stripe_head,
+                                      batch_list);
+                if (sh != head_sh)
+                        goto again;
        }
 }
@@ -1051,6 +1205,7 @@ static void ops_run_biofill(struct stripe_head *sh)
        struct async_submit_ctl submit;
        int i;
+        BUG_ON(sh->batch_head);
        pr_debug("%s: stripe %llu\n", __func__,
                (unsigned long long)sh->sector);
@@ -1109,16 +1264,28 @@ static void ops_complete_compute(void *stripe_head_ref)
 /* return a pointer to the address conversion region of the scribble buffer */
 static addr_conv_t *to_addr_conv(struct stripe_head *sh,
-                                 struct raid5_percpu *percpu)
+                                 struct raid5_percpu *percpu, int i)
 {
-        return percpu->scribble + sizeof(struct page *) * (sh->disks + 2);
+        void *addr;
+        addr = flex_array_get(percpu->scribble, i);
+        return addr + sizeof(struct page *) * (sh->disks + 2);
+}
+/* return a pointer to the address conversion region of the scribble buffer */
+static struct page **to_addr_page(struct raid5_percpu *percpu, int i)
+{
+        void *addr;
+        addr = flex_array_get(percpu->scribble, i);
+        return addr;
 }
 static struct dma_async_tx_descriptor *
 ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 {
        int disks = sh->disks;
-        struct page **xor_srcs = percpu->scribble;
+        struct page **xor_srcs = to_addr_page(percpu, 0);
        int target = sh->ops.target;
        struct r5dev *tgt = &sh->dev[target];
        struct page *xor_dest = tgt->page;
@@ -1127,6 +1294,8 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
        struct async_submit_ctl submit;
        int i;
+        BUG_ON(sh->batch_head);
        pr_debug("%s: stripe %llu block: %d\n",
                __func__, (unsigned long long)sh->sector, target);
        BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
@@ -1138,7 +1307,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
        atomic_inc(&sh->count);
        init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL,
-                          ops_complete_compute, sh, to_addr_conv(sh, percpu));
+                          ops_complete_compute, sh, to_addr_conv(sh, percpu, 0));
        if (unlikely(count == 1))
                tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
        else
@@ -1156,7 +1325,9 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
 * destination buffer is recorded in srcs[count] and the Q destination
 * is recorded in srcs[count+1]].
 */
-static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh)
+static int set_syndrome_sources(struct page **srcs,
+                                struct stripe_head *sh,
+                                int srctype)
 {
        int disks = sh->disks;
        int syndrome_disks = sh->ddf_layout ? disks : (disks - 2);
@@ -1171,8 +1342,15 @@ static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh)
        i = d0_idx;
        do {
                int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks);
+                struct r5dev *dev = &sh->dev[i];
-                srcs[slot] = sh->dev[i].page;
+                if (i == sh->qd_idx || i == sh->pd_idx ||
+                    (srctype == SYNDROME_SRC_ALL) ||
+                    (srctype == SYNDROME_SRC_WANT_DRAIN &&
+                     test_bit(R5_Wantdrain, &dev->flags)) ||
+                    (srctype == SYNDROME_SRC_WRITTEN &&
+                     dev->written))
+                        srcs[slot] = sh->dev[i].page;
                i = raid6_next_disk(i, disks);
        } while (i != d0_idx);
@@ -1183,7 +1361,7 @@ static struct dma_async_tx_descriptor *
 ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
 {
        int disks = sh->disks;
-        struct page **blocks = percpu->scribble;
+        struct page **blocks = to_addr_page(percpu, 0);
        int target;
        int qd_idx = sh->qd_idx;
        struct dma_async_tx_descriptor *tx;
@@ -1193,6 +1371,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
        int i;
        int count;
+        BUG_ON(sh->batch_head);
        if (sh->ops.target < 0)
                target = sh->ops.target2;
        else if (sh->ops.target2 < 0)
@@ -1211,12 +1390,12 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
        atomic_inc(&sh->count);
        if (target == qd_idx) {
-                count = set_syndrome_sources(blocks, sh);
+                count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
                blocks[count] = NULL; /* regenerating p is not necessary */
                BUG_ON(blocks[count+1] != dest); /* q should already be set */
                init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
                                  ops_complete_compute, sh,
-                                  to_addr_conv(sh, percpu));
+                                  to_addr_conv(sh, percpu, 0));
                tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
        } else {
                /* Compute any data- or p-drive using XOR */
@@ -1229,7 +1408,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
                init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST,
                                  NULL, ops_complete_compute, sh,
-                                  to_addr_conv(sh, percpu));
+                                  to_addr_conv(sh, percpu, 0));
                tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit);
        }
@@ -1248,9 +1427,10 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
        struct r5dev *tgt = &sh->dev[target];
        struct r5dev *tgt2 = &sh->dev[target2];
        struct dma_async_tx_descriptor *tx;
-        struct page **blocks = percpu->scribble;
+        struct page **blocks = to_addr_page(percpu, 0);
        struct async_submit_ctl submit;
+        BUG_ON(sh->batch_head);
        pr_debug("%s: stripe %llu block1: %d block2: %d\n",
                 __func__, (unsigned long long)sh->sector, target, target2);
        BUG_ON(target < 0 || target2 < 0);
@@ -1290,7 +1470,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
                        /* Missing P+Q, just recompute */
                        init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
                                          ops_complete_compute, sh,
-                                          to_addr_conv(sh, percpu));
+                                          to_addr_conv(sh, percpu, 0));
                        return async_gen_syndrome(blocks, 0, syndrome_disks+2,
                                                  STRIPE_SIZE, &submit);
                } else {
@@ -1314,21 +1494,21 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
                        init_async_submit(&submit,
                                          ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST,
                                          NULL, NULL, NULL,
-                                          to_addr_conv(sh, percpu));
+                                          to_addr_conv(sh, percpu, 0));
                        tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE,
                                       &submit);
-                        count = set_syndrome_sources(blocks, sh);
+                        count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
                        init_async_submit(&submit, ASYNC_TX_FENCE, tx,
                                          ops_complete_compute, sh,
-                                          to_addr_conv(sh, percpu));
+                                          to_addr_conv(sh, percpu, 0));
                        return async_gen_syndrome(blocks, 0, count+2,
                                                  STRIPE_SIZE, &submit);
                }
        } else {
                init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
                                  ops_complete_compute, sh,
-                                  to_addr_conv(sh, percpu));
+                                  to_addr_conv(sh, percpu, 0));
                if (failb == syndrome_disks) {
                        /* We're missing D+P. */
                        return async_raid6_datap_recov(syndrome_disks+2,
@@ -1352,17 +1532,18 @@ static void ops_complete_prexor(void *stripe_head_ref)
 }
 static struct dma_async_tx_descriptor *
-ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu,
+ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
-               struct dma_async_tx_descriptor *tx)
+                struct dma_async_tx_descriptor *tx)
 {
        int disks = sh->disks;
-        struct page **xor_srcs = percpu->scribble;
+        struct page **xor_srcs = to_addr_page(percpu, 0);
        int count = 0, pd_idx = sh->pd_idx, i;
        struct async_submit_ctl submit;
        /* existing parity data subtracted */
        struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
+        BUG_ON(sh->batch_head);
        pr_debug("%s: stripe %llu\n", __func__,
                (unsigned long long)sh->sector);
@@ -1374,31 +1555,56 @@ ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu,
        }
        init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
-                          ops_complete_prexor, sh, to_addr_conv(sh, percpu));
+                          ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
        tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
        return tx;
 }
 static struct dma_async_tx_descriptor *
+ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
+                struct dma_async_tx_descriptor *tx)
+{
+        struct page **blocks = to_addr_page(percpu, 0);
+        int count;
+        struct async_submit_ctl submit;
+        pr_debug("%s: stripe %llu\n", __func__,
+                (unsigned long long)sh->sector);
+        count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN);
+        init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx,
+                          ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
+        tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE,  &submit);
+        return tx;
+}
+static struct dma_async_tx_descriptor *
 ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
        int disks = sh->disks;
        int i;
+        struct stripe_head *head_sh = sh;
        pr_debug("%s: stripe %llu\n", __func__,
                (unsigned long long)sh->sector);
        for (i = disks; i--; ) {
-                struct r5dev *dev = &sh->dev[i];
+                struct r5dev *dev;
                struct bio *chosen;
-                if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
+                sh = head_sh;
+                if (test_and_clear_bit(R5_Wantdrain, &head_sh->dev[i].flags)) {
                        struct bio *wbi;
+again:
+                        dev = &sh->dev[i];
                        spin_lock_irq(&sh->stripe_lock);
                        chosen = dev->towrite;
                        dev->towrite = NULL;
+                        sh->overwrite_disks = 0;
                        BUG_ON(dev->written);
                        wbi = dev->written = chosen;
                        spin_unlock_irq(&sh->stripe_lock);
@@ -1423,6 +1629,15 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
                                }
                                wbi = r5_next_bio(wbi, dev->sector);
                        }
+                        if (head_sh->batch_head) {
+                                sh = list_first_entry(&sh->batch_list,
+                                                      struct stripe_head,
+                                                      batch_list);
+                                if (sh == head_sh)
+                                        continue;
+                                goto again;
+                        }
                }
        }
@@ -1478,12 +1693,15 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
                     struct dma_async_tx_descriptor *tx)
 {
        int disks = sh->disks;
-        struct page **xor_srcs = percpu->scribble;
+        struct page **xor_srcs;
        struct async_submit_ctl submit;
-        int count = 0, pd_idx = sh->pd_idx, i;
+        int count, pd_idx = sh->pd_idx, i;
        struct page *xor_dest;
        int prexor = 0;
        unsigned long flags;
+        int j = 0;
+        struct stripe_head *head_sh = sh;
+        int last_stripe;
        pr_debug("%s: stripe %llu\n", __func__,
                (unsigned long long)sh->sector);
@@ -1500,15 +1718,18 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
                ops_complete_reconstruct(sh);
                return;
        }
+again:
+        count = 0;
+        xor_srcs = to_addr_page(percpu, j);
        /* check if prexor is active which means only process blocks
         * that are part of a read-modify-write (written)
         */
-        if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
+        if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
                prexor = 1;
                xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
                for (i = disks; i--; ) {
                        struct r5dev *dev = &sh->dev[i];
-                        if (dev->written)
+                        if (head_sh->dev[i].written)
                                xor_srcs[count++] = dev->page;
                }
        } else {
@@ -1525,17 +1746,32 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
         * set ASYNC_TX_XOR_DROP_DST and ASYNC_TX_XOR_ZERO_DST
         * for the synchronous xor case
         */
-        flags = ASYNC_TX_ACK |
+        last_stripe = !head_sh->batch_head ||
-                (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST);
+                list_first_entry(&sh->batch_list,
+                                 struct stripe_head, batch_list) == head_sh;
-        atomic_inc(&sh->count);
+        if (last_stripe) {
+                flags = ASYNC_TX_ACK |
+                        (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST);
+                atomic_inc(&head_sh->count);
+                init_async_submit(&submit, flags, tx, ops_complete_reconstruct, head_sh,
+                                  to_addr_conv(sh, percpu, j));
+        } else {
+                flags = prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST;
+                init_async_submit(&submit, flags, tx, NULL, NULL,
+                                  to_addr_conv(sh, percpu, j));
+        }
-        init_async_submit(&submit, flags, tx, ops_complete_reconstruct, sh,
-                          to_addr_conv(sh, percpu));
        if (unlikely(count == 1))
                tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
        else
                tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
+        if (!last_stripe) {
+                j++;
+                sh = list_first_entry(&sh->batch_list, struct stripe_head,
+                                      batch_list);
+                goto again;
+        }
 }
 static void
@@ -1543,8 +1779,12 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
                     struct dma_async_tx_descriptor *tx)
 {
        struct async_submit_ctl submit;
-        struct page **blocks = percpu->scribble;
+        struct page **blocks;
-        int count, i;
+        int count, i, j = 0;
+        struct stripe_head *head_sh = sh;
+        int last_stripe;
+        int synflags;
+        unsigned long txflags;
        pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
@@ -1562,13 +1802,36 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
                return;
        }
-        count = set_syndrome_sources(blocks, sh);
+again:
+        blocks = to_addr_page(percpu, j);
-        atomic_inc(&sh->count);
+        if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
+                synflags = SYNDROME_SRC_WRITTEN;
+                txflags = ASYNC_TX_ACK | ASYNC_TX_PQ_XOR_DST;
+        } else {
+                synflags = SYNDROME_SRC_ALL;
+                txflags = ASYNC_TX_ACK;
+        }
+        count = set_syndrome_sources(blocks, sh, synflags);
+        last_stripe = !head_sh->batch_head ||
+                list_first_entry(&sh->batch_list,
+                                 struct stripe_head, batch_list) == head_sh;
-        init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct,
+        if (last_stripe) {
-                          sh, to_addr_conv(sh, percpu));
+                atomic_inc(&head_sh->count);
+                init_async_submit(&submit, txflags, tx, ops_complete_reconstruct,
+                                  head_sh, to_addr_conv(sh, percpu, j));
+        } else
+                init_async_submit(&submit, 0, tx, NULL, NULL,
+                                  to_addr_conv(sh, percpu, j));
        async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE,  &submit);
+        if (!last_stripe) {
+                j++;
+                sh = list_first_entry(&sh->batch_list, struct stripe_head,
+                                      batch_list);
+                goto again;
+        }
 }
 static void ops_complete_check(void *stripe_head_ref)
@@ -1589,7 +1852,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
        int pd_idx = sh->pd_idx;
        int qd_idx = sh->qd_idx;
        struct page *xor_dest;
-        struct page **xor_srcs = percpu->scribble;
+        struct page **xor_srcs = to_addr_page(percpu, 0);
        struct dma_async_tx_descriptor *tx;
        struct async_submit_ctl submit;
        int count;
@@ -1598,6 +1861,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
        pr_debug("%s: stripe %llu\n", __func__,
                (unsigned long long)sh->sector);
+        BUG_ON(sh->batch_head);
        count = 0;
        xor_dest = sh->dev[pd_idx].page;
        xor_srcs[count++] = xor_dest;
@@ -1608,7 +1872,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
        }
        init_async_submit(&submit, 0, NULL, NULL, NULL,
-                          to_addr_conv(sh, percpu));
+                          to_addr_conv(sh, percpu, 0));
        tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
                           &sh->ops.zero_sum_result, &submit);
@@ -1619,20 +1883,21 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
 static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu, int checkp)
 {
-        struct page **srcs = percpu->scribble;
+        struct page **srcs = to_addr_page(percpu, 0);
        struct async_submit_ctl submit;
        int count;
        pr_debug("%s: stripe %llu checkp: %d\n", __func__,
                (unsigned long long)sh->sector, checkp);
-        count = set_syndrome_sources(srcs, sh);
+        BUG_ON(sh->batch_head);
+        count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL);
        if (!checkp)
                srcs[count] = NULL;
        atomic_inc(&sh->count);
        init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check,
-                          sh, to_addr_conv(sh, percpu));
+                          sh, to_addr_conv(sh, percpu, 0));
        async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE,
                           &sh->ops.zero_sum_result, percpu->spare_page, &submit);
 }
@@ -1667,8 +1932,12 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
                        async_tx_ack(tx);
        }
-        if (test_bit(STRIPE_OP_PREXOR, &ops_request))
+        if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
-                tx = ops_run_prexor(sh, percpu, tx);
+                if (level < 6)
+                        tx = ops_run_prexor5(sh, percpu, tx);
+                else
+                        tx = ops_run_prexor6(sh, percpu, tx);
+        }
        if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) {
                tx = ops_run_biodrain(sh, tx);
@@ -1693,7 +1962,7 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
                        BUG();
        }
-        if (overlap_clear)
+        if (overlap_clear && !sh->batch_head)
                for (i = disks; i--; ) {
                        struct r5dev *dev = &sh->dev[i];
                        if (test_and_clear_bit(R5_Overlap, &dev->flags))
@@ -1702,10 +1971,10 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
        put_cpu();
 }
-static int grow_one_stripe(struct r5conf *conf, int hash)
+static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
 {
        struct stripe_head *sh;
-        sh = kmem_cache_zalloc(conf->slab_cache, GFP_KERNEL);
+        sh = kmem_cache_zalloc(conf->slab_cache, gfp);
        if (!sh)
                return 0;
@@ -1713,17 +1982,23 @@ static int grow_one_stripe(struct r5conf *conf, int hash)
        spin_lock_init(&sh->stripe_lock);
-        if (grow_buffers(sh)) {
+        if (grow_buffers(sh, gfp)) {
                shrink_buffers(sh);
                kmem_cache_free(conf->slab_cache, sh);
                return 0;
        }
-        sh->hash_lock_index = hash;
+        sh->hash_lock_index =
+                conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS;
        /* we just created an active stripe so... */
        atomic_set(&sh->count, 1);
        atomic_inc(&conf->active_stripes);
        INIT_LIST_HEAD(&sh->lru);
+        spin_lock_init(&sh->batch_lock);
+        INIT_LIST_HEAD(&sh->batch_list);
+        sh->batch_head = NULL;
        release_stripe(sh);
+        conf->max_nr_stripes++;
        return 1;
 }
@@ -1731,7 +2006,6 @@ static int grow_stripes(struct r5conf *conf, int num)
 {
        struct kmem_cache *sc;
        int devs = max(conf->raid_disks, conf->previous_raid_disks);
-        int hash;
        if (conf->mddev->gendisk)
                sprintf(conf->cache_name[0],
@@ -1749,13 +2023,10 @@ static int grow_stripes(struct r5conf *conf, int num)
                return 1;
        conf->slab_cache = sc;
        conf->pool_size = devs;
-        hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS;
+        while (num--)
-        while (num--) {
+                if (!grow_one_stripe(conf, GFP_KERNEL))
-                if (!grow_one_stripe(conf, hash))
                        return 1;
-                conf->max_nr_stripes++;
-                hash = (hash + 1) % NR_STRIPE_HASH_LOCKS;
-        }
        return 0;
 }
@@ -1772,13 +2043,21 @@ static int grow_stripes(struct r5conf *conf, int num)
 * calculate over all devices (not just the data blocks), using zeros in place
 * of the P and Q blocks.
 */
-static size_t scribble_len(int num)
+static struct flex_array *scribble_alloc(int num, int cnt, gfp_t flags)
 {
+        struct flex_array *ret;
        size_t len;
        len = sizeof(struct page *) * (num+2) + sizeof(addr_conv_t) * (num+2);
+        ret = flex_array_alloc(len, cnt, flags);
-        return len;
+        if (!ret)
+                return NULL;
+        /* always prealloc all elements, so no locking is required */
+        if (flex_array_prealloc(ret, 0, cnt, flags)) {
+                flex_array_free(ret);
+                return NULL;
+        }
+        return ret;
 }
 static int resize_stripes(struct r5conf *conf, int newsize)
@@ -1896,16 +2175,16 @@ static int resize_stripes(struct r5conf *conf, int newsize)
                err = -ENOMEM;
        get_online_cpus();
-        conf->scribble_len = scribble_len(newsize);
        for_each_present_cpu(cpu) {
                struct raid5_percpu *percpu;
-                void *scribble;
+                struct flex_array *scribble;
                percpu = per_cpu_ptr(conf->percpu, cpu);
-                scribble = kmalloc(conf->scribble_len, GFP_NOIO);
+                scribble = scribble_alloc(newsize, conf->chunk_sectors /
+                        STRIPE_SECTORS, GFP_NOIO);
                if (scribble) {
-                        kfree(percpu->scribble);
+                        flex_array_free(percpu->scribble);
                        percpu->scribble = scribble;
                } else {
                        err = -ENOMEM;
@@ -1937,9 +2216,10 @@ static int resize_stripes(struct r5conf *conf, int newsize)
        return err;
 }
-static int drop_one_stripe(struct r5conf *conf, int hash)
+static int drop_one_stripe(struct r5conf *conf)
 {
        struct stripe_head *sh;
+        int hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS;
        spin_lock_irq(conf->hash_locks + hash);
        sh = get_free_stripe(conf, hash);
@@ -1950,15 +2230,15 @@ static int drop_one_stripe(struct r5conf *conf, int hash)
        shrink_buffers(sh);
        kmem_cache_free(conf->slab_cache, sh);
        atomic_dec(&conf->active_stripes);
+        conf->max_nr_stripes--;
        return 1;
 }
 static void shrink_stripes(struct r5conf *conf)
 {
-        int hash;
+        while (conf->max_nr_stripes &&
-        for (hash = 0; hash < NR_STRIPE_HASH_LOCKS; hash++)
+               drop_one_stripe(conf))
-                while (drop_one_stripe(conf, hash))
+                ;
-                        ;
        if (conf->slab_cache)
                kmem_cache_destroy(conf->slab_cache);
@@ -2154,10 +2434,16 @@ static void raid5_end_write_request(struct bio *bi, int error)
        }
        rdev_dec_pending(rdev, conf->mddev);
+        if (sh->batch_head && !uptodate)
+                set_bit(STRIPE_BATCH_ERR, &sh->batch_head->state);
        if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags))
                clear_bit(R5_LOCKED, &sh->dev[i].flags);
        set_bit(STRIPE_HANDLE, &sh->state);
        release_stripe(sh);
+        if (sh->batch_head && sh != sh->batch_head)
+                release_stripe(sh->batch_head);
 }
 static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous);
@@ -2535,7 +2821,7 @@ static void
 schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
                         int rcw, int expand)
 {
-        int i, pd_idx = sh->pd_idx, disks = sh->disks;
+        int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks;
        struct r5conf *conf = sh->raid_conf;
        int level = conf->level;
@@ -2571,13 +2857,15 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
                        if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
                                atomic_inc(&conf->pending_full_writes);
        } else {
-                BUG_ON(level == 6);
                BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
                        test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
+                BUG_ON(level == 6 &&
+                        (!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) ||
+                           test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags))));
                for (i = disks; i--; ) {
                        struct r5dev *dev = &sh->dev[i];
-                        if (i == pd_idx)
+                        if (i == pd_idx || i == qd_idx)
                                continue;
                        if (dev->towrite &&
@@ -2624,7 +2912,8 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 * toread/towrite point to the first in a chain.
 * The bi_next chain must be in order.
 */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite)
+static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
+                          int forwrite, int previous)
 {
        struct bio **bip;
        struct r5conf *conf = sh->raid_conf;
@@ -2643,6 +2932,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
         * protect it.
         */
        spin_lock_irq(&sh->stripe_lock);
+        /* Don't allow new IO added to stripes in batch list */
+        if (sh->batch_head)
+                goto overlap;
        if (forwrite) {
                bip = &sh->dev[dd_idx].towrite;
                if (*bip == NULL)
@@ -2657,6 +2949,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
        if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
                goto overlap;
+        if (!forwrite || previous)
+                clear_bit(STRIPE_BATCH_READY, &sh->state);
        BUG_ON(*bip && bi->bi_next && (*bip) != bi->bi_next);
        if (*bip)
                bi->bi_next = *bip;
@@ -2674,7 +2969,8 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
                                sector = bio_end_sector(bi);
                }
                if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
-                        set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
+                        if (!test_and_set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags))
+                                sh->overwrite_disks++;
        }
        pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
@@ -2688,6 +2984,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
                sh->bm_seq = conf->seq_flush+1;
                set_bit(STRIPE_BIT_DELAY, &sh->state);
        }
+        if (stripe_can_batch(sh))
+                stripe_add_to_batch_list(conf, sh);
        return 1;
 overlap:
@@ -2720,6 +3019,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
                                struct bio **return_bi)
 {
        int i;
+        BUG_ON(sh->batch_head);
        for (i = disks; i--; ) {
                struct bio *bi;
                int bitmap_end = 0;
@@ -2746,6 +3046,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
                /* fail all writes first */
                bi = sh->dev[i].towrite;
                sh->dev[i].towrite = NULL;
+                sh->overwrite_disks = 0;
                spin_unlock_irq(&sh->stripe_lock);
                if (bi)
                        bitmap_end = 1;
@@ -2834,6 +3135,7 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
        int abort = 0;
        int i;
+        BUG_ON(sh->batch_head);
        clear_bit(STRIPE_SYNCING, &sh->state);
        if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags))
                wake_up(&conf->wait_for_overlap);
@@ -3064,6 +3366,7 @@ static void handle_stripe_fill(struct stripe_head *sh,
 {
        int i;
+        BUG_ON(sh->batch_head);
        /* look for blocks to read/compute, skip this if a compute
         * is already in flight, or if the stripe contents are in the
         * midst of changing due to a write
@@ -3087,6 +3390,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
        int i;
        struct r5dev *dev;
        int discard_pending = 0;
+        struct stripe_head *head_sh = sh;
+        bool do_endio = false;
+        int wakeup_nr = 0;
        for (i = disks; i--; )
                if (sh->dev[i].written) {
@@ -3102,8 +3408,11 @@ static void handle_stripe_clean_event(struct r5conf *conf,
                                        clear_bit(R5_UPTODATE, &dev->flags);
                                if (test_and_clear_bit(R5_SkipCopy, &dev->flags)) {
                                        WARN_ON(test_bit(R5_UPTODATE, &dev->flags));
-                                        dev->page = dev->orig_page;
                                }
+                                do_endio = true;
+returnbi:
+                                dev->page = dev->orig_page;
                                wbi = dev->written;
                                dev->written = NULL;
                                while (wbi && wbi->bi_iter.bi_sector <
@@ -3120,6 +3429,17 @@ static void handle_stripe_clean_event(struct r5conf *conf,
                                                STRIPE_SECTORS,
                                         !test_bit(STRIPE_DEGRADED, &sh->state),
                                                0);
+                                if (head_sh->batch_head) {
+                                        sh = list_first_entry(&sh->batch_list,
+                                                              struct stripe_head,
+                                                              batch_list);
+                                        if (sh != head_sh) {
+                                                dev = &sh->dev[i];
+                                                goto returnbi;
+                                        }
+                                }
+                                sh = head_sh;
+                                dev = &sh->dev[i];
                        } else if (test_bit(R5_Discard, &dev->flags))
                                discard_pending = 1;
                        WARN_ON(test_bit(R5_SkipCopy, &dev->flags));
@@ -3141,8 +3461,17 @@ static void handle_stripe_clean_event(struct r5conf *conf,
                 * will be reinitialized
                 */
                spin_lock_irq(&conf->device_lock);
+unhash:
                remove_hash(sh);
+                if (head_sh->batch_head) {
+                        sh = list_first_entry(&sh->batch_list,
+                                              struct stripe_head, batch_list);
+                        if (sh != head_sh)
+                                        goto unhash;
+                }
                spin_unlock_irq(&conf->device_lock);
+                sh = head_sh;
                if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state))
                        set_bit(STRIPE_HANDLE, &sh->state);
@@ -3151,6 +3480,45 @@ static void handle_stripe_clean_event(struct r5conf *conf,
        if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
                if (atomic_dec_and_test(&conf->pending_full_writes))
                        md_wakeup_thread(conf->mddev->thread);
+        if (!head_sh->batch_head || !do_endio)
+                return;
+        for (i = 0; i < head_sh->disks; i++) {
+                if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags))
+                        wakeup_nr++;
+        }
+        while (!list_empty(&head_sh->batch_list)) {
+                int i;
+                sh = list_first_entry(&head_sh->batch_list,
+                                      struct stripe_head, batch_list);
+                list_del_init(&sh->batch_list);
+                set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG,
+                              head_sh->state & ~((1 << STRIPE_ACTIVE) |
+                                                 (1 << STRIPE_PREREAD_ACTIVE) |
+                                                 STRIPE_EXPAND_SYNC_FLAG));
+                sh->check_state = head_sh->check_state;
+                sh->reconstruct_state = head_sh->reconstruct_state;
+                for (i = 0; i < sh->disks; i++) {
+                        if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+                                wakeup_nr++;
+                        sh->dev[i].flags = head_sh->dev[i].flags;
+                }
+                spin_lock_irq(&sh->stripe_lock);
+                sh->batch_head = NULL;
+                spin_unlock_irq(&sh->stripe_lock);
+                if (sh->state & STRIPE_EXPAND_SYNC_FLAG)
+                        set_bit(STRIPE_HANDLE, &sh->state);
+                release_stripe(sh);
+        }
+        spin_lock_irq(&head_sh->stripe_lock);
+        head_sh->batch_head = NULL;
+        spin_unlock_irq(&head_sh->stripe_lock);
+        wake_up_nr(&conf->wait_for_overlap, wakeup_nr);
+        if (head_sh->state & STRIPE_EXPAND_SYNC_FLAG)
+                set_bit(STRIPE_HANDLE, &head_sh->state);
 }
 static void handle_stripe_dirtying(struct r5conf *conf,
@@ -3161,28 +3529,27 @@ static void handle_stripe_dirtying(struct r5conf *conf,
        int rmw = 0, rcw = 0, i;
        sector_t recovery_cp = conf->mddev->recovery_cp;
-        /* RAID6 requires 'rcw' in current implementation.
+        /* Check whether resync is now happening or should start.
-         * Otherwise, check whether resync is now happening or should start.
         * If yes, then the array is dirty (after unclean shutdown or
         * initial creation), so parity in some stripes might be inconsistent.
         * In this case, we need to always do reconstruct-write, to ensure
         * that in case of drive failure or read-error correction, we
         * generate correct data from the parity.
         */
-        if (conf->max_degraded == 2 ||
+        if (conf->rmw_level == PARITY_DISABLE_RMW ||
            (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
             s->failed == 0)) {
                /* Calculate the real rcw later - for now make it
                 * look like rcw is cheaper
                 */
                rcw = 1; rmw = 2;
-                pr_debug("force RCW max_degraded=%u, recovery_cp=%llu sh->sector=%llu\n",
+                pr_debug("force RCW rmw_level=%u, recovery_cp=%llu sh->sector=%llu\n",
-                         conf->max_degraded, (unsigned long long)recovery_cp,
+                         conf->rmw_level, (unsigned long long)recovery_cp,
                         (unsigned long long)sh->sector);
        } else for (i = disks; i--; ) {
                /* would I have to read this buffer for read_modify_write */
                struct r5dev *dev = &sh->dev[i];
-                if ((dev->towrite || i == sh->pd_idx) &&
+                if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
                    !test_bit(R5_LOCKED, &dev->flags) &&
                    !(test_bit(R5_UPTODATE, &dev->flags) ||
                      test_bit(R5_Wantcompute, &dev->flags))) {
@@ -3192,7 +3559,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
                                rmw += 2*disks;  /* cannot read it */
                }
                /* Would I have to read this buffer for reconstruct_write */
-                if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
+                if (!test_bit(R5_OVERWRITE, &dev->flags) &&
+                    i != sh->pd_idx && i != sh->qd_idx &&
                    !test_bit(R5_LOCKED, &dev->flags) &&
                    !(test_bit(R5_UPTODATE, &dev->flags) ||
                    test_bit(R5_Wantcompute, &dev->flags))) {
@@ -3205,7 +3573,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
        pr_debug("for sector %llu, rmw=%d rcw=%d\n",
                (unsigned long long)sh->sector, rmw, rcw);
        set_bit(STRIPE_HANDLE, &sh->state);
-        if (rmw < rcw && rmw > 0) {
+        if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_ENABLE_RMW)) && rmw > 0) {
                /* prefer read-modify-write, but need to get some data */
                if (conf->mddev->queue)
                        blk_add_trace_msg(conf->mddev->queue,
@@ -3213,7 +3581,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
                                          (unsigned long long)sh->sector, rmw);
                for (i = disks; i--; ) {
                        struct r5dev *dev = &sh->dev[i];
-                        if ((dev->towrite || i == sh->pd_idx) &&
+                        if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
                            !test_bit(R5_LOCKED, &dev->flags) &&
                            !(test_bit(R5_UPTODATE, &dev->flags) ||
                            test_bit(R5_Wantcompute, &dev->flags)) &&
@@ -3232,7 +3600,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
                        }
                }
        }
-        if (rcw <= rmw && rcw > 0) {
+        if ((rcw < rmw || (rcw == rmw && conf->rmw_level != PARITY_ENABLE_RMW)) && rcw > 0) {
                /* want reconstruct write, but need to get some data */
                int qread =0;
                rcw = 0;
@@ -3290,6 +3658,7 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
 {
        struct r5dev *dev = NULL;
+        BUG_ON(sh->batch_head);
        set_bit(STRIPE_HANDLE, &sh->state);
        switch (sh->check_state) {
@@ -3380,6 +3749,7 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
        int qd_idx = sh->qd_idx;
        struct r5dev *dev;
+        BUG_ON(sh->batch_head);
        set_bit(STRIPE_HANDLE, &sh->state);
        BUG_ON(s->failed > 2);
@@ -3543,6 +3913,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
         * copy some of them into a target stripe for expand.
         */
        struct dma_async_tx_descriptor *tx = NULL;
+        BUG_ON(sh->batch_head);
        clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
        for (i = 0; i < sh->disks; i++)
                if (i != sh->pd_idx && i != sh->qd_idx) {
@@ -3615,8 +3986,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
        memset(s, 0, sizeof(*s));
-        s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+        s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state) && !sh->batch_head;
-        s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+        s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head;
        s->failed_num[0] = -1;
        s->failed_num[1] = -1;
@@ -3786,6 +4157,80 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
        rcu_read_unlock();
 }
+static int clear_batch_ready(struct stripe_head *sh)
+{
+        struct stripe_head *tmp;
+        if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
+                return 0;
+        spin_lock(&sh->stripe_lock);
+        if (!sh->batch_head) {
+                spin_unlock(&sh->stripe_lock);
+                return 0;
+        }
+        /*
+         * this stripe could be added to a batch list before we check
+         * BATCH_READY, skips it
+         */
+        if (sh->batch_head != sh) {
+                spin_unlock(&sh->stripe_lock);
+                return 1;
+        }
+        spin_lock(&sh->batch_lock);
+        list_for_each_entry(tmp, &sh->batch_list, batch_list)
+                clear_bit(STRIPE_BATCH_READY, &tmp->state);
+        spin_unlock(&sh->batch_lock);
+        spin_unlock(&sh->stripe_lock);
+        /*
+         * BATCH_READY is cleared, no new stripes can be added.
+         * batch_list can be accessed without lock
+         */
+        return 0;
+}
+static void check_break_stripe_batch_list(struct stripe_head *sh)
+{
+        struct stripe_head *head_sh, *next;
+        int i;
+        if (!test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state))
+                return;
+        head_sh = sh;
+        do {
+                sh = list_first_entry(&sh->batch_list,
+                                      struct stripe_head, batch_list);
+                BUG_ON(sh == head_sh);
+        } while (!test_bit(STRIPE_DEGRADED, &sh->state));
+        while (sh != head_sh) {
+                next = list_first_entry(&sh->batch_list,
+                                        struct stripe_head, batch_list);
+                list_del_init(&sh->batch_list);
+                set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG,
+                              head_sh->state & ~((1 << STRIPE_ACTIVE) |
+                                                 (1 << STRIPE_PREREAD_ACTIVE) |
+                                                 (1 << STRIPE_DEGRADED) |
+                                                 STRIPE_EXPAND_SYNC_FLAG));
+                sh->check_state = head_sh->check_state;
+                sh->reconstruct_state = head_sh->reconstruct_state;
+                for (i = 0; i < sh->disks; i++)
+                        sh->dev[i].flags = head_sh->dev[i].flags &
+                                (~((1 << R5_WriteError) | (1 << R5_Overlap)));
+                spin_lock_irq(&sh->stripe_lock);
+                sh->batch_head = NULL;
+                spin_unlock_irq(&sh->stripe_lock);
+                set_bit(STRIPE_HANDLE, &sh->state);
+                release_stripe(sh);
+                sh = next;
+        }
+}
 static void handle_stripe(struct stripe_head *sh)
 {
        struct stripe_head_state s;
@@ -3803,7 +4248,14 @@ static void handle_stripe(struct stripe_head *sh)
                return;
        }
-        if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+        if (clear_batch_ready(sh) ) {
+                clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
+                return;
+        }
+        check_break_stripe_batch_list(sh);
+        if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) {
                spin_lock(&sh->stripe_lock);
                /* Cannot process 'sync' concurrently with 'discard' */
                if (!test_bit(STRIPE_DISCARD, &sh->state) &&
@@ -4158,7 +4610,7 @@ static int raid5_congested(struct mddev *mddev, int bits)
         * how busy the stripe_cache is
         */
-        if (conf->inactive_blocked)
+        if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state))
                return 1;
        if (conf->quiesce)
                return 1;
@@ -4180,8 +4632,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev,
        unsigned int chunk_sectors = mddev->chunk_sectors;
        unsigned int bio_sectors = bvm->bi_size >> 9;
-        if ((bvm->bi_rw & 1) == WRITE)
+        /*
-                return biovec->bv_len; /* always allow writes to be mergeable */
+         * always allow writes to be mergeable, read as well if array
+         * is degraded as we'll go through stripe cache anyway.
+         */
+        if ((bvm->bi_rw & 1) == WRITE || mddev->degraded)
+                return biovec->bv_len;
        if (mddev->new_chunk_sectors < mddev->chunk_sectors)
                chunk_sectors = mddev->new_chunk_sectors;
@@ -4603,12 +5059,14 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
                }
                set_bit(STRIPE_DISCARD, &sh->state);
                finish_wait(&conf->wait_for_overlap, &w);
+                sh->overwrite_disks = 0;
                for (d = 0; d < conf->raid_disks; d++) {
                        if (d == sh->pd_idx || d == sh->qd_idx)
                                continue;
                        sh->dev[d].towrite = bi;
                        set_bit(R5_OVERWRITE, &sh->dev[d].flags);
                        raid5_inc_bi_active_stripes(bi);
+                        sh->overwrite_disks++;
                }
                spin_unlock_irq(&sh->stripe_lock);
                if (conf->mddev->bitmap) {
@@ -4656,7 +5114,12 @@ static void make_request(struct mddev *mddev, struct bio * bi)
        md_write_start(mddev, bi);
-        if (rw == READ &&
+        /*
+         * If array is degraded, better not do chunk aligned read because
+         * later we might have to read it again in order to reconstruct
+         * data on failed drives.
+         */
+        if (rw == READ && mddev->degraded == 0 &&
             mddev->reshape_position == MaxSector &&
             chunk_aligned_read(mddev,bi))
                return;
@@ -4772,7 +5235,7 @@ static void make_request(struct mddev *mddev, struct bio * bi)
                        }
                        if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-                            !add_stripe_bio(sh, bi, dd_idx, rw)) {
+                            !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
                                /* Stripe is busy expanding or
                                 * add failed due to overlap.  Flush everything
                                 * and wait a while
@@ -4785,7 +5248,8 @@ static void make_request(struct mddev *mddev, struct bio * bi)
                        }
                        set_bit(STRIPE_HANDLE, &sh->state);
                        clear_bit(STRIPE_DELAYED, &sh->state);
-                        if ((bi->bi_rw & REQ_SYNC) &&
+                        if ((!sh->batch_head || sh == sh->batch_head) &&
+                            (bi->bi_rw & REQ_SYNC) &&
                            !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
                                atomic_inc(&conf->preread_active_stripes);
                        release_stripe_plug(mddev, sh);
@@ -5050,8 +5514,7 @@ ret:
        return reshape_sectors;
 }
-/* FIXME go_faster isn't used */
+static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped)
-static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster)
 {
        struct r5conf *conf = mddev->private;
        struct stripe_head *sh;
@@ -5186,7 +5649,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
                        return handled;
                }
-                if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
+                if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) {
                        release_stripe(sh);
                        raid5_set_bi_processed_stripes(raid_bio, scnt);
                        conf->retry_read_aligned = raid_bio;
@@ -5312,6 +5775,8 @@ static void raid5d(struct md_thread *thread)
                int batch_size, released;
                released = release_stripe_list(conf, conf->temp_inactive_list);
+                if (released)
+                        clear_bit(R5_DID_ALLOC, &conf->cache_state);
                if (
                    !list_empty(&conf->bitmap_list)) {
@@ -5350,6 +5815,13 @@ static void raid5d(struct md_thread *thread)
        pr_debug("%d stripes handled\n", handled);
        spin_unlock_irq(&conf->device_lock);
+        if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state)) {
+                grow_one_stripe(conf, __GFP_NOWARN);
+                /* Set flag even if allocation failed.  This helps
+                 * slow down allocation requests when mem is short
+                 */
+                set_bit(R5_DID_ALLOC, &conf->cache_state);
+        }
        async_tx_issue_pending_all();
        blk_finish_plug(&plug);
@@ -5365,7 +5837,7 @@ raid5_show_stripe_cache_size(struct mddev *mddev, char *page)
        spin_lock(&mddev->lock);
        conf = mddev->private;
        if (conf)
-                ret = sprintf(page, "%d\n", conf->max_nr_stripes);
+                ret = sprintf(page, "%d\n", conf->min_nr_stripes);
        spin_unlock(&mddev->lock);
        return ret;
 }
@@ -5375,30 +5847,24 @@ raid5_set_cache_size(struct mddev *mddev, int size)
 {
        struct r5conf *conf = mddev->private;
        int err;
-        int hash;
        if (size <= 16 || size > 32768)
                return -EINVAL;
-        hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS;
-        while (size < conf->max_nr_stripes) {
+        conf->min_nr_stripes = size;
-                if (drop_one_stripe(conf, hash))
+        while (size < conf->max_nr_stripes &&
-                        conf->max_nr_stripes--;
+               drop_one_stripe(conf))
-                else
+                ;
-                        break;
-                hash--;
-                if (hash < 0)
-                        hash = NR_STRIPE_HASH_LOCKS - 1;
-        }
        err = md_allow_write(mddev);
        if (err)
                return err;
-        hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS;
-        while (size > conf->max_nr_stripes) {
+        while (size > conf->max_nr_stripes)
-                if (grow_one_stripe(conf, hash))
+                if (!grow_one_stripe(conf, GFP_KERNEL))
-                        conf->max_nr_stripes++;
+                        break;
-                else break;
-                hash = (hash + 1) % NR_STRIPE_HASH_LOCKS;
-        }
        return 0;
 }
 EXPORT_SYMBOL(raid5_set_cache_size);
@@ -5433,6 +5899,49 @@ raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
                                raid5_store_stripe_cache_size);
 static ssize_t
+raid5_show_rmw_level(struct mddev  *mddev, char *page)
+{
+        struct r5conf *conf = mddev->private;
+        if (conf)
+                return sprintf(page, "%d\n", conf->rmw_level);
+        else
+                return 0;
+}
+static ssize_t
+raid5_store_rmw_level(struct mddev  *mddev, const char *page, size_t len)
+{
+        struct r5conf *conf = mddev->private;
+        unsigned long new;
+        if (!conf)
+                return -ENODEV;
+        if (len >= PAGE_SIZE)
+                return -EINVAL;
+        if (kstrtoul(page, 10, &new))
+                return -EINVAL;
+        if (new != PARITY_DISABLE_RMW && !raid6_call.xor_syndrome)
+                return -EINVAL;
+        if (new != PARITY_DISABLE_RMW &&
+            new != PARITY_ENABLE_RMW &&
+            new != PARITY_PREFER_RMW)
+                return -EINVAL;
+        conf->rmw_level = new;
+        return len;
+}
+static struct md_sysfs_entry
+raid5_rmw_level = __ATTR(rmw_level, S_IRUGO | S_IWUSR,
+                         raid5_show_rmw_level,
+                         raid5_store_rmw_level);
+static ssize_t
 raid5_show_preread_threshold(struct mddev *mddev, char *page)
 {
        struct r5conf *conf;
@@ -5463,7 +5972,7 @@ raid5_store_preread_threshold(struct mddev *mddev, const char *page, size_t len)
        conf = mddev->private;
        if (!conf)
                err = -ENODEV;
-        else if (new > conf->max_nr_stripes)
+        else if (new > conf->min_nr_stripes)
                err = -EINVAL;
        else
                conf->bypass_threshold = new;
@@ -5618,6 +6127,7 @@ static struct attribute *raid5_attrs[] =  {
        &raid5_preread_bypass_threshold.attr,
        &raid5_group_thread_cnt.attr,
        &raid5_skip_copy.attr,
+        &raid5_rmw_level.attr,
        NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -5699,7 +6209,8 @@ raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
 static void free_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu)
 {
        safe_put_page(percpu->spare_page);
-        kfree(percpu->scribble);
+        if (percpu->scribble)
+                flex_array_free(percpu->scribble);
        percpu->spare_page = NULL;
        percpu->scribble = NULL;
 }
@@ -5709,7 +6220,9 @@ static int alloc_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu
        if (conf->level == 6 && !percpu->spare_page)
                percpu->spare_page = alloc_page(GFP_KERNEL);
        if (!percpu->scribble)
-                percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
+                percpu->scribble = scribble_alloc(max(conf->raid_disks,
+                        conf->previous_raid_disks), conf->chunk_sectors /
+                        STRIPE_SECTORS, GFP_KERNEL);
        if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) {
                free_scratch_buffer(conf, percpu);
@@ -5740,6 +6253,8 @@ static void raid5_free_percpu(struct r5conf *conf)
 static void free_conf(struct r5conf *conf)
 {
+        if (conf->shrinker.seeks)
+                unregister_shrinker(&conf->shrinker);
        free_thread_groups(conf);
        shrink_stripes(conf);
        raid5_free_percpu(conf);
@@ -5807,6 +6322,30 @@ static int raid5_alloc_percpu(struct r5conf *conf)
        return err;
 }
+static unsigned long raid5_cache_scan(struct shrinker *shrink,
+                                      struct shrink_control *sc)
+{
+        struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+        int ret = 0;
+        while (ret < sc->nr_to_scan) {
+                if (drop_one_stripe(conf) == 0)
+                        return SHRINK_STOP;
+                ret++;
+        }
+        return ret;
+}
+static unsigned long raid5_cache_count(struct shrinker *shrink,
+                                       struct shrink_control *sc)
+{
+        struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+        if (conf->max_nr_stripes < conf->min_nr_stripes)
+                /* unlikely, but not impossible */
+                return 0;
+        return conf->max_nr_stripes - conf->min_nr_stripes;
+}
 static struct r5conf *setup_conf(struct mddev *mddev)
 {
        struct r5conf *conf;
@@ -5879,7 +6418,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
        else
                conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks;
        max_disks = max(conf->raid_disks, conf->previous_raid_disks);
-        conf->scribble_len = scribble_len(max_disks);
        conf->disks = kzalloc(max_disks * sizeof(struct disk_info),
                              GFP_KERNEL);
@@ -5907,6 +6445,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
                INIT_LIST_HEAD(conf->temp_inactive_list + i);
        conf->level = mddev->new_level;
+        conf->chunk_sectors = mddev->new_chunk_sectors;
        if (raid5_alloc_percpu(conf) != 0)
                goto abort;
@@ -5939,12 +6478,17 @@ static struct r5conf *setup_conf(struct mddev *mddev)
                        conf->fullsync = 1;
        }
-        conf->chunk_sectors = mddev->new_chunk_sectors;
        conf->level = mddev->new_level;
-        if (conf->level == 6)
+        if (conf->level == 6) {
                conf->max_degraded = 2;
-        else
+                if (raid6_call.xor_syndrome)
+                        conf->rmw_level = PARITY_ENABLE_RMW;
+                else
+                        conf->rmw_level = PARITY_DISABLE_RMW;
+        } else {
                conf->max_degraded = 1;
+                conf->rmw_level = PARITY_ENABLE_RMW;
+        }
        conf->algorithm = mddev->new_layout;
        conf->reshape_progress = mddev->reshape_position;
        if (conf->reshape_progress != MaxSector) {
@@ -5952,10 +6496,11 @@ static struct r5conf *setup_conf(struct mddev *mddev)
                conf->prev_algo = mddev->layout;
        }
-        memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
+        conf->min_nr_stripes = NR_STRIPES;
+        memory = conf->min_nr_stripes * (sizeof(struct stripe_head) +
                 max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
        atomic_set(&conf->empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS);
-        if (grow_stripes(conf, NR_STRIPES)) {
+        if (grow_stripes(conf, conf->min_nr_stripes)) {
                printk(KERN_ERR
                       "md/raid:%s: couldn't allocate %dkB for buffers\n",
                       mdname(mddev), memory);
@@ -5963,6 +6508,17 @@ static struct r5conf *setup_conf(struct mddev *mddev)
        } else
                printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
                       mdname(mddev), memory);
+        /*
+         * Losing a stripe head costs more than the time to refill it,
+         * it reduces the queue depth and so can hurt throughput.
+         * So set it rather large, scaled by number of devices.
+         */
+        conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
+        conf->shrinker.scan_objects = raid5_cache_scan;
+        conf->shrinker.count_objects = raid5_cache_count;
+        conf->shrinker.batch = 128;
+        conf->shrinker.flags = 0;
+        register_shrinker(&conf->shrinker);
        sprintf(pers_name, "raid%d", mddev->new_level);
        conf->thread = md_register_thread(raid5d, mddev, pers_name);
@@ -6604,9 +7160,9 @@ static int check_stripe_cache(struct mddev *mddev)
         */
        struct r5conf *conf = mddev->private;
        if (((mddev->chunk_sectors << 9) / STRIPE_SIZE) * 4
-            > conf->max_nr_stripes ||
+            > conf->min_nr_stripes ||
            ((mddev->new_chunk_sectors << 9) / STRIPE_SIZE) * 4
-            > conf->max_nr_stripes) {
+            > conf->min_nr_stripes) {
                printk(KERN_WARNING "md/raid:%s: reshape: not enough stripes.  Needed %lu\n",
                       mdname(mddev),
                       ((max(mddev->chunk_sectors, mddev->new_chunk_sectors) << 9)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 983e18a83db1..7dc0dd86074b 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -210,11 +210,19 @@ struct stripe_head {
        atomic_t                count;        /* nr of active thread/requests */
        int                     bm_seq; /* sequence number for bitmap flushes */
        int                     disks;          /* disks in stripe */
+        int                     overwrite_disks; /* total overwrite disks in stripe,
+                                                  * this is only checked when stripe
+                                                  * has STRIPE_BATCH_READY
+                                                  */
        enum check_states       check_state;
        enum reconstruct_states reconstruct_state;
        spinlock_t              stripe_lock;
        int                     cpu;
        struct r5worker_group   *group;
+        struct stripe_head      *batch_head; /* protected by stripe lock */
+        spinlock_t              batch_lock; /* only header's lock is useful */
+        struct list_head        batch_list; /* protected by head's batch lock*/
        /**
         * struct stripe_operations
         * @target - STRIPE_OP_COMPUTE_BLK target
@@ -327,8 +335,15 @@ enum {
        STRIPE_ON_UNPLUG_LIST,
        STRIPE_DISCARD,
        STRIPE_ON_RELEASE_LIST,
+        STRIPE_BATCH_READY,
+        STRIPE_BATCH_ERR,
 };
+#define STRIPE_EXPAND_SYNC_FLAG \
+        ((1 << STRIPE_EXPAND_SOURCE) |\
+        (1 << STRIPE_EXPAND_READY) |\
+        (1 << STRIPE_EXPANDING) |\
+        (1 << STRIPE_SYNC_REQUESTED))
 /*
 * Operation request flags
 */
@@ -340,6 +355,24 @@ enum {
        STRIPE_OP_RECONSTRUCT,
        STRIPE_OP_CHECK,
 };
+/*
+ * RAID parity calculation preferences
+ */
+enum {
+        PARITY_DISABLE_RMW = 0,
+        PARITY_ENABLE_RMW,
+        PARITY_PREFER_RMW,
+};
+/*
+ * Pages requested from set_syndrome_sources()
+ */
+enum {
+        SYNDROME_SRC_ALL,
+        SYNDROME_SRC_WANT_DRAIN,
+        SYNDROME_SRC_WRITTEN,
+};
 /*
 * Plugging:
 *
@@ -396,10 +429,11 @@ struct r5conf {
        spinlock_t              hash_locks[NR_STRIPE_HASH_LOCKS];
        struct mddev            *mddev;
        int                     chunk_sectors;
-        int                     level, algorithm;
+        int                     level, algorithm, rmw_level;
        int                     max_degraded;
        int                     raid_disks;
        int                     max_nr_stripes;
+        int                     min_nr_stripes;
        /* reshape_progress is the leading edge of a 'reshape'
         * It has value MaxSector when no reshape is happening
@@ -458,15 +492,11 @@ struct r5conf {
        /* per cpu variables */
        struct raid5_percpu {
                struct page     *spare_page; /* Used when checking P/Q in raid6 */
-                void            *scribble;   /* space for constructing buffer
+                struct flex_array *scribble;   /* space for constructing buffer
                                              * lists and performing address
                                              * conversions
                                              */
        } __percpu *percpu;
-        size_t                  scribble_len; /* size of scribble region must be
-                                               * associated with conf to handle
-                                               * cpu hotplug while reshaping
-                                               */
 #ifdef CONFIG_HOTPLUG_CPU
        struct notifier_block   cpu_notify;
 #endif
@@ -480,9 +510,19 @@ struct r5conf {
        struct llist_head       released_stripes;
        wait_queue_head_t       wait_for_stripe;
        wait_queue_head_t       wait_for_overlap;
-        int                     inactive_blocked;       /* release of inactive stripes blocked,
+        unsigned long           cache_state;
-                                                         * waiting for 25% to be free
+#define R5_INACTIVE_BLOCKED     1       /* release of inactive stripes blocked,
-                                                         */
+                                         * waiting for 25% to be free
+                                         */
+#define R5_ALLOC_MORE           2       /* It might help to allocate another
+                                         * stripe.
+                                         */
+#define R5_DID_ALLOC            4       /* A stripe was allocated, don't allocate
+                                         * more until at least one has been
+                                         * released.  This avoids flooding
+                                         * the cache.
+                                         */
+        struct shrinker         shrinker;
        int                     pool_size; /* number of disks in stripeheads in pool */
        spinlock_t              device_lock;
        struct disk_info        *disks;
@@ -497,6 +537,7 @@ struct r5conf {
        int                     worker_cnt_per_group;
 };
 /*
 * Our supported algorithms
 */
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 179b38ffd351..388574ea38ed 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -60,12 +60,15 @@ struct dma_chan_ref {
 * dependency chain
 * @ASYNC_TX_FENCE: specify that the next operation in the dependency
 * chain uses this operation's result as an input
+ * @ASYNC_TX_PQ_XOR_DST: do not overwrite the syndrome but XOR it with the
+ * input data. Required for rmw case.
 */
 enum async_tx_flags {
        ASYNC_TX_XOR_ZERO_DST    = (1 << 0),
        ASYNC_TX_XOR_DROP_DST    = (1 << 1),
        ASYNC_TX_ACK             = (1 << 2),
        ASYNC_TX_FENCE           = (1 << 3),
+        ASYNC_TX_PQ_XOR_DST      = (1 << 4),
 };
 /**
diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 73069cb6c54a..a7a06d1dcf9c 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -72,6 +72,7 @@ extern const char raid6_empty_zero_page[PAGE_SIZE];
 /* Routine choices */
 struct raid6_calls {
        void (*gen_syndrome)(int, size_t, void **);
+        void (*xor_syndrome)(int, int, int, size_t, void **);
        int  (*valid)(void);    /* Returns 1 if this routine set is usable */
        const char *name;       /* Name of this routine set */
        int prefer;             /* Has special performance attribute */
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 49f4210d4394..2ae6131e69a5 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -78,6 +78,12 @@
 #define MD_DISK_ACTIVE          1 /* disk is running or spare disk */
 #define MD_DISK_SYNC            2 /* disk is in sync with the raid set */
 #define MD_DISK_REMOVED         3 /* disk is in sync with the raid set */
+#define MD_DISK_CLUSTER_ADD     4 /* Initiate a disk add across the cluster
+                                   * For clustered enviroments only.
+                                   */
+#define MD_DISK_CANDIDATE       5 /* disk is added as spare (local) until confirmed
+                                   * For clustered enviroments only.
+                                   */
 #define MD_DISK_WRITEMOSTLY     9 /* disk is "write-mostly" is RAID1 config.
                                   * read requests will only be sent here in
@@ -101,6 +107,7 @@ typedef struct mdp_device_descriptor_s {
 #define MD_SB_CLEAN             0
 #define MD_SB_ERRORS            1
+#define MD_SB_CLUSTERED         5 /* MD is clustered */
 #define MD_SB_BITMAP_PRESENT    8 /* bitmap may be present nearby */
 /*
diff --git a/include/uapi/linux/raid/md_u.h b/include/uapi/linux/raid/md_u.h
index 74e7c60c4716..1cb8aa6850b5 100644
--- a/include/uapi/linux/raid/md_u.h
+++ b/include/uapi/linux/raid/md_u.h
@@ -62,6 +62,7 @@
 #define STOP_ARRAY              _IO (MD_MAJOR, 0x32)
 #define STOP_ARRAY_RO           _IO (MD_MAJOR, 0x33)
 #define RESTART_ARRAY_RW        _IO (MD_MAJOR, 0x34)
+#define CLUSTERED_DISK_NACK     _IO (MD_MAJOR, 0x35)
 /* 63 partitions with the alternate major number (mdp) */
 #define MdpMinorShift 6
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index dbef2314901e..975c6e0434bd 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -131,11 +131,12 @@ static inline const struct raid6_recov_calls *raid6_choose_recov(void)
 static inline const struct raid6_calls *raid6_choose_gen(
        void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks)
 {
-        unsigned long perf, bestperf, j0, j1;
+        unsigned long perf, bestgenperf, bestxorperf, j0, j1;
+        int start = (disks>>1)-1, stop = disks-3;       /* work on the second half of the disks */
        const struct raid6_calls *const *algo;
        const struct raid6_calls *best;
-        for (bestperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) {
+        for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) {
                if (!best || (*algo)->prefer >= best->prefer) {
                        if ((*algo)->valid && !(*algo)->valid())
                                continue;
@@ -153,19 +154,45 @@ static inline const struct raid6_calls *raid6_choose_gen(
                        }
                        preempt_enable();
-                        if (perf > bestperf) {
+                        if (perf > bestgenperf) {
-                                bestperf = perf;
+                                bestgenperf = perf;
                                best = *algo;
                        }
-                        pr_info("raid6: %-8s %5ld MB/s\n", (*algo)->name,
+                        pr_info("raid6: %-8s gen() %5ld MB/s\n", (*algo)->name,
                               (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2));
+                        if (!(*algo)->xor_syndrome)
+                                continue;
+                        perf = 0;
+                        preempt_disable();
+                        j0 = jiffies;
+                        while ((j1 = jiffies) == j0)
+                                cpu_relax();
+                        while (time_before(jiffies,
+                                            j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
+                                (*algo)->xor_syndrome(disks, start, stop,
+                                                      PAGE_SIZE, *dptrs);
+                                perf++;
+                        }
+                        preempt_enable();
+                        if (best == *algo)
+                                bestxorperf = perf;
+                        pr_info("raid6: %-8s xor() %5ld MB/s\n", (*algo)->name,
+                                (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1));
                }
        }
        if (best) {
-                pr_info("raid6: using algorithm %s (%ld MB/s)\n",
+                pr_info("raid6: using algorithm %s gen() %ld MB/s\n",
                       best->name,
-                       (bestperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2));
+                       (bestgenperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2));
+                if (best->xor_syndrome)
+                        pr_info("raid6: .... xor() %ld MB/s, rmw enabled\n",
+                               (bestxorperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1));
                raid6_call = *best;
        } else
                pr_err("raid6: Yikes!  No algorithm found!\n");
diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
index 7cc12b532e95..bec27fce7501 100644
--- a/lib/raid6/altivec.uc
+++ b/lib/raid6/altivec.uc
@@ -119,6 +119,7 @@ int raid6_have_altivec(void)
 const struct raid6_calls raid6_altivec$# = {
        raid6_altivec$#_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_altivec,
        "altivecx$#",
        0
diff --git a/lib/raid6/avx2.c b/lib/raid6/avx2.c
index bc3b1dd436eb..76734004358d 100644
--- a/lib/raid6/avx2.c
+++ b/lib/raid6/avx2.c
@@ -89,6 +89,7 @@ static void raid6_avx21_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_avx2x1 = {
        raid6_avx21_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_avx2,
        "avx2x1",
        1                       /* Has cache hints */
@@ -150,6 +151,7 @@ static void raid6_avx22_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_avx2x2 = {
        raid6_avx22_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_avx2,
        "avx2x2",
        1                       /* Has cache hints */
@@ -242,6 +244,7 @@ static void raid6_avx24_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_avx2x4 = {
        raid6_avx24_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_avx2,
        "avx2x4",
        1                       /* Has cache hints */
diff --git a/lib/raid6/int.uc b/lib/raid6/int.uc
index 5b50f8dfc5d2..558aeac9342a 100644
--- a/lib/raid6/int.uc
+++ b/lib/raid6/int.uc
@@ -107,9 +107,48 @@ static void raid6_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
        }
 }
+static void raid6_int$#_xor_syndrome(int disks, int start, int stop,
+                                     size_t bytes, void **ptrs)
+{
+        u8 **dptr = (u8 **)ptrs;
+        u8 *p, *q;
+        int d, z, z0;
+        unative_t wd$$, wq$$, wp$$, w1$$, w2$$;
+        z0 = stop;              /* P/Q right side optimization */
+        p = dptr[disks-2];      /* XOR parity */
+        q = dptr[disks-1];      /* RS syndrome */
+        for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
+                /* P/Q data pages */
+                wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE];
+                for ( z = z0-1 ; z >= start ; z-- ) {
+                        wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE];
+                        wp$$ ^= wd$$;
+                        w2$$ = MASK(wq$$);
+                        w1$$ = SHLBYTE(wq$$);
+                        w2$$ &= NBYTES(0x1d);
+                        w1$$ ^= w2$$;
+                        wq$$ = w1$$ ^ wd$$;
+                }
+                /* P/Q left side optimization */
+                for ( z = start-1 ; z >= 0 ; z-- ) {
+                        w2$$ = MASK(wq$$);
+                        w1$$ = SHLBYTE(wq$$);
+                        w2$$ &= NBYTES(0x1d);
+                        wq$$ = w1$$ ^ w2$$;
+                }
+                *(unative_t *)&p[d+NSIZE*$$] ^= wp$$;
+                *(unative_t *)&q[d+NSIZE*$$] ^= wq$$;
+        }
+}
 const struct raid6_calls raid6_intx$# = {
        raid6_int$#_gen_syndrome,
-        NULL,           /* always valid */
+        raid6_int$#_xor_syndrome,
+        NULL,                   /* always valid */
        "int" NSTRING "x$#",
        0
 };
diff --git a/lib/raid6/mmx.c b/lib/raid6/mmx.c
index 590c71c9e200..b3b0e1fcd3af 100644
--- a/lib/raid6/mmx.c
+++ b/lib/raid6/mmx.c
@@ -76,6 +76,7 @@ static void raid6_mmx1_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_mmxx1 = {
        raid6_mmx1_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_mmx,
        "mmxx1",
        0
@@ -134,6 +135,7 @@ static void raid6_mmx2_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_mmxx2 = {
        raid6_mmx2_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_mmx,
        "mmxx2",
        0
diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c
index 36ad4705df1a..d9ad6ee284f4 100644
--- a/lib/raid6/neon.c
+++ b/lib/raid6/neon.c
@@ -42,6 +42,7 @@
        }                                                               \
        struct raid6_calls const raid6_neonx ## _n = {                  \
                raid6_neon ## _n ## _gen_syndrome,                      \
+                NULL,           /* XOR not yet implemented */           \
                raid6_have_neon,                                        \
                "neonx" #_n,                                            \
                0                                                       \
diff --git a/lib/raid6/sse1.c b/lib/raid6/sse1.c
index f76297139445..9025b8ca9aa3 100644
--- a/lib/raid6/sse1.c
+++ b/lib/raid6/sse1.c
@@ -92,6 +92,7 @@ static void raid6_sse11_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_sse1x1 = {
        raid6_sse11_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_sse1_or_mmxext,
        "sse1x1",
        1                       /* Has cache hints */
@@ -154,6 +155,7 @@ static void raid6_sse12_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_sse1x2 = {
        raid6_sse12_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        raid6_have_sse1_or_mmxext,
        "sse1x2",
        1                       /* Has cache hints */
diff --git a/lib/raid6/sse2.c b/lib/raid6/sse2.c
index 85b82c85f28e..1d2276b007ee 100644
--- a/lib/raid6/sse2.c
+++ b/lib/raid6/sse2.c
@@ -88,8 +88,58 @@ static void raid6_sse21_gen_syndrome(int disks, size_t bytes, void **ptrs)
        kernel_fpu_end();
 }
+static void raid6_sse21_xor_syndrome(int disks, int start, int stop,
+                                     size_t bytes, void **ptrs)
+ {
+        u8 **dptr = (u8 **)ptrs;
+        u8 *p, *q;
+        int d, z, z0;
+        z0 = stop;              /* P/Q right side optimization */
+        p = dptr[disks-2];      /* XOR parity */
+        q = dptr[disks-1];      /* RS syndrome */
+        kernel_fpu_begin();
+        asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0]));
+        for ( d = 0 ; d < bytes ; d += 16 ) {
+                asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));
+                asm volatile("movdqa %0,%%xmm2" : : "m" (p[d]));
+                asm volatile("pxor %xmm4,%xmm2");
+                /* P/Q data pages */
+                for ( z = z0-1 ; z >= start ; z-- ) {
+                        asm volatile("pxor %xmm5,%xmm5");
+                        asm volatile("pcmpgtb %xmm4,%xmm5");
+                        asm volatile("paddb %xmm4,%xmm4");
+                        asm volatile("pand %xmm0,%xmm5");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d]));
+                        asm volatile("pxor %xmm5,%xmm2");
+                        asm volatile("pxor %xmm5,%xmm4");
+                }
+                /* P/Q left side optimization */
+                for ( z = start-1 ; z >= 0 ; z-- ) {
+                        asm volatile("pxor %xmm5,%xmm5");
+                        asm volatile("pcmpgtb %xmm4,%xmm5");
+                        asm volatile("paddb %xmm4,%xmm4");
+                        asm volatile("pand %xmm0,%xmm5");
+                        asm volatile("pxor %xmm5,%xmm4");
+                }
+                asm volatile("pxor %0,%%xmm4" : : "m" (q[d]));
+                /* Don't use movntdq for r/w memory area < cache line */
+                asm volatile("movdqa %%xmm4,%0" : "=m" (q[d]));
+                asm volatile("movdqa %%xmm2,%0" : "=m" (p[d]));
+        }
+        asm volatile("sfence" : : : "memory");
+        kernel_fpu_end();
+}
 const struct raid6_calls raid6_sse2x1 = {
        raid6_sse21_gen_syndrome,
+        raid6_sse21_xor_syndrome,
        raid6_have_sse2,
        "sse2x1",
        1                       /* Has cache hints */
@@ -150,8 +200,76 @@ static void raid6_sse22_gen_syndrome(int disks, size_t bytes, void **ptrs)
        kernel_fpu_end();
 }
+ static void raid6_sse22_xor_syndrome(int disks, int start, int stop,
+                                     size_t bytes, void **ptrs)
+ {
+        u8 **dptr = (u8 **)ptrs;
+        u8 *p, *q;
+        int d, z, z0;
+        z0 = stop;              /* P/Q right side optimization */
+        p = dptr[disks-2];      /* XOR parity */
+        q = dptr[disks-1];      /* RS syndrome */
+        kernel_fpu_begin();
+        asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0]));
+        for ( d = 0 ; d < bytes ; d += 32 ) {
+                asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));
+                asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16]));
+                asm volatile("movdqa %0,%%xmm2" : : "m" (p[d]));
+                asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16]));
+                asm volatile("pxor %xmm4,%xmm2");
+                asm volatile("pxor %xmm6,%xmm3");
+                /* P/Q data pages */
+                for ( z = z0-1 ; z >= start ; z-- ) {
+                        asm volatile("pxor %xmm5,%xmm5");
+                        asm volatile("pxor %xmm7,%xmm7");
+                        asm volatile("pcmpgtb %xmm4,%xmm5");
+                        asm volatile("pcmpgtb %xmm6,%xmm7");
+                        asm volatile("paddb %xmm4,%xmm4");
+                        asm volatile("paddb %xmm6,%xmm6");
+                        asm volatile("pand %xmm0,%xmm5");
+                        asm volatile("pand %xmm0,%xmm7");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("pxor %xmm7,%xmm6");
+                        asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d]));
+                        asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16]));
+                        asm volatile("pxor %xmm5,%xmm2");
+                        asm volatile("pxor %xmm7,%xmm3");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("pxor %xmm7,%xmm6");
+                }
+                /* P/Q left side optimization */
+                for ( z = start-1 ; z >= 0 ; z-- ) {
+                        asm volatile("pxor %xmm5,%xmm5");
+                        asm volatile("pxor %xmm7,%xmm7");
+                        asm volatile("pcmpgtb %xmm4,%xmm5");
+                        asm volatile("pcmpgtb %xmm6,%xmm7");
+                        asm volatile("paddb %xmm4,%xmm4");
+                        asm volatile("paddb %xmm6,%xmm6");
+                        asm volatile("pand %xmm0,%xmm5");
+                        asm volatile("pand %xmm0,%xmm7");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("pxor %xmm7,%xmm6");
+                }
+                asm volatile("pxor %0,%%xmm4" : : "m" (q[d]));
+                asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16]));
+                /* Don't use movntdq for r/w memory area < cache line */
+                asm volatile("movdqa %%xmm4,%0" : "=m" (q[d]));
+                asm volatile("movdqa %%xmm6,%0" : "=m" (q[d+16]));
+                asm volatile("movdqa %%xmm2,%0" : "=m" (p[d]));
+                asm volatile("movdqa %%xmm3,%0" : "=m" (p[d+16]));
+        }
+        asm volatile("sfence" : : : "memory");
+        kernel_fpu_end();
+ }
 const struct raid6_calls raid6_sse2x2 = {
        raid6_sse22_gen_syndrome,
+        raid6_sse22_xor_syndrome,
        raid6_have_sse2,
        "sse2x2",
        1                       /* Has cache hints */
@@ -248,8 +366,117 @@ static void raid6_sse24_gen_syndrome(int disks, size_t bytes, void **ptrs)
        kernel_fpu_end();
 }
+ static void raid6_sse24_xor_syndrome(int disks, int start, int stop,
+                                     size_t bytes, void **ptrs)
+ {
+        u8 **dptr = (u8 **)ptrs;
+        u8 *p, *q;
+        int d, z, z0;
+        z0 = stop;              /* P/Q right side optimization */
+        p = dptr[disks-2];      /* XOR parity */
+        q = dptr[disks-1];      /* RS syndrome */
+        kernel_fpu_begin();
+        asm volatile("movdqa %0,%%xmm0" :: "m" (raid6_sse_constants.x1d[0]));
+        for ( d = 0 ; d < bytes ; d += 64 ) {
+                asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));
+                asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16]));
+                asm volatile("movdqa %0,%%xmm12" :: "m" (dptr[z0][d+32]));
+                asm volatile("movdqa %0,%%xmm14" :: "m" (dptr[z0][d+48]));
+                asm volatile("movdqa %0,%%xmm2" : : "m" (p[d]));
+                asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16]));
+                asm volatile("movdqa %0,%%xmm10" : : "m" (p[d+32]));
+                asm volatile("movdqa %0,%%xmm11" : : "m" (p[d+48]));
+                asm volatile("pxor %xmm4,%xmm2");
+                asm volatile("pxor %xmm6,%xmm3");
+                asm volatile("pxor %xmm12,%xmm10");
+                asm volatile("pxor %xmm14,%xmm11");
+                /* P/Q data pages */
+                for ( z = z0-1 ; z >= start ; z-- ) {
+                        asm volatile("prefetchnta %0" :: "m" (dptr[z][d]));
+                        asm volatile("prefetchnta %0" :: "m" (dptr[z][d+32]));
+                        asm volatile("pxor %xmm5,%xmm5");
+                        asm volatile("pxor %xmm7,%xmm7");
+                        asm volatile("pxor %xmm13,%xmm13");
+                        asm volatile("pxor %xmm15,%xmm15");
+                        asm volatile("pcmpgtb %xmm4,%xmm5");
+                        asm volatile("pcmpgtb %xmm6,%xmm7");
+                        asm volatile("pcmpgtb %xmm12,%xmm13");
+                        asm volatile("pcmpgtb %xmm14,%xmm15");
+                        asm volatile("paddb %xmm4,%xmm4");
+                        asm volatile("paddb %xmm6,%xmm6");
+                        asm volatile("paddb %xmm12,%xmm12");
+                        asm volatile("paddb %xmm14,%xmm14");
+                        asm volatile("pand %xmm0,%xmm5");
+                        asm volatile("pand %xmm0,%xmm7");
+                        asm volatile("pand %xmm0,%xmm13");
+                        asm volatile("pand %xmm0,%xmm15");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("pxor %xmm7,%xmm6");
+                        asm volatile("pxor %xmm13,%xmm12");
+                        asm volatile("pxor %xmm15,%xmm14");
+                        asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d]));
+                        asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16]));
+                        asm volatile("movdqa %0,%%xmm13" :: "m" (dptr[z][d+32]));
+                        asm volatile("movdqa %0,%%xmm15" :: "m" (dptr[z][d+48]));
+                        asm volatile("pxor %xmm5,%xmm2");
+                        asm volatile("pxor %xmm7,%xmm3");
+                        asm volatile("pxor %xmm13,%xmm10");
+                        asm volatile("pxor %xmm15,%xmm11");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("pxor %xmm7,%xmm6");
+                        asm volatile("pxor %xmm13,%xmm12");
+                        asm volatile("pxor %xmm15,%xmm14");
+                }
+                asm volatile("prefetchnta %0" :: "m" (q[d]));
+                asm volatile("prefetchnta %0" :: "m" (q[d+32]));
+                /* P/Q left side optimization */
+                for ( z = start-1 ; z >= 0 ; z-- ) {
+                        asm volatile("pxor %xmm5,%xmm5");
+                        asm volatile("pxor %xmm7,%xmm7");
+                        asm volatile("pxor %xmm13,%xmm13");
+                        asm volatile("pxor %xmm15,%xmm15");
+                        asm volatile("pcmpgtb %xmm4,%xmm5");
+                        asm volatile("pcmpgtb %xmm6,%xmm7");
+                        asm volatile("pcmpgtb %xmm12,%xmm13");
+                        asm volatile("pcmpgtb %xmm14,%xmm15");
+                        asm volatile("paddb %xmm4,%xmm4");
+                        asm volatile("paddb %xmm6,%xmm6");
+                        asm volatile("paddb %xmm12,%xmm12");
+                        asm volatile("paddb %xmm14,%xmm14");
+                        asm volatile("pand %xmm0,%xmm5");
+                        asm volatile("pand %xmm0,%xmm7");
+                        asm volatile("pand %xmm0,%xmm13");
+                        asm volatile("pand %xmm0,%xmm15");
+                        asm volatile("pxor %xmm5,%xmm4");
+                        asm volatile("pxor %xmm7,%xmm6");
+                        asm volatile("pxor %xmm13,%xmm12");
+                        asm volatile("pxor %xmm15,%xmm14");
+                }
+                asm volatile("movntdq %%xmm2,%0" : "=m" (p[d]));
+                asm volatile("movntdq %%xmm3,%0" : "=m" (p[d+16]));
+                asm volatile("movntdq %%xmm10,%0" : "=m" (p[d+32]));
+                asm volatile("movntdq %%xmm11,%0" : "=m" (p[d+48]));
+                asm volatile("pxor %0,%%xmm4" : : "m" (q[d]));
+                asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16]));
+                asm volatile("pxor %0,%%xmm12" : : "m" (q[d+32]));
+                asm volatile("pxor %0,%%xmm14" : : "m" (q[d+48]));
+                asm volatile("movntdq %%xmm4,%0" : "=m" (q[d]));
+                asm volatile("movntdq %%xmm6,%0" : "=m" (q[d+16]));
+                asm volatile("movntdq %%xmm12,%0" : "=m" (q[d+32]));
+                asm volatile("movntdq %%xmm14,%0" : "=m" (q[d+48]));
+        }
+        asm volatile("sfence" : : : "memory");
+        kernel_fpu_end();
+ }
 const struct raid6_calls raid6_sse2x4 = {
        raid6_sse24_gen_syndrome,
+        raid6_sse24_xor_syndrome,
        raid6_have_sse2,
        "sse2x4",
        1                       /* Has cache hints */
diff --git a/lib/raid6/test/test.c b/lib/raid6/test/test.c
index 5a485b7a7d3c..3bebbabdb510 100644
--- a/lib/raid6/test/test.c
+++ b/lib/raid6/test/test.c
@@ -28,11 +28,11 @@ char *dataptrs[NDISKS];
 char data[NDISKS][PAGE_SIZE];
 char recovi[PAGE_SIZE], recovj[PAGE_SIZE];
-static void makedata(void)
+static void makedata(int start, int stop)
 {
        int i, j;
-        for (i = 0; i < NDISKS; i++) {
+        for (i = start; i <= stop; i++) {
                for (j = 0; j < PAGE_SIZE; j++)
                        data[i][j] = rand();
@@ -91,34 +91,55 @@ int main(int argc, char *argv[])
 {
        const struct raid6_calls *const *algo;
        const struct raid6_recov_calls *const *ra;
-        int i, j;
+        int i, j, p1, p2;
        int err = 0;
-        makedata();
+        makedata(0, NDISKS-1);
        for (ra = raid6_recov_algos; *ra; ra++) {
                if ((*ra)->valid  && !(*ra)->valid())
                        continue;
                raid6_2data_recov = (*ra)->data2;
                raid6_datap_recov = (*ra)->datap;
                printf("using recovery %s\n", (*ra)->name);
                for (algo = raid6_algos; *algo; algo++) {
-                        if (!(*algo)->valid || (*algo)->valid()) {
+                        if ((*algo)->valid && !(*algo)->valid())
-                                raid6_call = **algo;
+                                continue;
+                        raid6_call = **algo;
+                        /* Nuke syndromes */
+                        memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE);
+                        /* Generate assumed good syndrome */
+                        raid6_call.gen_syndrome(NDISKS, PAGE_SIZE,
+                                                (void **)&dataptrs);
+                        for (i = 0; i < NDISKS-1; i++)
+                                for (j = i+1; j < NDISKS; j++)
+                                        err += test_disks(i, j);
+                        if (!raid6_call.xor_syndrome)
+                                continue;
+                        for (p1 = 0; p1 < NDISKS-2; p1++)
+                                for (p2 = p1; p2 < NDISKS-2; p2++) {
-                                /* Nuke syndromes */
+                                        /* Simulate rmw run */
-                                memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE);
+                                        raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE,
+                                                                (void **)&dataptrs);
+                                        makedata(p1, p2);
+                                        raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE,
+                                                                (void **)&dataptrs);
-                                /* Generate assumed good syndrome */
+                                        for (i = 0; i < NDISKS-1; i++)
-                                raid6_call.gen_syndrome(NDISKS, PAGE_SIZE,
+                                                for (j = i+1; j < NDISKS; j++)
-                                                        (void **)&dataptrs);
+                                                        err += test_disks(i, j);
+                                }
-                                for (i = 0; i < NDISKS-1; i++)
-                                        for (j = i+1; j < NDISKS; j++)
-                                                err += test_disks(i, j);
-                        }
                }
                printf("\n");
        }
diff --git a/lib/raid6/tilegx.uc b/lib/raid6/tilegx.uc
index e7c29459cbcd..2dd291a11264 100644
--- a/lib/raid6/tilegx.uc
+++ b/lib/raid6/tilegx.uc
@@ -80,6 +80,7 @@ void raid6_tilegx$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
 const struct raid6_calls raid6_tilegx$# = {
        raid6_tilegx$#_gen_syndrome,
+        NULL,                   /* XOR not yet implemented */
        NULL,
        "tilegx$#",
        0