aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAge
...
| * dm raid1: use hold framework in do_failuresMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | Use the hold framework in do_failures. This patch doesn't change the bio processing logic, it just simplifies failure handling and avoids periodically polling the failures list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm raid1: add framework to hold bios during suspendMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | Add framework to delay bios until a suspend and then resubmit them with either DM_ENDIO_REQUEUE (if the suspend was noflush) or complete them with -EIO. I/O barrier support will use this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm raid1: report flush errors separately in statusMikulas Patocka2009-12-10
| | | | | | | | | | | | | | Report flush errors as 'F' instead of 'D' for log and mirror devices. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm raid1: implement mirror_flushMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | Implement flush callee. It uses dm_io to send zero-size barrier synchronously and concurrently to all the mirror legs. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log: use flush callback fnMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | Call the flush callback from the log. If flush failed, we have no alternative but to mark the whole log as dirty. Also we set the variable flush_failed to prevent any bits ever being marked as clean again. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log: add flush callback fnMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a callback pointer from the log to dm-raid1 layer. Before some region is set as "in-sync", we need to flush hardware cache on all the disks. But the log module doesn't have access to the mirror_set structure. So it will use this callback. So far the callback is unused, it will be used in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log: introduce flush_failed variableMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce "flush failed" variable. When a flush before clearing a bit in the log fails, we don't know anything about which which regions are in-sync and which not. So we need to set all regions as not-in-sync and set the variable "flush_failed" to prevent setting the in-sync bit in the future. A target reload is the only way to get out of this situation. The variable will be set in following patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log: add flush_header functionMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | Introduce flush_header and use it to flush the log device. Note that we don't have to flush if all the regions transition from "dirty" to "clean" state. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm raid1: split touched state into twoMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the variable "touched" into two, "touched_dirtied" and "touched_cleaned", set when some region was dirtied or cleaned. This will be used to optimize flushes. After a transition from "dirty" to "clean" state we don't have flush hardware cache on the log device. After a transition from "clean" to "dirty" the cache must be flushed. Before a transition from "clean" to "dirty" state we don't have to flush all the raid legs. Before a transition from "dirty" to "clean" we must flush all the legs to make sure that they are really in sync. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm raid1: support flushMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | Flush support for dm-raid1. When it receives an empty barrier, submit it to all the devices via dm-io. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm io: remove extra bi_io_vec region hackMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the hack where we allocate an extra bi_io_vec to store additional private data. This hack prevents us from supporting barriers in dm-raid1 without first making another little block layer change. Instead of doing that, this patch eliminates the bi_io_vec abuse by storing the region number directly in the low bits of bi_private. We need to store two things for each bio, the pointer to the main io structure and, if parallel writes were requested, an index indicating which of these writes this bio belongs to. There can be at most BITS_PER_LONG regions - 32 or 64. The index (region number) was stored in the last (hidden) bio vector and the pointer to struct io was stored in bi_private. This patch now aligns "struct io" on BITS_PER_LONG bytes and stores the region number in the low BITS_PER_LONG bits of bi_private. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm io: use slab for struct ioMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | Allocate "struct io" from a slab. This patch changes dm-io, so that "struct io" is allocated from a slab cache. It used to be allocated with kmalloc. Allocating from a slab will be needed for the next patch, because it requires a special alignment of "struct io" and kmalloc cannot meet this alignment. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: make wipe message also wipe essiv keyMilan Broz2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "wipe key" message is used to wipe the volume key from memory temporarily, for example when suspending to RAM. But the initialisation vector in ESSIV mode is calculated from the hashed volume key, so the wipe message should wipe this IV key too and reinitialise it when the volume key is reinstated. This patch adds an IV wipe method called from a wipe message callback. ESSIV is then reinitialised using the init function added by the last patch. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: separate essiv allocation from initialisationMilan Broz2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch separates the construction of IV from its initialisation. (For ESSIV it is a hash calculation based on volume key.) Constructor code now preallocates hash tfm and salt array and saves it in a private IV structure. The next patch requires this to reinitialise the wiped IV without reallocating memory when resuming a suspended device. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: restructure essiv error pathMilan Broz2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | Use kzfree for salt deallocation because it is derived from the volume key. Use a common error path in ESSIV constructor. Required by a later patch which fixes the way key material is wiped from memory. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: move private iv fields to structsMilan Broz2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | Define private structures for IV so it's easy to add further attributes in a following patch which fixes the way key material is wiped from memory. Also move ESSIV destructor and remove unnecessary 'status' operation. There are no functional changes in this patch. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: make wipe message also wipe tfm keyMilan Broz2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | The "wipe key" message is used to wipe a volume key from memory temporarily, for example when suspending to RAM. There are two instances of the key in memory (inside crypto tfm) but only one got wiped. This patch wipes them both. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: cope with chunk size larger than originMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under some special conditions the snapshot hash_size is calculated as zero. This patch instead sets a minimum value of 64, the same as for the pending exception table. rounddown_pow_of_two(0) is an undefined operation (it expands to shift by -1). init_exception_table with an argument of 0 would fail with -ENOMEM. The way to trigger the problem is to create a snapshot with a chunk size that is larger than the origin device. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: only take lock for statustype info not tableMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Take snapshot lock only for STATUSTYPE_INFO, not STATUSTYPE_TABLE. Commit 4c6fff445d7aa753957856278d4d93bcad6e2c14 (dm-snapshot-lock-snapshot-while-supplying-status.patch) introduced this use of the lock, but userspace applications using libdevmapper have been found to request STATUSTYPE_TABLE while the device is suspended and the lock is already held, leading to deadlock. Since the lock is not necessary in this case, don't try to take it. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: sysfs add empty release function to avoid debug warningMilan Broz2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch just removes an unnecessary warning: kobject: 'dm': does not have a release() function, it is broken and must be fixed. The kobject is embedded in mapped device struct, so code does not need to release memory explicitly here. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm exception store: free tmp_store on persistent flag errorJulia Lawall2009-12-10
| | | | | | | | | | | | | | | | Error handling code following a kmalloc should free the allocated data. Cc: stable@kernel.org Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: avoid _hash_lock deadlockMikulas Patocka2009-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a reported deadlock if there are still unprocessed multipath events on a device that is being removed. _hash_lock is held during dev_remove while trying to send the outstanding events. Sending the events requests the _hash_lock again in dm_copy_name_and_uuid. This patch introduces a separate lock around regions that modify the link to the hash table (dm_set_mdptr) or the name or uuid so that dm_copy_name_and_uuid no longer needs _hash_lock. Additionally, dm_copy_name_and_uuid can only be called if md exists so we can drop the dm_get() and dm_put() which can lead to a BUG() while md is being freed. The deadlock: #0 [ffff8106298dfb48] schedule at ffffffff80063035 #1 [ffff8106298dfc20] __down_read at ffffffff8006475d #2 [ffff8106298dfc60] dm_copy_name_and_uuid at ffffffff8824f740 #3 [ffff8106298dfc90] dm_send_uevents at ffffffff88252685 #4 [ffff8106298dfcd0] event_callback at ffffffff8824c678 #5 [ffff8106298dfd00] dm_table_event at ffffffff8824dd01 #6 [ffff8106298dfd10] __hash_remove at ffffffff882507ad #7 [ffff8106298dfd30] dev_remove at ffffffff88250865 #8 [ffff8106298dfd60] ctl_ioctl at ffffffff88250d80 #9 [ffff8106298dfee0] do_ioctl at ffffffff800418c4 #10 [ffff8106298dff00] vfs_ioctl at ffffffff8002fab9 #11 [ffff8106298dff40] sys_ioctl at ffffffff8004bdaf #12 [ffff8106298dff80] tracesys at ffffffff8005d28d (via system_call) Cc: stable@kernel.org Reported-by: guy keren <choo@actcom.co.il> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | drivers/md/md.c: use %pU to print UUIDsJoe Perches2009-12-15
| | | | | | | | | | | | | | Signed-off-by: Joe Perches <joe@perches.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | tree-wide: convert open calls to remove spaces to skip_spaces() lib functionAndré Goddard Rosa2009-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes use of skip_spaces() defined in lib/string.c for removing leading spaces from strings all over the tree. It decreases lib.a code size by 47 bytes and reuses the function tree-wide: text data bss dec hex filename 64688 584 592 65864 10148 (TOTALS-BEFORE) 64641 584 592 65817 10119 (TOTALS-AFTER) Also, while at it, if we see (*str && isspace(*str)), we can be sure to remove the first condition (*str) as the second one (isspace(*str)) also evaluates to 0 whenever *str == 0, making it redundant. In other words, "a char equals zero is never a space". Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below, and found occurrences of this pattern on 3 more files: drivers/leds/led-class.c drivers/leds/ledtrig-timer.c drivers/video/output.c @@ expression str; @@ ( // ignore skip_spaces cases while (*str && isspace(*str)) { \(str++;\|++str;\) } | - *str && isspace(*str) ) Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Julia Lawall <julia@diku.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Neil Brown <neilb@suse.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: David Howells <dhowells@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | md: add 'recovery_start' per-device sysfs attributeDan Williams2009-12-13
| | | | | | | | | | | | | | | | | | | | Enable external metadata arrays to manage rebuild checkpointing via a md/dev-XXX/recovery_start attribute which reflects rdev->recovery_offset Also update resync_start_store to allow 'none' to be written, for consistency. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | md: rcu_read_lock() walk of mddev->disks in md_do_sync()Dan Williams2009-12-13
| | | | | | | | | | | | | | | | Other walks of this list are either under rcu_read_lock() or the list mutation lock (mddev_lock()). This protects against the improbable case of a disk being removed from the array at the start of md_do_sync(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | md: integrate spares into array at earliest opportunity.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As v1.x metadata can record that a member of the array is not completely recovered, it make sense to record that a spare has become a regular member of the array at the earliest opportunity. So remove the tests on "recovery_offset > 0" in super_1_sync as they really aren't needed, and schedule a metadata update immediately after adding spares to a degraded array. This means that if a crash happens immediately after a recovery starts, the new device will be included in the array and recovery will continue from wherever it was up to. Previously this didn't happen unless recovery was at least 1/16 of the way through. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: move compat_ioctl handling into md.cArnd Bergmann2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | The RAID ioctls are only implemented in md.c, so the handling for them should also be moved there from fs/compat_ioctl.c. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Neil Brown <neilb@suse.de> Cc: Andre Noll <maan@systemlinux.org> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* | md: revise Kconfig help for MD_MULTIPATHNeilBrown2009-12-13
| | | | | | | | | | | | | | | | Make it clear in the config message that MD_MULTIPATH is not under active development. Cc: Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | md: add MODULE_DESCRIPTION for all md related modules.NeilBrown2009-12-13
| | | | | | | | | | | | Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | raid: improve MD/raid10 handling of correctable read errors.Robert Becker2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've noticed severe lasting performance degradation of our raid arrays when we have drives that yield large amounts of media errors. The raid10 module will queue each failed read for retry, and also will attempt call fix_read_error() to perform the read recovery. Read recovery is performed while the array is frozen, so repeated recovery attempts can degrade the performance of the array for extended periods of time. With this patch I propose adding a per md device max number of corrected read attempts. Each rdev will maintain a count of read correction attempts in the rdev->read_errors field (not used currently for raid10). When we enter fix_read_error() we'll check to see when the last read error occurred, and divide the read error count by 2 for every hour since the last read error. If at that point our read error count exceeds the read error threshold, we'll fail the raid device. In addition in this patch I add sysfs nodes (get/set) for the per md max_read_errors attribute, the rdev->read_errors attribute, and added some printk's to indicate when fix_read_error fails to repair an rdev. For testing I used debugfs->fail_make_request to inject IO errors to the rdev while doing IO to the raid array. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | md/raid10: print more useful messages on device failure.Robert Becker2009-12-13
| | | | | | | | | | | | | | | | | | When we get a read error on a device in a RAID10, and attempting to repair the error fails, print more useful messages about why it failed. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | md/bitmap: update dirty flag when bitmap bits are explicitly set.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a sysfs file which allows bits in the write-intent bitmap to be explicit set - indicating that the block is thought to be 'dirty'. When this happens we should really set recovery_cp backwards to include the block to reflect this dirtiness. In particular, a 'resync' process will refuse to start if recovery_cp is beyond the end of the array, so this is needed to allow a resync to be triggered. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: Support write-intent bitmaps with externally managed metadata.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | In this case, the metadata needs to not be in the same sector as the bitmap. md will not read/write any bitmap metadata. Config must be done via sysfs and when a recovery makes the array non-degraded again, writing 'true' to 'bitmap/can_clear' will allow bits in the bitmap to be cleared again. Signed-off-by: NeilBrown <neilb@suse.de>
* | md/bitmap: move setting of daemon_lastrun out of bitmap_read_sbNeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | Setting daemon_lastrun really has nothing to do with reading the bitmap superblock, it just happens to be needed at the same time. bitmap_read_sb is about to become options, so move that code out to after the call to bitmap_read_sb. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: support updating bitmap parameters via sysfs.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A new attribute directory 'bitmap' in 'md' is created which contains files for configuring the bitmap. 'location' identifies where the bitmap is, either 'none', or 'file' or 'sector offset from metadata'. Writing 'location' can create or remove a bitmap. Adding a 'file' bitmap this way is not yet supported. 'chunksize' and 'time_base' must be set before 'location' can be set. 'chunksize' can be set before creating a bitmap, but is currently always over-ridden by the bitmap superblock. 'time_base' and 'backlog' can be updated at any time. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>
* | md: factor out parsing of fixed-point numbersNeilBrown2009-12-13
| | | | | | | | | | | | | | | | safe_delay_store can parse fixed point numbers (for fractions of a second). We will want to do that for another sysfs file soon, so factor out the code. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: support bitmap offset appropriate for external-metadata arrays.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | For md arrays were metadata is managed externally, the kernel does not know about a superblock so the superblock offset is 0. If we want to have a write-intent-bitmap near the end of the devices of such an array, we should support sector_t sized offset. We need offset be possibly negative for when the bitmap is before the metadata, so use loff_t instead. Also add sanity check that bitmap does not overlap with data. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: remove needless setting of thread->timeout in raid10_quiesceNeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | As bitmap_create and bitmap_destroy already set thread->timeout as appropriate, there is no need to do it in raid10_quiesce. There is a possible need to wake the thread after the timeout has been set low, but it is better to do that where the timeout is actually set low, in bitmap_create. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.NeilBrown2009-12-13
| | | | | | | | | | | | This removes a lot of multiplications by HZ. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: move offset, daemon_sleep and chunksize out of bitmap structureNeilBrown2009-12-13
| | | | | | | | | | | | | | ... and into bitmap_info. These are all configuration parameters that need to be set before the bitmap is created. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: collect bitmap-specific fields into one structure.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | In preparation for making bitmap fields configurable via sysfs, start tidying up by making a single structure to contain the configuration fields. Signed-off-by: NeilBrown <neilb@suse.de>
* | md/raid1: add takeover support for raid5->raid1NeilBrown2009-12-13
| | | | | | | | | | | | A 2-device raid5 array can now be converted to raid1. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: add honouring of suspend_{lo,hi} to raid1.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | This will allow us to stop writeout to portions of the array while they are resynced by someone else - e.g. another node in a cluster. Signed-off-by: NeilBrown <neilb@suse.de>
* | md/raid5: don't complete make_request on barrier until writes are scheduledNeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The post-barrier-flush is sent by md as soon as make_request on the barrier write completes. For raid5, the data might not be in the per-device queues yet. So for barrier requests, wait for any pre-reading to be done so that the request will be in the per-device queues. We use the 'preread_active' count to check that nothing is still in the preread phase, and delay the decrement of this count until after write requests have been submitted to the underlying devices. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: support barrier requests on all personalities.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>
* | md: don't reset curr_resync_completed after an interrupted resyncNeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | If a resync/recovery/check/repair is interrupted for some reason, it can be useful to know exactly where it got up to. So in that case, do not clear curr_resync_completed. Initialise it when starting a resync/recovery/... instead. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: adjust resync_min usefully when resync aborts.NeilBrown2009-12-13
| | | | | | | | | | | | | | | | | | | | When a 'check' or 'repair' finished we should clear resync_min so that a future check/repair will cover the whole array (by default). However if it is interrupted, we should update resync_min to where we got up to, so that when the check/repair continues it just does the remainder of the array. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: remove sparse warning:symbol XXX was not declared.NeilBrown2009-12-13
| | | | | | | | Signed-off-by: NeilBrown <neilb@suse.de>
* | md/raid5: remove some sparse warnings.NeilBrown2009-12-13
| | | | | | | | | | | | qd_idx is previously declared and given exactly the same value! Signed-off-by: NeilBrown <neilb@suse.de>