litmus-rt.git - The LITMUS^RT kernel.

	Commit message (Collapse)	Author	Age
*	md: Push down reconstruction log message to personality code.	Andre Noll	2009-06-17
\| \| \| \| \| \| \| \| \| \| \| \|	Currently, the md layer checks in analyze_sbs() if the raid level supports reconstruction (mddev->level >= 1) and if reconstruction is in progress (mddev->recovery_cp != MaxSector). Move that printk into the personality code of those raid levels that care (levels 1, 4, 5, 6, 10). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Convert mddev->new_chunk to sectors.	Andre Noll	2009-06-17
\| \| \| \| \| \| \| \| \| \|	A straight-forward conversion which gets rid of some multiplications/divisions/shifts. The patch also introduces a couple of new ones, most of which are due to conf->chunk_size still being represented in bytes. This will be cleaned up in subsequent patches. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Make mddev->chunk_size sector-based.	Andre Noll	2009-06-17
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove mddev_to_conf "helper" macro	NeilBrown	2009-06-16
\| \| \| \| \| \| \| \| \| \|	Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
*	block: Use accessor functions for queue limits	Martin K. Petersen	2009-05-22
\| \| \| \| \| \| \| \|	Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	block: move bio list helpers into bio.h	Christoph Hellwig	2009-04-15
\| \| \| \| \| \| \| \| \|	It's used by DM and MD and generally useful, so move the bio list helpers into bio.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	md/raid1: fix build breakage	Alexander Beregalov	2009-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix this build error: drivers/md/raid1.c: In function 'raid1_congested': drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared BDI_write_congested was changed in commit 1faa16d228 ("block: change the request allocation/congestion logic to be sync/async based") Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md/raid1 - don't assume newly allocated bvecs are initialised.	NeilBrown	2009-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit d3f761104b097738932afcc310fbbbbfb007ef92 newly allocated bvecs aren't initialised to NULL, so we have to be more careful about freeing a bio which only managed to get a few pages allocated to it. Otherwise the resync process crashes. This patch is appropriate for 2.6.29-stable. Cc: stable@kernel.org Cc: "Jens Axboe" <jens.axboe@oracle.com> Reported-by: Gabriele Tozzi <gabriele@tozzi.eu> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: 'array_size' sysfs attribute	Dan Williams	2009-03-31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow userspace to set the size of the array according to the following semantics: 1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0) a) If size is set before the array is running, do_md_run will fail if size is greater than the default size b) A reshape attempt that reduces the default size to less than the set array size should be blocked 2/ once userspace sets the size the kernel will not change it 3/ writing 'default' to this attribute returns control of the size to the kernel and reverts to the size reported by the personality Also, convert locations that need to know the default size from directly reading ->array_sectors to <pers>_size. Resync/reshape operations always follow the default size. Finally, fixup other locations that read a number of 1k-blocks from userspace to use strict_blocks_to_sectors() which checks for unsigned long long to sector_t overflow and blocks to sectors overflow. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
*	md: centralize ->array_sectors modifications	Dan Williams	2009-03-30
\| \| \| \| \| \| \| \| \|	Get personalities out of the business of directly modifying ->array_sectors. Lays groundwork to introduce policy on when ->array_sectors can be modified. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
*	md: add 'size' as a personality method	Dan Williams	2009-03-30
\| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation for giving userspace control over ->array_sectors we need to be able to retrieve the 'default' size, and the 'anticipated' size when a reshape is requested. For personalities that do not reshape emit a warning if anything but the default size is requested. In the raid5 case we need to update ->previous_raid_disks to make the new 'default' size available. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
*	md: enable suspend/resume of md devices.	NeilBrown	2009-03-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To be able to change the 'level' of an md/raid array, we need to suspend the device so that no requests are active - then move some pointers around etc. The code already keeps counts of active requests and the ->quiesce function can be used to wait until those counts hit zero. However the quiesce function blocks new requests once they are all ready 'inside' the personality module, and that is too late if we want to replace the personality modules. So make all md requests come in through a common md_make_request function that keeps track of how many requests have entered the modules but may not yet be on the internal reference counts. Allow md_make_request to be blocked when we want to suspend the device, and make it possible to wait for all those in-transit requests to be added to internal lists so that ->quiesce can wait for them. There is still a problem that when a request completes, we drop the ref count inside the personality code so there is a short time between when the refcount hits zero, and when the personality code is no longer being used. The personality code never blocks (schedule or spinlock) between dropping the refcount and exiting the routine, so this should be safe (as put_module calls synchronize_sched() before unmapping the module code). Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Make mddev->size sector-based.	Andre Noll	2009-03-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: move md_k.h from include/linux/raid/ to drivers/md/	NeilBrown	2009-03-30
\| \| \| \| \| \|	It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: move lots of #include lines out of .h files and into .c	NeilBrown	2009-03-30
\| \| \| \| \| \| \| \| \| \|	This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: move headers out of include/linux/raid/	Christoph Hellwig	2009-03-30
\| \| \| \| \| \| \| \| \| \|	Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: avoid races when stopping resync.	NeilBrown	2009-02-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There has been a race in raid10 and raid1 for a long time which has only recently started showing up due to a scheduler changed. When a sync_read request finishes, as soon as reschedule_retry is called, another thread can mark the resync request as having completed, so md_do_sync can finish, ->stop can be called, and ->conf can be freed. So using conf after reschedule_retry is not safe. Similarly, when finishing a sync_write, calling md_done_sync must be the last thing we do, as it allows a chain of events which will free conf and other data structures. The first of these requires action in raid10.c The second requires action in raid1.c and raid10.c Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Allow read error in a single drive raid1 to be passed up.	NeilBrown	2009-02-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a raid1 only has a single working device and gets a read error, we choose to simply return that error up to the filesystem (or whatever) rather than failing the whole array. However the codes doesn't quite do that. We attempt a readbalance which allocates the same drive, so we retry the read - indefinitely. Instead: If read_balance in the error case chooses the same drive that just failed, treat it as a failure and don't retry. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: don't retry recovery of raid1 that fails due to error on source drive.	NeilBrown	2009-01-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a raid1 has only one working drive and it has a sector which gives an error on read, then an attempt to recover onto a spare will fail, but as the single remaining drive is not removed from the array, the recovery will be immediately re-attempted, resulting in an infinite recovery loop. So detect this situation and don't retry recovery once an error on the lone remaining drive is detected. Allow recovery to be retried once every time a spare is added in case the problem wasn't actually a media error. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: use list_for_each_entry macro directly	Cheng Renquan	2009-01-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: build failure due to missing delay.h	Stephen Rothwell	2008-10-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/md/raid1.c: In function 'sync_request': drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible' make[3]: * [drivers/md/raid1.o] Error 1 make[3]: * Waiting for unfinished jobs.... drivers/md/raid10.c: In function 'sync_request': drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid10.o] Error 1 drivers/md/md.c: In function 'md_do_sync': drivers/md/md.c:5915: error: implicit declaration of function 'msleep' Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove unnecessary #includes, #defines, and function declarations"). I added the following patch. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>
*	block: move stats from disk to part0	Tejun Heo	2008-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	block: fix diskstats access	Tejun Heo	2008-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	block: raid fixups for removal of bi_hw_segments	Jens Axboe	2008-10-09
\| \| \| \|	Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	drop vmerge accounting	Mikulas Patocka	2008-10-09
\| \| \| \| \| \| \| \|	Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	md: Make mddev->array_size sector-based.	Andre Noll	2008-07-21
\| \| \| \| \| \| \| \| \|	This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: resolve external metadata handling deadlock in md_allow_write	Dan Williams	2008-06-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	md_allow_write() marks the metadata dirty while holding mddev->lock and then waits for the write to complete. For externally managed metadata this causes a deadlock as userspace needs to take the lock to communicate that the metadata update has completed. Change md_allow_write() in the 'external' case to start the 'mark active' operation and then return -EAGAIN. The expected side effects while waiting for userspace to write 'active' to 'array_state' are holding off reshape (code currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes these failures can be mitigated by coordinating with mdmon. md_write_start() still prevents writes from occurring until the metadata handler has had a chance to take action as it unconditionally waits for MD_CHANGE_CLEAN to be cleared. [neilb@suse.de: return -EAGAIN, try GFP_NOIO] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
*	rationalise return value for ->hot_add_disk method.	Neil Brown	2008-06-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For all array types but linear, ->hot_add_disk returns 1 on success, 0 on failure. For linear, it returns 0 on success and -errno on failure. This doesn't cause a functional problem because the ->hot_add_disk function of linear is used quite differently to the others. However it is confusing. So convert all to return 0 for success or -errno on failure and fix call sites to match. Signed-off-by: Neil Brown <neilb@suse.de>
*	Support adding a spare to a live md array with external metadata.	Neil Brown	2008-06-27
\| \| \| \| \| \| \|	i.e. extend the 'md/dev-XXX/slot' attribute so that you can tell a device to fill an vacant slot in an and md array. Signed-off-by: Neil Brown <neilb@suse.de>
*	md: restart recovery cleanly after device failure.	NeilBrown	2008-05-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we get any IO error during a recovery (rebuilding a spare), we abort the recovery and restart it. For RAID6 (and multi-drive RAID1) it may not be best to restart at the beginning: when multiple failures can be tolerated, the recovery may be able to continue and re-doing all that has already been done doesn't make sense. We already have the infrastructure to record where a recovery is up to and restart from there, but it is not being used properly. This is because: - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR, which causes the recovery not be be checkpointed. - We remove spares and then re-added them which loses important state information. The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't needed. If there is an error, the relevant drive will be marked as Faulty, and that is enough to ensure correct handling of the error. So we first remove MD_RECOVERY_ERR, changing some of the uses of it to MD_RECOVERY_INTR. Then we cause the attempt to remove a non-faulty device from an array to fail (unless recovery is impossible as the array is too degraded). Then when remove_and_add_spares attempts to remove the devices on which recovery can continue, it will fail, they will remain in place, and recovery will continue on them as desired. Issue: If we are halfway through rebuilding a spare and another drive fails, and a new spare is immediately available, do we want to: 1/ complete the current rebuild, then go back and rebuild the new spare or 2/ restart the rebuild from the start and rebuild both devices in parallel. Both options can be argued for. The code currently takes option 2 as a/ this requires least code change b/ this results in a minimally-degraded array in minimal time. Cc: "Eivind Sarto" <ivan@kasenna.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: raid1: Fix restoration of bio between failed read and write.	NeilBrown	2008-05-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When performing a "recovery" or "check" pass on a RAID1 array, we read from each device and possible, if there is a difference or a read error, write back to some devices. We use the same 'bio' for both read and write, resetting various fields between the two operations. We forgot to reset bv_offset and bv_len however. These are often left unchanged, but in the case where there is an IO error one or two sectors into a page, they are changed. This results in correctable errors not being corrected properly. It does not result in any data corruption. Cc: "Fairbanks, David" <David.Fairbanks@stratus.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: fix possible oops when removing a bitmap from an active array	NeilBrown	2008-05-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is possible to add a write-intent bitmap to an active array, or remove the bitmap that is there. When we do with the 'quiesce' the array, which causes make_request to block in "wait_barrier()". However we are sampling the value of "mddev->bitmap" before the wait_barrier call, and using it afterwards. This can result in using a bitmap structure that has been freed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Remove blkdev warning triggered by using md	Neil Brown	2008-05-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As setting and clearing queue flags now requires that we hold a spinlock on the queue, and as blk_queue_stack_limits is called without that lock, get the lock inside blk_queue_stack_limits. For blk_queue_stack_limits to be able to find the right lock, each md personality needs to set q->queue_lock to point to the appropriate lock. Those personalities which didn't previously use a spin_lock, us q->__queue_lock. So always initialise that lock when allocated. With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no longer cause warnings as it will be clear that the proper lock is held. Thanks to Dan Williams for review and fixing the silly bugs. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alistair John Strachan <alistair@devzero.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Jacek Luczak <difrost.kernel@gmail.com> Cc: Prakash Punnoor <prakash@punnoor.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: support blocking writes to an array on device failure	Dan Williams	2008-04-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allows a userspace metadata handler to take action upon detecting a device failure. Based on an original patch by Neil Brown. Changes: -added blocked_wait waitqueue to rdev -don't qualify Blocked with Faulty always let userspace block writes -added md_wait_for_blocked_rdev to wait for the block device to be clear, if userspace misses the notification another one is sent every 5 seconds -set MD_RECOVERY_NEEDED after clearing "blocked" -kill DoBlock flag, just test mddev->external Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	raid: remove leading TAB on printk messages	Nick Andrew	2008-04-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MD drivers use one printk() call to print 2 log messages and the second line may be prefixed by a TAB character. It may also output a trailing space before newline. klogd (I think) turns the TAB character into the 2 characters '^I' when logging to a file. This looks ugly. Instead of a leading TAB to indicate continuation, prefix both output lines with 'raid:' or similar. Also remove any trailing space in the vicinity of the affected code and consistently end the sentences with a period. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: fix possible raid1/raid10 deadlock on read error during resync	NeilBrown	2008-03-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for another possible deadlock in raid1/raid10 error handing. If a read request returns an error while a resync is happening and a resync request is pending, the attempt to fix the error will block until the resync progresses, and the resync will block until the read request completes. Thus a deadlock. This patch fixes the problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: fix deadlock in md/raid1 and md/raid10 when handling a read error	NeilBrown	2008-03-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When handling a read error, we freeze the array to stop any other IO while attempting to over-write with correct data. This is done in the raid1d(raid10d) thread and must wait for all submitted IO to complete (except for requests that failed and are sitting in the retry queue - these are counted in ->nr_queue and will stay there during a freeze). However write requests need attention from raid1d as bitmap updates might be required. This can cause a deadlock as raid1 is waiting for requests to finish that themselves need attention from raid1d. So we create a new function 'flush_pending_writes' to give that attention, and call it in freeze_array to be sure that we aren't waiting on raid1d. Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: change ITERATE_RDEV to rdev_for_each	NeilBrown	2008-02-06
\| \| \| \| \| \| \| \| \|	As this is more in line with common practice in the kernel. Also swap the args around to be more like list_for_each. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: allow a maximum extent to be set for resyncing	NeilBrown	2008-02-06
\| \| \| \| \| \| \| \| \| \| \| \| \|	This allows userspace to control resync/reshape progress and synchronise it with other activities, such as shared access in a SAN, or backing up critical sections during a tricky reshape. Writing a number of sectors (which must be a multiple of the chunk size if such is meaningful) causes a resync to pause when it gets to that point. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: Update md bitmap during resync.	NeilBrown	2008-02-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently an md array with a write-intent bitmap does not updated that bitmap to reflect successful partial resync. Rather the entire bitmap is updated when the resync completes. This is because there is no guarentee that resync requests will complete in order, and tracking each request individually is unnecessarily burdensome. However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciably affect resync performance. [snitzer@gmail.com: export bitmap_cond_end_sync] Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Mike Snitzer" <snitzer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Add UNPLUG traces to all appropriate places	Alan D. Brunelle	2007-11-09
\| \| \| \| \| \| \| \|	Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	Convert files to UTF-8 and some cleanups	Jan Engelhardt	2007-10-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Convert files to UTF-8. * Also correct some people's names (one example is Eißfeldt, which was found in a source file. Given that the author used an ß at all in a source file indicates that the real name has in fact a 'ß' and not an 'ss', which is commonly used as a substitute for 'ß' when limited to 7bit.) * Correct town names (Goettingen -> Göttingen) * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313) Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
*	md: make sure read errors are auto-corrected during a 'check' resync in raid1	NeilBrown	2007-10-17
\| \| \| \| \| \| \| \| \| \| \| \| \|	Whenever a read error is found, we should attempt to overwrite with correct data to 'fix' it. However when do a 'check' pass (which compares data blocks that are successfully read, but doesn't normally overwrite) we don't do that. We should. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	block: convert blkdev_issue_flush() to use empty barriers	Jens Axboe	2007-10-16
\| \| \| \| \| \| \|	Then we can get rid of ->issue_flush_fn() and all the driver private implementations of that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	Drop 'size' argument from bio_endio and bi_end_io	NeilBrown	2007-10-10
\| \| \| \| \| \| \| \| \| \| \| \| \|	As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	md: correctly update sysfs when a raid1 is reshaped	NeilBrown	2007-08-22
\| \| \| \| \| \| \| \| \| \| \|	When a raid1 array is reshaped (number of drives changed), the list of devices is compacted, so that slots for missing devices are filled with working devices from later slots. This requires the "rd%d" symlinks in sysfs to be updated. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: make sure a re-add after a restart honours bitmap when resyncing	NeilBrown	2007-08-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 1757128438d41670ded8bc3bc735325cc07dc8f9 was slightly bad. If an array has a write-intent bitmap, and you remove a drive, then readd it, only the changed parts should be resynced. However after the above commit, this only works if the array has not been shut down and restarted. This is because it sets 'fullsync' at little more often than it should. This patch is more careful. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[BLOCK] Get rid of request_queue_t typedef	Jens Axboe	2007-07-24
\| \| \| \| \| \| \| \| \|	Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	md: change bitmap_unplug and others to void functions	NeilBrown	2007-07-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bitmap_unplug only ever returns 0, so it may as well be void. Two callers try to print a message if it returns non-zero, but that message is already printed by bitmap_file_kick. write_page returns an error which is not consistently checked. It always causes BITMAP_WRITE_ERROR to be set on an error, and that can more conveniently be checked. When the return of write_page is checked, an error causes bitmap_file_kick to be called - so move that call into write_page - and protect against recursive calls into bitmap_file_kick. bitmap_update_sb returns an error that is never checked. So make these 'void' and be consistent about checking the bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	md: fix bug in error handling during raid1 repair	Mike Accetta	2007-06-16
\| \| \| \| \| \| \| \| \| \| \| \|	If raid1/repair (which reads all block and fixes any differences it finds) hits a read error, it doesn't reset the bio for writing before writing correct data back, so the read error isn't fixed, and the device probably gets a zero-length write which it might complain about. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>