aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* Merge tag 'md/3.19' of git://neil.brown.name/mdLinus Torvalds2014-12-14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from Neil Brown: "Three fixes for md. I did have a largish set of locking changes queued, but late testing showed they weren't quite as stable as I thought and while I fixed what I found, I decided it safer to delay them a release ... particularly as I'll be AFK for a few weeks. So expect a larger batch next time :-)" * tag 'md/3.19' of git://neil.brown.name/md: md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. md: fix semicolon.cocci warnings md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
| * md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.NeilBrown2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent change to md started the ->sync_thread from a asynchronously from a work_queue rather than synchronously. This means that there can be a small window between the time when MD_RECOVERY_RUNNING is set and when ->sync_thread is set. So code that checks ->sync_thread might now conclude that the thread has not been started and (because a lock is held) will not be started. That is no longer the case. Most of those places are best fixed by testing MD_RECOVERY_RUNNING as well. To make this completely reliable, we wake_up(&resync_wait) after clearing that flag as well as after clearing ->sync_thread. Other places are better served by flushing the relevant workqueue to ensure that that if the sync thread was starting, it has now started. This is particularly best if we are about to stop the sync thread. Fixes: ac05f256691fe427a3e84c19261adb0b67dd73c0 Signed-off-by: NeilBrown <neilb@suse.de>
| * md: fix semicolon.cocci warningskbuild test robot2014-12-03
| | | | | | | | | | | | | | | | | | | | | | drivers/md/md.c:7175:43-44: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.NeilBrown2014-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is critical that fetch_block() and handle_stripe_dirtying() are consistent in their analysis of what needs to be loaded. Otherwise raid5 can wait forever for a block that won't be loaded. Currently when writing to a RAID5 that is resyncing, to a location beyond the resync offset, handle_stripe_dirtying chooses a reconstruct-write cycle, but fetch_block() assumes a read-modify-write, and a lockup can happen. So treat that case just like RAID6, just as we do in handle_stripe_dirtying. RAID6 always does reconstruct-write. This bug was introduced when the behaviour of handle_stripe_dirtying was changed in 3.7, so the patch is suitable for any kernel since, though it will need careful merging for some versions. Cc: stable@vger.kernel.org (v3.7+) Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f Reported-by: Henry Cai <henryplusplus@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2014-12-14
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes (mainly Andy's TLS fixes), plus a cleanup" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tls: Disallow unusual TLS segments x86/tls: Validate TLS entries to protect espfix MAINTAINERS: Add me as x86 VDSO submaintainer x86/asm: Unify segment selector defines x86/asm: Guard against building the 32/64-bit versions of the asm-offsets*.c file directly x86_64, switch_to(): Load TLS descriptors before switching DS and ES x86/mm: Use min() instead of min_t() in the e820 printout code x86/mm: Fix zone ranges boot printout x86/doc: Update documentation after file shuffling
| * | x86/tls: Disallow unusual TLS segmentsAndy Lutomirski2014-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users have no business installing custom code segments into the GDT, and segments that are not present but are otherwise valid are a historical source of interesting attacks. For completeness, block attempts to set the L bit. (Prior to this patch, the L bit would have been silently dropped.) This is an ABI break. I've checked glibc, musl, and Wine, and none of them look like they'll have any trouble. Note to stable maintainers: this is a hardening patch that fixes no known bugs. Given the possibility of ABI issues, this probably shouldn't be backported quickly. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org # optional Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: security@kernel.org <security@kernel.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/tls: Validate TLS entries to protect espfixAndy Lutomirski2014-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Installing a 16-bit RW data segment into the GDT defeats espfix. AFAICT this will not affect glibc, Wine, or dosemu at all. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: security@kernel.org <security@kernel.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | MAINTAINERS: Add me as x86 VDSO submaintainerAndy Lutomirski2014-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | Here goes... :) Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1042001e502f8e0deb0edfeeac209b68378650cf.1418430292.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/asm: Unify segment selector definesBorislav Petkov2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Those are identical on 32- and 64-bit, unify them. No functional change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418127959-29902-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/asm: Guard against building the 32/64-bit versions of the asm-offsets*.c ↵Borislav Petkov2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | file directly Sometimes it is helpful to build a kernel compilation unit directly, i.e.: make .../<filename>.i in order to look at compiler output. Since asm-offsets_{32,64}.c are included by asm-offsets.c and building them directly doesn't evaluate the macros used (thus making the preprocessor output not very useful), error out when an attempt is made to build them. Issue a hint for the user to build asm-offsets.c instead. Suggested-by: Michael Matz <matz@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Michal Marek <mmarek@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418139917-12722-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86_64, switch_to(): Load TLS descriptors before switching DS and ESAndy Lutomirski2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise, if buggy user code points DS or ES into the TLS array, they would be corrupted after a context switch. This also significantly improves the comments and documents some gotchas in the code. Before this patch, the both tests below failed. With this patch, the es test passes, although the gsbase test still fails. ----- begin es test ----- /* * Copyright (c) 2014 Andy Lutomirski * GPL v2 */ static unsigned short GDT3(int idx) { return (idx << 3) | 3; } static int create_tls(int idx, unsigned int base) { struct user_desc desc = { .entry_number = idx, .base_addr = base, .limit = 0xfffff, .seg_32bit = 1, .contents = 0, /* Data, grow-up */ .read_exec_only = 0, .limit_in_pages = 1, .seg_not_present = 0, .useable = 0, }; if (syscall(SYS_set_thread_area, &desc) != 0) err(1, "set_thread_area"); return desc.entry_number; } int main() { int idx = create_tls(-1, 0); printf("Allocated GDT index %d\n", idx); unsigned short orig_es; asm volatile ("mov %%es,%0" : "=rm" (orig_es)); int errors = 0; int total = 1000; for (int i = 0; i < total; i++) { asm volatile ("mov %0,%%es" : : "rm" (GDT3(idx))); usleep(100); unsigned short es; asm volatile ("mov %%es,%0" : "=rm" (es)); asm volatile ("mov %0,%%es" : : "rm" (orig_es)); if (es != GDT3(idx)) { if (errors == 0) printf("[FAIL]\tES changed from 0x%hx to 0x%hx\n", GDT3(idx), es); errors++; } } if (errors) { printf("[FAIL]\tES was corrupted %d/%d times\n", errors, total); return 1; } else { printf("[OK]\tES was preserved\n"); return 0; } } ----- end es test ----- ----- begin gsbase test ----- /* * gsbase.c, a gsbase test * Copyright (c) 2014 Andy Lutomirski * GPL v2 */ static unsigned char *testptr, *testptr2; static unsigned char read_gs_testvals(void) { unsigned char ret; asm volatile ("movb %%gs:%1, %0" : "=r" (ret) : "m" (*testptr)); return ret; } int main() { int errors = 0; testptr = mmap((void *)0x200000000UL, 1, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0); if (testptr == MAP_FAILED) err(1, "mmap"); testptr2 = mmap((void *)0x300000000UL, 1, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0); if (testptr2 == MAP_FAILED) err(1, "mmap"); *testptr = 0; *testptr2 = 1; if (syscall(SYS_arch_prctl, ARCH_SET_GS, (unsigned long)testptr2 - (unsigned long)testptr) != 0) err(1, "ARCH_SET_GS"); usleep(100); if (read_gs_testvals() == 1) { printf("[OK]\tARCH_SET_GS worked\n"); } else { printf("[FAIL]\tARCH_SET_GS failed\n"); errors++; } asm volatile ("mov %0,%%gs" : : "r" (0)); if (read_gs_testvals() == 0) { printf("[OK]\tWriting 0 to gs worked\n"); } else { printf("[FAIL]\tWriting 0 to gs failed\n"); errors++; } usleep(100); if (read_gs_testvals() == 0) { printf("[OK]\tgsbase is still zero\n"); } else { printf("[FAIL]\tgsbase was corrupted\n"); errors++; } return errors == 0 ? 0 : 1; } ----- end gsbase test ----- Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: <stable@vger.kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/509d27c9fec78217691c3dad91cec87e1006b34a.1418075657.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/mm: Use min() instead of min_t() in the e820 printout codeXishi Qiu2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The type of "MAX_DMA_PFN" and "xXx_pfn" are both unsigned long now, so use min() instead of min_t(). Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Linux MM <linux-mm@kvack.org> Cc: <dave@sr71.net> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/5487AB3F.7050807@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/mm: Fix zone ranges boot printoutXishi Qiu2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the usual physical memory layout boot printout: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0xc3fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0xbf78ffff] [ 0.000000] node 0: [mem 0x100000000-0x63fffffff] [ 0.000000] node 1: [mem 0x640000000-0xc3fffffff] ... This is the log when we set "mem=2G" on the boot cmdline: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] // should be 0x7fffffff, right? [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] ... This patch fixes the printout, the following log shows the right ranges: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0x7fffffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] ... Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Linux MM <linux-mm@kvack.org> Cc: <dave@sr71.net> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/5487AB3D.6070306@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/doc: Update documentation after file shufflingLuis R. Rodriguez2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While at it, also refer to the 32 bit entry file. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: linux-doc@vger.kernel.org Cc: bpoirier@suse.de Link: http://lkml.kernel.org/r/1418165684-6226-1-git-send-email-mcgrof@do-not-panic.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2014-12-13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer driver updates from Jens Axboe: - NVMe updates: - The blk-mq conversion from Matias (and others) - A stack of NVMe bug fixes from the nvme tree, mostly from Keith. - Various bug fixes from me, fixing issues in both the blk-mq conversion and generic bugs. - Abort and CPU online fix from Sam. - Hot add/remove fix from Indraneel. - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp) - With the generic IO stat accounting from 3.19/core, converting md, bcache, and rsxx to use those. From Gu Zheng. - Boundary check for queue/irq mode for null_blk from Matias. Fixes cases where invalid values could be given, causing the device to hang. - The xen blkfront pull request, with two bug fixes from Vitaly. * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits) NVMe: fix race condition in nvme_submit_sync_cmd() NVMe: fix retry/error logic in nvme_queue_rq() NVMe: Fix FS mount issue (hot-remove followed by hot-add) NVMe: fix error return checking from blk_mq_alloc_request() NVMe: fix freeing of wrong request in abort path xen/blkfront: remove redundant flush_op xen/blkfront: improve protection against issuing unsupported REQ_FUA NVMe: Fix command setup on IO retry null_blk: boundary check queue_mode and irqmode block/rsxx: use generic io stats accounting functions to simplify io stat accounting md: use generic io stats accounting functions to simplify io stat accounting drbd: use generic io stats accounting functions to simplify io stat accounting md/bcache: use generic io stats accounting functions to simplify io stat accounting NVMe: Update module version major number NVMe: fail pci initialization if the device doesn't have any BARs NVMe: add ->exit_hctx() hook NVMe: make setup work for devices that don't do INTx NVMe: enable IO stats by default NVMe: nvme_submit_async_admin_req() must use atomic rq allocation NVMe: replace blk_put_request() with blk_mq_free_request() ...
| * | | NVMe: fix race condition in nvme_submit_sync_cmd()Jens Axboe2014-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have a race between the schedule timing out and the command completing, we could have the task issuing the command exit nvme_submit_sync_cmd() while the irq is running sync_completion(). If that happens, we could be corrupting memory, since the stack that held 'cmdinfo' is no longer valid. Fix this by always calling nvme_abort_cmd_info(). Once that call completes, we know that we have either run sync_completion() if the completion came in, or that we will never run it since we now have special_completion() as the command callback handler. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | NVMe: fix retry/error logic in nvme_queue_rq()Jens Axboe2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic around retrying and erroring IO in nvme_queue_rq() is broken in a few ways: - If we fail allocating dma memory for a discard, we return retry. We have the 'iod' stored in ->special, but we free the 'iod'. - For a normal request, if we fail dma mapping of setting up prps, we have the same iod situation. Additionally, we haven't set the callback for the request yet, so we also potentially leak IOMMU resources. Get rid of the ->special 'iod' store. The retry is uncommon enough that it's not worth optimizing for or holding on to resources to attempt to speed it up. Additionally, it's usually best practice to free any request related resources when doing retries. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | NVMe: Fix FS mount issue (hot-remove followed by hot-add)Indraneel M2014-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After Hot-remove of a device with a mounted partition, when the device is hot-added again, the new node reappears as nvme0n1. Mounting this new node fails with the error: mount: mount /dev/nvme0n1p1 on /mnt failed: File exists. The old nodes's FS entries still exist and the kernel can't re-create procfs and sysfs entries for the new node with the same name. The patch fixes this issue. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Indraneel M <indraneel.m@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | Merge branch 'for-jens-3.19' of ↵Jens Axboe2014-12-10
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.19/drivers Konrad writes: These are two fixes for Xen blkfront. They harden how it deals with broken backends.
| | * | | xen/blkfront: remove redundant flush_opVitaly Kuznetsov2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | flush_op is unambiguously defined by feature_flush: REQ_FUA | REQ_FLUSH -> BLKIF_OP_WRITE_BARRIER REQ_FLUSH -> BLKIF_OP_FLUSH_DISKCACHE 0 -> 0 and thus can be removed. This is just a cleanup. The patch was suggested by Boris Ostrovsky. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | xen/blkfront: improve protection against issuing unsupported REQ_FUAVitaly Kuznetsov2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced in d11e61583 and was factored out into blkif_request_flush_valid() in 0f1ca65ee. However: 1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request will still pass the check. 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when the request is invalid. 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that -EOPNOTSUPP is more appropriate here. Fix all of the above issues. This patch is based on the original patch by Laszlo Ersek and a comment by Jeff Moyer. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | NVMe: fix error return checking from blk_mq_alloc_request()Jens Axboe2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We return an error pointer or the request, not NULL. Half the call paths got it right, the others didn't. Fix those up. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | NVMe: fix freeing of wrong request in abort pathSam Bradshaw2014-12-10
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | We allocate 'abort_req', but free 'req' in case of an error submitting the IO. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | NVMe: Fix command setup on IO retryKeith Busch2014-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On retry, the req->special is pointing to an already setup IOD, but we still need to setup the command context and callback, otherwise you'll see false twice completed errors and leak requests. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | null_blk: boundary check queue_mode and irqmodeMatias Bjorling2014-11-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When either queue_mode or irq_mode parameter is set outside its boundaries, the driver will not complete requests. This stalls driver initialization when partitions are probed. Fix by setting out of bound values to the parameters default. Signed-off-by: Matias Bjørling <m@bjorling.me> Updated by me to have the parse+check in just one function. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block/rsxx: use generic io stats accounting functions to simplify io stat ↵Gu Zheng2014-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | md: use generic io stats accounting functions to simplify io stat accountingGu Zheng2014-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | drbd: use generic io stats accounting functions to simplify io stat accountingGu Zheng2014-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | md/bcache: use generic io stats accounting functions to simplify io stat ↵Gu Zheng2014-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Kent Overstreet <kmo@datera.io> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | Merge branch 'for-3.19/core' into for-3.19/driversJens Axboe2014-11-24
| |\ \ \
| * | | | NVMe: Update module version major numberKeith Busch2014-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's already near impossible to tell what bits someone is running based on a 'modinfo nvme', and I don't want to try guessing if someone is running blk-mq or bio-based. Let's make it obvious with the module version that the blk-mq conversion is a major change. Future bio-based versions can increment to 0.10 in a fork if revisions occur. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | NVMe: fail pci initialization if the device doesn't have any BARsJens Axboe2014-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PCI init of NVMe doesn't check for valid bars before proceeding to map and use BAR 0. If the device is hosed (or firmware is), then we should catch this case and give up early. This fixes a: [ 1662.035778] WARNING: CPU: 0 PID: 4 at arch/x86/mm/ioremap.c:63 __ioremap_check_ram+0xa7/0xc0() and later badness on such a device. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | NVMe: add ->exit_hctx() hookJens Axboe2014-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we do teardown and setup of the queue and block related parts of the driver, then we should clear nvmeq->hctx once we kill the hardware queue. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | NVMe: make setup work for devices that don't do INTxJens Axboe2014-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The setup/probe part currently relies on INTx being there and working, that's not always the case. For devices that don't advertise INTx, enable a single MSIx vector early on and disable it again before we ask for our full range of queue vecs. Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | Merge branch 'master' into for-3.19/driversJens Axboe2014-11-18
| |\ \ \ \
| * | | | | NVMe: enable IO stats by defaultJens Axboe2014-11-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before the blk-mq conversion they were on by default, we should not change behavior there. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | NVMe: nvme_submit_async_admin_req() must use atomic rq allocationJens Axboe2014-11-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are called for async event notification issues, and the nvmeq lock is already held. If we fail the request allocation, we'll just retry next time. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | NVMe: replace blk_put_request() with blk_mq_free_request()Jens Axboe2014-11-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No point in using blk_put_request(), since we know we are blk-mq. This only makes sense in core code where we could be dealing with either legacy or blk-mq drivers. Additionally, use blk_mq_free_hctx_request() for the request completion fast path, where we already know the mapping from request to hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | Merge branch 'for-3.19/core' into for-3.19/driversJens Axboe2014-11-17
| |\ \ \ \ \
| * | | | | | drbd: Remove an useless copy of kernel_setsockopt()Philipp Reisner2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Old backward-compat cruft Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | drbd: Fix state change in case of connection timeoutPhilipp Reisner2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A connection timeout affects all volumes of a resource! Under the following conditions: A resource with multiple volumes AND ko-count >=1 AND a write request triggers the timeout (ko-count * timeout) DRBD's internal state gets confused. That in turn may lead to very miss leading follow up failures. E.g. "BUG: scheduling while atomic" CC: stable@kernel.org # v3.17 Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | drbd: merge_bvec_fn: properly remap bvm->bi_bdevLars Ellenberg2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was not noticed for many years. Affects operation if md raid is used a backing device for DRBD. CC: stable@kernel.org # v3.2+ Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | drbd: fix resync throttling initializationLars Ellenberg2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If for some reason DRBD resync was the only activity on a backend device, drbd_rs_c_min_rate_throttle() would mistakenly decide that it is still initialization time, and keep throttling the resync. This patch explicitly initializes ->rs_last_events to the current backend event counters, and drops the rs_last_events == 0 from the throttle condition. Reported-by: Mikhail Sugakov <msugakov@amazon.de> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | drbd: fix race between role change and handshakePhilipp Reisner2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Symptoms: If DRBD was "cleanly shut down" (all in sync, both Secondary before disconnect, identical data generation uuids), and then one side was promoted *during* the next connection handshake, the role change could confuse the handshake. The Primary would get stuck in WFBitmapS, the Secondary would log unexpected cstate (Connected) in receive_bitmap and get stuck in WFBitmapT. Fix: The test in is_valid_soft_transition wrong. It works because the not allowed actions (promote/attach) do not touch the cstate. The previous condition failed to demand a cstate change in one clause. In order to avoid deadlocks give up the state_mutex while waiting for the transient state to go away. Conflicts: drbd/drbd_state.c drbd/drbd_state.h drbd/drbd_wrappers.h Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | drbd: Only use drbd_msg_put_info() in drbd_nl.cAndreas Gruenbacher2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid generic netlink calls in other parts of the code base. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | drbd: Minor cleanupsAndreas Gruenbacher2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | . Update comments . drbd_set_{in,out_of}_sync(): Remove unused parameters . Move common code into adm_del_resource() . Redefine ERR_MINOR_EXISTS -> ERR_MINOR_OR_VOLUME_EXISTS Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | NVMe: __nvme_submit_admin_cmd() can be statickbuild test robot2014-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | drivers/block/nvme-core.c:865:5: sparse: symbol '__nvme_submit_admin_cmd' was not declared. Should it be static? Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | NVMe: blk_mq_alloc_request() returns error pointersDan Carpenter2014-11-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently converted this to blk_mq but the error checks have to be updated to check for IS_ERR() instead of NULL. Fixes: a4aea5623d4a ('NVMe: Convert to blk-mq') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | NVMe: Convert to blk-mqMatias Bjørling2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This converts the NVMe driver to a blk-mq request-based driver. The NVMe driver is currently bio-based and implements queue logic within itself. By using blk-mq, a lot of these responsibilities can be moved and simplified. The patch is divided into the following blocks: * Per-command data and cmdid have been moved into the struct request field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id maintenance are now handled by blk-mq through the rq->tag field. * The logic for splitting bio's has been moved into the blk-mq layer. The driver instead notifies the block layer about limited gap support in SG lists. * blk-mq handles timeouts and is reimplemented within nvme_timeout(). This both includes abort handling and command cancelation. * Assignment of nvme queues to CPUs are replaced with the blk-mq version. The current blk-mq strategy is to assign the number of mapped queues and CPUs to provide synergy, while the nvme driver assign as many nvme hw queues as possible. This can be implemented in blk-mq if needed. * NVMe queues are merged with the tags structure of blk-mq. * blk-mq takes care of setup/teardown of nvme queues and guards invalid accesses. Therefore, RCU-usage for nvme queues can be removed. * IO tracing and accounting are handled by blk-mq and therefore removed. * Queue suspension logic is replaced with the logic from the block layer. Contributions in this patch from: Sam Bradshaw <sbradshaw@micron.com> Jens Axboe <axboe@fb.com> Keith Busch <keith.busch@intel.com> Robert Nelson <rlnelson@google.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Jens Axboe <axboe@fb.com> Updated for new ->queue_rq() prototype. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | NVMe: Do not over allocate for discard requestsKeith Busch2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Discard requests are often for very large ranges. The discard size is not representative of the data transfer size so we don't need to allocate for such a large prp list. This patch requests allocating only enough for the memory needed for the data transfer and saves a little over 8k of memory per max discard request. Signed-off-by: Keith Busch <keith.busch@intel.com> Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>