aboutsummaryrefslogtreecommitdiffstats
path: root/block/blk-core.c
Commit message (Collapse)AuthorAge
* Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-blockLinus Torvalds2014-12-13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ...
| * blk-mq: Fix a use-after-freeBart Van Assche2014-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq users are allowed to free the memory request_queue.tag_set points at after blk_cleanup_queue() has finished but before blk_release_queue() has started. This can happen e.g. in the SCSI core. The SCSI core namely embeds the tag_set structure in a SCSI host structure. The SCSI host structure is freed by scsi_host_dev_release(). This function is called after blk_cleanup_queue() finished but can be called before blk_release_queue(). This means that it is not safe to access request_queue.tag_set from inside blk_release_queue(). Hence remove the blk_sync_queue() call from blk_release_queue(). This call is not necessary - outstanding requests must have finished before blk_release_queue() is called. Additionally, move the blk_mq_free_queue() call from blk_release_queue() to blk_cleanup_queue() to avoid that struct request_queue.tag_set gets accessed after it has been freed. This patch avoids that the following kernel oops can be triggered when deleting a SCSI host for which scsi-mq was enabled: Call Trace: [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270 [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380 [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180 [<ffffffff8124d654>] blk_release_queue+0x84/0xd0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff81245895>] blk_put_queue+0x15/0x20 [<ffffffff8125c409>] disk_release+0x99/0xd0 [<ffffffff8133d056>] device_release+0x36/0xb0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff8125a78a>] put_disk+0x1a/0x20 [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0 [<ffffffff811d56a0>] blkdev_put+0x50/0x160 [<ffffffff81199eb4>] kill_block_super+0x44/0x70 [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60 [<ffffffff8119a87e>] deactivate_super+0x4e/0x70 [<ffffffff811b9833>] cleanup_mnt+0x43/0x90 [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20 [<ffffffff8107252c>] task_work_run+0xac/0xe0 [<ffffffff81002c01>] do_notify_resume+0x61/0xa0 [<ffffffff814d2c58>] int_signal+0x12/0x17 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge tag 'pm+acpi-3.19-rc1' of ↵Linus Torvalds2014-12-11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "This time we have some more new material than we used to have during the last couple of development cycles. The most important part of it to me is the introduction of a unified interface for accessing device properties provided by platform firmware. It works with Device Trees and ACPI in a uniform way and drivers using it need not worry about where the properties come from as long as the platform firmware (either DT or ACPI) makes them available. It covers both devices and "bare" device node objects without struct device representation as that turns out to be necessary in some cases. This has been in the works for quite a few months (and development cycles) and has been approved by all of the relevant maintainers. On top of that, some drivers are switched over to the new interface (at25, leds-gpio, gpio_keys_polled) and some additional changes are made to the core GPIO subsystem to allow device drivers to manipulate GPIOs in the "canonical" way on platforms that provide GPIO information in their ACPI tables, but don't assign names to GPIO lines (in which case the driver needs to do that on the basis of what it knows about the device in question). That also has been approved by the GPIO core maintainers and the rfkill driver is now going to use it. Second is support for hardware P-states in the intel_pstate driver. It uses CPUID to detect whether or not the feature is supported by the processor in which case it will be enabled by default. However, it can be disabled entirely from the kernel command line if necessary. Next is support for a platform firmware interface based on ACPI operation regions used by the PMIC (Power Management Integrated Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms. That interface is used for manipulating power resources and for thermal management: sensor temperature reporting, trip point setting and so on. Also the ACPI core is now going to support the _DEP configuration information in a limited way. Basically, _DEP it supposed to reflect off-the-hierarchy dependencies between devices which may be very indirect, like when AML for one device accesses locations in an operation region handled by another device's driver (usually, the device depended on this way is a serial bus or GPIO controller). The support added this time is sufficient to make the ACPI battery driver work on Asus T100A, but it is general enough to be able to cover some other use cases in the future. Finally, we have a new cpufreq driver for the Loongson1B processor. In addition to the above, there are fixes and cleanups all over the place as usual and a traditional ACPICA update to a recent upstream release. As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for Intel platforms should be able to handle power management of the DMA engine correctly, the cpufreq-dt driver should interact with the thermal subsystem in a better way and the ACPI backlight driver should handle some more corner cases, among other things. On top of the ACPICA update there are fixes for race conditions in the ACPICA's interrupt handling code which might lead to some random and strange looking failures on some systems. In the cleanups department the most visible part is the series of commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration option. That was triggered by a discussion regarding the generic power domains code during which we realized that trying to support certain combinations of PM config options was painful and not really worth it, because nobody would use them in production anyway. For this reason, we decided to make CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the conclusion that the latter became redundant and CONFIG_PM could be used instead of it. The material here makes that replacement in a major part of the tree, but there will be at least one more batch of that in the second part of the merge window. Specifics: - Support for retrieving device properties information from ACPI _DSD device configuration objects and a unified device properties interface for device drivers (and subsystems) on top of that. As stated above, this works with Device Trees and ACPI and allows device drivers to be written in a platform firmware (DT or ACPI) agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are now going to use this new interface and the GPIO subsystem is additionally modified to allow device drivers to assign names to GPIO resources returned by ACPI _CRS objects (in case _DSD is not present or does not provide the expected data). The changes in this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam, Geert Uytterhoeven). - Support for Hardware Managed Performance States (HWP) as described in Volume 3, section 14.4, of the Intel SDM in the intel_pstate driver. CPUID is used to detect whether or not the feature is supported by the processor. If supported, it will be enabled automatically unless the intel_pstate=no_hwp switch is present in the kernel command line. From Dirk Brandewie. - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie). - Support for firmware interface based on ACPI operation regions used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR platforms for power resource control and thermal management (Aaron Lu). - Limited support for retrieving off-the-hierarchy dependencies between devices from ACPI _DEP device configuration objects and deferred probing support for the ACPI battery driver based on the _DEP information to make that driver work on Asus T100A (Lan Tianyu). - New cpufreq driver for the Loongson1B processor (Kelvin Cheung). - ACPICA update to upstream revision 20141107 which only affects tools (Bob Moore). - Fixes for race conditions in the ACPICA's interrupt handling code and in the ACPI code related to system suspend and resume (Lv Zheng and Rafael J Wysocki). - ACPI core fix for an RCU-related issue in the ioremap() regions management code that slowed down significantly after CPUs had been allowed to enter idle states even if they'd had RCU callbakcs queued and triggered some problems in certain proprietary graphics driver (and elsewhere). The fix replaces synchronize_rcu() in that code with synchronize_rcu_expedited() which makes the issue go away. From Konstantin Khlebnikov. - ACPI LPSS (Low-Power Subsystem) driver fix to handle power management of the DMA engine included into the LPSS correctly. The problem is that the DMA engine doesn't have ACPI PM support of its own and it simply is turned off when the last LPSS device having ACPI PM support goes into D3cold. To work around that, the PM domain used by the ACPI LPSS driver is redesigned so at least one device with ACPI PM support will be on as long as the DMA engine is in use. From Andy Shevchenko. - ACPI backlight driver fix to avoid using it on "Win8-compatible" systems where it doesn't work and where it was used by default by mistake (Aaron Lu). - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki, Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin Chaugule (mostly related to the upcoming ARM64 support). - Intel RAPL (Running Average Power Limit) power capping driver fixes and improvements including new processor IDs (Jacob Pan). - Generic power domains modification to power up domains after attaching devices to them to meet the expectations of device drivers and bus types assuming devices to be accessible at probe time (Ulf Hansson). - Preliminary support for controlling device clocks from the generic power domains core code and modifications of the ARM/shmobile platform to use that feature (Ulf Hansson). - Assorted minor fixes and cleanups of the generic power domains core code (Ulf Hansson, Geert Uytterhoeven). - Assorted minor fixes and cleanups of the device clocks control code in the PM core (Geert Uytterhoeven, Grygorii Strashko). - Consolidation of device power management Kconfig options by making CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter which is now redundant (Rafael J Wysocki and Kevin Hilman). That is the first batch of the changes needed for this purpose. - Core device runtime power management support code cleanup related to the execution of callbacks (Andrzej Hajda). - cpuidle ARM support improvements (Lorenzo Pieralisi). - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and Bartlomiej Zolnierkiewicz). - New cpufreq driver callback (->ready) to be executed when the cpufreq core is ready to use a given policy object and cpufreq-dt driver modification to use that callback for cooling device registration (Viresh Kumar). - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James Geboski, Tomeu Vizoso). - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate, cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao, Stefan Wahren, Petr Cvek). - OPP (Operating Performance Points) framework modification to allow OPPs to be removed too and update of a few cpufreq drivers (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added during initialization) on driver removal (Viresh Kumar). - Hibernation core fixes and cleanups (Tina Ruchandani and Markus Elfring). - PM Kconfig fix related to CPU power management (Pankaj Dubey). - cpupower tool fix (Prarit Bhargava)" * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM tools: cpupower: fix return checks for sysfs_get_idlestate_count() drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM leds: leds-gpio: Fix multiple instances registration without 'label' property iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros ...
| * | block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PMRafael J. Wysocki2014-12-03
| |/ | | | | | | | | | | | | | | | | | | | | | | | | After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks depending on CONFIG_PM_RUNTIME may now be changed to depend on CONFIG_PM. Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core. Reviewed-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* / scsi: add new scsi-command flag for tagged commandsChristoph Hellwig2014-11-12
|/ | | | | | | | | | | | | | | | | | | | | | | Currently scsi piggy backs on the block layer to define the concept of a tagged command. But we want to be able to have block-level host-wide tags assigned even for untagged commands like the initial INQUIRY, so add a new SCSI-level flag for commands that are tagged at the scsi level, so that even commands without that set can have tags assigned to them. Note that this alredy is the case for the blk-mq code path, and this just lets the old path catch up with it. We also set this flag based upon sdev->simple_tags instead of the block queue flag, so that it is entirely independent of the block layer tagging, and thus always correct even if a driver doesn't use block level tagging yet. Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and removing it forces them to look for the proper replacement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
* Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-blockLinus Torvalds2014-10-18
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...
| * block: include func name in __get_request printsRobert Elliott2014-10-13
| | | | | | | | | | | | | | | | | | | | | | In __get_request calls to printk_ratelimited, include the function name so the callbacks suppressed message matches the messages that are printed, and add "dev" before the device name so it matches other block layer messages. Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: make blk_update_request print prefix match ratelimited prefixRobert Elliott2014-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In blk_update_request, change the printk_ratelimited prefix from end_request to blk_update_request so it matches the name printed if rate limiting occurs. Old: [10234.933106] blk_update_request: 174 callbacks suppressed [10234.934940] end_request: critical target error, dev sdr, sector 16 [10234.949788] end_request: critical target error, dev sdr, sector 16 New: [16863.445173] blk_update_request: 398 callbacks suppressed [16863.447029] blk_update_request: critical target error, dev sdr, sector 1442066176 [16863.449383] blk_update_request: critical target error, dev sdr, sector 802802888 [16863.451680] blk_update_request: critical target error, dev sdr, sector 1609535456 Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: use bio_clone_fast() in blk_rq_prep_clone()Junichi Nomura2014-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Request cloning clones bios in the request to track the completion of each bio. For that purpose, we can use bio_clone_fast() instead of bio_clone() to avoid unnecessary allocation and copy of bvecs. This patch reduces memory footprint of request-based device-mapper (about 1-4KB for each request) and is a preparation for further reduction of memory usage by removing unused bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: misplaced rq_complete tracepointHannes Reinecke2014-10-01
| | | | | | | | | | | | | | | | | | | | The rq_complete tracepoint was never issued for empty requests, causing the resulting blktrace information to never show any completion for those request. Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: support per-distpatch_queue flush machineryMing Lei2014-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch supports to run one single flush machinery for each blk-mq dispatch queue, so that: - current init_request and exit_request callbacks can cover flush request too, then the buggy copying way of initializing flush request's pdu can be fixed - flushing performance gets improved in case of multi hw-queue In fio sync write test over virtio-blk(4 hw queues, ioengine=sync, iodepth=64, numjobs=4, bs=4K), it is observed that througput gets increased a lot over my test environment: - throughput: +70% in case of virtio-blk over null_blk - throughput: +30% in case of virtio-blk over SSD image The multi virtqueue feature isn't merged to QEMU yet, and patches for the feature can be found in below tree: git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4 And simply passing 'num_queues=4 vectors=5' should be enough to enable multi queue(quad queue) feature for QEMU virtio-blk. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: introduce 'blk_mq_ctx' parameter to blk_get_flush_queueMing Lei2014-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(), so that this function can find the corresponding blk_flush_queue bound with current mq context since the flush queue will become per hw-queue. For legacy queue, the parameter can be simply 'NULL'. For multiqueue case, the parameter should be set as the context from which the related request is originated. With this context info, the hw queue and related flush queue can be found easily. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: remove blk_init_flush() and its pairMing Lei2014-09-25
| | | | | | | | | | | | | | | | Now mission of the two helpers is over, and just call blk_alloc_flush_queue() and blk_free_flush_queue() directly. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: introduce blk_flush_queue to drive flush machineryMing Lei2014-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces 'struct blk_flush_queue' and puts all flush machinery related fields into this structure, so that - flush implementation details aren't exposed to driver - it is easy to convert to per dispatch-queue flush machinery This patch is basically a mechanical replacement. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: move flush initialization to blk_flush_initMing Lei2014-09-25
| | | | | | | | | | | | | | | | | | These fields are always used with the flush request, so initialize them together. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: introduce blk_init_flush and its pairMing Lei2014-09-25
| | | | | | | | | | | | | | | | | | | | | | | | These two temporary functions are introduced for holding flush initialization and de-initialization, so that we can introduce 'flush queue' easier in the following patch. And once 'flush queue' and its allocation/free functions are ready, they will be removed for sake of code readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge branch 'for-linus' into for-3.18/coreJens Axboe2014-09-11
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bit of churn on the for-linus side that would be nice to have in the core bits for 3.18, so pull it in to catch us up and make forward progress easier. Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/scsi_ioctl.c
| * | block, bdi: an active gendisk always has a request_queue associated with itTejun Heo2014-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdev_get_queue() returns the request_queue associated with the specified block_device. blk_get_backing_dev_info() makes use of bdev_get_queue() to determine the associated bdi given a block_device. All the callers of bdev_get_queue() including blk_get_backing_dev_info() assume that bdev_get_queue() may return NULL and implement NULL handling; however, bdev_get_queue() requires the passed in block_device is opened and attached to its gendisk. Because an active gendisk always has a valid request_queue associated with it, bdev_get_queue() can never return NULL and neither can blk_get_backing_dev_info(). Make it clear that neither of the two functions can return NULL and remove NULL handling from all the callers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block,scsi: fixup blk_get_request dead queue scenariosJoe Lawrence2014-08-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The blk_get_request function may fail in low-memory conditions or during device removal (even if __GFP_WAIT is set). To distinguish between these errors, modify the blk_get_request call stack to return the appropriate ERR_PTR. Verify that all callers check the return status and consider IS_ERR instead of a simple NULL pointer check. For consistency, make a similar change to the blk_mq_alloc_request leg of blk_get_request. It may fail if the queue is dead, or the caller was unwilling to wait. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2014-10-07
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...
| * | Documentation: Docbook: Fix generated DocBook/kernel-api.xmlMasanari Iida2014-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | scsi-mq: fix requests that use a separate CDB bufferTony Battersby2014-08-22
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes code such as the following with scsi-mq enabled: rq = blk_get_request(...); blk_rq_set_block_pc(rq); rq->cmd = my_cmd_buffer; /* separate CDB buffer */ blk_execute_rq_nowait(...); Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for large CDBs only). Without this patch, scsi_mq_prep_fn() will set rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: decouble blk-mq freezing from generic bypassingTejun Heo2014-07-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq freezing is entangled with generic bypassing which bypasses blkcg and io scheduler and lets IO requests fall through the block layer to the drivers in FIFO order. This allows forward progress on IOs with the advanced features disabled so that those features can be configured or altered without worrying about stalling IO which may lead to deadlock through memory allocation. However, generic bypassing doesn't quite fit blk-mq. blk-mq currently doesn't make use of blkcg or ioscheds and it maps bypssing to freezing, which blocks request processing and drains all the in-flight ones. This causes problems as bypassing assumes that request processing is online. blk-mq works around this by conditionally allowing request processing for the problem case - during queue initialization. Another weirdity is that except for during queue cleanup, bypassing started on the generic side prevents blk-mq from processing new requests but doesn't drain the in-flight ones. This shouldn't break anything but again highlights that something isn't quite right here. The root cause is conflating blk-mq freezing and generic bypassing which are two different mechanisms. The only intersecting purpose that they serve is during queue cleanup. Let's properly separate blk-mq freezing from generic bypassing and simply use it where necessary. * request_queue->mq_freeze_depth is added and blk_mq_[un]freeze_queue() now operate on this counter instead of ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added but the counter is tested directly. This will be further updated by later changes. * blk_mq_drain_queue() is dropped and "__" prefix is dropped from blk_mq_freeze_queue(). Queue cleanup path now calls blk_mq_freeze_queue() directly. * blk_queue_enter()'s fast path condition is simplified to simply check @q->mq_freeze_depth. Previously, the condition was !blk_queue_dying(q) && (!blk_queue_bypass(q) || !blk_queue_init_done(q)) mq_freeze_depth is incremented right after dying is set and blk_queue_init_done() exception isn't necessary as blk-mq doesn't start frozen, which only leaves the blk_queue_bypass() test which can be replaced by @q->mq_freeze_depth test. This change simplifies the code and reduces confusion in the area. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block, blk-mq: draining can't be skipped even if bypass_depth was non-zeroTejun Heo2014-07-01
|/ | | | | | | | | | | | | | | | | | | | | | Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue() skip queue draining if bypass_depth was already above zero. The assumption is that the one which bumped the bypass_depth should have performed draining already; however, there's nothing which prevents a new instance of bypassing/freezing from starting before the previous one finishes draining. The current code may allow the later bypassing/freezing instances to complete while there still are in-flight requests which haven't finished draining. Fix it by draining regardless of bypass_depth. We still skip draining from blk_queue_bypass_start() while the queue is initializing to avoid introducing excessive delays during boot. INIT_DONE setting is moved above the initial blk_queue_bypass_end() so that bypassing attempts can't slip inbetween. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2014-06-19
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A smaller collection of fixes for the block core that would be nice to have in -rc2. This pull request contains: - Fixes for races in the wait/wakeup logic used in blk-mq from Alexander. No issues have been observed, but it is definitely a bit flakey currently. Alternatively, we may drop the cyclic wakeups going forward, but that needs more testing. - Some cleanups from Christoph. - Fix for an oops in null_blk if queue_mode=1 and softirq completions are used. From me. - A fix for a regression caused by the chunk size setting. It inadvertently used max_hw_sectors instead of max_sectors, which is incorrect, and causes hangs on btrfs multi-disk setups (where hw sectors apparently isn't set). From me. - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a recent addition as well, but it actually breaks blk-mq which relies on strict scheduling. If the workqueue power_efficient mode is turned on, this breaks blk-mq. From Matias. - null_blk module parameter description fix from Mike" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: bitmap tag: fix races in bt_get() function blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt blk-mq: bitmap tag: fix races on shared ::wake_index fields block: blk_max_size_offset() should check ->max_sectors null_blk: fix softirq completions for queue_mode == 1 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue blk-mq: properly drain stopped queues block: remove WQ_POWER_EFFICIENT from kblockd null_blk: fix name and description of 'queue_mode' module parameter block: remove elv_abort_queue and blk_abort_flushes
| * block: remove WQ_POWER_EFFICIENT from kblockdMatias Bjørling2014-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq issues async requests through kblockd. To issue a work request on a specific CPU, kblockd_schedule_delayed_work_on is used. However, the specific CPU choice may not be honored, if the power_efficient option for workqueues is set. blk-mq requires that we have strict per-cpu scheduling, so it wont work properly if kblockd is marked POWER_EFFICIENT and power_efficient is set. Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior. This essentially reverts part of commit 695588f9454b, which added the WQ_POWER_EFFICIENT marker to kblockd. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2014-06-15
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...
| * NVMe: Add tracepointsKeith Busch2014-05-05
| | | | | | | | | | | | | | | | Adding tracepoints for bio_complete and block_split into nvme to help with gathering IO info using blktrace and blkparse. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
* | block: add blk_rq_set_block_pc()Jens Axboe2014-06-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: remove 'magic' from struct blk_plugJens Axboe2014-05-29
| | | | | | | | | | | | | | | | I don't think we've ever caught any bugs with this, and there's the list poisoning for the plug lists to catch uninitialized cases. So remove the magic member and save 8 bytes in the struct. Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_requestChristoph Hellwig2014-05-28
| | | | | | | | | | | | | | | | Instead of having two almost identical copies of the same code just let the callers pass in the reserved flag directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: only allocate/free mq_usage_counter in blk-mqMing Lei2014-05-27
| | | | | | | | | | | | | | | | | | The percpu counter is only used for blk-mq, so move its allocation and free inside blk-mq, and don't allocate it for legacy queue device. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: Micro-optimize blk_queue_nomerges() checkRobert Elliott2014-05-20
| | | | | | | | | | | | | | | | | | In blk_mq_make_request(), do the blk_queue_nomerges() check outside the call to blk_attempt_plug_merge() to eliminate function call overhead when nomerges=2 (disabled) Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: allow changing of queue depth through sysfsJens Axboe2014-05-20
| | | | | | | | | | | | | | | | | | | | | | For request_fn based devices, the block layer exports a 'nr_requests' file through sysfs to allow adjusting of queue depth on the fly. Currently this returns -EINVAL for blk-mq, since it's not wired up. Wire this up for blk-mq, so that it now also always dynamic adjustments of the allowed queue depth for any given block device managed by blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: only calculate part_in_flight() onceJens Axboe2014-05-09
| | | | | | | | | | | | | | | | | | We first check if we have inflight IO, then retrieve that same number again. Usually this isn't that costly since the chance of having the data dirtied in between is small, but there's no reason for calling part_in_flight() twice. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: export blk_finish_requestChristoph Hellwig2014-04-16
| | | | | | | | | | | | | | | | This allows to mirror the blk-mq code flow for more a more readable I/O completion handler in SCSI. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: add blk_mq_delay_queueChristoph Hellwig2014-04-16
| | | | | | | | | | | | | | | | | | | | | | | | Add a blk-mq equivalent to blk_delay_queue so that the scsi layer can ask to be kicked again after a delay. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified by me to kill the unnecessary preempt disable/enable in the delayed workqueue handler. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: remove struct request buffer memberJens Axboe2014-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was used in the olden days, back when onions were proper yellow. Basically it mapped to the current buffer to be transferred. With highmem being added more than a decade ago, most drivers map pages out of a bio, and rq->buffer isn't pointing at anything valid. Convert old style drivers to just use bio_data(). For the discard payload use case, just reference the page in the bio. Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge tag 'v3.15-rc1' into for-3.16/coreJens Axboe2014-04-15
|\| | | | | | | | | We don't like this, but things have diverged with the blk-mq fixes in 3.15-rc1. So merge it in.
| * block: fix regression with block enabled taggingJens Axboe2014-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Martin reported that his test system would not boot with current git, it oopsed with this: BUG: unable to handle kernel paging request at ffff88046c6c9e80 IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246 Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013 Workqueue: events_unbound async_run_entry_fn task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000 RIP: 0010:[<ffffffff812971e0>] [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 RSP: 0018:ffff880273d03a58 EFLAGS: 00010092 RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6 RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410 R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0 R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e FS: 0000000000000000(0000) GS:ffff880277b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0 Stack: ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000 ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62 ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8 Call Trace: [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510 [<ffffffff81293167>] __blk_run_queue+0x37/0x50 [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130 [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0 [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0 [<ffffffff813bba35>] scsi_execute+0xe5/0x180 [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110 [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod] [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0 [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod] [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod] [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140 [<ffffffff8106a1c5>] process_one_work+0x175/0x410 [<ffffffff8106b373>] worker_thread+0x123/0x400 [<ffffffff8106b250>] ? manage_workers+0x160/0x160 [<ffffffff8107104e>] kthread+0xce/0xf0 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70 Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89 df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89 58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d RIP [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 RSP <ffff880273d03a58> CR2: ffff88046c6c9e80 Martin bisected and found this to be the problem patch; commit 6d113398dcf4dfcd9787a4ead738b186f7b7ff0f Author: Jan Kara <jack@suse.cz> Date: Mon Feb 24 16:39:54 2014 +0100 block: Stop abusing rq->csd.list in blk-softirq and the problem was immediately apparent. The patch states that it is safe to reuse queuelist at completion time, since it is no longer used. However, that is not true if a device is using block enabled tagging. If that is the case, then the queuelist is reused to keep track of busy tags. If a device also ended up using softirq completions, we'd reuse ->queuelist for the IPI handling while block tagging was still using it. Boom. Fix this by adding a new ipi_list list head, and share the memory used with the request hash table. The hash table is never used after the request is moved to the dispatch list, which happens long before any potential completion of the request. Add a new request bit for this, so we don't have cases that check rq->hash while it could potentially have been reused for the IPI completion. Reported-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERODuan Jiong2014-04-11
| | | | | | | | | | | | | | | | This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: add kblockd_schedule_delayed_work_on()Jens Axboe2014-04-09
| | | | | | | | | | | | | | | | | | Same function as kblockd_schedule_delayed_work(), but allow the caller to pass in a CPU that the work should be executed on. This just directly extends and maps into the workqueue API, and will be used to make the blk-mq mappings more strict. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: remove 'q' parameter from kblockd_schedule_*_work()Jens Axboe2014-04-09
|/ | | | | | The queue parameter is never used, just get rid of it. Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-04-02
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual rocket science -- mostly documentation and comment updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: sparse: fix comment doc: fix double words isdn: capi: fix "CAPI_VERSION" comment doc: DocBook: Fix typos in xml and template file Bluetooth: add module name for btwilink driver core: unexport static function create_syslog_header mmc: core: typo fix in printk specifier ARM: spear: clean up editing mistake net-sysfs: fix comment typo 'CONFIG_SYFS' doc: Insert MODULE_ in module-signing macros Documentation: update URL to hfsplus Technote 1150 gpio: update path to documentation ixgbe: Fix format string in ixgbe_fcoe. Kconfig: Remove useless "default N" lines user_namespace.c: Remove duplicated word in comment CREDITS: fix formatting treewide: Fix typo in Documentation/DocBook mm: Fix warning on make htmldocs caused by slab.c ata: ata-samsung_cf: cleanup in header file idr: remove unused prototype of idr_free()
| * Merge branch 'master' into for-nextJiri Kosina2014-02-20
| |\
| * | treewide: Fix typo in Documentation/DocBookMasanari Iida2014-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fix spelling typo in Documentation/DocBook. It is because .html and .xml files are generated by make htmldocs, I have to fix a typo within the source files. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | Merge branch 'for-3.15/core' of git://git.kernel.dk/linux-blockLinus Torvalds2014-04-01
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block layer updates from Jens Axboe: "This is the pull request for the core block IO bits for the 3.15 kernel. It's a smaller round this time, it contains: - Various little blk-mq fixes and additions from Christoph and myself. - Cleanup of the IPI usage from the block layer, and associated helper code. From Frederic Weisbecker and Jan Kara. - Duplicate code cleanup in bio-integrity from Gu Zheng. This will give you a merge conflict, but that should be easy to resolve. - blk-mq notify spinlock fix for RT from Mike Galbraith. - A blktrace partial accounting bug fix from Roman Pen. - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li" * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits) blk-mq: add REQ_SYNC early rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock blk-mq: support partial I/O completions blk-mq: merge blk_mq_insert_request and blk_mq_run_request blk-mq: remove blk_mq_alloc_rq blk-mq: don't dump CPU -> hw queue map on driver load blk-mq: fix wrong usage of hctx->state vs hctx->flags blk-mq: allow blk_mq_init_commands() to return failure block: remove old blk_iopoll_enabled variable blktrace: fix accounting of partially completed requests smp: Rename __smp_call_function_single() to smp_call_function_single_async() smp: Remove wait argument from __smp_call_function_single() watchdog: Simplify a little the IPI call smp: Move __smp_call_function_single() below its safe version smp: Consolidate the various smp_call_function_single() declensions smp: Teach __smp_call_function_single() to check for offline cpus smp: Remove unused list_head from csd smp: Iterate functions through llist_for_each_entry_safe() block: Stop abusing rq->csd.list in blk-softirq block: Remove useless IPI struct initialization ...
| * | | blktrace: fix accounting of partially completed requestsRoman Pen2014-03-05
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | trace_block_rq_complete does not take into account that request can be partially completed, so we can get the following incorrect output of blkparser: C R 232 + 240 [0] C R 240 + 232 [0] C R 248 + 224 [0] C R 256 + 216 [0] but should be: C R 232 + 8 [0] C R 240 + 8 [0] C R 248 + 8 [0] C R 256 + 8 [0] Also, the whole output summary statistics of completed requests and final throughput will be incorrect. This patch takes into account real completion size of the request and fixes wrong completion accounting. Signed-off-by: Roman Pen <r.peniaev@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> CC: linux-kernel@vger.kernel.org Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* | | block: free q->flush_rq in blk_init_allocated_queue error pathsDave Jones2014-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but neglected to free that allocation on the error paths that follow. Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | block: fix q->flush_rq NULL pointer crash on dm-mpath flushMike Snitzer2014-03-08
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1874198 ("blk-mq: rework flush sequencing logic") switched ->flush_rq from being an embedded member of the request_queue structure to being dynamically allocated in blk_init_queue_node(). Request-based DM multipath doesn't use blk_init_queue_node(), instead it uses blk_alloc_queue_node() + blk_init_allocated_queue(). Because commit 1874198 placed the dynamic allocation of ->flush_rq in blk_init_queue_node() any flush issued to a dm-mpath device would crash with a NULL pointer, e.g.: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 PGD bb3c7067 PUD bb01d067 PMD 0 Oops: 0002 [#1] SMP ... CPU: 5 PID: 5028 Comm: dt Tainted: G W O 3.14.0-rc3.snitm+ #10 ... task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000 RIP: 0010:[<ffffffff8125037e>] [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 RSP: 0018:ffff880079565c98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030 RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001 R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000 R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000 FS: 00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0 Stack: 0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47 ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000 ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8 Call Trace: [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0 [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210 [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320 [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190 [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330 [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod] [<ffffffff812530c0>] generic_make_request+0xc0/0x100 [<ffffffff81253173>] submit_bio+0x73/0x140 [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80 [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0 [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60 [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20 [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20 [<ffffffff811b81f1>] do_fsync+0x41/0x80 [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80 [<ffffffff811b8260>] SyS_fsync+0x10/0x20 [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b Fix this by moving the ->flush_rq allocation from blk_init_queue_node() to blk_init_allocated_queue(). blk_init_queue_node() also calls blk_init_allocated_queue() so this change is functionality equivalent for all blk_init_queue_node() callers. Reported-by: Hannes Reinecke <hare@suse.de> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>