aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flagJan Kara2015-12-07
| | | | | | | | | | | | | | | | | | When dioread_nolock mode is enabled, we grab i_data_sem in ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block() not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However holding i_data_sem over overwrite direct IO isn't needed these days. We have exclusion against truncate / hole punching because we increase i_dio_count under i_mutex in ext4_ext_direct_IO() so once ext4_file_write_iter() verifies blocks are allocated & written, they are guaranteed to stay so during the whole direct IO even after we drop i_mutex. So we can just remove this locking abuse and the no longer necessary EXT4_GET_BLOCKS_NO_LOCK flag. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: document lock orderingJan Kara2015-12-07
| | | | | | | | We have enough locks that it's probably worth documenting the lock ordering rules we have in ext4. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix races of writeback with punch hole and zero rangeJan Kara2015-12-07
| | | | | | | | | | | | | When doing delayed allocation, update of on-disk inode size is postponed until IO submission time. However hole punch or zero range fallocate calls can end up discarding the tail page cache page and thus on-disk inode size would never be properly updated. Make sure the on-disk inode size is updated before truncating page cache. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix races between buffered IO and collapse / insert rangeJan Kara2015-12-07
| | | | | | | | | | | | | | | | Current code implementing FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE is prone to races with buffered writes and page faults. If buffered write or write via mmap manages to squeeze between filemap_write_and_wait_range() and truncate_pagecache() in the fallocate implementations, the written data is simply discarded by truncate_pagecache() although it should have been shifted. Fix the problem by moving filemap_write_and_wait_range() call inside i_mutex and i_mmap_sem. That way we are protected against races with both buffered writes and page faults. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: move unlocked dio protection from ext4_alloc_file_blocks()Jan Kara2015-12-07
| | | | | | | | | | | | | | Currently ext4_alloc_file_blocks() was handling protection against unlocked DIO. However we now need to sometimes call it under i_mmap_sem and sometimes not and DIO protection ranks above it (although strictly speaking this cannot currently create any deadlocks). Also ext4_zero_range() was actually getting & releasing unlocked DIO protection twice in some cases. Luckily it didn't introduce any real bug but it was a land mine waiting to be stepped on. So move DIO protection out from ext4_alloc_file_blocks() into the two callsites. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix races between page faults and hole punchingJan Kara2015-12-07
| | | | | | | | | | | | | | | | | | | | | | Currently, page faults and hole punching are completely unsynchronized. This can result in page fault faulting in a page into a range that we are punching after truncate_pagecache_range() has been called and thus we can end up with a page mapped to disk blocks that will be shortly freed. Filesystem corruption will shortly follow. Note that the same race is avoided for truncate by checking page fault offset against i_size but there isn't similar mechanism available for punching holes. Fix the problem by creating new rw semaphore i_mmap_sem in inode and grab it for writing over truncate, hole punching, and other functions removing blocks from extent tree and for read over page faults. We cannot easily use i_data_sem for this since that ranks below transaction start and we need something ranking above it so that it can be held over the whole truncate / hole punching operation. Also remove various workarounds we had in the code to reduce race window when page fault could have created pages with stale mapping information. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2015-12-07
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Ext4 bug fixes for v4.4, including fixes for post-2038 time encodings, some endian conversion problems with ext4 encryption, potential memory leaks after truncate in data=journal mode, and an ocfs2 regression caused by a jbd2 performance improvement" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix null committed data return in undo_access ext4: add "static" to ext4_seq_##name##_fops struct ext4: fix an endianness bug in ext4_encrypted_follow_link() ext4: fix an endianness bug in ext4_encrypted_zeroout() jbd2: Fix unreclaimed pages after truncate in data=journal mode ext4: Fix handling of extended tv_sec
| * jbd2: fix null committed data return in undo_accessJunxiao Bi2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | introduced jbd2_write_access_granted() to improve write|undo_access speed, but missed to check the status of b_committed_data which caused a kernel panic on ocfs2. [ 6538.405938] ------------[ cut here ]------------ [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400! [ 6538.406686] invalid opcode: 0000 [#1] SMP [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1 [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000 [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>] [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2] [ 6538.406686] RSP: 0018:ffff880075b7b7f8 EFLAGS: 00010246 [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0 [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430 [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001 [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001 [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0 [ 6538.406686] FS: 00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000 [ 6538.406686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0 [ 6538.406686] Stack: [ 6538.406686] ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800 [ 6538.406686] ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d [ 6538.406686] ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000 [ 6538.406686] Call Trace: [ 6538.406686] [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2] [ 6538.406686] [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2] [ 6538.406686] [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2] [ 6538.406686] [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2] [ 6538.406686] [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2] [ 6538.406686] [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2] [ 6538.406686] [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2] [ 6538.406686] [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2] [ 6538.406686] [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2] [ 6538.406686] [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2] [ 6538.406686] [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2] [ 6538.406686] [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2] [ 6538.406686] [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2] [ 6538.406686] [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2] [ 6538.406686] [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2] [ 6538.406686] [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2] [ 6538.406686] [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50 [ 6538.406686] [<ffffffff812124b7>] notify_change+0x1d7/0x340 [ 6538.406686] [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80 [ 6538.406686] [<ffffffff811f5876>] do_truncate+0x66/0x90 [ 6538.406686] [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110 [ 6538.406686] [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120 [ 6538.406686] [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10 [ 6538.406686] [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 [ 6538.406686] RIP [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2] [ 6538.406686] RSP <ffff880075b7b7f8> [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]--- [ 6538.694492] Kernel panic - not syncing: Fatal exception [ 6538.695484] Kernel Offset: disabled Fixes: de92c8caf16c("jbd2: speedup jbd2_journal_get_[write|undo]_access()") Cc: <stable@vger.kernel.org> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add "static" to ext4_seq_##name##_fops structXu Cang2015-11-26
| | | | | | | | | | | | to fix sparse warning, add static to ext4_seq_##name##_fops struct. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix an endianness bug in ext4_encrypted_follow_link()Al Viro2015-11-26
| | | | | | | | | | | | | | | | applying le32_to_cpu() to 16bit value is a bad idea... Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix an endianness bug in ext4_encrypted_zeroout()Al Viro2015-11-26
| | | | | | | | | | | | | | | | | | | | | | | | ex->ee_block is not host-endian (note that accesses of other fields of *ex right next to that line go through the helpers that do proper conversion from little-endian to host-endian; it might make sense to add similar for ->ee_block to avoid reintroducing that kind of bugs...) Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * jbd2: Fix unreclaimed pages after truncate in data=journal modeJan Kara2015-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ted and Namjae have reported that truncated pages don't get timely reclaimed after being truncated in data=journal mode. The following test triggers the issue easily: for (i = 0; i < 1000; i++) { pwrite(fd, buf, 1024*1024, 0); fsync(fd); fsync(fd); ftruncate(fd, 0); } The reason is that journal_unmap_buffer() finds that truncated buffers are not journalled (jh->b_transaction == NULL), they are part of checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have been already written out (!buffer_dirty(bh)). We clean such buffers but we leave them in the checkpoint list. Since checkpoint transaction holds a reference to the journal head, these buffers cannot be released until the checkpoint transaction is cleaned up. And at that point we don't call release_buffer_page() anymore so pages detached from mapping are lingering in the system waiting for reclaim to find them and free them. Fix the problem by removing buffers from transaction checkpoint lists when journal_unmap_buffer() finds out they don't have to be there anymore. Reported-and-tested-by: Namjae Jeon <namjae.jeon@samsung.com> Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * ext4: Fix handling of extended tv_secDavid Turner2015-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ext4, the bottom two bits of {a,c,m}time_extra are used to extend the {a,c,m}time fields, deferring the year 2038 problem to the year 2446. When decoding these extended fields, for times whose bottom 32 bits would represent a negative number, sign extension causes the 64-bit extended timestamp to be negative as well, which is not what's intended. This patch corrects that issue, so that the only negative {a,c,m}times are those between 1901 and 1970 (as per 32-bit signed timestamps). Some older kernels might have written pre-1970 dates with 1,1 in the extra bits. This patch treats those incorrectly-encoded dates as pre-1970, instead of post-2311, until kernel 4.20 is released. Hopefully by then e2fsck will have fixed up the bad data. Also add a comment explaining the encoding of ext4's extra {a,c,m}time bits. Signed-off-by: David Turner <novalis@novalis.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Mark Harris <mh8928@yahoo.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732 Cc: stable@vger.kernel.org
* | Linux 4.4-rc4Linus Torvalds2015-12-06
| |
* | staging/lustre: remove IOC_LIBCFS_PING_TEST ioctlJames Simmons2015-12-06
| | | | | | | | | | | | | | | | | | | | | | | | The ioctl IOC_LIBCFS_PING_TEST has not been used in ages. The recent nidstring changes which moved all the nidstring operations from libcfs to the LNet layer but this ioctl code was still using an nidstring operation that was causing a circular dependency loop between libcfs and LNet. Signed-off-by: James Simmons <jsimmons@infradead.org> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2015-12-06
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes (-stable fodder) + dead code removal after the overlayfs fix. I agree that it's better to separate from the fix part to make backporting easier, but IMO it's not worth delaying said dead code removal until the next window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't reset ->total_link_count on nested calls of vfs_path_lookup() ovl: get rid of the dead code left from broken (and disabled) optimizations ovl: fix permission checking for setattr
| * | Don't reset ->total_link_count on nested calls of vfs_path_lookup()Al Viro2015-12-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | we already zero it on outermost set_nameidata(), so initialization in path_init() is pointless and wrong. The same DoS exists on pre-4.2 kernels, but there a slightly different fix will be needed. Cc: stable@vger.kernel.org # v4.2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | ovl: get rid of the dead code left from broken (and disabled) optimizationsAl Viro2015-12-06
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | ovl: fix permission checking for setattrMiklos Szeredi2015-12-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Al Viro] The bug is in being too enthusiastic about optimizing ->setattr() away - instead of "copy verbatim with metadata" + "chmod/chown/utimes" (with the former being always safe and the latter failing in case of insufficient permissions) it tries to combine these two. Note that copyup itself will have to do ->setattr() anyway; _that_ is where the elevated capabilities are right. Having these two ->setattr() (one to set verbatim copy of metadata, another to do what overlayfs ->setattr() had been asked to do in the first place) combined is where it breaks. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2015-12-06
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "This updates contains the following changes: - Fix a signal handling regression in the bit wait functions. - Avoid false positive warnings in the wakeup path. - Initialize the scheduler root domain properly. - Handle gtime calculations in proc/$PID/stat proper. - Add more documentation for the barriers in try_to_wake_up(). - Fix a subtle race in try_to_wake_up() which might cause a task to be scheduled on two cpus - Compile static helper function only when it is used" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() sched/core: Better document the try_to_wake_up() barriers sched/cputime: Fix invalid gtime in proc sched/core: Clear the root_domain cpumasks in init_rootdomain() sched/core: Remove false-positive warning from wake_up_process() sched/wait: Fix signal handling in bit wait helpers sched/rt: Hide the push_irq_work_func() declaration
| * | | sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()Peter Zijlstra2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Oleg noticed that its possible to falsely observe p->on_cpu == 0 such that we'll prematurely continue with the wakeup and effectively run p on two CPUs at the same time. Even though the overlap is very limited; the task is in the middle of being scheduled out; it could still result in corruption of the scheduler data structures. CPU0 CPU1 set_current_state(...) <preempt_schedule> context_switch(X, Y) prepare_lock_switch(Y) Y->on_cpu = 1; finish_lock_switch(X) store_release(X->on_cpu, 0); try_to_wake_up(X) LOCK(p->pi_lock); t = X->on_cpu; // 0 context_switch(Y, X) prepare_lock_switch(X) X->on_cpu = 1; finish_lock_switch(Y) store_release(Y->on_cpu, 0); </preempt_schedule> schedule(); deactivate_task(X); X->on_rq = 0; if (X->on_rq) // false if (t) while (X->on_cpu) cpu_relax(); context_switch(X, ..) finish_lock_switch(X) store_release(X->on_cpu, 0); Avoid the load of X->on_cpu being hoisted over the X->on_rq load. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Better document the try_to_wake_up() barriersPeter Zijlstra2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explain how the control dependency and smp_rmb() end up providing ACQUIRE semantics and pair with smp_store_release() in finish_lock_switch(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/cputime: Fix invalid gtime in procHiroshi Shimamoto2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/stats shows invalid gtime when the thread is running in guest. When vtime accounting is not enabled, we cannot get a valid delta. The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap is only updated when vtime accounting is runtime enabled. This patch makes task_gtime() just return gtime without computing the buggy non-existing tickless delta when vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless delta. This way we fix the gtime value regression on machines not running nohz full. The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full. I ran and stop a busy loop in VM and see the gtime in host. Dump the 43rd field which shows the gtime in every second: # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done S 4348 R 7064566 R 7064766 R 7064967 R 7065168 S 4759 S 4759 During running busy loop, it returns large value. After applying this patch, we can see right gtime. # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done S 5338 R 5365 R 5465 R 5566 R 5666 S 5726 S 5726 Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Clear the root_domain cpumasks in init_rootdomain()Xunlei Pang2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, this may cause problems. For instance, When doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks(with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: dlo_mask, span, and online. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Remove false-positive warning from wake_up_process()Sasha Levin2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because wakeups can (fundamentally) be late, a task might not be in the expected state. Therefore testing against a task's state is racy, and can yield false positives. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: oleg@redhat.com Fixes: 9067ac85d533 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task") Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/wait: Fix signal handling in bit wait helpersPeter Zijlstra2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vladimir reported getting RCU stall warnings and bisected it back to commit: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") That commit inadvertently reversed the calls to schedule() and signal_pending(), thereby not handling the case where the signal receives while we sleep. Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mark.rutland@arm.com Cc: neilb@suse.de Cc: oleg@redhat.com Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.") Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/rt: Hide the push_irq_work_func() declarationArnd Bergmann2015-11-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The push_irq_work_func() function is conditionally defined only when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the forward declaration remains visibile without HAVE_RT_PUSH_IPI, causing a gcc warning in ARM64 allnoconfig: kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function] This changes the code to use the same condition for both the declaration and the function definition, which gets rid of the warning. As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI thing after: 8053871d0f7f ("smp: Fix smp_call_function_single_async() locking") Until that is done, this patch can be used to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2015-12-06
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thoma Gleixner: "Another round of fixes for x86: - Move the initialization of the microcode driver to late_initcall to make sure everything that init function needs is available. - Make sure that lockdep knows about interrupts being off in the entry code before calling into c-code. - Undo the cpu hotplug init delay regression. - Use the proper conditionals in the mpx instruction decoder. - Fixup restart_syscall for x32 tasks. - Fix the hugepage regression on PAE kernels which was introduced with the latest PAT changes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/signal: Fix restart_syscall number for x32 tasks x86/mpx: Fix instruction decoder condition x86/mm: Fix regression with huge pages on PAE x86 smpboot: Re-enable init_udelay=0 by default on modern CPUs x86/entry/64: Fix irqflag tracing wrt context tracking x86/microcode: Initialize the driver late when facilities are up
| * | | | x86/signal: Fix restart_syscall number for x32 tasksDmitry V. Levin2015-12-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When restarting a syscall with regs->ax == -ERESTART_RESTARTBLOCK, regs->ax is assigned to a restart_syscall number. For x32 tasks, this syscall number must have __X32_SYSCALL_BIT set, otherwise it will be an x86_64 syscall number instead of a valid x32 syscall number. This issue has been there since the introduction of x32. Reported-by: strace/tests/restart_syscall.test Reported-and-tested-by: Elvira Khabirova <lineprinter0@gmail.com> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Elvira Khabirova <lineprinter0@gmail.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20151130215436.GA25996@altlinux.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86/mpx: Fix instruction decoder conditionDave Hansen2015-12-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MPX decodes instructions in order to tell which bounds register was violated. Part of this decoding involves looking at the "REX prefix" which is a special instrucion prefix used to retrofit support for new registers in to old instructions. The X86_REX_*() macros are defined to return actual bit values: #define X86_REX_R(rex) ((rex) & 4) *not* boolean values. However, the MPX code was checking for them like they were booleans. This might have led to us mis-decoding the "REX prefix" and giving false information out to userspace about bounds violations. X86_REX_B() actually is bit 1, so this is really only broken for the X86_REX_X() case. Fix the conditionals up to tolerate the non-boolean values. Fixes: fcc7ffd67991 "x86, mpx: Decode MPX instruction to get bound violation information" Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20151201003113.D800C1E0@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86/mm: Fix regression with huge pages on PAEKirill A. Shutemov2015-12-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent PAT patchset has caused issue on 32-bit PAE machines: page:eea45000 count:0 mapcount:-128 mapping: (null) index:0x0 flags: 0x40000000() page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) < 0) ------------[ cut here ]------------ kernel BUG at /home/build/linux-boris/mm/huge_memory.c:1485! invalid opcode: 0000 [#1] SMP [...] Call Trace: unmap_single_vma ? __wake_up unmap_vmas unmap_region do_munmap vm_munmap SyS_munmap do_fast_syscall_32 ? __do_page_fault sysenter_past_esp Code: ... EIP: [<c11bde80>] zap_huge_pmd+0x240/0x260 SS:ESP 0068:f6459d98 The problem is in pmd_pfn_mask() and pmd_flags_mask(). These helpers use PMD_PAGE_MASK to calculate resulting mask. PMD_PAGE_MASK is 'unsigned long', not 'unsigned long long' as phys_addr_t is on 32-bit PAE (ARCH_PHYS_ADDR_T_64BIT). As a result, the upper bits of resulting mask get truncated. pud_pfn_mask() and pud_flags_mask() aren't problematic since we don't have PUD page table level on 32-bit systems, but it's reasonable to keep them consistent with PMD counterpart. Introduce PHYSICAL_PMD_PAGE_MASK and PHYSICAL_PUD_PAGE_MASK in addition to existing PHYSICAL_PAGE_MASK and reworks helpers to use them. Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [ Fix -Woverflow warnings from the realmode code. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jürgen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: elliott@hpe.com Cc: konrad.wilk@oracle.com Cc: linux-mm <linux-mm@kvack.org> Fixes: f70abb0fc3da ("x86/asm: Fix pud/pmd interfaces to handle large PAT bit") Link: http://lkml.kernel.org/r/1448878233-11390-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86 smpboot: Re-enable init_udelay=0 by default on modern CPUsLen Brown2015-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f1ccd249319e allowed the cmdline "cpu_init_udelay=" to work with all values, including the default of 10000. But in setting the default of 10000, it over-rode the code that sets the delay 0 on modern processors. Also, tidy up use of INT/UINT. Fixes: f1ccd249319e "x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehavior" Reported-by: Shane <shrybman@teksavvy.com> Signed-off-by: Len Brown <len.brown@intel.com> Cc: dparsons@brightdsl.net Cc: stable@kernel.org Link: http://lkml.kernel.org/r/9082eb809ef40dad02db714759c7aaf618c518d4.1448232494.git.len.brown@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86/entry/64: Fix irqflag tracing wrt context trackingAndy Lutomirski2015-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Paolo pointed out that enter_from_user_mode could be called while irqflags were traced as though IRQs were on. In principle, this could confuse lockdep. It doesn't cause any problems that I've seen in any configuration, but if I build with CONFIG_DEBUG_LOCKDEP=y, enable a nohz_full CPU, and add code like: if (irqs_disabled()) { spin_lock(&something); spin_unlock(&something); } to the top of enter_from_user_mode, then lockdep will complain without this fix. It seems that lockdep's irqflags sanity checks are too weak to detect this bug without forcing the issue. This patch adds one byte to normal kernels, and it's IMO a bit ugly. I haven't spotted a better way to do this yet, though. The issue is that we can't do TRACE_IRQS_OFF until after SWAPGS (if needed), but we're also supposed to do it before calling C code. An alternative approach would be to call trace_hardirqs_off in enter_from_user_mode. That would be less code and would not bloat normal kernels at all, but it would be harder to see how the code worked. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/86237e362390dfa6fec12de4d75a238acb0ae787.1447361906.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86/microcode: Initialize the driver late when facilities are upBorislav Petkov2015-11-23
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running microcode_init() from setup_arch() is a bad idea because not even kmalloc() is ready at that point and the loader does all kinds of allocations and init/registration with various subsystems. Make it a late initcall when required facilities are initialized so that the microcode driver initialization can succeed too. Reported-and-tested-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151120112400.GC4028@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge tag 'scsi-fixes' of ↵Linus Torvalds2015-12-06
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is quite a bumper crop of fixes: three from Arnd correcting various build issues in some configurations, a lock recursion in qla2xxx. Two potentially exploitable issues in hpsa and mvsas, a potential null deref in st, a revert of a bdi registration fix that turned out to cause even more problems, a set of fixes to allow people who only defined MPT2SAS to still work after the mpt2/mpt3sas merger and a couple of fixes for issues turned up by the hyper-v storvsc driver" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility Revert "scsi: Fix a bdi reregistration race" mpt3sas: Add dummy Kconfig option for backwards compatibility Fix a memory leak in scsi_host_dev_release() block/sd: Fix device-imposed transfer length limits scsi_debug: fix prevent_allow+verify regressions MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem. sd: Make discard granularity match logical block size when LBPRZ=1 scsi: hpsa: select CONFIG_SCSI_SAS_ATTR scsi: advansys needs ISA dma api for ISA support scsi_sysfs: protect against double execution of __scsi_remove_device() st: fix potential null pointer dereference. scsi: report 'INQUIRY result too short' once per host advansys: fix big-endian builds qla2xxx: Fix rwlock recursion hpsa: logical vs bitwise AND typo mvsas: don't allow negative timeouts mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag
| * \ \ \ Merge branch 'mkp-fixes' into fixesJames Bottomley2015-12-03
| |\ \ \ \
| | * | | | mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibilityJames Bottomley2015-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The non-PCI builds of the O day test project are failing: On Thu, 2015-12-03 at 05:02 +0800, kbuild test robot wrote: > warning: (SCSI_MPT2SAS) selects SCSI_MPT3SAS which has unmet direct > dependencies (SCSI_LOWLEVEL && PCI && SCSI) The problem is that select and depend don't interact because Kconfig doesn't have a SAT solver, so depend picks up dependencies and select does onward selects, but select doesn't pick up dependencies. To fix this, we need to add the correct dependencies to the MPT2SAS option like this. Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: b840c3627b6f4f856b333a14a72f8ed86da2f86c Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
| | * | | | mpt3sas: Add dummy Kconfig option for backwards compatibilityMartin K. Petersen2015-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mpt2sas driver was recently folded into mpt3sas to reduce code duplication. To avoid problems for people that only have CONFIG_SCSI_MPT2SAS in their .config we introduce a dummy option that will select MPT3SAS if MPT2SAS was previously enabled. This is a temporary measure and we will deprecate this config option in 4.6. Reported-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: James Bottomley <James.Bottomley@hansenpartnership.com> CC: Ingo Molnar <mingo@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | Fix a memory leak in scsi_host_dev_release()Bart Van Assche2015-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid that kmemleak reports the following memory leak if a SCSI LLD calls scsi_host_alloc() and scsi_host_put() but neither scsi_host_add() nor scsi_host_remove(). The following shell command triggers that scenario: for ((i=0; i<2; i++)); do srp_daemon -oac | while read line; do echo $line >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target done done unreferenced object 0xffff88021b24a220 (size 8): comm "srp_daemon", pid 56421, jiffies 4295006762 (age 4240.750s) hex dump (first 8 bytes): 68 6f 73 74 35 38 00 a5 host58.. backtrace: [<ffffffff8151014a>] kmemleak_alloc+0x7a/0xc0 [<ffffffff81165c1e>] __kmalloc_track_caller+0xfe/0x160 [<ffffffff81260d2b>] kvasprintf+0x5b/0x90 [<ffffffff81260e2d>] kvasprintf_const+0x8d/0xb0 [<ffffffff81254b0c>] kobject_set_name_vargs+0x3c/0xa0 [<ffffffff81337e3c>] dev_set_name+0x3c/0x40 [<ffffffff81355757>] scsi_host_alloc+0x327/0x4b0 [<ffffffffa03edc8e>] srp_create_target+0x4e/0x8a0 [ib_srp] [<ffffffff8133778b>] dev_attr_store+0x1b/0x20 [<ffffffff811f27fa>] sysfs_kf_write+0x4a/0x60 [<ffffffff811f1e8e>] kernfs_fop_write+0x14e/0x180 [<ffffffff81176eef>] __vfs_write+0x2f/0xf0 [<ffffffff811771e4>] vfs_write+0xa4/0x100 [<ffffffff81177c64>] SyS_write+0x54/0xc0 [<ffffffff8151b257>] entry_SYSCALL_64_fastpath+0x12/0x6f Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: stable <stable@vger.kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | block/sd: Fix device-imposed transfer length limitsMartin K. Petersen2015-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests") had the unfortunate side-effect of removing an implicit clamp to BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer code. This caused problems for some SMR drives. Debugging this issue revealed a few problems with the existing infrastructure since the block layer didn't know how to deal with device-imposed limits, only limits set by the I/O controller. - Introduce a new queue limit, max_dev_sectors, which is used by the ULD to signal the maximum sectors for a REQ_TYPE_FS request. - Ensure that max_dev_sectors is correctly stacked and taken into account when overriding max_sectors through sysfs. - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer values for later processing. - In sd_revalidate() set the queue's max_dev_sectors based on the MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value is not reported, fall back to a cap based on the CDB TRANSFER LENGTH field size. - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits VPD--if reported and sane--to signal the preferred device transfer size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS. - blk_limits_max_hw_sectors() is no longer used and can be removed. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581 Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: sweeneygj@gmx.com Tested-by: Arzeets <anatol.pomozov@gmail.com> Tested-by: David Eisner <david.eisner@oriel.oxon.org> Tested-by: Mario Kicherer <dev@kicherer.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | scsi_debug: fix prevent_allow+verify regressionsDouglas Gilbert2015-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ruediger Meier observed a regression with the PREVENT ALLOW MEDIUM REMOVAL command in lk 3.19: http://www.spinics.net/lists/util-linux-ng/msg11448.html Inspection indicated the same regression with VERIFY(10). The patch is against lk 3.19.3 and also works with lk 4.3.0 . With this patch both commands are accepted and do nothing. ChangeLog: - fix the lk 3.19 regression so that the PREVENT ALLOW MEDIUM REMOVAL command is supported once again - same fix for VERIFY(10) Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Hannes Reinicke <hare@suse.de> Reviewed-by: Ewan Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.Martin K. Petersen2015-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | sd: Make discard granularity match logical block size when LBPRZ=1Martin K. Petersen2015-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A device may report an OPTIMAL UNMAP GRANULARITY and UNMAP GRANULARITY ALIGNMENT in the Block Limits VPD. These parameters describe the device's internal provisioning allocation units. By default the block layer will round and align any discard requests based on these limits. If a device reports LBPRZ=1 to guarantee zeroes after discard, however, it is imperative that the block layer does not leave out any parts of the requested block range. Otherwise the device can not do the required zeroing of any partial allocation units and this can lead to data corruption. Since the dm thinp personality relies on the block layer's current behavior and is unable to deal with partial discard blocks we work around the problem by setting the granularity to match the logical block size when LBPRZ is enabled. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | scsi: hpsa: select CONFIG_SCSI_SAS_ATTRArnd Bergmann2015-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hpsa driver recently started using the sas transport class, but it does not ensure that the corresponding code is actually built, which may lead to a link error: drivers/built-in.o: In function `hpsa_free_sas_phy': (.text+0x1ce874): undefined reference to `sas_port_delete_phy' (.text+0x1ce87c): undefined reference to `sas_phy_free' drivers/built-in.o: In function `hpsa_alloc_sas_port': (.text+0x1ceb9c): undefined reference to `sas_port_alloc_num' (.text+0x1ceba8): undefined reference to `sas_port_add' drivers/built-in.o: In function `hpsa_init': (.init.text+0x8838): undefined reference to `sas_attach_transport' (.init.text+0x8868): undefined reference to `sas_release_transport This adds 'select SCSI_SAS_ATTR' in the Kconfig entry, just like we do for all other drivers using those functions. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: d04e62b9d63a ("hpsa: add in sas transport class") Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Don Brace <don.brace@pmcs.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | scsi: advansys needs ISA dma api for ISA supportArnd Bergmann2015-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The advansys drvier uses the request_dma function that is used on ISA machines for the internal DMA controller, which causes build errors on platforms that have ISA slots but do not provide the ISA DMA API: drivers/scsi/advansys.c: In function 'advansys_board_found': drivers/scsi/advansys.c:11300:10: error: implicit declaration of function 'request_dma' [-Werror=implicit-function-declaration] The problem now showed up in ARM randconfig builds after commit 6571fb3f8b7f ("advansys: Update to version 3.5 and remove compilation warning") made it possible to build on platforms that have neither VIRT_TO_BUS nor ISA_DMA_API but that do have ISA. This adds the missing dependency. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | scsi_sysfs: protect against double execution of __scsi_remove_device()Vitaly Kuznetsov2015-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some host errors storvsc module tries to remove sdev by scheduling a job which does the following: sdev = scsi_device_lookup(wrk->host, 0, 0, wrk->lun); if (sdev) { scsi_remove_device(sdev); scsi_device_put(sdev); } While this code seems correct the following crash is observed: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC RIP: 0010:[<ffffffff81169979>] [<ffffffff81169979>] bdi_destroy+0x39/0x220 ... [<ffffffff814aecdc>] ? _raw_spin_unlock_irq+0x2c/0x40 [<ffffffff8127b7db>] blk_cleanup_queue+0x17b/0x270 [<ffffffffa00b54c4>] __scsi_remove_device+0x54/0xd0 [scsi_mod] [<ffffffffa00b556b>] scsi_remove_device+0x2b/0x40 [scsi_mod] [<ffffffffa00ec47d>] storvsc_remove_lun+0x3d/0x60 [hv_storvsc] [<ffffffff81080791>] process_one_work+0x1b1/0x530 ... The problem comes with the fact that many such jobs (for the same device) are being scheduled simultaneously. While scsi_remove_device() uses shost->scan_mutex and scsi_device_lookup() will fail for a device in SDEV_DEL state there is no protection against someone who did scsi_device_lookup() before we actually entered __scsi_remove_device(). So the whole scenario looks like that: two callers do simultaneous (or preemption happens) calls to scsi_device_lookup() ant these calls succeed for both of them, after that they try doing scsi_remove_device(). shost->scan_mutex only serializes their calls to __scsi_remove_device() and we end up doing the cleanup path twice. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | st: fix potential null pointer dereference.Maurizio Lombardi2015-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If cdev_add() returns an error, the code calls cdev_del() passing the STm->cdevs[rew] pointer as parameter; the problem is that the pointer has not been initialized yet. This patch fixes the problem by moving the STm->cdevs[rew] pointer initialization before the call to cdev_add(). It also sets STm->devs[rew] and STm->cdevs[rew] to NULL in case of failure. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | scsi: report 'INQUIRY result too short' once per hostVitaly Kuznetsov2015-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some host adapters (e.g. Hyper-V storvsc) are known for not respecting the SPC-2/3/4 requirement for 'INQUIRY data (see table ...) shall contain at least 36 bytes'. As a result we get tons on 'scsi 0:7:1:1: scsi scan: INQUIRY result too short (5), using 36' messages on console. This can be problematic for slow consoles. Introduce short_inquiry flag in struct Scsi_Host to print the message once per host. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | advansys: fix big-endian buildsArnd Bergmann2015-11-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building the advansys driver in a big-endian configuration such as ARM allmodconfig shows a warning: drivers/scsi/advansys.c: In function 'adv_build_req': include/uapi/linux/byteorder/big_endian.h:32:26: warning: large integer implicitly truncated to unsigned type [-Woverflow] #define __cpu_to_le32(x) ((__force __le32)__swab32((x))) drivers/scsi/advansys.c:7806:22: note: in expansion of macro 'cpu_to_le32' scsiqp->sense_len = cpu_to_le32(SCSI_SENSE_BUFFERSIZE); It turns out that the commit that introduced this used the cpu_to_le32() incorrectly on an 8-bit field, which results in the sense_len to always be set to zero, as the SCSI_SENSE_BUFFERSIZE value gets moved to upper byte of the 32-bit intermediate. This removes the cpu_to_le32() call to restore the original version. I found this only by looking at the compiler output and have not done a full review for possible further endianess bugs in the same driver. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 811ddc057aac ("advansys: use DMA-API for mapping sense buffer") Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| | * | | | qla2xxx: Fix rwlock recursionBart Van Assche2015-11-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following kernel bug: kernel:BUG: rwlock recursion on CPU#2, insmod/39333, ffff8803e998cb28 kernel: Call Trace: kernel: [<ffffffff812bce44>] dump_stack+0x48/0x64 kernel: [<ffffffff810a8047>] rwlock_bug+0x67/0x70 kernel: [<ffffffff810a833a>] do_raw_write_lock+0x8a/0xa0 kernel: [<ffffffff815f3033>] _raw_write_lock_irqsave+0x63/0x80 kernel: [<ffffffffa08087c8>] qla82xx_rd_32+0xe8/0x140 [qla2xxx] kernel: [<ffffffffa0808845>] qla82xx_crb_win_lock+0x25/0x60 [qla2xxx] kernel: [<ffffffffa0808976>] qla82xx_wr_32+0xf6/0x150 [qla2xxx] kernel: [<ffffffffa0808ac0>] qla82xx_disable_intrs+0x50/0x80 [qla2xxx] kernel: [<ffffffffa080630a>] qla82xx_reset_chip+0x1a/0x20 [qla2xxx] kernel: [<ffffffffa07d6ef2>] qla2x00_initialize_adapter+0x132/0x420 [qla2xxx] kernel: [<ffffffffa08087c8>] qla82xx_rd_32+0xe8/0x140 [qla2xxx] kernel: [<ffffffffa0808845>] qla82xx_crb_win_lock+0x25/0x60 [qla2xxx] kernel: [<ffffffffa0808976>] qla82xx_wr_32+0xf6/0x150 [qla2xxx] kernel: [<ffffffffa0808ac0>] qla82xx_disable_intrs+0x50/0x80 [qla2xxx] kernel: [<ffffffffa080630a>] qla82xx_reset_chip+0x1a/0x20 [qla2xxx] kernel: [<ffffffffa07d6ef2>] qla2x00_initialize_adapter+0x132/0x420 [qla2xxx] kernel: [<ffffffffa07c964e>] qla2x00_probe_one+0xefe/0x2130 [qla2xxx] kernel: [<ffffffff8130052c>] local_pci_probe+0x4c/0xa0 kernel: [<ffffffff81300603>] pci_call_probe+0x83/0xa0 kernel: [<ffffffff813008cf>] pci_device_probe+0x7f/0xb0 kernel: [<ffffffff813e2e83>] really_probe+0x133/0x390 kernel: [<ffffffff813e3139>] driver_probe_device+0x59/0xd0 kernel: [<ffffffff813e3251>] __driver_attach+0xa1/0xb0 kernel: [<ffffffff813e0cdd>] bus_for_each_dev+0x8d/0xb0 kernel: [<ffffffff813e28ee>] driver_attach+0x1e/0x20 kernel: [<ffffffff813e2252>] bus_add_driver+0x1d2/0x290 kernel: [<ffffffff813e3970>] driver_register+0x60/0xe0 kernel: [<ffffffff813009e4>] __pci_register_driver+0x64/0x70 kernel: [<ffffffffa04bc1cb>] qla2x00_module_init+0x1cb/0x21b [qla2xxx] kernel: [<ffffffff8100027d>] do_one_initcall+0xad/0x1c0 kernel: [<ffffffff810e2859>] do_init_module+0x69/0x210 kernel: [<ffffffff810e4e5c>] load_module+0x5cc/0x750 kernel: [<ffffffff810e5162>] SyS_init_module+0x92/0xc0 kernel: [<ffffffff815f37d7>] entry_SYSCALL_64_fastpath+0x12/0x6f Fixes: 8dfa4b5a9b44 ("qla2xxx: Fix sparse annotation") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reported-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Tested-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Chad Dupuis <chad.dupuis@qlogic.com> Cc: Giridhar Malavali <giridhar.malavali@qlogic.com> Cc: Xose Vazquez Perez <xose.vazquez@gmail.com> Cc: stable <stable@vger.kernel.org> # v4.3+ Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>