aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* rcu: Use wrapper for lockdep assertsMatthew Wilcox2018-02-20
| | | | | | | | | | Commits c0b334c5bfa9 and ea9b0c8a26a2 introduced new sparse warnings by accessing rcu_node->lock directly and ignoring the __private marker. Introduce a new wrapper and use it. Also fix a similar problem in srcutree.c introduced by a3883df3935e. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Remove redundant nxttail index macro defineLiu, Changcheng2018-02-20
| | | | | | | | | | RCU's nxttail has been optimized to be a rcu_segcblist, which is a multi-tailed linked list with macros defined for the indexes for each tail. The indexes have been defined in linux/rcu_segcblist.h, so this commit removes the redundant definitions in kernel/rcu/tree.h. Signed-off-by: Liu Changcheng <changcheng.liu@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Consolidate rcu.h #ifdefsPaul E. McKenney2018-02-20
| | | | | | | | The kernel/rcu/rcu.h file has a pair of consecutive #ifdefs on CONFIG_TINY_RCU, so this commit consolidates them, thus saving a few lines of code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: More clearly identify grace-period kthread stack dumpPaul E. McKenney2018-02-20
| | | | | | | | | It is not always obvious that the stack dump from a starved grace-period kthread isn't instead that of a CPU stalling the current grace period. This commit therefore adds a pr_err() flagging these dumps. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Remove obsolete force-quiescent-state statistics for debugfsPaul E. McKenney2018-02-20
| | | | | | | | | | | | | The debugfs interface displayed statistics on RCU-pending checks but this interface has since been removed. This commit therefore removes the no-longer-used rcu_state structure's ->n_force_qs_lh and ->n_force_qs_ngp fields along with their updates. (Though the ->n_force_qs_ngp field was actually not used at all, embarrassingly enough.) If this information proves necessary in the future, the corresponding event traces will be added. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Remove obsolete __rcu_pending() statistics for debugfsPaul E. McKenney2018-02-20
| | | | | | | | | | | | | | | The debugfs interface displayed statistics on RCU-pending checks but this interface has since been removed. This commit therefore removes the no-longer-used rcu_data structure's ->n_rcu_pending, ->n_rp_core_needs_qs, ->n_rp_report_qs, ->n_rp_cb_ready, ->n_rp_cpu_needs_gp, ->n_rp_gp_completed, ->n_rp_gp_started, ->n_rp_nocb_defer_wakeup, and ->n_rp_need_nothing fields along with their updates. If this information proves necessary in the future, the corresponding event traces will be added. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Remove obsolete callback-invocation statistics for debugfsPaul E. McKenney2018-02-20
| | | | | | | | | | | | The debugfs interface displayed statistics on RCU callback invocation but this interface has since been removed. This commit therefore removes the no-longer-used rcu_data structure's ->n_cbs_invoked and ->n_nocbs_invoked fields along with their updates. If this information proves necessary in the future, the corresponding event traces will be added. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Remove obsolete boost statistics for debugfsPaul E. McKenney2018-02-20
| | | | | | | | | | | | The debugfs interface displayed statistics on RCU priority boosting, but this interface has since been removed. This commit therefore removes the no-longer-used rcu_data structure's ->n_tasks_boosted, ->n_exp_boosts, and ->n_exp_boosts and their updates. If this information proves necessary in the future, the corresponding event traces will be added. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Call touch_nmi_watchdog() while printing stall warningsTejun Heo2018-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When RCU stall warning triggers, it can print out a lot of messages while holding spinlocks. If the console device is slow (e.g. an actual or IPMI serial console), it may end up triggering NMI hard lockup watchdog like the following. *** CPU printking while holding RCU spinlock PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8" #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0 #1 [ffff881fff945e58] nmi_handle at ffffffff81020653 #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36 #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32 #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e [exception RIP: io_serial_in+21] RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002 RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000 RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188 RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4 R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1 R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55 #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00 #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691 #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2 #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55 #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5 #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127 #15 [ffff881fff943d70] printk at ffffffff8119a8a4 #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133 #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497 #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037 #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38 #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b *** CPU triggering the hardlockup watchdog PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35" #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874 #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795 #3 [ffff883fff105b08] panic at ffffffff8119a6ae #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866 #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4 #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265 #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f #9 [ffff883fff105e58] nmi_handle at ffffffff81020653 #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94 #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32 #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e [exception RIP: queued_spin_lock_slowpath+248] RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046 RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000 RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300 RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000 R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300 R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958 #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18 #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497 #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037 #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38 #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b --- <IRQ stack> --- Avoid spuriously triggering NMI hardlockup watchdog by touching it from the print functions. show_state_filter() shares the same problem and solution. v2: Relocate the comment to where it belongs. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Fix CPU offload boot message when no CPUs are offloadedPaul E. McKenney2018-02-20
| | | | | | | | | | | | | | | In CONFIG_RCU_NOCB_CPU=y kernels, if the boot parameters indicate that none of the CPUs should in fact be offloaded, the following somewhat obtuse message appears: Offload RCU callbacks from CPUs: . This commit therefore makes the message at least grammatically correct in this case: Offload RCU callbacks from CPUs: (none) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* sched/isolation: Eliminate NO_HZ_FULL_ALLPaul E. McKenney2018-02-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6f1982fedd59 ("sched/isolation: Handle the nohz_full= parameter") broke CONFIG_NO_HZ_FULL_ALL=y kernels. This breakage is due to the code under CONFIG_NO_HZ_FULL_ALL failing to invoke the shiny new housekeeping functions. This means that rcutorture scenario TREE04 now emits RCU CPU stall warnings due to the RCU grace-period kthreads not being awakened at a time of their choosing, or perhaps even not at all: [ 27.731422] rcu_bh kthread starved for 21001 jiffies! g18446744073709551369 c18446744073709551368 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=3 [ 27.731423] rcu_bh I14936 9 2 0x80080000 [ 27.731435] Call Trace: [ 27.731440] __schedule+0x31a/0x6d0 [ 27.731442] schedule+0x31/0x80 [ 27.731446] schedule_timeout+0x15a/0x320 [ 27.731453] ? call_timer_fn+0x130/0x130 [ 27.731457] rcu_gp_kthread+0x66c/0xea0 [ 27.731458] ? rcu_gp_kthread+0x66c/0xea0 Because no one has complained about CONFIG_NO_HZ_FULL_ALL=y being broken, I hypothesize that no one is in fact using it, other than rcutorture. This commit therefore eliminates CONFIG_NO_HZ_FULL_ALL and updates rcutorture's config files to instead use the nohz_full= kernel parameter to put the desired CPUs into nohz_full mode. Fixes: 6f1982fedd59 ("sched/isolation: Handle the nohz_full= parameter") Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net>
* rcu: Remove unnecessary spinlock in rcu_boot_init_percpu_data()Lihao Liang2018-02-15
| | | | | | | | Since rcu_boot_init_percpu_data() is only called at boot time, there is no data race and spinlock is not needed. Signed-off-by: Lihao Liang <lianglihao@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds2018-02-11
| | | | | | | | | | | | | | | | | | | | | | | This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-02-10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Radim Krčmář: "ARM: - icache invalidation optimizations, improving VM startup time - support for forwarded level-triggered interrupts, improving performance for timers and passthrough platform devices - a small fix for power-management notifiers, and some cosmetic changes PPC: - add MMIO emulation for vector loads and stores - allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without requiring the complex thread synchronization of older CPU versions - improve the handling of escalation interrupts with the XIVE interrupt controller - support decrement register migration - various cleanups and bugfixes. s390: - Cornelia Huck passed maintainership to Janosch Frank - exitless interrupts for emulated devices - cleanup of cpuflag handling - kvm_stat counter improvements - VSIE improvements - mm cleanup x86: - hypervisor part of SEV - UMIP, RDPID, and MSR_SMI_COUNT emulation - paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit - allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512 features - show vcpu id in its anonymous inode name - many fixes and cleanups - per-VCPU MSR bitmaps (already merged through x86/pti branch) - stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)" * tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits) KVM: PPC: Book3S: Add MMIO emulation for VMX instructions KVM: PPC: Book3S HV: Branch inside feature section KVM: PPC: Book3S HV: Make HPT resizing work on POWER9 KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code KVM: PPC: Book3S PR: Fix broken select due to misspelling KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs() KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled KVM: PPC: Book3S HV: Drop locks before reading guest memory kvm: x86: remove efer_reload entry in kvm_vcpu_stat KVM: x86: AMD Processor Topology Information x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested kvm: embed vcpu id to dentry of vcpu anon inode kvm: Map PFN-type memory regions as writable (if possible) x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n KVM: arm/arm64: Fixup userspace irqchip static key optimization KVM: arm/arm64: Fix userspace_irqchip_in_use counting KVM: arm/arm64: Fix incorrect timer_is_pending logic MAINTAINERS: update KVM/s390 maintainers MAINTAINERS: add Halil as additional vfio-ccw maintainer MAINTAINERS: add David as a reviewer for KVM/s390 ...
| * Merge branch 'x86/hyperv' of ↵Radim Krčmář2018-02-01
| |\ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Topic branch for stable KVM clockource under Hyper-V. Thanks to Christoffer Dall for resolving the ARM conflict.
| * | kvm_config: add CONFIG_S390_GUESTChristian Borntraeger2018-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | make kvmconfig currently does not select CONFIG_S390_GUEST. Since the virtio-ccw transport depends on CONFIG_S390_GUEST, we want to add CONFIG_S390_GUEST to kvmconfig. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-02-09
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Make allocations less aggressive in x_tables, from Minchal Hocko. 2) Fix netfilter flowtable Kconfig deps, from Pablo Neira Ayuso. 3) Fix connection loss problems in rtlwifi, from Larry Finger. 4) Correct DRAM dump length for some chips in ath10k driver, from Yu Wang. 5) Fix ABORT handling in rxrpc, from David Howells. 6) Add SPDX tags to Sun networking drivers, from Shannon Nelson. 7) Some ipv6 onlink handling fixes, from David Ahern. 8) Netem packet scheduler interval calcualtion fix from Md. Islam. 9) Don't put crypto buffers on-stack in rxrpc, from David Howells. 10) Fix handling of error non-delivery status in netlink multicast delivery over multiple namespaces, from Nicolas Dichtel. 11) Missing xdp flush in tuntap driver, from Jason Wang. 12) Synchonize RDS protocol netns/module teardown with rds object management, from Sowini Varadhan. 13) Add nospec annotations to mpls, from Dan Williams. 14) Fix SKB truesize handling in TIPC, from Hoang Le. 15) Interrupt masking fixes in stammc from Niklas Cassel. 16) Don't allow ptr_ring objects to be sized outside of kmalloc's limits, from Jason Wang. 17) Don't allow SCTP chunks to be built which will have a length exceeding the chunk header's 16-bit length field, from Alexey Kodanev. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits) ibmvnic: Remove skb->protocol checks in ibmvnic_xmit bpf: fix rlimit in reuseport net selftest sctp: verify size of a new chunk in _sctp_make_chunk() s390/qeth: fix SETIP command handling s390/qeth: fix underestimated count of buffer elements ptr_ring: try vmalloc() when kmalloc() fails ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE net: stmmac: remove redundant enable of PMT irq net: stmmac: rename GMAC_INT_DEFAULT_MASK for dwmac4 net: stmmac: discard disabled flags in interrupt status register ibmvnic: Reset long term map ID counter tools/libbpf: handle issues with bpf ELF objects containing .eh_frames selftests/bpf: add selftest that use test_libbpf_open selftests/bpf: add test program for loading BPF ELF files tools/libbpf: improve the pr_debug statements to contain section numbers bpf: Sync kernel ABI header with tooling header for bpf_common.h net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT net: thunder: change q_len's type to handle max ring size tipc: fix skb truesize/datasize ratio control net/sched: cls_u32: fix cls_u32 on filter replace ...
| * \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2018-02-09
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2018-02-09 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Two fixes for BPF sockmap in order to break up circular map references from programs attached to sockmap, and detaching related sockets in case of socket close() event. For the latter we get rid of the smap_state_change() and plug into ULP infrastructure, which will later also be used for additional features anyway such as TX hooks. For the second issue, dependency chain is broken up via map release callback to free parse/verdict programs, all from John. 2) Fix a libbpf relocation issue that was found while implementing XDP support for Suricata project. Issue was that when clang was invoked with default target instead of bpf target, then various other e.g. debugging relevant sections are added to the ELF file that contained relocation entries pointing to non-BPF related sections which libbpf trips over instead of skipping them. Test cases for libbpf are added as well, from Jesper. 3) Various misc fixes for bpftool and one for libbpf: a small addition to libbpf to make sure it recognizes all standard section prefixes. Then, the Makefile in bpftool/Documentation is improved to explicitly check for rst2man being installed on the system as we otherwise risk installing empty man pages; the man page for bpftool-map is corrected and a set of missing bash completions added in order to avoid shipping bpftool where the completions are only partially working, from Quentin. 4) Fix applying the relocation to immediate load instructions in the nfp JIT which were missing a shift, from Jakub. 5) Two fixes for the BPF kernel selftests: handle CONFIG_BPF_JIT_ALWAYS_ON=y gracefully in test_bpf.ko module and mark them as FLAG_EXPECTED_FAIL in this case; and explicitly delete the veth devices in the two tests test_xdp_{meta,redirect}.sh before dismantling the netnses as when selftests are run in batch mode, then workqueue to handle destruction might not have finished yet and thus veth creation in next test under same dev name would fail, from Yonghong. 6) Fix test_kmod.sh to check the test_bpf.ko module path before performing an insmod, and fallback to modprobe. Especially the latter is useful when having a device under test that has the modules installed instead, from Naresh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | bpf: sockmap, fix leaking maps with attached but not detached progsJohn Fastabend2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a program is attached to a map we increment the program refcnt to ensure that the program is not removed while it is potentially being referenced from sockmap side. However, if this same program also references the map (this is a reasonably common pattern in my programs) then the verifier will also increment the maps refcnt from the verifier. This is to ensure the map doesn't get garbage collected while the program has a reference to it. So we are left in a state where the map holds the refcnt on the program stopping it from being removed and releasing the map refcnt. And vice versa the program holds a refcnt on the map stopping it from releasing the refcnt on the prog. All this is fine as long as users detach the program while the map fd is still around. But, if the user omits this detach command we are left with a dangling map we can no longer release. To resolve this when the map fd is released decrement the program references and remove any reference from the map to the program. This fixes the issue with possibly dangling map and creates a user side API constraint. That is, the map fd must be held open for programs to be attached to a map. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: sockmap, add sock close() hook to remove socksJohn Fastabend2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The selftests test_maps program was leaving dangling BPF sockmap programs around because not all psock elements were removed from the map. The elements in turn hold a reference on the BPF program they are attached to causing BPF programs to stay open even after test_maps has completed. The original intent was that sk_state_change() would be called when TCP socks went through TCP_CLOSE state. However, because socks may be in SOCK_DEAD state or the sock may be a listening socket the event is not always triggered. To resolve this use the ULP infrastructure and register our own proto close() handler. This fixes the above case. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* | | | | Merge tag 'trace-v4.16-rc1' of ↵Linus Torvalds2018-02-09
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Al Viro discovered some breakage with the parsing of the set_ftrace_filter as well as the removing of function probes. This fixes the code with Al's suggestions. I also added a few selftests to test the broken cases such that they wont happen again" * tag 'trace-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: selftests/ftrace: Add more tests for removing of function probes selftests/ftrace: Add some missing glob checks selftests/ftrace: Have reset_ftrace_filter handle multiple instances selftests/ftrace: Have reset_ftrace_filter handle modules tracing: Fix parsing of globs with a wildcard at the beginning ftrace: Remove incorrect setting of glob search field
| * | | | | tracing: Fix parsing of globs with a wildcard at the beginningSteven Rostedt (VMware)2018-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Al Viro reported: For substring - sure, but what about something like "*a*b" and "a*b"? AFAICS, filter_parse_regex() ends up with identical results in both cases - MATCH_GLOB and *search = "a*b". And no way for the caller to tell one from another. Testing this with the following: # cd /sys/kernel/tracing # echo '*raw*lock' > set_ftrace_filter bash: echo: write error: Invalid argument With this patch: # echo '*raw*lock' > set_ftrace_filter # cat set_ftrace_filter _raw_read_trylock _raw_write_trylock _raw_read_unlock _raw_spin_unlock _raw_write_unlock _raw_spin_trylock _raw_spin_lock _raw_write_lock _raw_read_lock Al recommended not setting the search buffer to skip the first '*' unless we know we are not using MATCH_GLOB. This implements his suggested logic. Link: http://lkml.kernel.org/r/20180127170748.GF13338@ZenIV.linux.org.uk Cc: stable@vger.kernel.org Fixes: 60f1d5e3bac44 ("ftrace: Support full glob matching") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Suggsted-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | | | ftrace: Remove incorrect setting of glob search fieldSteven Rostedt (VMware)2018-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __unregister_ftrace_function_probe() will incorrectly parse the glob filter because it resets the search variable that was setup by filter_parse_regex(). Al Viro reported this: After that call of filter_parse_regex() we could have func_g.search not equal to glob only if glob started with '!' or '*'. In the former case we would've buggered off with -EINVAL (not = 1). In the latter we would've set func_g.search equal to glob + 1, calculated the length of that thing in func_g.len and proceeded to reset func_g.search back to glob. Suppose the glob is e.g. *foo*. We end up with func_g.type = MATCH_MIDDLE_ONLY; func_g.len = 3; func_g.search = "*foo"; Feeding that to ftrace_match_record() will not do anything sane - we will be looking for names containing "*foo" (->len is ignored for that one). Link: http://lkml.kernel.org/r/20180127031706.GE13338@ZenIV.linux.org.uk Cc: stable@vger.kernel.org Fixes: 3ba009297149f ("ftrace: Introduce ftrace_glob structure") Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | | | | | Merge tag 'modules-for-v4.16' of ↵Linus Torvalds2018-02-07
|\ \ \ \ \ \ | |_|/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Minor code cleanups and MAINTAINERS update" * tag 'modules-for-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: modpost: Remove trailing semicolon ftrace/module: Move ftrace_release_mod() to ddebug_cleanup label MAINTAINERS: Remove from module & paravirt maintenance
| * | | | | ftrace/module: Move ftrace_release_mod() to ddebug_cleanup labelNamit Gupta2018-01-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ftrace_module_init happen after dynamic_debug_setup, it is desired that cleanup should be called after this label however in current implementation it is called in free module label,ie:even though ftrace in not initialized, from so many fail case ftrace_release_mod() will be called and unnecessary traverse the whole list. In below patch we moved ftrace_release_mod() from free_module label to ddebug_cleanup label. that is the best possible location, other solution is to make new label to ftrace_release_mod() but since ftrace_module_init() is not return with minimum changes it should be in ddebug_cleanup label. Signed-off-by: Namit Gupta <gupta.namit@samsung.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Jessica Yu <jeyu@kernel.org>
* | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2018-02-07
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: - kasan updates - procfs - lib/bitmap updates - other lib/ updates - checkpatch tweaks - rapidio - ubsan - pipe fixes and cleanups - lots of other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) Documentation/sysctl/user.txt: fix typo MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns MAINTAINERS: update various PALM patterns MAINTAINERS: update "ARM/OXNAS platform support" patterns MAINTAINERS: update Cortina/Gemini patterns MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern MAINTAINERS: remove ANDROID ION pattern mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors mm: docs: fix parameter names mismatch mm: docs: fixup punctuation pipe: read buffer limits atomically pipe: simplify round_pipe_size() pipe: reject F_SETPIPE_SZ with size over UINT_MAX pipe: fix off-by-one error when checking buffer limits pipe: actually allow root to exceed the pipe buffer limits pipe, sysctl: remove pipe_proc_fn() pipe, sysctl: drop 'min' parameter from pipe-max-size converter kasan: rework Kconfig settings crash_dump: is_kdump_kernel can be boolean kernel/mutex: mutex_is_locked can be boolean ...
| * | | | | | pipe: reject F_SETPIPE_SZ with size over UINT_MAXEric Biggers2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A pipe's size is represented as an 'unsigned int'. As expected, writing a value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to 32 bits, rather than failing with EINVAL as expected. (It *does* fail with EINVAL for values above (1 << 31) but <= UINT_MAX.) Fix this by moving the check against UINT_MAX into round_pipe_size() which is called in both cases. Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | pipe, sysctl: remove pipe_proc_fn()Eric Biggers2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pipe_proc_fn() is no longer needed, as it only calls through to proc_dopipe_max_size(). Just put proc_dopipe_max_size() in the ctl_table entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS stub for it. (The reason the ENOSYS stub isn't needed is that the pipe-max-size ctl_table entry is located directly in 'kern_table' rather than being registered separately. Therefore, the entry is already only defined when the kernel is built with sysctl support.) Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | pipe, sysctl: drop 'min' parameter from pipe-max-size converterEric Biggers2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "pipe: buffer limits fixes and cleanups", v2. This series simplifies the sysctl handler for pipe-max-size and fixes another set of bugs related to the pipe buffer limits: - The root user wasn't allowed to exceed the limits when creating new pipes. - There was an off-by-one error when checking the limits, so a limit of N was actually treated as N - 1. - F_SETPIPE_SZ accepted values over UINT_MAX. - Reading the pipe buffer limits could be racy. This patch (of 7): Before validating the given value against pipe_min_size, do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the value up to pipe_min_size. Therefore, the second check against pipe_min_size is redundant. Remove it. Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/resource: iomem_is_exclusive can be booleanYaowei Bai2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make iomem_is_exclusive return bool due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/cpuset: current_cpuset_is_being_rebound can be booleanYaowei Bai2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make current_cpuset_is_being_rebound return bool due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | genirq: remove unneeded kallsyms includeSergey Senozhatsky2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The file was converted from print_symbol() to %pf some time ago in commit ef26f20cd117 ("genirq: Print threaded handler in spurious debug output"). kallsyms does not seem to be needed anymore. Link: http://lkml.kernel.org/r/20171208025616.16267-10-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | hrtimer: remove unneeded kallsyms includeSergey Senozhatsky2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hrtimer does not seem to use any of kallsyms functions/defines. Link: http://lkml.kernel.org/r/20171208025616.16267-9-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kcov: detect double association with a single taskDmitry Vyukov2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently KCOV_ENABLE does not check if the current task is already associated with another kcov descriptor. As the result it is possible to associate a single task with more than one kcov descriptor, which later leads to a memory leak of the old descriptor. This relation is really meant to be one-to-one (task has only one back link). Extend validation to detect such misuse. Link: http://lkml.kernel.org/r/20180122082520.15716-1-dvyukov@google.com Fixes: 5c9a8750a640 ("kernel: add kcov code coverage") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Shankara Pailoor <sp3485@columbia.edu> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: syzbot <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/relay.c: revert "kernel/relay.c: fix potential memory leak"Eric Biggers2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ba62bafe942b ("kernel/relay.c: fix potential memory leak"). This commit introduced a double free bug, because 'chan' is already freed by the line: kref_put(&chan->kref, relay_destroy_channel); This bug was found by syzkaller, using the BLKTRACESETUP ioctl. Link: http://lkml.kernel.org/r/20180127004759.101823-1-ebiggers3@gmail.com Fixes: ba62bafe942b ("kernel/relay.c: fix potential memory leak") Signed-off-by: Eric Biggers <ebiggers@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Zhouyi Zhou <yizhouzhou@ict.ac.cn> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | pids: introduce find_get_task_by_vpid() helperMike Rapoport2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several functions that do find_task_by_vpid() followed by get_task_struct(). We can use a helper function instead. Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | cpumask: make cpumask_size() return "unsigned int"Alexey Dobriyan2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPUmasks are never big enough to warrant 64-bit code. Space savings: add/remove: 0/0 grow/shrink: 1/4 up/down: 3/-17 (-14) Function old new delta sched_init_numa 1530 1533 +3 compat_sys_sched_setaffinity 160 159 -1 sys_sched_getaffinity 197 195 -2 sys_sched_setaffinity 183 176 -7 compat_sys_sched_getaffinity 179 172 -7 Link: http://lkml.kernel.org/r/20171204165531.GA8221@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/fork.c: add comment about usage of CLONE_FS flags and namespacesMarcos Paulo de Souza2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All other places that deals with namespaces have an explanation of why the restriction is there. The description added in this commit was based on commit e66eded8309e ("userns: Don't allow CLONE_NEWUSER | CLONE_FS"). Link: http://lkml.kernel.org/r/20171112151637.13258-1-marcos.souza.org@gmail.com Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/fork.c: check error and return earlyMarcos Paulo de Souza2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thus reducing one indentation level while maintaining the same rationale. Link: http://lkml.kernel.org/r/20171117002929.5155-1-marcos.souza.org@gmail.com Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/async.c: revert "async: simplify lowest_in_progress()"Rasmus Villemoes2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 92266d6ef60c ("async: simplify lowest_in_progress()") which was simply wrong: In the case where domain is NULL, we now use the wrong offsetof() in the list_first_entry macro, so we don't actually fetch the ->cookie value, but rather the eight bytes located sizeof(struct list_head) further into the struct async_entry. On 64 bit, that's the data member, while on 32 bit, that's a u64 built from func and data in some order. I think the bug happens to be harmless in practice: It obviously only affects callers which pass a NULL domain, and AFAICT the only such caller is async_synchronize_full() -> async_synchronize_full_domain(NULL) -> async_synchronize_cookie_domain(ASYNC_COOKIE_MAX, NULL) and the ASYNC_COOKIE_MAX means that in practice we end up waiting for the async_global_pending list to be empty - but it would break if somebody happened to pass (void*)-1 as the data element to async_schedule, and of course also if somebody ever does a async_synchronize_cookie_domain(, NULL) with a "finite" cookie value. Maybe the "harmless in practice" means this isn't -stable material. But I'm not completely confident my quick git grep'ing is enough, and there might be affected code in one of the earlier kernels that has since been removed, so I'll leave the decision to the stable guys. Link: http://lkml.kernel.org/r/20171128104938.3921-1-linux@rasmusvillemoes.dk Fixes: 92266d6ef60c "async: simplify lowest_in_progress()" Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Adam Wallis <awallis@codeaurora.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | Makefile: introduce CONFIG_CC_STACKPROTECTOR_AUTOKees Cook2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nearly all modern compilers support a stack-protector option, and nearly all modern distributions enable the kernel stack-protector, so enabling this by default in kernel builds would make sense. However, Kconfig does not have knowledge of available compiler features, so it isn't safe to force on, as this would unconditionally break builds for the compilers or architectures that don't have support. Instead, this introduces a new option, CONFIG_CC_STACKPROTECTOR_AUTO, which attempts to discover the best possible stack-protector available, and will allow builds to proceed even if the compiler doesn't support any stack-protector. This option is made the default so that kernels built with modern compilers will be protected-by-default against stack buffer overflows, avoiding things like the recent BlueBorne attack. Selection of a specific stack-protector option remains available, including disabling it. Additionally, tiny.config is adjusted to use CC_STACKPROTECTOR_NONE, since that's the option with the least code size (and it used to be the default, so we have to explicitly choose it there now). Link: http://lkml.kernel.org/r/1510076320-69931-4-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Laura Abbott <labbott@redhat.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2018-02-06
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - membarrier updates (Mathieu Desnoyers) - SMP balancing optimizations (Mel Gorman) - stats update optimizations (Peter Zijlstra) - RT scheduler race fixes (Steven Rostedt) - misc fixes and updates * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS sched/fair: Do not migrate if the prev_cpu is idle sched/fair: Restructure wake_affine*() to return a CPU id sched/fair: Remove unnecessary parameters from wake_affine_idle() sched/rt: Make update_curr_rt() more accurate sched/rt: Up the root domain ref count when passing it around via IPIs sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() sched/core: Optimize update_stats_*() sched/core: Optimize ttwu_stat() membarrier/selftest: Test private expedited sync core command membarrier/arm64: Provide core serializing command membarrier/x86: Provide core serializing command membarrier: Provide core serializing command, *_SYNC_CORE lockin/x86: Implement sync_core_before_usermode() locking: Introduce sync_core_before_usermode() membarrier/selftest: Test global expedited command membarrier: Provide GLOBAL_EXPEDITED command membarrier: Document scheduler barrier requirements powerpc, membarrier: Skip memory barrier in switch_mm() membarrier/selftest: Test private expedited command
| * \ \ \ \ \ \ Merge branch 'linus' into sched/urgent, to resolve conflictsIngo Molnar2018-02-06
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/arm64/kernel/entry.S arch/x86/Kconfig include/linux/sched/mm.h kernel/fork.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/fair: Use a recently used CPU as an idle candidate and the basis for SISMel Gorman2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The select_idle_sibling() (SIS) rewrite in commit: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()") ... replaced a domain iteration with a search that broadly speaking does a wrapped walk of the scheduler domain sharing a last-level-cache. While this had a number of improvements, one consequence is that two tasks that share a waker/wakee relationship push each other around a socket. Even though two tasks may be active, all cores are evenly used. This is great from a search perspective and spreads a load across individual cores, but it has adverse consequences for cpufreq. As each CPU has relatively low utilisation, cpufreq may decide the utilisation is too low to used a higher P-state and overall computation throughput suffers. While individual cpufreq and cpuidle drivers may compensate by artifically boosting P-state (at c0) or avoiding lower C-states (during idle), it does not help if hardware-based cpufreq (e.g. HWP) is used. This patch tracks a recently used CPU based on what CPU a task was running on when it last was a waker a CPU it was recently using when a task is a wakee. During SIS, the recently used CPU is used as a target if it's still allowed by the task and is idle. The benefit may be non-obvious so consider an example of two tasks communicating back and forth. Task A may be an application doing IO where task B is a kworker or kthread like journald. Task A may issue IO, wake B and B wakes up A on completion. With the existing scheme this may look like the following (potentially different IDs if SMT is in use but similar principal applies). A (cpu 0) wake B (wakes on cpu 1) B (cpu 1) wake A (wakes on cpu 2) A (cpu 2) wake B (wakes on cpu 3) etc. A careful reader may wonder why CPU 0 was not idle when B wakes A the first time and it's simply due to the fact that A can be rescheduled to another CPU and the pattern is that prev == target when B tries to wakeup A and the information about CPU 0 has been lost. With this patch, the pattern is more likely to be: A (cpu 0) wake B (wakes on cpu 1) B (cpu 1) wake A (wakes on cpu 0) A (cpu 0) wake B (wakes on cpu 1) etc i.e. two communicating casts are more likely to use just two cores instead of all available cores sharing a LLC. The most dramatic speedup was noticed on dbench using the XFS filesystem on UMA as clients interact heavily with workqueues in that configuration. Note that a similar speedup is not observed on ext4 as the wakeup pattern is different: 4.15.0-rc9 4.15.0-rc9 waprev-v1 biasancestor-v1 Hmean 1 287.54 ( 0.00%) 817.01 ( 184.14%) Hmean 2 1268.12 ( 0.00%) 1781.24 ( 40.46%) Hmean 4 1739.68 ( 0.00%) 1594.47 ( -8.35%) Hmean 8 2464.12 ( 0.00%) 2479.56 ( 0.63%) Hmean 64 1455.57 ( 0.00%) 1434.68 ( -1.44%) The results can be less dramatic on NUMA where automatic balancing interferes with the test. It's also known that network benchmarks running on localhost also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP and TCP depending on the machine). Hackbench also seens small improvements (6-11% depending on machine and thread count). The facebook schbench was also tested but in most cases showed little or no different to wakeup latencies. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/fair: Do not migrate if the prev_cpu is idleMel Gorman2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wake_affine_idle() prefers to move a task to the current CPU if the wakeup is due to an interrupt. The expectation is that the interrupt data is cache hot and relevant to the waking task as well as avoiding a search. However, there is no way to determine if there was cache hot data on the previous CPU that may exceed the interrupt data. Furthermore, round-robin delivery of interrupts can migrate tasks around a socket where each CPU is under-utilised. This can interact badly with cpufreq which makes decisions based on per-cpu data. It has been observed on machines with HWP that p-states are not boosted to their maximum levels even though the workload is latency and throughput sensitive. This patch uses the previous CPU for the task if it's idle and cache-affine with the current CPU even if the current CPU is idle due to the wakup being related to the interrupt. This reduces migrations at the cost of the interrupt data not being cache hot when the task wakes. A variety of workloads were tested on various machines and no adverse impact was noticed that was outside noise. dbench on ext4 on UMA showed roughly 10% reduction in the number of CPU migrations and it is a case where interrupts are frequent for IO competions. In most cases, the difference in performance is quite small but variability is often reduced. For example, this is the result for pgbench running on a UMA machine with different numbers of clients. 4.15.0-rc9 4.15.0-rc9 baseline waprev-v1 Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%) Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%) Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%) Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%) Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%) Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%) Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%) Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%) Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%) Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%) CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%) Note that the different in performance is marginal but for low utilisation, there is less variability. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/fair: Restructure wake_affine*() to return a CPU idMel Gorman2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a preparation patch that has wake_affine*() return a CPU ID instead of a boolean. The intent is to allow the wake_affine() helpers to be avoided if a decision is already made. This patch has no functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/fair: Remove unnecessary parameters from wake_affine_idle()Mel Gorman2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wake_affine_idle() takes parameters it never uses so clean it up. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/rt: Make update_curr_rt() more accurateWen Yang2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/rt: Up the root domain ref count when passing it around via IPIsSteven Rostedt (VMware)2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When issuing an IPI RT push, where an IPI is sent to each CPU that has more than one RT task scheduled on it, it references the root domain's rto_mask, that contains all the CPUs within the root domain that has more than one RT task in the runable state. The problem is, after the IPIs are initiated, the rq->lock is released. This means that the root domain that is associated to the run queue could be freed while the IPIs are going around. Add a sched_get_rd() and a sched_put_rd() that will increment and decrement the root domain's ref count respectively. This way when initiating the IPIs, the scheduler will up the root domain's ref count before releasing the rq->lock, ensuring that the root domain does not go away until the IPI round is complete. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()Steven Rostedt (VMware)2018-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the rto_push_irq_work_func() is called, it looks at the RT overloaded bitmask in the root domain via the runqueue (rq->rd). The problem is that during CPU up and down, nothing here stops rq->rd from changing between taking the rq->rd->rto_lock and releasing it. That means the lock that is released is not the same lock that was taken. Instead of using this_rq()->rd to get the root domain, as the irq work is part of the root domain, we can simply get the root domain from the irq work that is passed to the routine: container_of(work, struct root_domain, rto_push_work) This keeps the root domain consistent. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>