aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* sched: Make __update_entity_runnable_avg() fastPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __update_entity_runnable_avg forms the core of maintaining an entity's runnable load average. In this function we charge the accumulated run-time since last update and handle appropriate decay. In some cases, e.g. a waking task, this time interval may be much larger than our period unit. Fortunately we can exploit some properties of our series to perform decay for a blocked update in constant time and account the contribution for a running update in essentially-constant* time. [*]: For any running entity they should be performing updates at the tick which gives us a soft limit of 1 jiffy between updates, and we can compute up to a 32 jiffy update in a single pass. C program to generate the magic constants in the arrays: #include <math.h> #include <stdio.h> #define N 32 #define WMULT_SHIFT 32 const long WMULT_CONST = ((1UL << N) - 1); double y; long runnable_avg_yN_inv[N]; void calc_mult_inv() { int i; double yn = 0; printf("inverses\n"); for (i = 0; i < N; i++) { yn = (double)WMULT_CONST * pow(y, i); runnable_avg_yN_inv[i] = yn; printf("%2d: 0x%8lx\n", i, runnable_avg_yN_inv[i]); } printf("\n"); } long mult_inv(long c, int n) { return (c * runnable_avg_yN_inv[n]) >> WMULT_SHIFT; } void calc_yn_sum(int n) { int i; double sum = 0, sum_fl = 0, diff = 0; /* * We take the floored sum to ensure the sum of partial sums is never * larger than the actual sum. */ printf("sum y^n\n"); printf(" %8s %8s %8s\n", "exact", "floor", "error"); for (i = 1; i <= n; i++) { sum = (y * sum + y * 1024); sum_fl = floor(y * sum_fl+ y * 1024); printf("%2d: %8.0f %8.0f %8.0f\n", i, sum, sum_fl, sum_fl - sum); } printf("\n"); } void calc_conv(long n) { long old_n; int i = -1; printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n"); do { old_n = n; n = mult_inv(n, 1) + 1024; i++; } while (n != old_n); printf("%d> %ld\n", i - 1, n); printf("\n"); } void main() { y = pow(0.5, 1/(double)N); calc_mult_inv(); calc_conv(1024); calc_yn_sum(N); } [ Compile with -lm ] Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.277808946@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Update_cfs_shares at period edgePaul Turner2012-10-24
| | | | | | | | | | | | Now that our measurement intervals are small (~1ms) we can amortize the posting of update_shares() to be about each period overflow. This is a large cost saving for frequently switching tasks. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.200772172@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Refactor update_shares_cpu() -> update_blocked_avgs()Paul Turner2012-10-24
| | | | | | | | | | | | | Now that running entities maintain their own load-averages the work we must do in update_shares() is largely restricted to the periodic decay of blocked entities. This allows us to be a little less pessimistic regarding our occupancy on rq->lock and the associated rq->clock updates required. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.133999170@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Replace update_shares weight distribution with per-entity computationPaul Turner2012-10-24
| | | | | | | | | | | | Now that the machinery in place is in place to compute contributed load in a bottom up fashion; replace the shares distribution code within update_shares() accordingly. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Maintain runnable averages across throttled periodsPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | | | With bandwidth control tracked entities may cease execution according to user specified bandwidth limits. Charging this time as either throttled or blocked however, is incorrect and would falsely skew in either direction. What we actually want is for any throttled periods to be "invisible" to load-tracking as they are removed from the system for that interval and contribute normally otherwise. Do this by moderating the progression of time to omit any periods in which the entity belonged to a throttled hierarchy. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.998912151@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Normalize tg load contributions against runnable timePaul Turner2012-10-24
| | | | | | | | | | | | | | | | | | | Entities of equal weight should receive equitable distribution of cpu time. This is challenging in the case of a task_group's shares as execution may be occurring on multiple cpus simultaneously. To handle this we divide up the shares into weights proportionate with the load on each cfs_rq. This does not however, account for the fact that the sum of the parts may be less than one cpu and so we need to normalize: load(tg) = min(runnable_avg(tg), 1) * tg->shares Where runnable_avg is the aggregate time in which the task_group had runnable children. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com>. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.930124292@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Compute load contribution by a group entityPaul Turner2012-10-24
| | | | | | | | | | | | | | Unlike task entities who have a fixed weight, group entities instead own a fraction of their parenting task_group's shares as their contributed weight. Compute this fraction so that we can correctly account hierarchies and shared entity nodes. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.855074415@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Aggregate total task_group loadPaul Turner2012-10-24
| | | | | | | | | | | | Maintain a global running sum of the average load seen on each cfs_rq belonging to each task group so that it may be used in calculating an appropriate shares:weight distribution. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.792901086@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Account for blocked load waking back upPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | When a running entity blocks we migrate its tracked load to cfs_rq->blocked_runnable_avg. In the sleep case this occurs while holding rq->lock and so is a natural transition. Wake-ups however, are potentially asynchronous in the presence of migration and so special care must be taken. We use an atomic counter to track such migrated load, taking care to match this with the previously introduced decay counters so that we don't migrate too much load. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.726077467@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Add an rq migration call-back to sched_classPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | Since we are now doing bottom up load accumulation we need explicit notification when a task has been re-parented so that the old hierarchy can be updated. Adds: migrate_task_rq(struct task_struct *p, int next_cpu) (The alternative is to do this out of __set_task_cpu, but it was suggested that this would be a cleaner encapsulation.) Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.660023400@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Maintain the load contribution of blocked entitiesPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are currently maintaining: runnable_load(cfs_rq) = \Sum task_load(t) For all running children t of cfs_rq. While this can be naturally updated for tasks in a runnable state (as they are scheduled); this does not account for the load contributed by blocked task entities. This can be solved by introducing a separate accounting for blocked load: blocked_load(cfs_rq) = \Sum runnable(b) * weight(b) Obviously we do not want to iterate over all blocked entities to account for their decay, we instead observe that: runnable_load(t) = \Sum p_i*y^i and that to account for an additional idle period we only need to compute: y*runnable_load(t). This means that we can compute all blocked entities at once by evaluating: blocked_load(cfs_rq)` = y * blocked_load(cfs_rq) Finally we maintain a decay counter so that when a sleeping entity re-awakens we can determine how much of its load should be removed from the blocked sum. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.585389902@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Aggregate load contributed by task entities on parenting cfs_rqPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | | | | For a given task t, we can compute its contribution to load as: task_load(t) = runnable_avg(t) * weight(t) On a parenting cfs_rq we can then aggregate: runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t Maintain this bottom up, with task entities adding their contributed load to the parenting cfs_rq sum. When a task entity's load changes we add the same delta to the maintained sum. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Maintain per-rq runnable averagesBen Segall2012-10-24
| | | | | | | | | | | Since runqueues do not have a corresponding sched_entity we instead embed a sched_avg structure directly. Signed-off-by: Ben Segall <bsegall@google.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.442637130@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Track the runnable average on a per-task entity basisPaul Turner2012-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | Instead of tracking averaging the load parented by a cfs_rq, we can track entity load directly. With the load for a given cfs_rq then being the sum of its children. To do this we represent the historical contribution to runnable average within each trailing 1024us of execution as the coefficients of a geometric series. We can express this for a given task t as: runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t) Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms and y is chosen such that y^k = 1/2. We currently choose k to be 32 which roughly translates to about a sched period. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'stable/for-linus-3.7-rc2-tag' of ↵Linus Torvalds2012-10-23
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull xen bug-fixes from Konrad Rzeszutek Wilk: - Fix mysterious SIGSEGV or SIGKILL in applications due to corrupting of the %eip when returning from a signal handler. - Fix various ARM compile issues after the merge fallout. - Continue on making more of the Xen generic code usable by ARM platform. - Fix SR-IOV passthrough to mirror multifunction PCI devices. - Fix various compile warnings. - Remove hypercalls that don't exist anymore. * tag 'stable/for-linus-3.7-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: dbgp: Fix warning when CONFIG_PCI is not enabled. xen: arm: comment on why 64-bit xen_pfn_t is safe even on 32 bit xen: balloon: use correct type for frame_list xen/x86: don't corrupt %eip when returning from a signal handler xen: arm: make p2m operations NOPs xen: balloon: don't include e820.h xen: grant: use xen_pfn_t type for frame_list. xen: events: pirq_check_eoi_map is X86 specific xen: XENMEM_translate_gpfn_list was remove ages ago and is unused. xen: sysfs: fix build warning. xen: sysfs: include err.h for PTR_ERR etc xen: xenbus: quirk uses x86 specific cpuid xen PV passthru: assign SR-IOV virtual functions to separate virtual slots xen/xenbus: Fix compile warning. xen/x86: remove duplicated include from enlighten.c
| * xen: dbgp: Fix warning when CONFIG_PCI is not enabled.Ian Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | I saw this on ARM: linux/drivers/xen/dbgp.c:11:23: warning: unused variable 'ctrlr' [-Wunused-variable] Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * Merge commit 'v3.7-rc1' into stable/for-linus-3.7Konrad Rzeszutek Wilk2012-10-19
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * commit 'v3.7-rc1': (10892 commits) Linux 3.7-rc1 x86, boot: Explicitly include autoconf.h for hostprogs perf: Fix UAPI fallout ARM: config: make sure that platforms are ordered by option string ARM: config: sort select statements alphanumerically UAPI: (Scripted) Disintegrate include/linux/byteorder UAPI: (Scripted) Disintegrate include/linux UAPI: Unexport linux/blk_types.h UAPI: Unexport part of linux/ppp-comp.h perf: Handle new rbtree implementation procfs: don't need a PATH_MAX allocation to hold a string representation of an int vfs: embed struct filename inside of names_cache allocation if possible audit: make audit_inode take struct filename vfs: make path_openat take a struct filename pointer vfs: turn do_path_lookup into wrapper around struct filename variant audit: allow audit code to satisfy getname requests from its names_list vfs: define struct filename and have getname() return it btrfs: Fix compilation with user namespace support enabled userns: Fix posix_acl_file_xattr_userns gid conversion userns: Properly print bluetooth socket uids ...
| * | xen: arm: comment on why 64-bit xen_pfn_t is safe even on 32 bitIan Campbell2012-10-19
| | | | | | | | | | | | | | | Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: balloon: use correct type for frame_listIan Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | This is now a xen_pfn_t. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen/x86: don't corrupt %eip when returning from a signal handlerDavid Vrabel2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 32 bit guests, if a userspace process has %eax == -ERESTARTSYS (-512) or -ERESTARTNOINTR (-513) when it is interrupted by an event /and/ the process has a pending signal then %eip (and %eax) are corrupted when returning to the main process after handling the signal. The application may then crash with SIGSEGV or a SIGILL or it may have subtly incorrect behaviour (depending on what instruction it returned to). The occurs because handle_signal() is incorrectly thinking that there is a system call that needs to restarted so it adjusts %eip and %eax to re-execute the system call instruction (even though user space had not done a system call). If %eax == -514 (-ERESTARTNOHAND (-514) or -ERESTART_RESTARTBLOCK (-516) then handle_signal() only corrupted %eax (by setting it to -EINTR). This may cause the application to crash or have incorrect behaviour. handle_signal() assumes that regs->orig_ax >= 0 means a system call so any kernel entry point that is not for a system call must push a negative value for orig_ax. For example, for physical interrupts on bare metal the inverse of the vector is pushed and page_fault() sets regs->orig_ax to -1, overwriting the hardware provided error code. xen_hypervisor_callback() was incorrectly pushing 0 for orig_ax instead of -1. Classic Xen kernels pushed %eax which works as %eax cannot be both non-negative and -RESTARTSYS (etc.), but using -1 is consistent with other non-system call entry points and avoids some of the tests in handle_signal(). There were similar bugs in xen_failsafe_callback() of both 32 and 64-bit guests. If the fault was corrected and the normal return path was used then 0 was incorrectly pushed as the value for orig_ax. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Jan Beulich <JBeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Cc: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: arm: make p2m operations NOPsIan Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes common code less ifdef-y and is consistent with PVHVM on x86. Also note that phys_to_machine_mapping_valid should take a pfn argument and make it do so. Add __set_phys_to_machine, make set_phys_to_machine a simple wrapper (on systems with non-nop implementations the outer one can allocate new p2m pages). Make __set_phys_to_machine check for identity mapping or invalid only. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: balloon: don't include e820.hIan Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | This breaks on !X86 and AFAICT is not required on X86 either. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: grant: use xen_pfn_t type for frame_list.Ian Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This correctly sizes it as 64 bit on ARM but leaves it as unsigned long on x86 (therefore no intended change on x86). The long and ulong guest handles are now unused (and a bit dangerous) so remove them. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: events: pirq_check_eoi_map is X86 specificIan Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On ARM I see: drivers/xen/events.c:280:13: warning: 'pirq_check_eoi_map' defined but not used [-Wunused-function] Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: XENMEM_translate_gpfn_list was remove ages ago and is unused.Ian Campbell2012-10-19
| | | | | | | | | | | | | | | Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: sysfs: fix build warning.Ian Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define PRI macros for xen_ulong_t and xen_pfn_t and use to fix: drivers/xen/sys-hypervisor.c:288:4: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'xen_ulong_t' [-Wformat] Ideally this would use PRIx64 on ARM but these (or equivalent) don't seem to be available in the kernel. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: sysfs: include err.h for PTR_ERR etcIan Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes build error on ARM: drivers/xen/sys-hypervisor.c: In function 'uuid_show_fallback': drivers/xen/sys-hypervisor.c:127:2: error: implicit declaration of function 'IS_ERR' [-Werror=implicit-function-declaration] drivers/xen/sys-hypervisor.c:128:3: error: implicit declaration of function 'PTR_ERR' [-Werror=implicit-function-declaration] Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen: xenbus: quirk uses x86 specific cpuidIan Campbell2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This breaks on ARM. This quirk is not necessary on ARM because no hypervisors of that vintage exist for that architecture (port is too new). Signed-off-by: Ian Campbell <ian.campbell@citrix.com> [v1: Moved the ifdef inside the function per Jan Beulich suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen PV passthru: assign SR-IOV virtual functions to separate virtual slotsLaszlo Ersek2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VFs are reported as single-function devices in PCI_HEADER_TYPE, which causes pci_scan_slot() in the PV domU to skip all VFs beyond #0 in the pciback-provided slot. Avoid this by assigning each VF to a separate virtual slot. Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen/xenbus: Fix compile warning.Konrad Rzeszutek Wilk2012-10-19
| | | | | | | | | | | | | | | | | | | | | We were missing the 'void' on the parameter arguments. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | xen/x86: remove duplicated include from enlighten.cWei Yongjun2012-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove duplicated include. dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) CC: stable@vger.kernel.org Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | alpha: separate thread-synchronous flagsAl Viro2012-10-23
| | | | | | | | | | | | | | | | | | | | | ... and fix the race in updating unaligned control ones Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'kvm-3.7-2' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2012-10-23
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Avi Kivity: "KVM updates for 3.7-rc2" * tag 'kvm-3.7-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM guest: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT KVM: apic: fix LDR calculation in x2apic mode KVM: MMU: fix release noslot pfn
| * | | KVM guest: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENTSasha Levin2012-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM_PV_REASON_PAGE_NOT_PRESENT kicks cpu out of idleness, but we haven't marked that spot as an exit from idleness. Not doing so can cause RCU warnings such as: [ 732.788386] =============================== [ 732.789803] [ INFO: suspicious RCU usage. ] [ 732.790032] 3.7.0-rc1-next-20121019-sasha-00002-g6d8d02d-dirty #63 Tainted: G W [ 732.790032] ------------------------------- [ 732.790032] include/linux/rcupdate.h:738 rcu_read_lock() used illegally while idle! [ 732.790032] [ 732.790032] other info that might help us debug this: [ 732.790032] [ 732.790032] [ 732.790032] RCU used illegally from idle CPU! [ 732.790032] rcu_scheduler_active = 1, debug_locks = 1 [ 732.790032] RCU used illegally from extended quiescent state! [ 732.790032] 2 locks held by trinity-child31/8252: [ 732.790032] #0: (&rq->lock){-.-.-.}, at: [<ffffffff83a67528>] __schedule+0x178/0x8f0 [ 732.790032] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff81152bde>] cpuacct_charge+0xe/0x200 [ 732.790032] [ 732.790032] stack backtrace: [ 732.790032] Pid: 8252, comm: trinity-child31 Tainted: G W 3.7.0-rc1-next-20121019-sasha-00002-g6d8d02d-dirty #63 [ 732.790032] Call Trace: [ 732.790032] [<ffffffff8118266b>] lockdep_rcu_suspicious+0x10b/0x120 [ 732.790032] [<ffffffff81152c60>] cpuacct_charge+0x90/0x200 [ 732.790032] [<ffffffff81152bde>] ? cpuacct_charge+0xe/0x200 [ 732.790032] [<ffffffff81158093>] update_curr+0x1a3/0x270 [ 732.790032] [<ffffffff81158a6a>] dequeue_entity+0x2a/0x210 [ 732.790032] [<ffffffff81158ea5>] dequeue_task_fair+0x45/0x130 [ 732.790032] [<ffffffff8114ae29>] dequeue_task+0x89/0xa0 [ 732.790032] [<ffffffff8114bb9e>] deactivate_task+0x1e/0x20 [ 732.790032] [<ffffffff83a67c29>] __schedule+0x879/0x8f0 [ 732.790032] [<ffffffff8117e20d>] ? trace_hardirqs_off+0xd/0x10 [ 732.790032] [<ffffffff810a37a5>] ? kvm_async_pf_task_wait+0x1d5/0x2b0 [ 732.790032] [<ffffffff83a67cf5>] schedule+0x55/0x60 [ 732.790032] [<ffffffff810a37c4>] kvm_async_pf_task_wait+0x1f4/0x2b0 [ 732.790032] [<ffffffff81139e50>] ? abort_exclusive_wait+0xb0/0xb0 [ 732.790032] [<ffffffff81139c25>] ? prepare_to_wait+0x25/0x90 [ 732.790032] [<ffffffff810a3a66>] do_async_page_fault+0x56/0xa0 [ 732.790032] [<ffffffff83a6a6e8>] async_page_fault+0x28/0x30 Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Gleb Natapov <gleb@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| * | | KVM: apic: fix LDR calculation in x2apic modeGleb Natapov2012-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Chegu Vinod <chegu_vinod@hp.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| * | | KVM: MMU: fix release noslot pfnXiao Guangrong2012-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can not directly call kvm_release_pfn_clean to release the pfn since we can meet noslot pfn which is used to cache mmio info into spte Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>
* | | | Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2012-10-23
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Most of these are uprobes race fixes from Oleg, and their preparatory cleanups. (It's larger than what I'd normally send for an -rc kernel, but they looked significant enough to not delay them.) There's also an oprofile fix and an uncore PMU fix." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) perf/x86: Disable uncore on virtualized CPUs oprofile, x86: Fix wrapping bug in op_x86_get_ctrl() ring-buffer: Check for uninitialized cpu buffer before resizing uprobes: Fix the racy uprobe->flags manipulation uprobes: Fix prepare_uprobe() race with itself uprobes: Introduce prepare_uprobe() uprobes: Fix handle_swbp() vs unregister() + register() race uprobes: Do not delete uprobe if uprobe_unregister() fails uprobes: Don't return success if alloc_uprobe() fails uprobes/x86: Only rep+nop can be emulated correctly uprobes: Simplify is_swbp_at_addr(), remove stale comments uprobes: Kill set_orig_insn()->is_swbp_at_addr() uprobes: Introduce copy_opcode(), kill read_opcode() uprobes: Kill set_swbp()->is_swbp_at_addr() uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC uprobes: Change write_opcode() to use FOLL_FORCE uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume() uprobes: Kill UTASK_BP_HIT state uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp() ...
| * \ \ \ Merge branch 'tip/perf/urgent' of ↵Ingo Molnar2012-10-21
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull ftrace ring-buffer resizing fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | ring-buffer: Check for uninitialized cpu buffer before resizingVaibhav Nagarnaik2012-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With a system where, num_present_cpus < num_possible_cpus, even if all CPUs are online, non-present CPUs don't have per_cpu buffers allocated. If per_cpu/<cpu>/buffer_size_kb is modified for such a CPU, it can cause a panic due to NULL dereference in ring_buffer_resize(). To fix this, resize operation is allowed only if the per-cpu buffer has been initialized. Link: http://lkml.kernel.org/r/1349912427-6486-1-git-send-email-vnagarnaik@google.com Cc: stable@vger.kernel.org # 3.5+ Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | | Merge branch 'uprobes/core' of ↵Ingo Molnar2012-10-21
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/urgent Pull various uprobes bugfixes from Oleg Nesterov - mostly race and failure path fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | | uprobes: Fix the racy uprobe->flags manipulationOleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple threads can manipulate uprobe->flags, this is obviously unsafe. For example mmap can set UPROBE_COPY_INSN while register tries to set UPROBE_RUN_HANDLER, the latter can also race with can_skip_sstep() which clears UPROBE_SKIP_SSTEP. Change this code to use bitops. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Fix prepare_uprobe() race with itselfOleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | install_breakpoint() is called under mm->mmap_sem, this protects set_swbp() but not prepare_uprobe(). Two or more different tasks can call install_breakpoint()->prepare_uprobe() at the same time, this leads to numerous problems if UPROBE_COPY_INSN is not set. Just for example, the second copy_insn() can corrupt the already analyzed/fixuped uprobe->arch.insn and race with handle_swbp(). This patch simply adds uprobe->copy_mutex to serialize this code. We could probably reuse ->consumer_rwsem, but this would mean that consumer->handler() can not use mm->mmap_sem, not good. Note: this is another temporary ugly hack until we move this logic into uprobe_register(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Introduce prepare_uprobe()Oleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparation. Extract the copy_insn/arch_uprobe_analyze_insn code from install_breakpoint() into the new helper, prepare_uprobe(). And move uprobe->flags defines from uprobes.h to uprobes.c, nobody else can use them anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Fix handle_swbp() vs unregister() + register() raceOleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Strictly speaking this race was added by me in 56bb4cf6. However I think that this bug is just another indication that we should move copy_insn/uprobe_analyze_insn code from install_breakpoint() to uprobe_register(), there are a lot of other reasons for that. Until then, add a hack to close the race. A task can hit uprobe U1, but before it calls find_uprobe() this uprobe can be unregistered *AND* another uprobe U2 can be added to uprobes_tree at the same inode/offset. In this case handle_swbp() will use the not-fully-initialized U2, in particular its arch.insn for xol. Add the additional !UPROBE_COPY_INSN check into handle_swbp(), if this flag is not set we simply restart as if the new uprobe was not inserted yet. This is not very nice, we need barriers, but we will remove this hack when we change uprobe_register(). Note: with or without this patch install_breakpoint() can race with itself, yet another reson to kill UPROBE_COPY_INSN altogether. And even the usage of uprobe->flags is not safe. See the next patches. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Do not delete uprobe if uprobe_unregister() failsOleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | delete_uprobe() must not be called if register_for_each_vma(false) fails to remove all breakpoints, __uprobe_unregister() is correct. The problem is that register_for_each_vma(false) always returns 0 and thus this logic does not work. 1. Change verify_opcode() to return 0 rather than -EINVAL when unregister detects the !is_swbp insn, we can treat this case as success and currently unregister paths ignore the error code anyway. 2. Change remove_breakpoint() to propagate the error code from write_opcode(). 3. Change register_for_each_vma(is_register => false) to remove as much breakpoints as possible but return non-zero if remove_breakpoint() fails at least once. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Don't return success if alloc_uprobe() failsOleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If alloc_uprobe() fails uprobe_register() should return ENOMEM, not 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes/x86: Only rep+nop can be emulated correctlyOleg Nesterov2012-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __skip_sstep() correctly detects the "nontrivial" nop insns, but since it doesn't update regs->ip we can not really skip "0x0f 0x1f | 0x0f 0x19 | 0x87 0xc0", the probed application is killed by SIGILL'ed handle_swbp(). Remove these additional checks. If we want to implement this correctly we need to know the full insn length to update ->ip. rep* + nop is fine even without updating ->ip. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Simplify is_swbp_at_addr(), remove stale commentsOleg Nesterov2012-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the previous change is_swbp_at_addr() is always called with current->mm. Remove this check and move it close to its single caller. Also, remove the obsolete comment about is_swbp_at_addr() and uprobe_state.count. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Kill set_orig_insn()->is_swbp_at_addr()Oleg Nesterov2012-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike set_swbp(), set_orig_insn()->is_swbp_at_addr() makes sense, although it can't prevent all confusions. But the usage of is_swbp_at_addr() is equally confusing, and it adds the extra get_user_pages() we can avoid. This patch removes set_orig_insn()->is_swbp_at_addr() but changes write_opcode() to do the necessary checks before replace_page(). Perhaps it also makes sense to ensure PAGE_MAPPING_ANON in unregister case. find_active_uprobe() becomes the only user of is_swbp_at_addr(), we can change its semantics. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
| | * | | | | uprobes: Introduce copy_opcode(), kill read_opcode()Oleg Nesterov2012-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No functional changes, preparations. 1. Extract the kmap-and-memcpy code from read_opcode() into the new trivial helper, copy_opcode(). The next patch will add another user. 2. read_opcode() becomes really trivial, fold it into its single caller, is_swbp_at_addr(). 3. Remove "auprobe" argument from write_opcode(), it is not used since f403072c6. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>