aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* Merge branch 'kvm-updates/2.6.28' of ↵Linus Torvalds2008-10-28
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: KVM: ia64: Makefile fix for forcing to re-generate asm-offsets.h KVM: Future-proof device assignment ABI KVM: ia64: Fix halt emulation logic KVM: Fix guest shared interrupt with in-kernel irqchip KVM: MMU: sync root on paravirt TLB flush
| * KVM: ia64: Makefile fix for forcing to re-generate asm-offsets.hXiantao Zhang2008-10-28
| | | | | | | | | | | | | | To avoid using stale asm-offsets.h. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| * KVM: Future-proof device assignment ABIAvi Kivity2008-10-28
| | | | | | | | | | | | Reserve some space so we can add more data. Signed-off-by: Avi Kivity <avi@qumranet.com>
| * KVM: ia64: Fix halt emulation logicXiantao Zhang2008-10-28
| | | | | | | | | | | | | | | | | | | | Common halt logic was changed by x86 and did not update ia64. This patch updates halt for ia64. Fixes a regression causing guests to hang with more than 2 vcpus. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| * KVM: Fix guest shared interrupt with in-kernel irqchipSheng Yang2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every call of kvm_set_irq() should offer an irq_source_id, which is allocated by kvm_request_irq_source_id(). Based on irq_source_id, we identify the irq source and implement logical OR for shared level interrupts. The allocated irq_source_id can be freed by kvm_free_irq_source_id(). Currently, we support at most sizeof(unsigned long) different irq sources. [Amit: - rebase to kvm.git HEAD - move definition of KVM_USERSPACE_IRQ_SOURCE_ID to common file - move kvm_request_irq_source_id to the update_irq ioctl] [Xiantao: - Add kvm/ia64 stuff and make it work for kvm/ia64 guests] Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| * KVM: MMU: sync root on paravirt TLB flushMarcelo Tosatti2008-10-28
| | | | | | | | | | | | | | | | The pvmmu TLB flush handler should request a root sync, similarly to a native read-write CR3. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* | Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds2008-10-28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: fix irqs on/off ip tracing lockdep: minor fix for debug_show_all_locks() x86: restore the old swiotlb alloc_coherent behavior x86: use GFP_DMA for 24bit coherent_dma_mask swiotlb: remove panic for alloc_coherent failure xen: compilation fix of drivers/xen/events.c on IA64 xen: portability clean up and some minor clean up for xencomm.c xen: don't reload cr3 on suspend kernel/resource: fix reserve_region_with_split() section mismatch printk: remove unused code from kernel/printk.c
| * | lockdep: fix irqs on/off ip tracingHeiko Carstens2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix lockdep lock-api-caller output when irqsoff tracing is enabled 81d68a96 "ftrace: trace irq disabled critical timings" added wrappers around trace_hardirqs_on/off_caller. However these functions use __builtin_return_address(0) to figure out which function actually disabled or enabled irqs. The result is that we save the ips of trace_hardirqs_on/off instead of the real caller. Not very helpful. However since the patch from Steven the ip already gets passed. So use that and get rid of __builtin_return_address(0) in these two functions. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | lockdep: minor fix for debug_show_all_locks()qinghuang feng2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we failed to get tasklist_lock eventually (count equals 0), we should only print " ignoring it.\n", and not print " locked it.\n" needlessly. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | x86: restore the old swiotlb alloc_coherent behaviorFUJITA Tomonori2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This restores the old swiotlb alloc_coherent behavior (before the alloc_coherent rewrite): http://lkml.org/lkml/2008/8/12/200 The old alloc_coherent avoids GFP_DMA allocation first and if the allocated address is not fit for the device's coherent_dma_mask, then dma_alloc_coherent does GFP_DMA allocation. If it fails, alloc_coherent calls swiotlb_alloc_coherent (in short, we rarely used swiotlb_alloc_coherent). After the alloc_coherent rewrite, dma_alloc_coherent (include/asm-x86/dma-mapping.h) directly calls swiotlb_alloc_coherent. It means that we possibly can't handle a device having dma_masks > 24bit < 32bits since swiotlb_alloc_coherent doesn't have the above GFP_DMA retry mechanism. This patch fixes x86's swiotlb alloc_coherent to use the GFP_DMA retry mechanism, which dma_generic_alloc_coherent() provides now (pci-nommu.c and GART IOMMU driver also use dma_generic_alloc_coherent). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | x86: use GFP_DMA for 24bit coherent_dma_maskFUJITA Tomonori2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dma_alloc_coherent (include/asm-x86/dma-mapping.h) avoids GFP_DMA allocation first and if the allocated address is not fit for the device's coherent_dma_mask, then dma_alloc_coherent does GFP_DMA allocation. This is because dma_alloc_coherent avoids precious GFP_DMA zone if possible. This is also how the old dma_alloc_coherent (arch/x86/kernel/pci-dma.c) works. However, if the coherent_dma_mask of a device is 24bit, there is no point to go into the above GFP_DMA retry mechanism. We had better use GFP_DMA in the first place. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | swiotlb: remove panic for alloc_coherent failureFUJITA Tomonori2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | swiotlb_alloc_coherent calls panic() when allocated swiotlb pages is not fit for a device's dma mask. However, alloc_coherent failure is not a disaster at all. AFAIK, none of other x86 and IA64 IOMMU implementations don't crash in case of alloc_coherent failure. There are some drivers that don't check alloc_coherent failure but not many (about ten and I've already started to fix some of them). alloc_coherent returns NULL in case of failure so it's likely that these guilty drivers crash immediately. So swiotlb doesn't need to call panic() just for them. Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | xen: compilation fix of drivers/xen/events.c on IA64Isaku Yamahata2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | use set_xen_guest_handle() instead of direct assigning. > linux-2.6/drivers/xen/events.c: In function 'xen_poll_irq': > linux-2.6/drivers/xen/events.c:757: error: incompatible types in assignment > make[4]: *** [drivers/xen/events.o] Error 1 Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | xen: portability clean up and some minor clean up for xencomm.cIsaku Yamahata2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clean up of xencomm.c. is_phys_contiguous() is arch dependent function that depends on how virtual memory are laid out. So split out the function into arch specific code. Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com>
| * | xen: don't reload cr3 on suspendJeremy Fitzhardinge2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | It isn't necessary, and it makes the code needlessly non-portable. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | kernel/resource: fix reserve_region_with_split() section mismatchPaul Mundt2008-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: cleanup, small kernel text size reduction, no functionality changed reserve_region_with_split() calls in to __reserve_region_with_split(), which is an __init function. The only caller of reserve_region_with_split() is an __init function, so make it __init too. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | printk: remove unused code from kernel/printk.croel kluin2008-10-23
| | | | | | | | | | | | | | | | | | both log_buf_copy() and log_buf_len are unused. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds2008-10-28
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: make variable static
| * | | irq: make variable staticroel kluin2008-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This variable is only used in the source file, so make it static. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2008-10-28
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix documentation reference for sched_min_granularity_ns sched: virtual time buddy preemption sched: re-instate vruntime based wakeup preemption sched: weaken sync hint sched: more accurate min_vruntime accounting sched: fix a find_busiest_group buglet sched: add CONFIG_SMP consistency
| * | | | sched: fix documentation reference for sched_min_granularity_nsJiri Kosina2008-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: documentation fix sched-design-CFS.txt wrongly references sched_granularity_ns sysctl, as its name in fact is sched_min_granularity_ns. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | sched: virtual time buddy preemptionPeter Zijlstra2008-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we moved wakeup preemption back to virtual time, it makes sense to move the buddy stuff back as well. The purpose of the buddy scheduling is to allow a quickly scheduling pair of tasks to run away from the group as far as a regular busy task would be allowed under wakeup preemption. This has the advantage that the pair can ping-pong for a while, enjoying cache-hotness. Without buddy scheduling other tasks would interleave destroying the cache. Also, it saves a word in cfs_rq. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | sched: re-instate vruntime based wakeup preemptionPeter Zijlstra2008-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The advantage is that vruntime based wakeup preemption has a better conceptual model. Here wakeup_gran = 0 means: preempt when 'fair'. Therefore wakeup_gran is the granularity of unfairness we allow in order to make progress. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | sched: weaken sync hintMike Galbraith2008-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mysql+oltp and pgsql+oltp peaks are still shifted right. The below puts the peaks back to 1 client/server pair per core. Use the avg_overlap information to weaken the sync hint. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | sched: more accurate min_vruntime accountingPeter Zijlstra2008-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mike noticed the current min_vruntime tracking can go wrong and skip the current task. If the only remaining task in the tree is a nice 19 task with huge vruntime, new tasks will be inserted too far to the right too, causing some interactibity issues. min_vruntime can only change due to the leftmost entry disappearing (dequeue_entity()), or by the leftmost entry being incremented past the next entry, which elects a new leftmost (__update_curr()) Due to the current entry not being part of the actual tree, we have to compare the leftmost tree entry with the current entry, and take the leftmost of these two. So create a update_min_vruntime() function that takes computes the leftmost vruntime in the system (either tree of current) and increases the cfs_rq->min_vruntime if the computed value is larger than the previously found min_vruntime. And call this from the two sites we've identified that can change min_vruntime. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | sched: fix a find_busiest_group bugletPeter Zijlstra2008-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In one of the group load balancer patches: commit 408ed066b11cf9ee4536573b4269ee3613bd735e Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri Jun 27 13:41:28 2008 +0200 Subject: sched: hierarchical load vs find_busiest_group The following change: - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >= + if (max_load - this_load + 2*busiest_load_per_task >= busiest_load_per_task * imbn) { made the condition always true, because imbn is [1,2]. Therefore, remove the 2*, and give the it a fair chance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | Merge commit 'v2.6.28-rc1' into sched/urgentIngo Molnar2008-10-24
| |\ \ \ \
| * | | | | sched: add CONFIG_SMP consistencyLi Zefan2008-10-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a patch from Henrik Austad did this: >> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is >> not set. Peter observed: > While a proper cleanup, could you do it by re-arranging the methods so > as to not create an additional ifdef? Do not declare select_task_rq and some other methods as part of sched_class when CONFIG_SMP is not set. Also gather those methods to avoid CONFIG_SMP mess. Idea-by: Henrik Austad <henrik.austad@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Henrik Austad <henrik@austad.us> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds2008-10-28
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, memory hotplug: remove wrong -1 in calling init_memory_mapping() x86: keep the /proc/meminfo page count correct x86/uv: memory allocation at initialization xen: fix Xen domU boot with batched mprotect
| * | | | | | x86, memory hotplug: remove wrong -1 in calling init_memory_mapping()Shaohua Li2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix crash with memory hotplug Shuahua Li found: | I just did some experiments on a desktop for memory hotplug and this bug | triggered a crash in my test. | | Yinghai's suggestion also fixed the bug. We don't need to round it, just remove that extra -1 Signed-off-by: Yinghai <yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | x86: keep the /proc/meminfo page count correctYinghai Lu2008-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: get correct page count in /proc/meminfo found page count in /proc/meminfo is nor correct on 1G system in VirtualBox 2.0.4 # cat /proc/meminfo MemTotal: 1017508 kB MemFree: 822700 kB Buffers: 1456 kB Cached: 26632 kB SwapCached: 0 kB ... Hugepagesize: 2048 kB DirectMap4k: 4032 kB DirectMap2M: 18446744073709549568 kB with this patch get: ... DirectMap4k: 4032 kB DirectMap2M: 1044480 kB which is consistent to kernel_page_tables ---[ Low Kernel Mapping ]--- 0xffff880000000000-0xffff880000001000 4K RW PCD GLB x pte 0xffff880000001000-0xffff88000009f000 632K RW GLB x pte 0xffff88000009f000-0xffff8800000a0000 4K RW PCD GLB x pte 0xffff8800000a0000-0xffff880000200000 1408K RW GLB x pte 0xffff880000200000-0xffff88003fe00000 1020M RW PSE GLB x pmd 0xffff88003fe00000-0xffff88003fff0000 1984K RW GLB NX pte 0xffff88003fff0000-0xffff880040000000 64K pte 0xffff880040000000-0xffff888000000000 511G pud 0xffff888000000000-0xffffc20000000000 58880G pgd Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | x86/uv: memory allocation at initializationCliff Wickman2008-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: on SGI UV platforms, fix boot crash UV initialization is currently called too late to call alloc_bootmem_pages(). The current sequence is: start_kernel() mem_init() free_all_bootmem() <--- discard of bootmem rest_init() kernel_init() smp_prepare_cpus() native_smp_prepare_cpus() uv_system_init() <--- uses alloc_bootmem_pages() It should be calling kmalloc(). Signed-off-by: Cliff Wickman <cpw@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | xen: fix Xen domU boot with batched mprotectChris Lalancette2008-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix guest kernel boot crash on certain configs Recent i686 2.6.27 kernels with a certain amount of memory (between 736 and 855MB) have a problem booting under a hypervisor that supports batched mprotect (this includes the RHEL-5 Xen hypervisor as well as any 3.3 or later Xen hypervisor). The problem ends up being that xen_ptep_modify_prot_commit() is using virt_to_machine to calculate which pfn to update. However, this only works for pages that are in the p2m list, and the pages coming from change_pte_range() in mm/mprotect.c are kmap_atomic pages. Because of this, we can run into the situation where the lookup in the p2m table returns an INVALID_MFN, which we then try to pass to the hypervisor, which then (correctly) denies the request to a totally bogus pfn. The right thing to do is to use arbitrary_virt_to_machine, so that we can be sure we are modifying the right pfn. This unfortunately introduces a performance penalty because of a full page-table-walk, but we can avoid that penalty for pages in the p2m list by checking if virt_addr_valid is true, and if so, just doing the lookup in the p2m table. The attached patch implements this, and allows my 2.6.27 i686 based guest with 768MB of memory to boot on a RHEL-5 hypervisor again. Thanks to Jeremy for the suggestions about how to fix this particular issue. Signed-off-by: Chris Lalancette <clalance@redhat.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Chris Lalancette <clalance@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6Linus Torvalds2008-10-28
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: [S390] s390: Fix build for !CONFIG_S390_GUEST + CONFIG_VIRTIO_CONSOLE [S390] No more 4kb stacks. [S390] Change default IPL method to IPL_VM. [S390] tape: disable interrupts in tape_open and tape_release [S390] appldata: unsigned ops->size cannot be negative [S390] tape block: complete request with correct locking [S390] Fix sysdev class file creation. [S390] pgtables: Fix race in enable_sie vs. page table ops [S390] qdio: remove incorrect memset [S390] qdio: prevent double qdio shutdown in case of I/O errors
| * | | | | | | [S390] s390: Fix build for !CONFIG_S390_GUEST + CONFIG_VIRTIO_CONSOLEChristian Borntraeger2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The s390 kernel does not compile if virtio console is enabled, but guest support is disabled: LD .tmp_vmlinux1 arch/s390/kernel/built-in.o: In function `setup_arch': /space/linux-2.5/arch/s390/kernel/setup.c:773: undefined reference to `s390_virtio_console_init' The fix is related to commit 99e65c92f2bbf84f43766a8bf701e36817d62822 Author: Christian Borntraeger <borntraeger@de.ibm.com> Date: Fri Jul 25 15:50:04 2008 +0200 KVM: s390: Fix guest kconfig Which changed the build process to build kvm_virtio.c only if CONFIG_S390_GUEST is set. We must ifdef the prototype in the header file accordingly. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] No more 4kb stacks.Heiko Carstens2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We got a stack overflow with a small stack configuration on a 32 bit system. It just looks like as 4kb isn't enough and too dangerous. So lets get rid of 4kb stacks on 32 bit. But one thing I completely dislike about the call trace below is that just for debugging or tracing purposes sprintf gets called (cio_start_key): /* process condition code */ sprintf(dbf_txt, "ccode:%d", ccode); CIO_TRACE_EVENT(4, dbf_txt); But maybe its just me who thinks that this could be done better. <4>Kernel stack overflow. <4>Modules linked in: dm_multipath sunrpc bonding qeth_l2 dm_mod qeth ccwgroup vmur <4>CPU: 1 Not tainted 2.6.27-30.x.20081015-s390default #1 <4>Process httpd (pid: 3807, task: 20ae2df8, ksp: 1666fb78) <4>Krnl PSW : 040c0000 8027098a (number+0xe/0x348) <4> R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 <4>Krnl GPRS: 00d43318 0027097c 1666f277 9666f270 <4> 00000000 00000000 0000000a ffffffff <4> 9666f270 1666f228 1666f277 1666f098 <4> 00000002 80270982 80271016 1666f098 <4>Krnl Code: 8027097e: f0340dd0a7f1 srp 3536(4,%r0),2033(%r10),4 <4> 80270984: 0f00 clcl %r0,%r0 <4> 80270986: a7840001 brc 8,80270988 <4> >8027098a: 18ef lr %r14,%r15 <4> 8027098c: a7faff68 ahi %r15,-152 <4> 80270990: 18bf lr %r11,%r15 <4> 80270992: 18a2 lr %r10,%r2 <4> 80270994: 1893 lr %r9,%r3 Modified calltrace with annotated stackframe size of each function: stackframe size | 0 304 vsnprintf+850 [0x271016] 1 72 sprintf+74 [0x271522] 2 56 cio_start_key+262 [0x2d4c16] 3 56 ccw_device_start_key+222 [0x2dfe92] 4 56 ccw_device_start+40 [0x2dff28] 5 48 raw3215_start_io+104 [0x30b0f8] 6 56 raw3215_write+494 [0x30ba0a] 7 40 con3215_write+68 [0x30bafc] 8 40 __call_console_drivers+146 [0x12b0fa] 9 32 _call_console_drivers+102 [0x12b192] 10 64 release_console_sem+268 [0x12b614] 11 168 vprintk+462 [0x12bca6] 12 72 printk+68 [0x12bfd0] 13 256 __print_symbol+50 [0x15a882] 14 56 __show_trace+162 [0x103d06] 15 32 show_trace+224 [0x103e70] 16 48 show_stack+152 [0x103f20] 17 56 dump_stack+126 [0x104612] 18 96 __alloc_pages_internal+592 [0x175004] 19 80 cache_alloc_refill+776 [0x196f3c] 20 40 __kmalloc+258 [0x1972ae] 21 40 __alloc_skb+94 [0x328086] 22 32 pskb_copy+50 [0x328252] 23 32 skb_realloc_headroom+110 [0x328a72] 24 104 qeth_l2_hard_start_xmit+378 [0x7803bfde] 25 56 dev_hard_start_xmit+450 [0x32ef6e] 26 56 __qdisc_run+390 [0x3425d6] 27 48 dev_queue_xmit+410 [0x331e06] 28 40 ip_finish_output+308 [0x354ac8] 29 56 ip_output+218 [0x355b6e] 30 24 ip_local_out+56 [0x354584] 31 120 ip_queue_xmit+300 [0x355cec] 32 96 tcp_transmit_skb+812 [0x367da8] 33 40 tcp_push_one+158 [0x369fda] 34 112 tcp_sendmsg+852 [0x35d5a0] 35 240 sock_sendmsg+164 [0x32035c] 36 56 kernel_sendmsg+86 [0x32064a] 37 88 sock_no_sendpage+98 [0x322b22] 38 104 tcp_sendpage+70 [0x35cc1e] 39 48 sock_sendpage+74 [0x31eb66] 40 64 pipe_to_sendpage+102 [0x1c4b2e] 41 64 __splice_from_pipe+120 [0x1c5340] 42 72 splice_from_pipe+90 [0x1c57e6] 43 56 generic_splice_sendpage+38 [0x1c5832] 44 48 do_splice_from+104 [0x1c4c38] 45 48 direct_splice_actor+52 [0x1c4c88] 46 80 splice_direct_to_actor+180 [0x1c4f80] 47 72 do_splice_direct+70 [0x1c5112] 48 64 do_sendfile+360 [0x19de18] 49 72 sys_sendfile64+126 [0x19df32] 50 336 sysc_do_restart+18 [0x111a1a] Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] Change default IPL method to IPL_VM.Heiko Carstens2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allyesconfig and allmodconfig built kernels have a tape IPL record. A the vmreader record makes much more sense, since hardly anybody will ever IPL a kernel from tape. So change the default. As I side effect I can test these kernels without fiddling around with the kernel config ;) Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] tape: disable interrupts in tape_open and tape_releaseFrank Munzert2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Get tape device lock with interrupts disabled. Otherwise lockdep will issue a warning similar to: ================================= [ INFO: inconsistent lock state ] 2.6.27 #1 --------------------------------- inconsistent {in-hardirq-W} -> {hardirq-on-W} usage. vol_id/2903 [HC0[0]:SC0[0]:HE1:SE1] takes: (sch->lock){++..}, at: [<000003e00004c7a2>] tape_open+0x42/0x1a4 [tape] {in-hardirq-W} state was registered at: [<000000000007ce5c>] __lock_acquire+0x894/0xa74 [<000000000007d0ce>] lock_acquire+0x92/0xb8 [<0000000000345154>] _spin_lock+0x5c/0x9c [<0000000000202264>] do_IRQ+0x124/0x1f0 [<0000000000026610>] io_return+0x0/0x8 irq event stamp: 847 hardirqs last enabled at (847): [<000000000007aca6>] trace_hardirqs_on+0x2a/0x38 hardirqs last disabled at (846): [<0000000000076ca2>] trace_hardirqs_off+0x2a/0x38 softirqs last enabled at (0): [<000000000004909e>] copy_process+0x43e/0x11f4 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: 1 lock held by vol_id/2903: #0: (&bdev->bd_mutex){--..}, at: [<000000000010e0f4>] do_open+0x78/0x358 stack backtrace: CPU: 1 Not tainted 2.6.27 #1}, Process vol_id (pid: 2903, task: 000000003d4c0000, ksp: 000000003d4e3b10) 0400000000000000 000000003d4e3830 0000000000000002 0000000000000000 000000003d4e38d0 000000003d4e3848 000000003d4e3848 00000000000168a8 0000000000000000 000000003d4e3b10 0000000000000000 0000000000000000 000000003d4e3830 000000000000000c 000000003d4e3830 000000003d4e38a0 000000000034aa98 00000000000168a8 000000003d4e3830 000000003d4e3880 Call Trace: ([<000000000001681c>] show_trace+0x138/0x158) [<0000000000016902>] show_stack+0xc6/0xf8 [<00000000000170d4>] dump_stack+0xb0/0xc0 [<0000000000078810>] print_usage_bug+0x1e8/0x228 [<000000000007a71c>] mark_lock+0xb14/0xd24 [<000000000007cd5a>] __lock_acquire+0x792/0xa74 [<000000000007d0ce>] lock_acquire+0x92/0xb8 [<0000000000345154>] _spin_lock+0x5c/0x9c [<000003e00004c7a2>] tape_open+0x42/0x1a4 [tape] [<000003e00005185c>] tapeblock_open+0x98/0xd0 [tape] Signed-off-by: Frank Munzert <munzert@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] appldata: unsigned ops->size cannot be negativeRoel Kluin2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | unsigned ops->size cannot be negative Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] tape block: complete request with correct lockingFrank Munzert2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __blk_end_request must be called with request queue lock held. We need to use blk_end_request rather than __blk_end_request. Signed-off-by: Frank Munzert <munzert@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] Fix sysdev class file creation.Heiko Carstens2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use sysdev_class_create_file() to create create sysdev class attributes instead of sysfs_create_file(). Using sysfs_create_file() wasn't a very good idea since the show and store functions have a different amount of parameters for sysfs files and sysdev class files. In particular the pointer to the buffer is the last argument and therefore accesses to random memory regions happened. Still worked surprisingly well until we got a kernel panic. Cc: stable@kernel.org Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] pgtables: Fix race in enable_sie vs. page table opsChristian Borntraeger2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current enable_sie code sets the mm->context.pgstes bit to tell dup_mm that the new mm should have extended page tables. This bit is also used by the s390 specific page table primitives to decide about the page table layout - which means context.pgstes has two meanings. This can cause any kind of bugs. For example - e.g. shrink_zone can call ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young will test for context.pgstes. Since enable_sie changed that value of the old struct mm without changing the page table layout ptep_clear_flush_young will do the wrong thing. The solution is to split pgstes into two bits - one for the allocation - one for the current state Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] qdio: remove incorrect memsetJan Glauber2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the memset since zeroing the string is not needed and use snprintf instead of sprintf. Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | | | | | | [S390] qdio: prevent double qdio shutdown in case of I/O errorsJan Glauber2008-10-28
| | |_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of I/O errors on a qdio subchannel qdio_shutdown may be called twice by the qdio driver and by zfcp. Remove the superfluous shutdown from qdio and let the upper layer driver handle the error condition. Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | | | | | | Merge branch 'upstream-linus' of ↵Linus Torvalds2008-10-28
|\ \ \ \ \ \ \ | |/ / / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: libata: ahci enclosure management bit mask libata: ahci enclosure management led sync pata_ninja32: suspend/resume support libata: Fix LBA48 on pata_it821x RAID volumes. libata: clear saved xfer_mode and ncq_enabled on device detach sata_sil24: configure max read request size to 4k libata: add missing kernel-doc libata: fix device iteration bugs ahci: Add support for Promise PDC42819 ata: Switch all my stuff to a common address
| * | | | | | libata: ahci enclosure management bit maskDavid Milburn2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enclosure management bit mask definitions. Signed-off-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
| * | | | | | libata: ahci enclosure management led syncDavid Milburn2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Synchronize ahci_sw_activity and ahci_sw_activity_blink with ata_port lock. Signed-off-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
| * | | | | | pata_ninja32: suspend/resume supportAlan Cox2008-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I had assumed that the standard recovery would be sufficient for this hardware but it isn't. Fix up the other registers on resume as needed. See bug #11735 Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
| * | | | | | libata: Fix LBA48 on pata_it821x RAID volumes.Ondrej Zary2008-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [http://lkml.org/lkml/2008/10/18/82] Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
| * | | | | | libata: clear saved xfer_mode and ncq_enabled on device detachTejun Heo2008-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | libata EH saves xfer_mode and ncq_enabled at start to later set DUBIOUS_XFER flag if it has changed. These values need to be cleared on device detach such that hot device swap doesn't accidentally miss DUBIOUS_XFER. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>