aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAge
* ext4: implement writeback livelock avoidance using page taggingEric Sandeen2010-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is analogous to Jan Kara's commit, f446daaea9d4a420d16c606f755f3689dcb2d0ce mm: implement writeback livelock avoidance using page tagging but since we forked write_cache_pages, we need to reimplement it there (and in ext4_da_writepages, since range_cyclic handling was moved to there) If you start a large buffered IO to a file, and then set fsync after it, you'll find that fsync does not complete until the other IO stops. If you continue re-dirtying the file (say, putting dd with conv=notrunc in a loop), when fsync finally completes (after all IO is done), it reports via tracing that it has written many more pages than the file contains; in other words it has synced and re-synced pages in the file multiple times. This then leads to problems with our writeback_index update, since it advances it by pages written, and essentially sets writeback_index off the end of the file... With the following patch, we only sync as much as was dirty at the time of the sync. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* fs: Add FITRIM ioctlLukas Czerner2010-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds an filesystem independent ioctl to allow implementation of file system batched discard support. I takes fstrim_range structure as an argument. fstrim_range is definec in the include/fs.h and its definition is as follows. struct fstrim_range { start; len; minlen; } start - first Byte to trim len - number of Bytes to trim from start minlen - minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This will be rounded up to fs block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: start = 0; len = ULLONG_MAX; minlen = 0; So it will trim the whole file system at one run. After the FITRIM is done, the number of actually discarded Bytes is stored in fstrim_range.len to give the user better insight on how much storage space has been really released for wear-leveling. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: don't use ext4_allocation_contexts for tracingEric Sandeen2010-10-27
| | | | | | | | | | | | | | | | | | | Many tracepoints were populating an ext4_allocation_context to pass in, but this requires a slab allocation even when tracepoints are off. In fact, 4 of 5 of these allocations were only for tracing. In addition, we were only using a small fraction of the 144 bytes of this structure for this purpose. We can do away with all these alloc/frees of the ac and simply pass in the bits we care about, instead. I tested this by turning on tracing and running through xfstests on x86_64. I did not actually do anything with the trace output, however. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix oops in trace_ext4_mb_release_group_paEric Sandeen2010-10-27
| | | | | | | | | | | | | | | | | | | | | | Our QA reported an oops in the ext4_mb_release_group_pa tracing, and Josef Bacik pointed out that it was because we may have a non-null but uninitialized ac_inode in the allocation context. I can reproduce it when running xfstests with ext4 tracepoints on, on a CONFIG_SLAB_DEBUG kernel. We call trace_ext4_mb_release_group_pa from 2 places, ext4_mb_discard_group_preallocations and ext4_mb_discard_lg_preallocations In both cases we allocate an ac as a container just for tracing (!) and never fill in the ac_inode. There's no reason to be assigning, testing, or printing it as far as I can see, so just remove it from the tracepoint. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Add helper function for blkdev_issue_zeroout (sb_issue_discard)Lukas Czerner2010-10-27
| | | | | | | | This is done the same way as helper sb_issue_discard for blkdev_issue_discard. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: avoid null dereference in trace_ext4_mballoc_discardWen Congyang2010-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ac->inode is set to null in function ext4_mb_release_group_pa(), and then trace_ext4_mballoc_discard(ac) is called, the kernel will panic. BUG: unable to handle kernel NULL pointer dereference at 000000a4 IP: [<f87e1714>] ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4] *pdpt = 0000000000abd001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Pid: 550, comm: flush-8:16 Not tainted 2.6.36-rc1 #1 SE7320EP2/Altos G530 EIP: 0060:[<f87e1714>] EFLAGS: 00010206 CPU: 1 EIP is at ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4] EAX: f32ac840 EBX: f3f1cf88 ECX: f32ac840 EDX: 00000000 ESI: f32ac83c EDI: f880b9d8 EBP: 00000000 ESP: f4b77ae4 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process flush-8:16 (pid: 550, ti=f4b76000 task=f613e540 task.ti=f4b76000) Call Trace: [<f87f5ac1>] ? ext4_mb_release_group_pa+0x121/0x150 [ext4] [<f87f8356>] ? ext4_mb_discard_group_preallocations+0x336/0x400 [ext4] [<f87fb7f1>] ? ext4_mb_new_blocks+0x3d1/0x4f0 [ext4] [<c05a6c5b>] ? __make_request+0x10b/0x440 [<f87f1fb4>] ? ext4_ext_map_blocks+0x1334/0x1980 [ext4] [<c04ac78a>] ? rb_reserve_next_event+0xaa/0x3b0 [<f87d18d6>] ? ext4_map_blocks+0xd6/0x1d0 [ext4] [<f87d2da7>] ? mpage_da_map_blocks+0xc7/0x8a0 [ext4] [<c04c8a68>] ? find_get_pages_tag+0x38/0x110 [<c04d23a5>] ? __pagevec_release+0x15/0x20 [<f87d3ca5>] ? ext4_da_writepages+0x2b5/0x5d0 [ext4] [<c04cfbe0>] ? __writepage+0x0/0x30 [<c04d0e34>] ? do_writepages+0x14/0x30 [<c0526600>] ? writeback_single_inode+0xa0/0x240 [<c0526971>] ? writeback_sb_inodes+0xc1/0x180 [<c0526ab8>] ? writeback_inodes_wb+0x88/0x140 [<c0526d7b>] ? wb_writeback+0x20b/0x320 [<c045aca7>] ? lock_timer_base+0x27/0x50 [<c0526fe0>] ? wb_do_writeback+0x150/0x190 [<c05270a8>] ? bdi_writeback_thread+0x88/0x1f0 [<c043b680>] ? complete+0x40/0x60 [<c0527020>] ? bdi_writeback_thread+0x0/0x1f0 [<c0469474>] ? kthread+0x74/0x80 [<c0469400>] ? kthread+0x0/0x80 [<c040a23e>] ? kernel_thread_helper+0x6/0x10 Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd2: Fix I/O hang in jbd2_journal_release_jbd_inodeBrian King2010-10-27
| | | | | | | | | | | | | | | | | | | | | This fixes a hang seen in jbd2_journal_release_jbd_inode on a lot of Power 6 systems running with ext4. When we get in the hung state, all I/O to the disk in question gets blocked where we stay indefinitely. Looking at the task list, I can see we are stuck in jbd2_journal_release_jbd_inode waiting on a wake up. I added some debug code to detect this scenario and dump additional data if we were stuck in jbd2_journal_release_jbd_inode for longer than 30 minutes. When it hit, I was able to see that i_flags was 0, suggesting we missed the wake up. This patch changes i_flags to be an unsigned long, uses bit operators to access it, and adds barriers around the accesses. Prior to applying this patch, we were regularly hitting this hang on numerous systems in our test environment. After applying the patch, the hangs no longer occur. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2010-09-28
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits) tcp: Fix >4GB writes on 64-bit. net/9p: Mount only matching virtio channels de2104x: fix ethtool tproxy: check for transparent flag in ip_route_newports ipv6: add IPv6 to neighbour table overflow warning tcp: fix TSO FACK loss marking in tcp_mark_head_lost 3c59x: fix regression from patch "Add ethtool WOL support" ipv6: add a missing unregister_pernet_subsys call s390: use free_netdev(netdev) instead of kfree() sgiseeq: use free_netdev(netdev) instead of kfree() rionet: use free_netdev(netdev) instead of kfree() ibm_newemac: use free_netdev(netdev) instead of kfree() smsc911x: Add MODULE_ALIAS() net: reset skb queue mapping when rx'ing over tunnel br2684: fix scheduling while atomic de2104x: fix TP link detection de2104x: fix power management de2104x: disable autonegotiation on broken hardware net: fix a lockdep splat e1000e: 82579 do not gate auto config of PHY by hardware during nominal use ...
| * tcp: Fix >4GB writes on 64-bit.David S. Miller2010-09-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes kernel bugzilla #16603 tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write zero bytes, for example. There is also the problem higher up of how verify_iovec() works. It wants to prevent the total length from looking like an error return value. However it does this using 'int', but syscalls return 'long' (and thus signed 64-bit on 64-bit machines). So it could trigger false-positives on 64-bit as written. So fix it to use 'long'. Reported-by: Olaf Bonorden <bono@onlinehome.de> Reported-by: Daniel Büse <dbuese@gmx.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tproxy: check for transparent flag in ip_route_newportsUlrich Weber2010-09-27
| | | | | | | | | | | | | | as done in ip_route_connect() Signed-off-by: Ulrich Weber <uweber@astaro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: add a missing unregister_pernet_subsys callNeil Horman2010-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up a missing exit path in the ipv6 module init routines. In addrconf_init we call ipv6_addr_label_init which calls register_pernet_subsys for the ipv6_addr_label_ops structure. But if module loading fails, or if the ipv6 module is removed, there is no corresponding unregister_pernet_subsys call, which leaves a now-bogus address on the pernet_list, leading to oopses in subsequent registrations. This patch cleans up both the failed load path and the unload path. Tested by myself with good results. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> include/net/addrconf.h | 1 + net/ipv6/addrconf.c | 11 ++++++++--- net/ipv6/addrlabel.c | 5 +++++ 3 files changed, 14 insertions(+), 3 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: reset skb queue mapping when rx'ing over tunnelTom Herbert2010-09-26
| | | | | | | | | | | | | | | | | | | | Reset queue mapping when an skb is reentering the stack via a tunnel. On second pass, the queue mapping from the original device is no longer valid. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move "struct net" declaration inside the __KERNEL__ macro guardOllie Wild2010-09-22
| | | | | | | | | | | | | | | | | | | | | | | | This patch reduces namespace pollution by moving the "struct net" declaration out of the userspace-facing portion of linux/netlink.h. It has no impact on the kernel. (This came up because we have several C++ applications which use "net" as a namespace name.) Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * xfrm: Allow different selector family in temporary stateThomas Egerer2010-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The family parameter xfrm_state_find is used to find a state matching a certain policy. This value is set to the template's family (encap_family) right before xfrm_state_find is called. The family parameter is however also used to construct a temporary state in xfrm_state_find itself which is wrong for inter-family scenarios because it produces a selector for the wrong family. Since this selector is included in the xfrm_user_acquire structure, user space programs misinterpret IPv6 addresses as IPv4 and vice versa. This patch splits up the original init_tempsel function into a part that initializes the selector respectively the props and id of the temporary state, to allow for differing ip address families whithin the state. Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds2010-09-27
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86/amd-iommu: Fix rounding-bug in __unmap_single x86/amd-iommu: Work around S3 BIOS bug x86/amd-iommu: Set iommu configuration flags in enable-loop x86, setup: Fix earlyprintk=serial,0x3f8,115200 x86, setup: Fix earlyprintk=serial,ttyS0,115200
| * \ Merge branch 'amd-iommu/2.6.36' of ↵Ingo Molnar2010-09-24
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-iommu into x86/urgent
| | * | x86/amd-iommu: Work around S3 BIOS bugJoerg Roedel2010-09-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a workaround for an IOMMU BIOS problem to the AMD IOMMU driver. The result of the bug is that the IOMMU does not execute commands anymore when the system comes out of the S3 state resulting in system failure. The bug in the BIOS is that is does not restore certain hardware specific registers correctly. This workaround reads out the contents of these registers at boot time and restores them on resume from S3. The workaround is limited to the specific IOMMU chipset where this problem occurs. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
* | | | arm: fix "arm: fix pci_set_consistent_dma_mask for dmabounce devices"FUJITA Tomonori2010-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes the regression caused by the commit 6fee48cd330c68 ("dma-mapping: arm: use generic pci_set_dma_mask and pci_set_consistent_dma_mask"). ARM needs to clip the dma coherent mask for dmabounce devices. This restores the old trick. Note that strictly speaking, the DMA API doesn't allow architectures to do such but I'm not sure it's worth adding the new API to set the dma mask that allows architectures to clip it. Reported-by: Krzysztof Halasa <khc@pm.waw.pl> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | fs: {lock,unlock}_flocks() stubs to prepare for BKL removalSage Weil2010-09-21
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The lock structs are currently protected by the BKL, but are accessed by code in fs/locks.c and misc file system and DLM code. These stubs will allow all users to switch to the new interface before the implementation is changed to a spinlock. Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2010-09-19
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits) dca: disable dca on IOAT ver.3.0 multiple-IOH platforms netpoll: Disable IRQ around RCU dereference in netpoll_rx sctp: Do not reset the packet during sctp_packet_config(). net/llc: storing negative error codes in unsigned short MAINTAINERS: move atlx discussions to netdev drivers/net/cxgb3/cxgb3_main.c: prevent reading uninitialized stack memory drivers/net/eql.c: prevent reading uninitialized stack memory drivers/net/usb/hso.c: prevent reading uninitialized memory xfrm: dont assume rcu_read_lock in xfrm_output_one() r8169: Handle rxfifo errors on 8168 chips 3c59x: Remove atomic context inside vortex_{set|get}_wol tcp: Prevent overzealous packetization by SWS logic. net: RPS needs to depend upon USE_GENERIC_SMP_HELPERS phylib: fix PAL state machine restart on resume net: use rcu_barrier() in rollback_registered_many bonding: correctly process non-linear skbs ipv4: enable getsockopt() for IP_NODEFRAG ipv4: force_igmp_version ignored when a IGMPv3 query received ppp: potential NULL dereference in ppp_mp_explode() net/llc: make opt unsigned in llc_ui_setsockopt() ...
| * | netpoll: Disable IRQ around RCU dereference in netpoll_rxHerbert Xu2010-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot use rcu_dereference_bh safely in netpoll_rx as we may be called with IRQs disabled. We could however simply disable IRQs as that too causes BH to be disabled and is safe in either case. Thanks to John Linville for discovering this bug and providing a patch. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: Prevent overzealous packetization by SWS logic.Alexey Kuznetsov2010-09-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If peer uses tiny MSS (say, 75 bytes) and similarly tiny advertised window, the SWS logic will packetize to half the MSS unnecessarily. This causes problems with some embedded devices. However for large MSS devices we do want to half-MSS packetize otherwise we never get enough packets into the pipe for things like fast retransmit and recovery to work. Be careful also to handle the case where MSS > window, otherwise we'll never send until the probe timer. Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2010-09-16
|\ \ \ | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add documentation
| * | | workqueue: add documentationTejun Heo2010-09-13
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update copyright notice and add Documentation/workqueue.txt. Randy Dunlap, Dave Chinner: misc fixes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-By: Florian Mickler <florian@mickler.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Dave Chinner <david@fromorbit.com>
* | | Merge branch 'drm-fixes' of ↵Linus Torvalds2010-09-16
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 * 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: drm/radeon/kms: only warn on mipmap size checks in r600 cs checker (v2) drm/radeon/kms: force legacy pll algo for RV620 LVDS drm: fix race between driver loading and userspace open. drm: Use a nondestructive mode for output detect when polling (v2) drm/radeon/kms: fix the colorbuffer CS checker for r300-r500 drm/radeon/kms: increase lockup detection interval to 10 sec for r100-r500 drm/radeon/kms/evergreen: fix backend setup drm: Use a nondestructive mode for output detect when polling drm/radeon: add some missing copyright headers drm: Only decouple the old_fb from the crtc is we call mode_set* drm/radeon/kms: don't enable underscan with interlaced modes drm/radeon/kms: add connector table for Mac x800 drm/radeon/kms: fix regression in RMX code (v2) drm: Fix regression in disable polling e58f637
| * | | drm: Use a nondestructive mode for output detect when polling (v2)Chris Wilson2010-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v2: Julien Cristau pointed out that @nondestructive results in double-negatives and confusion when trying to interpret the parameter, so use @force instead. Much easier to type as well. ;-) And fix the miscompilation of vmgfx reported by Sedat Dilek. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
| * | | drm: Use a nondestructive mode for output detect when pollingChris Wilson2010-09-13
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Destructive load-detection is very expensive and due to failings elsewhere can trigger system wide stalls of up to 600ms. A simple first step to correcting this is not to invoke such an expensive and destructive load-detection operation automatically. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29536 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16265 Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
* | | Merge ssh://master.kernel.org/home/hpa/tree/secLinus Torvalds2010-09-14
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | * ssh://master.kernel.org/home/hpa/tree/sec: x86-64, compat: Retruncate rax after ia32 syscall entry tracing x86-64, compat: Test %rax for the syscall number, not %eax compat: Make compat_alloc_user_space() incorporate the access_ok()
| * | | compat: Make compat_alloc_user_space() incorporate the access_ok()H. Peter Anvin2010-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | compat_alloc_user_space() expects the caller to independently call access_ok() to verify the returned area. A missing call could introduce problems on some architectures. This patch incorporates the access_ok() check into compat_alloc_user_space() and also adds a sanity check on the length. The existing compat_alloc_user_space() implementations are renamed arch_compat_alloc_user_space() and are used as part of the implementation of the new global function. This patch assumes NULL will cause __get_user()/__put_user() to either fail or access userspace on all architectures. This should be followed by checking the return value of compat_access_user_space() for NULL in the callers, at which time the access_ok() in the callers can also be removed. Reported-by: Ben Hawkes <hawkes@sota.gen.nz> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: <stable@kernel.org>
* | | | Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds2010-09-14
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies statfs() gives ESTALE error NFS: Fix a typo in nfs_sockaddr_match_ipaddr6 sunrpc: increase MAX_HASHTABLE_BITS to 14 gss:spkm3 miss returning error to caller when import security context gss:krb5 miss returning error to caller when import security context Remove incorrect do_vfs_lock message SUNRPC: cleanup state-machine ordering SUNRPC: Fix a race in rpc_info_open SUNRPC: Fix race corrupting rpc upcall Fix null dereference in call_allocate
| * | | SUNRPC: Fix a race in rpc_info_openTrond Myklebust2010-09-12
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race between rpc_info_open and rpc_release_client() in that nothing stops a process from opening the file after the clnt->cl_kref goes to zero. Fix this by using atomic_inc_unless_zero()... Reported-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
* | | Merge branch 'for_linus' of ↵Linus Torvalds2010-09-13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: dquot: do full inode dirty in allocating space
| * | | dquot: do full inode dirty in allocating spaceShaohua Li2010-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alex Shi found a regression when doing ffsb test. The test has several threads, and each thread creates a small file, write to it and then delete it. ffsb reports about 20% regression and Alex bisected it to 43d2932d88e4. The test will call __mark_inode_dirty 3 times. without this commit, we only take inode_lock one time, while with it, we take the lock 3 times with flags ( I_DIRTY_SYNC,I_DIRTY_PAGES,I_DIRTY). Perf shows the lock contention increased too much. Below proposed patch fixes it. fs is allocating blocks, which usually means file writes and the inode will be dirtied soon. We fully dirty the inode to reduce some inode_lock contention in several calls of __mark_inode_dirty. Jan Kara: Added comment. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | | | Merge branch 'next-spi' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds2010-09-13
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'next-spi' of git://git.secretlab.ca/git/linux-2.6: spi/pl022: move probe call to subsys_initcall() powerpc/5200: mpc52xx_uart.c: Add of_node_put to avoid memory leak spi/pl022: fix APB pclk power regression on U300 spi/spi_s3c64xx: Warn if PIO transfers time out spi/s3c64xx: Fix incorrect reuse of 'val' local variable. spi/s3c64xx: Fix compilation warning spi/dw_spi: clean the cs_control code spi/dw_spi: Allow interrupt sharing spi/spi_s3c64xx: Increase dead reckoning time in wait_for_xfer() spi/spi_s3c64xx: Move to subsys_initcall() spi: free children in spi_unregister_master, not siblings gpiolib: Add 'struct gpio_chip' forward declaration for !GPIOLIB case of: Fix missing includes - ll_temac spi/spi_s3c64xx: Staticise non-exported functions spi/spi_s3c64xx: Make probe more robust against missing board config
| * | | spi/dw_spi: clean the cs_control codeFeng Tang2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 052dc7c45i "spi/dw_spi: conditional transfer mode change" introduced cs_control code, which has a bug by using bit offset for spi mode to set transfer mode in control register. Also it forces devices who don't need cs_control to re-configure the control registers for each spi transfer. This patch will fix them Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | gpiolib: Add 'struct gpio_chip' forward declaration for !GPIOLIB caseAnton Vorontsov2010-09-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_GPIOLIB=n, the 'struct gpio_chip' is not declared, so the following pops up on PowerPC: cc1: warnings being treated as errors In file included from arch/powerpc/platforms/52xx/mpc52xx_common.c:19: include/linux/of_gpio.h:74: warning: 'struct gpio_chip' declared inside parameter list include/linux/of_gpio.h:74: warning: its scope is only this definition or declaration, which is probably not what you want include/linux/of_gpio.h:75: warning: 'struct gpio_chip' declared inside parameter list make[2]: *** [arch/powerpc/platforms/52xx/mpc52xx_common.o] Error 1 This patch fixes the issue by providing the proper forward declaration. Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2010-09-11
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (28 commits) ipheth: remove incorrect devtype to WWAN MAINTAINERS: Add CAIF sctp: fix test for end of loop KS8851: Correct RX packet allocation udp: add rehash on connect() net: blackhole route should always be recalculated ipv4: Suppress lockdep-RCU false positive in FIB trie (3) niu: Fix kernel buffer overflow for ETHTOOL_GRXCLSRLALL ipvs: fix active FTP gro: Re-fix different skb headrooms via-velocity: Turn scatter-gather support back off. ipv4: Fix reverse path filtering with multipath routing. UNIX: Do not loop forever at unix_autobind(). PATCH: b44 Handle RX FIFO overflow better (simplified) irda: off by one 3c59x: Fix deadlock in vortex_error() netfilter: discard overlapping IPv6 fragment ipv6: discard overlapping fragment net: fix tx queue selection for bridged devices implementing select_queue bonding: Fix jiffies overflow problems (again) ... Fix up trivial conflicts due to the same cgroup API thinko fix going through both Andrew and the networking tree. However, there were small differences between the two, with Andrew's version generally being the nicer one, and the one I merged first. So pick that one. Conflicts in: include/linux/cgroup.h and kernel/cgroup.c
| * | | Merge branch 'vhost-net' of ↵David S. Miller2010-09-10
| |\ \ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
| | * | | cgroups: fix API thinkoMichael S. Tsirkin2010-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_attach_task_current_cg API that have upstream is backwards: we really need an API to attach to the cgroups from another process A to the current one. In our case (vhost), a priveledged user wants to attach it's task to cgroups from a less priveledged one, the API makes us run it in the other task's context, and this fails. So let's make the API generic and just pass in 'from' and 'to' tasks. Add an inline wrapper for cgroup_attach_task_current_cg to avoid breaking bisect. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com>
| * | | | Merge branch 'master' of ↵David S. Miller2010-09-09
| |\ \ \ \ | | | |_|/ | | |/| | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
| * | | | udp: add rehash on connect()Eric Dumazet2010-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation) added a secondary hash on UDP, hashed on (local addr, local port). Problem is that following sequence : fd = socket(...) connect(fd, &remote, ...) not only selects remote end point (address and port), but also sets local address, while UDP stack stored in secondary hash table the socket while its local address was INADDR_ANY (or ipv6 equivalent) Sequence is : - autobind() : choose a random local port, insert socket in hash tables [while local address is INADDR_ANY] - connect() : set remote address and port, change local address to IP given by a route lookup. When an incoming UDP frame comes, if more than 10 sockets are found in primary hash table, we switch to secondary table, and fail to find socket because its local address changed. One solution to this problem is to rehash datagram socket if needed. We add a new rehash(struct socket *) method in "struct proto", and implement this method for UDP v4 & v6, using a common helper. This rehashing only takes care of secondary hash table, since primary hash (based on local port only) is not changed. Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | ipvs: fix active FTPJulian Anastasov2010-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Do not create expectation when forwarding the PORT command to avoid blocking the connection. The problem is that nf_conntrack_ftp.c:help() tries to create the same expectation later in POST_ROUTING and drops the packet with "dropping packet" message after failure in nf_ct_expect_related. - Change ip_vs_update_conntrack to alter the conntrack for related connections from real server. If we do not alter the reply in this direction the next packet from client sent to vport 20 comes as NEW connection. We alter it but may be some collision happens for both conntracks and the second conntrack gets destroyed immediately. The connection stucks too. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | cls_cgroup: Fix rcu lockdep warningLi Zefan2010-09-03
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave reported an rcu lockdep warning on 2.6.35.4 kernel task->cgroups and task->cgroups->subsys[i] are protected by RCU. So we avoid accessing invalid pointers here. This might happen, for example, when you are deref-ing those pointers while someone move @task from one cgroup to another. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-09-10
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Range check cpu in blk_cpu_to_group scatterlist: prevent invalid free when alloc fails writeback: Fix lost wake-up shutting down writeback thread writeback: do not lose wakeup events when forking bdi threads cciss: fix reporting of max queue depth since init block: switch s390 tape_block and mg_disk to elevator_change() block: add function call to switch the IO scheduler from a driver fs/bio-integrity.c: return -ENOMEM on kmalloc failure bio-integrity.c: remove dependency on __GFP_NOFAIL BLOCK: fix bio.bi_rw handling block: put dev->kobj in blk_register_queue fail path cciss: handle allocation failure cfq-iosched: Documentation help for new tunables cfq-iosched: blktrace print per slice sector stats cfq-iosched: Implement tunable group_idle cfq-iosched: Do group share accounting in IOPS when slice_idle=0 cfq-iosched: Do not idle if slice_idle=0 cciss: disable doorbell reset on reset_devices blkio: Fix return code for mkdir calls
| * | | | block: add function call to switch the IO scheduler from a driverJens Axboe2010-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently drivers must do an elevator_exit() + elevator_init() to switch IO schedulers. There are a few problems with this: - Since commit 1abec4fdbb142e3ccb6ce99832fae42129134a96, elevator_init() requires a zeroed out q->elevator pointer. The two existing in-kernel users don't do that. - It will only work at initialization time, since using the above two-staged construct does not properly quisce the queue. So add elevator_change() which takes care of this, and convert the elv_iosched_store() sysfs interface to use this helper as well. Reported-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Reported-by: Kevin Vigor <kevin@vigor.nu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | | | | Merge branch 'upstream-linus' of ↵Linus Torvalds2010-09-09
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: libata-sff: Reenable Port Multiplier after libata-sff remodeling. libata: skip EH autopsy and recovery during suspend ahci: AHCI and RAID mode SATA patch for Intel Patsburg DeviceIDs ata_piix: IDE Mode SATA patch for Intel Patsburg DeviceIDs libata,pata_via: revert ata_wait_idle() removal from ata_sff/via_tf_load() ahci: fix hang on failed softreset pata_artop: Fix device ID parity check
| * | | | | libata-sff: Reenable Port Multiplier after libata-sff remodeling.Gwendal Grignou2010-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Keep track of the link on the which the current request is in progress. It allows support of links behind port multiplier. Not all libata-sff is PMP compliant. Code for native BMDMA controller does not take in accound PMP. Tested on Marvell 7042 and Sil7526. Signed-off-by: Gwendal Grignou <gwendal@google.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
| * | | | | libata: skip EH autopsy and recovery during suspendTejun Heo2010-09-09
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some mysterious reason, certain hardware reacts badly to usual EH actions while the system is going for suspend. As the devices won't be needed until the system is resumed, ask EH to skip usual autopsy and recovery and proceed directly to suspend. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Stephan Diestelhorst <stephan.diestelhorst@amd.com> Cc: stable@kernel.org Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
* | | | | mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory ↵Christoph Lameter2010-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is low and kswapd is awake Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | swap: discard while swapping only if SWAP_FLAG_DISCARDHugh Dickins2010-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs show a shift since I last tested in December: in part because of firmware updates, in part because of the necessary move from barriers to awaiting completion at the block layer. While discard at swapon still shows as slightly beneficial on both, discarding 1MB swap cluster when allocating is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV). Surrender: discard as presently implemented is more hindrance than help for swap; but might prove useful on other devices, or with improvements. So continue to do the discard at swapon, but make discard while swapping conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using only the lower 16 bits of int flags). We can add a --discard or -d to swapon(8), and a "discard" to swap in /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nigel Cunningham <nigel@tuxonice.net> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>