aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux
Commit message (Collapse)AuthorAge
* capabilities: remove task_ns_* functionsEric Paris2012-01-05
| | | | | | | | | | | | | | task_ in the front of a function, in the security subsystem anyway, means to me at least, that we are operating with that task as the subject of the security decision. In this case what it means is that we are using current as the subject but we use the task to get the right namespace. Who in the world would ever realize that's what task_ns_capability means just by the name? This patch eliminates the task_ns functions entirely and uses the has_ns_capability function instead. This means we explicitly open code the ns in question in the caller. I think it makes the caller a LOT more clear what is going on. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
* capabilites: introduce new has_ns_capabilities_noauditEric Paris2012-01-05
| | | | | | | | | | For consistency in interfaces, introduce a new interface called has_ns_capabilities_noaudit. It checks if the given task has the given capability in the given namespace. Use this new function by has_capabilities_noaudit. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
* capabilities: remove all _real_ interfacesEric Paris2012-01-05
| | | | | | | | | The name security_real_capable and security_real_capable_noaudit just don't make much sense to me. Convert them to use security_capable and security_capable_noaudit. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
* capabilities: introduce security_capable_noauditEric Paris2012-01-05
| | | | | | | | | | | | Exactly like security_capable except don't audit any denials. This is for places where the kernel may make decisions about what to do if a task has a given capability, but which failing that capability is not a sign of a security policy violation. An example is checking if a task has CAP_SYS_ADMIN to lower it's likelyhood of being killed by the oom killer. This check is not a security violation if it is denied. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
* capabilities: reverse arguments to security_capableEric Paris2012-01-05
| | | | | | | | | | security_capable takes ns, cred, cap. But the LSM capable() hook takes cred, ns, cap. The capability helper functions also take cred, ns, cap. Rather than flip argument order just to flip it back, leave them alone. Heck, this should be a little faster since argument will be in the right place! Signed-off-by: Eric Paris <eparis@redhat.com>
* capabilities: remove the task from capable LSM hook entirelyEric Paris2012-01-05
| | | | | | | | | | | | | | | | | | | The capabilities framework is based around credentials, not necessarily the current task. Yet we still passed the current task down into LSMs from the security_capable() LSM hook as if it was a meaningful portion of the security decision. This patch removes the 'generic' passing of current and instead forces individual LSMs to use current explicitly if they think it is appropriate. In our case those LSMs are SELinux and AppArmor. I believe the AppArmor use of current is incorrect, but that is wholely unrelated to this patch. This patch does not change what AppArmor does, it just makes it clear in the AppArmor code that it is doing it. The SELinux code still uses current in it's audit message, which may also be wrong and needs further investigation. Again this is NOT a change, it may have always been wrong, this patch just makes it clear what is happening. Signed-off-by: Eric Paris <eparis@redhat.com>
* Merge branch 'for-linus' of http://people.redhat.com/agk/git/linux-dmLinus Torvalds2011-10-06
|\ | | | | | | | | | | | | | | * 'for-linus' of http://people.redhat.com/agk/git/linux-dm: dm crypt: always disable discard_zeroes_data dm: raid fix write_mostly arg validation dm table: avoid crash if integrity profile changes dm: flakey fix corrupt_bio_byte error path
| * dm crypt: always disable discard_zeroes_dataMilan Broz2011-09-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If optional discard support in dm-crypt is enabled, discards requests bypass the crypt queue and blocks of the underlying device are discarded. For the read path, discarded blocks are handled the same as normal ciphertext blocks, thus decrypted. So if the underlying device announces discarded regions return zeroes, dm-crypt must disable this flag because after decryption there is just random noise instead of zeroes. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | Merge git://github.com/davem330/netLinus Torvalds2011-10-04
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://github.com/davem330/net: pch_gbe: Fixed the issue on which a network freezes pch_gbe: Fixed the issue on which PC was frozen when link was downed. make PACKET_STATISTICS getsockopt report consistently between ring and non-ring net: xen-netback: correctly restart Tx after a VM restore/migrate bonding: properly stop queuing work when requested can bcm: fix incomplete tx_setup fix RDSRDMA: Fix cleanup of rds_iw_mr_pool net: Documentation: Fix type of variables ibmveth: Fix oops on request_irq failure ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket cxgb4: Fix EEH on IBM P7IOC can bcm: fix tx_setup off-by-one errors MAINTAINERS: tehuti: Alexander Indenbaum's address bounces dp83640: reduce driver noise ptp: fix L2 event message recognition
| * | ptp: fix L2 event message recognitionRichard Cochran2011-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IEEE 1588 standard defines two kinds of messages, event and general messages. Event messages require time stamping, and general do not. When using UDP transport, two separate ports are used for the two message types. The BPF designed to recognize event messages incorrectly classifies L2 general messages as event messages. This commit fixes the issue by extending the filter to check the message type field for L2 PTP packets. Event messages are be distinguished from general messages by testing the "general" bit. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: <stable@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | PCI: Disable MPS configuration by defaultJon Mason2011-10-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the ability to disable PCI-E MPS turning and using the BIOS configured MPS defaults. Due to the number of issues recently discovered on some x86 chipsets, make this the default behavior. Also, add the option for peer to peer DMA MPS configuration. Peer to peer DMA is outside the scope of this patch, but MPS configuration could prevent it from working by having the MPS on one root port different than the MPS on another. To work around this, simply make the system wide MPS the smallest possible value (128B). Signed-off-by: Jon Mason <mason@myri.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | |
| \ \
| \ \
| \ \
*---. \ \ Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and ↵Linus Torvalds2011-10-01
|\ \ \ \ \ | |_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: irq: Fix check for already initialized irq_domain in irq_domain_add irq: Add declaration of irq_domain_simple_ops to irqdomain.h * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86/rtc: Don't recursively acquire rtc_lock * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: posix-cpu-timers: Cure SMP wobbles sched: Fix up wchan borkage sched/rt: Migrate equal priority tasks to available CPUs
| | | * | posix-cpu-timers: Cure SMP wobblesPeter Zijlstra2011-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Aside of that it makes the function safe on 32 bit systems. The old code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a 64bit value and could be changed on another cpu at the same time. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | irq: Add declaration of irq_domain_simple_ops to irqdomain.hRob Herring2011-09-20
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | irq_domain_simple_ops is exported, but is not declared in irqdomain.h, so add it. Signed-off-by: Rob Herring <rob.herring@calxeda.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: marc.zyngier@arm.com Cc: thomas.abraham@linaro.org Cc: jamie@jamieiles.com Cc: b-cousson@ti.com Cc: shawn.guo@linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: devicetree-discuss@lists.ozlabs.org Link: http://lkml.kernel.org/r/1316017900-19918-2-git-send-email-robherring2@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | vfs: remove LOOKUP_NO_AUTOMOUNT flagLinus Torvalds2011-09-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | That flag no longer makes sense, since we don't look up automount points as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT handling was buggy to begin with: it would avoid automounting even for cases where we really *needed* to do the automount handling, and could return ENOENT for autofs entries that hadn't been instantiated yet. With our new non-eager automount semantics, one discussion has been about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the newfstatat() and fstatat64() system calls), but it's probably not worth it: you can always force at least directory automounting by simply adding the final '/' to the filename, which works for *all* of the stat family system calls, old and new. So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a result of our bad default behavior. Acked-by: Ian Kent <raven@themaw.net> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vfs pathname lookup: Add LOOKUP_AUTOMOUNT flagLinus Torvalds2011-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we've now turned around and made LOOKUP_FOLLOW *not* force an automount, we want to add the ability to force an automount event on lookup even if we don't happen to have one of the other flags that force it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..) Most cases will never want to use this, since you'd normally want to delay automounting as long as possible, which usually implies LOOKUP_OPEN (when we open a file or directory, we really cannot avoid the automount any more). But Trond argued sufficiently forcefully that at a minimum bind mounting a file and quotactl will want to force the automount lookup. Some other cases (like nfs_follow_remote_path()) could use it too, although LOOKUP_DIRECTORY would work there as well. This commit just adds the flag and logic, no users yet, though. It also doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and was made irrelevant by the same change that made us not follow on LOOKUP_FOLLOW. Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Ian Kent <raven@themaw.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6Linus Torvalds2011-09-22
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: [S390] kvm: extension capability for new address space layout [S390] kvm: fix address mode switching
| * | | [S390] kvm: extension capability for new address space layoutChristian Borntraeger2011-09-20
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address spaces for kvm guest images) changed kvm on s390 to use a separate address space for kvm guests. We can now put KVM guests anywhere in the user address mode with a size up to 8PB - as long as the memory is 1MB-aligned. This change was done without KVM extension capability bit. The change was added after 3.0, but we still have a chance to add a feature bit before 3.1 (keeping the releases in a sane state). We use number 71 to avoid collisions with other pending kvm patches as requested by Alexander Graf. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Avi Kivity <avi@redhat.com> Cc: Alexander Graf <agraf@suse.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2011-09-21
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-block: floppy: use del_timer_sync() in init cleanup blk-cgroup: be able to remove the record of unplugged device block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request mm: Add comment explaining task state setting in bdi_forker_thread() mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() block: simplify force plug flush code a little bit block: change force plug flush call order block: Fix queue_flag update when rq_affinity goes from 2 to 1 block: separate priority boosting from REQ_META block: remove READ_META and WRITE_META xen-blkback: fixed indentation and comments xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
| * | block: simplify force plug flush code a little bitShaohua Li2011-08-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleaning up the code a little bit. attempt_plug_merge() traverses the plug list anyway, we can do the request counting there, so stack size is reduced a little bit. The motivation here is I suspect if we should count the requests for each queue (task could handle multiple disks in the meantime), but my test doesn't show it's worthy doing. If somebody proves we should do it, below change will make that more easier. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | block: separate priority boosting from REQ_METAChristoph Hellwig2011-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | block: remove READ_META and WRITE_METAChristoph Hellwig2011-08-23
| | | | | | | | | | | | | | | | | | | | | | | | Replace all occurnanced of the undocumented READ_META with READ | REQ_META and remove the unused WRITE_META define. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | | Merge branch 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6Linus Torvalds2011-09-18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6: mfd: Fix omap-usb-host build failure mfd: Make omap-usb-host TLL mode work again mfd: Set MAX8997 irq pointer mfd: Fix initialisation of tps65910 interrupts mfd: Check for twl4030-madc NULL pointer mfd: Copy the device pointer to the twl4030-madc structure mfd: Rename wm8350 static gpio_set_debounce() mfd: Fix value of WM8994_CONFIGURE_GPIO
| * | | mfd: Fix value of WM8994_CONFIGURE_GPIOMark Brown2011-09-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This needs to be an out of band value for the register and on this device registers are 16 bit so we must shift left one to the 17th bit. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: stable@kernel.org Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* | | | Merge git://github.com/davem330/netLinus Torvalds2011-09-18
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://github.com/davem330/net: (62 commits) ipv6: don't use inetpeer to store metrics for routes. can: ti_hecc: include linux/io.h IRDA: Fix global type conflicts in net/irda/irsysctl.c v2 net: Handle different key sizes between address families in flow cache net: Align AF-specific flowi structs to long ipv4: Fix fib_info->fib_metrics leak caif: fix a potential NULL dereference sctp: deal with multiple COOKIE_ECHO chunks ibmveth: Fix checksum offload failure handling ibmveth: Checksum offload is always disabled ibmveth: Fix issue with DMA mapping failure ibmveth: Fix DMA unmap error pch_gbe: support ML7831 IOH pch_gbe: added the process of FIFO over run error pch_gbe: fixed the issue which receives an unnecessary packet. sfc: Use 64-bit writes for TX push where possible Revert "sfc: Use write-combining to reduce TX latency" and follow-ups bnx2x: Fix ethtool advertisement bnx2x: Fix 578xx link LED bnx2x: Fix XMAC loopback test ...
| * | | | Merge branch 'master' of ../netdev/David S. Miller2011-09-16
| |\| | |
| * | | | net: copy userspace buffers on device forwardingMichael S. Tsirkin2011-09-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dev_forward_skb loops an skb back into host networking stack which might hang on the memory indefinitely. In particular, this can happen in macvtap in bridged mode. Copy the userspace fragments to avoid blocking the sender in that case. As this patch makes skb_copy_ubufs extern now, I also added some documentation and made it clear the SKBTX_DEV_ZEROCOPY flag automatically instead of doing it in all callers. This can be made into a separate patch if people feel it's worth it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | tcp: Change possible SYN flooding messagesEric Dumazet2011-09-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "Possible SYN flooding on port xxxx " messages can fill logs on servers. Change logic to log the message only once per listener, and add two new SNMP counters to track : TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client TCPReqQFullDrop : number of times a SYN request was dropped because syncookies were not enabled. Based on a prior patch from Tom Herbert, and suggestions from David. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | drivers/gpio/gpio-generic.c: fix build errorsRussell King2011-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building a kernel with hotplug disabled results in a link failure: `bgpio_remove' referenced in section `___ksymtab_gpl+bgpio_remove' of drivers/built-in.o: defined in discarded section `.devexit.text' of drivers/built-in.o This is because of bgpio_remove() is exported. It is illegal to export symbols which are discarded either at link time or as part of an init/exit section. Fix this by dropping the __devexit attributation from bgpio_remove(). Also drop the __devinit attributation from bgpio_init(). Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | memcg: Revert "memcg: add memory.vmscan_stat"Johannes Weiner2011-09-14
| |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add memory.vmscan_stat"). The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | regulator: fix kernel-doc warning in consumer.hRandy Dunlap2011-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc warning about internal/private data by marking it as "private:" so that kernel-doc will ignore it. Warning(include/linux/regulator/consumer.h:128): No description found for parameter 'ret' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | perf events: Fix slow and broken cgroup context switch codeStephane Eranian2011-08-29
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current cgroup context switch code was incorrect leading to bogus counts. Furthermore, as soon as there was an active cgroup event on a CPU, the context switch cost on that CPU would increase by a significant amount as demonstrated by a simple ping/pong example: $ ./pong Both processes pinned to CPU1, running for 10s 10684.51 ctxsw/s Now start a cgroup perf stat: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100 $ ./pong Both processes pinned to CPU1, running for 10s 6674.61 ctxsw/s That's a 37% penalty. Note that pong is not even in the monitored cgroup. The results shown by perf stat are bogus: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100 Performance counter stats for 'sleep 100': CPU1 <not counted> cycles test CPU1 16,984,189,138 cycles # 0.000 GHz The second 'cycles' event should report a count @ CPU clock (here 2.4GHz) as it is counting across all cgroups. The patch below fixes the bogus accounting and bypasses any cgroup switches in case the outgoing and incoming tasks are in the same cgroup. With this patch the same test now yields: $ ./pong Both processes pinned to CPU1, running for 10s 10775.30 ctxsw/s Start perf stat with cgroup: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Run pong outside the cgroup: $ /pong Both processes pinned to CPU1, running for 10s 10687.80 ctxsw/s The penalty is now less than 2%. And the results for perf stat are correct: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Performance counter stats for 'sleep 10': CPU1 <not counted> cycles test # 0.000 GHz CPU1 23,933,981,448 cycles # 0.000 GHz Now perf stat reports the correct counts for for the non cgroup event. If we run pong inside the cgroup, then we also get the correct counts: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Performance counter stats for 'sleep 10': CPU1 22,297,726,205 cycles test # 0.000 GHz CPU1 23,933,981,448 cycles # 0.000 GHz 10.001457237 seconds time elapsed Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | All Arch: remove linkage for sys_nfsservctl system callNeilBrown2011-08-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | The nfsservctl system call is now gone, so we should remove all linkage for it. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'tty-linus' of ↵Linus Torvalds2011-08-26
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6 * 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: omap-serial: Allow IXON and IXOFF to be disabled. TTY: serial, document ignoring of uart->ops->startup error TTY: pty, fix pty counting 8250: Fix race condition in serial8250_backup_timeout(). serial/8250_pci: delete duplicate data definition 8250_pci: add support for Rosewill RC-305 4x serial port card tty: Add "spi:" prefix for spi modalias atmel_serial: fix atmel_default_console_device serial: 8250_pnp: add Intermec CV60 touchscreen device drivers/serial/ucc_uart.c: Fix compiler warning pch_uart: Set PCIe bus number using probe parameter serial: samsung: Fix build error
| * | | TTY: pty, fix pty countingJiri Slaby2011-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tty_operations->remove is normally called like: queue_release_one_tty ->tty_shutdown ->tty_driver_remove_tty ->tty_operations->remove However tty_shutdown() is called from queue_release_one_tty() only if tty_operations->shutdown is NULL. But for pty, it is not. pty_unix98_shutdown() is used there as ->shutdown. So tty_operations->remove of pty (i.e. pty_unix98_remove()) is never called. This results in invalid pty_count. I.e. what can be seen in /proc/sys/kernel/pty/nr. I see this was already reported at: https://lkml.org/lkml/2009/11/5/370 But it was not fixed since then. This patch is kind of a hackish way. The problem lies in ->install. We allocate there another tty (so-called tty->link). So ->install is called once, but ->remove twice, for both tty and tty->link. The fix here is to count both tty and tty->link and divide the count by 2 for user. And to have ->remove called, let's make tty_driver_remove_tty() global and call that from pty_unix98_shutdown() (tty_operations->shutdown). While at it, let's document that when ->shutdown is defined, tty_shutdown() is not called. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* | | | Merge branch 'driver-core-linus' of ↵Linus Torvalds2011-08-26
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: drivers:misc: ti-st: fix unexpected UART close drivers:misc: ti-st: free skb on firmware download drivers:misc: ti-st: wait for completion at fail drivers:misc: ti-st: reinit completion before send drivers:misc: ti-st: fail-safe on wrong pkt type drivers:misc: ti-st: reinit completion on ver read drivers:misc:ti-st: platform hooks for chip states drivers:misc: ti-st: avoid a misleading dbg msg base/devres.c: quiet sparse noise about context imbalance pti: add missing CONFIG_PCI dependency drivers/base/devtmpfs.c: correct annotation of `setup_done' driver core: fix kernel-doc warning in platform.c firmware: fix google/gsmi.c build warning
| * | | | drivers:misc:ti-st: platform hooks for chip statesPavan Savoy2011-08-22
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Certain platform specific or Host-WiLink Interface specific actions would be required to be taken when the chip is being enabled and after the chip is disabled such as configuration of the mux modes for the GPIO of host connected to the nshutdown of the chip or relinquishing UART after the chip is disabled. Similar actions can also be taken when the chip is in deep sleep or when the chip is awake. Performance enhancements such as configuring the host to run faster when chip is awake and slower when chip is asleep can also be made here. Signed-off-by: Pavan Savoy <pavan_savoy@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* | | | backlight: add a callback 'notify_after' for backlight controlDilan Lee2011-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need a callback to do some things after pwm_enable, pwm_disable and pwm_config. Signed-off-by: Dilan Lee <dilee@nvidia.com> Reviewed-by: Robert Morell <rmorell@nvidia.com> Reviewed-by: Arun Murthy <arun.murthy@stericsson.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | rapidio: fix use of non-compatible registersAlexandre Bounine2011-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace/remove use of RIO v.1.2 registers/bits that are not forward-compatible with newer versions of RapidIO specification. RapidIO specification v.1.3 removed Write Port CSR, Doorbell CSR, Mailbox CSR and Mailbox and Doorbell bits of the PEF CAR. Use of removed (since RIO v.1.3) register bits affects users of currently available 1.3 and 2.x compliant devices who may use not so recent kernel versions. Removing checks for unsupported bits makes corresponding routines compatible with all versions of RapidIO specification. Therefore, backporting makes stable kernel versions compliant with RIO v.1.3 and later as well. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Chul Kim <chul.kim@idt.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | MAINTAINERS: Evgeniy has movedEvgeniy Polyakov2011-08-25
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | lockdep: Add helper function for dir vs file i_mutex annotationJosh Boyer2011-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Purely in-memory filesystems do not use the inode hash as the dcache tells us if an entry already exists. As a result, they do not call unlock_new_inode, and thus directory inodes do not get put into a different lockdep class for i_sem. We need the different lockdep classes, because the locking order for i_mutex is different for directory inodes and regular inodes. Directory inodes can do "readdir()", which takes i_mutex *before* possibly taking mm->mmap_sem (due to a page fault while copying the directory entry to user space). In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem before accessing i_mutex. The two cases can never happen for the same inode, so no real deadlock can occur, but without the different lockdep classes, lockdep cannot understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this can lead to false positives from lockdep like below: find/645 is trying to acquire lock: (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac but task is already holding lock: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>] vfs_readdir+0x5b/0xb4 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}: [<ffffffff8108ac26>] lock_acquire+0xbf/0x103 [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361 [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45 [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110 [<ffffffff81111557>] mmap_region+0x258/0x432 [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306 [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a [<ffffffff8100c858>] sys_mmap+0x22/0x24 [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b -> #0 (&mm->mmap_sem){++++++}: [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7 [<ffffffff8108ac26>] lock_acquire+0xbf/0x103 [<ffffffff81109541>] might_fault+0x89/0xac [<ffffffff81149cff>] filldir+0x6f/0xc7 [<ffffffff811586ea>] dcache_readdir+0x67/0x205 [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4 [<ffffffff8114a073>] sys_getdents+0x7e/0xd1 [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b This patch moves the directory vs file lockdep annotation into a helper function that can be called by in-memory filesystems and has hugetlbfs call it. Signed-off-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'urgent-for-linus' of ↵Linus Torvalds2011-08-25
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: squeeze max-pause area and drop pass-good area
| * | | | squeeze max-pause area and drop pass-good areaWu Fengguang2011-08-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the pass-good area introduced in ffd1f609ab10 ("writeback: introduce max-pause and pass-good dirty limits") and make the max-pause area smaller and safe. This fixes ~30% performance regression in the ext3 data=writeback fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes. Using deadline scheduler also has a regression, but not that big as CFQ, so this suggests we have some write starvation. The test logs show that - the disks are sometimes under utilized - global dirty pages sometimes rush high to the pass-good area for several hundred seconds, while in the mean time some bdi dirty pages drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the global dirty pages dropped under global dirty threshold and bdi_dirty rush very high (for example, 2 times higher than bdi_thresh). During which time balance_dirty_pages() is not called at all. So the problems are 1) The random writes progress so slow that they break the assumption of the max-pause logic that "8 pages per 200ms is typically more than enough to curb heavy dirtiers". 2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others. 3) The higher max-pause/pass-good thresholds somehow leads to the bad swing of dirty pages. The fix is to allow the task to slightly dirty over task_bdi_thresh, but no way to exceed bdi_dirty and/or global dirty_thresh. Tests show that it fixed the JBOD regression completely (both behavior and performance), while still being able to cut down large pause times in balance_dirty_pages() for single-disk cases. Reported-by: Li Shaohua <shaohua.li@intel.com> Tested-by: Li Shaohua <shaohua.li@intel.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* | | | | Add a personality to report 2.6.x version numbersAndi Kleen2011-08-25
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I ran into a couple of programs which broke with the new Linux 3.0 version. Some of those were binary only. I tried to use LD_PRELOAD to work around it, but it was quite difficult and in one case impossible because of a mix of 32bit and 64bit executables. For example, all kind of management software from HP doesnt work, unless we pretend to run a 2.6 kernel. $ uname -a Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux $ hpacucli ctrl all show Error: No controllers detected. $ rpm -qf /usr/sbin/hpacucli hpacucli-8.75-12.0 Another notable case is that Python now reports "linux3" from sys.platform(); which in turn can break things that were checking sys.platform() == "linux2": https://bugzilla.mozilla.org/show_bug.cgi?id=664564 It seems pretty clear to me though it's a bug in the apps that are using '==' instead of .startswith(), but this allows us to unbreak broken programs. This patch adds a UNAME26 personality that makes the kernel report a 2.6.40+x version number instead. The x is the x in 3.x. I know this is somewhat ugly, but I didn't find a better workaround, and compatibility to existing programs is important. Some programs also read /proc/sys/kernel/osrelease. This can be worked around in user space with mount --bind (and a mount namespace) To use: wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c gcc -o uname26 uname26.c ./uname26 program Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-08-24
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message fuse: mark pages accessed when written to fuse: delete dead .write_begin and .write_end aops fuse: fix flock fuse: fix non-ANSI void function notation
| * | | fuse: fix flockMiklos Szeredi2011-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a9ff4f87 "fuse: support BSD locking semantics" overlooked a number of issues with supporing flock locks over existing POSIX locking infrastructure: - it's not backward compatible, passing flock(2) calls to userspace unconditionally (if userspace sets FUSE_POSIX_LOCKS) - it doesn't cater for the fact that flock locks are automatically unlocked on file release - it doesn't take into account the fact that flock exclusive locks (write locks) don't need an fd opened for write. The last one invalidates the original premise of the patch that flock locks can be emulated with POSIX locks. This patch fixes the first two issues. The last one needs to be fixed in userspace if the filesystem assumed that a write lock will happen only on a file operned for write (as in the case of the current fuse library). Reported-by: Sebastian Pipping <webmaster@hartwork.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2011-08-19
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...
| * | | | block: fix flush machinery for stacking drivers with differring flush flagsJeff Moyer2011-08-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement FLUSH/FUA to support merge, introduced a performance regression when running any sort of fsyncing workload using dm-multipath and certain storage (in our case, an HP EVA). The test I ran was fs_mark, and it dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out that dm-multipath always advertised flush+fua support, and passed commands on down the stack, where those flags used to get stripped off. The above commit changed that behavior: static inline struct request *__elv_next_request(struct request_queue *q) { struct request *rq; while (1) { - while (!list_empty(&q->queue_head)) { + if (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) || - (rq->cmd_flags & REQ_FLUSH_SEQ)) - return rq; - rq = blk_do_flush(q, rq); - if (rq) - return rq; + return rq; } Note that previously, a command would come in here, have REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush: struct request *blk_do_flush(struct request_queue *q, struct request *rq) { unsigned int fflags = q->flush_flags; /* may change, cache it */ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA; bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH); bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA); unsigned skip = 0; ... if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) { rq->cmd_flags &= ~REQ_FLUSH; if (!has_fua) rq->cmd_flags &= ~REQ_FUA; return rq; } So, the flush machinery was bypassed in such cases (q->flush_flags == 0 && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)). Now, however, we don't get into the flush machinery at all. Instead, __elv_next_request just hands a request with flush and fua bits set to the scsi_request_fn, even if the underlying request_queue does not support flush or fua. The agreed upon approach is to fix the flush machinery to allow stacking. While this isn't used in practice (since there is only one request-based dm target, and that target will now reflect the flush flags of the underlying device), it does future-proof the solution, and make it function as designed. In order to make this work, I had to add a field to the struct request, inside the flush structure (to store the original req->end_io). Shaohua had suggested overloading the union with rb_node and completion_data, but the completion data is used by device mapper and can also be used by other drivers. So, I didn't see a way around the additional field. I tested this patch on an HP EVA with both ext4 and xfs, and it recovers the lost performance. Comments and other testers, as always, are appreciated. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | | blktrace: add FLUSH/FUA supportNamhyung Kim2011-08-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or FUA follows WRITE, use the same 'F' flag for both cases and distinguish them by their (relative) position. The end results look like (other flags might be shown also): - WRITE: W - WRITE_FLUSH: FW - WRITE_FUA: WF - WRITE_FLUSH_FUA: FWF Note that we reuse TC_BARRIER due to lack of bit space of act_mask so that the older versions of blktrace tools will report flush requests as barriers from now on. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | | Move some REQ flags to the common bio/request areaMatthew Wilcox2011-08-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | REQ_SECURE, REQ_FLUSH and REQ_FUA may all be set on a bio as well as on a request, so relocate them to the shared part of the enum. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>