aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* Merge git://git.infradead.org/users/eparis/auditLinus Torvalds2014-04-12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull audit updates from Eric Paris. * git://git.infradead.org/users/eparis/audit: (28 commits) AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range audit: do not cast audit_rule_data pointers pointlesly AUDIT: Allow login in non-init namespaces audit: define audit_is_compat in kernel internal header kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c sched: declare pid_alive as inline audit: use uapi/linux/audit.h for AUDIT_ARCH declarations syscall_get_arch: remove useless function arguments audit: remove stray newline from audit_log_execve_info() audit_panic() call audit: remove stray newlines from audit_log_lost messages audit: include subject in login records audit: remove superfluous new- prefix in AUDIT_LOGIN messages audit: allow user processes to log from another PID namespace audit: anchor all pid references in the initial pid namespace audit: convert PPIDs to the inital PID namespace. pid: get pid_t ppid of task in init_pid_ns audit: rename the misleading audit_get_context() to audit_take_context() audit: Add generic compat syscall support audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL ...
| * AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERICChris Metcalf2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems with CONFIG_COMPAT we introduced the new requirement that audit_classify_compat_syscall() exists. This wasn't true for everything (apparently not for "tilegx", which I know less that nothing about.) Instead of wrapping the preprocessor optomization with CONFIG_COMPAT we should have used the new CONFIG_AUDIT_COMPAT_GENERIC. This patch uses that config option to make sure only arches which intend to implement this have the requirement. This works fine for tilegx according to Chris Metcalf Signed-off-by: Eric Paris <eparis@redhat.com>
| * audit: renumber AUDIT_FEATURE_CHANGE into the 1300 rangeEric Paris2014-04-02
| | | | | | | | | | | | | | | | | | 1000-1099 is for configuring things. So auditd ignored such messages. This is about actually logging what was configured. Move it into the range for such types of messages. Reported-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * audit: do not cast audit_rule_data pointers pointleslyEric Paris2014-04-02
| | | | | | | | | | | | | | | | | | For some sort of legacy support audit_rule is a subset of (and first entry in) audit_rule_data. We don't actually need or use audit_rule. We just do a cast from one to the other for no gain what so ever. Stop the crazy casting. Signed-off-by: Eric Paris <eparis@redhat.com>
| * AUDIT: Allow login in non-init namespacesEric Paris2014-03-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It its possible to configure your PAM stack to refuse login if audit messages (about the login) were unable to be sent. This is common in many distros and thus normal configuration of many containers. The PAM modules determine if audit is enabled/disabled in the kernel based on the return value from sending an audit message on the netlink socket. If userspace gets back ECONNREFUSED it believes audit is disabled in the kernel. If it gets any other error else it refuses to let the login proceed. Just about ever since the introduction of namespaces the kernel audit subsystem has returned EPERM if the task sending a message was not in the init user or pid namespace. So many forms of containers have never worked if audit was enabled in the kernel. BUT if the container was not in net_init then the kernel network code would send ECONNREFUSED (instead of the audit code sending EPERM). Thus by pure accident/dumb luck/bug if an admin configured the PAM stack to reject all logins that didn't talk to audit, but then ran the login untility in the non-init_net namespace, it would work!! Clearly this was a bug, but it is a bug some people expected. With the introduction of network namespace support in 3.14-rc1 the two bugs stopped cancelling each other out. Now, containers in the non-init_net namespace refused to let users log in (just like PAM was configfured!) Obviously some people were not happy that what used to let users log in, now didn't! This fix is kinda hacky. We return ECONNREFUSED for all non-init relevant namespaces. That means that not only will the old broken non-init_net setups continue to work, now the broken non-init_pid or non-init_user setups will 'work'. They don't really work, since audit isn't logging things. But it's what most users want. In 3.15 we should have patches to support not only the non-init_net (3.14) namespace but also the non-init_pid and non-init_user namespace. So all will be right in the world. This just opens the doors wide open on 3.14 and hopefully makes users happy, if not the audit system... Reported-by: Andre Tomt <andre@tomt.net> Reported-by: Adam Richter <adam_richter2004@yahoo.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: kernel/audit.c
| * audit: define audit_is_compat in kernel internal headerEric Paris2014-03-24
| | | | | | | | | | | | | | | | We were exposing a function based on kernel config options to userspace. This is wrong. Move it to the audit internal header. Suggested-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * kernel: Use RCU_INIT_POINTER(x, NULL) in audit.cMonam Agarwal2014-03-24
| | | | | | | | | | | | | | | | | | | | | | | | This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * sched: declare pid_alive as inlineRichard Guy Briggs2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | We accidentally declared pid_alive without any extern/inline connotation. Some platforms were fine with this, some like ia64 and mips were very angry. If the function is inline, the prototype should be inline! on ia64: include/linux/sched.h:1718: warning: 'pid_alive' declared inline after being called Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * audit: use uapi/linux/audit.h for AUDIT_ARCH declarationsEric Paris2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | The syscall.h headers were including linux/audit.h but really only needed the uapi/linux/audit.h to get the requisite defines. Switch to the uapi headers. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@linux-mips.org Cc: linux-s390@vger.kernel.org Cc: x86@kernel.org
| * syscall_get_arch: remove useless function argumentsEric Paris2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every caller of syscall_get_arch() uses current for the task and no implementors of the function need args. So just get rid of both of those things. Admittedly, since these are inline functions we aren't wasting stack space, but it just makes the prototypes better. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@linux-mips.org Cc: linux390@de.ibm.com Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-arch@vger.kernel.org
| * audit: remove stray newline from audit_log_execve_info() audit_panic() callJoe Perches2014-03-20
| | | | | | | | | | | | There's an unnecessary use of a \n in audit_panic. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: remove stray newlines from audit_log_lost messagesJosh Boyer2014-03-20
| | | | | | | | | | | | | | | | | | | | | | Calling audit_log_lost with a \n in the format string leads to extra newlines in dmesg. That function will eventually call audit_panic which uses pr_err with an explicit \n included. Just make these calls match the others that lack \n. Reported-by: Jonathan Kamens <jik@kamens.brookline.ma.us> Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: include subject in login recordsEric Paris2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | The login uid change record does not include the selinux context of the task logging in. Add that information. (Updated from 2011-01: RHBZ:670328 -- RGB) Reported-by: Steve Grubb <sgrubb@redhat.com> Acked-by: James Morris <jmorris@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: remove superfluous new- prefix in AUDIT_LOGIN messagesRichard Guy Briggs2014-03-20
| | | | | | | | | | | | The new- prefix on ses and auid are un-necessary and break ausearch. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: allow user processes to log from another PID namespaceRichard Guy Briggs2014-03-20
| | | | | | | | | | | | | | | | | | | | Still only permit the audit logging daemon and control to operate from the initial PID namespace, but allow processes to log from another PID namespace. Cc: "Eric W. Biederman" <ebiederm@xmission.com> (informed by ebiederman's c776b5d2) Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: anchor all pid references in the initial pid namespaceRichard Guy Briggs2014-03-20
| | | | | | | | | | | | | | | | | | | | Store and log all PIDs with reference to the initial PID namespace and use the access functions task_pid_nr() and task_tgid_nr() for task->pid and task->tgid. Cc: "Eric W. Biederman" <ebiederm@xmission.com> (informed by ebiederman's c776b5d2) Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: convert PPIDs to the inital PID namespace.Richard Guy Briggs2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sys_getppid() returns the parent pid of the current process in its own pid namespace. Since audit filters are based in the init pid namespace, a process could avoid a filter or trigger an unintended one by being in an alternate pid namespace or log meaningless information. Switch to task_ppid_nr() for PPIDs to anchor all audit filters in the init_pid_ns. (informed by ebiederman's 6c621b7e) Cc: stable@vger.kernel.org Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * pid: get pid_t ppid of task in init_pid_nsRichard Guy Briggs2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | Added the functions task_ppid_nr_ns() and task_ppid_nr() to abstract the lookup of the PPID (real_parent's pid_t) of a process, including rcu locking, in the arbitrary and init_pid_ns. This provides an alternative to sys_getppid(), which is relative to the child process' pid namespace. (informed by ebiederman's 6c621b7e) Cc: stable@vger.kernel.org Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: rename the misleading audit_get_context() to audit_take_context()Richard Guy Briggs2014-03-20
| | | | | | | | | | | | | | | | | | | | "get" usually implies incrementing a refcount into a structure to indicate a reference being held by another part of code. Change this function name to indicate it is in fact being taken from it, returning the value while clearing it in the supplying structure. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: Add generic compat syscall supportAKASHI Takahiro2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | lib/audit.c provides a generic function for auditing system calls. This patch extends it for compat syscall support on bi-architectures (32/64-bit) by adding lib/compat_audit.c. What is required to support this feature are: * add asm/unistd32.h for compat system call names * select CONFIG_AUDIT_ARCH_COMPAT_GENERIC Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALLAKASHI Takahiro2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently AUDITSYSCALL has a long list of architecture depencency: depends on AUDIT && (X86 || PARISC || PPC || S390 || IA64 || UML || SPARC64 || SUPERH || (ARM && AEABI && !OABI_COMPAT) || ALPHA) The purpose of this patch is to replace it with HAVE_ARCH_AUDITSYSCALL for simplicity. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Acked-by: Will Deacon <will.deacon@arm.com> (arm) Acked-by: Richard Guy Briggs <rgb@redhat.com> (audit) Acked-by: Matt Turner <mattst88@gmail.com> (alpha) Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Eric Paris <eparis@redhat.com>
| * alpha: Enable system-call auditing support.蔡正龙2014-03-20
| | | | | | | | | | Signed-off-by: Zhenglong.cai <zhenglong.cai@cs2c.com.cn> Signed-off-by: Matt Turner <mattst88@gmail.com>
| * audit: Send replies in the proper network namespace.Eric W. Biederman2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | In perverse cases of file descriptor passing the current network namespace of a process and the network namespace of a socket used by that socket may differ. Therefore use the network namespace of the appropiate socket to ensure replies always go to the appropiate socket. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * audit: Use struct net not pid_t to remember the network namespce to reply inEric W. Biederman2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While reading through 3.14-rc1 I found a pretty siginficant mishandling of network namespaces in the recent audit changes. In struct audit_netlink_list and audit_reply add a reference to the network namespace of the caller and remove the userspace pid of the caller. This cleanly remembers the callers network namespace, and removes a huge class of races and nasty failure modes that can occur when attempting to relook up the callers network namespace from a pid_t (including the caller's network namespace changing, pid wraparound, and the pid simply not being present). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * audit: Audit proc/<pid>/cmdline aka proctitleWilliam Roberts2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During an audit event, cache and print the value of the process's proctitle value (proc/<pid>/cmdline). This is useful in situations where processes are started via fork'd virtual machines where the comm field is incorrect. Often times, setting the comm field still is insufficient as the comm width is not very wide and most virtual machine "package names" do not fit. Also, during execution, many threads have their comm field set as well. By tying it back to the global cmdline value for the process, audit records will be more complete in systems with these properties. An example of where this is useful and applicable is in the realm of Android. With Android, their is no fork/exec for VM instances. The bare, preloaded Dalvik VM listens for a fork and specialize request. When this request comes in, the VM forks, and the loads the specific application (specializing). This was done to take advantage of COW and to not require a load of basic packages by the VM on very app spawn. When this spawn occurs, the package name is set via setproctitle() and shows up in procfs. Many of these package names are longer then 16 bytes, the historical width of task->comm. Having the cmdline in the audit records will couple the application back to the record directly. Also, on my Debian development box, some audit records were more useful then what was printed under comm. The cached proctitle is tied to the life-cycle of the audit_context structure and is built on demand. Proctitle is controllable by userspace, and thus should not be trusted. It is meant as an aid to assist in debugging. The proctitle event is emitted during syscall audits, and can be filtered with auditctl. Example: type=AVC msg=audit(1391217013.924:386): avc: denied { getattr } for pid=1971 comm="mkdir" name="/" dev="selinuxfs" ino=1 scontext=system_u:system_r:consolekit_t:s0-s0:c0.c255 tcontext=system_u:object_r:security_t:s0 tclass=filesystem type=SYSCALL msg=audit(1391217013.924:386): arch=c000003e syscall=137 success=yes exit=0 a0=7f019dfc8bd7 a1=7fffa6aed2c0 a2=fffffffffff4bd25 a3=7fffa6aed050 items=0 ppid=1967 pid=1971 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mkdir" exe="/bin/mkdir" subj=system_u:system_r:consolekit_t:s0-s0:c0.c255 key=(null) type=UNKNOWN[1327] msg=audit(1391217013.924:386): proctitle=6D6B646972002D70002F7661722F72756E2F636F6E736F6C65 Acked-by: Steve Grubb <sgrubb@redhat.com> (wrt record formating) Signed-off-by: William Roberts <wroberts@tresys.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * proc: Update get proc_pid_cmdline() to use mm.h helpersWilliam Roberts2014-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | Re-factor proc_pid_cmdline() to use get_cmdline() helper from mm.h. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: William Roberts <wroberts@tresys.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * mm: Create utility function for accessing a tasks commandline valueWilliam Roberts2014-03-07
| | | | | | | | | | | | | | | | | | | | | | | | introduce get_cmdline() for retreiving the value of a processes proc/self/cmdline value. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: William Roberts <wroberts@tresys.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * capabilities: add descriptions for AUDIT_CONTROL and AUDIT_WRITERichard Guy Briggs2014-03-07
| | | | | | | | | | | | Fill in missing descriptions for AUDIT_CONTROL and AUDIT_WRITE definitions. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * audit: Use more current logging style againRichard Guy Briggs2014-03-07
| | | | | | | | | | | | | | | | Add pr_fmt to prefix "audit: " to output Convert printk(KERN_<LEVEL> to pr_<level> Coalesce formats Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * Merge tag 'v3.13' into for-3.15Eric Paris2014-03-07
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux 3.13 Conflicts: include/net/xfrm.h Simple merge where v3.13 removed 'extern' from definitions and the audit tree did s/u32/unsigned int/ to the same definitions.
* | \ Merge branch 'async-scsi-resume' of ↵Linus Torvalds2014-04-11
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci Pull async SCSI resume support from Dan Williams: "Allow disks and other devices to resume in parallel. This provides a tangible speed up for a non-esoteric use case (laptop resume): https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach" * 'async-scsi-resume' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci: scsi: async sd resume
| * | | scsi: async sd resumeDan Williams2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the callback parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. There is a concern that async resume could trigger PSU overload. In the enterprise, storage enclosures enforce staggered spin-up regardless of what the kernel does making async scanning safe by default. Outside of that context a user can disable asynchronous scanning via a kernel command line or CONFIG_SCSI_SCAN_ASYNC. Honor that setting when deciding whether to do resume asynchronously. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown <len.brown@intel.com> Cc: Phillip Susi <psusi@ubuntu.com> [alan: bug fix and clean up suggestion] Acked-by: Alan Stern <stern@rowland.harvard.edu> Suggested-by: Todd Brandt <todd.e.brandt@linux.intel.com> [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | | | Merge tag 'md/3.15' of git://neil.brown.name/mdLinus Torvalds2014-04-11
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from Neil Brown: "Just a few md patches for the 3.15 merge window. Not much happening in md/raid at the moment. Just a few bug fixes (one for -stable) and a couple of performance tweaks" * tag 'md/3.15' of git://neil.brown.name/md: raid5: get_active_stripe avoids device_lock raid5: make_request does less prepare wait md: avoid oops on unload if some process is in poll or select. md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails. md/bitmap: don't abuse i_writecount for bitmap files.
| * | | | raid5: get_active_stripe avoids device_lockShaohua Li2014-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For sequential workload (or request size big workload), get_active_stripe can find cached stripe. In this case, we always hold device_lock, which exposes a lot of lock contention for such workload. If stripe count isn't 0, we don't need hold the lock actually, since we just increase its count. And this is the hot code path for such workload. Unfortunately we must delete the BUG_ON. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | raid5: make_request does less prepare waitShaohua Li2014-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a lot of contention for sequential workload (or big request size workload). For such workload, each bio includes several stripes. So we can just do prepare_to_wait/finish_wait once for the whold bio instead of every stripe. This reduces the lock contention completely for such workload. Random workload might have the similar lock contention too, but I didn't see it yet, maybe because my stroage is still not fast enough. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | md: avoid oops on unload if some process is in poll or select.NeilBrown2014-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If md-mod is unloaded while some process is in poll() or select(), then that process maintains a pointer to md_event_waiters, and when the try to unlink from that list, they will oops. The procfs infrastructure ensures that ->poll won't be called after remove_proc_entry, but doesn't provide a wait_queue_head for us to use, and the waitqueue code doesn't provide a way to remove all listeners from a waitqueue. So we need to: 1/ make sure no further references to md_event_waiters are taken (by setting md_unloading) 2/ wake up all processes currently waiting, and 3/ wait until all those processes have disconnected from our wait_queue_head. Reported-by: "majianpeng" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation ↵NeilBrown2014-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fails. When performing a user-request check/repair (MD_RECOVERY_REQUEST is set) on a raid1, we allocate multiple bios each with their own set of pages. If the page allocations for one bio fails, we currently do *not* free the pages allocated for the previous bios, nor do we free the bio itself. This patch frees all the already-allocate pages, and makes sure that all the bios are freed as well. This bug can cause a memory leak which can ultimately OOM a machine. It was introduced in 3.10-rc1. Fixes: a07876064a0b73ab5ef1ebcf14b1cf0231c07858 Cc: Kent Overstreet <koverstreet@google.com> Cc: stable@vger.kernel.org (3.10+) Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | | md/bitmap: don't abuse i_writecount for bitmap files.NeilBrown2014-04-08
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md bitmap code currently tries to use i_writecount to stop any other process from writing to out bitmap file. But that is really an abuse and has bit-rotted so locking is all wrong. So discard that - root should be allowed to shoot self in foot. Still use it in a much less intrusive way to stop the same file being used as bitmap on two different array, and apply other checks to ensure the file is at least vaguely usable for bitmap storage (is regular, is open for write. Support for ->bmap is already checked elsewhere). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de>
* | | | Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2014-04-11
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe driver updates from Matthew Wilcox: "Various updates to the NVMe driver. The most user-visible change is that drive hotplugging now works and CPU hotplug while an NVMe drive is installed should also work better" * git://git.infradead.org/users/willy/linux-nvme: NVMe: Retry failed commands with non-fatal errors NVMe: Add getgeo to block ops NVMe: Start-stop nvme_thread during device add-remove. NVMe: Make I/O timeout a module parameter NVMe: CPU hot plug notification NVMe: per-cpu io queues NVMe: Replace DEFINE_PCI_DEVICE_TABLE NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmds NVMe: IOCTL path RCU protect queue access NVMe: RCU protected access to io queues NVMe: Initialize device reference count earlier NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions
| * | | | NVMe: Retry failed commands with non-fatal errorsKeith Busch2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For commands returned with failed status, queue these for resubmission and continue retrying them until success or for a limited amount of time. The final timeout was arbitrarily chosen so requests can't be retried indefinitely. Since these are requeued on the nvmeq that submitted the command, the callbacks have to take an nvmeq instead of an nvme_dev as a parameter so that we can use the locked queue to append the iod to retry later. The nvme_iod conviently can be used to track how long we've been trying to successfully complete an iod request. The nvme_iod also provides the nvme prp dma mappings, so I had to move a few things around so we can keep those mappings. Signed-off-by: Keith Busch <keith.busch@intel.com> [fixed checkpatch issue with long line] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: Add getgeo to block opsKeith Busch2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some programs require HDIO_GETGEO work, which requires we implement getgeo. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: Start-stop nvme_thread during device add-remove.Dan McLeran2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Done to ensure nvme_thread is not running when there are no devices to poll. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: Make I/O timeout a module parameterKeith Busch2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Increase the default timeout to 30 seconds to match SCSI. Signed-off-by: Keith Busch <keith.busch@intel.com> [use byte instead of ushort] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: CPU hot plug notificationKeith Busch2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Registers with hot cpu notification to rebalance, and potentially allocate additional, io queues. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: per-cpu io queuesKeith Busch2014-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device's IO queues are associated with CPUs, so we can use a per-cpu variable to map the a qid to a cpu. This provides a convienient way to optimally assign queues to multiple cpus when the device supports fewer queues than the host has cpus. The previous implementation may have assigned these poorly in these situations. This patch addresses this by sharing queues among cpus that are "close" together and should have a lower lock contention penalty. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: Replace DEFINE_PCI_DEVICE_TABLEMatthew Wilcox2014-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Checkpatch has started warning against using DEFINE_PCI_DEVICE_TABLE, so replace it. Also update the copyright date and bump the module version number to 0.9. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmdsKeith Busch2014-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dev->max_hw_sectors may be zero to indicate the device has no limit on the number of sectors. nvme_trans_do_nvme_io() should use the software limit, since this is guaranteed to be non-zero. Reported-by: Mundu <mundu2510@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: IOCTL path RCU protect queue accessKeith Busch2014-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds rcu protected access to a queue in the nvme IOCTL path to fix potential races between a surprise removal and queue usage in nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent the nvme_queue from freeing while this path is executing so it can't sleep, and so this path will no longer wait for a available command id should they all be in use at the time a passthrough IOCTL request is received. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: RCU protected access to io queuesKeith Busch2014-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds rcu protected access to nvme_queue to fix a race between a surprise removal freeing the queue and a thread with open reference on a NVMe block device using that queue. The queues do not need to be rcu protected during the initialization or shutdown parts, so I've added a helper function for raw deferencing to get around the sparse errors. There is still a hole in the IOCTL path for the same problem, which is fixed in a subsequent patch. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | | NVMe: Initialize device reference count earlierKeith Busch2014-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an NVMe device becomes ready but fails to create IO queues, the driver creates a character device handle so the device can be managed. The device reference count needs to be initialized before creating the character device. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>