aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAge
* [PATCH] IPC namespace - utilsKirill Korotaev2006-10-02
| | | | | | | | | | | | | | | This patch adds basic IPC namespace functionality to IPC utils: - init_ipc_ns - copy/clone/unshare/free IPC ns - /proc preparations Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] IPC namespace coreKirill Korotaev2006-10-02
| | | | | | | | | | | | | | | | | | | | | This patch set allows to unshare IPCs and have a private set of IPC objects (sem, shm, msg) inside namespace. Basically, it is another building block of containers functionality. This patch implements core IPC namespace changes: - ipc_namespace structure - new config option CONFIG_IPC_NS - adds CLONE_NEWIPC flag - unshare support [clg@fr.ibm.com: small fix for unshare of ipc namespace] [akpm@osdl.org: build fix] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: utsname: implement CLONE_NEWUTS flagSerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | | | Implement a CLONE_NEWUTS flag, and use it at clone and sys_unshare. [clg@fr.ibm.com: IPC unshare fix] [bunk@stusta.de: cleanup] Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: utsname: remove system_utsnameSerge E. Hallyn2006-10-02
| | | | | | | | | | | | | The system_utsname isn't needed now that kernel/sysctl.c is fixed. Nuke it. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: utsname: implement utsname namespacesSerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | | | | | This patch defines the uts namespace and some manipulators. Adds the uts namespace to task_struct, and initializes a system-wide init namespace. It leaves a #define for system_utsname so sysctl will compile. This define will be removed in a separate patch. [akpm@osdl.org: build fix, cleanup] Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: utsname: use init_utsname when appropriateSerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | | | | | | | | In some places, particularly drivers and __init code, the init utsns is the appropriate one to use. This patch replaces those with a the init_utsname helper. Changes: Removed several uses of init_utsname(). Hope I picked all the right ones in net/ipv4/ipconfig.c. These are now changed to utsname() (the per-process namespace utsname) in the previous patch (2/7) [akpm@osdl.org: CIFS fix] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: utsname: switch to using uts namespacesSerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | | | | | | | | Replace references to system_utsname to the per-process uts namespace where appropriate. This includes things like uname. Changes: Per Eric Biederman's comments, use the per-process uts namespace for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c [jdike@addtoit.com: UML fix] [clg@fr.ibm.com: cleanup] [akpm@osdl.org: build fix] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: utsname: introduce temporary helpersSerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | Define utsname() and init_utsname() which return &system_utsname. Users of system_utsname will be changed to use these helpers, after which system_utsname will disappear. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: incorporate fs namespace into nsproxySerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | | This moves the mount namespace into the nsproxy. The mount namespace count now refers to the number of nsproxies point to it, rather than the number of tasks. As a result, the unshare_namespace() function in kernel/fork.c no longer checks whether it is being shared. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] namespaces: add nsproxySerge E. Hallyn2006-10-02
| | | | | | | | | | | | | | | | | | | This patch adds a nsproxy structure to the task struct. Later patches will move the fs namespace pointer into this structure, and introduce a new utsname namespace into the nsproxy. The vserver and openvz functionality, then, would be implemented in large part by virtualizing/isolating more and more resources into namespaces, each contained in the nsproxy. [akpm@osdl.org: build fix] Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] nfsd: lockdep annotationPeter Zijlstra2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | while doing a kernel make modules_install install over an NFS mount. ============================================= [ INFO: possible recursive locking detected ] --------------------------------------------- nfsd/9550 is trying to acquire lock: (&inode->i_mutex){--..}, at: [<c034c845>] mutex_lock+0x1c/0x1f but task is already holding lock: (&inode->i_mutex){--..}, at: [<c034c845>] mutex_lock+0x1c/0x1f other info that might help us debug this: 2 locks held by nfsd/9550: #0: (hash_sem){..--}, at: [<cc895223>] exp_readlock+0xd/0xf [nfsd] #1: (&inode->i_mutex){--..}, at: [<c034c845>] mutex_lock+0x1c/0x1f stack backtrace: [<c0103508>] show_trace_log_lvl+0x58/0x152 [<c0103b8b>] show_trace+0xd/0x10 [<c0103c2f>] dump_stack+0x19/0x1b [<c012aa57>] __lock_acquire+0x77a/0x9a3 [<c012af4a>] lock_acquire+0x60/0x80 [<c034c6c2>] __mutex_lock_slowpath+0xa7/0x20e [<c034c845>] mutex_lock+0x1c/0x1f [<c0162edc>] vfs_unlink+0x34/0x8a [<cc891d98>] nfsd_unlink+0x18f/0x1e2 [nfsd] [<cc89884f>] nfsd3_proc_remove+0x95/0xa2 [nfsd] [<cc88f0d4>] nfsd_dispatch+0xc0/0x178 [nfsd] [<c033e84d>] svc_process+0x3a5/0x5ed [<cc88f5ba>] nfsd+0x1a7/0x305 [nfsd] [<c0101005>] kernel_thread_helper+0x5/0xb DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Leftover inexact backtrace: [<c0103b8b>] show_trace+0xd/0x10 [<c0103c2f>] dump_stack+0x19/0x1b [<c012aa57>] __lock_acquire+0x77a/0x9a3 [<c012af4a>] lock_acquire+0x60/0x80 [<c034c6c2>] __mutex_lock_slowpath+0xa7/0x20e [<c034c845>] mutex_lock+0x1c/0x1f [<c0162edc>] vfs_unlink+0x34/0x8a [<cc891d98>] nfsd_unlink+0x18f/0x1e2 [nfsd] [<cc89884f>] nfsd3_proc_remove+0x95/0xa2 [nfsd] [<cc88f0d4>] nfsd_dispatch+0xc0/0x178 [nfsd] [<c033e84d>] svc_process+0x3a5/0x5ed [<cc88f5ba>] nfsd+0x1a7/0x305 [nfsd] [<c0101005>] kernel_thread_helper+0x5/0xb ============================================= [ INFO: possible recursive locking detected ] --------------------------------------------- nfsd/9580 is trying to acquire lock: (&inode->i_mutex){--..}, at: [<c034cc1d>] mutex_lock+0x1c/0x1f but task is already holding lock: (&inode->i_mutex){--..}, at: [<c034cc1d>] mutex_lock+0x1c/0x1f other info that might help us debug this: 2 locks held by nfsd/9580: #0: (hash_sem){..--}, at: [<cc89522b>] exp_readlock+0xd/0xf [nfsd] #1: (&inode->i_mutex){--..}, at: [<c034cc1d>] mutex_lock+0x1c/0x1f stack backtrace: [<c0103508>] show_trace_log_lvl+0x58/0x152 [<c0103b8b>] show_trace+0xd/0x10 [<c0103c2f>] dump_stack+0x19/0x1b [<c012aa63>] __lock_acquire+0x77a/0x9a3 [<c012af56>] lock_acquire+0x60/0x80 [<c034ca9a>] __mutex_lock_slowpath+0xa7/0x20e [<c034cc1d>] mutex_lock+0x1c/0x1f [<cc892ad1>] nfsd_setattr+0x2c8/0x499 [nfsd] [<cc893ede>] nfsd_create_v3+0x31b/0x4ac [nfsd] [<cc8984a1>] nfsd3_proc_create+0x128/0x138 [nfsd] [<cc88f0d4>] nfsd_dispatch+0xc0/0x178 [nfsd] [<c033ec1d>] svc_process+0x3a5/0x5ed [<cc88f5ba>] nfsd+0x1a7/0x305 [nfsd] [<c0101005>] kernel_thread_helper+0x5/0xb DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Leftover inexact backtrace: [<c0103b8b>] show_trace+0xd/0x10 [<c0103c2f>] dump_stack+0x19/0x1b [<c012aa63>] __lock_acquire+0x77a/0x9a3 [<c012af56>] lock_acquire+0x60/0x80 [<c034ca9a>] __mutex_lock_slowpath+0xa7/0x20e [<c034cc1d>] mutex_lock+0x1c/0x1f [<cc892ad1>] nfsd_setattr+0x2c8/0x499 [nfsd] [<cc893ede>] nfsd_create_v3+0x31b/0x4ac [nfsd] [<cc8984a1>] nfsd3_proc_create+0x128/0x138 [nfsd] [<cc88f0d4>] nfsd_dispatch+0xc0/0x178 [nfsd] [<c033ec1d>] svc_process+0x3a5/0x5ed [<cc88f5ba>] nfsd+0x1a7/0x305 [nfsd] [<c0101005>] kernel_thread_helper+0x5/0xb Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Neil Brown <neilb@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: make rpc threads pools numa awareGreg Banks2006-10-02
| | | | | | | | | | | | | | | | | | | Actually implement multiple pools. On NUMA machines, allocate a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single global pool. Enqueue sockets on the svc_pool corresponding to the CPU on which the socket bh is run (i.e. the NIC interrupt CPU). Threads have their cpu mask set to limit them to the CPUs in the svc_pool that owns them. This is the patch that allows an Altix to scale NFS traffic linearly beyond 4 CPUs and 4 NICs. Incorporates changes and feedback from Neil Brown, Trond Myklebust, and Christoph Hellwig. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: add svc_set_num_threadsGreg Banks2006-10-02
| | | | | | | | | | | | | Currently knfsd keeps its own list of all nfsd threads in nfssvc.c; add a new way of managing the list of all threads in a svc_serv. Add svc_create_pooled() to allow creation of a svc_serv whose threads are managed by the sunrpc code. Add svc_set_num_threads() to manage the number of threads in a service, either per-pool or globally across the service. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: add svc_getGreg Banks2006-10-02
| | | | | | | | | | add svc_get() for those occasions when we need to temporarily bump up svc_serv->sv_nrthreads as a pseudo refcount. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: split svc_serv into poolsGreg Banks2006-10-02
| | | | | | | | | | | | | | | | | Split out the list of idle threads and pending sockets from svc_serv into a new svc_pool structure, and allocate a fixed number (in this patch, 1) of pools per svc_serv. The new structure contains a lock which takes over several of the duties of svc_serv->sv_lock, which is now relegated to protecting only sv_tempsocks, sv_permsocks, and sv_tmpcnt in svc_serv. The point is to move the hottest fields out of svc_serv and into svc_pool, allowing a following patch to arrange for a svc_pool per NUMA node or per CPU. This is a major step towards making the NFS server NUMA-friendly. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: convert sk_reserved to atomic_tGreg Banks2006-10-02
| | | | | | | | | | | Convert the svc_sock->sk_reserved variable from an int protected by svc_serv->sv_lock, to an atomic. This reduces (by 1) the number of places we need to take the (effectively global) svc_serv->sv_lock. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: use new lock for svc_sock deferred listGreg Banks2006-10-02
| | | | | | | | | | | Protect the svc_sock->sk_deferred list with a new lock svc_sock->sk_defer_lock instead of svc_serv->sv_lock. Using the more fine-grained lock reduces the number of places we need to take the svc_serv lock. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: convert sk_inuse to atomic_tGreg Banks2006-10-02
| | | | | | | | | | | Convert the svc_sock->sk_inuse counter from an int protected by svc_serv->sv_lock, to an atomic. This reduces the number of places we need to take the (effectively global) svc_serv->sv_lock. Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: move tempsock aging to a timerGreg Banks2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following are 11 patches from Greg Banks which combine to make knfsd more Numa-aware. They reduce hitting on 'global' data structures, and create some data-structures that can be node-local. knfsd threads are bound to a particular node, and the thread to handle a new request is chosen from the threads that are attach to the node that received the interrupt. The distribution of threads across nodes can be controlled by a new file in the 'nfsd' filesystem, though the default approach of an even spread is probably fine for most sites. Some (old) numbers that show the efficacy of these patches: N == number of NICs == number of CPUs == nmber of clients. Number of NUMA nodes == N/2 N Throughput, MiB/s CPU usage, % (max=N*100) Before After Before After --- ------ ---- ----- ----- 4 312 435 350 228 6 500 656 501 418 8 562 804 690 589 This patch: Move the aging of RPC/TCP connection sockets from the main svc_recv() loop to a timer which uses a mark-and-sweep algorithm every 6 minutes. This reduces the amount of work that needs to be done in the main RPC loop and the length of time we need to hold the (effectively global) svc_serv->sv_lock. [akpm@osdl.org: cleanup] Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: Drop 'serv' option to svc_recv and svc_processNeilBrown2006-10-02
| | | | | | | | | | | It isn't needed as it is available in rqstp->rq_server, and dropping it allows some local vars to be dropped. [akpm@osdl.org: build fix] Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: allow sockets to be passed to nfsd via 'portlist'NeilBrown2006-10-02
| | | | | | | | | | | | Userspace should create and bind a socket (but not connectted) and write the 'fd' to portlist. This will cause the nfs server to listen on that socket. To close a socket, the name of the socket - as read from 'portlist' can be written to 'portlist' with a preceding '-'. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: define new nfsdfs file: portlist - contains list of portsNeilBrown2006-10-02
| | | | | | | | | | | | | | | | | This file will list all ports that nfsd has open. Default when TCP enabled will be ipv4 udp 0.0.0.0 2049 ipv4 tcp 0.0.0.0 2049 Later, the list of ports will be settable. 'portlist' chosen rather than 'ports', to avoid unnecessary confusion with non-mainline patches which created 'ports' with different semantics. [akpm@osdl.org: cleanups, build fix] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: remove nfsd_versbits as intermediate storage for desired versionsNeilBrown2006-10-02
| | | | | | | | | | | | | | | | | | | | | | We have an array 'nfsd_version' which lists the available versions of nfsd, and 'nfsd_versions' (poor choice there :-() which lists the currently active versions. Then we have a bitmap - nfsd_versbits which says which versions are wanted. The bits in this bitset cause content to be copied from nfsd_version to nfsd_versions when nfsd starts. This patch removes nfsd_versbits and moves information directly from nfsd_version to nfsd_versions when requests for version changes arrive. Note that this doesn't make it possible to change versions while the server is running. This is because serv->sv_xdrsize is calculated when a service is created, and used when threads are created, and xdrsize depends on the active versions. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: be more selective in which sockets lockd listens onNeilBrown2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently lockd listens on UDP always, and TCP if CONFIG_NFSD_TCP is set. However as lockd performs services of the client as well, this is a problem. If CONFIG_NfSD_TCP is not set, and a tcp mount is used, the server will not be able to call back to lockd. So: - add an option to lockd_up saying which protocol is needed - Always open sockets for which an explicit port was given, otherwise only open a socket of the type required - Change nfsd to do one lockd_up per socket rather than one per thread. This - removes the dependancy on CONFIG_NFSD_TCP - means that lockd may open sockets other than at startup - means that lockd will *not* listen on UDP if the only mounts are TCP mount (and nfsd hasn't started). The latter is the only one that concerns me at all - I don't know if this might be a problem with some servers. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] knfsd: add a callback for when last rpc thread finishesNeilBrown2006-10-02
| | | | | | | | | | | | nfsd has some cleanup that it wants to do when the last thread exits, and there will shortly be some more. So collect this all into one place and define a callback for an rpc service to call when the service is about to be destroyed. [akpm@osdl.org: cleanups, build fix] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpumask: add highest_possible_node_idGreg Banks2006-10-02
| | | | | | | | | | | cpumask: add highest_possible_node_id(), analogous to highest_possible_processor_id(). [pj@sgi.com: fix typo] Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kretprobe spinlock deadlock patchbibo,mao2006-10-02
| | | | | | | | | | | | kprobe_flush_task() possibly calls kfree function during holding kretprobe_lock spinlock, if kfree function is probed by kretprobe that will incur spinlock deadlock. This patch moves kfree function out scope of kretprobe_lock. Signed-off-by: bibo, mao <bibo.mao@intel.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Add regs_return_value() helperAnanth N Mavinakayanahalli2006-10-02
| | | | | | | | | | | | | Add the regs_return_value() macro to extract the return value in an architecture agnostic manner, given the pt_regs. Other architecture maintainers may want to add similar helpers. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kprobes: handle symbol resolution when <module:.symbol> is specifiedAnanth N Mavinakayanahalli2006-10-02
| | | | | | | | | | | | kallsyms_lookup_name() allows for <module:symbol> style specification for looking up symbol addresses. Handle the case where the user specifies <module:.symbol> on powerpc, given that 64-bit powerpc uses function descriptors. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Kprobes: Make kprobe modules more portableAnanth N Mavinakayanahalli2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In an effort to make kprobe modules more portable, here is a patch that: o Introduces the "symbol_name" field to struct kprobe. The symbol->address resolution now happens in the kernel in an architecture agnostic manner. 64-bit powerpc users no longer have to specify the ".symbols" o Introduces the "offset" field to struct kprobe to allow a user to specify an offset into a symbol. o The legacy mechanism of specifying the kprobe.addr is still supported. However, if both the kprobe.addr and kprobe.symbol_name are specified, probe registration fails with an -EINVAL. o The symbol resolution code uses kallsyms_lookup_name(). So CONFIG_KPROBES now depends on CONFIG_KALLSYMS o Apparantly kprobe modules were the only legitimate out-of-tree user of the kallsyms_lookup_name() EXPORT. Now that the symbol resolution happens in-kernel, remove the EXPORT as suggested by Christoph Hellwig o Modify tcp_probe.c that uses the kprobe interface so as to make it work on multiple platforms (in its earlier form, the code wouldn't work, say, on powerpc) Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] usb: fixup usb so it uses struct pidEric W. Biederman2006-10-02
| | | | | | | | | | | | | | | | The problem with remembering a user space process by its pid is that it is possible that the process will exit, pid wrap around will occur. Converting to a struct pid avoid that problem, and paves the way for implementing a pid namespace. Also since usb is the only user of kill_proc_info_as_uid rename kill_proc_info_as_uid to kill_pid_info_as_uid and have the new version take a struct pid. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Define struct pspaceSukadev Bhattiprolu2006-10-02
| | | | | | | | | | | | | | | | | | | Define a per-container pid space object. And create one instance of this object, init_pspace, to define the entire pid space. Subsequent patches will provide/use interfaces to create/destroy pid spaces. Its a subset/rework of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/285 . Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Kirill Korotaev <dev@sw.ru> Cc: Andrey Savochkin <saw@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Move pidmap to pspace.hSukadev Bhattiprolu2006-10-02
| | | | | | | | | | | | | | | | Move struct pidmap and PIDMAP_ENTRIES to a new file, include/linux/pspace.h where it will be used in subsequent patches to define pid spaces. Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/285 [akpm@osdl.org: cleanups] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pid: simplify pid iteratorsOleg Nesterov2006-10-02
| | | | | | | | | | | | I think it is hardly possible to read the current do_each_task_pid(). The new version is much simpler and makes the code smaller. Only the do_each_task_pid change is tested, the do_each_pid_task isn't. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] const struct tty_operationsJeff Dike2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | As part of an SMP cleanliness pass over UML, I consted a bunch of structures in order to not have to document their locking. One of these structures was a struct tty_operations. In order to const it in UML without introducing compiler complaints, the declaration of tty_set_operations needs to be changed, and then all of its callers need to be fixed. This patch declares all struct tty_operations in the tree as const. In all cases, they are static and used only as input to tty_set_operations. As an extra check, I ran an i386 allyesconfig build which produced no extra warnings. 53 drivers are affected. I checked the history of a bunch of them, and in most cases, there have been only a handful of maintenance changes in the last six months. serial_core.c was the busiest one that I looked at. Signed-off-by: Jeff Dike <jdike@addtoit.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] file: modify struct fown_struct to use a struct pidEric W. Biederman2006-10-02
| | | | | | | | | | File handles can be requested to send sigio and sigurg to processes. By tracking the destination processes using struct pid instead of pid_t we make the interface safe from all potential pid wrap around problems. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vt: Make vt_pid a struct pid (making it pid wrap around safe).Eric W. Biederman2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | I took a good hard look at the locking and it appears the locking on vt_pid is the console semaphore. Every modified path is called under the console semaphore except reset_vc when it is called from fn_SAK or do_SAK both of which appear to be in interrupt context. In addition I need to be careful because in the presence of an oops the console_sem may be arbitrarily dropped. Which leads me to conclude the current locking is inadequate for my needs. Given the weird cases we could hit because of oops printing instead of introducing an extra spin lock to protect the data and keep the pid to signal and the signal to send in sync, I have opted to use xchg on just the struct pid * pointer instead. Due to console_sem we will stay in sync between vt_pid and vt_mode except for a small window during a SAK, or oops handling. SAK handling should kill any user space process that care, and oops handling we are broken anyway. Besides the worst that can happen is that I try to send the wrong signal. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] vt: rework the console spawning variablesEric W. Biederman2006-10-02
| | | | | | | | | | | | | | | | This is such a rare path it took me a while to figure out how to test this after soring out the locking. This patch does several things. - The variables used are moved into a structure and declared in vt_kern.h - A spinlock is added so we don't have SMP races updating the values. - Instead of raw pid_t value a struct_pid is used to guard against pid wrap around issues, if the daemon to spawn a new console dies. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pid: implement pid_nrEric W. Biederman2006-10-02
| | | | | | | | | | | | | As we stop storing pid_t's and move to storing struct pid *. We need a way to get the pid_t from the struct pid to report to user space what we have stored. Having a clean well defined way to do this is especially important as we move to multiple pid spaces as may need to report a different value to the caller depending on which pid space the caller is in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pid: implement signal functions that take a struct pid *Eric W. Biederman2006-10-02
| | | | | | | | | | | | | | | | | | | | | Currently the signal functions all either take a task or a pid_t argument. This patch implements variants that take a struct pid *. After all of the users have been update it is my intention to remove the variants that take a pid_t as using pid_t can be more work (an extra hash table lookup) and difficult to get right in the presence of multiple pid namespaces. There are two kinds of functions introduced in this patch. The are the general use functions kill_pgrp and kill_pid which take a priv argument that is ultimately used to create the appropriate siginfo information, Then there are _kill_pgrp_info, kill_pgrp_info, kill_pid_info the internal implementation helpers that take an explicit siginfo. The distinction is made because filling out an explcit siginfo is tricky, and will be even more tricky when pid namespaces are introduced. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pid: add do_each_pid_taskEric W. Biederman2006-10-02
| | | | | | | | | | | | | To avoid pid rollover confusion the kernel needs to work with struct pid * instead of pid_t. Currently there is not an iterator that walks through all of the tasks of a given pid type starting with a struct pid. This prevents us replacing some pid_t instances with struct pid. So this patch adds do_each_pid_task which walks through the set of task for a given pid type starting with a struct pid. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] pid: implement access helpers for a tacks various process groupsEric W. Biederman2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the last round of cleaning up the pid hash table a more general struct pid was introduced, that can be referenced counted. With the more general struct pid most if not all places where we store a pid_t we can now store a struct pid * and remove the need for a hash table lookup, and avoid any possible problems with pid roll over. Looking forward to the pid namespaces struct pid * gives us an absolute form a pid so we can compare and use them without caring which pid namespace we are in. This patchset introduces the infrastructure needed to use struct pid instead of pid_t, and then it goes on to convert two different kernel users that currently store a pid_t value. There are a lot more places to go but this is enough to get the basic idea. Before we can merge a pid namespace patch all of the kernel pid_t users need to be examined. Those that deal with user space processes need to be converted to using a struct pid *. Those that deal with kernel processes need to converted to using the kthread api. A rare few that only use their current processes pid values get to be left alone. This patch: task_session returns the struct pid of a tasks session. task_pgrp returns the struct pid of a tasks process group. task_tgid returns the struct pid of a tasks thread group. task_pid returns the struct pid of a tasks process id. These can be used to avoid unnecessary hash table lookups, and to implement safe pid comparisions in the face of a pid namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] proc: modify proc_pident_lookup to be completely table drivenEric W. Biederman2006-10-02
| | | | | | | | | | | Currently proc_pident_lookup gets the names and types from a table and then has a huge switch statement to get the inode and file operations it needs. That is silly and is becoming increasingly hard to maintain so I just put all of the information in the table. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] proc: readdir race fix (take 3)Eric W. Biederman2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem: An opendir, readdir, closedir sequence can fail to report process ids that are continually in use throughout the sequence of system calls. For this race to trigger the process that proc_pid_readdir stops at must exit before readdir is called again. This can cause ps to fail to report processes, and it is in violation of posix guarantees and normal application expectations with respect to readdir. Currently there is no way to work around this problem in user space short of providing a gargantuan buffer to user space so the directory read all happens in on system call. This patch implements the normal directory semantics for proc, that guarantee that a directory entry that is neither created nor destroyed while reading the directory entry will be returned. For directory that are either created or destroyed during the readdir you may or may not see them. Furthermore you may seek to a directory offset you have previously seen. These are the guarantee that ext[23] provides and that posix requires, and more importantly that user space expects. Plus it is a simple semantic to implement reliable service. It is just a matter of calling readdir a second time if you are wondering if something new has show up. These better semantics are implemented by scanning through the pids in numerical order and by making the file offset a pid plus a fixed offset. The pid scan happens on the pid bitmap, which when you look at it is remarkably efficient for a brute force algorithm. Given that a typical cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There are only 40 cache lines for the entire 32K pid space. A typical system will have 100 pids or more so this is actually fewer cache lines we have to look at to scan a linked list, and the worst case of having to scan the entire pid bitmap is pretty reasonable. If we need something more efficient we can go to a more efficient data structure for indexing the pids, but for now what we have should be sufficient. In addition this takes no additional locks and is actually less code than what we are doing now. Also another very subtle bug in this area has been fixed. It is possible to catch a task in the middle of de_thread where a thread is assuming the thread of it's thread group leader. This patch carefully handles that case so if we hit it we don't fail to return the pid, that is undergoing the de_thread dance. Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for providing the first fix, pointing this out and working on it. [oleg@tv-sign.ru: fix it] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jean Delvare <jdelvare@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] list module taint flags in Oops/panicRandy Dunlap2006-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | When listing loaded modules during an oops or panic, also list each module's Tainted flags if non-zero (P: Proprietary or F: Forced load only). If a module is did not taint the kernel, it is just listed like usbcore but if it did taint the kernel, it is listed like wizmodem(PF) Example: [ 3260.121718] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [ 3260.121729] [<ffffffff8804c099>] :dump_test:proc_dump_test+0x99/0xc8 [ 3260.121742] PGD fe8d067 PUD 264a6067 PMD 0 [ 3260.121748] Oops: 0002 [1] SMP [ 3260.121753] CPU 1 [ 3260.121756] Modules linked in: dump_test(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ide_cd generic ohci1394 snd_hda_intel snd_hda_codec snd_pcm snd_timer snd ieee1394 snd_page_alloc piix ide_core arcmsr aic79xx scsi_transport_spi usblp [ 3260.121785] Pid: 5556, comm: bash Tainted: P 2.6.18-git10 #1 [Alternatively, I can look into listing tainted flags with 'lsmod', but that won't help in oopsen/panics so much.] [akpm@osdl.org: cleanup] Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] LIB: add gen_pool_destroy()Steve Wise2006-10-02
| | | | | | | | | | | Modules using the genpool allocator need to be able to destroy the data structure when unloading. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Dean Nelson <dcn@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Some config.h removalsZachary Amsden2006-10-01
| | | | | | | | | | | | | During tracking down a PAE compile failure, I found that config.h was being included in a bunch of places in i386 code. It is no longer necessary, so drop it. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] paravirt: update pte hookZachary Amsden2006-10-01
| | | | | | | | | | | | | | | | | | | Add a pte_update_hook which notifies about pte changes that have been made without using the set_pte / clear_pte interfaces. This allows shadow mode hypervisors which do not trap on page table access to maintain synchronized shadows. It also turns out, there was one pte update in PAE mode that wasn't using any accessor interface at all for setting NX protection. Considering it is PAE specific, and the accessor is i386 specific, I didn't want to add a generic encapsulation of this behavior yet. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] paravirt: remove set pte atomicZachary Amsden2006-10-01
| | | | | | | | | | | | | | | | | Now that ptep_establish has a definition in PAE i386 3-level paging code, the only paging model which is insane enough to have multi-word hardware PTEs which are not efficient to set atomically, we can remove the ghost of set_pte_atomic from other architectures which falesly duplicated it, and remove all knowledge of it from the generic pgtable code. set_pte_atomic is now a private pte operator which is specific to i386 Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] paravirt: optimize ptep establish for paeZachary Amsden2006-10-01
| | | | | | | | | | | | | | | | | The ptep_establish macro is only used on user-level PTEs, for P->P mapping changes. Since these always happen under protection of the pagetable lock, the strong synchronization of a 64-bit cmpxchg is not needed, in fact, not even a lock prefix needs to be used. We can simply instead clear the P-bit, followed by a normal set. The write ordering is still important to avoid the possibility of the TLB snooping a partially written PTE and getting a bad mapping installed. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>