aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAge
* kill boilerplate around posix_acl_chmod_masq()Al Viro2011-07-25
| | | | | | | | | new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified clone or with NULL if that has failed; returns 0 or -ve on error. All callers of posix_acl_chmod_masq() switched to that - they'd been doing exactly the same thing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* VFS : mount lock scalability for internal mountsTim Chen2011-07-24
| | | | | | | | | | | | | | | | | | | For a number of file systems that don't have a mount point (e.g. sockfs and pipefs), they are not marked as long term. Therefore in mntput_no_expire, all locks in vfs_mount lock are taken instead of just local cpu's lock to aggregate reference counts when we release reference to file objects. In fact, only local lock need to have been taken to update ref counts as these file systems are in no danger of going away until we are ready to unregister them. The attached patch marks file systems using kern_mount without mount point as long term. The contentions of vfs_mount lock is now eliminated. Before un-registering such file system, kern_unmount should be called to remove the long term flag and make the mount point ready to be freed. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-07-22
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.
| * mm/truncate.c: fix build for CONFIG_BLOCK not enabledRandy Dunlap2011-07-22
| | | | | | | | | | | | | | | | | | | | Fix build error when CONFIG_BLOCK is not enabled by providing a stub inode_dio_wait() function. mm/truncate.c:612: error: implicit declaration of function 'inode_dio_wait' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs:update the NOTE of the file_operations structureWanlong Gao2011-07-20
| | | | | | | | | | | | | | | | | | Big kernel lock had been removed and setlease now use the lock_flocks() to hold a special spin lock file_lock_lock by Matthew. So just remove the out-of-date NOTE. Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlersJosef Bacik2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: add SEEK_HOLE and SEEK_DATA flagsJosef Bacik2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out using fiemap in things like cp cause more problems than it solves, so lets try and give userspace an interface that doesn't suck. We need to match solaris here, and the definitions are *o* If /whence/ is SEEK_HOLE, the offset of the start of the next hole greater than or equal to the supplied offset is returned. The definition of a hole is provided near the end of the DESCRIPTION. *o* If /whence/ is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset. So in the generic case the entire file is data and there is a virtual hole at the end. That means we will just return i_size for SEEK_HOLE and will return the same offset for SEEK_DATA. This is how Solaris does it so we have to do it the same way. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: seq_file - add event counter to simplify poll() supportKay Sievers2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: simplify the blockdev_direct_IO prototypeChristoph Hellwig2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Simple filesystems always pass inode->i_sb_bdev as the block device argument, and never need a end_io handler. Let's simply things for them and for my grepping activity by dropping these arguments. The only thing not falling into that scheme is ext4, which passes and end_io handler without needing special flags (yet), but given how messy the direct I/O code there is use of __blockdev_direct_IO in one instead of two out of three cases isn't going to make a large difference anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * rw_semaphore: remove up/down_read_non_ownerChristoph Hellwig2011-07-20
| | | | | | | | | | | | | | Now that the last users is gone these can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: kill i_alloc_semChristoph Hellwig2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i_alloc_sem is a rather special rw_semaphore. It's the last one that may be released by a non-owner, and it's write side is always mirrored by real exclusion. It's intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can't proceed as long as it's non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can't appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS's i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bits on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * anonfd: fix missing declarationTomasz Stanislawski2011-07-20
| | | | | | | | | | | | | | | | The forward declaration of struct file_operations is added to avoid compilation warnings. Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * superblock: add filesystem shrinker operationsDave Chinner2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we have a per-superblock shrinker implementation, we can add a filesystem specific callout to it to allow filesystem internal caches to be shrunk by the superblock shrinker. Rather than perpetuate the multipurpose shrinker callback API (i.e. nr_to_scan == 0 meaning "tell me how many objects freeable in the cache), two operations will be added. The first will return the number of objects that are freeable, the second is the actual shrinker call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * superblock: introduce per-sb cache shrinker infrastructureDave Chinner2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With context based shrinkers, we can implement a per-superblock shrinker that shrinks the caches attached to the superblock. We currently have global shrinkers for the inode and dentry caches that split up into per-superblock operations via a coarse proportioning method that does not batch very well. The global shrinkers also have a dependency - dentries pin inodes - so we have to be very careful about how we register the global shrinkers so that the implicit call order is always correct. With a per-sb shrinker callout, we can encode this dependency directly into the per-sb shrinker, hence avoiding the need for strictly ordering shrinker registrations. We also have no need for any proportioning code for the shrinker subsystem already provides this functionality across all shrinkers. Allowing the shrinker to operate on a single superblock at a time means that we do less superblock list traversals and locking and reclaim should batch more effectively. This should result in less CPU overhead for reclaim and potentially faster reclaim of items from each filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * inode: move to per-sb LRU locksDave Chinner2011-07-20
| | | | | | | | | | | | | | | | | | | | With the inode LRUs moving to per-sb structures, there is no longer a need for a global inode_lru_lock. The locking can be made more fine-grained by moving to a per-sb LRU lock, isolating the LRU operations of different filesytsems completely from each other. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * inode: Make unused inode LRU per superblockDave Chinner2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode unused list is currently a global LRU. This does not match the other global filesystem cache - the dentry cache - which uses per-superblock LRU lists. Hence we have related filesystem object types using different LRU reclaimation schemes. To enable a per-superblock filesystem cache shrinker, both of these caches need to have per-sb unused object LRU lists. Hence this patch converts the global inode LRU to per-sb LRUs. The patch only does rudimentary per-sb propotioning in the shrinker infrastructure, as this gets removed when the per-sb shrinker callouts are introduced later on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * vmscan: add customisable shrinker batch sizeDave Chinner2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | For shrinkers that have their own cond_resched* calls, having shrink_slab break the work down into small batches is not paticularly efficient. Add a custom batchsize field to the struct shrinker so that shrinkers can use a larger batch size if they desire. A value of zero (uninitialised) means "use the default", so behaviour is unchanged by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * vmscan: add shrink_slab tracepointsDave Chinner2011-07-20
| | | | | | | | | | | | | | | | | | It is impossible to understand what the shrinkers are actually doing without instrumenting the code, so add a some tracepoints to allow insight to be gained. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * btrfs: kill magical embedded struct superblockAl Viro2011-07-20
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * switch vfs_path_lookup() to struct pathAl Viro2011-07-20
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill lookup_create()Al Viro2011-07-20
| | | | | | | | | | | | folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * make sure that nsproxy_cache is initialized early enoughAl Viro2011-07-20
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * new helpers: kern_path_create/user_path_createAl Viro2011-07-20
| | | | | | | | | | | | | | combination of kern_path_parent() and lookup_create(). Does *not* expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill LOOKUP_CONTINUEAl Viro2011-07-20
| | | | | | | | | | | | LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * nfs_open_context doesn't need struct path eitherAl Viro2011-07-20
| | | | | | | | | | | | just dentry, please... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill IPERM_FLAG_RCUAl Viro2011-07-20
| | | | | | | | | | | | not used anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to exec_permission()Al Viro2011-07-20
| | | | | | | | | | | | | | pass mask instead; kill security_inode_exec_permission() since we can use security_inode_permission() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to ->inode_permission()Al Viro2011-07-20
| | | | | | | | | | | | pass that via mask instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to ->permission()Al Viro2011-07-20
| | | | | | | | | | | | not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to generic_permission()Al Viro2011-07-20
| | | | | | | | | | | | | | redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to ->check_acl()Al Viro2011-07-20
| | | | | | | | | | | | not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: MAY_NOT_BLOCKAl Viro2011-07-20
| | | | | | | | | | | | Duplicate the flags argument into mask bitmap. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill check_acl callback of generic_permission()Al Viro2011-07-20
| | | | | | | | | | | | | | its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * lockless get_write_access/deny_write_accessAl Viro2011-07-20
| | | | | | | | | | | | new helpers: atomic_inc_unless_negative()/atomic_dec_unless_positive() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill file_permission() completelyAl Viro2011-07-20
| | | | | | | | | | | | convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handlingAl Viro2011-07-20
| | | | | | | | | | | | | | | | new helper: would_dump(bprm, file). Checks if we are allowed to read the file and if we are not - sets ENFORCE_NODUMP. Exported, used in places that previously open-coded the same logics. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * new helper: iterate_supers_type()Al Viro2011-07-20
| | | | | | | | | | | | | | | | Call the given function for all superblocks of given type. Function gets a superblock (with s_umount locked shared) and (void *) argument supplied by caller of iterator. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: add a DCACHE_NEED_LOOKUP flag for d_flagsJosef Bacik2011-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op->lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'x86-vdso-for-linus' of ↵Linus Torvalds2011-07-22
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-64, vdso: Do not allocate memory for the vDSO clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option x86, vdso: Drop now wrong comment Document the vDSO and add a reference parser ia64: Replace clocksource.fsys_mmio with generic arch data x86-64: Move vread_tsc and vread_hpet into the vDSO clocksource: Replace vread with generic arch data x86-64: Add --no-undefined to vDSO build x86-64: Allow alternative patching in the vDSO x86: Make alternative instruction pointers relative x86-64: Improve vsyscall emulation CS and RIP handling x86-64: Emulate legacy vsyscalls x86-64: Fill unused parts of the vsyscall page with 0xcc x86-64: Remove vsyscall number 3 (venosys) x86-64: Map the HPET NX x86-64: Remove kernel.vsyscall64 sysctl x86-64: Give vvars their own page x86-64: Document some of entry_64.S x86-64: Fix alignment of jiffies variable
| * | clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG optionH. Peter Anvin2011-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The machinery for __ARCH_HAS_CLOCKSOURCE_DATA assumed a file in asm-generic would be the default for architectures without their own file in asm/, but that is not how it works. Replace it with a Kconfig option instead. Link: http://lkml.kernel.org/r/4E288AA6.7090804@zytor.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tony Luck <tony.luck@intel.com>
| * | ia64: Replace clocksource.fsys_mmio with generic arch dataAndy Lutomirski2011-07-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that clocksource.archdata is available, use it for ia64-specific code. Cc: Clemens Ladisch <clemens@ladisch.de> Cc: linux-ia64@vger.kernel.org Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/d31de0ee0842a0e322fb6441571c2b0adb323fa2.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | clocksource: Replace vread with generic arch dataAndy Lutomirski2011-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vread field was bloating struct clocksource everywhere except x86_64, and I want to change the way this works on x86_64, so let's split it out into per-arch data. Cc: x86@kernel.org Cc: Clemens Ladisch <clemens@ladisch.de> Cc: linux-ia64@vger.kernel.org Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/3ae5ec76a168eaaae63f08a2a1060b91aa0b7759.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | x86-64: Emulate legacy vsyscallsAndy Lutomirski2011-06-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a fair amount of code in the vsyscall page. It contains a syscall instruction (in the gettimeofday fallback) and who knows what will happen if an exploit jumps into the middle of some other code. Reduce the risk by replacing the vsyscalls with short magic incantations that cause the kernel to emulate the real vsyscalls. These incantations are useless if entered in the middle. This causes vsyscalls to be a little more expensive than real syscalls. Fortunately sensible programs don't use them. The only exception is time() which is still called by glibc through the vsyscall - but calling time() millions of times per second is not sensible. glibc has this fixed in the development tree. This patch is not perfect: the vread_tsc and vread_hpet functions are still at a fixed address. Fixing that might involve making alternative patching work in the vDSO. Signed-off-by: Andy Lutomirski <luto@mit.edu> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jesper Juhl <jj@chaosbits.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Jan Beulich <JBeulich@novell.com> Cc: richard -rw- weinberger <richard.weinberger@gmail.com> Cc: Mikael Pettersson <mikpe@it.uu.se> Cc: Andi Kleen <andi@firstfloor.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: Valdis.Kletnieks@vt.edu Cc: pageexec@freemail.hu Link: http://lkml.kernel.org/r/e64e1b3c64858820d12c48fa739efbd1485e79d5.1307292171.git.luto@mit.edu [ Removed the CONFIG option - it's simpler to just do it unconditionally. Tidied up the code as well. ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'x86-numa-for-linus' of ↵Linus Torvalds2011-07-22
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-numa-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, numa: Implement pfn -> nid mapping granularity check x86, mm: s/PAGES_PER_ELEMENT/PAGES_PER_SECTION/
| * | | x86, numa: Implement pfn -> nid mapping granularity checkTejun Heo2011-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use sections array to map pfn to nid which is limited in granularity. If NUMA nodes are laid out such that the mapping cannot be accurate, boot will fail triggering BUG_ON() in mminit_verify_page_links(). On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been granular enough until commit 2706a0bf7b (x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This led to the following BUG_ON(). On node 0 totalpages: 2096615 DMA zone: 32 pages used for memmap DMA zone: 0 pages reserved DMA zone: 3927 pages, LIFO batch:0 Normal zone: 1740 pages used for memmap Normal zone: 220978 pages, LIFO batch:31 HighMem zone: 16405 pages used for memmap HighMem zone: 1853533 pages, LIFO batch:31 BUG: Int 6: CR2 (null) EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc EBX f2400000 EDX 00000006 ECX (null) EAX 00000001 err (null) EIP c16209aa CS 00000060 flg 00010002 Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000 (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null) f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null) Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17 Call Trace: [<c136b1e5>] ? early_fault+0x2e/0x2e [<c16209aa>] ? mminit_verify_page_links+0x12/0x42 [<c1620613>] ? memmap_init_zone+0xaf/0x10c [<c1620929>] ? free_area_init_node+0x2b9/0x2e3 [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451 [<c1601d80>] ? paging_init+0x112/0x118 [<c15f578d>] ? setup_arch+0x791/0x82f [<c15f43d9>] ? start_kernel+0x6a/0x257 This patch implements node_map_pfn_alignment() which determines maximum internode alignment and update numa_register_memblks() to reject NUMA configuration if alignment exceeds the pfn -> nid mapping granularity of the memory model as determined by PAGES_PER_SECTION. This makes the problematic machine boot w/ flatmem by rejecting the NUMA config and provides protection against crazy NUMA configurations. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com> Reported-and-Tested-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: Conny Seidel <conny.seidel@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | | | Merge branch 'x86-mtrr-for-linus' of ↵Linus Torvalds2011-07-22
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mtrr: Use pci_dev->revision x86, mtrr: use stop_machine APIs for doing MTRR rendezvous stop_machine: implement stop_machine_from_inactive_cpu() stop_machine: reorganize stop_cpus() implementation x86, mtrr: lock stop machine during MTRR rendezvous sequence
| * | | | x86, mtrr: use stop_machine APIs for doing MTRR rendezvousSuresh Siddha2011-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MTRR rendezvous sequence is not implemened using stop_machine() before, as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). Now that we have a new stop_machine_from_inactive_cpu() API, use it for rendezvous during mtrr init of a logical processor that is coming online. For the rest (runtime MTRR modification, system boot, resume paths), use stop_machine() to implement the rendezvous sequence. This will consolidate and cleanup the code. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182057.076997177@sbsiddha-MOBL3.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | stop_machine: implement stop_machine_from_inactive_cpu()Tejun Heo2011-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, mtrr wants stop_machine functionality while a CPU is being brought up. As stop_machine() requires the calling CPU to be active, mtrr implements its own stop_machine using stop_one_cpu() on each online CPU. This doesn't only unnecessarily duplicate complex logic but also introduces a possibility of deadlock when it races against the generic stop_machine(). This patch implements stop_machine_from_inactive_cpu() to serve such use cases. Its functionality is basically the same as stop_machine(); however, it should be called from a CPU which isn't active and doesn't depend on working scheduling on the calling CPU. This is achieved by using busy loops for synchronization and open-coding stop_cpus queuing and waiting with direct invocation of fn() for local CPU inbetween. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110623182056.982526827@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | | x86, mtrr: lock stop machine during MTRR rendezvous sequenceSuresh Siddha2011-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MTRR rendezvous sequence using stop_one_cpu_nowait() can potentially happen in parallel with another system wide rendezvous using stop_machine(). This can lead to deadlock (The order in which works are queued can be different on different cpu's. Some cpu's will be running the first rendezvous handler and others will be running the second rendezvous handler. Each set waiting for the other set to join for the system wide rendezvous, leading to a deadlock). MTRR rendezvous sequence is not implemented using stop_machine() as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). stop_machine() works with only online cpus. For now, take the stop_machine mutex in the MTRR rendezvous sequence that gets called from an online cpu (here we are in the process context and can potentially sleep while taking the mutex). And the MTRR rendezvous that gets triggered during cpu online doesn't need to take this stop_machine lock (as the stop_machine() already ensures that there is no cpu hotplug going on in parallel by doing get_online_cpus()) TBD: Pursue a cleaner solution of extending the stop_machine() infrastructure to handle the case where the calling cpu is still not online and use this for MTRR rendezvous sequence. fixes: https://bugzilla.novell.com/show_bug.cgi?id=672008 Reported-by: Vadim Kotelnikov <vadimuzzz@inbox.ru> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182056.807230326@sbsiddha-MOBL3.sc.intel.com Cc: stable@kernel.org # 2.6.35+, backport a week or two after this gets more testing in mainline Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* | | | | Merge branch 'x86-efi-for-linus' of ↵Linus Torvalds2011-07-22
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, efi: Properly pre-initialize table pointers x86, efi: Add infrastructure for UEFI 2.0 runtime services x86, efi: Fix argument types for SetVariable()