aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* um: Add support for CONFIG_STACKTRACEDaniel Walter2014-10-13
| | | | | | | Add stacktrace support for User Mode Linux Signed-off-by: Daniel Walter <dwalter@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
* um: ubd: Fix for processes stuck in D state foreverThorsten Knabe2014-10-13
| | | | | | | | | | | | | | Starting with Linux 3.12 processes get stuck in D state forever in UserModeLinux under sync heavy workloads. This bug was introduced by commit 805f11a0d5 (um: ubd: Add REQ_FLUSH suppport). Fix bug by adding a check if FLUSH request was successfully submitted to the I/O thread and keeping the FLUSH request on the request queue on submission failures. Fixes: 805f11a0d5 (um: ubd: Add REQ_FLUSH suppport) Signed-off-by: Thorsten Knabe <linux@thorsten-knabe.de> Cc: stable@kernel.org # >= 3.12 Signed-off-by: Richard Weinberger <richard@nod.at>
* um: delete unnecessary bootmem struct page arrayHonggang Li2014-10-13
| | | | | | | | | | | | | 1) uml kernel bootmem managed through bootmem_data->node_bootmem_map, not the struct page array, so the array is unnecessary. 2) the bootmem struct page array has been pointed by a *local* pointer, struct page *map, in init_maps function. The array can be accessed only in init_maps's scope. As a result, uml kernel wastes about 1% of total memory. Signed-off-by: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
* um: remove csum_partial_copy_generic_i386 to clean up exception tableHonggang Li2014-10-13
| | | | | | | | | | | arch/x86/um/checksum_32.S had been copy & paste from x86. When build x86 uml, csum_partial_copy_generic_i386 mess up the exception table. In fact, exception table dose not work in uml kernel. And csum_partial_copy_generic_i386 never been called. So, delete it. Signed-off-by: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
* Merge branch 'for-3.18' of ↵Linus Torvalds2014-10-10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix 6ae833c7fe0c ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...
| * percpu: fix how @gfp is interpreted by the percpu allocatorTejun Heo2014-10-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When @gfp is specified, the percpu allocator is interested in whether it contains all of GFP_KERNEL or not. If it does, the normal allocation path is taken; otherwise, the atomic allocation path. Unfortunately, pcpu_alloc() was incorrectly testing for whether @gfp contains any part of GFP_KERNEL. Fix it by testing "(gfp & GFP_KERNEL) != GFP_KERNEL" instead of "!(gfp & GFP_KERNEL)" to decide whether the allocation should be atomic or not. Signed-off-by: Tejun Heo <tj@kernel.org>
| * blk-mq, percpu_ref: start q->mq_usage_counter in atomic modeTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take above ten seconds when scsi-mq is used. [ 0.949892] scsi host0: Virtio SCSI HBA [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz <stall> [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0 [ 16.194099] osd: LOADED open-osd 0.2.1 [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This is also the reason why request_queues start in bypass mode which is ended on blk_register_queue() as shutting down a fully functional queue also involves a RCU grace period and the queues for non-existent SCSI devices never reach registration. blk-mq basically needs to do the same thing - start the mq in a degraded mode which is faster to shut down and then make it fully functional only after the queue reaches registration. percpu_ref recently grew facilities to force atomic operation until explicitly switched to percpu mode, which can be used for this purpose. This patch makes blk-mq initialize q->mq_usage_counter in atomic mode and switch it to percpu mode only once blk_register_queue() is reached. Note that this issue was previously worked around by 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") for v3.17. The temp fix was reverted in preparation of adding persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe""). This patch and the prerequisite percpu_ref changes will be merged during v3.18 devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count") Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org>
| * percpu_ref: make INIT_ATOMIC and switch_to_atomic() stickyTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, a percpu_ref which is initialized with PERPCU_REF_INIT_ATOMIC or switched to atomic mode via switch_to_atomic() automatically reverts to percpu mode on the first percpu_ref_reinit(). This makes the atomic mode difficult to use for cases where a percpu_ref is used as a persistent on/off switch which may be cycled multiple times. This patch makes such atomic state sticky so that it survives through kill/reinit cycles. After this patch, atomic state is cleared only by an explicit percpu_ref_switch_to_percpu() call. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
| * percpu_ref: add PERCPU_REF_INIT_* flagsTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
| * percpu_ref: decouple switching to percpu mode and reinitTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref has treated the dropping of the base reference and switching to atomic mode as an integral operation; however, there's nothing inherent tying the two together. The use cases for percpu_ref have been expanding continuously. While the current init/kill/reinit/exit model can cover a lot, the coupling of kill/reinit with atomic/percpu mode switching is turning out to be too restrictive for use cases where many percpu_refs are created and destroyed back-to-back with only some of them reaching extended operation. The coupling also makes implementing always-atomic debug mode difficult. This patch separates out percpu mode switching into percpu_ref_switch_to_percpu() and reimplements percpu_ref_reinit() on top of it. * DEAD still requires ATOMIC. A dead ref can't be switched to percpu mode w/o going through reinit. v2: __percpu_ref_switch_to_percpu() was missing static. Fixed. Reported by Fengguang aka kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kbuild test robot <fengguang.wu@intel.com>
| * percpu_ref: decouple switching to atomic mode and killingTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref has treated the dropping of the base reference and switching to atomic mode as an integral operation; however, there's nothing inherent tying the two together. The use cases for percpu_ref have been expanding continuously. While the current init/kill/reinit/exit model can cover a lot, the coupling of kill/reinit with atomic/percpu mode switching is turning out to be too restrictive for use cases where many percpu_refs are created and destroyed back-to-back with only some of them reaching extended operation. The coupling also makes implementing always-atomic debug mode difficult. This patch separates out atomic mode switching into percpu_ref_switch_to_atomic() and reimplements percpu_ref_kill_and_confirm() on top of it. * The handling of __PERCPU_REF_ATOMIC and __PERCPU_REF_DEAD is now differentiated. Among get/put operations, percpu_ref_tryget_live() is the only one which cares about DEAD. * percpu_ref_switch_to_atomic() can be called multiple times on the same ref. This means that multiple @confirm_switch may get queued up which we can't do reliably without extra memory area. This is handled by making the later invocation synchronously wait for the completion of the previous one. This isn't particularly desirable but such synchronous waits shouldn't happen in most cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
| * percpu_ref: add PCPU_REF_DEADTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref will be restructured so that percpu/atomic mode switching and reference killing are dedoupled. In preparation, add PCPU_REF_DEAD and PCPU_REF_ATOMIC_DEAD which is OR of ATOMIC and DEAD. For now, ATOMIC and DEAD are changed together and all PCPU_REF_ATOMIC uses are converted to PCPU_REF_ATOMIC_DEAD without causing any behavior changes. percpu_ref_init() now specifies an explicit alignment when allocating the percpu counters so that the pointer has enough unused low bits to accomodate the flags. Note that one flag was fine as min alignment for percpu memory is 2 bytes but two flags are already too many for the natural alignment of unsigned longs on archs like cris and m68k. v2: The original patch had BUILD_BUG_ON() which triggers if unsigned long's alignment isn't enough to accomodate the flags, which triggered on cris and m64k. percpu_ref_init() updated to specify the required alignment explicitly. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: kbuild test robot <fengguang.wu@intel.com>
| * percpu_ref: rename things to prepare for decoupling percpu/atomic mode switchTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref will be restructured so that percpu/atomic mode switching and reference killing are dedoupled. In preparation, do the following renames. * percpu_ref->confirm_kill -> percpu_ref->confirm_switch * __PERCPU_REF_DEAD -> __PERCPU_REF_ATOMIC * __percpu_ref_alive() -> __ref_is_percpu() This patch is pure rename and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
| * percpu_ref: replace pcpu_ prefix with percpu_Tejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref uses pcpu_ prefix for internal stuff and percpu_ for externally visible ones. This is the same convention used in the percpu allocator implementation. It works fine there but percpu_ref doesn't have too much internal-only stuff and scattered usages of pcpu_ prefix are confusing than helpful. This patch replaces all pcpu_ prefixes with percpu_. This is pure rename and there's no functional change. Note that PCPU_REF_DEAD is renamed to __PERCPU_REF_DEAD to signify that the flag is internal. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
| * percpu_ref: minor code and comment updatesTejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | * Some comments became stale. Updated. * percpu_ref_tryget() unnecessarily initializes @ret. Removed. * A blank line removed from percpu_ref_kill_rcu(). * Explicit function name in a WARN format string replaced with __func__. * WARN_ON() in percpu_ref_reinit() converted to WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
| * percpu_ref: relocate percpu_ref_reinit()Tejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | percpu_ref is gonna go through restructuring. Move percpu_ref_reinit() after percpu_ref_kill_and_confirm(). This will make later changes easier to follow and result in cleaner organization. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com>
| * Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during ↵Tejun Heo2014-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | probe" This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which was a temporary fix for SCSI blk-mq stall issue. The following patches will fix the issue properly by introducing atomic mode to percpu_ref. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
| * Merge branch 'for-linus' of ↵Tejun Heo2014-09-24
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
| * | Revert "percpu: free percpu allocation info for uniprocessor system"Guenter Roeck2014-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system"). The commit causes a hang with a crisv32 image. This may be an architecture problem, but at least for now the revert is necessary to be able to boot a crisv32 image. Cc: Tejun Heo <tj@kernel.org> Cc: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system") Cc: stable@vger.kernel.org # Please don't apply 3189eddbcafc
| * | percpu-refcount: make percpu_ref based on longs instead of intsTejun Heo2014-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref is currently based on ints and the number of refs it can cover is (1 << 31). This makes it impossible to use a percpu_ref to count memory objects or pages on 64bit machines as it may overflow. This forces those users to somehow aggregate the references before contributing to the percpu_ref which is often cumbersome and sometimes challenging to get the same level of performance as using the percpu_ref directly. While using ints for the percpu counters makes them pack tighter on 64bit machines, the possible gain from using ints instead of longs is extremely small compared to the overall gain from per-cpu operation. This patch makes percpu_ref based on longs so that it can be used to directly count memory objects or pages. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
| * | percpu-refcount: improve WARN messagesTejun Heo2014-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu_ref's WARN messages can be a lot more helpful by indicating who's the culprit. Make them report the release function that the offending percpu-refcount is associated with. This should make it a lot easier to track down the reported invalid refcnting operations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com>
| * | percpu: fix locking regression in the failure path of pcpu_alloc()Tejun Heo2014-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | While updating locking, b38d08f3181c ("percpu: restructure locking") broke pcpu_create_chunk() creation path in pcpu_alloc(). It returns without releasing pcpu_alloc_mutex. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Julia Lawall <julia.lawall@lip6.fr>
| * | percpu-refcount: add @gfp to percpu_ref_init()Tejun Heo2014-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>
| * | proportions: add @gfp to init functionsTejun Heo2014-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Percpu allocator now supports allocation mask. Add @gfp to [flex_]proportions init functions so that !GFP_KERNEL allocation masks can be used with them too. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org>
| * | percpu_counter: add @gfp to percpu_counter_init()Tejun Heo2014-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Percpu allocator now supports allocation mask. Add @gfp to percpu_counter_init() so that !GFP_KERNEL allocation masks can be used with percpu_counters too. We could have left percpu_counter_init() alone and added percpu_counter_init_gfp(); however, the number of users isn't that high and introducing _gfp variants to all percpu data structures would be quite ugly, so let's just do the conversion. This is the one with the most users. Other percpu data structures are a lot easier to convert. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: x86@kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>
| * | percpu_counter: make percpu_counters_lock irq-safeTejun Heo2014-09-07
| | | | | | | | | | | | | | | | | | | | | | | | percpu_counter is scheduled to grow @gfp support to allow atomic initialization. This patch makes percpu_counters_lock irq-safe so that it can be safely used from atomic contexts. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: implement asynchronous chunk populationTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The percpu allocator now supports atomic allocations by only allocating from already populated areas but the mechanism to ensure that there's adequate amount of populated areas was missing. This patch expands pcpu_balance_work so that in addition to freeing excess free chunks it also populates chunks to maintain an adequate level of populated areas. pcpu_alloc() schedules pcpu_balance_work if the amount of free populated areas is too low or after an atomic allocation failure. * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for PCPU_EMPTY_POP_PAGES_LOW. * pcpu_async_enabled is added to gate both async jobs - chunk->map_extend_work and pcpu_balance_work - so that we don't end up scheduling them while the needed subsystems aren't up yet. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: rename pcpu_reclaim_work to pcpu_balance_workTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pcpu_reclaim_work will also be used to populate chunks asynchronously. Rename it to pcpu_balance_work in preparation. pcpu_reclaim() is renamed to pcpu_balance_workfn() and some of its local variables are renamed too. This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populatedTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pcpu_nr_empty_pop_pages counts the number of empty populated pages across all chunks and chunk->nr_populated counts the number of populated pages in a chunk. Both will be used to implement pre/async population for atomic allocations. pcpu_chunk_[de]populated() are added to update chunk->populated, chunk->nr_populated and pcpu_nr_empty_pop_pages together. All successful chunk [de]populations should be followed by the corresponding pcpu_chunk_[de]populated() calls. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: make sure chunk->map array has available spaceTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An allocation attempt may require extending chunk->map array which requires GFP_KERNEL context which isn't available for atomic allocations. This patch ensures that chunk->map array usually keeps some amount of available space by directly allocating buffer space during GFP_KERNEL allocations and scheduling async extension during atomic ones. This should make atomic allocation failures from map space exhaustion rare. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: implement [__]alloc_percpu_gfp()Tejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that pcpu_alloc_area() can allocate only from populated areas, it's easy to add atomic allocation support to [__]alloc_percpu(). Update pcpu_alloc() so that it accepts @gfp and skips all the blocking operations and allocates only from the populated areas if @gfp doesn't contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp() are added. While this means that atomic allocations are possible, this isn't complete yet as there's no mechanism to ensure that certain amount of populated areas is kept available and atomic allocations may keep failing under certain conditions. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: indent the population block in pcpu_alloc()Tejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The next patch will conditionalize the population block in pcpu_alloc() which will end up making a rather large indentation change obfuscating the actual logic change. This patch puts the block under "if (true)" so that the next patch can avoid indentation changes. The defintions of the local variables which are used only in the block are moved into the block. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: make pcpu_alloc_area() capable of allocating only from populated areasTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update pcpu_alloc_area() so that it can skip unpopulated areas if the new parameter @pop_only is true. This is implemented by a new function, pcpu_fit_in_area(), which determines the amount of head padding considering the alignment and populated state. @pop_only is currently always false but this will be used to implement atomic allocation. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: restructure lockingTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At first, the percpu allocator required a sleepable context for both alloc and free paths and used pcpu_alloc_mutex to protect everything. Later, pcpu_lock was introduced to protect the index data structure so that the free path can be invoked from atomic contexts. The conversion only updated what's necessary and left most of the allocation path under pcpu_alloc_mutex. The percpu allocator is planned to add support for atomic allocation and this patch restructures locking so that the coverage of pcpu_alloc_mutex is further reduced. * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new chunk and populating the allocated area. Everything else is now protected soley by pcpu_lock. After this change, multiple instances of pcpu_extend_area_map() may race but the function already implements sufficient synchronization using pcpu_lock. This also allows multiple allocators to arrive at new chunk creation. To avoid creating multiple empty chunks back-to-back, a new chunk is created iff there is no other empty chunk after grabbing pcpu_alloc_mutex. * pcpu_lock is now held while modifying chunk->populated bitmap. After this, all data structures are protected by pcpu_lock. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: make percpu-km set chunk->populated bitmap properlyTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu-km instantiates the whole chunk on creation and doesn't make use of chunk->populated bitmap and leaves it as zero. While this currently doesn't cause any problem, the inconsistency makes it difficult to build further logic on top of chunk->populated. This patch makes percpu-km fill chunk->populated on creation so that the bitmap is always consistent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Christoph Lameter <cl@linux.com>
| * | percpu: move region iterations out of pcpu_[de]populate_chunk()Tejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, pcpu_[de]populate_chunk() were called with the range which may contain multiple target regions in it and pcpu_[de]populate_chunk() iterated over the regions. This has the benefit of batching up cache flushes for all the regions; however, we're planning to add more bookkeeping logic around [de]population to support atomic allocations and this delegation of iterations gets in the way. This patch moves the region iterations out of pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and pcpu_reclaim() - so that we can later add logic to track more states around them. This change may make cache and tlb flushes more frequent but multi-region [de]populations are rare anyway and if this actually becomes a problem, it's not difficult to factor out cache flushes as separate callbacks which are directly invoked from percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: move common parts out of pcpu_[de]populate_chunk()Tejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu-vm and percpu-km implement separate versions of pcpu_[de]populate_chunk() and some part which is or should be common are currently in the specific implementations. Make the following changes. * Allocate area clearing is moved from the pcpu_populate_chunk() implementations to pcpu_alloc(). This makes percpu-km's version noop. * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved to their respective callers so that they are applied to percpu-km too. This doesn't make any meaningful difference as both functions are noop for percpu-km; however, this is more consistent and will help implementing atomic allocation support. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: remove @may_alloc from pcpu_get_pages()Tejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pcpu_get_pages() creates the temp pages array if not already allocated and returns the pointer to it. As the function is called from both [de]population paths and depopulation can only happen after at least one successful population, the param doesn't make any difference - the allocation will always happen on the population path anyway. Remove @may_alloc from pcpu_get_pages(). Also, add an lockdep assertion pcpu_alloc_mutex instead of vaguely stating that the exclusion is the caller's responsibility. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | percpu: remove the usage of separate populated bitmap in percpu-vmTejun Heo2014-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | percpu-vm uses pcpu_get_pages_and_bitmap() to acquire temp pages array and populated bitmap and uses the two during [de]population. The temp bitmap is used only to build the new bitmap that is copied to chunk->populated after the operation succeeds; however, the new bitmap can be trivially set after success without using the temp bitmap. This patch removes the temp populated bitmap usage from percpu-vm.c. * pcpu_get_pages_and_bitmap() is renamed to pcpu_get_pages() and no longer hands out the temp bitmap. * @populated arugment is dropped from all the related functions. @populated updates in pcpu_[un]map_pages() are dropped. * Two loops in pcpu_map_pages() are merged. * pcpu_[de]populated_chunk() modify chunk->populated bitmap directly from @page_start and @page_end after success. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Christoph Lameter <cl@linux.com>
* | | Merge branch 'for-3.18' of ↵Linus Torvalds2014-10-10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting. Just a handful of cleanup patches" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: remove redundant variable in cgroup_mount()" cgroup: remove redundant variable in cgroup_mount() cgroup: fix missing unlock in cgroup_release_agent() cgroup: remove CGRP_RELEASABLE flag perf/cgroup: Remove perf_put_cgroup() cgroup: remove redundant check in cgroup_ino() cpuset: simplify proc_cpuset_show() cgroup: simplify proc_cgroup_show() cgroup: use a per-cgroup work for release agent cgroup: remove bogus comments cgroup: remove redundant code in cgroup_rmdir() cgroup: remove some useless forward declarations cgroup: fix a typo in comment.
| * | | Revert "cgroup: remove redundant variable in cgroup_mount()"Zefan Li2014-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 0c7bf3e8cab7900e17ce7f97104c39927d835469. If there are child cgroups in the cgroupfs and then we umount it, the superblock will be destroyed but the cgroup_root will be kept around. When we mount it again, cgroup_mount() will find this cgroup_root and allocate a new sb for it. So with this commit we will be trapped in a dead loop in the case described above, because kernfs_pin_sb() keeps returning NULL. Currently I don't see how we can avoid using both pinned_sb and new_sb, so just revert it. Cc: Al Viro <viro@ZenIV.linux.org.uk> Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: remove redundant variable in cgroup_mount()Zefan Li2014-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both pinned_sb and new_sb indicate if a new superblock is needed, so we can just remove new_sb. Note now we must check if kernfs_tryget_sb() returns NULL, because when it returns NULL, kernfs_mount() may still re-use an existing superblock, which is just allocated by another concurent mount. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: fix missing unlock in cgroup_release_agent()Zefan Li2014-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch 971ff4935538: "cgroup: use a per-cgroup work for release agent" from Sep 18, 2014, leads to the following static checker warning: kernel/cgroup.c:5310 cgroup_release_agent() warn: 'mutex:&cgroup_mutex' is sometimes locked here and sometimes unlocked. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: remove CGRP_RELEASABLE flagZefan Li2014-09-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We call put_css_set() after setting CGRP_RELEASABLE flag in cgroup_task_migrate(), but in other places we call it without setting the flag. I don't see the necessity of this flag. Moreover once the flag is set, it will never be cleared, unless writing to the notify_on_release control file, so it can be quite confusing if we look at the output of debug.releasable. # mount -t cgroup -o debug xxx /cgroup # mkdir /cgroup/child # cat /cgroup/child/debug.releasable 0 <-- shows 0 though the cgroup is empty # echo $$ > /cgroup/child/tasks # cat /cgroup/child/debug.releasable 0 # echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks # cat /proc/child/debug.releasable 1 <-- shows 1 though the cgroup is not empty This patch removes the flag, and now debug.releasable shows if the cgroup is empty or not. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | perf/cgroup: Remove perf_put_cgroup()Zefan Li2014-09-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5a17f543ed68 ("cgroup: improve css_from_dir() into css_tryget_from_dir()") removed perf_tryget_cgroup(), so let's also remove perf_put_cgroup(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: remove redundant check in cgroup_ino()Zefan Li2014-09-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After we implemented default unified hierarchy, cgrp->kn can never be NULL. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cpuset: simplify proc_cpuset_show()Zefan Li2014-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the ONE macro instead of REG, and we can simplify proc_cpuset_show(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: simplify proc_cgroup_show()Zefan Li2014-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the ONE macro instead of REG, and we can simplify proc_cgroup_show(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: use a per-cgroup work for release agentZefan Li2014-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using a global work to schedule release agent on removable cgroups, we change to use a per-cgroup work to do this, which makes the code much simpler. v2: use a dedicated work instead of reusing css->destroy_work. (Tejun) Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: remove bogus commentsLi Zefan2014-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We never grab cgroup mutex in fork and exit paths no matter whether notify_on_release is set or not. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>