| Commit message (Collapse) | Author | Age |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
under heavy memory pressure, and Eric Whitney's bigalloc fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
ext4: fix suboptimal seek_{data,hole} extents traversial
ext4: ext4_inline_data_fiemap should respect callers argument
ext4: prevent fsreentrance deadlock for inline_data
ext4: forbid journal_async_commit in data=ordered mode
jbd2: remove unnecessary NULL check before iput()
ext4: Remove an unnecessary check for NULL before iput()
ext4: remove unneeded code in ext4_unlink
ext4: don't count external journal blocks as overhead
ext4: remove never taken branch from ext4_ext_shift_path_extents()
ext4: create nojournal_checksum mount option
ext4: update comments regarding ext4_delete_inode()
ext4: cleanup GFP flags inside resize path
ext4: introduce aging to extent status tree
ext4: cleanup flag definitions for extent status tree
ext4: limit number of scanned extents in status tree shrinker
ext4: move handling of list of shrinkable inodes into extent status code
ext4: change LRU to round-robin in extent status tree shrinker
ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
ext4: fix block reservation for bigalloc filesystems
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Testcase:
xfstests generic/270
MKFS_OPTIONS="-q -I 256 -O inline_data,64bit"
Call Trace:
[<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK
[<ffffffff81145260>] pagecache_get_page+0x65/0x15a
[<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c
[<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6
[<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c
[<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b
[<ffffffff81150af0>] truncate_inode_pages+0x12/0x14
[<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25
[<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c
[<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302
[<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc
[<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6
[<ffffffff810c040e>] ? mark_held_locks+0x59/0x77
[<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5
[<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354
[<ffffffff81185656>] ? file_start_write+0x2a/0x2c
[<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15
[<ffffffff811858ce>] new_sync_write+0x8a/0xb2
[<ffffffff81186e7b>] vfs_write+0xb5/0x14d
[<ffffffff81186ffb>] SyS_write+0x5c/0x8c
[<ffffffff816f2529>] system_call_fastpath+0x12/0x17
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.
TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);
Original report: https://lkml.org/lkml/2014/10/16/620
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0. Also fix incorrect
extent length determination.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin(). grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.
To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| | |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Setting retval to zero is not needed in ext4_unlink.
Remove unneeded code.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This was fixed for ext3 with:
e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs
but was never fixed for ext4.
With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
path[depth].p_hdr can never be NULL for a path passed to us (and even if
it could, EXT_LAST_EXTENT() would make something != NULL from it). So
just remove the branch.
Coverity-id: 1196498
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| | |
ext4_delete_inode() has been renamed for a long time, update
comments for this.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:
ioctl
->ext4_group_add
->journal_start
->ext4_setup_new_descs
->ext4_mb_add_groupinfo -> GFP_KERNEL
->ext4_flex_group_add
->ext4_update_super
->ext4_calculate_overhead -> GFP_KERNEL
->journal_stop
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Introduce a simple aging to extent status tree. Each extent has a
REFERENCED bit which gets set when the extent is used. Shrinker then
skips entries with referenced bit set and clears the bit. Thus
frequently used extents have higher chances of staying in memory.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently flags for extent status tree are defined twice, once shifted
and once without a being shifted. Consolidate these definitions into one
place and make some computations automatic to make adding flags less
error prone. Compiler should be clever enough to figure out these are
constants and generate the same code.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).
Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.
Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.
Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.
We replace the lru ordering with a simple round-robin. After that we
never need to keep a lru list. That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not. In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back. But it will cause a defect when we do a lot of writes. If
we don't put extent hole in extent cache, the following writes also need
to access extent tree to look at whether or not a block has been
allocated. It brings a cache miss. This commit fixes this defect.
Also if the inode doesn't have any extent, this extent hole will be
cached as well.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For bigalloc filesystems we have to check whether newly requested inode
block isn't already part of a cluster for which we already have delayed
allocation reservation. This check happens in ext4_ext_map_blocks() and
that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
ext4_da_map_blocks() finds in extent cache information about the block,
we don't call into ext4_ext_map_blocks() and thus we always end up
getting new reservation even if the space for cluster is already
reserved. This results in overreservation and premature ENOSPC reports.
Fix the problem by checking for existing cluster reservation already in
ext4_da_map_blocks(). That simplifies the logic and actually allows us
to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ext4_ext_remove_space() can incorrectly free a partial_cluster if
EAGAIN is encountered while truncating or punching. Extent removal
should be retried in this case.
It also fails to free a partial cluster when the punched region begins
at the start of a file on that unaligned cluster and where the entire
file has not been punched. Remove the requirement that all blocks in
the file must have been freed in order to free the partial cluster.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add some casts and rearrange a few statements for improved readability.
Some code can also be simplified and made more readable if we set
partial_cluster to 0 rather than to a negative value when we can tell
we've hit the left edge of the punched region.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The fix in commit ad6599ab3ac9 ("ext4: fix premature freeing of
partial clusters split across leaf blocks"), intended to avoid
dereferencing an invalid extent pointer when determining whether a
partial cluster should be freed, wasn't quite good enough. Assure that
at least one extent remains at the start of the leaf once the hole has
been punched. Otherwise, the pointer to the extent to the right of the
hole will be invalid and a partial cluster will be incorrectly freed.
Set partial_cluster to 0 when we can tell we've hit the left edge of
the punched region within the leaf. This prevents incorrect freeing
of a partial cluster when ext4_ext_rm_leaf is called one last time
during extent tree traversal after the punched region has been removed.
Adjust comments to reflect code changes and a correction. Remove a bit
of dead code.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The partial_cluster variable is not always initialized correctly when
hole punching on bigalloc file systems. Although commit c06344939422
("ext4: fix partial cluster handling for bigalloc file systems")
addressed the case where the right edge of the punched region and the
next extent to its right were within the same leaf, it didn't handle
the case where the next extent to its right is in the next leaf. This
causes xfstest generic/300 to fail.
Fix this by replacing the code in c0634493922 with a more general
solution that can continue the search for the first cluster to the
right of the punched region into the next leaf if present. If found,
partial_cluster is initialized to this cluster's negative value.
There's no need to determine if that cluster is actually shared; we
simply record it so its blocks won't be freed in the event it does
happen to be shared.
Also, minimize the burden on non-bigalloc file systems with some minor
code simplification.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| | |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302
Unfortunetly there is nothing we can do if some other task holds BH's
refenrence. So we must return EBUSY in this case. But we can try
kicking the journal to see if the other task releases the bh reference
after the commit is complete. Also decrease false positives by
properly checking for ENOSPC and retrying the allocation after kicking
the journal --- which is done by ext4_should_retry_alloc().
[ Modified by tytso to properly check for ENOSPC. ]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Implement cgroup_get_e_css() which finds and gets the effective css
for the specified cgroup and subsystem combination. This function
always returns a valid pinned css. This will be used by cgroup
writeback support.
While at it, add comment to cgroup_e_css() to explain why that
function is different from cgroup_get_e_css() and has to test
cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss).
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a new cgroup_subsys operatoin ->css_e_css_changed(). This is
invoked if any of the effective csses seen from the css's cgroup may
have changed. This will be used to implement cgroup writeback
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a new cgroup subsys callback css_released(). This is called when
the reference count of the css (cgroup_subsys_state) reaches zero
before RCU scheduling free.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a subsystem is offlined, its entry on @cgrp->subsys[] is cleared
asynchronously. If cgroup_subtree_control_write() is requested to
enable the subsystem again before the entry is cleared, it has to wait
for the previous offlining to finish and clear the @cgrp->subsys[]
entry before trying to enable the subsystem again.
This is currently done while verifying the input enable / disable
parameters. This used to be correct but f63070d350e3 ("cgroup: make
interface files visible iff enabled on cgroup->subtree_control")
breaks it. The commit is one of the commits implementing subsystem
dependency.
Through subsystem dependency, some subsystems may be enabled and
disabled implicitly in addition to the explicitly requested ones. The
actual subsystems to be enabled and disabled are determined during
@css_enable/disable calculation. The current offline wait logic skips
the ones which are already implicitly enabled and then waits for
subsystems in @enable; however, this misses the subsystems which may
be implicitly enabled through dependency from @enable. If such
implicitly subsystem hasn't yet finished offlining yet, the function
ends up trying to create a css when its @cgrp->subsys[] slot is
already occupied triggering BUG_ON() in init_and_link_css().
Fix it by moving the wait logic after @css_enable is calculated and
waiting for all the subsystems in @css_enable. This fixes the above
bug as the mask contains all subsystems which are to be enabled
including the ones enabled through dependencies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f63070d350e3 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control")
Acked-by: Zefan Li <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Make cgroup_subtree_control_write() first calculate new
subtree_control (new_sc), child_subsys_mask (new_ss) and
css_enable/disable masks before applying them to the cgroup. Also,
store the original subtree_control (old_sc) and child_subsys_mask
(old_ss) and use them to restore the orignal state after failure.
This patch shouldn't cause any behavior changes. This prepares for a
fix for a bug in the async css offline wait logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cgroup_refresh_child_subsys_mask()
cgroup_refresh_child_subsys_mask() calculates and updates the
effective @cgrp->child_subsys_maks according to the current
@cgrp->subtree_control. Separate out the calculation part into
cgroup_calc_child_subsys_mask(). This will be used to fix a bug in
the async css offline wait logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This will deadlock instead of unlocking.
Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.
Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.
However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.
So let's simplify the API back to the single check.
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The callback_mutex is only used to synchronize reads/updates of cpusets'
flags and cpu/node masks. These operations should always proceed fast so
there's no reason why we can't use a spinlock instead of the mutex.
Converting the callback_mutex into a spinlock will let us call
cpuset_zone_allowed_softwall from atomic context. This, in turn, makes
it possible to simplify the code by merging the hardwall and asoftwall
checks into the same function, which is the business of the next patch.
Suggested-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata changes from Tejun Heo:
"The only interesting piece is the support for shingled drives. The
changes in libata layer are minimal. All it does is identifying the
new class of device and report upwards accordingly"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata: Remove FIXME comment in atapi_request_sense()
sata_rcar: Document deprecated "renesas,rcar-sata"
sata_rcar: Add clocks to sata_rcar bindings
ahci_sunxi: Make AHCI_HFLAG_NO_PMP flag configurable with a module option
libata-scsi: Update SATL for ZAC drives
libata: Implement ATA_DEV_ZAC
libsas: use ata_dev_classify()
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Remove the FIXME comment in atapi_request_sense() asking whether
memset of sense buffer is necessary. The buffer may be partially or
fully filled by the device. We want it to be cleared.
tj: Updated description.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit e67adb4e669db834 ("sata_rcar: Add R-Car Gen2 SATA PHY support")
deprecated "renesas,rcar-sata" in favor of "renesas,sata-r8a7779", but
the deprecated value was never documented in the binding documentation,
while it is still in active use.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Now that the clocks are available in the R-Car Gen2 DT,
add clocks property description to the sata_rcar bindings.
The clocks have been tested on r8a7791 so we use that
as an example of the R-Car SATA node.
Signed-off-by: Valentine Barshak <valentine.barshak@cogentembedded.com>
[geert: Reworded clocks property]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The use of the AHCI_HFLAG_NO_PMP flag is something which we inherited from
the Allwinner android kernel sources, and I've always wanted to test if this
is really necessary.
So recently I've bought a sata port multiplexer, and I've given this a test
spin on both A10 and A20 devices, and it seems to work fine:
[ 2.154456] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2.161092] ata1.15: Port Multiplier 1.2, 0x197b:0x0325 r0, 5 ports, feat 0x5/0xf
[ 2.175511] ata1.00: hard resetting link
[ 2.524929] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 2.531430] ata1.01: hard resetting link
[ 2.974465] ata1.01: link resume succeeded after 1 retries
[ 3.094932] ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 3.101431] ata1.02: hard resetting link
[ 4.174466] ata1.02: failed to resume link (SControl 0)
[ 4.180065] ata1.02: SATA link down (SStatus 0 SControl 0)
(and the same for links 3 and 4)
Once the NO_PMP flag is removed it correctly sees the 2 disks which I've
attached, and I can mount and use them just fine.
Unfortunately when I then directly attached a disk to the sata port on the
sunxi SoC, and booted a kernel without the AHCI_HFLAG_NO_PMP flag, it would
not recognize that disk.
It turns out that the sata controller in the sunxi SoCs fails to handle
soft-resets issued to directly attached disks, and when pmp support is
enabled the kernel will always issue a soft-reset.
So add a module parameter to enable pmp usage, and default this to off, so
that directly attached disks keep working normally.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
ZAC (zoned-access command) drives translate into ZBC (Zoned block
command) device type for SCSI. So implement the correct mappings
into libata-scsi and update the SCSI command set versions.
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add new ATA device type for ZAC devices.
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Use the ata device class from libata in libsas instead of checking
the supported command set and switch to using ata_dev_classify()
instead of our own method.
Cc: Tejun Heo <tj@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull workqueue update from Tejun Heo:
"Work items which may be involved in memory reclaim path may be
executed by the rescuer under memory pressure. When a rescuer gets
activated, it processes whatever are on the pending list and then goes
back to sleep until the manager kicks it again which involves 100ms
delay.
This is problematic for self-requeueing work items or the ones running
on ordered workqueues as there always is only one work item on the
pending list when the rescuer kicks in. The execution of that work
item produces more to execute but the rescuer won't see them until
after the said 100ms has passed, so such workqueues would only execute
one work item every 100ms under prolonged memory pressure, which BTW
may be being prolonged due to the slow execution.
Neil wrote up a patch which fixes this issue by keeping the rescuer
working as long as the target workqueue is busy but doesn't have
enough workers"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: allow rescuer thread to do more work.
workqueue: invert the order between pool->lock and wq_mayday_lock
workqueue: cosmetic update in rescuer_thread()
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.
In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.
The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.
The rescuer will be woken again in 100ms to handle another max_requests
requests.
I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests. When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).
With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance. This allows
all requests to be handled in a timely fashion.
We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.
I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread. In that context
it significantly improves performance. A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.
Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.
tj: Inverted the lock order between wq_mayday_lock and pool->lock with
a preceding patch and simplified this patch. Added comment and
updated changelog accordingly. Dongsu spotted missing get_pwq()
in the simplified code.
Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Currently, pool->lock nests inside pool->lock. There's no inherent
reason for this order. The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.
This nesting order turns out to complicate things with the planned
rescuer_thread() update. Let's invert them. This doesn't cause any
behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Dongsu Park <dongsu.park@profitbricks.com>
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience. There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly. Replace it with the local variable.
This patch causes no functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
"Nothing interesting. A patch to convert the remaining __get_cpu_var()
users, another to fix non-critical off-by-one in an assertion and a
cosmetic conversion to lockless_dereference() in percpu-ref.
The back-merge from mainline is to receive lockless_dereference()"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Replace smp_read_barrier_depends() with lockless_dereference()
percpu: Convert remaining __get_cpu_var uses in 3.18-rcX
percpu: off by one in BUG_ON()
|