aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* ubifs: Implement encrypt/decrypt for all IORichard Weinberger2016-12-12
| | | | | | Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Constify struct inode pointer in ubifs_crypt_is_encrypted()Richard Weinberger2016-12-12
| | | | | | ...and provide a non const variant for fscrypto Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Introduce new data node field, compr_sizeRichard Weinberger2016-12-12
| | | | | | | | | | | | When data of a data node is compressed and encrypted we need to store the size of the compressed data because before encryption we may have to add padding bytes. For the new field we consume the last two padding bytes in struct ubifs_data_node. Two bytes are fine because the data length is at most 4096. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Enforce crypto policy in mmapRichard Weinberger2016-12-12
| | | | | | | We need this extra check in mmap because a process could gain an already opened fd. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Massage assert in ubifs_xattr_set() wrt. fscryptoRichard Weinberger2016-12-12
| | | | | | | | | When we're creating a new inode in UBIFS the inode is not yet exposed and fscrypto calls ubifs_xattr_set() without holding the inode mutex. This is okay but ubifs_xattr_set() has to know about this. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Preload crypto context in ->lookup()Richard Weinberger2016-12-12
| | | | | | ...and mark the dentry as encrypted. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Enforce crypto policy in ->link and ->renameRichard Weinberger2016-12-12
| | | | | | | | When a file is moved or linked into another directory its current crypto policy has to be compatible with the target policy. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Implement file open operationRichard Weinberger2016-12-12
| | | | | | | | We need ->open() for files to load the crypto key. If the no key is present and the file is encrypted, refuse to open. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Implement directory open operationRichard Weinberger2016-12-12
| | | | | | | | We need the ->open() hook to load the crypto context which is needed for all crypto operations within that directory. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Massage ubifs_listxattr() for encryption contextRichard Weinberger2016-12-12
| | | | | | | We have to make sure that we don't expose our internal crypto context to userspace. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Add skeleton for fscryptoRichard Weinberger2016-12-12
| | | | | | | This is the first building block to provide file level encryption on UBIFS. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Define UBIFS crypto context xattrRichard Weinberger2016-12-12
| | | | | | | Like ext4 UBIFS will store the crypto context in a xattr attribute. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Export xattr get and set functionsRichard Weinberger2016-12-12
| | | | | | For fscrypto we need this function outside of xattr.c. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Export ubifs_check_dir_empty()Richard Weinberger2016-12-12
| | | | | | | | fscrypto will need this function too. Also get struct ubifs_info from the provided inode. Not all callers will have a reference to struct ubifs_info. Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Remove some dead codeChristophe Jaillet2016-12-12
| | | | | | | | 'ubifs_fast_find_freeable()' can not return an error pointer, so this test can be removed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Use dirty_writeback_interval value for wbuf timerRafał Miłecki2016-12-12
| | | | | | | | | | | | | | | | | | | | | | | | | Right now wbuf timer has hardcoded timeouts and there is no place for manual adjustments. Some projects / cases many need that though. Few file systems allow doing that by respecting dirty_writeback_interval that can be set using sysctl (dirty_writeback_centisecs). Lowering dirty_writeback_interval could be some way of dealing with user space apps lacking proper fsyncs. This is definitely *not* a perfect solution but we don't have ideal (user space) world. There were already advanced discussions on this matter, mostly when ext4 was introduced and it wasn't behaving as ext3. Anyway, the final decision was to add some hacks to the ext4, as trying to fix whole user space or adding new API was pointless. We can't (and shouldn't?) just follow ext4. We can't e.g. sync on close as this would cause too many commits and flash wearing. On the other hand we still should allow some trade-off between -o sync and default wbuf timeout. Respecting dirty_writeback_interval should allow some sane cutomizations if used warily. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Richard Weinberger <richard@nod.at>
* ubifs: Drop softlimit and delta fields from struct ubifs_wbufRafał Miłecki2016-12-12
| | | | | | | | | | | Values of these fields are set during init and never modified. They are used (read) in a single function only. There isn't really any reason to keep them in a struct. It only makes struct just a bit bigger without any visible gain. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Richard Weinberger <richard@nod.at>
* fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FLDavid Gstir2016-12-11
| | | | | | | | ... to better explain its purpose after introducing in-place encryption without bounce buffer. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Delay bounce page pool allocation until neededDavid Gstir2016-12-11
| | | | | | | | | | Since fscrypt users can now indicated if fscrypt_encrypt_page() should use a bounce page, we can delay the bounce page pool initialization util it is really needed. That is until fscrypt_operations has no FS_CFLG_OWN_PAGES flag set. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()David Gstir2016-12-11
| | | | | | | | | | Rename the FS_CFLG_INPLACE_ENCRYPTION flag to FS_CFLG_OWN_PAGES which, when set, indicates that the fs uses pages under its own control as opposed to writeback pages which require locking and a bounce buffer for encryption. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()David Gstir2016-12-11
| | | | | | | | | | - Improve documentation - Add BUG_ON(len == 0) to avoid accidental switch of offs and len parameters - Improve variable names for readability Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Never allocate fscrypt_ctx on in-place encryptionDavid Gstir2016-12-11
| | | | | | | | | | | In case of in-place encryption fscrypt_ctx was allocated but never released. Since we don't need it for in-place encryption, we skip allocating it. Fixes: 1c7dcf69eea3 ("fscrypt: Add in-place encryption mode") Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Use correct index in decrypt path.David Gstir2016-12-11
| | | | | | | | | | Actually use the fs-provided index instead of always using page->index which is only set for page-cache pages. Fixes: 9c4bb8a3a9b4 ("fscrypt: Let fs select encryption index/tweak") Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: move the policy flags and encryption mode definitions to uapi headerTheodore Ts'o2016-12-11
| | | | | | | | These constants are part of the UAPI, so they belong in include/uapi/linux/fs.h instead of include/linux/fscrypto.h Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com>
* fscrypt: move non-public structures and constants to fscrypt_private.hTheodore Ts'o2016-12-11
| | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com>
* fscrypt: unexport fscrypt_initialize()Theodore Ts'o2016-12-11
| | | | | | | | The fscrypt_initalize() function isn't used outside fs/crypto, so there's no point making it be an exported symbol. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com>
* fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()Theodore Ts'o2016-12-11
| | | | | | | | | | To avoid namespace collisions, rename get_crypt_info() to fscrypt_get_crypt_info(). The function is only used inside the fs/crypto directory, so declare it in the new header file, fscrypt_private.h. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com>
* fscrypto: move ioctl processing more fully into common codeEric Biggers2016-12-11
| | | | | | | | | | | | Multiple bugs were recently fixed in the "set encryption policy" ioctl. To make it clear that fscrypt_process_policy() and fscrypt_get_policy() implement ioctls and therefore their implementations must take standard security and correctness precautions, rename them to fscrypt_ioctl_set_policy() and fscrypt_ioctl_get_policy(). Make the latter take in a struct file * to make it consistent with the former. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypto: remove unneeded Kconfig dependenciesEric Biggers2016-12-11
| | | | | | | | | | | | SHA256 and ENCRYPTED_KEYS are not needed. CTR shouldn't be needed either, but I left it for now because it was intentionally added by commit 71dea01ea2ed ("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled"). So it sounds like there may be a dependency problem elsewhere, which I have not been able to identify specifically, that must be solved before CTR can be removed. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patchesEric Biggers2016-12-11
| | | | | | | | | | | | The filesystem level encryption support, currently used by ext4 and f2fs and proposed for ubifs, does not yet have a dedicated mailing list. Since no mailing lists were specified in MAINTAINERS, get_maintainer.pl only recommended to send patches directly to the maintainers and to linux-kernel. This patch adds linux-fsdevel as the preferred mailing list for fscrypto patches for the time being. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypto: don't use on-stack buffer for key derivationEric Biggers2016-11-13
| | | | | | | | | | | | | | | | With the new (in 4.9) option to use a virtually-mapped stack (CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for the scatterlist crypto API because they may not be directly mappable to struct page. get_crypt_info() was using a stack buffer to hold the output from the encryption operation used to derive the per-file key. Fix it by using a heap buffer. This bug could most easily be observed in a CONFIG_DEBUG_SG kernel because this allowed the BUG in sg_set_buf() to be triggered. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypto: don't use on-stack buffer for filename encryptionEric Biggers2016-11-13
| | | | | | | | | | | | | | | | | With the new (in 4.9) option to use a virtually-mapped stack (CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for the scatterlist crypto API because they may not be directly mappable to struct page. For short filenames, fname_encrypt() was encrypting a stack buffer holding the padded filename. Fix it by encrypting the filename in-place in the output buffer, thereby making the temporary buffer unnecessary. This bug could most easily be observed in a CONFIG_DEBUG_SG kernel because this allowed the BUG in sg_set_buf() to be triggered. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Let fs select encryption index/tweakDavid Gstir2016-11-13
| | | | | | | | | | | Avoid re-use of page index as tweak for AES-XTS when multiple parts of same page are encrypted. This will happen on multiple (partial) calls of fscrypt_encrypt_page on same page. page->index is only valid for writeback pages. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Constify struct inode pointerDavid Gstir2016-11-13
| | | | | | | | | Some filesystems, such as UBIFS, maintain a const pointer for struct inode. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Enable partial page encryptionDavid Gstir2016-11-13
| | | | | | | | | Not all filesystems work on full pages, thus we should allow them to hand partial pages to fscrypt for en/decryption. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Allow fscrypt_decrypt_page() to function with non-writeback pagesDavid Gstir2016-11-13
| | | | | | | | | | Some filesystem might pass pages which do not have page->mapping->host set to the encrypted inode. We want the caller to explicitly pass the corresponding inode. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: Add in-place encryption modeDavid Gstir2016-11-13
| | | | | | | | | | | ext4 and f2fs require a bounce page when encrypting pages. However, not all filesystems will need that (eg. UBIFS). This is handled via a flag on fscrypt_operations where a fs implementation can select in-place encryption over using a bounce page (which is the default). Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Linux 4.9-rc4Linus Torvalds2016-11-05
|
* Merge branch 'i2c/for-current' of ↵Linus Torvalds2016-11-05
|\ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fix from Wolfram Sang: "A bugfix for the I2C core fixing a (rare) race condition" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: core: fix NULL pointer dereference under race condition
| * i2c: core: fix NULL pointer dereference under race conditionVladimir Zapolskiy2016-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Race condition between registering an I2C device driver and deregistering an I2C adapter device which is assumed to manage that I2C device may lead to a NULL pointer dereference due to the uninitialized list head of driver clients. The root cause of the issue is that the I2C bus may know about the registered device driver and thus it is matched by bus_for_each_drv(), but the list of clients is not initialized and commonly it is NULL, because I2C device drivers define struct i2c_driver as static and clients field is expected to be initialized by I2C core: i2c_register_driver() i2c_del_adapter() driver_register() ... bus_add_driver() ... ... bus_for_each_drv(..., __process_removed_adapter) ... i2c_do_del_adapter() ... list_for_each_entry_safe(..., &driver->clients, ...) INIT_LIST_HEAD(&driver->clients); To solve the problem it is sufficient to do clients list head initialization before calling driver_register(). The problem was found while using an I2C device driver with a sluggish registration routine on a bus provided by a physically detachable I2C master controller, but practically the oops may be reproduced under the race between arbitraty I2C device driver registration and managing I2C bus device removal e.g. by unbinding the latter over sysfs: % echo 21a4000.i2c > /sys/bus/platform/drivers/imx-i2c/unbind Unable to handle kernel NULL pointer dereference at virtual address 00000000 Internal error: Oops: 17 [#1] SMP ARM CPU: 2 PID: 533 Comm: sh Not tainted 4.9.0-rc3+ #61 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) task: e5ada400 task.stack: e4936000 PC is at i2c_do_del_adapter+0x20/0xcc LR is at __process_removed_adapter+0x14/0x1c Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 35bd004a DAC: 00000051 Process sh (pid: 533, stack limit = 0xe4936210) Stack: (0xe4937d28 to 0xe4938000) Backtrace: [<c0667be0>] (i2c_do_del_adapter) from [<c0667cc0>] (__process_removed_adapter+0x14/0x1c) [<c0667cac>] (__process_removed_adapter) from [<c0516998>] (bus_for_each_drv+0x6c/0xa0) [<c051692c>] (bus_for_each_drv) from [<c06685ec>] (i2c_del_adapter+0xbc/0x284) [<c0668530>] (i2c_del_adapter) from [<bf0110ec>] (i2c_imx_remove+0x44/0x164 [i2c_imx]) [<bf0110a8>] (i2c_imx_remove [i2c_imx]) from [<c051a838>] (platform_drv_remove+0x2c/0x44) [<c051a80c>] (platform_drv_remove) from [<c05183d8>] (__device_release_driver+0x90/0x12c) [<c0518348>] (__device_release_driver) from [<c051849c>] (device_release_driver+0x28/0x34) [<c0518474>] (device_release_driver) from [<c0517150>] (unbind_store+0x80/0x104) [<c05170d0>] (unbind_store) from [<c0516520>] (drv_attr_store+0x28/0x34) [<c05164f8>] (drv_attr_store) from [<c0298acc>] (sysfs_kf_write+0x50/0x54) [<c0298a7c>] (sysfs_kf_write) from [<c029801c>] (kernfs_fop_write+0x100/0x214) [<c0297f1c>] (kernfs_fop_write) from [<c0220130>] (__vfs_write+0x34/0x120) [<c02200fc>] (__vfs_write) from [<c0221088>] (vfs_write+0xa8/0x170) [<c0220fe0>] (vfs_write) from [<c0221e74>] (SyS_write+0x4c/0xa8) [<c0221e28>] (SyS_write) from [<c0108a20>] (ret_fast_syscall+0x0/0x1c) Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
| |
| \
*-. \ Merge branches 'sched-urgent-for-linus' and 'core-urgent-for-linus' of ↵Linus Torvalds2016-11-05
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stack vmap fixups from Thomas Gleixner: "Two small patches related to sched_show_task(): - make sure to hold a reference on the task stack while accessing it - remove the thread_saved_pc printout .. and add a sanity check into release_task_stack() to catch problems with task stack references" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Remove pointless printout in sched_show_task() sched/core: Fix oops in sched_show_task() * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fork: Add task stack refcounting sanity check and prevent premature task stack freeing
| | * | fork: Add task stack refcounting sanity check and prevent premature task ↵Andy Lutomirski2016-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stack freeing If something goes wrong with task stack refcounting and a stack refcount hits zero too early, warn and leak it rather than potentially freeing it early (and silently). Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Remove pointless printout in sched_show_task()Linus Torvalds2016-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In sched_show_task() we print out a useless hex number, not even a symbol, and there's a big question mark whether this even makes sense anyway, I suspect we should just remove it all. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Fix oops in sched_show_task()Tetsuo Handa2016-11-03
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread remains in the task list after its stack pointer was already set to NULL. Therefore, thread_saved_pc() and stack_not_used() in sched_show_task() will trigger NULL pointer dereference if an attempt to dump such thread's traces (e.g. SysRq-t, khungtaskd) is made. Since show_stack() in sched_show_task() calls try_get_task_stack() and sched_show_task() is called from interrupt context, calling try_get_task_stack() from sched_show_task() will be safe as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2016-11-05
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MD fixes from Shaohua Li: "There are several bug fixes queued: - fix raid5-cache recovery bugs - fix discard IO error handling for raid1/10 - fix array sync writes bogus position to superblock - fix IO error handling for raid array with external metadata" * tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: be careful not lot leak internal curr_resync value into metadata. -- (all) raid1: handle read error also in readonly mode raid5-cache: correct condition for empty metadata write md: report 'write_pending' state when array in sync md/raid5: write an empty meta-block when creating log super-block md/raid5: initialize next_checkpoint field before use RAID10: ignore discard error RAID1: ignore discard error
| * | | md: be careful not lot leak internal curr_resync value into metadata. -- (all)NeilBrown2016-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid1: handle read error also in readonly modeTomasz Majchrzak2016-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If write is the first operation on a disk and it happens not to be aligned to page size, block layer sends read request first. If read operation fails, the disk is set as failed as no attempt to fix the error is made because array is in auto-readonly mode. Similarily, the disk is set as failed for read-only array. Take the same approach as in raid10. Don't fail the disk if array is in readonly or auto-readonly mode. Try to redirect the request first and if unsuccessful, return a read error. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | raid5-cache: correct condition for empty metadata writeShaohua Li2016-10-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md: report 'write_pending' state when array in syncTomasz Majchrzak2016-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit 070dc6dd7103 ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * | | md/raid5: write an empty meta-block when creating log super-blockZhengyuan Liu2016-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>