aboutsummaryrefslogtreecommitdiffstats
path: root/fs
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2016-07-29 18:54:19 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2016-07-29 18:54:19 -0400
commita867d7349e94b6409b08629886a819f802377e91 (patch)
treecf26734d638bbeee4e8f1ec58161933a55b922e2 /fs
parent601f887d6105ddd28dc569a1504595bdf8df8a5b (diff)
parentaeaa4a79ff6a5ed912b7362f206cf8576fca538b (diff)
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns vfs updates from Eric Biederman: "This tree contains some very long awaited work on generalizing the user namespace support for mounting filesystems to include filesystems with a backing store. The real world target is fuse but the goal is to update the vfs to allow any filesystem to be supported. This patchset is based on a lot of code review and testing to approach that goal. While looking at what is needed to support the fuse filesystem it became clear that there were things like xattrs for security modules that needed special treatment. That the resolution of those concerns would not be fuse specific. That sorting out these general issues made most sense at the generic level, where the right people could be drawn into the conversation, and the issues could be solved for everyone. At a high level what this patchset does a couple of simple things: - Add a user namespace owner (s_user_ns) to struct super_block. - Teach the vfs to handle filesystem uids and gids not mapping into to kuids and kgids and being reported as INVALID_UID and INVALID_GID in vfs data structures. By assigning a user namespace owner filesystems that are mounted with only user namespace privilege can be detected. This allows security modules and the like to know which mounts may not be trusted. This also allows the set of uids and gids that are communicated to the filesystem to be capped at the set of kuids and kgids that are in the owning user namespace of the filesystem. One of the crazier corner casees this handles is the case of inodes whose i_uid or i_gid are not mapped into the vfs. Most of the code simply doesn't care but it is easy to confuse the inode writeback path so no operation that could cause an inode write-back is permitted for such inodes (aka only reads are allowed). This set of changes starts out by cleaning up the code paths involved in user namespace permirted mounts. Then when things are clean enough adds code that cleanly sets s_user_ns. Then additional restrictions are added that are possible now that the filesystem superblock contains owner information. These changes should not affect anyone in practice, but there are some parts of these restrictions that are changes in behavior. - Andy's restriction on suid executables that does not honor the suid bit when the path is from another mount namespace (think /proc/[pid]/fd/) or when the filesystem was mounted by a less privileged user. - The replacement of the user namespace implicit setting of MNT_NODEV with implicitly setting SB_I_NODEV on the filesystem superblock instead. Using SB_I_NODEV is a stronger form that happens to make this state user invisible. The user visibility can be managed but it caused problems when it was introduced from applications reasonably expecting mount flags to be what they were set to. There is a little bit of work remaining before it is safe to support mounting filesystems with backing store in user namespaces, beyond what is in this set of changes. - Verifying the mounter has permission to read/write the block device during mount. - Teaching the integrity modules IMA and EVM to handle filesystems mounted with only user namespace root and to reduce trust in their security xattrs accordingly. - Capturing the mounters credentials and using that for permission checks in d_automount and the like. (Given that overlayfs already does this, and we need the work in d_automount it make sense to generalize this case). Furthermore there are a few changes that are on the wishlist: - Get all filesystems supporting posix acls using the generic posix acls so that posix_acl_fix_xattr_from_user and posix_acl_fix_xattr_to_user may be removed. [Maintainability] - Reducing the permission checks in places such as remount to allow the superblock owner to perform them. - Allowing the superblock owner to chown files with unmapped uids and gids to something that is mapped so the files may be treated normally. I am not considering even obvious relaxations of permission checks until it is clear there are no more corner cases that need to be locked down and handled generically. Many thanks to Seth Forshee who kept this code alive, and putting up with me rewriting substantial portions of what he did to handle more corner cases, and for his diligent testing and reviewing of my changes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits) fs: Call d_automount with the filesystems creds fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns evm: Translate user/group ids relative to s_user_ns when computing HMAC dquot: For now explicitly don't support filesystems outside of init_user_ns quota: Handle quota data stored in s_user_ns in quota_setxquota quota: Ensure qids map to the filesystem vfs: Don't create inodes with a uid or gid unknown to the vfs vfs: Don't modify inodes with a uid or gid unknown to the vfs cred: Reject inodes with invalid ids in set_create_file_as() fs: Check for invalid i_uid in may_follow_link() vfs: Verify acls are valid within superblock's s_user_ns. userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS fs: Refuse uid/gid changes which don't map into s_user_ns selinux: Add support for unprivileged mounts from user namespaces Smack: Handle labels consistently in untrusted mounts Smack: Add support for unprivileged mounts from user namespaces fs: Treat foreign mounts as nosuid fs: Limit file caps to the user namespace of the super block userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag userns: Remove implicit MNT_NODEV fragility. ...
Diffstat (limited to 'fs')
-rw-r--r--fs/9p/acl.c2
-rw-r--r--fs/attr.c19
-rw-r--r--fs/block_dev.c2
-rw-r--r--fs/devpts/inode.c3
-rw-r--r--fs/exec.c2
-rw-r--r--fs/inode.c7
-rw-r--r--fs/kernfs/mount.c5
-rw-r--r--fs/namei.c55
-rw-r--r--fs/namespace.c99
-rw-r--r--fs/nfsd/nfsctl.c13
-rw-r--r--fs/posix_acl.c8
-rw-r--r--fs/proc/inode.c15
-rw-r--r--fs/proc/internal.h3
-rw-r--r--fs/proc/root.c61
-rw-r--r--fs/quota/dquot.c8
-rw-r--r--fs/quota/quota.c14
-rw-r--r--fs/super.c69
-rw-r--r--fs/sysfs/mount.c5
-rw-r--r--fs/xattr.c7
19 files changed, 242 insertions, 155 deletions
diff --git a/fs/9p/acl.c b/fs/9p/acl.c
index 0576eaeb60b9..5b6a1743ea17 100644
--- a/fs/9p/acl.c
+++ b/fs/9p/acl.c
@@ -266,7 +266,7 @@ static int v9fs_xattr_set_acl(const struct xattr_handler *handler,
266 if (IS_ERR(acl)) 266 if (IS_ERR(acl))
267 return PTR_ERR(acl); 267 return PTR_ERR(acl);
268 else if (acl) { 268 else if (acl) {
269 retval = posix_acl_valid(acl); 269 retval = posix_acl_valid(inode->i_sb->s_user_ns, acl);
270 if (retval) 270 if (retval)
271 goto err_out; 271 goto err_out;
272 } 272 }
diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0f6c88..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -255,6 +255,25 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
255 if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) 255 if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
256 return 0; 256 return 0;
257 257
258 /*
259 * Verify that uid/gid changes are valid in the target
260 * namespace of the superblock.
261 */
262 if (ia_valid & ATTR_UID &&
263 !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
264 return -EOVERFLOW;
265 if (ia_valid & ATTR_GID &&
266 !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
267 return -EOVERFLOW;
268
269 /* Don't allow modifications of files with invalid uids or
270 * gids unless those uids & gids are being made valid.
271 */
272 if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
273 return -EOVERFLOW;
274 if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
275 return -EOVERFLOW;
276
258 error = security_inode_setattr(dentry, attr); 277 error = security_inode_setattr(dentry, attr);
259 if (error) 278 if (error)
260 return error; 279 return error;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5cbd5391667e..ada42cf42d06 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1846,7 +1846,7 @@ struct block_device *lookup_bdev(const char *pathname)
1846 if (!S_ISBLK(inode->i_mode)) 1846 if (!S_ISBLK(inode->i_mode))
1847 goto fail; 1847 goto fail;
1848 error = -EACCES; 1848 error = -EACCES;
1849 if (path.mnt->mnt_flags & MNT_NODEV) 1849 if (!may_open_dev(&path))
1850 goto fail; 1850 goto fail;
1851 error = -ENOMEM; 1851 error = -ENOMEM;
1852 bdev = bd_acquire(inode); 1852 bdev = bd_acquire(inode);
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 37c134a132c7..d116453b0276 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -396,6 +396,7 @@ devpts_fill_super(struct super_block *s, void *data, int silent)
396{ 396{
397 struct inode *inode; 397 struct inode *inode;
398 398
399 s->s_iflags &= ~SB_I_NODEV;
399 s->s_blocksize = 1024; 400 s->s_blocksize = 1024;
400 s->s_blocksize_bits = 10; 401 s->s_blocksize_bits = 10;
401 s->s_magic = DEVPTS_SUPER_MAGIC; 402 s->s_magic = DEVPTS_SUPER_MAGIC;
@@ -480,7 +481,7 @@ static struct file_system_type devpts_fs_type = {
480 .name = "devpts", 481 .name = "devpts",
481 .mount = devpts_mount, 482 .mount = devpts_mount,
482 .kill_sb = devpts_kill_sb, 483 .kill_sb = devpts_kill_sb,
483 .fs_flags = FS_USERNS_MOUNT | FS_USERNS_DEV_MOUNT, 484 .fs_flags = FS_USERNS_MOUNT,
484}; 485};
485 486
486/* 487/*
diff --git a/fs/exec.c b/fs/exec.c
index 887c1c955df8..ca239fc86d8d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1411,7 +1411,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
1411 bprm->cred->euid = current_euid(); 1411 bprm->cred->euid = current_euid();
1412 bprm->cred->egid = current_egid(); 1412 bprm->cred->egid = current_egid();
1413 1413
1414 if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) 1414 if (!mnt_may_suid(bprm->file->f_path.mnt))
1415 return; 1415 return;
1416 1416
1417 if (task_no_new_privs(current)) 1417 if (task_no_new_privs(current))
diff --git a/fs/inode.c b/fs/inode.c
index e171f7b5f9e4..9cef4e16aeda 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1619,6 +1619,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
1619 1619
1620 if (inode->i_flags & S_NOATIME) 1620 if (inode->i_flags & S_NOATIME)
1621 return false; 1621 return false;
1622
1623 /* Atime updates will likely cause i_uid and i_gid to be written
1624 * back improprely if their true value is unknown to the vfs.
1625 */
1626 if (HAS_UNMAPPED_ID(inode))
1627 return false;
1628
1622 if (IS_NOATIME(inode)) 1629 if (IS_NOATIME(inode))
1623 return false; 1630 return false;
1624 if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode)) 1631 if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 63534f5f9073..b3d73ad52b22 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -152,6 +152,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
152 struct dentry *root; 152 struct dentry *root;
153 153
154 info->sb = sb; 154 info->sb = sb;
155 /* Userspace would break if executables or devices appear on sysfs */
156 sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
155 sb->s_blocksize = PAGE_SIZE; 157 sb->s_blocksize = PAGE_SIZE;
156 sb->s_blocksize_bits = PAGE_SHIFT; 158 sb->s_blocksize_bits = PAGE_SHIFT;
157 sb->s_magic = magic; 159 sb->s_magic = magic;
@@ -241,7 +243,8 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
241 info->root = root; 243 info->root = root;
242 info->ns = ns; 244 info->ns = ns;
243 245
244 sb = sget(fs_type, kernfs_test_super, kernfs_set_super, flags, info); 246 sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
247 &init_user_ns, info);
245 if (IS_ERR(sb) || sb->s_fs_info != info) 248 if (IS_ERR(sb) || sb->s_fs_info != info)
246 kfree(info); 249 kfree(info);
247 if (IS_ERR(sb)) 250 if (IS_ERR(sb))
diff --git a/fs/namei.c b/fs/namei.c
index 68a896c804b7..c386a329ab20 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -36,6 +36,7 @@
36#include <linux/posix_acl.h> 36#include <linux/posix_acl.h>
37#include <linux/hash.h> 37#include <linux/hash.h>
38#include <linux/bitops.h> 38#include <linux/bitops.h>
39#include <linux/init_task.h>
39#include <asm/uaccess.h> 40#include <asm/uaccess.h>
40 41
41#include "internal.h" 42#include "internal.h"
@@ -410,6 +411,14 @@ int __inode_permission(struct inode *inode, int mask)
410 */ 411 */
411 if (IS_IMMUTABLE(inode)) 412 if (IS_IMMUTABLE(inode))
412 return -EACCES; 413 return -EACCES;
414
415 /*
416 * Updating mtime will likely cause i_uid and i_gid to be
417 * written back improperly if their true value is unknown
418 * to the vfs.
419 */
420 if (HAS_UNMAPPED_ID(inode))
421 return -EACCES;
413 } 422 }
414 423
415 retval = do_inode_permission(inode, mask); 424 retval = do_inode_permission(inode, mask);
@@ -901,6 +910,7 @@ static inline int may_follow_link(struct nameidata *nd)
901{ 910{
902 const struct inode *inode; 911 const struct inode *inode;
903 const struct inode *parent; 912 const struct inode *parent;
913 kuid_t puid;
904 914
905 if (!sysctl_protected_symlinks) 915 if (!sysctl_protected_symlinks)
906 return 0; 916 return 0;
@@ -916,7 +926,8 @@ static inline int may_follow_link(struct nameidata *nd)
916 return 0; 926 return 0;
917 927
918 /* Allowed if parent directory and link owner match. */ 928 /* Allowed if parent directory and link owner match. */
919 if (uid_eq(parent->i_uid, inode->i_uid)) 929 puid = parent->i_uid;
930 if (uid_valid(puid) && uid_eq(puid, inode->i_uid))
920 return 0; 931 return 0;
921 932
922 if (nd->flags & LOOKUP_RCU) 933 if (nd->flags & LOOKUP_RCU)
@@ -1089,6 +1100,7 @@ static int follow_automount(struct path *path, struct nameidata *nd,
1089 bool *need_mntput) 1100 bool *need_mntput)
1090{ 1101{
1091 struct vfsmount *mnt; 1102 struct vfsmount *mnt;
1103 const struct cred *old_cred;
1092 int err; 1104 int err;
1093 1105
1094 if (!path->dentry->d_op || !path->dentry->d_op->d_automount) 1106 if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
@@ -1110,11 +1122,16 @@ static int follow_automount(struct path *path, struct nameidata *nd,
1110 path->dentry->d_inode) 1122 path->dentry->d_inode)
1111 return -EISDIR; 1123 return -EISDIR;
1112 1124
1125 if (path->dentry->d_sb->s_user_ns != &init_user_ns)
1126 return -EACCES;
1127
1113 nd->total_link_count++; 1128 nd->total_link_count++;
1114 if (nd->total_link_count >= 40) 1129 if (nd->total_link_count >= 40)
1115 return -ELOOP; 1130 return -ELOOP;
1116 1131
1132 old_cred = override_creds(&init_cred);
1117 mnt = path->dentry->d_op->d_automount(path); 1133 mnt = path->dentry->d_op->d_automount(path);
1134 revert_creds(old_cred);
1118 if (IS_ERR(mnt)) { 1135 if (IS_ERR(mnt)) {
1119 /* 1136 /*
1120 * The filesystem is allowed to return -EISDIR here to indicate 1137 * The filesystem is allowed to return -EISDIR here to indicate
@@ -2741,10 +2758,11 @@ EXPORT_SYMBOL(__check_sticky);
2741 * c. have CAP_FOWNER capability 2758 * c. have CAP_FOWNER capability
2742 * 6. If the victim is append-only or immutable we can't do antyhing with 2759 * 6. If the victim is append-only or immutable we can't do antyhing with
2743 * links pointing to it. 2760 * links pointing to it.
2744 * 7. If we were asked to remove a directory and victim isn't one - ENOTDIR. 2761 * 7. If the victim has an unknown uid or gid we can't change the inode.
2745 * 8. If we were asked to remove a non-directory and victim isn't one - EISDIR. 2762 * 8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
2746 * 9. We can't remove a root or mountpoint. 2763 * 9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
2747 * 10. We don't allow removal of NFS sillyrenamed files; it's handled by 2764 * 10. We can't remove a root or mountpoint.
2765 * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
2748 * nfs_async_unlink(). 2766 * nfs_async_unlink().
2749 */ 2767 */
2750static int may_delete(struct inode *dir, struct dentry *victim, bool isdir) 2768static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2766,7 +2784,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
2766 return -EPERM; 2784 return -EPERM;
2767 2785
2768 if (check_sticky(dir, inode) || IS_APPEND(inode) || 2786 if (check_sticky(dir, inode) || IS_APPEND(inode) ||
2769 IS_IMMUTABLE(inode) || IS_SWAPFILE(inode)) 2787 IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
2770 return -EPERM; 2788 return -EPERM;
2771 if (isdir) { 2789 if (isdir) {
2772 if (!d_is_dir(victim)) 2790 if (!d_is_dir(victim))
@@ -2787,16 +2805,22 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
2787 * 1. We can't do it if child already exists (open has special treatment for 2805 * 1. We can't do it if child already exists (open has special treatment for
2788 * this case, but since we are inlined it's OK) 2806 * this case, but since we are inlined it's OK)
2789 * 2. We can't do it if dir is read-only (done in permission()) 2807 * 2. We can't do it if dir is read-only (done in permission())
2790 * 3. We should have write and exec permissions on dir 2808 * 3. We can't do it if the fs can't represent the fsuid or fsgid.
2791 * 4. We can't do it if dir is immutable (done in permission()) 2809 * 4. We should have write and exec permissions on dir
2810 * 5. We can't do it if dir is immutable (done in permission())
2792 */ 2811 */
2793static inline int may_create(struct inode *dir, struct dentry *child) 2812static inline int may_create(struct inode *dir, struct dentry *child)
2794{ 2813{
2814 struct user_namespace *s_user_ns;
2795 audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE); 2815 audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
2796 if (child->d_inode) 2816 if (child->d_inode)
2797 return -EEXIST; 2817 return -EEXIST;
2798 if (IS_DEADDIR(dir)) 2818 if (IS_DEADDIR(dir))
2799 return -ENOENT; 2819 return -ENOENT;
2820 s_user_ns = dir->i_sb->s_user_ns;
2821 if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
2822 !kgid_has_mapping(s_user_ns, current_fsgid()))
2823 return -EOVERFLOW;
2800 return inode_permission(dir, MAY_WRITE | MAY_EXEC); 2824 return inode_permission(dir, MAY_WRITE | MAY_EXEC);
2801} 2825}
2802 2826
@@ -2865,6 +2889,12 @@ int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
2865} 2889}
2866EXPORT_SYMBOL(vfs_create); 2890EXPORT_SYMBOL(vfs_create);
2867 2891
2892bool may_open_dev(const struct path *path)
2893{
2894 return !(path->mnt->mnt_flags & MNT_NODEV) &&
2895 !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
2896}
2897
2868static int may_open(struct path *path, int acc_mode, int flag) 2898static int may_open(struct path *path, int acc_mode, int flag)
2869{ 2899{
2870 struct dentry *dentry = path->dentry; 2900 struct dentry *dentry = path->dentry;
@@ -2883,7 +2913,7 @@ static int may_open(struct path *path, int acc_mode, int flag)
2883 break; 2913 break;
2884 case S_IFBLK: 2914 case S_IFBLK:
2885 case S_IFCHR: 2915 case S_IFCHR:
2886 if (path->mnt->mnt_flags & MNT_NODEV) 2916 if (!may_open_dev(path))
2887 return -EACCES; 2917 return -EACCES;
2888 /*FALLTHRU*/ 2918 /*FALLTHRU*/
2889 case S_IFIFO: 2919 case S_IFIFO:
@@ -4135,6 +4165,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
4135 */ 4165 */
4136 if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) 4166 if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
4137 return -EPERM; 4167 return -EPERM;
4168 /*
4169 * Updating the link count will likely cause i_uid and i_gid to
4170 * be writen back improperly if their true value is unknown to
4171 * the vfs.
4172 */
4173 if (HAS_UNMAPPED_ID(inode))
4174 return -EPERM;
4138 if (!dir->i_op->link) 4175 if (!dir->i_op->link)
4139 return -EPERM; 4176 return -EPERM;
4140 if (S_ISDIR(inode->i_mode)) 4177 if (S_ISDIR(inode->i_mode))
diff --git a/fs/namespace.c b/fs/namespace.c
index 419f746d851d..7bb2cda3bfef 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2186,13 +2186,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
2186 } 2186 }
2187 if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) && 2187 if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
2188 !(mnt_flags & MNT_NODEV)) { 2188 !(mnt_flags & MNT_NODEV)) {
2189 /* Was the nodev implicitly added in mount? */ 2189 return -EPERM;
2190 if ((mnt->mnt_ns->user_ns != &init_user_ns) &&
2191 !(sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) {
2192 mnt_flags |= MNT_NODEV;
2193 } else {
2194 return -EPERM;
2195 }
2196 } 2190 }
2197 if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) && 2191 if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
2198 !(mnt_flags & MNT_NOSUID)) { 2192 !(mnt_flags & MNT_NOSUID)) {
@@ -2376,7 +2370,7 @@ unlock:
2376 return err; 2370 return err;
2377} 2371}
2378 2372
2379static bool fs_fully_visible(struct file_system_type *fs_type, int *new_mnt_flags); 2373static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
2380 2374
2381/* 2375/*
2382 * create a new mount for userspace and request it to be added into the 2376 * create a new mount for userspace and request it to be added into the
@@ -2386,7 +2380,6 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
2386 int mnt_flags, const char *name, void *data) 2380 int mnt_flags, const char *name, void *data)
2387{ 2381{
2388 struct file_system_type *type; 2382 struct file_system_type *type;
2389 struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
2390 struct vfsmount *mnt; 2383 struct vfsmount *mnt;
2391 int err; 2384 int err;
2392 2385
@@ -2397,26 +2390,6 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
2397 if (!type) 2390 if (!type)
2398 return -ENODEV; 2391 return -ENODEV;
2399 2392
2400 if (user_ns != &init_user_ns) {
2401 if (!(type->fs_flags & FS_USERNS_MOUNT)) {
2402 put_filesystem(type);
2403 return -EPERM;
2404 }
2405 /* Only in special cases allow devices from mounts
2406 * created outside the initial user namespace.
2407 */
2408 if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
2409 flags |= MS_NODEV;
2410 mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
2411 }
2412 if (type->fs_flags & FS_USERNS_VISIBLE) {
2413 if (!fs_fully_visible(type, &mnt_flags)) {
2414 put_filesystem(type);
2415 return -EPERM;
2416 }
2417 }
2418 }
2419
2420 mnt = vfs_kern_mount(type, flags, name, data); 2393 mnt = vfs_kern_mount(type, flags, name, data);
2421 if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) && 2394 if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
2422 !mnt->mnt_sb->s_subtype) 2395 !mnt->mnt_sb->s_subtype)
@@ -2426,6 +2399,11 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
2426 if (IS_ERR(mnt)) 2399 if (IS_ERR(mnt))
2427 return PTR_ERR(mnt); 2400 return PTR_ERR(mnt);
2428 2401
2402 if (mount_too_revealing(mnt, &mnt_flags)) {
2403 mntput(mnt);
2404 return -EPERM;
2405 }
2406
2429 err = do_add_mount(real_mount(mnt), path, mnt_flags); 2407 err = do_add_mount(real_mount(mnt), path, mnt_flags);
2430 if (err) 2408 if (err)
2431 mntput(mnt); 2409 mntput(mnt);
@@ -3217,22 +3195,19 @@ bool current_chrooted(void)
3217 return chrooted; 3195 return chrooted;
3218} 3196}
3219 3197
3220static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) 3198static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new,
3199 int *new_mnt_flags)
3221{ 3200{
3222 struct mnt_namespace *ns = current->nsproxy->mnt_ns;
3223 int new_flags = *new_mnt_flags; 3201 int new_flags = *new_mnt_flags;
3224 struct mount *mnt; 3202 struct mount *mnt;
3225 bool visible = false; 3203 bool visible = false;
3226 3204
3227 if (unlikely(!ns))
3228 return false;
3229
3230 down_read(&namespace_sem); 3205 down_read(&namespace_sem);
3231 list_for_each_entry(mnt, &ns->list, mnt_list) { 3206 list_for_each_entry(mnt, &ns->list, mnt_list) {
3232 struct mount *child; 3207 struct mount *child;
3233 int mnt_flags; 3208 int mnt_flags;
3234 3209
3235 if (mnt->mnt.mnt_sb->s_type != type) 3210 if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type)
3236 continue; 3211 continue;
3237 3212
3238 /* This mount is not fully visible if it's root directory 3213 /* This mount is not fully visible if it's root directory
@@ -3241,12 +3216,8 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
3241 if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root) 3216 if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
3242 continue; 3217 continue;
3243 3218
3244 /* Read the mount flags and filter out flags that 3219 /* A local view of the mount flags */
3245 * may safely be ignored.
3246 */
3247 mnt_flags = mnt->mnt.mnt_flags; 3220 mnt_flags = mnt->mnt.mnt_flags;
3248 if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
3249 mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
3250 3221
3251 /* Don't miss readonly hidden in the superblock flags */ 3222 /* Don't miss readonly hidden in the superblock flags */
3252 if (mnt->mnt.mnt_sb->s_flags & MS_RDONLY) 3223 if (mnt->mnt.mnt_sb->s_flags & MS_RDONLY)
@@ -3258,15 +3229,6 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
3258 if ((mnt_flags & MNT_LOCK_READONLY) && 3229 if ((mnt_flags & MNT_LOCK_READONLY) &&
3259 !(new_flags & MNT_READONLY)) 3230 !(new_flags & MNT_READONLY))
3260 continue; 3231 continue;
3261 if ((mnt_flags & MNT_LOCK_NODEV) &&
3262 !(new_flags & MNT_NODEV))
3263 continue;
3264 if ((mnt_flags & MNT_LOCK_NOSUID) &&
3265 !(new_flags & MNT_NOSUID))
3266 continue;
3267 if ((mnt_flags & MNT_LOCK_NOEXEC) &&
3268 !(new_flags & MNT_NOEXEC))
3269 continue;
3270 if ((mnt_flags & MNT_LOCK_ATIME) && 3232 if ((mnt_flags & MNT_LOCK_ATIME) &&
3271 ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK))) 3233 ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
3272 continue; 3234 continue;
@@ -3286,9 +3248,6 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
3286 } 3248 }
3287 /* Preserve the locked attributes */ 3249 /* Preserve the locked attributes */
3288 *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \ 3250 *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
3289 MNT_LOCK_NODEV | \
3290 MNT_LOCK_NOSUID | \
3291 MNT_LOCK_NOEXEC | \
3292 MNT_LOCK_ATIME); 3251 MNT_LOCK_ATIME);
3293 visible = true; 3252 visible = true;
3294 goto found; 3253 goto found;
@@ -3299,6 +3258,42 @@ found:
3299 return visible; 3258 return visible;
3300} 3259}
3301 3260
3261static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags)
3262{
3263 const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
3264 struct mnt_namespace *ns = current->nsproxy->mnt_ns;
3265 unsigned long s_iflags;
3266
3267 if (ns->user_ns == &init_user_ns)
3268 return false;
3269
3270 /* Can this filesystem be too revealing? */
3271 s_iflags = mnt->mnt_sb->s_iflags;
3272 if (!(s_iflags & SB_I_USERNS_VISIBLE))
3273 return false;
3274
3275 if ((s_iflags & required_iflags) != required_iflags) {
3276 WARN_ONCE(1, "Expected s_iflags to contain 0x%lx\n",
3277 required_iflags);
3278 return true;
3279 }
3280
3281 return !mnt_already_visible(ns, mnt, new_mnt_flags);
3282}
3283
3284bool mnt_may_suid(struct vfsmount *mnt)
3285{
3286 /*
3287 * Foreign mounts (accessed via fchdir or through /proc
3288 * symlinks) are always treated as if they are nosuid. This
3289 * prevents namespaces from trusting potentially unsafe
3290 * suid/sgid bits, file caps, or security labels that originate
3291 * in other namespaces.
3292 */
3293 return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
3294 current_in_userns(mnt->mnt_sb->s_user_ns);
3295}
3296
3302static struct ns_common *mntns_get(struct task_struct *task) 3297static struct ns_common *mntns_get(struct task_struct *task)
3303{ 3298{
3304 struct ns_common *ns = NULL; 3299 struct ns_common *ns = NULL;
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index e7787777620e..65ad0165a94f 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1151,20 +1151,15 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent)
1151#endif 1151#endif
1152 /* last one */ {""} 1152 /* last one */ {""}
1153 }; 1153 };
1154 struct net *net = data; 1154 get_net(sb->s_fs_info);
1155 int ret; 1155 return simple_fill_super(sb, 0x6e667364, nfsd_files);
1156
1157 ret = simple_fill_super(sb, 0x6e667364, nfsd_files);
1158 if (ret)
1159 return ret;
1160 sb->s_fs_info = get_net(net);
1161 return 0;
1162} 1156}
1163 1157
1164static struct dentry *nfsd_mount(struct file_system_type *fs_type, 1158static struct dentry *nfsd_mount(struct file_system_type *fs_type,
1165 int flags, const char *dev_name, void *data) 1159 int flags, const char *dev_name, void *data)
1166{ 1160{
1167 return mount_ns(fs_type, flags, current->nsproxy->net_ns, nfsd_fill_super); 1161 struct net *net = current->nsproxy->net_ns;
1162 return mount_ns(fs_type, flags, data, net, net->user_ns, nfsd_fill_super);
1168} 1163}
1169 1164
1170static void nfsd_umount(struct super_block *sb) 1165static void nfsd_umount(struct super_block *sb)
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index edc452c2a563..59d47ab0791a 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -205,7 +205,7 @@ posix_acl_clone(const struct posix_acl *acl, gfp_t flags)
205 * Check if an acl is valid. Returns 0 if it is, or -E... otherwise. 205 * Check if an acl is valid. Returns 0 if it is, or -E... otherwise.
206 */ 206 */
207int 207int
208posix_acl_valid(const struct posix_acl *acl) 208posix_acl_valid(struct user_namespace *user_ns, const struct posix_acl *acl)
209{ 209{
210 const struct posix_acl_entry *pa, *pe; 210 const struct posix_acl_entry *pa, *pe;
211 int state = ACL_USER_OBJ; 211 int state = ACL_USER_OBJ;
@@ -225,7 +225,7 @@ posix_acl_valid(const struct posix_acl *acl)
225 case ACL_USER: 225 case ACL_USER:
226 if (state != ACL_USER) 226 if (state != ACL_USER)
227 return -EINVAL; 227 return -EINVAL;
228 if (!uid_valid(pa->e_uid)) 228 if (!kuid_has_mapping(user_ns, pa->e_uid))
229 return -EINVAL; 229 return -EINVAL;
230 needs_mask = 1; 230 needs_mask = 1;
231 break; 231 break;
@@ -240,7 +240,7 @@ posix_acl_valid(const struct posix_acl *acl)
240 case ACL_GROUP: 240 case ACL_GROUP:
241 if (state != ACL_GROUP) 241 if (state != ACL_GROUP)
242 return -EINVAL; 242 return -EINVAL;
243 if (!gid_valid(pa->e_gid)) 243 if (!kgid_has_mapping(user_ns, pa->e_gid))
244 return -EINVAL; 244 return -EINVAL;
245 needs_mask = 1; 245 needs_mask = 1;
246 break; 246 break;
@@ -834,7 +834,7 @@ set_posix_acl(struct inode *inode, int type, struct posix_acl *acl)
834 return -EPERM; 834 return -EPERM;
835 835
836 if (acl) { 836 if (acl) {
837 int ret = posix_acl_valid(acl); 837 int ret = posix_acl_valid(inode->i_sb->s_user_ns, acl);
838 if (ret) 838 if (ret)
839 return ret; 839 return ret;
840 } 840 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 42305ddcbaa0..c1b72388e571 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -457,17 +457,30 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
457 return inode; 457 return inode;
458} 458}
459 459
460int proc_fill_super(struct super_block *s) 460int proc_fill_super(struct super_block *s, void *data, int silent)
461{ 461{
462 struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
462 struct inode *root_inode; 463 struct inode *root_inode;
463 int ret; 464 int ret;
464 465
466 if (!proc_parse_options(data, ns))
467 return -EINVAL;
468
469 /* User space would break if executables or devices appear on proc */
470 s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
465 s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC; 471 s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
466 s->s_blocksize = 1024; 472 s->s_blocksize = 1024;
467 s->s_blocksize_bits = 10; 473 s->s_blocksize_bits = 10;
468 s->s_magic = PROC_SUPER_MAGIC; 474 s->s_magic = PROC_SUPER_MAGIC;
469 s->s_op = &proc_sops; 475 s->s_op = &proc_sops;
470 s->s_time_gran = 1; 476 s->s_time_gran = 1;
477
478 /*
479 * procfs isn't actually a stacking filesystem; however, there is
480 * too much magic going on inside it to permit stacking things on
481 * top of it
482 */
483 s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
471 484
472 pde_get(&proc_root); 485 pde_get(&proc_root);
473 root_inode = proc_get_inode(s, &proc_root); 486 root_inode = proc_get_inode(s, &proc_root);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa2781095bd1..7931c558c192 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -212,7 +212,7 @@ extern const struct inode_operations proc_pid_link_inode_operations;
212 212
213extern void proc_init_inodecache(void); 213extern void proc_init_inodecache(void);
214extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *); 214extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
215extern int proc_fill_super(struct super_block *); 215extern int proc_fill_super(struct super_block *, void *data, int flags);
216extern void proc_entry_rundown(struct proc_dir_entry *); 216extern void proc_entry_rundown(struct proc_dir_entry *);
217 217
218/* 218/*
@@ -268,6 +268,7 @@ static inline void proc_tty_init(void) {}
268 * root.c 268 * root.c
269 */ 269 */
270extern struct proc_dir_entry proc_root; 270extern struct proc_dir_entry proc_root;
271extern int proc_parse_options(char *options, struct pid_namespace *pid);
271 272
272extern void proc_self_init(void); 273extern void proc_self_init(void);
273extern int proc_remount(struct super_block *, int *, char *); 274extern int proc_remount(struct super_block *, int *, char *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 06702783bf40..8d3e484055a6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,21 +23,6 @@
23 23
24#include "internal.h" 24#include "internal.h"
25 25
26static int proc_test_super(struct super_block *sb, void *data)
27{
28 return sb->s_fs_info == data;
29}
30
31static int proc_set_super(struct super_block *sb, void *data)
32{
33 int err = set_anon_super(sb, NULL);
34 if (!err) {
35 struct pid_namespace *ns = (struct pid_namespace *)data;
36 sb->s_fs_info = get_pid_ns(ns);
37 }
38 return err;
39}
40
41enum { 26enum {
42 Opt_gid, Opt_hidepid, Opt_err, 27 Opt_gid, Opt_hidepid, Opt_err,
43}; 28};
@@ -48,7 +33,7 @@ static const match_table_t tokens = {
48 {Opt_err, NULL}, 33 {Opt_err, NULL},
49}; 34};
50 35
51static int proc_parse_options(char *options, struct pid_namespace *pid) 36int proc_parse_options(char *options, struct pid_namespace *pid)
52{ 37{
53 char *p; 38 char *p;
54 substring_t args[MAX_OPT_ARGS]; 39 substring_t args[MAX_OPT_ARGS];
@@ -100,52 +85,16 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
100static struct dentry *proc_mount(struct file_system_type *fs_type, 85static struct dentry *proc_mount(struct file_system_type *fs_type,
101 int flags, const char *dev_name, void *data) 86 int flags, const char *dev_name, void *data)
102{ 87{
103 int err;
104 struct super_block *sb;
105 struct pid_namespace *ns; 88 struct pid_namespace *ns;
106 char *options;
107 89
108 if (flags & MS_KERNMOUNT) { 90 if (flags & MS_KERNMOUNT) {
109 ns = (struct pid_namespace *)data; 91 ns = data;
110 options = NULL; 92 data = NULL;
111 } else { 93 } else {
112 ns = task_active_pid_ns(current); 94 ns = task_active_pid_ns(current);
113 options = data;
114
115 /* Does the mounter have privilege over the pid namespace? */
116 if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
117 return ERR_PTR(-EPERM);
118 }
119
120 sb = sget(fs_type, proc_test_super, proc_set_super, flags, ns);
121 if (IS_ERR(sb))
122 return ERR_CAST(sb);
123
124 /*
125 * procfs isn't actually a stacking filesystem; however, there is
126 * too much magic going on inside it to permit stacking things on
127 * top of it
128 */
129 sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
130
131 if (!proc_parse_options(options, ns)) {
132 deactivate_locked_super(sb);
133 return ERR_PTR(-EINVAL);
134 }
135
136 if (!sb->s_root) {
137 err = proc_fill_super(sb);
138 if (err) {
139 deactivate_locked_super(sb);
140 return ERR_PTR(err);
141 }
142
143 sb->s_flags |= MS_ACTIVE;
144 /* User space would break if executables appear on proc */
145 sb->s_iflags |= SB_I_NOEXEC;
146 } 95 }
147 96
148 return dget(sb->s_root); 97 return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
149} 98}
150 99
151static void proc_kill_sb(struct super_block *sb) 100static void proc_kill_sb(struct super_block *sb)
@@ -165,7 +114,7 @@ static struct file_system_type proc_fs_type = {
165 .name = "proc", 114 .name = "proc",
166 .mount = proc_mount, 115 .mount = proc_mount,
167 .kill_sb = proc_kill_sb, 116 .kill_sb = proc_kill_sb,
168 .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, 117 .fs_flags = FS_USERNS_MOUNT,
169}; 118};
170 119
171void __init proc_root_init(void) 120void __init proc_root_init(void)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index b1322dd9d136..1bfac28b7e7d 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
841 unsigned int hashent = hashfn(sb, qid); 841 unsigned int hashent = hashfn(sb, qid);
842 struct dquot *dquot, *empty = NULL; 842 struct dquot *dquot, *empty = NULL;
843 843
844 if (!qid_has_mapping(sb->s_user_ns, qid))
845 return ERR_PTR(-EINVAL);
846
844 if (!sb_has_quota_active(sb, qid.type)) 847 if (!sb_has_quota_active(sb, qid.type))
845 return ERR_PTR(-ESRCH); 848 return ERR_PTR(-ESRCH);
846we_slept: 849we_slept:
@@ -2268,6 +2271,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
2268 error = -EINVAL; 2271 error = -EINVAL;
2269 goto out_fmt; 2272 goto out_fmt;
2270 } 2273 }
2274 /* Filesystems outside of init_user_ns not yet supported */
2275 if (sb->s_user_ns != &init_user_ns) {
2276 error = -EINVAL;
2277 goto out_fmt;
2278 }
2271 /* Usage always has to be set... */ 2279 /* Usage always has to be set... */
2272 if (!(flags & DQUOT_USAGE_ENABLED)) { 2280 if (!(flags & DQUOT_USAGE_ENABLED)) {
2273 error = -EINVAL; 2281 error = -EINVAL;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 0f10ee9892ce..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
211 if (!sb->s_qcop->get_dqblk) 211 if (!sb->s_qcop->get_dqblk)
212 return -ENOSYS; 212 return -ENOSYS;
213 qid = make_kqid(current_user_ns(), type, id); 213 qid = make_kqid(current_user_ns(), type, id);
214 if (!qid_valid(qid)) 214 if (!qid_has_mapping(sb->s_user_ns, qid))
215 return -EINVAL; 215 return -EINVAL;
216 ret = sb->s_qcop->get_dqblk(sb, qid, &fdq); 216 ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
217 if (ret) 217 if (ret)
@@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
237 if (!sb->s_qcop->get_nextdqblk) 237 if (!sb->s_qcop->get_nextdqblk)
238 return -ENOSYS; 238 return -ENOSYS;
239 qid = make_kqid(current_user_ns(), type, id); 239 qid = make_kqid(current_user_ns(), type, id);
240 if (!qid_valid(qid)) 240 if (!qid_has_mapping(sb->s_user_ns, qid))
241 return -EINVAL; 241 return -EINVAL;
242 ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq); 242 ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
243 if (ret) 243 if (ret)
@@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
288 if (!sb->s_qcop->set_dqblk) 288 if (!sb->s_qcop->set_dqblk)
289 return -ENOSYS; 289 return -ENOSYS;
290 qid = make_kqid(current_user_ns(), type, id); 290 qid = make_kqid(current_user_ns(), type, id);
291 if (!qid_valid(qid)) 291 if (!qid_has_mapping(sb->s_user_ns, qid))
292 return -EINVAL; 292 return -EINVAL;
293 copy_from_if_dqblk(&fdq, &idq); 293 copy_from_if_dqblk(&fdq, &idq);
294 return sb->s_qcop->set_dqblk(sb, qid, &fdq); 294 return sb->s_qcop->set_dqblk(sb, qid, &fdq);
@@ -581,10 +581,10 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
581 if (!sb->s_qcop->set_dqblk) 581 if (!sb->s_qcop->set_dqblk)
582 return -ENOSYS; 582 return -ENOSYS;
583 qid = make_kqid(current_user_ns(), type, id); 583 qid = make_kqid(current_user_ns(), type, id);
584 if (!qid_valid(qid)) 584 if (!qid_has_mapping(sb->s_user_ns, qid))
585 return -EINVAL; 585 return -EINVAL;
586 /* Are we actually setting timer / warning limits for all users? */ 586 /* Are we actually setting timer / warning limits for all users? */
587 if (from_kqid(&init_user_ns, qid) == 0 && 587 if (from_kqid(sb->s_user_ns, qid) == 0 &&
588 fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) { 588 fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
589 struct qc_info qinfo; 589 struct qc_info qinfo;
590 int ret; 590 int ret;
@@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
642 if (!sb->s_qcop->get_dqblk) 642 if (!sb->s_qcop->get_dqblk)
643 return -ENOSYS; 643 return -ENOSYS;
644 qid = make_kqid(current_user_ns(), type, id); 644 qid = make_kqid(current_user_ns(), type, id);
645 if (!qid_valid(qid)) 645 if (!qid_has_mapping(sb->s_user_ns, qid))
646 return -EINVAL; 646 return -EINVAL;
647 ret = sb->s_qcop->get_dqblk(sb, qid, &qdq); 647 ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
648 if (ret) 648 if (ret)
@@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
669 if (!sb->s_qcop->get_nextdqblk) 669 if (!sb->s_qcop->get_nextdqblk)
670 return -ENOSYS; 670 return -ENOSYS;
671 qid = make_kqid(current_user_ns(), type, id); 671 qid = make_kqid(current_user_ns(), type, id);
672 if (!qid_valid(qid)) 672 if (!qid_has_mapping(sb->s_user_ns, qid))
673 return -EINVAL; 673 return -EINVAL;
674 ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq); 674 ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
675 if (ret) 675 if (ret)
diff --git a/fs/super.c b/fs/super.c
index 5806ffd45563..c2ff475c1711 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -33,6 +33,7 @@
33#include <linux/cleancache.h> 33#include <linux/cleancache.h>
34#include <linux/fsnotify.h> 34#include <linux/fsnotify.h>
35#include <linux/lockdep.h> 35#include <linux/lockdep.h>
36#include <linux/user_namespace.h>
36#include "internal.h" 37#include "internal.h"
37 38
38 39
@@ -165,6 +166,7 @@ static void destroy_super(struct super_block *s)
165 list_lru_destroy(&s->s_inode_lru); 166 list_lru_destroy(&s->s_inode_lru);
166 security_sb_free(s); 167 security_sb_free(s);
167 WARN_ON(!list_empty(&s->s_mounts)); 168 WARN_ON(!list_empty(&s->s_mounts));
169 put_user_ns(s->s_user_ns);
168 kfree(s->s_subtype); 170 kfree(s->s_subtype);
169 kfree(s->s_options); 171 kfree(s->s_options);
170 call_rcu(&s->rcu, destroy_super_rcu); 172 call_rcu(&s->rcu, destroy_super_rcu);
@@ -174,11 +176,13 @@ static void destroy_super(struct super_block *s)
174 * alloc_super - create new superblock 176 * alloc_super - create new superblock
175 * @type: filesystem type superblock should belong to 177 * @type: filesystem type superblock should belong to
176 * @flags: the mount flags 178 * @flags: the mount flags
179 * @user_ns: User namespace for the super_block
177 * 180 *
178 * Allocates and initializes a new &struct super_block. alloc_super() 181 * Allocates and initializes a new &struct super_block. alloc_super()
179 * returns a pointer new superblock or %NULL if allocation had failed. 182 * returns a pointer new superblock or %NULL if allocation had failed.
180 */ 183 */
181static struct super_block *alloc_super(struct file_system_type *type, int flags) 184static struct super_block *alloc_super(struct file_system_type *type, int flags,
185 struct user_namespace *user_ns)
182{ 186{
183 struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER); 187 struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
184 static const struct super_operations default_op; 188 static const struct super_operations default_op;
@@ -188,6 +192,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
188 return NULL; 192 return NULL;
189 193
190 INIT_LIST_HEAD(&s->s_mounts); 194 INIT_LIST_HEAD(&s->s_mounts);
195 s->s_user_ns = get_user_ns(user_ns);
191 196
192 if (security_sb_alloc(s)) 197 if (security_sb_alloc(s))
193 goto fail; 198 goto fail;
@@ -201,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
201 init_waitqueue_head(&s->s_writers.wait_unfrozen); 206 init_waitqueue_head(&s->s_writers.wait_unfrozen);
202 s->s_bdi = &noop_backing_dev_info; 207 s->s_bdi = &noop_backing_dev_info;
203 s->s_flags = flags; 208 s->s_flags = flags;
209 if (s->s_user_ns != &init_user_ns)
210 s->s_iflags |= SB_I_NODEV;
204 INIT_HLIST_NODE(&s->s_instances); 211 INIT_HLIST_NODE(&s->s_instances);
205 INIT_HLIST_BL_HEAD(&s->s_anon); 212 INIT_HLIST_BL_HEAD(&s->s_anon);
206 mutex_init(&s->s_sync_lock); 213 mutex_init(&s->s_sync_lock);
@@ -445,29 +452,42 @@ void generic_shutdown_super(struct super_block *sb)
445EXPORT_SYMBOL(generic_shutdown_super); 452EXPORT_SYMBOL(generic_shutdown_super);
446 453
447/** 454/**
448 * sget - find or create a superblock 455 * sget_userns - find or create a superblock
449 * @type: filesystem type superblock should belong to 456 * @type: filesystem type superblock should belong to
450 * @test: comparison callback 457 * @test: comparison callback
451 * @set: setup callback 458 * @set: setup callback
452 * @flags: mount flags 459 * @flags: mount flags
460 * @user_ns: User namespace for the super_block
453 * @data: argument to each of them 461 * @data: argument to each of them
454 */ 462 */
455struct super_block *sget(struct file_system_type *type, 463struct super_block *sget_userns(struct file_system_type *type,
456 int (*test)(struct super_block *,void *), 464 int (*test)(struct super_block *,void *),
457 int (*set)(struct super_block *,void *), 465 int (*set)(struct super_block *,void *),
458 int flags, 466 int flags, struct user_namespace *user_ns,
459 void *data) 467 void *data)
460{ 468{
461 struct super_block *s = NULL; 469 struct super_block *s = NULL;
462 struct super_block *old; 470 struct super_block *old;
463 int err; 471 int err;
464 472
473 if (!(flags & MS_KERNMOUNT) &&
474 !(type->fs_flags & FS_USERNS_MOUNT) &&
475 !capable(CAP_SYS_ADMIN))
476 return ERR_PTR(-EPERM);
465retry: 477retry:
466 spin_lock(&sb_lock); 478 spin_lock(&sb_lock);
467 if (test) { 479 if (test) {
468 hlist_for_each_entry(old, &type->fs_supers, s_instances) { 480 hlist_for_each_entry(old, &type->fs_supers, s_instances) {
469 if (!test(old, data)) 481 if (!test(old, data))
470 continue; 482 continue;
483 if (user_ns != old->s_user_ns) {
484 spin_unlock(&sb_lock);
485 if (s) {
486 up_write(&s->s_umount);
487 destroy_super(s);
488 }
489 return ERR_PTR(-EBUSY);
490 }
471 if (!grab_super(old)) 491 if (!grab_super(old))
472 goto retry; 492 goto retry;
473 if (s) { 493 if (s) {
@@ -480,7 +500,7 @@ retry:
480 } 500 }
481 if (!s) { 501 if (!s) {
482 spin_unlock(&sb_lock); 502 spin_unlock(&sb_lock);
483 s = alloc_super(type, flags); 503 s = alloc_super(type, flags, user_ns);
484 if (!s) 504 if (!s)
485 return ERR_PTR(-ENOMEM); 505 return ERR_PTR(-ENOMEM);
486 goto retry; 506 goto retry;
@@ -503,6 +523,31 @@ retry:
503 return s; 523 return s;
504} 524}
505 525
526EXPORT_SYMBOL(sget_userns);
527
528/**
529 * sget - find or create a superblock
530 * @type: filesystem type superblock should belong to
531 * @test: comparison callback
532 * @set: setup callback
533 * @flags: mount flags
534 * @data: argument to each of them
535 */
536struct super_block *sget(struct file_system_type *type,
537 int (*test)(struct super_block *,void *),
538 int (*set)(struct super_block *,void *),
539 int flags,
540 void *data)
541{
542 struct user_namespace *user_ns = current_user_ns();
543
544 /* Ensure the requestor has permissions over the target filesystem */
545 if (!(flags & MS_KERNMOUNT) && !ns_capable(user_ns, CAP_SYS_ADMIN))
546 return ERR_PTR(-EPERM);
547
548 return sget_userns(type, test, set, flags, user_ns, data);
549}
550
506EXPORT_SYMBOL(sget); 551EXPORT_SYMBOL(sget);
507 552
508void drop_super(struct super_block *sb) 553void drop_super(struct super_block *sb)
@@ -920,12 +965,20 @@ static int ns_set_super(struct super_block *sb, void *data)
920 return set_anon_super(sb, NULL); 965 return set_anon_super(sb, NULL);
921} 966}
922 967
923struct dentry *mount_ns(struct file_system_type *fs_type, int flags, 968struct dentry *mount_ns(struct file_system_type *fs_type,
924 void *data, int (*fill_super)(struct super_block *, void *, int)) 969 int flags, void *data, void *ns, struct user_namespace *user_ns,
970 int (*fill_super)(struct super_block *, void *, int))
925{ 971{
926 struct super_block *sb; 972 struct super_block *sb;
927 973
928 sb = sget(fs_type, ns_test_super, ns_set_super, flags, data); 974 /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
975 * over the namespace.
976 */
977 if (!(flags & MS_KERNMOUNT) && !ns_capable(user_ns, CAP_SYS_ADMIN))
978 return ERR_PTR(-EPERM);
979
980 sb = sget_userns(fs_type, ns_test_super, ns_set_super, flags,
981 user_ns, ns);
929 if (IS_ERR(sb)) 982 if (IS_ERR(sb))
930 return ERR_CAST(sb); 983 return ERR_CAST(sb);
931 984
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index f3db82071cfb..20b8f82e115b 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -41,8 +41,7 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
41 if (IS_ERR(root) || !new_sb) 41 if (IS_ERR(root) || !new_sb)
42 kobj_ns_drop(KOBJ_NS_TYPE_NET, ns); 42 kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
43 else if (new_sb) 43 else if (new_sb)
44 /* Userspace would break if executables appear on sysfs */ 44 root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
45 root->d_sb->s_iflags |= SB_I_NOEXEC;
46 45
47 return root; 46 return root;
48} 47}
@@ -59,7 +58,7 @@ static struct file_system_type sysfs_fs_type = {
59 .name = "sysfs", 58 .name = "sysfs",
60 .mount = sysfs_mount, 59 .mount = sysfs_mount,
61 .kill_sb = sysfs_kill_sb, 60 .kill_sb = sysfs_kill_sb,
62 .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, 61 .fs_flags = FS_USERNS_MOUNT,
63}; 62};
64 63
65int __init sysfs_init(void) 64int __init sysfs_init(void)
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
38 if (mask & MAY_WRITE) { 38 if (mask & MAY_WRITE) {
39 if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) 39 if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
40 return -EPERM; 40 return -EPERM;
41 /*
42 * Updating an xattr will likely cause i_uid and i_gid
43 * to be writen back improperly if their true value is
44 * unknown to the vfs.
45 */
46 if (HAS_UNMAPPED_ID(inode))
47 return -EPERM;
41 } 48 }
42 49
43 /* 50 /*