aboutsummaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAge
* CIFS: Implement cifs_strict_readv (try #4)Pavel Shilovsky2011-01-20
| | | | | | | | | | Read from the cache if we have at least Level II oplock - otherwise read from the server. Add cifs_user_readv to let the client read into iovec buffers. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* CIFS: Implement cifs_file_strict_mmap (try #2)Pavel Shilovsky2011-01-20
| | | | | | | | Invalidate inode mapping if we don't have at least Level II oplock. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* CIFS: Implement cifs_strict_fsyncPavel Shilovsky2011-01-20
| | | | | | | | | | | Invalidate inode mapping if we don't have at least Level II oplock in cifs_strict_fsync. Also remove filemap_write_and_wait call from cifs_fsync because it is previously called from vfs_fsync_range. Add file operations' structures for strict cache mode. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* CIFS: Make cifsFileInfo_put work with strict cache modePavel Shilovsky2011-01-20
| | | | | | | | | | On strict cache mode when we close the last file handle of the inode we should set invalid_mapping flag on this inode to prevent data coherency problem when we open it again but it has been modified on the server. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: mangle existing header for SMB_COM_NT_CANCELJeff Layton2011-01-20
| | | | | | | | | | | | | | | The NT_CANCEL command looks just like the original command, except for a few small differences. The send_nt_cancel function however currently takes a tcon, which we don't have in SendReceive and SendReceive2. Instead of "respinning" the entire header for an NT_CANCEL, just mangle the existing header by replacing just the fields we need. This means we don't need a tcon and allows us to call it from other places. Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: remove code for setting timeouts on requestsJeff Layton2011-01-20
| | | | | | | | | | Since we don't time out individual requests anymore, remove the code that we used to use for setting timeouts on different requests. Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* [CIFS] cifs: reconnect unresponsive serversSteve French2011-01-20
| | | | | | | | | | | | | | | | If the server isn't responding to echoes, we don't want to leave tasks hung waiting for it to reply. At that point, we'll want to reconnect so that soft mounts can return an error to userspace quickly. If the client hasn't received a reply after a specified number of echo intervals, assume that the transport is down and attempt to reconnect the socket. The number of echo_intervals to wait before attempting to reconnect is tunable via a module parameter. Setting it to 0, means that the client will never attempt to reconnect. The default is 5. Signed-off-by: Jeff Layton <jlayton@redhat.com>
* cifs: set up recurring workqueue job to do SMB echo requestsJeff Layton2011-01-20
| | | | | | Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: add ability to send an echo requestJeff Layton2011-01-20
| | | | | | Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: add cifs_call_asyncJeff Layton2011-01-20
| | | | | | | | | Add a function that will send a request, and set up the mid for an async reply. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: allow for different handling of received responseJeff Layton2011-01-20
| | | | | | | | | | | | | | | | | | | | | | In order to incorporate async requests, we need to allow for a more general way to do things on receive, rather than just waking up a process. Turn the task pointer in the mid_q_entry into a callback function and a generic data pointer. When a response comes in, or the socket is reconnected, cifsd can call the callback function in order to wake up the process. The default is to just wake up the current process which should mean no change in behavior for existing code. Also, clean up the locking in cifs_reconnect. There doesn't seem to be any need to hold both the srv_mutex and GlobalMid_Lock when walking the list of mids. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: clean up sync_mid_resultJeff Layton2011-01-20
| | | | | | | | Make it use a switch statement based on the value of the midStatus. If the resp_buf is set, then MID_RESPONSE_RECEIVED is too. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: don't reconnect server when we don't get a responseJeff Layton2011-01-20
| | | | | | | | | | | | | We only want to force a reconnect to the server under very limited and specific circumstances. Now that we have processes waiting indefinitely for responses, we shouldn't reach this point unless a reconnect is already in process. Thus, there's no reason to re-mark the server for reconnect here. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: wait indefinitely for responsesJeff Layton2011-01-20
| | | | | | | | | | | | | | | | | | | | The client should not be timing out on individual SMB requests. Too much of the state between client and server is tied to the state of the socket. If we time out requests and issue spurious disconnects then that comprimises data integrity. Instead of doing this complicated dance where we try to decide how long to wait for a response for particular requests, have the client instead wait indefinitely for a response. Also, use a TASK_KILLABLE sleep here so that fatal signals will break out of this waiting. Later patches will add support for detecting dead peers and forcing reconnects based on that. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: Use mask of ACEs for SID Everyone to calculate all three permissions ↵Shirish Pargaonkar2011-01-19
| | | | | | | | | | | | | | user, group, and other If a DACL has entries for ACEs for SID Everyone and Authenticated Users, factor in mask in respective entries during calculation of permissions for all three, user, group, and other. http://technet.microsoft.com/en-us/library/bb463216.aspx Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: Fix regression during share-level security mounts (Repost)Shirish Pargaonkar2011-01-19
| | | | | | | | | | | | | NTLM response length was changed to 16 bytes instead of 24 bytes that are sent in Tree Connection Request during share-level security share mounts. Revert it back to 24 bytes. Reported-and-Tested-by: Grzegorz Ozanski <grzegorz.ozanski@intel.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Acked-by: Suresh Jayaraman <sjayaraman@suse.de> Cc: stable@kernel.org Signed-off-by: Steve French <sfrench@us.ibm.com>
* [CIFS] Update cifs version numberSteve French2011-01-19
| | | | Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: move mid result processing into common functionJeff Layton2011-01-19
| | | | | | | Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: move locked sections out of DeleteMidQEntry and AllocMidQEntryJeff Layton2011-01-19
| | | | | | | | | | | | | | In later patches, we're going to need to have finer-grained control over the addition and removal of these structs from the pending_mid_q and we'll need to be able to call the destructor while holding the spinlock. Move the locked sections out of both routines and into the callers. Fix up current callers of DeleteMidQEntry to call a new routine that dequeues the entry and then destroys it. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: clean up accesses to midCountJeff Layton2011-01-19
| | | | | | | | | | | | It's an atomic_t and the code accesses the "counter" field in it directly instead of using atomic_read(). It also is sometimes accessed under a spinlock and sometimes not. Move it out of the spinlock since we don't need belt-and-suspenders for something that's just informational. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: make wait_for_free_request take a TCP_Server_Info pointerJeff Layton2011-01-19
| | | | | | | | | The cifsSesInfo pointer is only used to get at the server. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: no need to mark smb_ses_list as cifs_demultiplex_thread is exitingJeff Layton2011-01-19
| | | | | | | | | | | The TCP_Server_Info is refcounted and every SMB session holds a reference to it. Thus, smb_ses_list is always going to be empty when cifsd is coming down. This is dead code. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* cifs: don't fail writepages on -EAGAIN errorsJeff Layton2011-01-19
| | | | | | | | | | | | | | | | | | | | If CIFSSMBWrite2 returns -EAGAIN, then the error should be considered temporary. CIFS should retry the write instead of setting an error on the mapping and returning. For WB_SYNC_ALL, just retry the write immediately. In the WB_SYNC_NONE case, call redirty_page_for_writeback on all of the pages that didn't get written out and then move on. Also, fix up the handling of a short write with a successful return code. MS-CIFS says that 0 bytes_written means ENOSPC or EFBIG. It doesn't mention what a short, but non-zero write means, so for now treat it as we would an -EAGAIN return. Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* CIFS: Fix oplock break handling (try #2)Pavel Shilovsky2011-01-19
| | | | | | | | | | | | | | | When we get oplock break notification we should set the appropriate value of OplockLevel field in oplock break acknowledge according to the oplock level held by the client in this time. As we only can have level II oplock or no oplock in the case of oplock break, we should be aware only about clientCanCacheRead field in cifsInodeInfo structure. Also fix bug connected with wrong interpretation of OplockLevel field during oplock break notification processing. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>
* autofs4: clean ->d_release() and autofs4_free_ino() upAl Viro2011-01-18
| | | | | | | | The latter is called only when both ino and dentry are about to be freed, so cleaning ->d_fsdata and ->dentry is pointless. Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: split autofs4_init_ino()Al Viro2011-01-18
| | | | | | | | | split init_ino into new_ino and clean_ino; the former is what used to be init_ino(NULL, sbi), the latter is for cases where we passed non-NULL ino. Lose unused arguments. Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: mkdir and symlink always get a dentry that had passed lookupAl Viro2011-01-18
| | | | | | | ... so ->d_fsdata will have been set up before we get there Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: autofs4_get_inode() doesn't need autofs_info * argument anymoreAl Viro2011-01-18
| | | | | Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: kill ->size in autofs_infoAl Viro2011-01-18
| | | | | | | | | | It's used only to pass the length of symlink body to autofs4_get_inode() in autofs4_dir_symlink(). We can bloody well set inode->i_size in autofs4_dir_symlink() directly and be done with that. Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: pass mode to autofs4_get_inode() explicitlyAl Viro2011-01-18
| | | | | | | | | In all cases we'd set inf->mode to know value just before passing it to autofs4_get_inode(). That kills the need to store it in autofs_info and pass it to autofs_init_ino() Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: autofs4_mkroot() is not different from autofs4_init_ino()Al Viro2011-01-18
| | | | | | | | Kill it. Mind you, it's been an obfuscated call of autofs4_init_ino() ever since 2.3.99pre6-4... Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4: keep symlink body in inode->i_privateAl Viro2011-01-18
| | | | | | | | gets rid of all ->free()/->u.symlink machinery in autofs; we simply keep symlink bodies in inode->i_private and free them in ->evict_inode(). Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4 - fix debug print in autofs4_lookup()Ian Kent2011-01-18
| | | | | | | oz_mode isn't defined any more, use autofs4_oz_mode(sbi) instead. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs - fix dentry ref count in do_lookup()Ian Kent2011-01-18
| | | | | | | | | | | There is a ref count problem in fs/namei.c:do_lookup(). When walking in ref-walk mode, if follow_managed() returns a fail we need to drop dentry and possibly vfsmount. Clean up properly, as we do in the other caller of follow_managed(). Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* autofs4 - fix get_next_positive_dentry()Ian Kent2011-01-18
| | | | | | | | | | | | | | The initialization condition in fs/autofs4/expire.c:get_next_positive_dentry() appears to be incorrect. If prev == NULL I believe that root should be returned. Further down, at the current dentry check for it being simple_positive() it looks like the d_lock for dentry p should be dropped instead of dentry ret, otherwise when p is assinged to ret we end up with no lock on p and a lost lock on ret, which leads to a deadlock. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-01-17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits) Btrfs: forced readonly mounts on errors btrfs: Require CAP_SYS_ADMIN for filesystem rebalance Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check btrfs: Fix memory leak in btrfs_read_fs_root_no_radix() btrfs: check NULL or not btrfs: Don't pass NULL ptr to func that may deref it. btrfs: mount failure return value fix btrfs: Mem leak in btrfs_get_acl() btrfs: fix wrong free space information of btrfs btrfs: make the chunk allocator utilize the devices better btrfs: restructure find_free_dev_extent() btrfs: fix wrong calculation of stripe size btrfs: try to reclaim some space when chunk allocation fails btrfs: fix wrong data space statistics fs/btrfs: Fix build of ctree Btrfs: fix off by one while setting block groups readonly Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls Btrfs: Add readonly snapshots support Btrfs: Refactor btrfs_ioctl_snap_create() btrfs: Extract duplicate decompress code ...
| * Btrfs: forced readonly mounts on errorsliubo2011-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch comes from "Forced readonly mounts on errors" ideas. As we know, this is the first step in being more fault tolerant of disk corruptions instead of just using BUG() statements. The major content: - add a framework for generating errors that should result in filesystems going readonly. - keep FS state in disk super block. - make sure that all of resource will be freed and released at umount time. - make sure that fter FS is forced readonly on error, there will be no more disk change before FS is corrected. For this, we should stop write operation. After this patch is applied, the conversion from BUG() to such a framework can happen incrementally. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: Require CAP_SYS_ADMIN for filesystem rebalanceBen Hutchings2011-01-16
| | | | | | | | | | | | | | | | | | | | Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire filesystem and may run uninterruptibly for a long time. This does not seem to be something that an unprivileged user should be able to do. Reported-by: Aron Xu <happyaron.xu@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_checkJosef Bacik2011-01-16
| | | | | | | | | | | | | | | | | | | | If we run low on space we could get a bunch of warnings out of btrfs_block_rsv_check, but this is mostly just called via the transaction code to see if we need to end the transaction, it expects to see failures, so let's not WARN and freak everybody out for no reason. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()Tsutomu Itoh2011-01-16
| | | | | | | | | | | | | | | | In btrfs_read_fs_root_no_radix(), 'root' is not freed if btrfs_search_slot() returns error. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: check NULL or notTsutomu Itoh2011-01-16
| | | | | | | | | | | | | | Should check if functions returns NULL or not. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: Don't pass NULL ptr to func that may deref it.Jesper Juhl2011-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, In fs/btrfs/inode.c::fixup_tree_root_location() we have this code: ... if (!path) { err = -ENOMEM; goto out; } ... out: btrfs_free_path(path); return err; btrfs_free_path() passes its argument on to other functions and some of them end up dereferencing the pointer. In the code above that pointer is clearly NULL, so btrfs_free_path() will eventually cause a NULL dereference. There are many ways to cut this cake (fix the bug). The one I chose was to make btrfs_free_path() deal gracefully with NULL pointers. If you disagree, feel free to come up with an alternative patch. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: mount failure return value fixDave Young2011-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I happened to pass swap partition as root partition in cmdline, then kernel panic and tell me about "Cannot open root device". It is not correct, in fact it is a fs type mismatch instead of 'no device'. Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL. The logic in init/do_mounts.c: for (p = fs_names; *p; p += strlen(p)+1) { int err = do_mount_root(name, p, flags, root_mount_data); switch (err) { case 0: goto out; case -EACCES: flags |= MS_RDONLY; goto retry; case -EINVAL: continue; } print "Cannot open root device" panic } SO fs type after btrfs will have no chance to mount Here fix the return value as -EINVAL Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: Mem leak in btrfs_get_acl()Jesper Juhl2011-01-16
| | | | | | | | | | | | | | | | | | It seems to me that we leak the memory allocated to 'value' in btrfs_get_acl() if the call to posix_acl_from_xattr() fails. Here's a patch that attempts to correct that problem. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: fix wrong free space information of btrfsMiao Xie2011-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we store data by raid profile in btrfs with two or more different size disks, df command shows there is some free space in the filesystem, but the user can not write any data in fact, df command shows the wrong free space information of btrfs. # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10 # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 28.00KB devid 1 size 5.01GB used 2.03GB path /dev/sda9 devid 2 size 10.00GB used 2.01GB path /dev/sda10 # btrfs device scan /dev/sda9 /dev/sda10 # mount /dev/sda9 /mnt # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999 (fill the filesystem) # sync # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 3.99GB devid 1 size 5.01GB used 5.01GB path /dev/sda9 devid 2 size 10.00GB used 4.99GB path /dev/sda10 It is because btrfs cannot allocate chunks when one of the pairing disks has no space, the free space on the other disks can not be used for ever, and should be subtracted from the total space, but btrfs doesn't subtract this space from the total. It is strange to the user. This patch fixes it by calcing the free space that can be used to allocate chunks. Implementation: 1. get all the devices free space, and align them by stripe length. 2. sort the devices by the free space. 3. check the free space of the devices, 3.1. if it is not zero, and then check the number of the devices that has more free space than this device, if the number of the devices is beyond the min stripe number, the free space can be used, and add into total free space. if the number of the devices is below the min stripe number, we can not use the free space, the check ends. 3.2. if the free space is zero, check the next devices, goto 3.1 This implementation is just likely fake chunk allocation. After appling this patch, df can show correct space information: # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 0 100% /mnt Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: make the chunk allocator utilize the devices betterMiao Xie2011-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this patch, we change the handling method when we can not get enough free extents with default size. Implementation: 1. Look up the suitable free extent on each device and keep the search result. If not find a suitable free extent, keep the max free extent 2. If we get enough suitable free extents with default size, chunk allocation succeeds. 3. If we can not get enough free extents, but the number of the extent with default size is >= min_stripes, we just change the mapping information (reduce the number of stripes in the extent map), and chunk allocation succeeds. 4. If the number of the extent with default size is < min_stripes, sort the devices by its max free extent's size descending 5. Use the size of the max free extent on the (num_stripes - 1)th device as the stripe size to allocate the device space By this way, the chunk allocator can allocate chunks as large as possible when the devices' space is not enough and make full use of the devices. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: restructure find_free_dev_extent()Miao Xie2011-01-16
| | | | | | | | | | | | | | | | | | - make it return the start position and length of the max free space when it can not find a suitable free space. - make it more readability Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: fix wrong calculation of stripe sizeMiao Xie2011-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two tiny problem: - One is When we check the chunk size is greater than the max chunk size or not, we should take mirrors into account, but the original code didn't. - The other is btrfs shouldn't use the size of the residual free space as the length of of a dup chunk when doing chunk allocation. It is because the device space that a dup chunk needs is twice as large as the chunk size, if we use the size of the residual free space as the length of a dup chunk, we can not get enough free space. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: try to reclaim some space when chunk allocation failsMiao Xie2011-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot write data into files when when there is tiny space in the filesystem. Reproduce steps: # mkfs.btrfs /dev/sda1 # mount /dev/sda1 /mnt # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1 # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999 (fill the filesystem) # umount /mnt # mount /dev/sda1 /mnt # rm -f /mnt/tmpfile0 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1 (failed with nospec) But if we do the last step again, we can write data successfully. The reason of the problem is that btrfs didn't try to commit the current transaction and reclaim some space when chunk allocation failed. This patch fixes it by committing the current transaction to reclaim some space when chunk allocation fails. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: fix wrong data space statisticsMiao Xie2011-01-16
| | | | | | | | | | | | | | | | | | Josef has implemented mixed data/metadata chunks, we must add those chunks' space just like data chunks. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>