aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/vfs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/vfs.txt')
-rw-r--r--Documentation/filesystems/vfs.txt217
1 files changed, 195 insertions, 22 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index e56e842847d3..0fcbd74efd2f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -230,10 +230,15 @@ only called from a process context (i.e. not from an interrupt handler
230or bottom half). 230or bottom half).
231 231
232 alloc_inode: this method is called by inode_alloc() to allocate memory 232 alloc_inode: this method is called by inode_alloc() to allocate memory
233 for struct inode and initialize it. 233 for struct inode and initialize it. If this function is not
234 defined, a simple 'struct inode' is allocated. Normally
235 alloc_inode will be used to allocate a larger structure which
236 contains a 'struct inode' embedded within it.
234 237
235 destroy_inode: this method is called by destroy_inode() to release 238 destroy_inode: this method is called by destroy_inode() to release
236 resources allocated for struct inode. 239 resources allocated for struct inode. It is only required if
240 ->alloc_inode was defined and simply undoes anything done by
241 ->alloc_inode.
237 242
238 read_inode: this method is called to read a specific inode from the 243 read_inode: this method is called to read a specific inode from the
239 mounted filesystem. The i_ino member in the struct inode is 244 mounted filesystem. The i_ino member in the struct inode is
@@ -443,14 +448,81 @@ otherwise noted.
443The Address Space Object 448The Address Space Object
444======================== 449========================
445 450
446The address space object is used to identify pages in the page cache. 451The address space object is used to group and manage pages in the page
447 452cache. It can be used to keep track of the pages in a file (or
453anything else) and also track the mapping of sections of the file into
454process address spaces.
455
456There are a number of distinct yet related services that an
457address-space can provide. These include communicating memory
458pressure, page lookup by address, and keeping track of pages tagged as
459Dirty or Writeback.
460
461The first can be used independantly to the others. The vm can try to
462either write dirty pages in order to clean them, or release clean
463pages in order to reuse them. To do this it can call the ->writepage
464method on dirty pages, and ->releasepage on clean pages with
465PagePrivate set. Clean pages without PagePrivate and with no external
466references will be released without notice being given to the
467address_space.
468
469To achieve this functionality, pages need to be placed on an lru with
470lru_cache_add and mark_page_active needs to be called whenever the
471page is used.
472
473Pages are normally kept in a radix tree index by ->index. This tree
474maintains information about the PG_Dirty and PG_Writeback status of
475each page, so that pages with either of these flags can be found
476quickly.
477
478The Dirty tag is primarily used by mpage_writepages - the default
479->writepages method. It uses the tag to find dirty pages to call
480->writepage on. If mpage_writepages is not used (i.e. the address
481provides it's own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
482almost unused. write_inode_now and sync_inode do use it (through
483__sync_single_inode) to check if ->writepages has been successful in
484writing out the whole address_space.
485
486The Writeback tag is used by filemap*wait* and sync_page* functions,
487though wait_on_page_writeback_range, to wait for all writeback to
488complete. While waiting ->sync_page (if defined) will be called on
489each page that is found to require writeback
490
491An address_space handler may attach extra information to a page,
492typically using the 'private' field in the 'struct page'. If such
493information is attached, the PG_Private flag should be set. This will
494cause various mm routines to make extra calls into the address_space
495handler to deal with that data.
496
497An address space acts as an intermediate between storage and
498application. Data is read into the address space a whole page at a
499time, and provided to the application either by copying of the page,
500or by memory-mapping the page.
501Data is written into the address space by the application, and then
502written-back to storage typically in whole pages, however the
503address_space has finner control of write sizes.
504
505The read process essentially only requires 'readpage'. The write
506process is more complicated and uses prepare_write/commit_write or
507set_page_dirty to write data into the address_space, and writepage,
508sync_page, and writepages to writeback data to storage.
509
510Adding and removing pages to/from an address_space is protected by the
511inode's i_mutex.
512
513When data is written to a page, the PG_Dirty flag should be set. It
514typically remains set until writepage asks for it to be written. This
515should clear PG_Dirty and set PG_Writeback. It can be actually
516written at any point after PG_Dirty is clear. Once it is known to be
517safe, PG_Writeback is cleared.
518
519Writeback makes use of a writeback_control structure...
448 520
449struct address_space_operations 521struct address_space_operations
450------------------------------- 522-------------------------------
451 523
452This describes how the VFS can manipulate mapping of a file to page cache in 524This describes how the VFS can manipulate mapping of a file to page cache in
453your filesystem. As of kernel 2.6.13, the following members are defined: 525your filesystem. As of kernel 2.6.16, the following members are defined:
454 526
455struct address_space_operations { 527struct address_space_operations {
456 int (*writepage)(struct page *page, struct writeback_control *wbc); 528 int (*writepage)(struct page *page, struct writeback_control *wbc);
@@ -469,47 +541,148 @@ struct address_space_operations {
469 loff_t offset, unsigned long nr_segs); 541 loff_t offset, unsigned long nr_segs);
470 struct page* (*get_xip_page)(struct address_space *, sector_t, 542 struct page* (*get_xip_page)(struct address_space *, sector_t,
471 int); 543 int);
544 /* migrate the contents of a page to the specified target */
545 int (*migratepage) (struct page *, struct page *);
472}; 546};
473 547
474 writepage: called by the VM write a dirty page to backing store. 548 writepage: called by the VM to write a dirty page to backing store.
549 This may happen for data integrity reason (i.e. 'sync'), or
550 to free up memory (flush). The difference can be seen in
551 wbc->sync_mode.
552 The PG_Dirty flag has been cleared and PageLocked is true.
553 writepage should start writeout, should set PG_Writeback,
554 and should make sure the page is unlocked, either synchronously
555 or asynchronously when the write operation completes.
556
557 If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
558 try too hard if there are problems, and may choose to write out a
559 different page from the mapping if that would be more
560 appropriate. If it chooses not to start writeout, it should
561 return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
562 calling ->writepage on that page.
563
564 See the file "Locking" for more details.
475 565
476 readpage: called by the VM to read a page from backing store. 566 readpage: called by the VM to read a page from backing store.
567 The page will be Locked when readpage is called, and should be
568 unlocked and marked uptodate once the read completes.
569 If ->readpage discovers that it needs to unlock the page for
570 some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
571 In this case, the page will be re-located, re-locked and if
572 that all succeeds, ->readpage will be called again.
477 573
478 sync_page: called by the VM to notify the backing store to perform all 574 sync_page: called by the VM to notify the backing store to perform all
479 queued I/O operations for a page. I/O operations for other pages 575 queued I/O operations for a page. I/O operations for other pages
480 associated with this address_space object may also be performed. 576 associated with this address_space object may also be performed.
481 577
578 This function is optional and is called only for pages with
579 PG_Writeback set while waiting for the writeback to complete.
580
482 writepages: called by the VM to write out pages associated with the 581 writepages: called by the VM to write out pages associated with the
483 address_space object. 582 address_space object. If WBC_SYNC_ALL, then the
583 writeback_control will specify a range of pages that must be
584 written out. If WBC_SYNC_NONE, then a nr_to_write is given
585 and that many pages should be written if possible.
586 If no ->writepages is given, then mpage_writepages is used
587 instead. This will choose pages from the addresspace that are
588 tagged as DIRTY and will pass them to ->writepage.
484 589
485 set_page_dirty: called by the VM to set a page dirty. 590 set_page_dirty: called by the VM to set a page dirty.
591 This is particularly needed if an address space attaches
592 private data to a page, and that data needs to be updated when
593 a page is dirtied. This is called, for example, when a memory
594 mapped page gets modified.
595 If defined, it should set the PageDirty flag, and the
596 PAGECACHE_TAG_DIRTY tag in the radix tree.
486 597
487 readpages: called by the VM to read pages associated with the address_space 598 readpages: called by the VM to read pages associated with the address_space
488 object. 599 object. This is essentially just a vector version of
600 readpage. Instead of just one page, several pages are
601 requested.
602 readpages is only used for readahead, so read errors are
603 ignored. If anything goes wrong, feel free to give up.
489 604
490 prepare_write: called by the generic write path in VM to set up a write 605 prepare_write: called by the generic write path in VM to set up a write
491 request for a page. 606 request for a page. This indicates to the address space that
492 607 the given range of bytes are about to be written. The
493 commit_write: called by the generic write path in VM to write page to 608 address_space should check that the write will be able to
494 its backing store. 609 complete, by allocating space if necessary and doing any other
610 internal house keeping. If the write will update parts of
611 any basic-blocks on storage, then those blocks should be
612 pre-read (if they haven't been read already) so that the
613 updated blocks can be written out properly.
614 The page will be locked. If prepare_write wants to unlock the
615 page it, like readpage, may do so and return
616 AOP_TRUNCATED_PAGE.
617 In this case the prepare_write will be retried one the lock is
618 regained.
619
620 commit_write: If prepare_write succeeds, new data will be copied
621 into the page and then commit_write will be called. It will
622 typically update the size of the file (if appropriate) and
623 mark the inode as dirty, and do any other related housekeeping
624 operations. It should avoid returning an error if possible -
625 errors should have been handled by prepare_write.
495 626
496 bmap: called by the VFS to map a logical block offset within object to 627 bmap: called by the VFS to map a logical block offset within object to
497 physical block number. This method is use by for the legacy FIBMAP 628 physical block number. This method is used by for the FIBMAP
498 ioctl. Other uses are discouraged. 629 ioctl and for working with swap-files. To be able to swap to
499 630 a file, the file must have as stable mapping to a block
500 invalidatepage: called by the VM on truncate to disassociate a page from its 631 device. The swap system does not go through the filesystem
501 address_space mapping. 632 but instead uses bmap to find out where the blocks in the file
502 633 are and uses those addresses directly.
503 releasepage: called by the VFS to release filesystem specific metadata from 634
504 a page. 635
505 636 invalidatepage: If a page has PagePrivate set, then invalidatepage
506 direct_IO: called by the VM for direct I/O writes and reads. 637 will be called when part or all of the page is to be removed
638 from the address space. This generally corresponds either a
639 truncation or a complete invalidation of the address space
640 (in the latter case 'offset' will always be 0).
641 Any private data associated with the page should be updated
642 to reflect this truncation. If offset is 0, then
643 the private data should be released, because the page
644 must be able to be completely discarded. This may be done by
645 calling the ->releasepage function, but in this case the
646 release MUST succeed.
647
648 releasepage: releasepage is called on PagePrivate pages to indicate
649 that the page should be freed if possible. ->releasepage
650 should remove any private data from the page and clear the
651 PagePrivate flag. It may also remove the page from the
652 address_space. If this fails for some reason, it may indicate
653 failure with a 0 return value.
654 This is used in two distinct though related cases. The first
655 is when the VM finds a clean page with no active users and
656 wants to make it a free page. If ->releasepage succeeds, the
657 page will be removed from the address_space and become free.
658
659 The second case if when a request has been made to invalidate
660 some or all pages in an address_space. This can happen
661 through the fadvice(POSIX_FADV_DONTNEED) system call or by the
662 filesystem explicitly requesting it as nfs and 9fs do (when
663 they believe the cache may be out of date with storage) by
664 calling invalidate_inode_pages2().
665 If the filesystem makes such a call, and needs to be certain
666 that all pages are invalidated, then it's releasepage will
667 need to ensure this. Possibly it can clear the PageUptodate
668 bit if it cannot free private data yet.
669
670 direct_IO: called by the generic read/write routines to perform
671 direct_IO - that is IO requests which bypass the page cache
672 and tranfer data directly between the storage and the
673 application's address space.
507 674
508 get_xip_page: called by the VM to translate a block number to a page. 675 get_xip_page: called by the VM to translate a block number to a page.
509 The page is valid until the corresponding filesystem is unmounted. 676 The page is valid until the corresponding filesystem is unmounted.
510 Filesystems that want to use execute-in-place (XIP) need to implement 677 Filesystems that want to use execute-in-place (XIP) need to implement
511 it. An example implementation can be found in fs/ext2/xip.c. 678 it. An example implementation can be found in fs/ext2/xip.c.
512 679
680 migrate_page: This is used to compact the physical memory usage.
681 If the VM wants to relocate a page (maybe off a memory card
682 that is signalling imminent failure) it will pass a new page
683 and an old page to this function. migrate_page should
684 transfer any private data across and update any references
685 that it has to the page.
513 686
514The File Object 687The File Object
515=============== 688===============