diff options
-rw-r--r-- | Documentation/filesystems/vfs.txt | 217 |
1 files changed, 195 insertions, 22 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index e56e842847d3..0fcbd74efd2f 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -230,10 +230,15 @@ only called from a process context (i.e. not from an interrupt handler | |||
230 | or bottom half). | 230 | or bottom half). |
231 | 231 | ||
232 | alloc_inode: this method is called by inode_alloc() to allocate memory | 232 | alloc_inode: this method is called by inode_alloc() to allocate memory |
233 | for struct inode and initialize it. | 233 | for struct inode and initialize it. If this function is not |
234 | defined, a simple 'struct inode' is allocated. Normally | ||
235 | alloc_inode will be used to allocate a larger structure which | ||
236 | contains a 'struct inode' embedded within it. | ||
234 | 237 | ||
235 | destroy_inode: this method is called by destroy_inode() to release | 238 | destroy_inode: this method is called by destroy_inode() to release |
236 | resources allocated for struct inode. | 239 | resources allocated for struct inode. It is only required if |
240 | ->alloc_inode was defined and simply undoes anything done by | ||
241 | ->alloc_inode. | ||
237 | 242 | ||
238 | read_inode: this method is called to read a specific inode from the | 243 | read_inode: this method is called to read a specific inode from the |
239 | mounted filesystem. The i_ino member in the struct inode is | 244 | mounted filesystem. The i_ino member in the struct inode is |
@@ -443,14 +448,81 @@ otherwise noted. | |||
443 | The Address Space Object | 448 | The Address Space Object |
444 | ======================== | 449 | ======================== |
445 | 450 | ||
446 | The address space object is used to identify pages in the page cache. | 451 | The address space object is used to group and manage pages in the page |
447 | 452 | cache. It can be used to keep track of the pages in a file (or | |
453 | anything else) and also track the mapping of sections of the file into | ||
454 | process address spaces. | ||
455 | |||
456 | There are a number of distinct yet related services that an | ||
457 | address-space can provide. These include communicating memory | ||
458 | pressure, page lookup by address, and keeping track of pages tagged as | ||
459 | Dirty or Writeback. | ||
460 | |||
461 | The first can be used independantly to the others. The vm can try to | ||
462 | either write dirty pages in order to clean them, or release clean | ||
463 | pages in order to reuse them. To do this it can call the ->writepage | ||
464 | method on dirty pages, and ->releasepage on clean pages with | ||
465 | PagePrivate set. Clean pages without PagePrivate and with no external | ||
466 | references will be released without notice being given to the | ||
467 | address_space. | ||
468 | |||
469 | To achieve this functionality, pages need to be placed on an lru with | ||
470 | lru_cache_add and mark_page_active needs to be called whenever the | ||
471 | page is used. | ||
472 | |||
473 | Pages are normally kept in a radix tree index by ->index. This tree | ||
474 | maintains information about the PG_Dirty and PG_Writeback status of | ||
475 | each page, so that pages with either of these flags can be found | ||
476 | quickly. | ||
477 | |||
478 | The Dirty tag is primarily used by mpage_writepages - the default | ||
479 | ->writepages method. It uses the tag to find dirty pages to call | ||
480 | ->writepage on. If mpage_writepages is not used (i.e. the address | ||
481 | provides it's own ->writepages) , the PAGECACHE_TAG_DIRTY tag is | ||
482 | almost unused. write_inode_now and sync_inode do use it (through | ||
483 | __sync_single_inode) to check if ->writepages has been successful in | ||
484 | writing out the whole address_space. | ||
485 | |||
486 | The Writeback tag is used by filemap*wait* and sync_page* functions, | ||
487 | though wait_on_page_writeback_range, to wait for all writeback to | ||
488 | complete. While waiting ->sync_page (if defined) will be called on | ||
489 | each page that is found to require writeback | ||
490 | |||
491 | An address_space handler may attach extra information to a page, | ||
492 | typically using the 'private' field in the 'struct page'. If such | ||
493 | information is attached, the PG_Private flag should be set. This will | ||
494 | cause various mm routines to make extra calls into the address_space | ||
495 | handler to deal with that data. | ||
496 | |||
497 | An address space acts as an intermediate between storage and | ||
498 | application. Data is read into the address space a whole page at a | ||
499 | time, and provided to the application either by copying of the page, | ||
500 | or by memory-mapping the page. | ||
501 | Data is written into the address space by the application, and then | ||
502 | written-back to storage typically in whole pages, however the | ||
503 | address_space has finner control of write sizes. | ||
504 | |||
505 | The read process essentially only requires 'readpage'. The write | ||
506 | process is more complicated and uses prepare_write/commit_write or | ||
507 | set_page_dirty to write data into the address_space, and writepage, | ||
508 | sync_page, and writepages to writeback data to storage. | ||
509 | |||
510 | Adding and removing pages to/from an address_space is protected by the | ||
511 | inode's i_mutex. | ||
512 | |||
513 | When data is written to a page, the PG_Dirty flag should be set. It | ||
514 | typically remains set until writepage asks for it to be written. This | ||
515 | should clear PG_Dirty and set PG_Writeback. It can be actually | ||
516 | written at any point after PG_Dirty is clear. Once it is known to be | ||
517 | safe, PG_Writeback is cleared. | ||
518 | |||
519 | Writeback makes use of a writeback_control structure... | ||
448 | 520 | ||
449 | struct address_space_operations | 521 | struct address_space_operations |
450 | ------------------------------- | 522 | ------------------------------- |
451 | 523 | ||
452 | This describes how the VFS can manipulate mapping of a file to page cache in | 524 | This describes how the VFS can manipulate mapping of a file to page cache in |
453 | your filesystem. As of kernel 2.6.13, the following members are defined: | 525 | your filesystem. As of kernel 2.6.16, the following members are defined: |
454 | 526 | ||
455 | struct address_space_operations { | 527 | struct address_space_operations { |
456 | int (*writepage)(struct page *page, struct writeback_control *wbc); | 528 | int (*writepage)(struct page *page, struct writeback_control *wbc); |
@@ -469,47 +541,148 @@ struct address_space_operations { | |||
469 | loff_t offset, unsigned long nr_segs); | 541 | loff_t offset, unsigned long nr_segs); |
470 | struct page* (*get_xip_page)(struct address_space *, sector_t, | 542 | struct page* (*get_xip_page)(struct address_space *, sector_t, |
471 | int); | 543 | int); |
544 | /* migrate the contents of a page to the specified target */ | ||
545 | int (*migratepage) (struct page *, struct page *); | ||
472 | }; | 546 | }; |
473 | 547 | ||
474 | writepage: called by the VM write a dirty page to backing store. | 548 | writepage: called by the VM to write a dirty page to backing store. |
549 | This may happen for data integrity reason (i.e. 'sync'), or | ||
550 | to free up memory (flush). The difference can be seen in | ||
551 | wbc->sync_mode. | ||
552 | The PG_Dirty flag has been cleared and PageLocked is true. | ||
553 | writepage should start writeout, should set PG_Writeback, | ||
554 | and should make sure the page is unlocked, either synchronously | ||
555 | or asynchronously when the write operation completes. | ||
556 | |||
557 | If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to | ||
558 | try too hard if there are problems, and may choose to write out a | ||
559 | different page from the mapping if that would be more | ||
560 | appropriate. If it chooses not to start writeout, it should | ||
561 | return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep | ||
562 | calling ->writepage on that page. | ||
563 | |||
564 | See the file "Locking" for more details. | ||
475 | 565 | ||
476 | readpage: called by the VM to read a page from backing store. | 566 | readpage: called by the VM to read a page from backing store. |
567 | The page will be Locked when readpage is called, and should be | ||
568 | unlocked and marked uptodate once the read completes. | ||
569 | If ->readpage discovers that it needs to unlock the page for | ||
570 | some reason, it can do so, and then return AOP_TRUNCATED_PAGE. | ||
571 | In this case, the page will be re-located, re-locked and if | ||
572 | that all succeeds, ->readpage will be called again. | ||
477 | 573 | ||
478 | sync_page: called by the VM to notify the backing store to perform all | 574 | sync_page: called by the VM to notify the backing store to perform all |
479 | queued I/O operations for a page. I/O operations for other pages | 575 | queued I/O operations for a page. I/O operations for other pages |
480 | associated with this address_space object may also be performed. | 576 | associated with this address_space object may also be performed. |
481 | 577 | ||
578 | This function is optional and is called only for pages with | ||
579 | PG_Writeback set while waiting for the writeback to complete. | ||
580 | |||
482 | writepages: called by the VM to write out pages associated with the | 581 | writepages: called by the VM to write out pages associated with the |
483 | address_space object. | 582 | address_space object. If WBC_SYNC_ALL, then the |
583 | writeback_control will specify a range of pages that must be | ||
584 | written out. If WBC_SYNC_NONE, then a nr_to_write is given | ||
585 | and that many pages should be written if possible. | ||
586 | If no ->writepages is given, then mpage_writepages is used | ||
587 | instead. This will choose pages from the addresspace that are | ||
588 | tagged as DIRTY and will pass them to ->writepage. | ||
484 | 589 | ||
485 | set_page_dirty: called by the VM to set a page dirty. | 590 | set_page_dirty: called by the VM to set a page dirty. |
591 | This is particularly needed if an address space attaches | ||
592 | private data to a page, and that data needs to be updated when | ||
593 | a page is dirtied. This is called, for example, when a memory | ||
594 | mapped page gets modified. | ||
595 | If defined, it should set the PageDirty flag, and the | ||
596 | PAGECACHE_TAG_DIRTY tag in the radix tree. | ||
486 | 597 | ||
487 | readpages: called by the VM to read pages associated with the address_space | 598 | readpages: called by the VM to read pages associated with the address_space |
488 | object. | 599 | object. This is essentially just a vector version of |
600 | readpage. Instead of just one page, several pages are | ||
601 | requested. | ||
602 | readpages is only used for readahead, so read errors are | ||
603 | ignored. If anything goes wrong, feel free to give up. | ||
489 | 604 | ||
490 | prepare_write: called by the generic write path in VM to set up a write | 605 | prepare_write: called by the generic write path in VM to set up a write |
491 | request for a page. | 606 | request for a page. This indicates to the address space that |
492 | 607 | the given range of bytes are about to be written. The | |
493 | commit_write: called by the generic write path in VM to write page to | 608 | address_space should check that the write will be able to |
494 | its backing store. | 609 | complete, by allocating space if necessary and doing any other |
610 | internal house keeping. If the write will update parts of | ||
611 | any basic-blocks on storage, then those blocks should be | ||
612 | pre-read (if they haven't been read already) so that the | ||
613 | updated blocks can be written out properly. | ||
614 | The page will be locked. If prepare_write wants to unlock the | ||
615 | page it, like readpage, may do so and return | ||
616 | AOP_TRUNCATED_PAGE. | ||
617 | In this case the prepare_write will be retried one the lock is | ||
618 | regained. | ||
619 | |||
620 | commit_write: If prepare_write succeeds, new data will be copied | ||
621 | into the page and then commit_write will be called. It will | ||
622 | typically update the size of the file (if appropriate) and | ||
623 | mark the inode as dirty, and do any other related housekeeping | ||
624 | operations. It should avoid returning an error if possible - | ||
625 | errors should have been handled by prepare_write. | ||
495 | 626 | ||
496 | bmap: called by the VFS to map a logical block offset within object to | 627 | bmap: called by the VFS to map a logical block offset within object to |
497 | physical block number. This method is use by for the legacy FIBMAP | 628 | physical block number. This method is used by for the FIBMAP |
498 | ioctl. Other uses are discouraged. | 629 | ioctl and for working with swap-files. To be able to swap to |
499 | 630 | a file, the file must have as stable mapping to a block | |
500 | invalidatepage: called by the VM on truncate to disassociate a page from its | 631 | device. The swap system does not go through the filesystem |
501 | address_space mapping. | 632 | but instead uses bmap to find out where the blocks in the file |
502 | 633 | are and uses those addresses directly. | |
503 | releasepage: called by the VFS to release filesystem specific metadata from | 634 | |
504 | a page. | 635 | |
505 | 636 | invalidatepage: If a page has PagePrivate set, then invalidatepage | |
506 | direct_IO: called by the VM for direct I/O writes and reads. | 637 | will be called when part or all of the page is to be removed |
638 | from the address space. This generally corresponds either a | ||
639 | truncation or a complete invalidation of the address space | ||
640 | (in the latter case 'offset' will always be 0). | ||
641 | Any private data associated with the page should be updated | ||
642 | to reflect this truncation. If offset is 0, then | ||
643 | the private data should be released, because the page | ||
644 | must be able to be completely discarded. This may be done by | ||
645 | calling the ->releasepage function, but in this case the | ||
646 | release MUST succeed. | ||
647 | |||
648 | releasepage: releasepage is called on PagePrivate pages to indicate | ||
649 | that the page should be freed if possible. ->releasepage | ||
650 | should remove any private data from the page and clear the | ||
651 | PagePrivate flag. It may also remove the page from the | ||
652 | address_space. If this fails for some reason, it may indicate | ||
653 | failure with a 0 return value. | ||
654 | This is used in two distinct though related cases. The first | ||
655 | is when the VM finds a clean page with no active users and | ||
656 | wants to make it a free page. If ->releasepage succeeds, the | ||
657 | page will be removed from the address_space and become free. | ||
658 | |||
659 | The second case if when a request has been made to invalidate | ||
660 | some or all pages in an address_space. This can happen | ||
661 | through the fadvice(POSIX_FADV_DONTNEED) system call or by the | ||
662 | filesystem explicitly requesting it as nfs and 9fs do (when | ||
663 | they believe the cache may be out of date with storage) by | ||
664 | calling invalidate_inode_pages2(). | ||
665 | If the filesystem makes such a call, and needs to be certain | ||
666 | that all pages are invalidated, then it's releasepage will | ||
667 | need to ensure this. Possibly it can clear the PageUptodate | ||
668 | bit if it cannot free private data yet. | ||
669 | |||
670 | direct_IO: called by the generic read/write routines to perform | ||
671 | direct_IO - that is IO requests which bypass the page cache | ||
672 | and tranfer data directly between the storage and the | ||
673 | application's address space. | ||
507 | 674 | ||
508 | get_xip_page: called by the VM to translate a block number to a page. | 675 | get_xip_page: called by the VM to translate a block number to a page. |
509 | The page is valid until the corresponding filesystem is unmounted. | 676 | The page is valid until the corresponding filesystem is unmounted. |
510 | Filesystems that want to use execute-in-place (XIP) need to implement | 677 | Filesystems that want to use execute-in-place (XIP) need to implement |
511 | it. An example implementation can be found in fs/ext2/xip.c. | 678 | it. An example implementation can be found in fs/ext2/xip.c. |
512 | 679 | ||
680 | migrate_page: This is used to compact the physical memory usage. | ||
681 | If the VM wants to relocate a page (maybe off a memory card | ||
682 | that is signalling imminent failure) it will pass a new page | ||
683 | and an old page to this function. migrate_page should | ||
684 | transfer any private data across and update any references | ||
685 | that it has to the page. | ||
513 | 686 | ||
514 | The File Object | 687 | The File Object |
515 | =============== | 688 | =============== |