diff options
| -rw-r--r-- | Documentation/filesystems/vfs.txt | 217 |
1 files changed, 195 insertions, 22 deletions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index e56e842847d3..0fcbd74efd2f 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
| @@ -230,10 +230,15 @@ only called from a process context (i.e. not from an interrupt handler | |||
| 230 | or bottom half). | 230 | or bottom half). |
| 231 | 231 | ||
| 232 | alloc_inode: this method is called by inode_alloc() to allocate memory | 232 | alloc_inode: this method is called by inode_alloc() to allocate memory |
| 233 | for struct inode and initialize it. | 233 | for struct inode and initialize it. If this function is not |
| 234 | defined, a simple 'struct inode' is allocated. Normally | ||
| 235 | alloc_inode will be used to allocate a larger structure which | ||
| 236 | contains a 'struct inode' embedded within it. | ||
| 234 | 237 | ||
| 235 | destroy_inode: this method is called by destroy_inode() to release | 238 | destroy_inode: this method is called by destroy_inode() to release |
| 236 | resources allocated for struct inode. | 239 | resources allocated for struct inode. It is only required if |
| 240 | ->alloc_inode was defined and simply undoes anything done by | ||
| 241 | ->alloc_inode. | ||
| 237 | 242 | ||
| 238 | read_inode: this method is called to read a specific inode from the | 243 | read_inode: this method is called to read a specific inode from the |
| 239 | mounted filesystem. The i_ino member in the struct inode is | 244 | mounted filesystem. The i_ino member in the struct inode is |
| @@ -443,14 +448,81 @@ otherwise noted. | |||
| 443 | The Address Space Object | 448 | The Address Space Object |
| 444 | ======================== | 449 | ======================== |
| 445 | 450 | ||
| 446 | The address space object is used to identify pages in the page cache. | 451 | The address space object is used to group and manage pages in the page |
| 447 | 452 | cache. It can be used to keep track of the pages in a file (or | |
| 453 | anything else) and also track the mapping of sections of the file into | ||
| 454 | process address spaces. | ||
| 455 | |||
| 456 | There are a number of distinct yet related services that an | ||
| 457 | address-space can provide. These include communicating memory | ||
| 458 | pressure, page lookup by address, and keeping track of pages tagged as | ||
| 459 | Dirty or Writeback. | ||
| 460 | |||
| 461 | The first can be used independantly to the others. The vm can try to | ||
| 462 | either write dirty pages in order to clean them, or release clean | ||
| 463 | pages in order to reuse them. To do this it can call the ->writepage | ||
| 464 | method on dirty pages, and ->releasepage on clean pages with | ||
| 465 | PagePrivate set. Clean pages without PagePrivate and with no external | ||
| 466 | references will be released without notice being given to the | ||
| 467 | address_space. | ||
| 468 | |||
| 469 | To achieve this functionality, pages need to be placed on an lru with | ||
| 470 | lru_cache_add and mark_page_active needs to be called whenever the | ||
| 471 | page is used. | ||
| 472 | |||
| 473 | Pages are normally kept in a radix tree index by ->index. This tree | ||
| 474 | maintains information about the PG_Dirty and PG_Writeback status of | ||
| 475 | each page, so that pages with either of these flags can be found | ||
| 476 | quickly. | ||
| 477 | |||
| 478 | The Dirty tag is primarily used by mpage_writepages - the default | ||
| 479 | ->writepages method. It uses the tag to find dirty pages to call | ||
| 480 | ->writepage on. If mpage_writepages is not used (i.e. the address | ||
| 481 | provides it's own ->writepages) , the PAGECACHE_TAG_DIRTY tag is | ||
| 482 | almost unused. write_inode_now and sync_inode do use it (through | ||
| 483 | __sync_single_inode) to check if ->writepages has been successful in | ||
| 484 | writing out the whole address_space. | ||
| 485 | |||
| 486 | The Writeback tag is used by filemap*wait* and sync_page* functions, | ||
| 487 | though wait_on_page_writeback_range, to wait for all writeback to | ||
| 488 | complete. While waiting ->sync_page (if defined) will be called on | ||
| 489 | each page that is found to require writeback | ||
| 490 | |||
| 491 | An address_space handler may attach extra information to a page, | ||
| 492 | typically using the 'private' field in the 'struct page'. If such | ||
| 493 | information is attached, the PG_Private flag should be set. This will | ||
| 494 | cause various mm routines to make extra calls into the address_space | ||
| 495 | handler to deal with that data. | ||
| 496 | |||
| 497 | An address space acts as an intermediate between storage and | ||
| 498 | application. Data is read into the address space a whole page at a | ||
| 499 | time, and provided to the application either by copying of the page, | ||
| 500 | or by memory-mapping the page. | ||
| 501 | Data is written into the address space by the application, and then | ||
| 502 | written-back to storage typically in whole pages, however the | ||
| 503 | address_space has finner control of write sizes. | ||
| 504 | |||
| 505 | The read process essentially only requires 'readpage'. The write | ||
| 506 | process is more complicated and uses prepare_write/commit_write or | ||
| 507 | set_page_dirty to write data into the address_space, and writepage, | ||
| 508 | sync_page, and writepages to writeback data to storage. | ||
| 509 | |||
| 510 | Adding and removing pages to/from an address_space is protected by the | ||
| 511 | inode's i_mutex. | ||
| 512 | |||
| 513 | When data is written to a page, the PG_Dirty flag should be set. It | ||
| 514 | typically remains set until writepage asks for it to be written. This | ||
| 515 | should clear PG_Dirty and set PG_Writeback. It can be actually | ||
| 516 | written at any point after PG_Dirty is clear. Once it is known to be | ||
| 517 | safe, PG_Writeback is cleared. | ||
| 518 | |||
| 519 | Writeback makes use of a writeback_control structure... | ||
| 448 | 520 | ||
| 449 | struct address_space_operations | 521 | struct address_space_operations |
| 450 | ------------------------------- | 522 | ------------------------------- |
| 451 | 523 | ||
| 452 | This describes how the VFS can manipulate mapping of a file to page cache in | 524 | This describes how the VFS can manipulate mapping of a file to page cache in |
| 453 | your filesystem. As of kernel 2.6.13, the following members are defined: | 525 | your filesystem. As of kernel 2.6.16, the following members are defined: |
| 454 | 526 | ||
| 455 | struct address_space_operations { | 527 | struct address_space_operations { |
| 456 | int (*writepage)(struct page *page, struct writeback_control *wbc); | 528 | int (*writepage)(struct page *page, struct writeback_control *wbc); |
| @@ -469,47 +541,148 @@ struct address_space_operations { | |||
| 469 | loff_t offset, unsigned long nr_segs); | 541 | loff_t offset, unsigned long nr_segs); |
| 470 | struct page* (*get_xip_page)(struct address_space *, sector_t, | 542 | struct page* (*get_xip_page)(struct address_space *, sector_t, |
| 471 | int); | 543 | int); |
| 544 | /* migrate the contents of a page to the specified target */ | ||
| 545 | int (*migratepage) (struct page *, struct page *); | ||
| 472 | }; | 546 | }; |
| 473 | 547 | ||
| 474 | writepage: called by the VM write a dirty page to backing store. | 548 | writepage: called by the VM to write a dirty page to backing store. |
| 549 | This may happen for data integrity reason (i.e. 'sync'), or | ||
| 550 | to free up memory (flush). The difference can be seen in | ||
| 551 | wbc->sync_mode. | ||
| 552 | The PG_Dirty flag has been cleared and PageLocked is true. | ||
| 553 | writepage should start writeout, should set PG_Writeback, | ||
| 554 | and should make sure the page is unlocked, either synchronously | ||
| 555 | or asynchronously when the write operation completes. | ||
| 556 | |||
| 557 | If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to | ||
| 558 | try too hard if there are problems, and may choose to write out a | ||
| 559 | different page from the mapping if that would be more | ||
| 560 | appropriate. If it chooses not to start writeout, it should | ||
| 561 | return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep | ||
| 562 | calling ->writepage on that page. | ||
| 563 | |||
| 564 | See the file "Locking" for more details. | ||
| 475 | 565 | ||
| 476 | readpage: called by the VM to read a page from backing store. | 566 | readpage: called by the VM to read a page from backing store. |
| 567 | The page will be Locked when readpage is called, and should be | ||
| 568 | unlocked and marked uptodate once the read completes. | ||
| 569 | If ->readpage discovers that it needs to unlock the page for | ||
| 570 | some reason, it can do so, and then return AOP_TRUNCATED_PAGE. | ||
| 571 | In this case, the page will be re-located, re-locked and if | ||
| 572 | that all succeeds, ->readpage will be called again. | ||
| 477 | 573 | ||
| 478 | sync_page: called by the VM to notify the backing store to perform all | 574 | sync_page: called by the VM to notify the backing store to perform all |
| 479 | queued I/O operations for a page. I/O operations for other pages | 575 | queued I/O operations for a page. I/O operations for other pages |
| 480 | associated with this address_space object may also be performed. | 576 | associated with this address_space object may also be performed. |
| 481 | 577 | ||
| 578 | This function is optional and is called only for pages with | ||
| 579 | PG_Writeback set while waiting for the writeback to complete. | ||
| 580 | |||
| 482 | writepages: called by the VM to write out pages associated with the | 581 | writepages: called by the VM to write out pages associated with the |
| 483 | address_space object. | 582 | address_space object. If WBC_SYNC_ALL, then the |
| 583 | writeback_control will specify a range of pages that must be | ||
| 584 | written out. If WBC_SYNC_NONE, then a nr_to_write is given | ||
| 585 | and that many pages should be written if possible. | ||
| 586 | If no ->writepages is given, then mpage_writepages is used | ||
| 587 | instead. This will choose pages from the addresspace that are | ||
| 588 | tagged as DIRTY and will pass them to ->writepage. | ||
| 484 | 589 | ||
| 485 | set_page_dirty: called by the VM to set a page dirty. | 590 | set_page_dirty: called by the VM to set a page dirty. |
| 591 | This is particularly needed if an address space attaches | ||
| 592 | private data to a page, and that data needs to be updated when | ||
| 593 | a page is dirtied. This is called, for example, when a memory | ||
| 594 | mapped page gets modified. | ||
| 595 | If defined, it should set the PageDirty flag, and the | ||
| 596 | PAGECACHE_TAG_DIRTY tag in the radix tree. | ||
| 486 | 597 | ||
| 487 | readpages: called by the VM to read pages associated with the address_space | 598 | readpages: called by the VM to read pages associated with the address_space |
| 488 | object. | 599 | object. This is essentially just a vector version of |
| 600 | readpage. Instead of just one page, several pages are | ||
| 601 | requested. | ||
| 602 | readpages is only used for readahead, so read errors are | ||
| 603 | ignored. If anything goes wrong, feel free to give up. | ||
| 489 | 604 | ||
| 490 | prepare_write: called by the generic write path in VM to set up a write | 605 | prepare_write: called by the generic write path in VM to set up a write |
| 491 | request for a page. | 606 | request for a page. This indicates to the address space that |
| 492 | 607 | the given range of bytes are about to be written. The | |
| 493 | commit_write: called by the generic write path in VM to write page to | 608 | address_space should check that the write will be able to |
| 494 | its backing store. | 609 | complete, by allocating space if necessary and doing any other |
| 610 | internal house keeping. If the write will update parts of | ||
| 611 | any basic-blocks on storage, then those blocks should be | ||
| 612 | pre-read (if they haven't been read already) so that the | ||
| 613 | updated blocks can be written out properly. | ||
| 614 | The page will be locked. If prepare_write wants to unlock the | ||
| 615 | page it, like readpage, may do so and return | ||
| 616 | AOP_TRUNCATED_PAGE. | ||
| 617 | In this case the prepare_write will be retried one the lock is | ||
| 618 | regained. | ||
| 619 | |||
| 620 | commit_write: If prepare_write succeeds, new data will be copied | ||
| 621 | into the page and then commit_write will be called. It will | ||
| 622 | typically update the size of the file (if appropriate) and | ||
| 623 | mark the inode as dirty, and do any other related housekeeping | ||
| 624 | operations. It should avoid returning an error if possible - | ||
| 625 | errors should have been handled by prepare_write. | ||
| 495 | 626 | ||
| 496 | bmap: called by the VFS to map a logical block offset within object to | 627 | bmap: called by the VFS to map a logical block offset within object to |
| 497 | physical block number. This method is use by for the legacy FIBMAP | 628 | physical block number. This method is used by for the FIBMAP |
| 498 | ioctl. Other uses are discouraged. | 629 | ioctl and for working with swap-files. To be able to swap to |
| 499 | 630 | a file, the file must have as stable mapping to a block | |
| 500 | invalidatepage: called by the VM on truncate to disassociate a page from its | 631 | device. The swap system does not go through the filesystem |
| 501 | address_space mapping. | 632 | but instead uses bmap to find out where the blocks in the file |
| 502 | 633 | are and uses those addresses directly. | |
| 503 | releasepage: called by the VFS to release filesystem specific metadata from | 634 | |
| 504 | a page. | 635 | |
| 505 | 636 | invalidatepage: If a page has PagePrivate set, then invalidatepage | |
| 506 | direct_IO: called by the VM for direct I/O writes and reads. | 637 | will be called when part or all of the page is to be removed |
| 638 | from the address space. This generally corresponds either a | ||
| 639 | truncation or a complete invalidation of the address space | ||
| 640 | (in the latter case 'offset' will always be 0). | ||
| 641 | Any private data associated with the page should be updated | ||
| 642 | to reflect this truncation. If offset is 0, then | ||
| 643 | the private data should be released, because the page | ||
| 644 | must be able to be completely discarded. This may be done by | ||
| 645 | calling the ->releasepage function, but in this case the | ||
| 646 | release MUST succeed. | ||
| 647 | |||
| 648 | releasepage: releasepage is called on PagePrivate pages to indicate | ||
| 649 | that the page should be freed if possible. ->releasepage | ||
| 650 | should remove any private data from the page and clear the | ||
| 651 | PagePrivate flag. It may also remove the page from the | ||
| 652 | address_space. If this fails for some reason, it may indicate | ||
| 653 | failure with a 0 return value. | ||
| 654 | This is used in two distinct though related cases. The first | ||
| 655 | is when the VM finds a clean page with no active users and | ||
| 656 | wants to make it a free page. If ->releasepage succeeds, the | ||
| 657 | page will be removed from the address_space and become free. | ||
| 658 | |||
| 659 | The second case if when a request has been made to invalidate | ||
| 660 | some or all pages in an address_space. This can happen | ||
| 661 | through the fadvice(POSIX_FADV_DONTNEED) system call or by the | ||
| 662 | filesystem explicitly requesting it as nfs and 9fs do (when | ||
| 663 | they believe the cache may be out of date with storage) by | ||
| 664 | calling invalidate_inode_pages2(). | ||
| 665 | If the filesystem makes such a call, and needs to be certain | ||
| 666 | that all pages are invalidated, then it's releasepage will | ||
| 667 | need to ensure this. Possibly it can clear the PageUptodate | ||
| 668 | bit if it cannot free private data yet. | ||
| 669 | |||
| 670 | direct_IO: called by the generic read/write routines to perform | ||
| 671 | direct_IO - that is IO requests which bypass the page cache | ||
| 672 | and tranfer data directly between the storage and the | ||
| 673 | application's address space. | ||
| 507 | 674 | ||
| 508 | get_xip_page: called by the VM to translate a block number to a page. | 675 | get_xip_page: called by the VM to translate a block number to a page. |
| 509 | The page is valid until the corresponding filesystem is unmounted. | 676 | The page is valid until the corresponding filesystem is unmounted. |
| 510 | Filesystems that want to use execute-in-place (XIP) need to implement | 677 | Filesystems that want to use execute-in-place (XIP) need to implement |
| 511 | it. An example implementation can be found in fs/ext2/xip.c. | 678 | it. An example implementation can be found in fs/ext2/xip.c. |
| 512 | 679 | ||
| 680 | migrate_page: This is used to compact the physical memory usage. | ||
| 681 | If the VM wants to relocate a page (maybe off a memory card | ||
| 682 | that is signalling imminent failure) it will pass a new page | ||
| 683 | and an old page to this function. migrate_page should | ||
| 684 | transfer any private data across and update any references | ||
| 685 | that it has to the page. | ||
| 513 | 686 | ||
| 514 | The File Object | 687 | The File Object |
| 515 | =============== | 688 | =============== |
