diff options
author | Greg KH <greg@press.(none)> | 2005-10-28 13:13:16 -0400 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@suse.de> | 2005-10-28 13:13:16 -0400 |
commit | 6fbfddcb52d8d9fa2cd209f5ac2a1c87497d55b5 (patch) | |
tree | c0414e89678fcef7ce3493e048d855bde781ae8d /Documentation | |
parent | 1a222bca26ca691e83be1b08f5e96ae96d0d8cae (diff) | |
parent | 27d1097d39509494706eaa2620ef3b1e780a3224 (diff) |
Merge ../bleed-2.6
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/DocBook/libata.tmpl | 1072 | ||||
-rw-r--r-- | Documentation/block/biodoc.txt | 113 | ||||
-rw-r--r-- | Documentation/networking/bonding.txt | 5 |
3 files changed, 1127 insertions, 63 deletions
diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl index 375ae760dc1e..d260d92089ad 100644 --- a/Documentation/DocBook/libata.tmpl +++ b/Documentation/DocBook/libata.tmpl | |||
@@ -415,6 +415,362 @@ and other resources, etc. | |||
415 | </sect1> | 415 | </sect1> |
416 | </chapter> | 416 | </chapter> |
417 | 417 | ||
418 | <chapter id="libataEH"> | ||
419 | <title>Error handling</title> | ||
420 | |||
421 | <para> | ||
422 | This chapter describes how errors are handled under libata. | ||
423 | Readers are advised to read SCSI EH | ||
424 | (Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first. | ||
425 | </para> | ||
426 | |||
427 | <sect1><title>Origins of commands</title> | ||
428 | <para> | ||
429 | In libata, a command is represented with struct ata_queued_cmd | ||
430 | or qc. qc's are preallocated during port initialization and | ||
431 | repetitively used for command executions. Currently only one | ||
432 | qc is allocated per port but yet-to-be-merged NCQ branch | ||
433 | allocates one for each tag and maps each qc to NCQ tag 1-to-1. | ||
434 | </para> | ||
435 | <para> | ||
436 | libata commands can originate from two sources - libata itself | ||
437 | and SCSI midlayer. libata internal commands are used for | ||
438 | initialization and error handling. All normal blk requests | ||
439 | and commands for SCSI emulation are passed as SCSI commands | ||
440 | through queuecommand callback of SCSI host template. | ||
441 | </para> | ||
442 | </sect1> | ||
443 | |||
444 | <sect1><title>How commands are issued</title> | ||
445 | |||
446 | <variablelist> | ||
447 | |||
448 | <varlistentry><term>Internal commands</term> | ||
449 | <listitem> | ||
450 | <para> | ||
451 | First, qc is allocated and initialized using | ||
452 | ata_qc_new_init(). Although ata_qc_new_init() doesn't | ||
453 | implement any wait or retry mechanism when qc is not | ||
454 | available, internal commands are currently issued only during | ||
455 | initialization and error recovery, so no other command is | ||
456 | active and allocation is guaranteed to succeed. | ||
457 | </para> | ||
458 | <para> | ||
459 | Once allocated qc's taskfile is initialized for the command to | ||
460 | be executed. qc currently has two mechanisms to notify | ||
461 | completion. One is via qc->complete_fn() callback and the | ||
462 | other is completion qc->waiting. qc->complete_fn() callback | ||
463 | is the asynchronous path used by normal SCSI translated | ||
464 | commands and qc->waiting is the synchronous (issuer sleeps in | ||
465 | process context) path used by internal commands. | ||
466 | </para> | ||
467 | <para> | ||
468 | Once initialization is complete, host_set lock is acquired | ||
469 | and the qc is issued. | ||
470 | </para> | ||
471 | </listitem> | ||
472 | </varlistentry> | ||
473 | |||
474 | <varlistentry><term>SCSI commands</term> | ||
475 | <listitem> | ||
476 | <para> | ||
477 | All libata drivers use ata_scsi_queuecmd() as | ||
478 | hostt->queuecommand callback. scmds can either be simulated | ||
479 | or translated. No qc is involved in processing a simulated | ||
480 | scmd. The result is computed right away and the scmd is | ||
481 | completed. | ||
482 | </para> | ||
483 | <para> | ||
484 | For a translated scmd, ata_qc_new_init() is invoked to | ||
485 | allocate a qc and the scmd is translated into the qc. SCSI | ||
486 | midlayer's completion notification function pointer is stored | ||
487 | into qc->scsidone. | ||
488 | </para> | ||
489 | <para> | ||
490 | qc->complete_fn() callback is used for completion | ||
491 | notification. ATA commands use ata_scsi_qc_complete() while | ||
492 | ATAPI commands use atapi_qc_complete(). Both functions end up | ||
493 | calling qc->scsidone to notify upper layer when the qc is | ||
494 | finished. After translation is completed, the qc is issued | ||
495 | with ata_qc_issue(). | ||
496 | </para> | ||
497 | <para> | ||
498 | Note that SCSI midlayer invokes hostt->queuecommand while | ||
499 | holding host_set lock, so all above occur while holding | ||
500 | host_set lock. | ||
501 | </para> | ||
502 | </listitem> | ||
503 | </varlistentry> | ||
504 | |||
505 | </variablelist> | ||
506 | </sect1> | ||
507 | |||
508 | <sect1><title>How commands are processed</title> | ||
509 | <para> | ||
510 | Depending on which protocol and which controller are used, | ||
511 | commands are processed differently. For the purpose of | ||
512 | discussion, a controller which uses taskfile interface and all | ||
513 | standard callbacks is assumed. | ||
514 | </para> | ||
515 | <para> | ||
516 | Currently 6 ATA command protocols are used. They can be | ||
517 | sorted into the following four categories according to how | ||
518 | they are processed. | ||
519 | </para> | ||
520 | |||
521 | <variablelist> | ||
522 | <varlistentry><term>ATA NO DATA or DMA</term> | ||
523 | <listitem> | ||
524 | <para> | ||
525 | ATA_PROT_NODATA and ATA_PROT_DMA fall into this category. | ||
526 | These types of commands don't require any software | ||
527 | intervention once issued. Device will raise interrupt on | ||
528 | completion. | ||
529 | </para> | ||
530 | </listitem> | ||
531 | </varlistentry> | ||
532 | |||
533 | <varlistentry><term>ATA PIO</term> | ||
534 | <listitem> | ||
535 | <para> | ||
536 | ATA_PROT_PIO is in this category. libata currently | ||
537 | implements PIO with polling. ATA_NIEN bit is set to turn | ||
538 | off interrupt and pio_task on ata_wq performs polling and | ||
539 | IO. | ||
540 | </para> | ||
541 | </listitem> | ||
542 | </varlistentry> | ||
543 | |||
544 | <varlistentry><term>ATAPI NODATA or DMA</term> | ||
545 | <listitem> | ||
546 | <para> | ||
547 | ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this | ||
548 | category. packet_task is used to poll BSY bit after | ||
549 | issuing PACKET command. Once BSY is turned off by the | ||
550 | device, packet_task transfers CDB and hands off processing | ||
551 | to interrupt handler. | ||
552 | </para> | ||
553 | </listitem> | ||
554 | </varlistentry> | ||
555 | |||
556 | <varlistentry><term>ATAPI PIO</term> | ||
557 | <listitem> | ||
558 | <para> | ||
559 | ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set | ||
560 | and, as in ATAPI NODATA or DMA, packet_task submits cdb. | ||
561 | However, after submitting cdb, further processing (data | ||
562 | transfer) is handed off to pio_task. | ||
563 | </para> | ||
564 | </listitem> | ||
565 | </varlistentry> | ||
566 | </variablelist> | ||
567 | </sect1> | ||
568 | |||
569 | <sect1><title>How commands are completed</title> | ||
570 | <para> | ||
571 | Once issued, all qc's are either completed with | ||
572 | ata_qc_complete() or time out. For commands which are handled | ||
573 | by interrupts, ata_host_intr() invokes ata_qc_complete(), and, | ||
574 | for PIO tasks, pio_task invokes ata_qc_complete(). In error | ||
575 | cases, packet_task may also complete commands. | ||
576 | </para> | ||
577 | <para> | ||
578 | ata_qc_complete() does the following. | ||
579 | </para> | ||
580 | |||
581 | <orderedlist> | ||
582 | |||
583 | <listitem> | ||
584 | <para> | ||
585 | DMA memory is unmapped. | ||
586 | </para> | ||
587 | </listitem> | ||
588 | |||
589 | <listitem> | ||
590 | <para> | ||
591 | ATA_QCFLAG_ACTIVE is clared from qc->flags. | ||
592 | </para> | ||
593 | </listitem> | ||
594 | |||
595 | <listitem> | ||
596 | <para> | ||
597 | qc->complete_fn() callback is invoked. If the return value of | ||
598 | the callback is not zero. Completion is short circuited and | ||
599 | ata_qc_complete() returns. | ||
600 | </para> | ||
601 | </listitem> | ||
602 | |||
603 | <listitem> | ||
604 | <para> | ||
605 | __ata_qc_complete() is called, which does | ||
606 | <orderedlist> | ||
607 | |||
608 | <listitem> | ||
609 | <para> | ||
610 | qc->flags is cleared to zero. | ||
611 | </para> | ||
612 | </listitem> | ||
613 | |||
614 | <listitem> | ||
615 | <para> | ||
616 | ap->active_tag and qc->tag are poisoned. | ||
617 | </para> | ||
618 | </listitem> | ||
619 | |||
620 | <listitem> | ||
621 | <para> | ||
622 | qc->waiting is claread & completed (in that order). | ||
623 | </para> | ||
624 | </listitem> | ||
625 | |||
626 | <listitem> | ||
627 | <para> | ||
628 | qc is deallocated by clearing appropriate bit in ap->qactive. | ||
629 | </para> | ||
630 | </listitem> | ||
631 | |||
632 | </orderedlist> | ||
633 | </para> | ||
634 | </listitem> | ||
635 | |||
636 | </orderedlist> | ||
637 | |||
638 | <para> | ||
639 | So, it basically notifies upper layer and deallocates qc. One | ||
640 | exception is short-circuit path in #3 which is used by | ||
641 | atapi_qc_complete(). | ||
642 | </para> | ||
643 | <para> | ||
644 | For all non-ATAPI commands, whether it fails or not, almost | ||
645 | the same code path is taken and very little error handling | ||
646 | takes place. A qc is completed with success status if it | ||
647 | succeeded, with failed status otherwise. | ||
648 | </para> | ||
649 | <para> | ||
650 | However, failed ATAPI commands require more handling as | ||
651 | REQUEST SENSE is needed to acquire sense data. If an ATAPI | ||
652 | command fails, ata_qc_complete() is invoked with error status, | ||
653 | which in turn invokes atapi_qc_complete() via | ||
654 | qc->complete_fn() callback. | ||
655 | </para> | ||
656 | <para> | ||
657 | This makes atapi_qc_complete() set scmd->result to | ||
658 | SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As | ||
659 | the sense data is empty but scmd->result is CHECK CONDITION, | ||
660 | SCSI midlayer will invoke EH for the scmd, and returning 1 | ||
661 | makes ata_qc_complete() to return without deallocating the qc. | ||
662 | This leads us to ata_scsi_error() with partially completed qc. | ||
663 | </para> | ||
664 | |||
665 | </sect1> | ||
666 | |||
667 | <sect1><title>ata_scsi_error()</title> | ||
668 | <para> | ||
669 | ata_scsi_error() is the current hostt->eh_strategy_handler() | ||
670 | for libata. As discussed above, this will be entered in two | ||
671 | cases - timeout and ATAPI error completion. This function | ||
672 | calls low level libata driver's eng_timeout() callback, the | ||
673 | standard callback for which is ata_eng_timeout(). It checks | ||
674 | if a qc is active and calls ata_qc_timeout() on the qc if so. | ||
675 | Actual error handling occurs in ata_qc_timeout(). | ||
676 | </para> | ||
677 | <para> | ||
678 | If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and | ||
679 | completes the qc. Note that as we're currently in EH, we | ||
680 | cannot call scsi_done. As described in SCSI EH doc, a | ||
681 | recovered scmd should be either retried with | ||
682 | scsi_queue_insert() or finished with scsi_finish_command(). | ||
683 | Here, we override qc->scsidone with scsi_finish_command() and | ||
684 | calls ata_qc_complete(). | ||
685 | </para> | ||
686 | <para> | ||
687 | If EH is invoked due to a failed ATAPI qc, the qc here is | ||
688 | completed but not deallocated. The purpose of this | ||
689 | half-completion is to use the qc as place holder to make EH | ||
690 | code reach this place. This is a bit hackish, but it works. | ||
691 | </para> | ||
692 | <para> | ||
693 | Once control reaches here, the qc is deallocated by invoking | ||
694 | __ata_qc_complete() explicitly. Then, internal qc for REQUEST | ||
695 | SENSE is issued. Once sense data is acquired, scmd is | ||
696 | finished by directly invoking scsi_finish_command() on the | ||
697 | scmd. Note that as we already have completed and deallocated | ||
698 | the qc which was associated with the scmd, we don't need | ||
699 | to/cannot call ata_qc_complete() again. | ||
700 | </para> | ||
701 | |||
702 | </sect1> | ||
703 | |||
704 | <sect1><title>Problems with the current EH</title> | ||
705 | |||
706 | <itemizedlist> | ||
707 | |||
708 | <listitem> | ||
709 | <para> | ||
710 | Error representation is too crude. Currently any and all | ||
711 | error conditions are represented with ATA STATUS and ERROR | ||
712 | registers. Errors which aren't ATA device errors are treated | ||
713 | as ATA device errors by setting ATA_ERR bit. Better error | ||
714 | descriptor which can properly represent ATA and other | ||
715 | errors/exceptions is needed. | ||
716 | </para> | ||
717 | </listitem> | ||
718 | |||
719 | <listitem> | ||
720 | <para> | ||
721 | When handling timeouts, no action is taken to make device | ||
722 | forget about the timed out command and ready for new commands. | ||
723 | </para> | ||
724 | </listitem> | ||
725 | |||
726 | <listitem> | ||
727 | <para> | ||
728 | EH handling via ata_scsi_error() is not properly protected | ||
729 | from usual command processing. On EH entrance, the device is | ||
730 | not in quiescent state. Timed out commands may succeed or | ||
731 | fail any time. pio_task and atapi_task may still be running. | ||
732 | </para> | ||
733 | </listitem> | ||
734 | |||
735 | <listitem> | ||
736 | <para> | ||
737 | Too weak error recovery. Devices / controllers causing HSM | ||
738 | mismatch errors and other errors quite often require reset to | ||
739 | return to known state. Also, advanced error handling is | ||
740 | necessary to support features like NCQ and hotplug. | ||
741 | </para> | ||
742 | </listitem> | ||
743 | |||
744 | <listitem> | ||
745 | <para> | ||
746 | ATA errors are directly handled in the interrupt handler and | ||
747 | PIO errors in pio_task. This is problematic for advanced | ||
748 | error handling for the following reasons. | ||
749 | </para> | ||
750 | <para> | ||
751 | First, advanced error handling often requires context and | ||
752 | internal qc execution. | ||
753 | </para> | ||
754 | <para> | ||
755 | Second, even a simple failure (say, CRC error) needs | ||
756 | information gathering and could trigger complex error handling | ||
757 | (say, resetting & reconfiguring). Having multiple code | ||
758 | paths to gather information, enter EH and trigger actions | ||
759 | makes life painful. | ||
760 | </para> | ||
761 | <para> | ||
762 | Third, scattered EH code makes implementing low level drivers | ||
763 | difficult. Low level drivers override libata callbacks. If | ||
764 | EH is scattered over several places, each affected callbacks | ||
765 | should perform its part of error handling. This can be error | ||
766 | prone and painful. | ||
767 | </para> | ||
768 | </listitem> | ||
769 | |||
770 | </itemizedlist> | ||
771 | </sect1> | ||
772 | </chapter> | ||
773 | |||
418 | <chapter id="libataExt"> | 774 | <chapter id="libataExt"> |
419 | <title>libata Library</title> | 775 | <title>libata Library</title> |
420 | !Edrivers/scsi/libata-core.c | 776 | !Edrivers/scsi/libata-core.c |
@@ -431,6 +787,722 @@ and other resources, etc. | |||
431 | !Idrivers/scsi/libata-scsi.c | 787 | !Idrivers/scsi/libata-scsi.c |
432 | </chapter> | 788 | </chapter> |
433 | 789 | ||
790 | <chapter id="ataExceptions"> | ||
791 | <title>ATA errors & exceptions</title> | ||
792 | |||
793 | <para> | ||
794 | This chapter tries to identify what error/exception conditions exist | ||
795 | for ATA/ATAPI devices and describe how they should be handled in | ||
796 | implementation-neutral way. | ||
797 | </para> | ||
798 | |||
799 | <para> | ||
800 | The term 'error' is used to describe conditions where either an | ||
801 | explicit error condition is reported from device or a command has | ||
802 | timed out. | ||
803 | </para> | ||
804 | |||
805 | <para> | ||
806 | The term 'exception' is either used to describe exceptional | ||
807 | conditions which are not errors (say, power or hotplug events), or | ||
808 | to describe both errors and non-error exceptional conditions. Where | ||
809 | explicit distinction between error and exception is necessary, the | ||
810 | term 'non-error exception' is used. | ||
811 | </para> | ||
812 | |||
813 | <sect1 id="excat"> | ||
814 | <title>Exception categories</title> | ||
815 | <para> | ||
816 | Exceptions are described primarily with respect to legacy | ||
817 | taskfile + bus master IDE interface. If a controller provides | ||
818 | other better mechanism for error reporting, mapping those into | ||
819 | categories described below shouldn't be difficult. | ||
820 | </para> | ||
821 | |||
822 | <para> | ||
823 | In the following sections, two recovery actions - reset and | ||
824 | reconfiguring transport - are mentioned. These are described | ||
825 | further in <xref linkend="exrec"/>. | ||
826 | </para> | ||
827 | |||
828 | <sect2 id="excatHSMviolation"> | ||
829 | <title>HSM violation</title> | ||
830 | <para> | ||
831 | This error is indicated when STATUS value doesn't match HSM | ||
832 | requirement during issuing or excution any ATA/ATAPI command. | ||
833 | </para> | ||
834 | |||
835 | <itemizedlist> | ||
836 | <title>Examples</title> | ||
837 | |||
838 | <listitem> | ||
839 | <para> | ||
840 | ATA_STATUS doesn't contain !BSY && DRDY && !DRQ while trying | ||
841 | to issue a command. | ||
842 | </para> | ||
843 | </listitem> | ||
844 | |||
845 | <listitem> | ||
846 | <para> | ||
847 | !BSY && !DRQ during PIO data transfer. | ||
848 | </para> | ||
849 | </listitem> | ||
850 | |||
851 | <listitem> | ||
852 | <para> | ||
853 | DRQ on command completion. | ||
854 | </para> | ||
855 | </listitem> | ||
856 | |||
857 | <listitem> | ||
858 | <para> | ||
859 | !BSY && ERR after CDB tranfer starts but before the | ||
860 | last byte of CDB is transferred. ATA/ATAPI standard states | ||
861 | that "The device shall not terminate the PACKET command | ||
862 | with an error before the last byte of the command packet has | ||
863 | been written" in the error outputs description of PACKET | ||
864 | command and the state diagram doesn't include such | ||
865 | transitions. | ||
866 | </para> | ||
867 | </listitem> | ||
868 | |||
869 | </itemizedlist> | ||
870 | |||
871 | <para> | ||
872 | In these cases, HSM is violated and not much information | ||
873 | regarding the error can be acquired from STATUS or ERROR | ||
874 | register. IOW, this error can be anything - driver bug, | ||
875 | faulty device, controller and/or cable. | ||
876 | </para> | ||
877 | |||
878 | <para> | ||
879 | As HSM is violated, reset is necessary to restore known state. | ||
880 | Reconfiguring transport for lower speed might be helpful too | ||
881 | as transmission errors sometimes cause this kind of errors. | ||
882 | </para> | ||
883 | </sect2> | ||
884 | |||
885 | <sect2 id="excatDevErr"> | ||
886 | <title>ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION)</title> | ||
887 | |||
888 | <para> | ||
889 | These are errors detected and reported by ATA/ATAPI devices | ||
890 | indicating device problems. For this type of errors, STATUS | ||
891 | and ERROR register values are valid and describe error | ||
892 | condition. Note that some of ATA bus errors are detected by | ||
893 | ATA/ATAPI devices and reported using the same mechanism as | ||
894 | device errors. Those cases are described later in this | ||
895 | section. | ||
896 | </para> | ||
897 | |||
898 | <para> | ||
899 | For ATA commands, this type of errors are indicated by !BSY | ||
900 | && ERR during command execution and on completion. | ||
901 | </para> | ||
902 | |||
903 | <para>For ATAPI commands,</para> | ||
904 | |||
905 | <itemizedlist> | ||
906 | |||
907 | <listitem> | ||
908 | <para> | ||
909 | !BSY && ERR && ABRT right after issuing PACKET | ||
910 | indicates that PACKET command is not supported and falls in | ||
911 | this category. | ||
912 | </para> | ||
913 | </listitem> | ||
914 | |||
915 | <listitem> | ||
916 | <para> | ||
917 | !BSY && ERR(==CHK) && !ABRT after the last | ||
918 | byte of CDB is transferred indicates CHECK CONDITION and | ||
919 | doesn't fall in this category. | ||
920 | </para> | ||
921 | </listitem> | ||
922 | |||
923 | <listitem> | ||
924 | <para> | ||
925 | !BSY && ERR(==CHK) && ABRT after the last byte | ||
926 | of CDB is transferred *probably* indicates CHECK CONDITION and | ||
927 | doesn't fall in this category. | ||
928 | </para> | ||
929 | </listitem> | ||
930 | |||
931 | </itemizedlist> | ||
932 | |||
933 | <para> | ||
934 | Of errors detected as above, the followings are not ATA/ATAPI | ||
935 | device errors but ATA bus errors and should be handled | ||
936 | according to <xref linkend="excatATAbusErr"/>. | ||
937 | </para> | ||
938 | |||
939 | <variablelist> | ||
940 | |||
941 | <varlistentry> | ||
942 | <term>CRC error during data transfer</term> | ||
943 | <listitem> | ||
944 | <para> | ||
945 | This is indicated by ICRC bit in the ERROR register and | ||
946 | means that corruption occurred during data transfer. Upto | ||
947 | ATA/ATAPI-7, the standard specifies that this bit is only | ||
948 | applicable to UDMA transfers but ATA/ATAPI-8 draft revision | ||
949 | 1f says that the bit may be applicable to multiword DMA and | ||
950 | PIO. | ||
951 | </para> | ||
952 | </listitem> | ||
953 | </varlistentry> | ||
954 | |||
955 | <varlistentry> | ||
956 | <term>ABRT error during data transfer or on completion</term> | ||
957 | <listitem> | ||
958 | <para> | ||
959 | Upto ATA/ATAPI-7, the standard specifies that ABRT could be | ||
960 | set on ICRC errors and on cases where a device is not able | ||
961 | to complete a command. Combined with the fact that MWDMA | ||
962 | and PIO transfer errors aren't allowed to use ICRC bit upto | ||
963 | ATA/ATAPI-7, it seems to imply that ABRT bit alone could | ||
964 | indicate tranfer errors. | ||
965 | </para> | ||
966 | <para> | ||
967 | However, ATA/ATAPI-8 draft revision 1f removes the part | ||
968 | that ICRC errors can turn on ABRT. So, this is kind of | ||
969 | gray area. Some heuristics are needed here. | ||
970 | </para> | ||
971 | </listitem> | ||
972 | </varlistentry> | ||
973 | |||
974 | </variablelist> | ||
975 | |||
976 | <para> | ||
977 | ATA/ATAPI device errors can be further categorized as follows. | ||
978 | </para> | ||
979 | |||
980 | <variablelist> | ||
981 | |||
982 | <varlistentry> | ||
983 | <term>Media errors</term> | ||
984 | <listitem> | ||
985 | <para> | ||
986 | This is indicated by UNC bit in the ERROR register. ATA | ||
987 | devices reports UNC error only after certain number of | ||
988 | retries cannot recover the data, so there's nothing much | ||
989 | else to do other than notifying upper layer. | ||
990 | </para> | ||
991 | <para> | ||
992 | READ and WRITE commands report CHS or LBA of the first | ||
993 | failed sector but ATA/ATAPI standard specifies that the | ||
994 | amount of transferred data on error completion is | ||
995 | indeterminate, so we cannot assume that sectors preceding | ||
996 | the failed sector have been transferred and thus cannot | ||
997 | complete those sectors successfully as SCSI does. | ||
998 | </para> | ||
999 | </listitem> | ||
1000 | </varlistentry> | ||
1001 | |||
1002 | <varlistentry> | ||
1003 | <term>Media changed / media change requested error</term> | ||
1004 | <listitem> | ||
1005 | <para> | ||
1006 | <<TODO: fill here>> | ||
1007 | </para> | ||
1008 | </listitem> | ||
1009 | </varlistentry> | ||
1010 | |||
1011 | <varlistentry><term>Address error</term> | ||
1012 | <listitem> | ||
1013 | <para> | ||
1014 | This is indicated by IDNF bit in the ERROR register. | ||
1015 | Report to upper layer. | ||
1016 | </para> | ||
1017 | </listitem> | ||
1018 | </varlistentry> | ||
1019 | |||
1020 | <varlistentry><term>Other errors</term> | ||
1021 | <listitem> | ||
1022 | <para> | ||
1023 | This can be invalid command or parameter indicated by ABRT | ||
1024 | ERROR bit or some other error condition. Note that ABRT | ||
1025 | bit can indicate a lot of things including ICRC and Address | ||
1026 | errors. Heuristics needed. | ||
1027 | </para> | ||
1028 | </listitem> | ||
1029 | </varlistentry> | ||
1030 | |||
1031 | </variablelist> | ||
1032 | |||
1033 | <para> | ||
1034 | Depending on commands, not all STATUS/ERROR bits are | ||
1035 | applicable. These non-applicable bits are marked with | ||
1036 | "na" in the output descriptions but upto ATA/ATAPI-7 | ||
1037 | no definition of "na" can be found. However, | ||
1038 | ATA/ATAPI-8 draft revision 1f describes "N/A" as | ||
1039 | follows. | ||
1040 | </para> | ||
1041 | |||
1042 | <blockquote> | ||
1043 | <variablelist> | ||
1044 | <varlistentry><term>3.2.3.3a N/A</term> | ||
1045 | <listitem> | ||
1046 | <para> | ||
1047 | A keyword the indicates a field has no defined value in | ||
1048 | this standard and should not be checked by the host or | ||
1049 | device. N/A fields should be cleared to zero. | ||
1050 | </para> | ||
1051 | </listitem> | ||
1052 | </varlistentry> | ||
1053 | </variablelist> | ||
1054 | </blockquote> | ||
1055 | |||
1056 | <para> | ||
1057 | So, it seems reasonable to assume that "na" bits are | ||
1058 | cleared to zero by devices and thus need no explicit masking. | ||
1059 | </para> | ||
1060 | |||
1061 | </sect2> | ||
1062 | |||
1063 | <sect2 id="excatATAPIcc"> | ||
1064 | <title>ATAPI device CHECK CONDITION</title> | ||
1065 | |||
1066 | <para> | ||
1067 | ATAPI device CHECK CONDITION error is indicated by set CHK bit | ||
1068 | (ERR bit) in the STATUS register after the last byte of CDB is | ||
1069 | transferred for a PACKET command. For this kind of errors, | ||
1070 | sense data should be acquired to gather information regarding | ||
1071 | the errors. REQUEST SENSE packet command should be used to | ||
1072 | acquire sense data. | ||
1073 | </para> | ||
1074 | |||
1075 | <para> | ||
1076 | Once sense data is acquired, this type of errors can be | ||
1077 | handled similary to other SCSI errors. Note that sense data | ||
1078 | may indicate ATA bus error (e.g. Sense Key 04h HARDWARE ERROR | ||
1079 | && ASC/ASCQ 47h/00h SCSI PARITY ERROR). In such | ||
1080 | cases, the error should be considered as an ATA bus error and | ||
1081 | handled according to <xref linkend="excatATAbusErr"/>. | ||
1082 | </para> | ||
1083 | |||
1084 | </sect2> | ||
1085 | |||
1086 | <sect2 id="excatNCQerr"> | ||
1087 | <title>ATA device error (NCQ)</title> | ||
1088 | |||
1089 | <para> | ||
1090 | NCQ command error is indicated by cleared BSY and set ERR bit | ||
1091 | during NCQ command phase (one or more NCQ commands | ||
1092 | outstanding). Although STATUS and ERROR registers will | ||
1093 | contain valid values describing the error, READ LOG EXT is | ||
1094 | required to clear the error condition, determine which command | ||
1095 | has failed and acquire more information. | ||
1096 | </para> | ||
1097 | |||
1098 | <para> | ||
1099 | READ LOG EXT Log Page 10h reports which tag has failed and | ||
1100 | taskfile register values describing the error. With this | ||
1101 | information the failed command can be handled as a normal ATA | ||
1102 | command error as in <xref linkend="excatDevErr"/> and all | ||
1103 | other in-flight commands must be retried. Note that this | ||
1104 | retry should not be counted - it's likely that commands | ||
1105 | retried this way would have completed normally if it were not | ||
1106 | for the failed command. | ||
1107 | </para> | ||
1108 | |||
1109 | <para> | ||
1110 | Note that ATA bus errors can be reported as ATA device NCQ | ||
1111 | errors. This should be handled as described in <xref | ||
1112 | linkend="excatATAbusErr"/>. | ||
1113 | </para> | ||
1114 | |||
1115 | <para> | ||
1116 | If READ LOG EXT Log Page 10h fails or reports NQ, we're | ||
1117 | thoroughly screwed. This condition should be treated | ||
1118 | according to <xref linkend="excatHSMviolation"/>. | ||
1119 | </para> | ||
1120 | |||
1121 | </sect2> | ||
1122 | |||
1123 | <sect2 id="excatATAbusErr"> | ||
1124 | <title>ATA bus error</title> | ||
1125 | |||
1126 | <para> | ||
1127 | ATA bus error means that data corruption occurred during | ||
1128 | transmission over ATA bus (SATA or PATA). This type of errors | ||
1129 | can be indicated by | ||
1130 | </para> | ||
1131 | |||
1132 | <itemizedlist> | ||
1133 | |||
1134 | <listitem> | ||
1135 | <para> | ||
1136 | ICRC or ABRT error as described in <xref linkend="excatDevErr"/>. | ||
1137 | </para> | ||
1138 | </listitem> | ||
1139 | |||
1140 | <listitem> | ||
1141 | <para> | ||
1142 | Controller-specific error completion with error information | ||
1143 | indicating transmission error. | ||
1144 | </para> | ||
1145 | </listitem> | ||
1146 | |||
1147 | <listitem> | ||
1148 | <para> | ||
1149 | On some controllers, command timeout. In this case, there may | ||
1150 | be a mechanism to determine that the timeout is due to | ||
1151 | transmission error. | ||
1152 | </para> | ||
1153 | </listitem> | ||
1154 | |||
1155 | <listitem> | ||
1156 | <para> | ||
1157 | Unknown/random errors, timeouts and all sorts of weirdities. | ||
1158 | </para> | ||
1159 | </listitem> | ||
1160 | |||
1161 | </itemizedlist> | ||
1162 | |||
1163 | <para> | ||
1164 | As described above, transmission errors can cause wide variety | ||
1165 | of symptoms ranging from device ICRC error to random device | ||
1166 | lockup, and, for many cases, there is no way to tell if an | ||
1167 | error condition is due to transmission error or not; | ||
1168 | therefore, it's necessary to employ some kind of heuristic | ||
1169 | when dealing with errors and timeouts. For example, | ||
1170 | encountering repetitive ABRT errors for known supported | ||
1171 | command is likely to indicate ATA bus error. | ||
1172 | </para> | ||
1173 | |||
1174 | <para> | ||
1175 | Once it's determined that ATA bus errors have possibly | ||
1176 | occurred, lowering ATA bus transmission speed is one of | ||
1177 | actions which may alleviate the problem. See <xref | ||
1178 | linkend="exrecReconf"/> for more information. | ||
1179 | </para> | ||
1180 | |||
1181 | </sect2> | ||
1182 | |||
1183 | <sect2 id="excatPCIbusErr"> | ||
1184 | <title>PCI bus error</title> | ||
1185 | |||
1186 | <para> | ||
1187 | Data corruption or other failures during transmission over PCI | ||
1188 | (or other system bus). For standard BMDMA, this is indicated | ||
1189 | by Error bit in the BMDMA Status register. This type of | ||
1190 | errors must be logged as it indicates something is very wrong | ||
1191 | with the system. Resetting host controller is recommended. | ||
1192 | </para> | ||
1193 | |||
1194 | </sect2> | ||
1195 | |||
1196 | <sect2 id="excatLateCompletion"> | ||
1197 | <title>Late completion</title> | ||
1198 | |||
1199 | <para> | ||
1200 | This occurs when timeout occurs and the timeout handler finds | ||
1201 | out that the timed out command has completed successfully or | ||
1202 | with error. This is usually caused by lost interrupts. This | ||
1203 | type of errors must be logged. Resetting host controller is | ||
1204 | recommended. | ||
1205 | </para> | ||
1206 | |||
1207 | </sect2> | ||
1208 | |||
1209 | <sect2 id="excatUnknown"> | ||
1210 | <title>Unknown error (timeout)</title> | ||
1211 | |||
1212 | <para> | ||
1213 | This is when timeout occurs and the command is still | ||
1214 | processing or the host and device are in unknown state. When | ||
1215 | this occurs, HSM could be in any valid or invalid state. To | ||
1216 | bring the device to known state and make it forget about the | ||
1217 | timed out command, resetting is necessary. The timed out | ||
1218 | command may be retried. | ||
1219 | </para> | ||
1220 | |||
1221 | <para> | ||
1222 | Timeouts can also be caused by transmission errors. Refer to | ||
1223 | <xref linkend="excatATAbusErr"/> for more details. | ||
1224 | </para> | ||
1225 | |||
1226 | </sect2> | ||
1227 | |||
1228 | <sect2 id="excatHoplugPM"> | ||
1229 | <title>Hotplug and power management exceptions</title> | ||
1230 | |||
1231 | <para> | ||
1232 | <<TODO: fill here>> | ||
1233 | </para> | ||
1234 | |||
1235 | </sect2> | ||
1236 | |||
1237 | </sect1> | ||
1238 | |||
1239 | <sect1 id="exrec"> | ||
1240 | <title>EH recovery actions</title> | ||
1241 | |||
1242 | <para> | ||
1243 | This section discusses several important recovery actions. | ||
1244 | </para> | ||
1245 | |||
1246 | <sect2 id="exrecClr"> | ||
1247 | <title>Clearing error condition</title> | ||
1248 | |||
1249 | <para> | ||
1250 | Many controllers require its error registers to be cleared by | ||
1251 | error handler. Different controllers may have different | ||
1252 | requirements. | ||
1253 | </para> | ||
1254 | |||
1255 | <para> | ||
1256 | For SATA, it's strongly recommended to clear at least SError | ||
1257 | register during error handling. | ||
1258 | </para> | ||
1259 | </sect2> | ||
1260 | |||
1261 | <sect2 id="exrecRst"> | ||
1262 | <title>Reset</title> | ||
1263 | |||
1264 | <para> | ||
1265 | During EH, resetting is necessary in the following cases. | ||
1266 | </para> | ||
1267 | |||
1268 | <itemizedlist> | ||
1269 | |||
1270 | <listitem> | ||
1271 | <para> | ||
1272 | HSM is in unknown or invalid state | ||
1273 | </para> | ||
1274 | </listitem> | ||
1275 | |||
1276 | <listitem> | ||
1277 | <para> | ||
1278 | HBA is in unknown or invalid state | ||
1279 | </para> | ||
1280 | </listitem> | ||
1281 | |||
1282 | <listitem> | ||
1283 | <para> | ||
1284 | EH needs to make HBA/device forget about in-flight commands | ||
1285 | </para> | ||
1286 | </listitem> | ||
1287 | |||
1288 | <listitem> | ||
1289 | <para> | ||
1290 | HBA/device behaves weirdly | ||
1291 | </para> | ||
1292 | </listitem> | ||
1293 | |||
1294 | </itemizedlist> | ||
1295 | |||
1296 | <para> | ||
1297 | Resetting during EH might be a good idea regardless of error | ||
1298 | condition to improve EH robustness. Whether to reset both or | ||
1299 | either one of HBA and device depends on situation but the | ||
1300 | following scheme is recommended. | ||
1301 | </para> | ||
1302 | |||
1303 | <itemizedlist> | ||
1304 | |||
1305 | <listitem> | ||
1306 | <para> | ||
1307 | When it's known that HBA is in ready state but ATA/ATAPI | ||
1308 | device in in unknown state, reset only device. | ||
1309 | </para> | ||
1310 | </listitem> | ||
1311 | |||
1312 | <listitem> | ||
1313 | <para> | ||
1314 | If HBA is in unknown state, reset both HBA and device. | ||
1315 | </para> | ||
1316 | </listitem> | ||
1317 | |||
1318 | </itemizedlist> | ||
1319 | |||
1320 | <para> | ||
1321 | HBA resetting is implementation specific. For a controller | ||
1322 | complying to taskfile/BMDMA PCI IDE, stopping active DMA | ||
1323 | transaction may be sufficient iff BMDMA state is the only HBA | ||
1324 | context. But even mostly taskfile/BMDMA PCI IDE complying | ||
1325 | controllers may have implementation specific requirements and | ||
1326 | mechanism to reset themselves. This must be addressed by | ||
1327 | specific drivers. | ||
1328 | </para> | ||
1329 | |||
1330 | <para> | ||
1331 | OTOH, ATA/ATAPI standard describes in detail ways to reset | ||
1332 | ATA/ATAPI devices. | ||
1333 | </para> | ||
1334 | |||
1335 | <variablelist> | ||
1336 | |||
1337 | <varlistentry><term>PATA hardware reset</term> | ||
1338 | <listitem> | ||
1339 | <para> | ||
1340 | This is hardware initiated device reset signalled with | ||
1341 | asserted PATA RESET- signal. There is no standard way to | ||
1342 | initiate hardware reset from software although some | ||
1343 | hardware provides registers that allow driver to directly | ||
1344 | tweak the RESET- signal. | ||
1345 | </para> | ||
1346 | </listitem> | ||
1347 | </varlistentry> | ||
1348 | |||
1349 | <varlistentry><term>Software reset</term> | ||
1350 | <listitem> | ||
1351 | <para> | ||
1352 | This is achieved by turning CONTROL SRST bit on for at | ||
1353 | least 5us. Both PATA and SATA support it but, in case of | ||
1354 | SATA, this may require controller-specific support as the | ||
1355 | second Register FIS to clear SRST should be transmitted | ||
1356 | while BSY bit is still set. Note that on PATA, this resets | ||
1357 | both master and slave devices on a channel. | ||
1358 | </para> | ||
1359 | </listitem> | ||
1360 | </varlistentry> | ||
1361 | |||
1362 | <varlistentry><term>EXECUTE DEVICE DIAGNOSTIC command</term> | ||
1363 | <listitem> | ||
1364 | <para> | ||
1365 | Although ATA/ATAPI standard doesn't describe exactly, EDD | ||
1366 | implies some level of resetting, possibly similar level | ||
1367 | with software reset. Host-side EDD protocol can be handled | ||
1368 | with normal command processing and most SATA controllers | ||
1369 | should be able to handle EDD's just like other commands. | ||
1370 | As in software reset, EDD affects both devices on a PATA | ||
1371 | bus. | ||
1372 | </para> | ||
1373 | <para> | ||
1374 | Although EDD does reset devices, this doesn't suit error | ||
1375 | handling as EDD cannot be issued while BSY is set and it's | ||
1376 | unclear how it will act when device is in unknown/weird | ||
1377 | state. | ||
1378 | </para> | ||
1379 | </listitem> | ||
1380 | </varlistentry> | ||
1381 | |||
1382 | <varlistentry><term>ATAPI DEVICE RESET command</term> | ||
1383 | <listitem> | ||
1384 | <para> | ||
1385 | This is very similar to software reset except that reset | ||
1386 | can be restricted to the selected device without affecting | ||
1387 | the other device sharing the cable. | ||
1388 | </para> | ||
1389 | </listitem> | ||
1390 | </varlistentry> | ||
1391 | |||
1392 | <varlistentry><term>SATA phy reset</term> | ||
1393 | <listitem> | ||
1394 | <para> | ||
1395 | This is the preferred way of resetting a SATA device. In | ||
1396 | effect, it's identical to PATA hardware reset. Note that | ||
1397 | this can be done with the standard SCR Control register. | ||
1398 | As such, it's usually easier to implement than software | ||
1399 | reset. | ||
1400 | </para> | ||
1401 | </listitem> | ||
1402 | </varlistentry> | ||
1403 | |||
1404 | </variablelist> | ||
1405 | |||
1406 | <para> | ||
1407 | One more thing to consider when resetting devices is that | ||
1408 | resetting clears certain configuration parameters and they | ||
1409 | need to be set to their previous or newly adjusted values | ||
1410 | after reset. | ||
1411 | </para> | ||
1412 | |||
1413 | <para> | ||
1414 | Parameters affected are. | ||
1415 | </para> | ||
1416 | |||
1417 | <itemizedlist> | ||
1418 | |||
1419 | <listitem> | ||
1420 | <para> | ||
1421 | CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used) | ||
1422 | </para> | ||
1423 | </listitem> | ||
1424 | |||
1425 | <listitem> | ||
1426 | <para> | ||
1427 | Parameters set with SET FEATURES including transfer mode setting | ||
1428 | </para> | ||
1429 | </listitem> | ||
1430 | |||
1431 | <listitem> | ||
1432 | <para> | ||
1433 | Block count set with SET MULTIPLE MODE | ||
1434 | </para> | ||
1435 | </listitem> | ||
1436 | |||
1437 | <listitem> | ||
1438 | <para> | ||
1439 | Other parameters (SET MAX, MEDIA LOCK...) | ||
1440 | </para> | ||
1441 | </listitem> | ||
1442 | |||
1443 | </itemizedlist> | ||
1444 | |||
1445 | <para> | ||
1446 | ATA/ATAPI standard specifies that some parameters must be | ||
1447 | maintained across hardware or software reset, but doesn't | ||
1448 | strictly specify all of them. Always reconfiguring needed | ||
1449 | parameters after reset is required for robustness. Note that | ||
1450 | this also applies when resuming from deep sleep (power-off). | ||
1451 | </para> | ||
1452 | |||
1453 | <para> | ||
1454 | Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / | ||
1455 | IDENTIFY PACKET DEVICE is issued after any configuration | ||
1456 | parameter is updated or a hardware reset and the result used | ||
1457 | for further operation. OS driver is required to implement | ||
1458 | revalidation mechanism to support this. | ||
1459 | </para> | ||
1460 | |||
1461 | </sect2> | ||
1462 | |||
1463 | <sect2 id="exrecReconf"> | ||
1464 | <title>Reconfigure transport</title> | ||
1465 | |||
1466 | <para> | ||
1467 | For both PATA and SATA, a lot of corners are cut for cheap | ||
1468 | connectors, cables or controllers and it's quite common to see | ||
1469 | high transmission error rate. This can be mitigated by | ||
1470 | lowering transmission speed. | ||
1471 | </para> | ||
1472 | |||
1473 | <para> | ||
1474 | The following is a possible scheme Jeff Garzik suggested. | ||
1475 | </para> | ||
1476 | |||
1477 | <blockquote> | ||
1478 | <para> | ||
1479 | If more than $N (3?) transmission errors happen in 15 minutes, | ||
1480 | </para> | ||
1481 | <itemizedlist> | ||
1482 | <listitem> | ||
1483 | <para> | ||
1484 | if SATA, decrease SATA PHY speed. if speed cannot be decreased, | ||
1485 | </para> | ||
1486 | </listitem> | ||
1487 | <listitem> | ||
1488 | <para> | ||
1489 | decrease UDMA xfer speed. if at UDMA0, switch to PIO4, | ||
1490 | </para> | ||
1491 | </listitem> | ||
1492 | <listitem> | ||
1493 | <para> | ||
1494 | decrease PIO xfer speed. if at PIO3, complain, but continue | ||
1495 | </para> | ||
1496 | </listitem> | ||
1497 | </itemizedlist> | ||
1498 | </blockquote> | ||
1499 | |||
1500 | </sect2> | ||
1501 | |||
1502 | </sect1> | ||
1503 | |||
1504 | </chapter> | ||
1505 | |||
434 | <chapter id="PiixInt"> | 1506 | <chapter id="PiixInt"> |
435 | <title>ata_piix Internals</title> | 1507 | <title>ata_piix Internals</title> |
436 | !Idrivers/scsi/ata_piix.c | 1508 | !Idrivers/scsi/ata_piix.c |
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 6dd274d7e1cf..2d65c2182161 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt | |||
@@ -906,9 +906,20 @@ Aside: | |||
906 | 906 | ||
907 | 907 | ||
908 | 4. The I/O scheduler | 908 | 4. The I/O scheduler |
909 | I/O schedulers are now per queue. They should be runtime switchable and modular | 909 | I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch |
910 | but aren't yet. Jens has most bits to do this, but the sysfs implementation is | 910 | queue and specific I/O schedulers. Unless stated otherwise, elevator is used |
911 | missing. | 911 | to refer to both parts and I/O scheduler to specific I/O schedulers. |
912 | |||
913 | Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c. | ||
914 | The generic dispatch queue is responsible for properly ordering barrier | ||
915 | requests, requeueing, handling non-fs requests and all other subtleties. | ||
916 | |||
917 | Specific I/O schedulers are responsible for ordering normal filesystem | ||
918 | requests. They can also choose to delay certain requests to improve | ||
919 | throughput or whatever purpose. As the plural form indicates, there are | ||
920 | multiple I/O schedulers. They can be built as modules but at least one should | ||
921 | be built inside the kernel. Each queue can choose different one and can also | ||
922 | change to another one dynamically. | ||
912 | 923 | ||
913 | A block layer call to the i/o scheduler follows the convention elv_xxx(). This | 924 | A block layer call to the i/o scheduler follows the convention elv_xxx(). This |
914 | calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, | 925 | calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, |
@@ -921,44 +932,36 @@ keeping work. | |||
921 | The functions an elevator may implement are: (* are mandatory) | 932 | The functions an elevator may implement are: (* are mandatory) |
922 | elevator_merge_fn called to query requests for merge with a bio | 933 | elevator_merge_fn called to query requests for merge with a bio |
923 | 934 | ||
924 | elevator_merge_req_fn " " " with another request | 935 | elevator_merge_req_fn called when two requests get merged. the one |
936 | which gets merged into the other one will be | ||
937 | never seen by I/O scheduler again. IOW, after | ||
938 | being merged, the request is gone. | ||
925 | 939 | ||
926 | elevator_merged_fn called when a request in the scheduler has been | 940 | elevator_merged_fn called when a request in the scheduler has been |
927 | involved in a merge. It is used in the deadline | 941 | involved in a merge. It is used in the deadline |
928 | scheduler for example, to reposition the request | 942 | scheduler for example, to reposition the request |
929 | if its sorting order has changed. | 943 | if its sorting order has changed. |
930 | 944 | ||
931 | *elevator_next_req_fn returns the next scheduled request, or NULL | 945 | elevator_dispatch_fn fills the dispatch queue with ready requests. |
932 | if there are none (or none are ready). | 946 | I/O schedulers are free to postpone requests by |
947 | not filling the dispatch queue unless @force | ||
948 | is non-zero. Once dispatched, I/O schedulers | ||
949 | are not allowed to manipulate the requests - | ||
950 | they belong to generic dispatch queue. | ||
933 | 951 | ||
934 | *elevator_add_req_fn called to add a new request into the scheduler | 952 | elevator_add_req_fn called to add a new request into the scheduler |
935 | 953 | ||
936 | elevator_queue_empty_fn returns true if the merge queue is empty. | 954 | elevator_queue_empty_fn returns true if the merge queue is empty. |
937 | Drivers shouldn't use this, but rather check | 955 | Drivers shouldn't use this, but rather check |
938 | if elv_next_request is NULL (without losing the | 956 | if elv_next_request is NULL (without losing the |
939 | request if one exists!) | 957 | request if one exists!) |
940 | 958 | ||
941 | elevator_remove_req_fn This is called when a driver claims ownership of | ||
942 | the target request - it now belongs to the | ||
943 | driver. It must not be modified or merged. | ||
944 | Drivers must not lose the request! A subsequent | ||
945 | call of elevator_next_req_fn must return the | ||
946 | _next_ request. | ||
947 | |||
948 | elevator_requeue_req_fn called to add a request to the scheduler. This | ||
949 | is used when the request has alrnadebeen | ||
950 | returned by elv_next_request, but hasn't | ||
951 | completed. If this is not implemented then | ||
952 | elevator_add_req_fn is called instead. | ||
953 | |||
954 | elevator_former_req_fn | 959 | elevator_former_req_fn |
955 | elevator_latter_req_fn These return the request before or after the | 960 | elevator_latter_req_fn These return the request before or after the |
956 | one specified in disk sort order. Used by the | 961 | one specified in disk sort order. Used by the |
957 | block layer to find merge possibilities. | 962 | block layer to find merge possibilities. |
958 | 963 | ||
959 | elevator_completed_req_fn called when a request is completed. This might | 964 | elevator_completed_req_fn called when a request is completed. |
960 | come about due to being merged with another or | ||
961 | when the device completes the request. | ||
962 | 965 | ||
963 | elevator_may_queue_fn returns true if the scheduler wants to allow the | 966 | elevator_may_queue_fn returns true if the scheduler wants to allow the |
964 | current context to queue a new request even if | 967 | current context to queue a new request even if |
@@ -967,13 +970,33 @@ elevator_may_queue_fn returns true if the scheduler wants to allow the | |||
967 | 970 | ||
968 | elevator_set_req_fn | 971 | elevator_set_req_fn |
969 | elevator_put_req_fn Must be used to allocate and free any elevator | 972 | elevator_put_req_fn Must be used to allocate and free any elevator |
970 | specific storate for a request. | 973 | specific storage for a request. |
974 | |||
975 | elevator_activate_req_fn Called when device driver first sees a request. | ||
976 | I/O schedulers can use this callback to | ||
977 | determine when actual execution of a request | ||
978 | starts. | ||
979 | elevator_deactivate_req_fn Called when device driver decides to delay | ||
980 | a request by requeueing it. | ||
971 | 981 | ||
972 | elevator_init_fn | 982 | elevator_init_fn |
973 | elevator_exit_fn Allocate and free any elevator specific storage | 983 | elevator_exit_fn Allocate and free any elevator specific storage |
974 | for a queue. | 984 | for a queue. |
975 | 985 | ||
976 | 4.2 I/O scheduler implementation | 986 | 4.2 Request flows seen by I/O schedulers |
987 | All requests seens by I/O schedulers strictly follow one of the following three | ||
988 | flows. | ||
989 | |||
990 | set_req_fn -> | ||
991 | |||
992 | i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> | ||
993 | (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn | ||
994 | ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn | ||
995 | iii. [none] | ||
996 | |||
997 | -> put_req_fn | ||
998 | |||
999 | 4.3 I/O scheduler implementation | ||
977 | The generic i/o scheduler algorithm attempts to sort/merge/batch requests for | 1000 | The generic i/o scheduler algorithm attempts to sort/merge/batch requests for |
978 | optimal disk scan and request servicing performance (based on generic | 1001 | optimal disk scan and request servicing performance (based on generic |
979 | principles and device capabilities), optimized for: | 1002 | principles and device capabilities), optimized for: |
@@ -993,18 +1016,7 @@ request in sort order to prevent binary tree lookups. | |||
993 | This arrangement is not a generic block layer characteristic however, so | 1016 | This arrangement is not a generic block layer characteristic however, so |
994 | elevators may implement queues as they please. | 1017 | elevators may implement queues as they please. |
995 | 1018 | ||
996 | ii. Last merge hint | 1019 | ii. Merge hash |
997 | The last merge hint is part of the generic queue layer. I/O schedulers must do | ||
998 | some management on it. For the most part, the most important thing is to make | ||
999 | sure q->last_merge is cleared (set to NULL) when the request on it is no longer | ||
1000 | a candidate for merging (for example if it has been sent to the driver). | ||
1001 | |||
1002 | The last merge performed is cached as a hint for the subsequent request. If | ||
1003 | sequential data is being submitted, the hint is used to perform merges without | ||
1004 | any scanning. This is not sufficient when there are multiple processes doing | ||
1005 | I/O though, so a "merge hash" is used by some schedulers. | ||
1006 | |||
1007 | iii. Merge hash | ||
1008 | AS and deadline use a hash table indexed by the last sector of a request. This | 1020 | AS and deadline use a hash table indexed by the last sector of a request. This |
1009 | enables merging code to quickly look up "back merge" candidates, even when | 1021 | enables merging code to quickly look up "back merge" candidates, even when |
1010 | multiple I/O streams are being performed at once on one disk. | 1022 | multiple I/O streams are being performed at once on one disk. |
@@ -1013,29 +1025,8 @@ multiple I/O streams are being performed at once on one disk. | |||
1013 | are far less common than "back merges" due to the nature of most I/O patterns. | 1025 | are far less common than "back merges" due to the nature of most I/O patterns. |
1014 | Front merges are handled by the binary trees in AS and deadline schedulers. | 1026 | Front merges are handled by the binary trees in AS and deadline schedulers. |
1015 | 1027 | ||
1016 | iv. Handling barrier cases | 1028 | iii. Plugging the queue to batch requests in anticipation of opportunities for |
1017 | A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered | 1029 | merge/sort optimizations |
1018 | around. That is, they must be processed after all older requests, and before | ||
1019 | any newer ones. This includes merges! | ||
1020 | |||
1021 | In AS and deadline schedulers, barriers have the effect of flushing the reorder | ||
1022 | queue. The performance cost of this will vary from nothing to a lot depending | ||
1023 | on i/o patterns and device characteristics. Obviously they won't improve | ||
1024 | performance, so their use should be kept to a minimum. | ||
1025 | |||
1026 | v. Handling insertion position directives | ||
1027 | A request may be inserted with a position directive. The directives are one of | ||
1028 | ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT. | ||
1029 | |||
1030 | ELEVATOR_INSERT_SORT is a general directive for non-barrier requests. | ||
1031 | ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue. | ||
1032 | ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and | ||
1033 | overrides the ordering requested by any previous barriers. In practice this is | ||
1034 | harmless and required, because it is used for SCSI requeueing. This does not | ||
1035 | require flushing the reorder queue, so does not impose a performance penalty. | ||
1036 | |||
1037 | vi. Plugging the queue to batch requests in anticipation of opportunities for | ||
1038 | merge/sort optimizations | ||
1039 | 1030 | ||
1040 | This is just the same as in 2.4 so far, though per-device unplugging | 1031 | This is just the same as in 2.4 so far, though per-device unplugging |
1041 | support is anticipated for 2.5. Also with a priority-based i/o scheduler, | 1032 | support is anticipated for 2.5. Also with a priority-based i/o scheduler, |
@@ -1069,7 +1060,7 @@ Aside: | |||
1069 | blk_kick_queue() to unplug a specific queue (right away ?) | 1060 | blk_kick_queue() to unplug a specific queue (right away ?) |
1070 | or optionally, all queues, is in the plan. | 1061 | or optionally, all queues, is in the plan. |
1071 | 1062 | ||
1072 | 4.3 I/O contexts | 1063 | 4.4 I/O contexts |
1073 | I/O contexts provide a dynamically allocated per process data area. They may | 1064 | I/O contexts provide a dynamically allocated per process data area. They may |
1074 | be used in I/O schedulers, and in the block layer (could be used for IO statis, | 1065 | be used in I/O schedulers, and in the block layer (could be used for IO statis, |
1075 | priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and | 1066 | priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and |
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index a55f0f95b171..b0fe41da007b 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt | |||
@@ -777,7 +777,7 @@ doing so is the same as described in the "Configuring Multiple Bonds | |||
777 | Manually" section, below. | 777 | Manually" section, below. |
778 | 778 | ||
779 | NOTE: It has been observed that some Red Hat supplied kernels | 779 | NOTE: It has been observed that some Red Hat supplied kernels |
780 | are apparently unable to rename modules at load time (the "-obonding1" | 780 | are apparently unable to rename modules at load time (the "-o bond1" |
781 | part). Attempts to pass that option to modprobe will produce an | 781 | part). Attempts to pass that option to modprobe will produce an |
782 | "Operation not permitted" error. This has been reported on some | 782 | "Operation not permitted" error. This has been reported on some |
783 | Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels | 783 | Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels |
@@ -883,7 +883,8 @@ the above does not work, and the second bonding instance never sees | |||
883 | its options. In that case, the second options line can be substituted | 883 | its options. In that case, the second options line can be substituted |
884 | as follows: | 884 | as follows: |
885 | 885 | ||
886 | install bonding1 /sbin/modprobe bonding -obond1 mode=balance-alb miimon=50 | 886 | install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ |
887 | mode=balance-alb miimon=50 | ||
887 | 888 | ||
888 | This may be repeated any number of times, specifying a new and | 889 | This may be repeated any number of times, specifying a new and |
889 | unique name in place of bond1 for each subsequent instance. | 890 | unique name in place of bond1 for each subsequent instance. |