diff options
| author | Benjamin Herrenschmidt <benh@kernel.crashing.org> | 2008-07-15 01:44:51 -0400 |
|---|---|---|
| committer | Benjamin Herrenschmidt <benh@kernel.crashing.org> | 2008-07-15 01:44:51 -0400 |
| commit | 43d2548bb2ef7e6d753f91468a746784041e522d (patch) | |
| tree | 77d13fcd48fd998393abb825ec36e2b732684a73 /Documentation | |
| parent | 585583d95c5660973bc0cf64add517b040acd8a4 (diff) | |
| parent | 85082fd7cbe3173198aac0eb5e85ab1edcc6352c (diff) | |
Merge commit '85082fd7cbe3173198aac0eb5e85ab1edcc6352c' into test-build
Manual fixup of:
arch/powerpc/Kconfig
Diffstat (limited to 'Documentation')
26 files changed, 825 insertions, 121 deletions
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 4bd9ea539129..44f52a4f5903 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block | |||
| @@ -26,3 +26,37 @@ Description: | |||
| 26 | I/O statistics of partition <part>. The format is the | 26 | I/O statistics of partition <part>. The format is the |
| 27 | same as the above-written /sys/block/<disk>/stat | 27 | same as the above-written /sys/block/<disk>/stat |
| 28 | format. | 28 | format. |
| 29 | |||
| 30 | |||
| 31 | What: /sys/block/<disk>/integrity/format | ||
| 32 | Date: June 2008 | ||
| 33 | Contact: Martin K. Petersen <martin.petersen@oracle.com> | ||
| 34 | Description: | ||
| 35 | Metadata format for integrity capable block device. | ||
| 36 | E.g. T10-DIF-TYPE1-CRC. | ||
| 37 | |||
| 38 | |||
| 39 | What: /sys/block/<disk>/integrity/read_verify | ||
| 40 | Date: June 2008 | ||
| 41 | Contact: Martin K. Petersen <martin.petersen@oracle.com> | ||
| 42 | Description: | ||
| 43 | Indicates whether the block layer should verify the | ||
| 44 | integrity of read requests serviced by devices that | ||
| 45 | support sending integrity metadata. | ||
| 46 | |||
| 47 | |||
| 48 | What: /sys/block/<disk>/integrity/tag_size | ||
| 49 | Date: June 2008 | ||
| 50 | Contact: Martin K. Petersen <martin.petersen@oracle.com> | ||
| 51 | Description: | ||
| 52 | Number of bytes of integrity tag space available per | ||
| 53 | 512 bytes of data. | ||
| 54 | |||
| 55 | |||
| 56 | What: /sys/block/<disk>/integrity/write_generate | ||
| 57 | Date: June 2008 | ||
| 58 | Contact: Martin K. Petersen <martin.petersen@oracle.com> | ||
| 59 | Description: | ||
| 60 | Indicates whether the block layer should automatically | ||
| 61 | generate checksums for write requests bound for | ||
| 62 | devices that support receiving integrity metadata. | ||
diff --git a/Documentation/ABI/testing/sysfs-bus-css b/Documentation/ABI/testing/sysfs-bus-css new file mode 100644 index 000000000000..b585ec258a08 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-css | |||
| @@ -0,0 +1,35 @@ | |||
| 1 | What: /sys/bus/css/devices/.../type | ||
| 2 | Date: March 2008 | ||
| 3 | Contact: Cornelia Huck <cornelia.huck@de.ibm.com> | ||
| 4 | linux-s390@vger.kernel.org | ||
| 5 | Description: Contains the subchannel type, as reported by the hardware. | ||
| 6 | This attribute is present for all subchannel types. | ||
| 7 | |||
| 8 | What: /sys/bus/css/devices/.../modalias | ||
| 9 | Date: March 2008 | ||
| 10 | Contact: Cornelia Huck <cornelia.huck@de.ibm.com> | ||
| 11 | linux-s390@vger.kernel.org | ||
| 12 | Description: Contains the module alias as reported with uevents. | ||
| 13 | It is of the format css:t<type> and present for all | ||
| 14 | subchannel types. | ||
| 15 | |||
| 16 | What: /sys/bus/css/drivers/io_subchannel/.../chpids | ||
| 17 | Date: December 2002 | ||
| 18 | Contact: Cornelia Huck <cornelia.huck@de.ibm.com> | ||
| 19 | linux-s390@vger.kernel.org | ||
| 20 | Description: Contains the ids of the channel paths used by this | ||
| 21 | subchannel, as reported by the channel subsystem | ||
| 22 | during subchannel recognition. | ||
| 23 | Note: This is an I/O-subchannel specific attribute. | ||
| 24 | Users: s390-tools, HAL | ||
| 25 | |||
| 26 | What: /sys/bus/css/drivers/io_subchannel/.../pimpampom | ||
| 27 | Date: December 2002 | ||
| 28 | Contact: Cornelia Huck <cornelia.huck@de.ibm.com> | ||
| 29 | linux-s390@vger.kernel.org | ||
| 30 | Description: Contains the PIM/PAM/POM values, as reported by the | ||
| 31 | channel subsystem when last queried by the common I/O | ||
| 32 | layer (this implies that this attribute is not neccessarily | ||
| 33 | in sync with the values current in the channel subsystem). | ||
| 34 | Note: This is an I/O-subchannel specific attribute. | ||
| 35 | Users: s390-tools, HAL | ||
diff --git a/Documentation/ABI/testing/sysfs-firmware-memmap b/Documentation/ABI/testing/sysfs-firmware-memmap new file mode 100644 index 000000000000..0d99ee6ae02e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-firmware-memmap | |||
| @@ -0,0 +1,71 @@ | |||
| 1 | What: /sys/firmware/memmap/ | ||
| 2 | Date: June 2008 | ||
| 3 | Contact: Bernhard Walle <bwalle@suse.de> | ||
| 4 | Description: | ||
| 5 | On all platforms, the firmware provides a memory map which the | ||
| 6 | kernel reads. The resources from that memory map are registered | ||
| 7 | in the kernel resource tree and exposed to userspace via | ||
| 8 | /proc/iomem (together with other resources). | ||
| 9 | |||
| 10 | However, on most architectures that firmware-provided memory | ||
| 11 | map is modified afterwards by the kernel itself, either because | ||
| 12 | the kernel merges that memory map with other information or | ||
| 13 | just because the user overwrites that memory map via command | ||
| 14 | line. | ||
| 15 | |||
| 16 | kexec needs the raw firmware-provided memory map to setup the | ||
| 17 | parameter segment of the kernel that should be booted with | ||
| 18 | kexec. Also, the raw memory map is useful for debugging. For | ||
| 19 | that reason, /sys/firmware/memmap is an interface that provides | ||
| 20 | the raw memory map to userspace. | ||
| 21 | |||
| 22 | The structure is as follows: Under /sys/firmware/memmap there | ||
| 23 | are subdirectories with the number of the entry as their name: | ||
| 24 | |||
| 25 | /sys/firmware/memmap/0 | ||
| 26 | /sys/firmware/memmap/1 | ||
| 27 | /sys/firmware/memmap/2 | ||
| 28 | /sys/firmware/memmap/3 | ||
| 29 | ... | ||
| 30 | |||
| 31 | The maximum depends on the number of memory map entries provided | ||
| 32 | by the firmware. The order is just the order that the firmware | ||
| 33 | provides. | ||
| 34 | |||
| 35 | Each directory contains three files: | ||
| 36 | |||
| 37 | start : The start address (as hexadecimal number with the | ||
| 38 | '0x' prefix). | ||
| 39 | end : The end address, inclusive (regardless whether the | ||
| 40 | firmware provides inclusive or exclusive ranges). | ||
| 41 | type : Type of the entry as string. See below for a list of | ||
| 42 | valid types. | ||
| 43 | |||
| 44 | So, for example: | ||
| 45 | |||
| 46 | /sys/firmware/memmap/0/start | ||
| 47 | /sys/firmware/memmap/0/end | ||
| 48 | /sys/firmware/memmap/0/type | ||
| 49 | /sys/firmware/memmap/1/start | ||
| 50 | ... | ||
| 51 | |||
| 52 | Currently following types exist: | ||
| 53 | |||
| 54 | - System RAM | ||
| 55 | - ACPI Tables | ||
| 56 | - ACPI Non-volatile Storage | ||
| 57 | - reserved | ||
| 58 | |||
| 59 | Following shell snippet can be used to display that memory | ||
| 60 | map in a human-readable format: | ||
| 61 | |||
| 62 | -------------------- 8< ---------------------------------------- | ||
| 63 | #!/bin/bash | ||
| 64 | cd /sys/firmware/memmap | ||
| 65 | for dir in * ; do | ||
| 66 | start=$(cat $dir/start) | ||
| 67 | end=$(cat $dir/end) | ||
| 68 | type=$(cat $dir/type) | ||
| 69 | printf "%016x-%016x (%s)\n" $start $[ $end +1] "$type" | ||
| 70 | done | ||
| 71 | -------------------- >8 ---------------------------------------- | ||
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt new file mode 100644 index 000000000000..e9dc8d86adc7 --- /dev/null +++ b/Documentation/block/data-integrity.txt | |||
| @@ -0,0 +1,327 @@ | |||
| 1 | ---------------------------------------------------------------------- | ||
| 2 | 1. INTRODUCTION | ||
| 3 | |||
| 4 | Modern filesystems feature checksumming of data and metadata to | ||
| 5 | protect against data corruption. However, the detection of the | ||
| 6 | corruption is done at read time which could potentially be months | ||
| 7 | after the data was written. At that point the original data that the | ||
| 8 | application tried to write is most likely lost. | ||
| 9 | |||
| 10 | The solution is to ensure that the disk is actually storing what the | ||
| 11 | application meant it to. Recent additions to both the SCSI family | ||
| 12 | protocols (SBC Data Integrity Field, SCC protection proposal) as well | ||
| 13 | as SATA/T13 (External Path Protection) try to remedy this by adding | ||
| 14 | support for appending integrity metadata to an I/O. The integrity | ||
| 15 | metadata (or protection information in SCSI terminology) includes a | ||
| 16 | checksum for each sector as well as an incrementing counter that | ||
| 17 | ensures the individual sectors are written in the right order. And | ||
| 18 | for some protection schemes also that the I/O is written to the right | ||
| 19 | place on disk. | ||
| 20 | |||
| 21 | Current storage controllers and devices implement various protective | ||
| 22 | measures, for instance checksumming and scrubbing. But these | ||
| 23 | technologies are working in their own isolated domains or at best | ||
| 24 | between adjacent nodes in the I/O path. The interesting thing about | ||
| 25 | DIF and the other integrity extensions is that the protection format | ||
| 26 | is well defined and every node in the I/O path can verify the | ||
| 27 | integrity of the I/O and reject it if corruption is detected. This | ||
| 28 | allows not only corruption prevention but also isolation of the point | ||
| 29 | of failure. | ||
| 30 | |||
| 31 | ---------------------------------------------------------------------- | ||
| 32 | 2. THE DATA INTEGRITY EXTENSIONS | ||
| 33 | |||
| 34 | As written, the protocol extensions only protect the path between | ||
| 35 | controller and storage device. However, many controllers actually | ||
| 36 | allow the operating system to interact with the integrity metadata | ||
| 37 | (IMD). We have been working with several FC/SAS HBA vendors to enable | ||
| 38 | the protection information to be transferred to and from their | ||
| 39 | controllers. | ||
| 40 | |||
| 41 | The SCSI Data Integrity Field works by appending 8 bytes of protection | ||
| 42 | information to each sector. The data + integrity metadata is stored | ||
| 43 | in 520 byte sectors on disk. Data + IMD are interleaved when | ||
| 44 | transferred between the controller and target. The T13 proposal is | ||
| 45 | similar. | ||
| 46 | |||
| 47 | Because it is highly inconvenient for operating systems to deal with | ||
| 48 | 520 (and 4104) byte sectors, we approached several HBA vendors and | ||
| 49 | encouraged them to allow separation of the data and integrity metadata | ||
| 50 | scatter-gather lists. | ||
| 51 | |||
| 52 | The controller will interleave the buffers on write and split them on | ||
| 53 | read. This means that the Linux can DMA the data buffers to and from | ||
| 54 | host memory without changes to the page cache. | ||
| 55 | |||
| 56 | Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs | ||
| 57 | is somewhat heavy to compute in software. Benchmarks found that | ||
| 58 | calculating this checksum had a significant impact on system | ||
| 59 | performance for a number of workloads. Some controllers allow a | ||
| 60 | lighter-weight checksum to be used when interfacing with the operating | ||
| 61 | system. Emulex, for instance, supports the TCP/IP checksum instead. | ||
| 62 | The IP checksum received from the OS is converted to the 16-bit CRC | ||
| 63 | when writing and vice versa. This allows the integrity metadata to be | ||
| 64 | generated by Linux or the application at very low cost (comparable to | ||
| 65 | software RAID5). | ||
| 66 | |||
| 67 | The IP checksum is weaker than the CRC in terms of detecting bit | ||
| 68 | errors. However, the strength is really in the separation of the data | ||
| 69 | buffers and the integrity metadata. These two distinct buffers much | ||
| 70 | match up for an I/O to complete. | ||
| 71 | |||
| 72 | The separation of the data and integrity metadata buffers as well as | ||
| 73 | the choice in checksums is referred to as the Data Integrity | ||
| 74 | Extensions. As these extensions are outside the scope of the protocol | ||
| 75 | bodies (T10, T13), Oracle and its partners are trying to standardize | ||
| 76 | them within the Storage Networking Industry Association. | ||
| 77 | |||
| 78 | ---------------------------------------------------------------------- | ||
| 79 | 3. KERNEL CHANGES | ||
| 80 | |||
| 81 | The data integrity framework in Linux enables protection information | ||
| 82 | to be pinned to I/Os and sent to/received from controllers that | ||
| 83 | support it. | ||
| 84 | |||
| 85 | The advantage to the integrity extensions in SCSI and SATA is that | ||
| 86 | they enable us to protect the entire path from application to storage | ||
| 87 | device. However, at the same time this is also the biggest | ||
| 88 | disadvantage. It means that the protection information must be in a | ||
| 89 | format that can be understood by the disk. | ||
| 90 | |||
| 91 | Generally Linux/POSIX applications are agnostic to the intricacies of | ||
| 92 | the storage devices they are accessing. The virtual filesystem switch | ||
| 93 | and the block layer make things like hardware sector size and | ||
| 94 | transport protocols completely transparent to the application. | ||
| 95 | |||
| 96 | However, this level of detail is required when preparing the | ||
| 97 | protection information to send to a disk. Consequently, the very | ||
| 98 | concept of an end-to-end protection scheme is a layering violation. | ||
| 99 | It is completely unreasonable for an application to be aware whether | ||
| 100 | it is accessing a SCSI or SATA disk. | ||
| 101 | |||
| 102 | The data integrity support implemented in Linux attempts to hide this | ||
| 103 | from the application. As far as the application (and to some extent | ||
| 104 | the kernel) is concerned, the integrity metadata is opaque information | ||
| 105 | that's attached to the I/O. | ||
| 106 | |||
| 107 | The current implementation allows the block layer to automatically | ||
| 108 | generate the protection information for any I/O. Eventually the | ||
| 109 | intent is to move the integrity metadata calculation to userspace for | ||
| 110 | user data. Metadata and other I/O that originates within the kernel | ||
| 111 | will still use the automatic generation interface. | ||
| 112 | |||
| 113 | Some storage devices allow each hardware sector to be tagged with a | ||
| 114 | 16-bit value. The owner of this tag space is the owner of the block | ||
| 115 | device. I.e. the filesystem in most cases. The filesystem can use | ||
| 116 | this extra space to tag sectors as they see fit. Because the tag | ||
| 117 | space is limited, the block interface allows tagging bigger chunks by | ||
| 118 | way of interleaving. This way, 8*16 bits of information can be | ||
| 119 | attached to a typical 4KB filesystem block. | ||
| 120 | |||
| 121 | This also means that applications such as fsck and mkfs will need | ||
| 122 | access to manipulate the tags from user space. A passthrough | ||
| 123 | interface for this is being worked on. | ||
| 124 | |||
| 125 | |||
| 126 | ---------------------------------------------------------------------- | ||
| 127 | 4. BLOCK LAYER IMPLEMENTATION DETAILS | ||
| 128 | |||
| 129 | 4.1 BIO | ||
| 130 | |||
| 131 | The data integrity patches add a new field to struct bio when | ||
| 132 | CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer | ||
| 133 | to a struct bip which contains the bio integrity payload. Essentially | ||
| 134 | a bip is a trimmed down struct bio which holds a bio_vec containing | ||
| 135 | the integrity metadata and the required housekeeping information (bvec | ||
| 136 | pool, vector count, etc.) | ||
| 137 | |||
| 138 | A kernel subsystem can enable data integrity protection on a bio by | ||
| 139 | calling bio_integrity_alloc(bio). This will allocate and attach the | ||
| 140 | bip to the bio. | ||
| 141 | |||
| 142 | Individual pages containing integrity metadata can subsequently be | ||
| 143 | attached using bio_integrity_add_page(). | ||
| 144 | |||
| 145 | bio_free() will automatically free the bip. | ||
| 146 | |||
| 147 | |||
| 148 | 4.2 BLOCK DEVICE | ||
| 149 | |||
| 150 | Because the format of the protection data is tied to the physical | ||
| 151 | disk, each block device has been extended with a block integrity | ||
| 152 | profile (struct blk_integrity). This optional profile is registered | ||
| 153 | with the block layer using blk_integrity_register(). | ||
| 154 | |||
| 155 | The profile contains callback functions for generating and verifying | ||
| 156 | the protection data, as well as getting and setting application tags. | ||
| 157 | The profile also contains a few constants to aid in completing, | ||
| 158 | merging and splitting the integrity metadata. | ||
| 159 | |||
| 160 | Layered block devices will need to pick a profile that's appropriate | ||
| 161 | for all subdevices. blk_integrity_compare() can help with that. DM | ||
| 162 | and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 | ||
| 163 | will require extra work due to the application tag. | ||
| 164 | |||
| 165 | |||
| 166 | ---------------------------------------------------------------------- | ||
| 167 | 5.0 BLOCK LAYER INTEGRITY API | ||
| 168 | |||
| 169 | 5.1 NORMAL FILESYSTEM | ||
| 170 | |||
| 171 | The normal filesystem is unaware that the underlying block device | ||
| 172 | is capable of sending/receiving integrity metadata. The IMD will | ||
| 173 | be automatically generated by the block layer at submit_bio() time | ||
| 174 | in case of a WRITE. A READ request will cause the I/O integrity | ||
| 175 | to be verified upon completion. | ||
| 176 | |||
| 177 | IMD generation and verification can be toggled using the | ||
| 178 | |||
| 179 | /sys/block/<bdev>/integrity/write_generate | ||
| 180 | |||
| 181 | and | ||
| 182 | |||
| 183 | /sys/block/<bdev>/integrity/read_verify | ||
| 184 | |||
| 185 | flags. | ||
| 186 | |||
| 187 | |||
| 188 | 5.2 INTEGRITY-AWARE FILESYSTEM | ||
| 189 | |||
| 190 | A filesystem that is integrity-aware can prepare I/Os with IMD | ||
| 191 | attached. It can also use the application tag space if this is | ||
| 192 | supported by the block device. | ||
| 193 | |||
| 194 | |||
| 195 | int bdev_integrity_enabled(block_device, int rw); | ||
| 196 | |||
| 197 | bdev_integrity_enabled() will return 1 if the block device | ||
| 198 | supports integrity metadata transfer for the data direction | ||
| 199 | specified in 'rw'. | ||
| 200 | |||
| 201 | bdev_integrity_enabled() honors the write_generate and | ||
| 202 | read_verify flags in sysfs and will respond accordingly. | ||
| 203 | |||
| 204 | |||
| 205 | int bio_integrity_prep(bio); | ||
| 206 | |||
| 207 | To generate IMD for WRITE and to set up buffers for READ, the | ||
| 208 | filesystem must call bio_integrity_prep(bio). | ||
| 209 | |||
| 210 | Prior to calling this function, the bio data direction and start | ||
| 211 | sector must be set, and the bio should have all data pages | ||
| 212 | added. It is up to the caller to ensure that the bio does not | ||
| 213 | change while I/O is in progress. | ||
| 214 | |||
| 215 | bio_integrity_prep() should only be called if | ||
| 216 | bio_integrity_enabled() returned 1. | ||
| 217 | |||
| 218 | |||
| 219 | int bio_integrity_tag_size(bio); | ||
| 220 | |||
| 221 | If the filesystem wants to use the application tag space it will | ||
| 222 | first have to find out how much storage space is available. | ||
| 223 | Because tag space is generally limited (usually 2 bytes per | ||
| 224 | sector regardless of sector size), the integrity framework | ||
| 225 | supports interleaving the information between the sectors in an | ||
| 226 | I/O. | ||
| 227 | |||
| 228 | Filesystems can call bio_integrity_tag_size(bio) to find out how | ||
| 229 | many bytes of storage are available for that particular bio. | ||
| 230 | |||
| 231 | Another option is bdev_get_tag_size(block_device) which will | ||
| 232 | return the number of available bytes per hardware sector. | ||
| 233 | |||
| 234 | |||
| 235 | int bio_integrity_set_tag(bio, void *tag_buf, len); | ||
| 236 | |||
| 237 | After a successful return from bio_integrity_prep(), | ||
| 238 | bio_integrity_set_tag() can be used to attach an opaque tag | ||
| 239 | buffer to a bio. Obviously this only makes sense if the I/O is | ||
| 240 | a WRITE. | ||
| 241 | |||
| 242 | |||
| 243 | int bio_integrity_get_tag(bio, void *tag_buf, len); | ||
| 244 | |||
| 245 | Similarly, at READ I/O completion time the filesystem can | ||
| 246 | retrieve the tag buffer using bio_integrity_get_tag(). | ||
| 247 | |||
| 248 | |||
| 249 | 6.3 PASSING EXISTING INTEGRITY METADATA | ||
| 250 | |||
| 251 | Filesystems that either generate their own integrity metadata or | ||
| 252 | are capable of transferring IMD from user space can use the | ||
| 253 | following calls: | ||
| 254 | |||
| 255 | |||
| 256 | struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); | ||
| 257 | |||
| 258 | Allocates the bio integrity payload and hangs it off of the bio. | ||
| 259 | nr_pages indicate how many pages of protection data need to be | ||
| 260 | stored in the integrity bio_vec list (similar to bio_alloc()). | ||
| 261 | |||
| 262 | The integrity payload will be freed at bio_free() time. | ||
| 263 | |||
| 264 | |||
| 265 | int bio_integrity_add_page(bio, page, len, offset); | ||
| 266 | |||
| 267 | Attaches a page containing integrity metadata to an existing | ||
| 268 | bio. The bio must have an existing bip, | ||
| 269 | i.e. bio_integrity_alloc() must have been called. For a WRITE, | ||
| 270 | the integrity metadata in the pages must be in a format | ||
| 271 | understood by the target device with the notable exception that | ||
| 272 | the sector numbers will be remapped as the request traverses the | ||
| 273 | I/O stack. This implies that the pages added using this call | ||
| 274 | will be modified during I/O! The first reference tag in the | ||
| 275 | integrity metadata must have a value of bip->bip_sector. | ||
| 276 | |||
| 277 | Pages can be added using bio_integrity_add_page() as long as | ||
| 278 | there is room in the bip bio_vec array (nr_pages). | ||
| 279 | |||
| 280 | Upon completion of a READ operation, the attached pages will | ||
| 281 | contain the integrity metadata received from the storage device. | ||
| 282 | It is up to the receiver to process them and verify data | ||
| 283 | integrity upon completion. | ||
| 284 | |||
| 285 | |||
| 286 | 6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY | ||
| 287 | METADATA | ||
| 288 | |||
| 289 | To enable integrity exchange on a block device the gendisk must be | ||
| 290 | registered as capable: | ||
| 291 | |||
| 292 | int blk_integrity_register(gendisk, blk_integrity); | ||
| 293 | |||
| 294 | The blk_integrity struct is a template and should contain the | ||
| 295 | following: | ||
| 296 | |||
| 297 | static struct blk_integrity my_profile = { | ||
| 298 | .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", | ||
| 299 | .generate_fn = my_generate_fn, | ||
| 300 | .verify_fn = my_verify_fn, | ||
| 301 | .get_tag_fn = my_get_tag_fn, | ||
| 302 | .set_tag_fn = my_set_tag_fn, | ||
| 303 | .tuple_size = sizeof(struct my_tuple_size), | ||
| 304 | .tag_size = <tag bytes per hw sector>, | ||
| 305 | }; | ||
| 306 | |||
| 307 | 'name' is a text string which will be visible in sysfs. This is | ||
| 308 | part of the userland API so chose it carefully and never change | ||
| 309 | it. The format is standards body-type-variant. | ||
| 310 | E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. | ||
| 311 | |||
| 312 | 'generate_fn' generates appropriate integrity metadata (for WRITE). | ||
| 313 | |||
| 314 | 'verify_fn' verifies that the data buffer matches the integrity | ||
| 315 | metadata. | ||
| 316 | |||
| 317 | 'tuple_size' must be set to match the size of the integrity | ||
| 318 | metadata per sector. I.e. 8 for DIF and EPP. | ||
| 319 | |||
| 320 | 'tag_size' must be set to identify how many bytes of tag space | ||
| 321 | are available per hardware sector. For DIF this is either 2 or | ||
| 322 | 0 depending on the value of the Control Mode Page ATO bit. | ||
| 323 | |||
| 324 | See 6.2 for a description of get_tag_fn and set_tag_fn. | ||
| 325 | |||
| 326 | ---------------------------------------------------------------------- | ||
| 327 | 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> | ||
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt index 13e4bf054c38..77d3faa1a611 100644 --- a/Documentation/ftrace.txt +++ b/Documentation/ftrace.txt | |||
| @@ -2,8 +2,11 @@ | |||
| 2 | ======================== | 2 | ======================== |
| 3 | 3 | ||
| 4 | Copyright 2008 Red Hat Inc. | 4 | Copyright 2008 Red Hat Inc. |
| 5 | Author: Steven Rostedt <srostedt@redhat.com> | 5 | Author: Steven Rostedt <srostedt@redhat.com> |
| 6 | License: The GNU Free Documentation License, Version 1.2 | ||
| 7 | Reviewers: Elias Oltmanns and Randy Dunlap | ||
| 6 | 8 | ||
| 9 | Writen for: 2.6.26-rc8 linux-2.6-tip.git tip/tracing/ftrace branch | ||
| 7 | 10 | ||
| 8 | Introduction | 11 | Introduction |
| 9 | ------------ | 12 | ------------ |
| @@ -46,7 +49,7 @@ of ftrace. Here is a list of some of the key files: | |||
| 46 | that is configured. | 49 | that is configured. |
| 47 | 50 | ||
| 48 | available_tracers : This holds the different types of tracers that | 51 | available_tracers : This holds the different types of tracers that |
| 49 | has been compiled into the kernel. The tracers | 52 | have been compiled into the kernel. The tracers |
| 50 | listed here can be configured by echoing in their | 53 | listed here can be configured by echoing in their |
| 51 | name into current_tracer. | 54 | name into current_tracer. |
| 52 | 55 | ||
| @@ -90,11 +93,13 @@ of ftrace. Here is a list of some of the key files: | |||
| 90 | trace_entries : This sets or displays the number of trace | 93 | trace_entries : This sets or displays the number of trace |
| 91 | entries each CPU buffer can hold. The tracer buffers | 94 | entries each CPU buffer can hold. The tracer buffers |
| 92 | are the same size for each CPU, so care must be | 95 | are the same size for each CPU, so care must be |
| 93 | taken when modifying the trace_entries. The number | 96 | taken when modifying the trace_entries. The trace |
| 94 | of actually entries will be the number given | 97 | buffers are allocated in pages (blocks of memory that |
| 95 | times the number of possible CPUS. The buffers | 98 | the kernel uses for allocation, usually 4 KB in size). |
| 96 | are saved as individual pages, and the actual entries | 99 | Since each entry is smaller than a page, if the last |
| 97 | will always be rounded up to entries per page. | 100 | allocated page has room for more entries than were |
| 101 | requested, the rest of the page is used to allocate | ||
| 102 | entries. | ||
| 98 | 103 | ||
| 99 | This can only be updated when the current_tracer | 104 | This can only be updated when the current_tracer |
| 100 | is set to "none". | 105 | is set to "none". |
| @@ -114,13 +119,13 @@ of ftrace. Here is a list of some of the key files: | |||
| 114 | in performance. This also has a side effect of | 119 | in performance. This also has a side effect of |
| 115 | enabling or disabling specific functions to be | 120 | enabling or disabling specific functions to be |
| 116 | traced. Echoing in names of functions into this | 121 | traced. Echoing in names of functions into this |
| 117 | file will limit the trace to only those files. | 122 | file will limit the trace to only these functions. |
| 118 | 123 | ||
| 119 | set_ftrace_notrace: This has the opposite effect that | 124 | set_ftrace_notrace: This has the opposite effect that |
| 120 | set_ftrace_filter has. Any function that is added | 125 | set_ftrace_filter has. Any function that is added |
| 121 | here will not be traced. If a function exists | 126 | here will not be traced. If a function exists |
| 122 | in both set_ftrace_filter and set_ftrace_notrace | 127 | in both set_ftrace_filter and set_ftrace_notrace, |
| 123 | the function will _not_ bet traced. | 128 | the function will _not_ be traced. |
| 124 | 129 | ||
| 125 | available_filter_functions : When a function is encountered the first | 130 | available_filter_functions : When a function is encountered the first |
| 126 | time by the dynamic tracer, it is recorded and | 131 | time by the dynamic tracer, it is recorded and |
| @@ -138,7 +143,7 @@ Here are the list of current tracers that can be configured. | |||
| 138 | 143 | ||
| 139 | ftrace - function tracer that uses mcount to trace all functions. | 144 | ftrace - function tracer that uses mcount to trace all functions. |
| 140 | It is possible to filter out which functions that are | 145 | It is possible to filter out which functions that are |
| 141 | traced when dynamic ftrace is configured in. | 146 | to be traced when dynamic ftrace is configured in. |
| 142 | 147 | ||
| 143 | sched_switch - traces the context switches between tasks. | 148 | sched_switch - traces the context switches between tasks. |
| 144 | 149 | ||
| @@ -297,13 +302,13 @@ explains which is which. | |||
| 297 | 302 | ||
| 298 | The above is mostly meaningful for kernel developers. | 303 | The above is mostly meaningful for kernel developers. |
| 299 | 304 | ||
| 300 | time: This differs from the trace output where as the trace output | 305 | time: This differs from the trace file output. The trace file output |
| 301 | contained a absolute timestamp. This timestamp is relative | 306 | included an absolute timestamp. The timestamp used by the |
| 302 | to the start of the first entry in the the trace. | 307 | latency_trace file is relative to the start of the trace. |
| 303 | 308 | ||
| 304 | delay: This is just to help catch your eye a bit better. And | 309 | delay: This is just to help catch your eye a bit better. And |
| 305 | needs to be fixed to be only relative to the same CPU. | 310 | needs to be fixed to be only relative to the same CPU. |
| 306 | The marks is determined by the difference between this | 311 | The marks are determined by the difference between this |
| 307 | current trace and the next trace. | 312 | current trace and the next trace. |
| 308 | '!' - greater than preempt_mark_thresh (default 100) | 313 | '!' - greater than preempt_mark_thresh (default 100) |
| 309 | '+' - greater than 1 microsecond | 314 | '+' - greater than 1 microsecond |
| @@ -322,13 +327,13 @@ output. To see what is available, simply cat the file: | |||
| 322 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ | 327 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ |
| 323 | noblock nostacktrace nosched-tree | 328 | noblock nostacktrace nosched-tree |
| 324 | 329 | ||
| 325 | To disable one of the options, echo in the option appended with "no". | 330 | To disable one of the options, echo in the option prepended with "no". |
| 326 | 331 | ||
| 327 | echo noprint-parent > /debug/tracing/iter_ctrl | 332 | echo noprint-parent > /debug/tracing/iter_ctrl |
| 328 | 333 | ||
| 329 | To enable an option, leave off the "no". | 334 | To enable an option, leave off the "no". |
| 330 | 335 | ||
| 331 | echo sym-offest > /debug/tracing/iter_ctrl | 336 | echo sym-offset > /debug/tracing/iter_ctrl |
| 332 | 337 | ||
| 333 | Here are the available options: | 338 | Here are the available options: |
| 334 | 339 | ||
| @@ -344,7 +349,7 @@ Here are the available options: | |||
| 344 | 349 | ||
| 345 | sym-offset - Display not only the function name, but also the offset | 350 | sym-offset - Display not only the function name, but also the offset |
| 346 | in the function. For example, instead of seeing just | 351 | in the function. For example, instead of seeing just |
| 347 | "ktime_get" you will see "ktime_get+0xb/0x20" | 352 | "ktime_get", you will see "ktime_get+0xb/0x20". |
| 348 | 353 | ||
| 349 | sym-offset: | 354 | sym-offset: |
| 350 | bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 | 355 | bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 |
| @@ -364,7 +369,7 @@ Here are the available options: | |||
| 364 | user applications that can translate the raw numbers better than | 369 | user applications that can translate the raw numbers better than |
| 365 | having it done in the kernel. | 370 | having it done in the kernel. |
| 366 | 371 | ||
| 367 | hex - similar to raw, but the numbers will be in a hexadecimal format. | 372 | hex - Similar to raw, but the numbers will be in a hexadecimal format. |
| 368 | 373 | ||
| 369 | bin - This will print out the formats in raw binary. | 374 | bin - This will print out the formats in raw binary. |
| 370 | 375 | ||
| @@ -381,7 +386,7 @@ sched_switch | |||
| 381 | ------------ | 386 | ------------ |
| 382 | 387 | ||
| 383 | This tracer simply records schedule switches. Here's an example | 388 | This tracer simply records schedule switches. Here's an example |
| 384 | on how to implement it. | 389 | of how to use it. |
| 385 | 390 | ||
| 386 | # echo sched_switch > /debug/tracing/current_tracer | 391 | # echo sched_switch > /debug/tracing/current_tracer |
| 387 | # echo 1 > /debug/tracing/tracing_enabled | 392 | # echo 1 > /debug/tracing/tracing_enabled |
| @@ -470,7 +475,7 @@ interrupt from triggering or the mouse interrupt from letting the | |||
| 470 | kernel know of a new mouse event. The result is a latency with the | 475 | kernel know of a new mouse event. The result is a latency with the |
| 471 | reaction time. | 476 | reaction time. |
| 472 | 477 | ||
| 473 | The irqsoff tracer tracks the time interrupts are disabled and when | 478 | The irqsoff tracer tracks the time interrupts are disabled to the time |
| 474 | they are re-enabled. When a new maximum latency is hit, it saves off | 479 | they are re-enabled. When a new maximum latency is hit, it saves off |
| 475 | the trace so that it may be retrieved at a later time. Every time a | 480 | the trace so that it may be retrieved at a later time. Every time a |
| 476 | new maximum in reached, the old saved trace is discarded and the new | 481 | new maximum in reached, the old saved trace is discarded and the new |
| @@ -519,7 +524,7 @@ The difference between the 6 and the displayed timestamp 7us is | |||
| 519 | because the clock must have incremented between the time of recording | 524 | because the clock must have incremented between the time of recording |
| 520 | the max latency and recording the function that had that latency. | 525 | the max latency and recording the function that had that latency. |
| 521 | 526 | ||
| 522 | Note the above had ftrace_enabled not set. If we set the ftrace_enabled | 527 | Note the above had ftrace_enabled not set. If we set the ftrace_enabled, |
| 523 | we get a much larger output: | 528 | we get a much larger output: |
| 524 | 529 | ||
| 525 | # tracer: irqsoff | 530 | # tracer: irqsoff |
| @@ -570,21 +575,21 @@ vim:ft=help | |||
| 570 | 575 | ||
| 571 | 576 | ||
| 572 | Here we traced a 50 microsecond latency. But we also see all the | 577 | Here we traced a 50 microsecond latency. But we also see all the |
| 573 | functions that were called during that time. Note that enabling | 578 | functions that were called during that time. Note that by enabling |
| 574 | function tracing we endure an added overhead. This overhead may | 579 | function tracing, we endure an added overhead. This overhead may |
| 575 | extend the latency times. But never the less, this trace has provided | 580 | extend the latency times. But nevertheless, this trace has provided |
| 576 | some very helpful debugging. | 581 | some very helpful debugging information. |
| 577 | 582 | ||
| 578 | 583 | ||
| 579 | preemptoff | 584 | preemptoff |
| 580 | ---------- | 585 | ---------- |
| 581 | 586 | ||
| 582 | When preemption is disabled we may be able to receive interrupts but | 587 | When preemption is disabled, we may be able to receive interrupts but |
| 583 | the task can not be preempted and a higher priority task must wait | 588 | the task cannot be preempted and a higher priority task must wait |
| 584 | for preemption to be enabled again before it can preempt a lower | 589 | for preemption to be enabled again before it can preempt a lower |
| 585 | priority task. | 590 | priority task. |
| 586 | 591 | ||
| 587 | The preemptoff tracer traces the places that disables preemption. | 592 | The preemptoff tracer traces the places that disable preemption. |
| 588 | Like the irqsoff, it records the maximum latency that preemption | 593 | Like the irqsoff, it records the maximum latency that preemption |
| 589 | was disabled. The control of preemptoff is much like the irqsoff. | 594 | was disabled. The control of preemptoff is much like the irqsoff. |
| 590 | 595 | ||
| @@ -696,7 +701,7 @@ Notice that the __do_softirq when called doesn't have a preempt_count. | |||
| 696 | It may seem that we missed a preempt enabled. What really happened | 701 | It may seem that we missed a preempt enabled. What really happened |
| 697 | is that the preempt count is held on the threads stack and we | 702 | is that the preempt count is held on the threads stack and we |
| 698 | switched to the softirq stack (4K stacks in effect). The code | 703 | switched to the softirq stack (4K stacks in effect). The code |
| 699 | does not copy the preempt count, but because interrupts are disabled | 704 | does not copy the preempt count, but because interrupts are disabled, |
| 700 | we don't need to worry about it. Having a tracer like this is good | 705 | we don't need to worry about it. Having a tracer like this is good |
| 701 | to let people know what really happens inside the kernel. | 706 | to let people know what really happens inside the kernel. |
| 702 | 707 | ||
| @@ -732,7 +737,7 @@ To record this time, use the preemptirqsoff tracer. | |||
| 732 | 737 | ||
| 733 | Again, using this trace is much like the irqsoff and preemptoff tracers. | 738 | Again, using this trace is much like the irqsoff and preemptoff tracers. |
| 734 | 739 | ||
| 735 | # echo preemptoff > /debug/tracing/current_tracer | 740 | # echo preemptirqsoff > /debug/tracing/current_tracer |
| 736 | # echo 0 > /debug/tracing/tracing_max_latency | 741 | # echo 0 > /debug/tracing/tracing_max_latency |
| 737 | # echo 1 > /debug/tracing/tracing_enabled | 742 | # echo 1 > /debug/tracing/tracing_enabled |
| 738 | # ls -ltr | 743 | # ls -ltr |
| @@ -862,9 +867,9 @@ This is a very interesting trace. It started with the preemption of | |||
| 862 | the ls task. We see that the task had the "need_resched" bit set | 867 | the ls task. We see that the task had the "need_resched" bit set |
| 863 | with the 'N' in the trace. Interrupts are disabled in the spin_lock | 868 | with the 'N' in the trace. Interrupts are disabled in the spin_lock |
| 864 | and the trace started. We see that a schedule took place to run | 869 | and the trace started. We see that a schedule took place to run |
| 865 | sshd. When the interrupts were enabled we took an interrupt. | 870 | sshd. When the interrupts were enabled, we took an interrupt. |
| 866 | On return of the interrupt the softirq ran. We took another interrupt | 871 | On return from the interrupt handler, the softirq ran. We took another |
| 867 | while running the softirq as we see with the capital 'H'. | 872 | interrupt while running the softirq as we see with the capital 'H'. |
| 868 | 873 | ||
| 869 | 874 | ||
| 870 | wakeup | 875 | wakeup |
| @@ -876,9 +881,9 @@ time it executes. This is also known as "schedule latency". | |||
| 876 | I stress the point that this is about RT tasks. It is also important | 881 | I stress the point that this is about RT tasks. It is also important |
| 877 | to know the scheduling latency of non-RT tasks, but the average | 882 | to know the scheduling latency of non-RT tasks, but the average |
| 878 | schedule latency is better for non-RT tasks. Tools like | 883 | schedule latency is better for non-RT tasks. Tools like |
| 879 | LatencyTop is more appropriate for such measurements. | 884 | LatencyTop are more appropriate for such measurements. |
| 880 | 885 | ||
| 881 | Real-Time environments is interested in the worst case latency. | 886 | Real-Time environments are interested in the worst case latency. |
| 882 | That is the longest latency it takes for something to happen, and | 887 | That is the longest latency it takes for something to happen, and |
| 883 | not the average. We can have a very fast scheduler that may only | 888 | not the average. We can have a very fast scheduler that may only |
| 884 | have a large latency once in a while, but that would not work well | 889 | have a large latency once in a while, but that would not work well |
| @@ -889,8 +894,8 @@ tasks that are unpredictable will overwrite the worst case latency | |||
| 889 | of RT tasks. | 894 | of RT tasks. |
| 890 | 895 | ||
| 891 | Since this tracer only deals with RT tasks, we will run this slightly | 896 | Since this tracer only deals with RT tasks, we will run this slightly |
| 892 | different than we did with the previous tracers. Instead of performing | 897 | differently than we did with the previous tracers. Instead of performing |
| 893 | an 'ls' we will run 'sleep 1' under 'chrt' which changes the | 898 | an 'ls', we will run 'sleep 1' under 'chrt' which changes the |
| 894 | priority of the task. | 899 | priority of the task. |
| 895 | 900 | ||
| 896 | # echo wakeup > /debug/tracing/current_tracer | 901 | # echo wakeup > /debug/tracing/current_tracer |
| @@ -924,9 +929,9 @@ wakeup latency trace v1.1.5 on 2.6.26-rc8 | |||
| 924 | vim:ft=help | 929 | vim:ft=help |
| 925 | 930 | ||
| 926 | 931 | ||
| 927 | Running this on an idle system we see that it only took 4 microseconds | 932 | Running this on an idle system, we see that it only took 4 microseconds |
| 928 | to perform the task switch. Note, since the trace marker in the | 933 | to perform the task switch. Note, since the trace marker in the |
| 929 | schedule is before the actual "switch" we stop the tracing when | 934 | schedule is before the actual "switch", we stop the tracing when |
| 930 | the recorded task is about to schedule in. This may change if | 935 | the recorded task is about to schedule in. This may change if |
| 931 | we add a new marker at the end of the scheduler. | 936 | we add a new marker at the end of the scheduler. |
| 932 | 937 | ||
| @@ -992,12 +997,15 @@ ksoftirq-7 1d..4 50us : schedule (__cond_resched) | |||
| 992 | 997 | ||
| 993 | The interrupt went off while running ksoftirqd. This task runs at | 998 | The interrupt went off while running ksoftirqd. This task runs at |
| 994 | SCHED_OTHER. Why didn't we see the 'N' set early? This may be | 999 | SCHED_OTHER. Why didn't we see the 'N' set early? This may be |
| 995 | a harmless bug with x86_32 and 4K stacks. The need_reched() function | 1000 | a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks |
| 996 | that tests if we need to reschedule looks on the actual stack. | 1001 | configured, the interrupt and softirq runs with their own stack. |
| 997 | Where as the setting of the NEED_RESCHED bit happens on the | 1002 | Some information is held on the top of the task's stack (need_resched |
| 998 | task's stack. But because we are in a hard interrupt, the test | 1003 | and preempt_count are both stored there). The setting of the NEED_RESCHED |
| 999 | is with the interrupts stack which has that to be false. We don't | 1004 | bit is done directly to the task's stack, but the reading of the |
| 1000 | see the 'N' until we switch back to the task's stack. | 1005 | NEED_RESCHED is done by looking at the current stack, which in this case |
| 1006 | is the stack for the hard interrupt. This hides the fact that NEED_RESCHED | ||
| 1007 | has been set. We don't see the 'N' until we switch back to the task's | ||
| 1008 | assigned stack. | ||
| 1001 | 1009 | ||
| 1002 | ftrace | 1010 | ftrace |
| 1003 | ------ | 1011 | ------ |
| @@ -1067,10 +1075,10 @@ this works is the mcount function call (placed at the start of | |||
| 1067 | every kernel function, produced by the -pg switch in gcc), starts | 1075 | every kernel function, produced by the -pg switch in gcc), starts |
| 1068 | of pointing to a simple return. | 1076 | of pointing to a simple return. |
| 1069 | 1077 | ||
| 1070 | When dynamic ftrace is initialized, it calls kstop_machine to make it | 1078 | When dynamic ftrace is initialized, it calls kstop_machine to make |
| 1071 | act like a uniprocessor so that it can freely modify code without | 1079 | the machine act like a uniprocessor so that it can freely modify code |
| 1072 | worrying about other processors executing that same code. At | 1080 | without worrying about other processors executing that same code. At |
| 1073 | initialization, the mcount calls are change to call a "record_ip" | 1081 | initialization, the mcount calls are changed to call a "record_ip" |
| 1074 | function. After this, the first time a kernel function is called, | 1082 | function. After this, the first time a kernel function is called, |
| 1075 | it has the calling address saved in a hash table. | 1083 | it has the calling address saved in a hash table. |
| 1076 | 1084 | ||
| @@ -1085,8 +1093,8 @@ traced, is that we can now selectively choose which functions we | |||
| 1085 | want to trace and which ones we want the mcount calls to remain as | 1093 | want to trace and which ones we want the mcount calls to remain as |
| 1086 | nops. | 1094 | nops. |
| 1087 | 1095 | ||
| 1088 | Two files that contain to the enabling and disabling of recorded | 1096 | Two files are used, one for enabling and one for disabling the tracing |
| 1089 | functions are: | 1097 | of recorded functions. They are: |
| 1090 | 1098 | ||
| 1091 | set_ftrace_filter | 1099 | set_ftrace_filter |
| 1092 | 1100 | ||
| @@ -1094,7 +1102,7 @@ and | |||
| 1094 | 1102 | ||
| 1095 | set_ftrace_notrace | 1103 | set_ftrace_notrace |
| 1096 | 1104 | ||
| 1097 | A list of available functions that you can add to this files is listed | 1105 | A list of available functions that you can add to these files is listed |
| 1098 | in: | 1106 | in: |
| 1099 | 1107 | ||
| 1100 | available_filter_functions | 1108 | available_filter_functions |
| @@ -1133,9 +1141,9 @@ sys_nanosleep | |||
| 1133 | 1141 | ||
| 1134 | 1142 | ||
| 1135 | Perhaps this isn't enough. The filters also allow simple wild cards. | 1143 | Perhaps this isn't enough. The filters also allow simple wild cards. |
| 1136 | Only the following is currently available | 1144 | Only the following are currently available |
| 1137 | 1145 | ||
| 1138 | <match>* - will match functions that begins with <match> | 1146 | <match>* - will match functions that begin with <match> |
| 1139 | *<match> - will match functions that end with <match> | 1147 | *<match> - will match functions that end with <match> |
| 1140 | *<match>* - will match functions that have <match> in it | 1148 | *<match>* - will match functions that have <match> in it |
| 1141 | 1149 | ||
| @@ -1187,7 +1195,7 @@ This is because the '>' and '>>' act just like they do in bash. | |||
| 1187 | To rewrite the filters, use '>' | 1195 | To rewrite the filters, use '>' |
| 1188 | To append to the filters, use '>>' | 1196 | To append to the filters, use '>>' |
| 1189 | 1197 | ||
| 1190 | To clear out a filter so that all functions will be recorded again. | 1198 | To clear out a filter so that all functions will be recorded again: |
| 1191 | 1199 | ||
| 1192 | # echo > /debug/tracing/set_ftrace_filter | 1200 | # echo > /debug/tracing/set_ftrace_filter |
| 1193 | # cat /debug/tracing/set_ftrace_filter | 1201 | # cat /debug/tracing/set_ftrace_filter |
| @@ -1246,8 +1254,8 @@ ftraced | |||
| 1246 | 1254 | ||
| 1247 | As mentioned above, when dynamic ftrace is configured in, a kernel | 1255 | As mentioned above, when dynamic ftrace is configured in, a kernel |
| 1248 | thread wakes up once a second and checks to see if there are mcount | 1256 | thread wakes up once a second and checks to see if there are mcount |
| 1249 | calls that need to be converted into nops. If there is not, then | 1257 | calls that need to be converted into nops. If there are not any, then |
| 1250 | it simply goes back to sleep. But if there is, it will call | 1258 | it simply goes back to sleep. But if there are some, it will call |
| 1251 | kstop_machine to convert the calls to nops. | 1259 | kstop_machine to convert the calls to nops. |
| 1252 | 1260 | ||
| 1253 | There may be a case that you do not want this added latency. | 1261 | There may be a case that you do not want this added latency. |
| @@ -1262,8 +1270,8 @@ mcount calls to nops. Remember that there's a large overhead | |||
| 1262 | to calling mcount. Without this kernel thread, that overhead will | 1270 | to calling mcount. Without this kernel thread, that overhead will |
| 1263 | exist. | 1271 | exist. |
| 1264 | 1272 | ||
| 1265 | Any write to the ftraced_enabled file will cause the kstop_machine | 1273 | If there are recorded calls to mcount, any write to the ftraced_enabled |
| 1266 | to run if there are recorded calls to mcount. This means that a | 1274 | file will cause the kstop_machine to run. This means that a |
| 1267 | user can manually perform the updates when they want to by simply | 1275 | user can manually perform the updates when they want to by simply |
| 1268 | echoing a '0' into the ftraced_enabled file. | 1276 | echoing a '0' into the ftraced_enabled file. |
| 1269 | 1277 | ||
| @@ -1315,7 +1323,7 @@ trace entries | |||
| 1315 | 1323 | ||
| 1316 | Having too much or not enough data can be troublesome in diagnosing | 1324 | Having too much or not enough data can be troublesome in diagnosing |
| 1317 | some issue in the kernel. The file trace_entries is used to modify | 1325 | some issue in the kernel. The file trace_entries is used to modify |
| 1318 | the size of the internal trace buffers. The numbers listed | 1326 | the size of the internal trace buffers. The number listed |
| 1319 | is the number of entries that can be recorded per CPU. To know | 1327 | is the number of entries that can be recorded per CPU. To know |
| 1320 | the full size, multiply the number of possible CPUS with the | 1328 | the full size, multiply the number of possible CPUS with the |
| 1321 | number of entries. | 1329 | number of entries. |
| @@ -1323,7 +1331,7 @@ number of entries. | |||
| 1323 | # cat /debug/tracing/trace_entries | 1331 | # cat /debug/tracing/trace_entries |
| 1324 | 65620 | 1332 | 65620 |
| 1325 | 1333 | ||
| 1326 | Note, to modify this you must have tracing fulling disabled. To do that, | 1334 | Note, to modify this, you must have tracing completely disabled. To do that, |
| 1327 | echo "none" into the current_tracer. | 1335 | echo "none" into the current_tracer. |
| 1328 | 1336 | ||
| 1329 | # echo none > /debug/tracing/current_tracer | 1337 | # echo none > /debug/tracing/current_tracer |
| @@ -1344,7 +1352,7 @@ it will add them. | |||
| 1344 | This shows us that 85 entries can fit on a single page. | 1352 | This shows us that 85 entries can fit on a single page. |
| 1345 | 1353 | ||
| 1346 | The number of pages that will be allocated is a percentage of available | 1354 | The number of pages that will be allocated is a percentage of available |
| 1347 | memory. Allocating too much will produces an error. | 1355 | memory. Allocating too much will produce an error. |
| 1348 | 1356 | ||
| 1349 | # echo 1000000000000 > /debug/tracing/trace_entries | 1357 | # echo 1000000000000 > /debug/tracing/trace_entries |
| 1350 | -bash: echo: write error: Cannot allocate memory | 1358 | -bash: echo: write error: Cannot allocate memory |
diff --git a/Documentation/ioctl-number.txt b/Documentation/ioctl-number.txt index 240ce7a56c40..3bb5f466a90d 100644 --- a/Documentation/ioctl-number.txt +++ b/Documentation/ioctl-number.txt | |||
| @@ -117,6 +117,7 @@ Code Seq# Include File Comments | |||
| 117 | <mailto:natalia@nikhefk.nikhef.nl> | 117 | <mailto:natalia@nikhefk.nikhef.nl> |
| 118 | 'c' 00-7F linux/comstats.h conflict! | 118 | 'c' 00-7F linux/comstats.h conflict! |
| 119 | 'c' 00-7F linux/coda.h conflict! | 119 | 'c' 00-7F linux/coda.h conflict! |
| 120 | 'c' 80-9F asm-s390/chsc.h | ||
| 120 | 'd' 00-FF linux/char/drm/drm/h conflict! | 121 | 'd' 00-FF linux/char/drm/drm/h conflict! |
| 121 | 'd' 00-DF linux/video_decoder.h conflict! | 122 | 'd' 00-DF linux/video_decoder.h conflict! |
| 122 | 'd' F0-FF linux/digi1.h | 123 | 'd' F0-FF linux/digi1.h |
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index b8e52c0355d3..9691c7f5166c 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt | |||
| @@ -109,7 +109,7 @@ There are two possible methods of using Kdump. | |||
| 109 | 2) Or use the system kernel binary itself as dump-capture kernel and there is | 109 | 2) Or use the system kernel binary itself as dump-capture kernel and there is |
| 110 | no need to build a separate dump-capture kernel. This is possible | 110 | no need to build a separate dump-capture kernel. This is possible |
| 111 | only with the architecutres which support a relocatable kernel. As | 111 | only with the architecutres which support a relocatable kernel. As |
| 112 | of today i386 and ia64 architectures support relocatable kernel. | 112 | of today, i386, x86_64 and ia64 architectures support relocatable kernel. |
| 113 | 113 | ||
| 114 | Building a relocatable kernel is advantageous from the point of view that | 114 | Building a relocatable kernel is advantageous from the point of view that |
| 115 | one does not have to build a second kernel for capturing the dump. But | 115 | one does not have to build a second kernel for capturing the dump. But |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b52f47d588b4..b3a5aad7e629 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
| @@ -271,6 +271,17 @@ and is between 256 and 4096 characters. It is defined in the file | |||
| 271 | aic79xx= [HW,SCSI] | 271 | aic79xx= [HW,SCSI] |
| 272 | See Documentation/scsi/aic79xx.txt. | 272 | See Documentation/scsi/aic79xx.txt. |
| 273 | 273 | ||
| 274 | amd_iommu= [HW,X86-84] | ||
| 275 | Pass parameters to the AMD IOMMU driver in the system. | ||
| 276 | Possible values are: | ||
| 277 | isolate - enable device isolation (each device, as far | ||
| 278 | as possible, will get its own protection | ||
| 279 | domain) | ||
| 280 | amd_iommu_size= [HW,X86-64] | ||
| 281 | Define the size of the aperture for the AMD IOMMU | ||
| 282 | driver. Possible values are: | ||
| 283 | '32M', '64M' (default), '128M', '256M', '512M', '1G' | ||
| 284 | |||
| 274 | amijoy.map= [HW,JOY] Amiga joystick support | 285 | amijoy.map= [HW,JOY] Amiga joystick support |
| 275 | Map of devices attached to JOY0DAT and JOY1DAT | 286 | Map of devices attached to JOY0DAT and JOY1DAT |
| 276 | Format: <a>,<b> | 287 | Format: <a>,<b> |
| @@ -599,6 +610,29 @@ and is between 256 and 4096 characters. It is defined in the file | |||
| 599 | See drivers/char/README.epca and | 610 | See drivers/char/README.epca and |
| 600 | Documentation/digiepca.txt. | 611 | Documentation/digiepca.txt. |
| 601 | 612 | ||
| 613 | disable_mtrr_cleanup [X86] | ||
| 614 | enable_mtrr_cleanup [X86] | ||
| 615 | The kernel tries to adjust MTRR layout from continuous | ||
| 616 | to discrete, to make X server driver able to add WB | ||
| 617 | entry later. This parameter enables/disables that. | ||
| 618 | |||
| 619 | mtrr_chunk_size=nn[KMG] [X86] | ||
| 620 | used for mtrr cleanup. It is largest continous chunk | ||
| 621 | that could hold holes aka. UC entries. | ||
| 622 | |||
| 623 | mtrr_gran_size=nn[KMG] [X86] | ||
| 624 | Used for mtrr cleanup. It is granularity of mtrr block. | ||
| 625 | Default is 1. | ||
| 626 | Large value could prevent small alignment from | ||
| 627 | using up MTRRs. | ||
| 628 | |||
| 629 | mtrr_spare_reg_nr=n [X86] | ||
| 630 | Format: <integer> | ||
| 631 | Range: 0,7 : spare reg number | ||
| 632 | Default : 1 | ||
| 633 | Used for mtrr cleanup. It is spare mtrr entries number. | ||
| 634 | Set to 2 or more if your graphical card needs more. | ||
| 635 | |||
| 602 | disable_mtrr_trim [X86, Intel and AMD only] | 636 | disable_mtrr_trim [X86, Intel and AMD only] |
| 603 | By default the kernel will trim any uncacheable | 637 | By default the kernel will trim any uncacheable |
| 604 | memory out of your available memory pool based on | 638 | memory out of your available memory pool based on |
| @@ -1208,6 +1242,11 @@ and is between 256 and 4096 characters. It is defined in the file | |||
| 1208 | mtdparts= [MTD] | 1242 | mtdparts= [MTD] |
| 1209 | See drivers/mtd/cmdlinepart.c. | 1243 | See drivers/mtd/cmdlinepart.c. |
| 1210 | 1244 | ||
| 1245 | mtdset= [ARM] | ||
| 1246 | ARM/S3C2412 JIVE boot control | ||
| 1247 | |||
| 1248 | See arch/arm/mach-s3c2412/mach-jive.c | ||
| 1249 | |||
| 1211 | mtouchusb.raw_coordinates= | 1250 | mtouchusb.raw_coordinates= |
| 1212 | [HW] Make the MicroTouch USB driver use raw coordinates | 1251 | [HW] Make the MicroTouch USB driver use raw coordinates |
| 1213 | ('y', default) or cooked coordinates ('n') | 1252 | ('y', default) or cooked coordinates ('n') |
| @@ -2116,6 +2155,9 @@ and is between 256 and 4096 characters. It is defined in the file | |||
| 2116 | usbhid.mousepoll= | 2155 | usbhid.mousepoll= |
| 2117 | [USBHID] The interval which mice are to be polled at. | 2156 | [USBHID] The interval which mice are to be polled at. |
| 2118 | 2157 | ||
| 2158 | add_efi_memmap [EFI; x86-32,X86-64] Include EFI memory map in | ||
| 2159 | kernel's map of available physical RAM. | ||
| 2160 | |||
| 2119 | vdso= [X86-32,SH,x86-64] | 2161 | vdso= [X86-32,SH,x86-64] |
| 2120 | vdso=2: enable compat VDSO (default with COMPAT_VDSO) | 2162 | vdso=2: enable compat VDSO (default with COMPAT_VDSO) |
| 2121 | vdso=1: enable VDSO (default) | 2163 | vdso=1: enable VDSO (default) |
diff --git a/Documentation/nmi_watchdog.txt b/Documentation/nmi_watchdog.txt index 757c729ee42e..90aa4531cb67 100644 --- a/Documentation/nmi_watchdog.txt +++ b/Documentation/nmi_watchdog.txt | |||
| @@ -10,7 +10,7 @@ us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt | |||
| 10 | which get executed even if the system is otherwise locked up hard). | 10 | which get executed even if the system is otherwise locked up hard). |
| 11 | This can be used to debug hard kernel lockups. By executing periodic | 11 | This can be used to debug hard kernel lockups. By executing periodic |
| 12 | NMI interrupts, the kernel can monitor whether any CPU has locked up, | 12 | NMI interrupts, the kernel can monitor whether any CPU has locked up, |
| 13 | and print out debugging messages if so. | 13 | and print out debugging messages if so. |
| 14 | 14 | ||
| 15 | In order to use the NMI watchdog, you need to have APIC support in your | 15 | In order to use the NMI watchdog, you need to have APIC support in your |
| 16 | kernel. For SMP kernels, APIC support gets compiled in automatically. For | 16 | kernel. For SMP kernels, APIC support gets compiled in automatically. For |
| @@ -22,8 +22,7 @@ CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain | |||
| 22 | kernel debugging options, such as Kernel Stack Meter or Kernel Tracer, | 22 | kernel debugging options, such as Kernel Stack Meter or Kernel Tracer, |
| 23 | may implicitly disable the NMI watchdog.] | 23 | may implicitly disable the NMI watchdog.] |
| 24 | 24 | ||
| 25 | For x86-64, the needed APIC is always compiled in, and the NMI watchdog is | 25 | For x86-64, the needed APIC is always compiled in. |
| 26 | always enabled with I/O-APIC mode (nmi_watchdog=1). | ||
| 27 | 26 | ||
| 28 | Using local APIC (nmi_watchdog=2) needs the first performance register, so | 27 | Using local APIC (nmi_watchdog=2) needs the first performance register, so |
| 29 | you can't use it for other purposes (such as high precision performance | 28 | you can't use it for other purposes (such as high precision performance |
| @@ -63,16 +62,15 @@ when the system is idle), but if your system locks up on anything but the | |||
| 63 | "hlt", then you are out of luck -- the event will not happen at all and the | 62 | "hlt", then you are out of luck -- the event will not happen at all and the |
| 64 | watchdog won't trigger. This is a shortcoming of the local APIC watchdog | 63 | watchdog won't trigger. This is a shortcoming of the local APIC watchdog |
| 65 | -- unfortunately there is no "clock ticks" event that would work all the | 64 | -- unfortunately there is no "clock ticks" event that would work all the |
| 66 | time. The I/O APIC watchdog is driven externally and has no such shortcoming. | 65 | time. The I/O APIC watchdog is driven externally and has no such shortcoming. |
| 67 | But its NMI frequency is much higher, resulting in a more significant hit | 66 | But its NMI frequency is much higher, resulting in a more significant hit |
| 68 | to the overall system performance. | 67 | to the overall system performance. |
| 69 | 68 | ||
| 70 | NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default, | 69 | On x86 nmi_watchdog is disabled by default so you have to enable it with |
| 71 | you have to enable it with a boot time parameter. Prior to 2.4.2-ac18 | 70 | a boot time parameter. |
| 72 | the NMI-oopser is enabled unconditionally on x86 SMP boxes. | ||
| 73 | 71 | ||
| 74 | On x86-64 the NMI oopser is on by default. On 64bit Intel CPUs | 72 | NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally |
| 75 | it uses IO-APIC by default and on AMD it uses local APIC. | 73 | on x86 SMP boxes. |
| 76 | 74 | ||
| 77 | [ feel free to send bug reports, suggestions and patches to | 75 | [ feel free to send bug reports, suggestions and patches to |
| 78 | Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing | 76 | Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing |
diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt index a9e990ab980f..373ceacc367e 100644 --- a/Documentation/scheduler/sched-domains.txt +++ b/Documentation/scheduler/sched-domains.txt | |||
| @@ -61,10 +61,7 @@ builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your | |||
| 61 | arch_init_sched_domains function. This function will attach domains to all | 61 | arch_init_sched_domains function. This function will attach domains to all |
| 62 | CPUs using cpu_attach_domain. | 62 | CPUs using cpu_attach_domain. |
| 63 | 63 | ||
| 64 | Implementors should change the line | 64 | The sched-domains debugging infrastructure can be enabled by enabling |
| 65 | #undef SCHED_DOMAIN_DEBUG | 65 | CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains |
| 66 | to | ||
| 67 | #define SCHED_DOMAIN_DEBUG | ||
| 68 | in kernel/sched.c as this enables an error checking parse of the sched domains | ||
| 69 | which should catch most possible errors (described above). It also prints out | 66 | which should catch most possible errors (described above). It also prints out |
| 70 | the domain structure in a visual format. | 67 | the domain structure in a visual format. |
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 14f901f639ee..3ef339f491e0 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt | |||
| @@ -51,9 +51,9 @@ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = | |||
| 51 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time | 51 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time |
| 52 | of 0.00015s. | 52 | of 0.00015s. |
| 53 | 53 | ||
| 54 | The remaining CPU time will be used for user input and other tass. Because | 54 | The remaining CPU time will be used for user input and other tasks. Because |
| 55 | realtime tasks have explicitly allocated the CPU time they need to perform | 55 | realtime tasks have explicitly allocated the CPU time they need to perform |
| 56 | their tasks, buffer underruns in the graphocs or audio can be eliminated. | 56 | their tasks, buffer underruns in the graphics or audio can be eliminated. |
| 57 | 57 | ||
| 58 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still | 58 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
| 59 | lack an EDF scheduler to make non-uniform periods usable. | 59 | lack an EDF scheduler to make non-uniform periods usable. |
diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 0bbee38acd26..72aff61e7315 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt | |||
| @@ -753,8 +753,11 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. | |||
| 753 | 753 | ||
| 754 | [Multiple options for each card instance] | 754 | [Multiple options for each card instance] |
| 755 | model - force the model name | 755 | model - force the model name |
| 756 | position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size) | 756 | position_fix - Fix DMA pointer (0 = auto, 1 = use LPIB, 2 = POSBUF) |
| 757 | probe_mask - Bitmask to probe codecs (default = -1, meaning all slots) | 757 | probe_mask - Bitmask to probe codecs (default = -1, meaning all slots) |
| 758 | bdl_pos_adj - Specifies the DMA IRQ timing delay in samples. | ||
| 759 | Passing -1 will make the driver to choose the appropriate | ||
| 760 | value based on the controller chip. | ||
| 758 | 761 | ||
| 759 | [Single (global) options] | 762 | [Single (global) options] |
| 760 | single_cmd - Use single immediate commands to communicate with | 763 | single_cmd - Use single immediate commands to communicate with |
| @@ -845,7 +848,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. | |||
| 845 | ALC269 | 848 | ALC269 |
| 846 | basic Basic preset | 849 | basic Basic preset |
| 847 | 850 | ||
| 848 | ALC662 | 851 | ALC662/663 |
| 849 | 3stack-dig 3-stack (2-channel) with SPDIF | 852 | 3stack-dig 3-stack (2-channel) with SPDIF |
| 850 | 3stack-6ch 3-stack (6-channel) | 853 | 3stack-6ch 3-stack (6-channel) |
| 851 | 3stack-6ch-dig 3-stack (6-channel) with SPDIF | 854 | 3stack-6ch-dig 3-stack (6-channel) with SPDIF |
| @@ -853,6 +856,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. | |||
| 853 | lenovo-101e Lenovo laptop | 856 | lenovo-101e Lenovo laptop |
| 854 | eeepc-p701 ASUS Eeepc P701 | 857 | eeepc-p701 ASUS Eeepc P701 |
| 855 | eeepc-ep20 ASUS Eeepc EP20 | 858 | eeepc-ep20 ASUS Eeepc EP20 |
| 859 | m51va ASUS M51VA | ||
| 860 | g71v ASUS G71V | ||
| 861 | h13 ASUS H13 | ||
| 862 | g50v ASUS G50V | ||
| 856 | auto auto-config reading BIOS (default) | 863 | auto auto-config reading BIOS (default) |
| 857 | 864 | ||
| 858 | ALC882/885 | 865 | ALC882/885 |
| @@ -1091,7 +1098,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. | |||
| 1091 | This occurs when the access to non-existing or non-working codec slot | 1098 | This occurs when the access to non-existing or non-working codec slot |
| 1092 | (likely a modem one) causes a stall of the communication via HD-audio | 1099 | (likely a modem one) causes a stall of the communication via HD-audio |
| 1093 | bus. You can see which codec slots are probed by enabling | 1100 | bus. You can see which codec slots are probed by enabling |
| 1094 | CONFIG_SND_DEBUG_DETECT, or simply from the file name of the codec | 1101 | CONFIG_SND_DEBUG_VERBOSE, or simply from the file name of the codec |
| 1095 | proc files. Then limit the slots to probe by probe_mask option. | 1102 | proc files. Then limit the slots to probe by probe_mask option. |
| 1096 | For example, probe_mask=1 means to probe only the first slot, and | 1103 | For example, probe_mask=1 means to probe only the first slot, and |
| 1097 | probe_mask=4 means only the third slot. | 1104 | probe_mask=4 means only the third slot. |
| @@ -2267,6 +2274,10 @@ case above again, the first two slots are already reserved. If any | |||
| 2267 | other driver (e.g. snd-usb-audio) is loaded before snd-interwave or | 2274 | other driver (e.g. snd-usb-audio) is loaded before snd-interwave or |
| 2268 | snd-ens1371, it will be assigned to the third or later slot. | 2275 | snd-ens1371, it will be assigned to the third or later slot. |
| 2269 | 2276 | ||
| 2277 | When a module name is given with '!', the slot will be given for any | ||
| 2278 | modules but that name. For example, "slots=!snd-pcsp" will reserve | ||
| 2279 | the first slot for any modules but snd-pcsp. | ||
| 2280 | |||
| 2270 | 2281 | ||
| 2271 | ALSA PCM devices to OSS devices mapping | 2282 | ALSA PCM devices to OSS devices mapping |
| 2272 | ======================================= | 2283 | ======================================= |
diff --git a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl index b03df4d4795c..e13c4e67029f 100644 --- a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl +++ b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl | |||
| @@ -6127,8 +6127,8 @@ struct _snd_pcm_runtime { | |||
| 6127 | 6127 | ||
| 6128 | <para> | 6128 | <para> |
| 6129 | <function>snd_printdd()</function> is compiled in only when | 6129 | <function>snd_printdd()</function> is compiled in only when |
| 6130 | <constant>CONFIG_SND_DEBUG_DETECT</constant> is set. Please note | 6130 | <constant>CONFIG_SND_DEBUG_VERBOSE</constant> is set. Please note |
| 6131 | that <constant>DEBUG_DETECT</constant> is not set as default | 6131 | that <constant>CONFIG_SND_DEBUG_VERBOSE</constant> is not set as default |
| 6132 | even if you configure the alsa-driver with | 6132 | even if you configure the alsa-driver with |
| 6133 | <option>--with-debug=full</option> option. You need to give | 6133 | <option>--with-debug=full</option> option. You need to give |
| 6134 | explicitly <option>--with-debug=detect</option> option instead. | 6134 | explicitly <option>--with-debug=detect</option> option instead. |
diff --git a/Documentation/tracers/mmiotrace.txt b/Documentation/tracers/mmiotrace.txt new file mode 100644 index 000000000000..a4afb560a45b --- /dev/null +++ b/Documentation/tracers/mmiotrace.txt | |||
| @@ -0,0 +1,164 @@ | |||
| 1 | In-kernel memory-mapped I/O tracing | ||
| 2 | |||
| 3 | |||
| 4 | Home page and links to optional user space tools: | ||
| 5 | |||
| 6 | http://nouveau.freedesktop.org/wiki/MmioTrace | ||
| 7 | |||
| 8 | MMIO tracing was originally developed by Intel around 2003 for their Fault | ||
| 9 | Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel, | ||
| 10 | Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau | ||
| 11 | project in mind. Since then many people have contributed. | ||
| 12 | |||
| 13 | Mmiotrace was built for reverse engineering any memory-mapped IO device with | ||
| 14 | the Nouveau project as the first real user. Only x86 and x86_64 architectures | ||
| 15 | are supported. | ||
| 16 | |||
| 17 | Out-of-tree mmiotrace was originally modified for mainline inclusion and | ||
| 18 | ftrace framework by Pekka Paalanen <pq@iki.fi>. | ||
| 19 | |||
| 20 | |||
| 21 | Preparation | ||
| 22 | ----------- | ||
| 23 | |||
| 24 | Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is | ||
| 25 | disabled by default, so it is safe to have this set to yes. SMP systems are | ||
| 26 | supported, but tracing is unreliable and may miss events if more than one CPU | ||
| 27 | is on-line, therefore mmiotrace takes all but one CPU off-line during run-time | ||
| 28 | activation. You can re-enable CPUs by hand, but you have been warned, there | ||
| 29 | is no way to automatically detect if you are losing events due to CPUs racing. | ||
| 30 | |||
| 31 | |||
| 32 | Usage Quick Reference | ||
| 33 | --------------------- | ||
| 34 | |||
| 35 | $ mount -t debugfs debugfs /debug | ||
| 36 | $ echo mmiotrace > /debug/tracing/current_tracer | ||
| 37 | $ cat /debug/tracing/trace_pipe > mydump.txt & | ||
| 38 | Start X or whatever. | ||
| 39 | $ echo "X is up" > /debug/tracing/marker | ||
| 40 | $ echo none > /debug/tracing/current_tracer | ||
| 41 | Check for lost events. | ||
| 42 | |||
| 43 | |||
| 44 | Usage | ||
| 45 | ----- | ||
| 46 | |||
| 47 | Make sure debugfs is mounted to /debug. If not, (requires root privileges) | ||
| 48 | $ mount -t debugfs debugfs /debug | ||
| 49 | |||
| 50 | Check that the driver you are about to trace is not loaded. | ||
| 51 | |||
| 52 | Activate mmiotrace (requires root privileges): | ||
| 53 | $ echo mmiotrace > /debug/tracing/current_tracer | ||
| 54 | |||
| 55 | Start storing the trace: | ||
| 56 | $ cat /debug/tracing/trace_pipe > mydump.txt & | ||
| 57 | The 'cat' process should stay running (sleeping) in the background. | ||
| 58 | |||
| 59 | Load the driver you want to trace and use it. Mmiotrace will only catch MMIO | ||
| 60 | accesses to areas that are ioremapped while mmiotrace is active. | ||
| 61 | |||
| 62 | [Unimplemented feature:] | ||
| 63 | During tracing you can place comments (markers) into the trace by | ||
| 64 | $ echo "X is up" > /debug/tracing/marker | ||
| 65 | This makes it easier to see which part of the (huge) trace corresponds to | ||
| 66 | which action. It is recommended to place descriptive markers about what you | ||
| 67 | do. | ||
| 68 | |||
| 69 | Shut down mmiotrace (requires root privileges): | ||
| 70 | $ echo none > /debug/tracing/current_tracer | ||
| 71 | The 'cat' process exits. If it does not, kill it by issuing 'fg' command and | ||
| 72 | pressing ctrl+c. | ||
| 73 | |||
| 74 | Check that mmiotrace did not lose events due to a buffer filling up. Either | ||
| 75 | $ grep -i lost mydump.txt | ||
| 76 | which tells you exactly how many events were lost, or use | ||
| 77 | $ dmesg | ||
| 78 | to view your kernel log and look for "mmiotrace has lost events" warning. If | ||
| 79 | events were lost, the trace is incomplete. You should enlarge the buffers and | ||
| 80 | try again. Buffers are enlarged by first seeing how large the current buffers | ||
| 81 | are: | ||
| 82 | $ cat /debug/tracing/trace_entries | ||
| 83 | gives you a number. Approximately double this number and write it back, for | ||
| 84 | instance: | ||
| 85 | $ echo 128000 > /debug/tracing/trace_entries | ||
| 86 | Then start again from the top. | ||
| 87 | |||
| 88 | If you are doing a trace for a driver project, e.g. Nouveau, you should also | ||
| 89 | do the following before sending your results: | ||
| 90 | $ lspci -vvv > lspci.txt | ||
| 91 | $ dmesg > dmesg.txt | ||
| 92 | $ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt | ||
| 93 | and then send the .tar.gz file. The trace compresses considerably. Replace | ||
| 94 | "pciid" and "nick" with the PCI ID or model name of your piece of hardware | ||
| 95 | under investigation and your nick name. | ||
| 96 | |||
| 97 | |||
| 98 | How Mmiotrace Works | ||
| 99 | ------------------- | ||
| 100 | |||
| 101 | Access to hardware IO-memory is gained by mapping addresses from PCI bus by | ||
| 102 | calling one of the ioremap_*() functions. Mmiotrace is hooked into the | ||
| 103 | __ioremap() function and gets called whenever a mapping is created. Mapping is | ||
| 104 | an event that is recorded into the trace log. Note, that ISA range mappings | ||
| 105 | are not caught, since the mapping always exists and is returned directly. | ||
| 106 | |||
| 107 | MMIO accesses are recorded via page faults. Just before __ioremap() returns, | ||
| 108 | the mapped pages are marked as not present. Any access to the pages causes a | ||
| 109 | fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace | ||
| 110 | marks the page present, sets TF flag to achieve single stepping and exits the | ||
| 111 | fault handler. The instruction that faulted is executed and debug trap is | ||
| 112 | entered. Here mmiotrace again marks the page as not present. The instruction | ||
| 113 | is decoded to get the type of operation (read/write), data width and the value | ||
| 114 | read or written. These are stored to the trace log. | ||
| 115 | |||
| 116 | Setting the page present in the page fault handler has a race condition on SMP | ||
| 117 | machines. During the single stepping other CPUs may run freely on that page | ||
| 118 | and events can be missed without a notice. Re-enabling other CPUs during | ||
| 119 | tracing is discouraged. | ||
| 120 | |||
| 121 | |||
| 122 | Trace Log Format | ||
| 123 | ---------------- | ||
| 124 | |||
| 125 | The raw log is text and easily filtered with e.g. grep and awk. One record is | ||
| 126 | one line in the log. A record starts with a keyword, followed by keyword | ||
| 127 | dependant arguments. Arguments are separated by a space, or continue until the | ||
| 128 | end of line. The format for version 20070824 is as follows: | ||
| 129 | |||
| 130 | Explanation Keyword Space separated arguments | ||
| 131 | --------------------------------------------------------------------------- | ||
| 132 | |||
| 133 | read event R width, timestamp, map id, physical, value, PC, PID | ||
| 134 | write event W width, timestamp, map id, physical, value, PC, PID | ||
| 135 | ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID | ||
| 136 | iounmap event UNMAP timestamp, map id, PC, PID | ||
| 137 | marker MARK timestamp, text | ||
| 138 | version VERSION the string "20070824" | ||
| 139 | info for reader LSPCI one line from lspci -v | ||
| 140 | PCI address map PCIDEV space separated /proc/bus/pci/devices data | ||
| 141 | unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID | ||
| 142 | |||
| 143 | Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual | ||
| 144 | is a kernel virtual address. Width is the data width in bytes and value is the | ||
| 145 | data value. Map id is an arbitrary id number identifying the mapping that was | ||
| 146 | used in an operation. PC is the program counter and PID is process id. PC is | ||
| 147 | zero if it is not recorded. PID is always zero as tracing MMIO accesses | ||
| 148 | originating in user space memory is not yet supported. | ||
| 149 | |||
| 150 | For instance, the following awk filter will pass all 32-bit writes that target | ||
| 151 | physical addresses in the range [0xfb73ce40, 0xfb800000[ | ||
| 152 | |||
| 153 | $ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 && | ||
| 154 | adr < 0xfb800000) print; }' | ||
| 155 | |||
| 156 | |||
| 157 | Tools for Developers | ||
| 158 | -------------------- | ||
| 159 | |||
| 160 | The user space tools include utilities for: | ||
| 161 | - replacing numeric addresses and values with hardware register names | ||
| 162 | - replaying MMIO logs, i.e., re-executing the recorded writes | ||
| 163 | |||
| 164 | |||
diff --git a/Documentation/i386/IO-APIC.txt b/Documentation/x86/i386/IO-APIC.txt index 30b4c714fbe1..30b4c714fbe1 100644 --- a/Documentation/i386/IO-APIC.txt +++ b/Documentation/x86/i386/IO-APIC.txt | |||
diff --git a/Documentation/i386/boot.txt b/Documentation/x86/i386/boot.txt index 95ad15c3b01f..147bfe511cdd 100644 --- a/Documentation/i386/boot.txt +++ b/Documentation/x86/i386/boot.txt | |||
| @@ -1,17 +1,14 @@ | |||
| 1 | THE LINUX/I386 BOOT PROTOCOL | 1 | THE LINUX/x86 BOOT PROTOCOL |
| 2 | ---------------------------- | 2 | --------------------------- |
| 3 | 3 | ||
| 4 | H. Peter Anvin <hpa@zytor.com> | 4 | On the x86 platform, the Linux kernel uses a rather complicated boot |
| 5 | Last update 2007-05-23 | ||
| 6 | |||
| 7 | On the i386 platform, the Linux kernel uses a rather complicated boot | ||
| 8 | convention. This has evolved partially due to historical aspects, as | 5 | convention. This has evolved partially due to historical aspects, as |
| 9 | well as the desire in the early days to have the kernel itself be a | 6 | well as the desire in the early days to have the kernel itself be a |
| 10 | bootable image, the complicated PC memory model and due to changed | 7 | bootable image, the complicated PC memory model and due to changed |
| 11 | expectations in the PC industry caused by the effective demise of | 8 | expectations in the PC industry caused by the effective demise of |
| 12 | real-mode DOS as a mainstream operating system. | 9 | real-mode DOS as a mainstream operating system. |
| 13 | 10 | ||
| 14 | Currently, the following versions of the Linux/i386 boot protocol exist. | 11 | Currently, the following versions of the Linux/x86 boot protocol exist. |
| 15 | 12 | ||
| 16 | Old kernels: zImage/Image support only. Some very early kernels | 13 | Old kernels: zImage/Image support only. Some very early kernels |
| 17 | may not even support a command line. | 14 | may not even support a command line. |
| @@ -372,10 +369,17 @@ Protocol: 2.00+ | |||
| 372 | - If 0, the protected-mode code is loaded at 0x10000. | 369 | - If 0, the protected-mode code is loaded at 0x10000. |
| 373 | - If 1, the protected-mode code is loaded at 0x100000. | 370 | - If 1, the protected-mode code is loaded at 0x100000. |
| 374 | 371 | ||
| 372 | Bit 5 (write): QUIET_FLAG | ||
| 373 | - If 0, print early messages. | ||
| 374 | - If 1, suppress early messages. | ||
| 375 | This requests to the kernel (decompressor and early | ||
| 376 | kernel) to not write early messages that require | ||
| 377 | accessing the display hardware directly. | ||
| 378 | |||
| 375 | Bit 6 (write): KEEP_SEGMENTS | 379 | Bit 6 (write): KEEP_SEGMENTS |
| 376 | Protocol: 2.07+ | 380 | Protocol: 2.07+ |
| 377 | - if 0, reload the segment registers in the 32bit entry point. | 381 | - If 0, reload the segment registers in the 32bit entry point. |
| 378 | - if 1, do not reload the segment registers in the 32bit entry point. | 382 | - If 1, do not reload the segment registers in the 32bit entry point. |
| 379 | Assume that %cs %ds %ss %es are all set to flat segments with | 383 | Assume that %cs %ds %ss %es are all set to flat segments with |
| 380 | a base of 0 (or the equivalent for their environment). | 384 | a base of 0 (or the equivalent for their environment). |
| 381 | 385 | ||
| @@ -504,7 +508,7 @@ Protocol: 2.06+ | |||
| 504 | maximum size was 255. | 508 | maximum size was 255. |
| 505 | 509 | ||
| 506 | Field name: hardware_subarch | 510 | Field name: hardware_subarch |
| 507 | Type: write | 511 | Type: write (optional, defaults to x86/PC) |
| 508 | Offset/size: 0x23c/4 | 512 | Offset/size: 0x23c/4 |
| 509 | Protocol: 2.07+ | 513 | Protocol: 2.07+ |
| 510 | 514 | ||
| @@ -520,11 +524,13 @@ Protocol: 2.07+ | |||
| 520 | 0x00000002 Xen | 524 | 0x00000002 Xen |
| 521 | 525 | ||
| 522 | Field name: hardware_subarch_data | 526 | Field name: hardware_subarch_data |
| 523 | Type: write | 527 | Type: write (subarch-dependent) |
| 524 | Offset/size: 0x240/8 | 528 | Offset/size: 0x240/8 |
| 525 | Protocol: 2.07+ | 529 | Protocol: 2.07+ |
| 526 | 530 | ||
| 527 | A pointer to data that is specific to hardware subarch | 531 | A pointer to data that is specific to hardware subarch |
| 532 | This field is currently unused for the default x86/PC environment, | ||
| 533 | do not modify. | ||
| 528 | 534 | ||
| 529 | Field name: payload_offset | 535 | Field name: payload_offset |
| 530 | Type: read | 536 | Type: read |
| @@ -545,6 +551,34 @@ Protocol: 2.08+ | |||
| 545 | 551 | ||
| 546 | The length of the payload. | 552 | The length of the payload. |
| 547 | 553 | ||
| 554 | Field name: setup_data | ||
| 555 | Type: write (special) | ||
| 556 | Offset/size: 0x250/8 | ||
| 557 | Protocol: 2.09+ | ||
| 558 | |||
| 559 | The 64-bit physical pointer to NULL terminated single linked list of | ||
| 560 | struct setup_data. This is used to define a more extensible boot | ||
| 561 | parameters passing mechanism. The definition of struct setup_data is | ||
| 562 | as follow: | ||
| 563 | |||
| 564 | struct setup_data { | ||
| 565 | u64 next; | ||
| 566 | u32 type; | ||
| 567 | u32 len; | ||
| 568 | u8 data[0]; | ||
| 569 | }; | ||
| 570 | |||
| 571 | Where, the next is a 64-bit physical pointer to the next node of | ||
| 572 | linked list, the next field of the last node is 0; the type is used | ||
| 573 | to identify the contents of data; the len is the length of data | ||
| 574 | field; the data holds the real payload. | ||
| 575 | |||
| 576 | This list may be modified at a number of points during the bootup | ||
| 577 | process. Therefore, when modifying this list one should always make | ||
| 578 | sure to consider the case where the linked list already contains | ||
| 579 | entries. | ||
| 580 | |||
| 581 | |||
| 548 | **** THE IMAGE CHECKSUM | 582 | **** THE IMAGE CHECKSUM |
| 549 | 583 | ||
| 550 | From boot protocol version 2.08 onwards the CRC-32 is calculated over | 584 | From boot protocol version 2.08 onwards the CRC-32 is calculated over |
| @@ -553,6 +587,7 @@ initial remainder of 0xffffffff. The checksum is appended to the | |||
| 553 | file; therefore the CRC of the file up to the limit specified in the | 587 | file; therefore the CRC of the file up to the limit specified in the |
| 554 | syssize field of the header is always 0. | 588 | syssize field of the header is always 0. |
| 555 | 589 | ||
| 590 | |||
| 556 | **** THE KERNEL COMMAND LINE | 591 | **** THE KERNEL COMMAND LINE |
| 557 | 592 | ||
| 558 | The kernel command line has become an important way for the boot | 593 | The kernel command line has become an important way for the boot |
| @@ -584,28 +619,6 @@ command line is entered using the following protocol: | |||
| 584 | covered by setup_move_size, so you may need to adjust this | 619 | covered by setup_move_size, so you may need to adjust this |
| 585 | field. | 620 | field. |
| 586 | 621 | ||
| 587 | Field name: setup_data | ||
| 588 | Type: write (obligatory) | ||
| 589 | Offset/size: 0x250/8 | ||
| 590 | Protocol: 2.09+ | ||
| 591 | |||
| 592 | The 64-bit physical pointer to NULL terminated single linked list of | ||
| 593 | struct setup_data. This is used to define a more extensible boot | ||
| 594 | parameters passing mechanism. The definition of struct setup_data is | ||
| 595 | as follow: | ||
| 596 | |||
| 597 | struct setup_data { | ||
| 598 | u64 next; | ||
| 599 | u32 type; | ||
| 600 | u32 len; | ||
| 601 | u8 data[0]; | ||
| 602 | }; | ||
| 603 | |||
| 604 | Where, the next is a 64-bit physical pointer to the next node of | ||
| 605 | linked list, the next field of the last node is 0; the type is used | ||
| 606 | to identify the contents of data; the len is the length of data | ||
| 607 | field; the data holds the real payload. | ||
| 608 | |||
| 609 | 622 | ||
| 610 | **** MEMORY LAYOUT OF THE REAL-MODE CODE | 623 | **** MEMORY LAYOUT OF THE REAL-MODE CODE |
| 611 | 624 | ||
diff --git a/Documentation/i386/usb-legacy-support.txt b/Documentation/x86/i386/usb-legacy-support.txt index 1894cdfc69d9..1894cdfc69d9 100644 --- a/Documentation/i386/usb-legacy-support.txt +++ b/Documentation/x86/i386/usb-legacy-support.txt | |||
diff --git a/Documentation/i386/zero-page.txt b/Documentation/x86/i386/zero-page.txt index 169ad423a3d1..169ad423a3d1 100644 --- a/Documentation/i386/zero-page.txt +++ b/Documentation/x86/i386/zero-page.txt | |||
diff --git a/Documentation/x86_64/00-INDEX b/Documentation/x86/x86_64/00-INDEX index 92fc20ab5f0e..92fc20ab5f0e 100644 --- a/Documentation/x86_64/00-INDEX +++ b/Documentation/x86/x86_64/00-INDEX | |||
diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index b0c7b6c4abda..b0c7b6c4abda 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt | |||
diff --git a/Documentation/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec index 3c23e0587db3..3c23e0587db3 100644 --- a/Documentation/x86_64/cpu-hotplug-spec +++ b/Documentation/x86/x86_64/cpu-hotplug-spec | |||
diff --git a/Documentation/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets index d1a985c5b00a..d1a985c5b00a 100644 --- a/Documentation/x86_64/fake-numa-for-cpusets +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets | |||
diff --git a/Documentation/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks index 5ad65d51fb95..5ad65d51fb95 100644 --- a/Documentation/x86_64/kernel-stacks +++ b/Documentation/x86/x86_64/kernel-stacks | |||
diff --git a/Documentation/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck index a05e58e7b159..a05e58e7b159 100644 --- a/Documentation/x86_64/machinecheck +++ b/Documentation/x86/x86_64/machinecheck | |||
diff --git a/Documentation/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index b89b6d2bebfa..efce75097369 100644 --- a/Documentation/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt | |||
| @@ -11,9 +11,8 @@ ffffc10000000000 - ffffc1ffffffffff (=40 bits) hole | |||
| 11 | ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space | 11 | ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space |
| 12 | ffffe20000000000 - ffffe2ffffffffff (=40 bits) virtual memory map (1TB) | 12 | ffffe20000000000 - ffffe2ffffffffff (=40 bits) virtual memory map (1TB) |
| 13 | ... unused hole ... | 13 | ... unused hole ... |
| 14 | ffffffff80000000 - ffffffff82800000 (=40 MB) kernel text mapping, from phys 0 | 14 | ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 |
| 15 | ... unused hole ... | 15 | ffffffffa0000000 - fffffffffff00000 (=1536 MB) module mapping space |
| 16 | ffffffff88000000 - fffffffffff00000 (=1919 MB) module mapping space | ||
| 17 | 16 | ||
| 18 | The direct mapping covers all memory in the system up to the highest | 17 | The direct mapping covers all memory in the system up to the highest |
| 19 | memory address (this means in some cases it can also include PCI memory | 18 | memory address (this means in some cases it can also include PCI memory |
diff --git a/Documentation/x86_64/uefi.txt b/Documentation/x86/x86_64/uefi.txt index 7d77120a5184..a5e2b4fdb170 100644 --- a/Documentation/x86_64/uefi.txt +++ b/Documentation/x86/x86_64/uefi.txt | |||
| @@ -36,3 +36,7 @@ Mechanics: | |||
| 36 | services. | 36 | services. |
| 37 | noefi turn off all EFI runtime services | 37 | noefi turn off all EFI runtime services |
| 38 | reboot_type=k turn off EFI reboot runtime service | 38 | reboot_type=k turn off EFI reboot runtime service |
| 39 | - If the EFI memory map has additional entries not in the E820 map, | ||
| 40 | you can include those entries in the kernels memory map of available | ||
| 41 | physical RAM by using the following kernel command line parameter. | ||
| 42 | add_efi_memmap include EFI memory map of available physical RAM | ||
