diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/debugfs-kmemtrace | 71 | ||||
-rw-r--r-- | Documentation/filesystems/pohmelfs/design_notes.txt | 70 | ||||
-rw-r--r-- | Documentation/filesystems/pohmelfs/info.txt | 86 | ||||
-rw-r--r-- | Documentation/filesystems/pohmelfs/network_protocol.txt | 227 | ||||
-rw-r--r-- | Documentation/ftrace.txt | 1134 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 33 | ||||
-rw-r--r-- | Documentation/laptops/acer-wmi.txt | 10 | ||||
-rw-r--r-- | Documentation/laptops/thinkpad-acpi.txt | 144 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 2 | ||||
-rw-r--r-- | Documentation/tracepoints.txt | 21 | ||||
-rw-r--r-- | Documentation/vm/kmemtrace.txt | 126 |
11 files changed, 1499 insertions, 425 deletions
diff --git a/Documentation/ABI/testing/debugfs-kmemtrace b/Documentation/ABI/testing/debugfs-kmemtrace new file mode 100644 index 000000000000..5e6a92a02d85 --- /dev/null +++ b/Documentation/ABI/testing/debugfs-kmemtrace | |||
@@ -0,0 +1,71 @@ | |||
1 | What: /sys/kernel/debug/kmemtrace/ | ||
2 | Date: July 2008 | ||
3 | Contact: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> | ||
4 | Description: | ||
5 | |||
6 | In kmemtrace-enabled kernels, the following files are created: | ||
7 | |||
8 | /sys/kernel/debug/kmemtrace/ | ||
9 | cpu<n> (0400) Per-CPU tracing data, see below. (binary) | ||
10 | total_overruns (0400) Total number of bytes which were dropped from | ||
11 | cpu<n> files because of full buffer condition, | ||
12 | non-binary. (text) | ||
13 | abi_version (0400) Kernel's kmemtrace ABI version. (text) | ||
14 | |||
15 | Each per-CPU file should be read according to the relay interface. That is, | ||
16 | the reader should set affinity to that specific CPU and, as currently done by | ||
17 | the userspace application (though there are other methods), use poll() with | ||
18 | an infinite timeout before every read(). Otherwise, erroneous data may be | ||
19 | read. The binary data has the following _core_ format: | ||
20 | |||
21 | Event ID (1 byte) Unsigned integer, one of: | ||
22 | 0 - represents an allocation (KMEMTRACE_EVENT_ALLOC) | ||
23 | 1 - represents a freeing of previously allocated memory | ||
24 | (KMEMTRACE_EVENT_FREE) | ||
25 | Type ID (1 byte) Unsigned integer, one of: | ||
26 | 0 - this is a kmalloc() / kfree() | ||
27 | 1 - this is a kmem_cache_alloc() / kmem_cache_free() | ||
28 | 2 - this is a __get_free_pages() et al. | ||
29 | Event size (2 bytes) Unsigned integer representing the | ||
30 | size of this event. Used to extend | ||
31 | kmemtrace. Discard the bytes you | ||
32 | don't know about. | ||
33 | Sequence number (4 bytes) Signed integer used to reorder data | ||
34 | logged on SMP machines. Wraparound | ||
35 | must be taken into account, although | ||
36 | it is unlikely. | ||
37 | Caller address (8 bytes) Return address to the caller. | ||
38 | Pointer to mem (8 bytes) Pointer to target memory area. Can be | ||
39 | NULL, but not all such calls might be | ||
40 | recorded. | ||
41 | |||
42 | In case of KMEMTRACE_EVENT_ALLOC events, the next fields follow: | ||
43 | |||
44 | Requested bytes (8 bytes) Total number of requested bytes, | ||
45 | unsigned, must not be zero. | ||
46 | Allocated bytes (8 bytes) Total number of actually allocated | ||
47 | bytes, unsigned, must not be lower | ||
48 | than requested bytes. | ||
49 | Requested flags (4 bytes) GFP flags supplied by the caller. | ||
50 | Target CPU (4 bytes) Signed integer, valid for event id 1. | ||
51 | If equal to -1, target CPU is the same | ||
52 | as origin CPU, but the reverse might | ||
53 | not be true. | ||
54 | |||
55 | The data is made available in the same endianness the machine has. | ||
56 | |||
57 | Other event ids and type ids may be defined and added. Other fields may be | ||
58 | added by increasing event size, but see below for details. | ||
59 | Every modification to the ABI, including new id definitions, are followed | ||
60 | by bumping the ABI version by one. | ||
61 | |||
62 | Adding new data to the packet (features) is done at the end of the mandatory | ||
63 | data: | ||
64 | Feature size (2 byte) | ||
65 | Feature ID (1 byte) | ||
66 | Feature data (Feature size - 3 bytes) | ||
67 | |||
68 | |||
69 | Users: | ||
70 | kmemtrace-user - git://repo.or.cz/kmemtrace-user.git | ||
71 | |||
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt new file mode 100644 index 000000000000..6d6db60d567d --- /dev/null +++ b/Documentation/filesystems/pohmelfs/design_notes.txt | |||
@@ -0,0 +1,70 @@ | |||
1 | POHMELFS: Parallel Optimized Host Message Exchange Layered File System. | ||
2 | |||
3 | Evgeniy Polyakov <zbr@ioremap.net> | ||
4 | |||
5 | Homepage: http://www.ioremap.net/projects/pohmelfs | ||
6 | |||
7 | POHMELFS first began as a network filesystem with coherent local data and | ||
8 | metadata caches but is now evolving into a parallel distributed filesystem. | ||
9 | |||
10 | Main features of this FS include: | ||
11 | * Locally coherent cache for data and metadata with (potentially) byte-range locks. | ||
12 | Since all Linux filesystems lock the whole inode during writing, algorithm | ||
13 | is very simple and does not use byte-ranges, although they are sent in | ||
14 | locking messages. | ||
15 | * Completely async processing of all events except creation of hard and symbolic | ||
16 | links, and rename events. | ||
17 | Object creation and data reading and writing are processed asynchronously. | ||
18 | * Flexible object architecture optimized for network processing. | ||
19 | Ability to create long paths to objects and remove arbitrarily huge | ||
20 | directories with a single network command. | ||
21 | (like removing the whole kernel tree via a single network command). | ||
22 | * Very high performance. | ||
23 | * Fast and scalable multithreaded userspace server. Being in userspace it works | ||
24 | with any underlying filesystem and still is much faster than async in-kernel NFS one. | ||
25 | * Client is able to switch between different servers (if one goes down, client | ||
26 | automatically reconnects to second and so on). | ||
27 | * Transactions support. Full failover for all operations. | ||
28 | Resending transactions to different servers on timeout or error. | ||
29 | * Read request (data read, directory listing, lookup requests) balancing between multiple servers. | ||
30 | * Write requests are replicated to multiple servers and completed only when all of them are acked. | ||
31 | * Ability to add and/or remove servers from the working set at run-time. | ||
32 | * Strong authentification and possible data encryption in network channel. | ||
33 | * Extended attributes support. | ||
34 | |||
35 | POHMELFS is based on transactions, which are potentially long-standing objects that live | ||
36 | in the client's memory. Each transaction contains all the information needed to process a given | ||
37 | command (or set of commands, which is frequently used during data writing: single transactions | ||
38 | can contain creation and data writing commands). Transactions are committed by all the servers | ||
39 | to which they are sent and, in case of failures, are eventually resent or dropped with an error. | ||
40 | For example, reading will return an error if no servers are available. | ||
41 | |||
42 | POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is | ||
43 | possible to detach replies from requests and, if the command requires data to be received, the | ||
44 | caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different | ||
45 | servers and async threads will pick up replies in parallel, find appropriate transactions in the | ||
46 | system and put the data where it belongs (like the page or inode cache). | ||
47 | |||
48 | The main feature of POHMELFS is writeback data and the metadata cache. | ||
49 | Only a few non-performance critical operations use the write-through cache and | ||
50 | are synchronous: hard and symbolic link creation, and object rename. Creation, | ||
51 | removal of objects and data writing are asynchronous and are sent to | ||
52 | the server during system writeback. Only one writer at a time is allowed for any | ||
53 | given inode, which is guarded by an appropriate locking protocol. | ||
54 | Because of this feature, POHMELFS is extremely fast at metadata intensive | ||
55 | workloads and can fully utilize the bandwidth to the servers when doing bulk | ||
56 | data transfers. | ||
57 | |||
58 | POHMELFS clients operate with a working set of servers and are capable of balancing read-only | ||
59 | operations (like lookups or directory listings) between them. | ||
60 | Administrators can add or remove servers from the set at run-time via special commands (described | ||
61 | in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers. | ||
62 | |||
63 | POHMELFS is capable of full data channel encryption and/or strong crypto hashing. | ||
64 | One can select any kernel supported cipher, encryption mode, hash type and operation mode | ||
65 | (hmac or digest). It is also possible to use both or neither (default). Crypto configuration | ||
66 | is checked during mount time and, if the server does not support it, appropriate capabilities | ||
67 | will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified). | ||
68 | Crypto performance heavily depends on the number of crypto threads, which asynchronously perform | ||
69 | crypto operations and send the resulting data to server or submit it up the stack. This number | ||
70 | can be controlled via a mount option. | ||
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt new file mode 100644 index 000000000000..4e3d50157083 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/info.txt | |||
@@ -0,0 +1,86 @@ | |||
1 | POHMELFS usage information. | ||
2 | |||
3 | Mount options: | ||
4 | idx=%u | ||
5 | Each mountpoint is associated with a special index via this option. | ||
6 | Administrator can add or remove servers from the given index, so all mounts, | ||
7 | which were attached to it, are updated. | ||
8 | Default it is 0. | ||
9 | |||
10 | trans_scan_timeout=%u | ||
11 | This timeout, expressed in milliseconds, specifies time to scan transaction | ||
12 | trees looking for stale requests, which have to be resent, or if number of | ||
13 | retries exceed specified limit, dropped with error. | ||
14 | Default is 5 seconds. | ||
15 | |||
16 | drop_scan_timeout=%u | ||
17 | Internal timeout, expressed in milliseconds, which specifies how frequently | ||
18 | inodes marked to be dropped are freed. It also specifies how frequently | ||
19 | the system checks that servers have to be added or removed from current working set. | ||
20 | Default is 1 second. | ||
21 | |||
22 | wait_on_page_timeout=%u | ||
23 | Number of milliseconds to wait for reply from remote server for data reading command. | ||
24 | If this timeout is exceeded, reading returns an error. | ||
25 | Default is 5 seconds. | ||
26 | |||
27 | trans_retries=%u | ||
28 | This is the number of times that a transaction will be resent to a server that did | ||
29 | not answer for the last @trans_scan_timeout milliseconds. | ||
30 | When the number of resends exceeds this limit, the transaction is completed with error. | ||
31 | Default is 5 resends. | ||
32 | |||
33 | crypto_thread_num=%u | ||
34 | Number of crypto processing threads. Threads are used both for RX and TX traffic. | ||
35 | Default is 2, or no threads if crypto operations are not supported. | ||
36 | |||
37 | trans_max_pages=%u | ||
38 | Maximum number of pages in a single transaction. This parameter also controls | ||
39 | the number of pages, allocated for crypto processing (each crypto thread has | ||
40 | pool of pages, the number of which is equal to 'trans_max_pages'. | ||
41 | Default is 100 pages. | ||
42 | |||
43 | crypto_fail_unsupported | ||
44 | If specified, mount will fail if the server does not support requested crypto operations. | ||
45 | By default mount will disable non-matching crypto operations. | ||
46 | |||
47 | mcache_timeout=%u | ||
48 | Maximum number of milliseconds to wait for the mcache objects to be processed. | ||
49 | Mcache includes locks (given lock should be granted by server), attributes (they should be | ||
50 | fully received in the given timeframe). | ||
51 | Default is 5 seconds. | ||
52 | |||
53 | Usage examples. | ||
54 | |||
55 | Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx | ||
56 | with appropriate hash algorithm and key file and cipher algorithm, mode and key file: | ||
57 | $cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key | ||
58 | |||
59 | Mount filesystem with given index $idx to /mnt mountpoint. | ||
60 | Client will connect to all servers specified in the working set via previous command: | ||
61 | mount -t pohmel -o idx=$idx q /mnt | ||
62 | |||
63 | One can add or remove servers from working set after mounting too. | ||
64 | |||
65 | |||
66 | Server installation. | ||
67 | |||
68 | Creating a server, which listens at port 1025 and 0.0.0.0 address. | ||
69 | Working root directory (note, that server chroots there, so you have to have appropriate permissions) | ||
70 | is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there | ||
71 | are appropriate key files. | ||
72 | Number of working threads is set to 10. | ||
73 | |||
74 | # ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key | ||
75 | |||
76 | -A 6 - listen on ipv6 address. Default: Disabled. | ||
77 | -r root - path to root directory. Default: /tmp. | ||
78 | -a addr - listen address. Default: 0.0.0.0. | ||
79 | -p port - listen port. Default: 1025. | ||
80 | -w workers - number of workers per connected client. Default: 1. | ||
81 | -K file - hash key size. Default: none. | ||
82 | -k file - cipher key size. Default: none. | ||
83 | -h - this help. | ||
84 | |||
85 | Number of worker threads specifies how many workers will be created for each client. | ||
86 | Bulk single-client transafers usually are better handled with smaller number (like 1-3). | ||
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt new file mode 100644 index 000000000000..40ea6c295afb --- /dev/null +++ b/Documentation/filesystems/pohmelfs/network_protocol.txt | |||
@@ -0,0 +1,227 @@ | |||
1 | POHMELFS network protocol. | ||
2 | |||
3 | Basic structure used in network communication is following command: | ||
4 | |||
5 | struct netfs_cmd | ||
6 | { | ||
7 | __u16 cmd; /* Command number */ | ||
8 | __u16 csize; /* Attached crypto information size */ | ||
9 | __u16 cpad; /* Attached padding size */ | ||
10 | __u16 ext; /* External flags */ | ||
11 | __u32 size; /* Size of the attached data */ | ||
12 | __u32 trans; /* Transaction id */ | ||
13 | __u64 id; /* Object ID to operate on. Used for feedback.*/ | ||
14 | __u64 start; /* Start of the object. */ | ||
15 | __u64 iv; /* IV sequence */ | ||
16 | __u8 data[0]; | ||
17 | }; | ||
18 | |||
19 | Commands can be embedded into transaction command (which in turn has own command), | ||
20 | so one can extend protocol as needed without breaking backward compatibility as long | ||
21 | as old commands are supported. All string lengths include tail 0 byte. | ||
22 | |||
23 | All commans are transfered over the network in big-endian. CPU endianess is used at the end peers. | ||
24 | |||
25 | @cmd - command number, which specifies command to be processed. Following | ||
26 | commands are used currently: | ||
27 | |||
28 | NETFS_READDIR = 1, /* Read directory for given inode number */ | ||
29 | NETFS_READ_PAGE, /* Read data page from the server */ | ||
30 | NETFS_WRITE_PAGE, /* Write data page to the server */ | ||
31 | NETFS_CREATE, /* Create directory entry */ | ||
32 | NETFS_REMOVE, /* Remove directory entry */ | ||
33 | NETFS_LOOKUP, /* Lookup single object */ | ||
34 | NETFS_LINK, /* Create a link */ | ||
35 | NETFS_TRANS, /* Transaction */ | ||
36 | NETFS_OPEN, /* Open intent */ | ||
37 | NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */ | ||
38 | NETFS_PAGE_CACHE, /* Page cache invalidation message */ | ||
39 | NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */ | ||
40 | NETFS_RENAME, /* Rename object */ | ||
41 | NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */ | ||
42 | NETFS_LOCK, /* Distributed lock message */ | ||
43 | NETFS_XATTR_SET, /* Set extended attribute */ | ||
44 | NETFS_XATTR_GET, /* Get extended attribute */ | ||
45 | |||
46 | @ext - external flags. Used by different commands to specify some extra arguments | ||
47 | like partial size of the embedded objects or creation flags. | ||
48 | |||
49 | @size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached, | ||
50 | but size of the requested data is incorporated here. It does not include size of the command | ||
51 | header (struct netfs_cmd) itself. | ||
52 | |||
53 | @id - id of the object this command operates on. Each command can use it for own purpose. | ||
54 | |||
55 | @start - start of the object this command operates on. Each command can use it for own purpose. | ||
56 | |||
57 | @csize, @cpad - size and padding size of the (attached if needed) crypto information. | ||
58 | |||
59 | Command specifications. | ||
60 | |||
61 | @NETFS_READDIR | ||
62 | This command is used to sync content of the remote dir to the client. | ||
63 | |||
64 | @ext - length of the path to object. | ||
65 | @size - the same. | ||
66 | @id - local inode number of the directory to read. | ||
67 | @start - zero. | ||
68 | |||
69 | |||
70 | @NETFS_READ_PAGE | ||
71 | This command is used to read data from remote server. | ||
72 | Data size does not exceed local page cache size. | ||
73 | |||
74 | @id - inode number. | ||
75 | @start - first byte offset. | ||
76 | @size - number of bytes to read plus length of the path to object. | ||
77 | @ext - object path length. | ||
78 | |||
79 | |||
80 | @NETFS_CREATE | ||
81 | Used to create object. | ||
82 | It does not require that all directories on top of the object were | ||
83 | already created, it will create them automatically. Each object has | ||
84 | associated @netfs_path_entry data structure, which contains creation | ||
85 | mode (permissions and type) and length of the name as long as name itself. | ||
86 | |||
87 | @start - 0 | ||
88 | @size - size of the all data structures needed to create a path | ||
89 | @id - local inode number | ||
90 | @ext - 0 | ||
91 | |||
92 | |||
93 | @NETFS_REMOVE | ||
94 | Used to remove object. | ||
95 | |||
96 | @ext - length of the path to object. | ||
97 | @size - the same. | ||
98 | @id - local inode number. | ||
99 | @start - zero. | ||
100 | |||
101 | |||
102 | @NETFS_LOOKUP | ||
103 | Lookup information about object on server. | ||
104 | |||
105 | @ext - length of the path to object. | ||
106 | @size - the same. | ||
107 | @id - local inode number of the directory to look object in. | ||
108 | @start - local inode number of the object to look at. | ||
109 | |||
110 | |||
111 | @NETFS_LINK | ||
112 | Create hard of symlink. | ||
113 | Command is sent as "object_path|target_path". | ||
114 | |||
115 | @size - size of the above string. | ||
116 | @id - parent local inode number. | ||
117 | @start - 1 for symlink, 0 for hardlink. | ||
118 | @ext - size of the "object_path" above. | ||
119 | |||
120 | |||
121 | @NETFS_TRANS | ||
122 | Transaction header. | ||
123 | |||
124 | @size - incorporates all embedded command sizes including theirs header sizes. | ||
125 | @start - transaction generation number - unique id used to find transaction. | ||
126 | @ext - transaction flags. Unused at the moment. | ||
127 | @id - 0. | ||
128 | |||
129 | |||
130 | @NETFS_OPEN | ||
131 | Open intent for given transaction. | ||
132 | |||
133 | @id - local inode number. | ||
134 | @start - 0. | ||
135 | @size - path length to the object. | ||
136 | @ext - open flags (O_RDWR and so on). | ||
137 | |||
138 | |||
139 | @NETFS_INODE_INFO | ||
140 | Metadata update command. | ||
141 | It is sent to servers when attributes of the object are changed and received | ||
142 | when data or metadata were updated. It operates with the following structure: | ||
143 | |||
144 | struct netfs_inode_info | ||
145 | { | ||
146 | unsigned int mode; | ||
147 | unsigned int nlink; | ||
148 | unsigned int uid; | ||
149 | unsigned int gid; | ||
150 | unsigned int blocksize; | ||
151 | unsigned int padding; | ||
152 | __u64 ino; | ||
153 | __u64 blocks; | ||
154 | __u64 rdev; | ||
155 | __u64 size; | ||
156 | __u64 version; | ||
157 | }; | ||
158 | |||
159 | It effectively mirrors stat(2) returned data. | ||
160 | |||
161 | |||
162 | @ext - path length to the object. | ||
163 | @size - the same plus size of the netfs_inode_info structure. | ||
164 | @id - local inode number. | ||
165 | @start - 0. | ||
166 | |||
167 | |||
168 | @NETFS_PAGE_CACHE | ||
169 | Command is only received by clients. It contains information about | ||
170 | page to be marked as not up-to-date. | ||
171 | |||
172 | @id - client's inode number. | ||
173 | @start - last byte of the page to be invalidated. If it is not equal to | ||
174 | current inode size, it will be vmtruncated(). | ||
175 | @size - 0 | ||
176 | @ext - 0 | ||
177 | |||
178 | |||
179 | @NETFS_READ_PAGES | ||
180 | Used to read multiple contiguous pages in one go. | ||
181 | |||
182 | @start - first byte of the contiguous region to read. | ||
183 | @size - contains of two fields: lower 8 bits are used to represent page cache shift | ||
184 | used by client, another 3 bytes are used to get number of pages. | ||
185 | @id - local inode number. | ||
186 | @ext - path length to the object. | ||
187 | |||
188 | |||
189 | @NETFS_RENAME | ||
190 | Used to rename object. | ||
191 | Attached data is formed into following string: "old_path|new_path". | ||
192 | |||
193 | @id - local inode number. | ||
194 | @start - parent inode number. | ||
195 | @size - length of the above string. | ||
196 | @ext - length of the old path part. | ||
197 | |||
198 | |||
199 | @NETFS_CAPABILITIES | ||
200 | Used to exchange crypto capabilities with server. | ||
201 | If crypto capabilities are not supported by server, then client will disable it | ||
202 | or fail (if 'crypto_fail_unsupported' mount options was specified). | ||
203 | |||
204 | @id - superblock index. Used to specify crypto information for group of servers. | ||
205 | @size - size of the attached capabilities structure. | ||
206 | @start - 0. | ||
207 | @size - 0. | ||
208 | @scsize - 0. | ||
209 | |||
210 | @NETFS_LOCK | ||
211 | Used to send lock request/release messages. Although it sends byte range request | ||
212 | and is capable of flushing pages based on that, it is not used, since all Linux | ||
213 | filesystems lock the whole inode. | ||
214 | |||
215 | @id - lock generation number. | ||
216 | @start - start of the locked range. | ||
217 | @size - size of the locked range. | ||
218 | @ext - lock type: read/write. Not used actually. 15'th bit is used to determine, | ||
219 | if it is lock request (1) or release (0). | ||
220 | |||
221 | @NETFS_XATTR_SET | ||
222 | @NETFS_XATTR_GET | ||
223 | Used to set/get extended attributes for given inode. | ||
224 | @id - attribute generation number or xattr setting type | ||
225 | @start - size of the attribute (request or attached) | ||
226 | @size - name length, path len and data size for given attribute | ||
227 | @ext - path length for given object | ||
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt index 803b1318b13d..fd9a3e693813 100644 --- a/Documentation/ftrace.txt +++ b/Documentation/ftrace.txt | |||
@@ -15,31 +15,31 @@ Introduction | |||
15 | 15 | ||
16 | Ftrace is an internal tracer designed to help out developers and | 16 | Ftrace is an internal tracer designed to help out developers and |
17 | designers of systems to find what is going on inside the kernel. | 17 | designers of systems to find what is going on inside the kernel. |
18 | It can be used for debugging or analyzing latencies and performance | 18 | It can be used for debugging or analyzing latencies and |
19 | issues that take place outside of user-space. | 19 | performance issues that take place outside of user-space. |
20 | 20 | ||
21 | Although ftrace is the function tracer, it also includes an | 21 | Although ftrace is the function tracer, it also includes an |
22 | infrastructure that allows for other types of tracing. Some of the | 22 | infrastructure that allows for other types of tracing. Some of |
23 | tracers that are currently in ftrace include a tracer to trace | 23 | the tracers that are currently in ftrace include a tracer to |
24 | context switches, the time it takes for a high priority task to | 24 | trace context switches, the time it takes for a high priority |
25 | run after it was woken up, the time interrupts are disabled, and | 25 | task to run after it was woken up, the time interrupts are |
26 | more (ftrace allows for tracer plugins, which means that the list of | 26 | disabled, and more (ftrace allows for tracer plugins, which |
27 | tracers can always grow). | 27 | means that the list of tracers can always grow). |
28 | 28 | ||
29 | 29 | ||
30 | The File System | 30 | The File System |
31 | --------------- | 31 | --------------- |
32 | 32 | ||
33 | Ftrace uses the debugfs file system to hold the control files as well | 33 | Ftrace uses the debugfs file system to hold the control files as |
34 | as the files to display output. | 34 | well as the files to display output. |
35 | 35 | ||
36 | To mount the debugfs system: | 36 | To mount the debugfs system: |
37 | 37 | ||
38 | # mkdir /debug | 38 | # mkdir /debug |
39 | # mount -t debugfs nodev /debug | 39 | # mount -t debugfs nodev /debug |
40 | 40 | ||
41 | (Note: it is more common to mount at /sys/kernel/debug, but for simplicity | 41 | ( Note: it is more common to mount at /sys/kernel/debug, but for |
42 | this document will use /debug) | 42 | simplicity this document will use /debug) |
43 | 43 | ||
44 | That's it! (assuming that you have ftrace configured into your kernel) | 44 | That's it! (assuming that you have ftrace configured into your kernel) |
45 | 45 | ||
@@ -50,90 +50,124 @@ of ftrace. Here is a list of some of the key files: | |||
50 | 50 | ||
51 | Note: all time values are in microseconds. | 51 | Note: all time values are in microseconds. |
52 | 52 | ||
53 | current_tracer: This is used to set or display the current tracer | 53 | current_tracer: |
54 | that is configured. | 54 | |
55 | 55 | This is used to set or display the current tracer | |
56 | available_tracers: This holds the different types of tracers that | 56 | that is configured. |
57 | have been compiled into the kernel. The tracers | 57 | |
58 | listed here can be configured by echoing their name | 58 | available_tracers: |
59 | into current_tracer. | 59 | |
60 | 60 | This holds the different types of tracers that | |
61 | tracing_enabled: This sets or displays whether the current_tracer | 61 | have been compiled into the kernel. The |
62 | is activated and tracing or not. Echo 0 into this | 62 | tracers listed here can be configured by |
63 | file to disable the tracer or 1 to enable it. | 63 | echoing their name into current_tracer. |
64 | 64 | ||
65 | trace: This file holds the output of the trace in a human readable | 65 | tracing_enabled: |
66 | format (described below). | 66 | |
67 | 67 | This sets or displays whether the current_tracer | |
68 | latency_trace: This file shows the same trace but the information | 68 | is activated and tracing or not. Echo 0 into this |
69 | is organized more to display possible latencies | 69 | file to disable the tracer or 1 to enable it. |
70 | in the system (described below). | 70 | |
71 | 71 | trace: | |
72 | trace_pipe: The output is the same as the "trace" file but this | 72 | |
73 | file is meant to be streamed with live tracing. | 73 | This file holds the output of the trace in a human |
74 | Reads from this file will block until new data | 74 | readable format (described below). |
75 | is retrieved. Unlike the "trace" and "latency_trace" | 75 | |
76 | files, this file is a consumer. This means reading | 76 | latency_trace: |
77 | from this file causes sequential reads to display | 77 | |
78 | more current data. Once data is read from this | 78 | This file shows the same trace but the information |
79 | file, it is consumed, and will not be read | 79 | is organized more to display possible latencies |
80 | again with a sequential read. The "trace" and | 80 | in the system (described below). |
81 | "latency_trace" files are static, and if the | 81 | |
82 | tracer is not adding more data, they will display | 82 | trace_pipe: |
83 | the same information every time they are read. | 83 | |
84 | 84 | The output is the same as the "trace" file but this | |
85 | trace_options: This file lets the user control the amount of data | 85 | file is meant to be streamed with live tracing. |
86 | that is displayed in one of the above output | 86 | Reads from this file will block until new data |
87 | files. | 87 | is retrieved. Unlike the "trace" and "latency_trace" |
88 | 88 | files, this file is a consumer. This means reading | |
89 | trace_max_latency: Some of the tracers record the max latency. | 89 | from this file causes sequential reads to display |
90 | For example, the time interrupts are disabled. | 90 | more current data. Once data is read from this |
91 | This time is saved in this file. The max trace | 91 | file, it is consumed, and will not be read |
92 | will also be stored, and displayed by either | 92 | again with a sequential read. The "trace" and |
93 | "trace" or "latency_trace". A new max trace will | 93 | "latency_trace" files are static, and if the |
94 | only be recorded if the latency is greater than | 94 | tracer is not adding more data, they will display |
95 | the value in this file. (in microseconds) | 95 | the same information every time they are read. |
96 | 96 | ||
97 | buffer_size_kb: This sets or displays the number of kilobytes each CPU | 97 | trace_options: |
98 | buffer can hold. The tracer buffers are the same size | 98 | |
99 | for each CPU. The displayed number is the size of the | 99 | This file lets the user control the amount of data |
100 | CPU buffer and not total size of all buffers. The | 100 | that is displayed in one of the above output |
101 | trace buffers are allocated in pages (blocks of memory | 101 | files. |
102 | that the kernel uses for allocation, usually 4 KB in size). | 102 | |
103 | If the last page allocated has room for more bytes | 103 | tracing_max_latency: |
104 | than requested, the rest of the page will be used, | 104 | |
105 | making the actual allocation bigger than requested. | 105 | Some of the tracers record the max latency. |
106 | (Note, the size may not be a multiple of the page size due | 106 | For example, the time interrupts are disabled. |
107 | to buffer managment overhead.) | 107 | This time is saved in this file. The max trace |
108 | 108 | will also be stored, and displayed by either | |
109 | This can only be updated when the current_tracer | 109 | "trace" or "latency_trace". A new max trace will |
110 | is set to "nop". | 110 | only be recorded if the latency is greater than |
111 | 111 | the value in this file. (in microseconds) | |
112 | tracing_cpumask: This is a mask that lets the user only trace | 112 | |
113 | on specified CPUS. The format is a hex string | 113 | buffer_size_kb: |
114 | representing the CPUS. | 114 | |
115 | 115 | This sets or displays the number of kilobytes each CPU | |
116 | set_ftrace_filter: When dynamic ftrace is configured in (see the | 116 | buffer can hold. The tracer buffers are the same size |
117 | section below "dynamic ftrace"), the code is dynamically | 117 | for each CPU. The displayed number is the size of the |
118 | modified (code text rewrite) to disable calling of the | 118 | CPU buffer and not total size of all buffers. The |
119 | function profiler (mcount). This lets tracing be configured | 119 | trace buffers are allocated in pages (blocks of memory |
120 | in with practically no overhead in performance. This also | 120 | that the kernel uses for allocation, usually 4 KB in size). |
121 | has a side effect of enabling or disabling specific functions | 121 | If the last page allocated has room for more bytes |
122 | to be traced. Echoing names of functions into this file | 122 | than requested, the rest of the page will be used, |
123 | will limit the trace to only those functions. | 123 | making the actual allocation bigger than requested. |
124 | 124 | ( Note, the size may not be a multiple of the page size | |
125 | set_ftrace_notrace: This has an effect opposite to that of | 125 | due to buffer managment overhead. ) |
126 | set_ftrace_filter. Any function that is added here will not | 126 | |
127 | be traced. If a function exists in both set_ftrace_filter | 127 | This can only be updated when the current_tracer |
128 | and set_ftrace_notrace, the function will _not_ be traced. | 128 | is set to "nop". |
129 | 129 | ||
130 | set_ftrace_pid: Have the function tracer only trace a single thread. | 130 | tracing_cpumask: |
131 | 131 | ||
132 | available_filter_functions: This lists the functions that ftrace | 132 | This is a mask that lets the user only trace |
133 | has processed and can trace. These are the function | 133 | on specified CPUS. The format is a hex string |
134 | names that you can pass to "set_ftrace_filter" or | 134 | representing the CPUS. |
135 | "set_ftrace_notrace". (See the section "dynamic ftrace" | 135 | |
136 | below for more details.) | 136 | set_ftrace_filter: |
137 | |||
138 | When dynamic ftrace is configured in (see the | ||
139 | section below "dynamic ftrace"), the code is dynamically | ||
140 | modified (code text rewrite) to disable calling of the | ||
141 | function profiler (mcount). This lets tracing be configured | ||
142 | in with practically no overhead in performance. This also | ||
143 | has a side effect of enabling or disabling specific functions | ||
144 | to be traced. Echoing names of functions into this file | ||
145 | will limit the trace to only those functions. | ||
146 | |||
147 | set_ftrace_notrace: | ||
148 | |||
149 | This has an effect opposite to that of | ||
150 | set_ftrace_filter. Any function that is added here will not | ||
151 | be traced. If a function exists in both set_ftrace_filter | ||
152 | and set_ftrace_notrace, the function will _not_ be traced. | ||
153 | |||
154 | set_ftrace_pid: | ||
155 | |||
156 | Have the function tracer only trace a single thread. | ||
157 | |||
158 | set_graph_function: | ||
159 | |||
160 | Set a "trigger" function where tracing should start | ||
161 | with the function graph tracer (See the section | ||
162 | "dynamic ftrace" for more details). | ||
163 | |||
164 | available_filter_functions: | ||
165 | |||
166 | This lists the functions that ftrace | ||
167 | has processed and can trace. These are the function | ||
168 | names that you can pass to "set_ftrace_filter" or | ||
169 | "set_ftrace_notrace". (See the section "dynamic ftrace" | ||
170 | below for more details.) | ||
137 | 171 | ||
138 | 172 | ||
139 | The Tracers | 173 | The Tracers |
@@ -141,36 +175,66 @@ The Tracers | |||
141 | 175 | ||
142 | Here is the list of current tracers that may be configured. | 176 | Here is the list of current tracers that may be configured. |
143 | 177 | ||
144 | function - function tracer that uses mcount to trace all functions. | 178 | "function" |
179 | |||
180 | Function call tracer to trace all kernel functions. | ||
181 | |||
182 | "function_graph_tracer" | ||
183 | |||
184 | Similar to the function tracer except that the | ||
185 | function tracer probes the functions on their entry | ||
186 | whereas the function graph tracer traces on both entry | ||
187 | and exit of the functions. It then provides the ability | ||
188 | to draw a graph of function calls similar to C code | ||
189 | source. | ||
145 | 190 | ||
146 | sched_switch - traces the context switches between tasks. | 191 | "sched_switch" |
147 | 192 | ||
148 | irqsoff - traces the areas that disable interrupts and saves | 193 | Traces the context switches and wakeups between tasks. |
149 | the trace with the longest max latency. | ||
150 | See tracing_max_latency. When a new max is recorded, | ||
151 | it replaces the old trace. It is best to view this | ||
152 | trace via the latency_trace file. | ||
153 | 194 | ||
154 | preemptoff - Similar to irqsoff but traces and records the amount of | 195 | "irqsoff" |
155 | time for which preemption is disabled. | ||
156 | 196 | ||
157 | preemptirqsoff - Similar to irqsoff and preemptoff, but traces and | 197 | Traces the areas that disable interrupts and saves |
158 | records the largest time for which irqs and/or preemption | 198 | the trace with the longest max latency. |
159 | is disabled. | 199 | See tracing_max_latency. When a new max is recorded, |
200 | it replaces the old trace. It is best to view this | ||
201 | trace via the latency_trace file. | ||
160 | 202 | ||
161 | wakeup - Traces and records the max latency that it takes for | 203 | "preemptoff" |
162 | the highest priority task to get scheduled after | ||
163 | it has been woken up. | ||
164 | 204 | ||
165 | nop - This is not a tracer. To remove all tracers from tracing | 205 | Similar to irqsoff but traces and records the amount of |
166 | simply echo "nop" into current_tracer. | 206 | time for which preemption is disabled. |
207 | |||
208 | "preemptirqsoff" | ||
209 | |||
210 | Similar to irqsoff and preemptoff, but traces and | ||
211 | records the largest time for which irqs and/or preemption | ||
212 | is disabled. | ||
213 | |||
214 | "wakeup" | ||
215 | |||
216 | Traces and records the max latency that it takes for | ||
217 | the highest priority task to get scheduled after | ||
218 | it has been woken up. | ||
219 | |||
220 | "hw-branch-tracer" | ||
221 | |||
222 | Uses the BTS CPU feature on x86 CPUs to traces all | ||
223 | branches executed. | ||
224 | |||
225 | "nop" | ||
226 | |||
227 | This is the "trace nothing" tracer. To remove all | ||
228 | tracers from tracing simply echo "nop" into | ||
229 | current_tracer. | ||
167 | 230 | ||
168 | 231 | ||
169 | Examples of using the tracer | 232 | Examples of using the tracer |
170 | ---------------------------- | 233 | ---------------------------- |
171 | 234 | ||
172 | Here are typical examples of using the tracers when controlling them only | 235 | Here are typical examples of using the tracers when controlling |
173 | with the debugfs interface (without using any user-land utilities). | 236 | them only with the debugfs interface (without using any |
237 | user-land utilities). | ||
174 | 238 | ||
175 | Output format: | 239 | Output format: |
176 | -------------- | 240 | -------------- |
@@ -187,16 +251,16 @@ Here is an example of the output format of the file "trace" | |||
187 | bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput | 251 | bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput |
188 | -------- | 252 | -------- |
189 | 253 | ||
190 | A header is printed with the tracer name that is represented by the trace. | 254 | A header is printed with the tracer name that is represented by |
191 | In this case the tracer is "function". Then a header showing the format. Task | 255 | the trace. In this case the tracer is "function". Then a header |
192 | name "bash", the task PID "4251", the CPU that it was running on | 256 | showing the format. Task name "bash", the task PID "4251", the |
193 | "01", the timestamp in <secs>.<usecs> format, the function name that was | 257 | CPU that it was running on "01", the timestamp in <secs>.<usecs> |
194 | traced "path_put" and the parent function that called this function | 258 | format, the function name that was traced "path_put" and the |
195 | "path_walk". The timestamp is the time at which the function was | 259 | parent function that called this function "path_walk". The |
196 | entered. | 260 | timestamp is the time at which the function was entered. |
197 | 261 | ||
198 | The sched_switch tracer also includes tracing of task wakeups and | 262 | The sched_switch tracer also includes tracing of task wakeups |
199 | context switches. | 263 | and context switches. |
200 | 264 | ||
201 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S | 265 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S |
202 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S | 266 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S |
@@ -205,8 +269,8 @@ context switches. | |||
205 | kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R | 269 | kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R |
206 | ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R | 270 | ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R |
207 | 271 | ||
208 | Wake ups are represented by a "+" and the context switches are shown as | 272 | Wake ups are represented by a "+" and the context switches are |
209 | "==>". The format is: | 273 | shown as "==>". The format is: |
210 | 274 | ||
211 | Context switches: | 275 | Context switches: |
212 | 276 | ||
@@ -220,19 +284,20 @@ Wake ups are represented by a "+" and the context switches are shown as | |||
220 | 284 | ||
221 | <pid>:<prio>:<state> + <pid>:<prio>:<state> | 285 | <pid>:<prio>:<state> + <pid>:<prio>:<state> |
222 | 286 | ||
223 | The prio is the internal kernel priority, which is the inverse of the | 287 | The prio is the internal kernel priority, which is the inverse |
224 | priority that is usually displayed by user-space tools. Zero represents | 288 | of the priority that is usually displayed by user-space tools. |
225 | the highest priority (99). Prio 100 starts the "nice" priorities with | 289 | Zero represents the highest priority (99). Prio 100 starts the |
226 | 100 being equal to nice -20 and 139 being nice 19. The prio "140" is | 290 | "nice" priorities with 100 being equal to nice -20 and 139 being |
227 | reserved for the idle task which is the lowest priority thread (pid 0). | 291 | nice 19. The prio "140" is reserved for the idle task which is |
292 | the lowest priority thread (pid 0). | ||
228 | 293 | ||
229 | 294 | ||
230 | Latency trace format | 295 | Latency trace format |
231 | -------------------- | 296 | -------------------- |
232 | 297 | ||
233 | For traces that display latency times, the latency_trace file gives | 298 | For traces that display latency times, the latency_trace file |
234 | somewhat more information to see why a latency happened. Here is a typical | 299 | gives somewhat more information to see why a latency happened. |
235 | trace. | 300 | Here is a typical trace. |
236 | 301 | ||
237 | # tracer: irqsoff | 302 | # tracer: irqsoff |
238 | # | 303 | # |
@@ -259,20 +324,20 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
259 | <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) | 324 | <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) |
260 | 325 | ||
261 | 326 | ||
327 | This shows that the current tracer is "irqsoff" tracing the time | ||
328 | for which interrupts were disabled. It gives the trace version | ||
329 | and the version of the kernel upon which this was executed on | ||
330 | (2.6.26-rc8). Then it displays the max latency in microsecs (97 | ||
331 | us). The number of trace entries displayed and the total number | ||
332 | recorded (both are three: #3/3). The type of preemption that was | ||
333 | used (PREEMPT). VP, KP, SP, and HP are always zero and are | ||
334 | reserved for later use. #P is the number of online CPUS (#P:2). | ||
262 | 335 | ||
263 | This shows that the current tracer is "irqsoff" tracing the time for which | 336 | The task is the process that was running when the latency |
264 | interrupts were disabled. It gives the trace version and the version | 337 | occurred. (swapper pid: 0). |
265 | of the kernel upon which this was executed on (2.6.26-rc8). Then it displays | ||
266 | the max latency in microsecs (97 us). The number of trace entries displayed | ||
267 | and the total number recorded (both are three: #3/3). The type of | ||
268 | preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero | ||
269 | and are reserved for later use. #P is the number of online CPUS (#P:2). | ||
270 | |||
271 | The task is the process that was running when the latency occurred. | ||
272 | (swapper pid: 0). | ||
273 | 338 | ||
274 | The start and stop (the functions in which the interrupts were disabled and | 339 | The start and stop (the functions in which the interrupts were |
275 | enabled respectively) that caused the latencies: | 340 | disabled and enabled respectively) that caused the latencies: |
276 | 341 | ||
277 | apic_timer_interrupt is where the interrupts were disabled. | 342 | apic_timer_interrupt is where the interrupts were disabled. |
278 | do_softirq is where they were enabled again. | 343 | do_softirq is where they were enabled again. |
@@ -308,12 +373,12 @@ The above is mostly meaningful for kernel developers. | |||
308 | latency_trace file is relative to the start of the trace. | 373 | latency_trace file is relative to the start of the trace. |
309 | 374 | ||
310 | delay: This is just to help catch your eye a bit better. And | 375 | delay: This is just to help catch your eye a bit better. And |
311 | needs to be fixed to be only relative to the same CPU. | 376 | needs to be fixed to be only relative to the same CPU. |
312 | The marks are determined by the difference between this | 377 | The marks are determined by the difference between this |
313 | current trace and the next trace. | 378 | current trace and the next trace. |
314 | '!' - greater than preempt_mark_thresh (default 100) | 379 | '!' - greater than preempt_mark_thresh (default 100) |
315 | '+' - greater than 1 microsecond | 380 | '+' - greater than 1 microsecond |
316 | ' ' - less than or equal to 1 microsecond. | 381 | ' ' - less than or equal to 1 microsecond. |
317 | 382 | ||
318 | The rest is the same as the 'trace' file. | 383 | The rest is the same as the 'trace' file. |
319 | 384 | ||
@@ -321,14 +386,15 @@ The above is mostly meaningful for kernel developers. | |||
321 | trace_options | 386 | trace_options |
322 | ------------- | 387 | ------------- |
323 | 388 | ||
324 | The trace_options file is used to control what gets printed in the trace | 389 | The trace_options file is used to control what gets printed in |
325 | output. To see what is available, simply cat the file: | 390 | the trace output. To see what is available, simply cat the file: |
326 | 391 | ||
327 | cat /debug/tracing/trace_options | 392 | cat /debug/tracing/trace_options |
328 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ | 393 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ |
329 | noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj | 394 | noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj |
330 | 395 | ||
331 | To disable one of the options, echo in the option prepended with "no". | 396 | To disable one of the options, echo in the option prepended with |
397 | "no". | ||
332 | 398 | ||
333 | echo noprint-parent > /debug/tracing/trace_options | 399 | echo noprint-parent > /debug/tracing/trace_options |
334 | 400 | ||
@@ -338,8 +404,8 @@ To enable an option, leave off the "no". | |||
338 | 404 | ||
339 | Here are the available options: | 405 | Here are the available options: |
340 | 406 | ||
341 | print-parent - On function traces, display the calling function | 407 | print-parent - On function traces, display the calling (parent) |
342 | as well as the function being traced. | 408 | function as well as the function being traced. |
343 | 409 | ||
344 | print-parent: | 410 | print-parent: |
345 | bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul | 411 | bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul |
@@ -348,15 +414,16 @@ Here are the available options: | |||
348 | bash-4000 [01] 1477.606694: simple_strtoul | 414 | bash-4000 [01] 1477.606694: simple_strtoul |
349 | 415 | ||
350 | 416 | ||
351 | sym-offset - Display not only the function name, but also the offset | 417 | sym-offset - Display not only the function name, but also the |
352 | in the function. For example, instead of seeing just | 418 | offset in the function. For example, instead of |
353 | "ktime_get", you will see "ktime_get+0xb/0x20". | 419 | seeing just "ktime_get", you will see |
420 | "ktime_get+0xb/0x20". | ||
354 | 421 | ||
355 | sym-offset: | 422 | sym-offset: |
356 | bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 | 423 | bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 |
357 | 424 | ||
358 | sym-addr - this will also display the function address as well as | 425 | sym-addr - this will also display the function address as well |
359 | the function name. | 426 | as the function name. |
360 | 427 | ||
361 | sym-addr: | 428 | sym-addr: |
362 | bash-4000 [01] 1477.606694: simple_strtoul <c0339346> | 429 | bash-4000 [01] 1477.606694: simple_strtoul <c0339346> |
@@ -366,35 +433,41 @@ Here are the available options: | |||
366 | bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ | 433 | bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ |
367 | (+0.000ms): simple_strtoul (strict_strtoul) | 434 | (+0.000ms): simple_strtoul (strict_strtoul) |
368 | 435 | ||
369 | raw - This will display raw numbers. This option is best for use with | 436 | raw - This will display raw numbers. This option is best for |
370 | user applications that can translate the raw numbers better than | 437 | use with user applications that can translate the raw |
371 | having it done in the kernel. | 438 | numbers better than having it done in the kernel. |
372 | 439 | ||
373 | hex - Similar to raw, but the numbers will be in a hexadecimal format. | 440 | hex - Similar to raw, but the numbers will be in a hexadecimal |
441 | format. | ||
374 | 442 | ||
375 | bin - This will print out the formats in raw binary. | 443 | bin - This will print out the formats in raw binary. |
376 | 444 | ||
377 | block - TBD (needs update) | 445 | block - TBD (needs update) |
378 | 446 | ||
379 | stacktrace - This is one of the options that changes the trace itself. | 447 | stacktrace - This is one of the options that changes the trace |
380 | When a trace is recorded, so is the stack of functions. | 448 | itself. When a trace is recorded, so is the stack |
381 | This allows for back traces of trace sites. | 449 | of functions. This allows for back traces of |
450 | trace sites. | ||
382 | 451 | ||
383 | userstacktrace - This option changes the trace. | 452 | userstacktrace - This option changes the trace. It records a |
384 | It records a stacktrace of the current userspace thread. | 453 | stacktrace of the current userspace thread. |
385 | 454 | ||
386 | sym-userobj - when user stacktrace are enabled, look up which object the | 455 | sym-userobj - when user stacktrace are enabled, look up which |
387 | address belongs to, and print a relative address | 456 | object the address belongs to, and print a |
388 | This is especially useful when ASLR is on, otherwise you don't | 457 | relative address. This is especially useful when |
389 | get a chance to resolve the address to object/file/line after the app is no | 458 | ASLR is on, otherwise you don't get a chance to |
390 | longer running | 459 | resolve the address to object/file/line after |
460 | the app is no longer running | ||
391 | 461 | ||
392 | The lookup is performed when you read trace,trace_pipe,latency_trace. Example: | 462 | The lookup is performed when you read |
463 | trace,trace_pipe,latency_trace. Example: | ||
393 | 464 | ||
394 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 | 465 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 |
395 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] | 466 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] |
396 | 467 | ||
397 | sched-tree - TBD (any users??) | 468 | sched-tree - trace all tasks that are on the runqueue, at |
469 | every scheduling event. Will add overhead if | ||
470 | there's a lot of tasks running at once. | ||
398 | 471 | ||
399 | 472 | ||
400 | sched_switch | 473 | sched_switch |
@@ -431,18 +504,19 @@ of how to use it. | |||
431 | [...] | 504 | [...] |
432 | 505 | ||
433 | 506 | ||
434 | As we have discussed previously about this format, the header shows | 507 | As we have discussed previously about this format, the header |
435 | the name of the trace and points to the options. The "FUNCTION" | 508 | shows the name of the trace and points to the options. The |
436 | is a misnomer since here it represents the wake ups and context | 509 | "FUNCTION" is a misnomer since here it represents the wake ups |
437 | switches. | 510 | and context switches. |
438 | 511 | ||
439 | The sched_switch file only lists the wake ups (represented with '+') | 512 | The sched_switch file only lists the wake ups (represented with |
440 | and context switches ('==>') with the previous task or current task | 513 | '+') and context switches ('==>') with the previous task or |
441 | first followed by the next task or task waking up. The format for both | 514 | current task first followed by the next task or task waking up. |
442 | of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO | 515 | The format for both of these is PID:KERNEL-PRIO:TASK-STATE. |
443 | is the inverse of the actual priority with zero (0) being the highest | 516 | Remember that the KERNEL-PRIO is the inverse of the actual |
444 | priority and the nice values starting at 100 (nice -20). Below is | 517 | priority with zero (0) being the highest priority and the nice |
445 | a quick chart to map the kernel priority to user land priorities. | 518 | values starting at 100 (nice -20). Below is a quick chart to map |
519 | the kernel priority to user land priorities. | ||
446 | 520 | ||
447 | Kernel priority: 0 to 99 ==> user RT priority 99 to 0 | 521 | Kernel priority: 0 to 99 ==> user RT priority 99 to 0 |
448 | Kernel priority: 100 to 139 ==> user nice -20 to 19 | 522 | Kernel priority: 100 to 139 ==> user nice -20 to 19 |
@@ -463,10 +537,10 @@ The task states are: | |||
463 | ftrace_enabled | 537 | ftrace_enabled |
464 | -------------- | 538 | -------------- |
465 | 539 | ||
466 | The following tracers (listed below) give different output depending | 540 | The following tracers (listed below) give different output |
467 | on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled, | 541 | depending on whether or not the sysctl ftrace_enabled is set. To |
468 | one can either use the sysctl function or set it via the proc | 542 | set ftrace_enabled, one can either use the sysctl function or |
469 | file system interface. | 543 | set it via the proc file system interface. |
470 | 544 | ||
471 | sysctl kernel.ftrace_enabled=1 | 545 | sysctl kernel.ftrace_enabled=1 |
472 | 546 | ||
@@ -474,12 +548,12 @@ file system interface. | |||
474 | 548 | ||
475 | echo 1 > /proc/sys/kernel/ftrace_enabled | 549 | echo 1 > /proc/sys/kernel/ftrace_enabled |
476 | 550 | ||
477 | To disable ftrace_enabled simply replace the '1' with '0' in | 551 | To disable ftrace_enabled simply replace the '1' with '0' in the |
478 | the above commands. | 552 | above commands. |
479 | 553 | ||
480 | When ftrace_enabled is set the tracers will also record the functions | 554 | When ftrace_enabled is set the tracers will also record the |
481 | that are within the trace. The descriptions of the tracers | 555 | functions that are within the trace. The descriptions of the |
482 | will also show an example with ftrace enabled. | 556 | tracers will also show an example with ftrace enabled. |
483 | 557 | ||
484 | 558 | ||
485 | irqsoff | 559 | irqsoff |
@@ -487,17 +561,18 @@ irqsoff | |||
487 | 561 | ||
488 | When interrupts are disabled, the CPU can not react to any other | 562 | When interrupts are disabled, the CPU can not react to any other |
489 | external event (besides NMIs and SMIs). This prevents the timer | 563 | external event (besides NMIs and SMIs). This prevents the timer |
490 | interrupt from triggering or the mouse interrupt from letting the | 564 | interrupt from triggering or the mouse interrupt from letting |
491 | kernel know of a new mouse event. The result is a latency with the | 565 | the kernel know of a new mouse event. The result is a latency |
492 | reaction time. | 566 | with the reaction time. |
493 | 567 | ||
494 | The irqsoff tracer tracks the time for which interrupts are disabled. | 568 | The irqsoff tracer tracks the time for which interrupts are |
495 | When a new maximum latency is hit, the tracer saves the trace leading up | 569 | disabled. When a new maximum latency is hit, the tracer saves |
496 | to that latency point so that every time a new maximum is reached, the old | 570 | the trace leading up to that latency point so that every time a |
497 | saved trace is discarded and the new trace is saved. | 571 | new maximum is reached, the old saved trace is discarded and the |
572 | new trace is saved. | ||
498 | 573 | ||
499 | To reset the maximum, echo 0 into tracing_max_latency. Here is an | 574 | To reset the maximum, echo 0 into tracing_max_latency. Here is |
500 | example: | 575 | an example: |
501 | 576 | ||
502 | # echo irqsoff > /debug/tracing/current_tracer | 577 | # echo irqsoff > /debug/tracing/current_tracer |
503 | # echo 0 > /debug/tracing/tracing_max_latency | 578 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -532,10 +607,11 @@ irqsoff latency trace v1.1.5 on 2.6.26 | |||
532 | 607 | ||
533 | 608 | ||
534 | Here we see that that we had a latency of 12 microsecs (which is | 609 | Here we see that that we had a latency of 12 microsecs (which is |
535 | very good). The _write_lock_irq in sys_setpgid disabled interrupts. | 610 | very good). The _write_lock_irq in sys_setpgid disabled |
536 | The difference between the 12 and the displayed timestamp 14us occurred | 611 | interrupts. The difference between the 12 and the displayed |
537 | because the clock was incremented between the time of recording the max | 612 | timestamp 14us occurred because the clock was incremented |
538 | latency and the time of recording the function that had that latency. | 613 | between the time of recording the max latency and the time of |
614 | recording the function that had that latency. | ||
539 | 615 | ||
540 | Note the above example had ftrace_enabled not set. If we set the | 616 | Note the above example had ftrace_enabled not set. If we set the |
541 | ftrace_enabled, we get a much larger output: | 617 | ftrace_enabled, we get a much larger output: |
@@ -586,24 +662,24 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
586 | 662 | ||
587 | 663 | ||
588 | Here we traced a 50 microsecond latency. But we also see all the | 664 | Here we traced a 50 microsecond latency. But we also see all the |
589 | functions that were called during that time. Note that by enabling | 665 | functions that were called during that time. Note that by |
590 | function tracing, we incur an added overhead. This overhead may | 666 | enabling function tracing, we incur an added overhead. This |
591 | extend the latency times. But nevertheless, this trace has provided | 667 | overhead may extend the latency times. But nevertheless, this |
592 | some very helpful debugging information. | 668 | trace has provided some very helpful debugging information. |
593 | 669 | ||
594 | 670 | ||
595 | preemptoff | 671 | preemptoff |
596 | ---------- | 672 | ---------- |
597 | 673 | ||
598 | When preemption is disabled, we may be able to receive interrupts but | 674 | When preemption is disabled, we may be able to receive |
599 | the task cannot be preempted and a higher priority task must wait | 675 | interrupts but the task cannot be preempted and a higher |
600 | for preemption to be enabled again before it can preempt a lower | 676 | priority task must wait for preemption to be enabled again |
601 | priority task. | 677 | before it can preempt a lower priority task. |
602 | 678 | ||
603 | The preemptoff tracer traces the places that disable preemption. | 679 | The preemptoff tracer traces the places that disable preemption. |
604 | Like the irqsoff tracer, it records the maximum latency for which preemption | 680 | Like the irqsoff tracer, it records the maximum latency for |
605 | was disabled. The control of preemptoff tracer is much like the irqsoff | 681 | which preemption was disabled. The control of preemptoff tracer |
606 | tracer. | 682 | is much like the irqsoff tracer. |
607 | 683 | ||
608 | # echo preemptoff > /debug/tracing/current_tracer | 684 | # echo preemptoff > /debug/tracing/current_tracer |
609 | # echo 0 > /debug/tracing/tracing_max_latency | 685 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -637,11 +713,12 @@ preemptoff latency trace v1.1.5 on 2.6.26-rc8 | |||
637 | sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) | 713 | sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) |
638 | 714 | ||
639 | 715 | ||
640 | This has some more changes. Preemption was disabled when an interrupt | 716 | This has some more changes. Preemption was disabled when an |
641 | came in (notice the 'h'), and was enabled while doing a softirq. | 717 | interrupt came in (notice the 'h'), and was enabled while doing |
642 | (notice the 's'). But we also see that interrupts have been disabled | 718 | a softirq. (notice the 's'). But we also see that interrupts |
643 | when entering the preempt off section and leaving it (the 'd'). | 719 | have been disabled when entering the preempt off section and |
644 | We do not know if interrupts were enabled in the mean time. | 720 | leaving it (the 'd'). We do not know if interrupts were enabled |
721 | in the mean time. | ||
645 | 722 | ||
646 | # tracer: preemptoff | 723 | # tracer: preemptoff |
647 | # | 724 | # |
@@ -700,28 +777,30 @@ preemptoff latency trace v1.1.5 on 2.6.26-rc8 | |||
700 | sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) | 777 | sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) |
701 | 778 | ||
702 | 779 | ||
703 | The above is an example of the preemptoff trace with ftrace_enabled | 780 | The above is an example of the preemptoff trace with |
704 | set. Here we see that interrupts were disabled the entire time. | 781 | ftrace_enabled set. Here we see that interrupts were disabled |
705 | The irq_enter code lets us know that we entered an interrupt 'h'. | 782 | the entire time. The irq_enter code lets us know that we entered |
706 | Before that, the functions being traced still show that it is not | 783 | an interrupt 'h'. Before that, the functions being traced still |
707 | in an interrupt, but we can see from the functions themselves that | 784 | show that it is not in an interrupt, but we can see from the |
708 | this is not the case. | 785 | functions themselves that this is not the case. |
709 | 786 | ||
710 | Notice that __do_softirq when called does not have a preempt_count. | 787 | Notice that __do_softirq when called does not have a |
711 | It may seem that we missed a preempt enabling. What really happened | 788 | preempt_count. It may seem that we missed a preempt enabling. |
712 | is that the preempt count is held on the thread's stack and we | 789 | What really happened is that the preempt count is held on the |
713 | switched to the softirq stack (4K stacks in effect). The code | 790 | thread's stack and we switched to the softirq stack (4K stacks |
714 | does not copy the preempt count, but because interrupts are disabled, | 791 | in effect). The code does not copy the preempt count, but |
715 | we do not need to worry about it. Having a tracer like this is good | 792 | because interrupts are disabled, we do not need to worry about |
716 | for letting people know what really happens inside the kernel. | 793 | it. Having a tracer like this is good for letting people know |
794 | what really happens inside the kernel. | ||
717 | 795 | ||
718 | 796 | ||
719 | preemptirqsoff | 797 | preemptirqsoff |
720 | -------------- | 798 | -------------- |
721 | 799 | ||
722 | Knowing the locations that have interrupts disabled or preemption | 800 | Knowing the locations that have interrupts disabled or |
723 | disabled for the longest times is helpful. But sometimes we would | 801 | preemption disabled for the longest times is helpful. But |
724 | like to know when either preemption and/or interrupts are disabled. | 802 | sometimes we would like to know when either preemption and/or |
803 | interrupts are disabled. | ||
725 | 804 | ||
726 | Consider the following code: | 805 | Consider the following code: |
727 | 806 | ||
@@ -741,11 +820,13 @@ The preemptoff tracer will record the total length of | |||
741 | call_function_with_irqs_and_preemption_off() and | 820 | call_function_with_irqs_and_preemption_off() and |
742 | call_function_with_preemption_off(). | 821 | call_function_with_preemption_off(). |
743 | 822 | ||
744 | But neither will trace the time that interrupts and/or preemption | 823 | But neither will trace the time that interrupts and/or |
745 | is disabled. This total time is the time that we can not schedule. | 824 | preemption is disabled. This total time is the time that we can |
746 | To record this time, use the preemptirqsoff tracer. | 825 | not schedule. To record this time, use the preemptirqsoff |
826 | tracer. | ||
747 | 827 | ||
748 | Again, using this trace is much like the irqsoff and preemptoff tracers. | 828 | Again, using this trace is much like the irqsoff and preemptoff |
829 | tracers. | ||
749 | 830 | ||
750 | # echo preemptirqsoff > /debug/tracing/current_tracer | 831 | # echo preemptirqsoff > /debug/tracing/current_tracer |
751 | # echo 0 > /debug/tracing/tracing_max_latency | 832 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -781,9 +862,10 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
781 | 862 | ||
782 | 863 | ||
783 | The trace_hardirqs_off_thunk is called from assembly on x86 when | 864 | The trace_hardirqs_off_thunk is called from assembly on x86 when |
784 | interrupts are disabled in the assembly code. Without the function | 865 | interrupts are disabled in the assembly code. Without the |
785 | tracing, we do not know if interrupts were enabled within the preemption | 866 | function tracing, we do not know if interrupts were enabled |
786 | points. We do see that it started with preemption enabled. | 867 | within the preemption points. We do see that it started with |
868 | preemption enabled. | ||
787 | 869 | ||
788 | Here is a trace with ftrace_enabled set: | 870 | Here is a trace with ftrace_enabled set: |
789 | 871 | ||
@@ -871,40 +953,42 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
871 | sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) | 953 | sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) |
872 | 954 | ||
873 | 955 | ||
874 | This is a very interesting trace. It started with the preemption of | 956 | This is a very interesting trace. It started with the preemption |
875 | the ls task. We see that the task had the "need_resched" bit set | 957 | of the ls task. We see that the task had the "need_resched" bit |
876 | via the 'N' in the trace. Interrupts were disabled before the spin_lock | 958 | set via the 'N' in the trace. Interrupts were disabled before |
877 | at the beginning of the trace. We see that a schedule took place to run | 959 | the spin_lock at the beginning of the trace. We see that a |
878 | sshd. When the interrupts were enabled, we took an interrupt. | 960 | schedule took place to run sshd. When the interrupts were |
879 | On return from the interrupt handler, the softirq ran. We took another | 961 | enabled, we took an interrupt. On return from the interrupt |
880 | interrupt while running the softirq as we see from the capital 'H'. | 962 | handler, the softirq ran. We took another interrupt while |
963 | running the softirq as we see from the capital 'H'. | ||
881 | 964 | ||
882 | 965 | ||
883 | wakeup | 966 | wakeup |
884 | ------ | 967 | ------ |
885 | 968 | ||
886 | In a Real-Time environment it is very important to know the wakeup | 969 | In a Real-Time environment it is very important to know the |
887 | time it takes for the highest priority task that is woken up to the | 970 | wakeup time it takes for the highest priority task that is woken |
888 | time that it executes. This is also known as "schedule latency". | 971 | up to the time that it executes. This is also known as "schedule |
889 | I stress the point that this is about RT tasks. It is also important | 972 | latency". I stress the point that this is about RT tasks. It is |
890 | to know the scheduling latency of non-RT tasks, but the average | 973 | also important to know the scheduling latency of non-RT tasks, |
891 | schedule latency is better for non-RT tasks. Tools like | 974 | but the average schedule latency is better for non-RT tasks. |
892 | LatencyTop are more appropriate for such measurements. | 975 | Tools like LatencyTop are more appropriate for such |
976 | measurements. | ||
893 | 977 | ||
894 | Real-Time environments are interested in the worst case latency. | 978 | Real-Time environments are interested in the worst case latency. |
895 | That is the longest latency it takes for something to happen, and | 979 | That is the longest latency it takes for something to happen, |
896 | not the average. We can have a very fast scheduler that may only | 980 | and not the average. We can have a very fast scheduler that may |
897 | have a large latency once in a while, but that would not work well | 981 | only have a large latency once in a while, but that would not |
898 | with Real-Time tasks. The wakeup tracer was designed to record | 982 | work well with Real-Time tasks. The wakeup tracer was designed |
899 | the worst case wakeups of RT tasks. Non-RT tasks are not recorded | 983 | to record the worst case wakeups of RT tasks. Non-RT tasks are |
900 | because the tracer only records one worst case and tracing non-RT | 984 | not recorded because the tracer only records one worst case and |
901 | tasks that are unpredictable will overwrite the worst case latency | 985 | tracing non-RT tasks that are unpredictable will overwrite the |
902 | of RT tasks. | 986 | worst case latency of RT tasks. |
903 | 987 | ||
904 | Since this tracer only deals with RT tasks, we will run this slightly | 988 | Since this tracer only deals with RT tasks, we will run this |
905 | differently than we did with the previous tracers. Instead of performing | 989 | slightly differently than we did with the previous tracers. |
906 | an 'ls', we will run 'sleep 1' under 'chrt' which changes the | 990 | Instead of performing an 'ls', we will run 'sleep 1' under |
907 | priority of the task. | 991 | 'chrt' which changes the priority of the task. |
908 | 992 | ||
909 | # echo wakeup > /debug/tracing/current_tracer | 993 | # echo wakeup > /debug/tracing/current_tracer |
910 | # echo 0 > /debug/tracing/tracing_max_latency | 994 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -934,17 +1018,16 @@ wakeup latency trace v1.1.5 on 2.6.26-rc8 | |||
934 | <idle>-0 1d..4 4us : schedule (cpu_idle) | 1018 | <idle>-0 1d..4 4us : schedule (cpu_idle) |
935 | 1019 | ||
936 | 1020 | ||
1021 | Running this on an idle system, we see that it only took 4 | ||
1022 | microseconds to perform the task switch. Note, since the trace | ||
1023 | marker in the schedule is before the actual "switch", we stop | ||
1024 | the tracing when the recorded task is about to schedule in. This | ||
1025 | may change if we add a new marker at the end of the scheduler. | ||
937 | 1026 | ||
938 | Running this on an idle system, we see that it only took 4 microseconds | 1027 | Notice that the recorded task is 'sleep' with the PID of 4901 |
939 | to perform the task switch. Note, since the trace marker in the | 1028 | and it has an rt_prio of 5. This priority is user-space priority |
940 | schedule is before the actual "switch", we stop the tracing when | 1029 | and not the internal kernel priority. The policy is 1 for |
941 | the recorded task is about to schedule in. This may change if | 1030 | SCHED_FIFO and 2 for SCHED_RR. |
942 | we add a new marker at the end of the scheduler. | ||
943 | |||
944 | Notice that the recorded task is 'sleep' with the PID of 4901 and it | ||
945 | has an rt_prio of 5. This priority is user-space priority and not | ||
946 | the internal kernel priority. The policy is 1 for SCHED_FIFO and 2 | ||
947 | for SCHED_RR. | ||
948 | 1031 | ||
949 | Doing the same with chrt -r 5 and ftrace_enabled set. | 1032 | Doing the same with chrt -r 5 and ftrace_enabled set. |
950 | 1033 | ||
@@ -1001,24 +1084,25 @@ ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) | |||
1001 | ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) | 1084 | ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) |
1002 | ksoftirq-7 1d..4 50us : schedule (__cond_resched) | 1085 | ksoftirq-7 1d..4 50us : schedule (__cond_resched) |
1003 | 1086 | ||
1004 | The interrupt went off while running ksoftirqd. This task runs at | 1087 | The interrupt went off while running ksoftirqd. This task runs |
1005 | SCHED_OTHER. Why did not we see the 'N' set early? This may be | 1088 | at SCHED_OTHER. Why did not we see the 'N' set early? This may |
1006 | a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks | 1089 | be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K |
1007 | configured, the interrupt and softirq run with their own stack. | 1090 | stacks configured, the interrupt and softirq run with their own |
1008 | Some information is held on the top of the task's stack (need_resched | 1091 | stack. Some information is held on the top of the task's stack |
1009 | and preempt_count are both stored there). The setting of the NEED_RESCHED | 1092 | (need_resched and preempt_count are both stored there). The |
1010 | bit is done directly to the task's stack, but the reading of the | 1093 | setting of the NEED_RESCHED bit is done directly to the task's |
1011 | NEED_RESCHED is done by looking at the current stack, which in this case | 1094 | stack, but the reading of the NEED_RESCHED is done by looking at |
1012 | is the stack for the hard interrupt. This hides the fact that NEED_RESCHED | 1095 | the current stack, which in this case is the stack for the hard |
1013 | has been set. We do not see the 'N' until we switch back to the task's | 1096 | interrupt. This hides the fact that NEED_RESCHED has been set. |
1097 | We do not see the 'N' until we switch back to the task's | ||
1014 | assigned stack. | 1098 | assigned stack. |
1015 | 1099 | ||
1016 | function | 1100 | function |
1017 | -------- | 1101 | -------- |
1018 | 1102 | ||
1019 | This tracer is the function tracer. Enabling the function tracer | 1103 | This tracer is the function tracer. Enabling the function tracer |
1020 | can be done from the debug file system. Make sure the ftrace_enabled is | 1104 | can be done from the debug file system. Make sure the |
1021 | set; otherwise this tracer is a nop. | 1105 | ftrace_enabled is set; otherwise this tracer is a nop. |
1022 | 1106 | ||
1023 | # sysctl kernel.ftrace_enabled=1 | 1107 | # sysctl kernel.ftrace_enabled=1 |
1024 | # echo function > /debug/tracing/current_tracer | 1108 | # echo function > /debug/tracing/current_tracer |
@@ -1048,14 +1132,15 @@ set; otherwise this tracer is a nop. | |||
1048 | [...] | 1132 | [...] |
1049 | 1133 | ||
1050 | 1134 | ||
1051 | Note: function tracer uses ring buffers to store the above entries. | 1135 | Note: function tracer uses ring buffers to store the above |
1052 | The newest data may overwrite the oldest data. Sometimes using echo to | 1136 | entries. The newest data may overwrite the oldest data. |
1053 | stop the trace is not sufficient because the tracing could have overwritten | 1137 | Sometimes using echo to stop the trace is not sufficient because |
1054 | the data that you wanted to record. For this reason, it is sometimes better to | 1138 | the tracing could have overwritten the data that you wanted to |
1055 | disable tracing directly from a program. This allows you to stop the | 1139 | record. For this reason, it is sometimes better to disable |
1056 | tracing at the point that you hit the part that you are interested in. | 1140 | tracing directly from a program. This allows you to stop the |
1057 | To disable the tracing directly from a C program, something like following | 1141 | tracing at the point that you hit the part that you are |
1058 | code snippet can be used: | 1142 | interested in. To disable the tracing directly from a C program, |
1143 | something like following code snippet can be used: | ||
1059 | 1144 | ||
1060 | int trace_fd; | 1145 | int trace_fd; |
1061 | [...] | 1146 | [...] |
@@ -1070,10 +1155,10 @@ int main(int argc, char *argv[]) { | |||
1070 | } | 1155 | } |
1071 | 1156 | ||
1072 | Note: Here we hard coded the path name. The debugfs mount is not | 1157 | Note: Here we hard coded the path name. The debugfs mount is not |
1073 | guaranteed to be at /debug (and is more commonly at /sys/kernel/debug). | 1158 | guaranteed to be at /debug (and is more commonly at |
1074 | For simple one time traces, the above is sufficent. For anything else, | 1159 | /sys/kernel/debug). For simple one time traces, the above is |
1075 | a search through /proc/mounts may be needed to find where the debugfs | 1160 | sufficent. For anything else, a search through /proc/mounts may |
1076 | file-system is mounted. | 1161 | be needed to find where the debugfs file-system is mounted. |
1077 | 1162 | ||
1078 | 1163 | ||
1079 | Single thread tracing | 1164 | Single thread tracing |
@@ -1152,49 +1237,297 @@ int main (int argc, char **argv) | |||
1152 | return 0; | 1237 | return 0; |
1153 | } | 1238 | } |
1154 | 1239 | ||
1240 | |||
1241 | hw-branch-tracer (x86 only) | ||
1242 | --------------------------- | ||
1243 | |||
1244 | This tracer uses the x86 last branch tracing hardware feature to | ||
1245 | collect a branch trace on all cpus with relatively low overhead. | ||
1246 | |||
1247 | The tracer uses a fixed-size circular buffer per cpu and only | ||
1248 | traces ring 0 branches. The trace file dumps that buffer in the | ||
1249 | following format: | ||
1250 | |||
1251 | # tracer: hw-branch-tracer | ||
1252 | # | ||
1253 | # CPU# TO <- FROM | ||
1254 | 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 | ||
1255 | 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a | ||
1256 | 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf | ||
1257 | 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf | ||
1258 | 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a | ||
1259 | 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf | ||
1260 | |||
1261 | |||
1262 | The tracer may be used to dump the trace for the oops'ing cpu on | ||
1263 | a kernel oops into the system log. To enable this, | ||
1264 | ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one | ||
1265 | can either use the sysctl function or set it via the proc system | ||
1266 | interface. | ||
1267 | |||
1268 | sysctl kernel.ftrace_dump_on_oops=1 | ||
1269 | |||
1270 | or | ||
1271 | |||
1272 | echo 1 > /proc/sys/kernel/ftrace_dump_on_oops | ||
1273 | |||
1274 | |||
1275 | Here's an example of such a dump after a null pointer | ||
1276 | dereference in a kernel module: | ||
1277 | |||
1278 | [57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 | ||
1279 | [57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1280 | [57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 | ||
1281 | [57848.106019] Oops: 0002 [#1] SMP | ||
1282 | [57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus | ||
1283 | [57848.106019] Dumping ftrace buffer: | ||
1284 | [57848.106019] --------------------------------- | ||
1285 | [...] | ||
1286 | [57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 | ||
1287 | [57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 | ||
1288 | [57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 | ||
1289 | [57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 | ||
1290 | [57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 | ||
1291 | [57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] | ||
1292 | [57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 | ||
1293 | [57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b | ||
1294 | [57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 | ||
1295 | [57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 | ||
1296 | [57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 | ||
1297 | [...] | ||
1298 | [57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 | ||
1299 | [57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 | ||
1300 | [57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 | ||
1301 | [57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 | ||
1302 | [...] | ||
1303 | [57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 | ||
1304 | [57848.106019] --------------------------------- | ||
1305 | [57848.106019] CPU 0 | ||
1306 | [57848.106019] Modules linked in: oops | ||
1307 | [57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 | ||
1308 | [57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1309 | [57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 | ||
1310 | [...] | ||
1311 | |||
1312 | |||
1313 | function graph tracer | ||
1314 | --------------------------- | ||
1315 | |||
1316 | This tracer is similar to the function tracer except that it | ||
1317 | probes a function on its entry and its exit. This is done by | ||
1318 | using a dynamically allocated stack of return addresses in each | ||
1319 | task_struct. On function entry the tracer overwrites the return | ||
1320 | address of each function traced to set a custom probe. Thus the | ||
1321 | original return address is stored on the stack of return address | ||
1322 | in the task_struct. | ||
1323 | |||
1324 | Probing on both ends of a function leads to special features | ||
1325 | such as: | ||
1326 | |||
1327 | - measure of a function's time execution | ||
1328 | - having a reliable call stack to draw function calls graph | ||
1329 | |||
1330 | This tracer is useful in several situations: | ||
1331 | |||
1332 | - you want to find the reason of a strange kernel behavior and | ||
1333 | need to see what happens in detail on any areas (or specific | ||
1334 | ones). | ||
1335 | |||
1336 | - you are experiencing weird latencies but it's difficult to | ||
1337 | find its origin. | ||
1338 | |||
1339 | - you want to find quickly which path is taken by a specific | ||
1340 | function | ||
1341 | |||
1342 | - you just want to peek inside a working kernel and want to see | ||
1343 | what happens there. | ||
1344 | |||
1345 | # tracer: function_graph | ||
1346 | # | ||
1347 | # CPU DURATION FUNCTION CALLS | ||
1348 | # | | | | | | | | ||
1349 | |||
1350 | 0) | sys_open() { | ||
1351 | 0) | do_sys_open() { | ||
1352 | 0) | getname() { | ||
1353 | 0) | kmem_cache_alloc() { | ||
1354 | 0) 1.382 us | __might_sleep(); | ||
1355 | 0) 2.478 us | } | ||
1356 | 0) | strncpy_from_user() { | ||
1357 | 0) | might_fault() { | ||
1358 | 0) 1.389 us | __might_sleep(); | ||
1359 | 0) 2.553 us | } | ||
1360 | 0) 3.807 us | } | ||
1361 | 0) 7.876 us | } | ||
1362 | 0) | alloc_fd() { | ||
1363 | 0) 0.668 us | _spin_lock(); | ||
1364 | 0) 0.570 us | expand_files(); | ||
1365 | 0) 0.586 us | _spin_unlock(); | ||
1366 | |||
1367 | |||
1368 | There are several columns that can be dynamically | ||
1369 | enabled/disabled. You can use every combination of options you | ||
1370 | want, depending on your needs. | ||
1371 | |||
1372 | - The cpu number on which the function executed is default | ||
1373 | enabled. It is sometimes better to only trace one cpu (see | ||
1374 | tracing_cpu_mask file) or you might sometimes see unordered | ||
1375 | function calls while cpu tracing switch. | ||
1376 | |||
1377 | hide: echo nofuncgraph-cpu > /debug/tracing/trace_options | ||
1378 | show: echo funcgraph-cpu > /debug/tracing/trace_options | ||
1379 | |||
1380 | - The duration (function's time of execution) is displayed on | ||
1381 | the closing bracket line of a function or on the same line | ||
1382 | than the current function in case of a leaf one. It is default | ||
1383 | enabled. | ||
1384 | |||
1385 | hide: echo nofuncgraph-duration > /debug/tracing/trace_options | ||
1386 | show: echo funcgraph-duration > /debug/tracing/trace_options | ||
1387 | |||
1388 | - The overhead field precedes the duration field in case of | ||
1389 | reached duration thresholds. | ||
1390 | |||
1391 | hide: echo nofuncgraph-overhead > /debug/tracing/trace_options | ||
1392 | show: echo funcgraph-overhead > /debug/tracing/trace_options | ||
1393 | depends on: funcgraph-duration | ||
1394 | |||
1395 | ie: | ||
1396 | |||
1397 | 0) | up_write() { | ||
1398 | 0) 0.646 us | _spin_lock_irqsave(); | ||
1399 | 0) 0.684 us | _spin_unlock_irqrestore(); | ||
1400 | 0) 3.123 us | } | ||
1401 | 0) 0.548 us | fput(); | ||
1402 | 0) + 58.628 us | } | ||
1403 | |||
1404 | [...] | ||
1405 | |||
1406 | 0) | putname() { | ||
1407 | 0) | kmem_cache_free() { | ||
1408 | 0) 0.518 us | __phys_addr(); | ||
1409 | 0) 1.757 us | } | ||
1410 | 0) 2.861 us | } | ||
1411 | 0) ! 115.305 us | } | ||
1412 | 0) ! 116.402 us | } | ||
1413 | |||
1414 | + means that the function exceeded 10 usecs. | ||
1415 | ! means that the function exceeded 100 usecs. | ||
1416 | |||
1417 | |||
1418 | - The task/pid field displays the thread cmdline and pid which | ||
1419 | executed the function. It is default disabled. | ||
1420 | |||
1421 | hide: echo nofuncgraph-proc > /debug/tracing/trace_options | ||
1422 | show: echo funcgraph-proc > /debug/tracing/trace_options | ||
1423 | |||
1424 | ie: | ||
1425 | |||
1426 | # tracer: function_graph | ||
1427 | # | ||
1428 | # CPU TASK/PID DURATION FUNCTION CALLS | ||
1429 | # | | | | | | | | | | ||
1430 | 0) sh-4802 | | d_free() { | ||
1431 | 0) sh-4802 | | call_rcu() { | ||
1432 | 0) sh-4802 | | __call_rcu() { | ||
1433 | 0) sh-4802 | 0.616 us | rcu_process_gp_end(); | ||
1434 | 0) sh-4802 | 0.586 us | check_for_new_grace_period(); | ||
1435 | 0) sh-4802 | 2.899 us | } | ||
1436 | 0) sh-4802 | 4.040 us | } | ||
1437 | 0) sh-4802 | 5.151 us | } | ||
1438 | 0) sh-4802 | + 49.370 us | } | ||
1439 | |||
1440 | |||
1441 | - The absolute time field is an absolute timestamp given by the | ||
1442 | system clock since it started. A snapshot of this time is | ||
1443 | given on each entry/exit of functions | ||
1444 | |||
1445 | hide: echo nofuncgraph-abstime > /debug/tracing/trace_options | ||
1446 | show: echo funcgraph-abstime > /debug/tracing/trace_options | ||
1447 | |||
1448 | ie: | ||
1449 | |||
1450 | # | ||
1451 | # TIME CPU DURATION FUNCTION CALLS | ||
1452 | # | | | | | | | | | ||
1453 | 360.774522 | 1) 0.541 us | } | ||
1454 | 360.774522 | 1) 4.663 us | } | ||
1455 | 360.774523 | 1) 0.541 us | __wake_up_bit(); | ||
1456 | 360.774524 | 1) 6.796 us | } | ||
1457 | 360.774524 | 1) 7.952 us | } | ||
1458 | 360.774525 | 1) 9.063 us | } | ||
1459 | 360.774525 | 1) 0.615 us | journal_mark_dirty(); | ||
1460 | 360.774527 | 1) 0.578 us | __brelse(); | ||
1461 | 360.774528 | 1) | reiserfs_prepare_for_journal() { | ||
1462 | 360.774528 | 1) | unlock_buffer() { | ||
1463 | 360.774529 | 1) | wake_up_bit() { | ||
1464 | 360.774529 | 1) | bit_waitqueue() { | ||
1465 | 360.774530 | 1) 0.594 us | __phys_addr(); | ||
1466 | |||
1467 | |||
1468 | You can put some comments on specific functions by using | ||
1469 | trace_printk() For example, if you want to put a comment inside | ||
1470 | the __might_sleep() function, you just have to include | ||
1471 | <linux/ftrace.h> and call trace_printk() inside __might_sleep() | ||
1472 | |||
1473 | trace_printk("I'm a comment!\n") | ||
1474 | |||
1475 | will produce: | ||
1476 | |||
1477 | 1) | __might_sleep() { | ||
1478 | 1) | /* I'm a comment! */ | ||
1479 | 1) 1.449 us | } | ||
1480 | |||
1481 | |||
1482 | You might find other useful features for this tracer in the | ||
1483 | following "dynamic ftrace" section such as tracing only specific | ||
1484 | functions or tasks. | ||
1485 | |||
1155 | dynamic ftrace | 1486 | dynamic ftrace |
1156 | -------------- | 1487 | -------------- |
1157 | 1488 | ||
1158 | If CONFIG_DYNAMIC_FTRACE is set, the system will run with | 1489 | If CONFIG_DYNAMIC_FTRACE is set, the system will run with |
1159 | virtually no overhead when function tracing is disabled. The way | 1490 | virtually no overhead when function tracing is disabled. The way |
1160 | this works is the mcount function call (placed at the start of | 1491 | this works is the mcount function call (placed at the start of |
1161 | every kernel function, produced by the -pg switch in gcc), starts | 1492 | every kernel function, produced by the -pg switch in gcc), |
1162 | of pointing to a simple return. (Enabling FTRACE will include the | 1493 | starts of pointing to a simple return. (Enabling FTRACE will |
1163 | -pg switch in the compiling of the kernel.) | 1494 | include the -pg switch in the compiling of the kernel.) |
1164 | 1495 | ||
1165 | At compile time every C file object is run through the | 1496 | At compile time every C file object is run through the |
1166 | recordmcount.pl script (located in the scripts directory). This | 1497 | recordmcount.pl script (located in the scripts directory). This |
1167 | script will process the C object using objdump to find all the | 1498 | script will process the C object using objdump to find all the |
1168 | locations in the .text section that call mcount. (Note, only | 1499 | locations in the .text section that call mcount. (Note, only the |
1169 | the .text section is processed, since processing other sections | 1500 | .text section is processed, since processing other sections like |
1170 | like .init.text may cause races due to those sections being freed). | 1501 | .init.text may cause races due to those sections being freed). |
1171 | 1502 | ||
1172 | A new section called "__mcount_loc" is created that holds references | 1503 | A new section called "__mcount_loc" is created that holds |
1173 | to all the mcount call sites in the .text section. This section is | 1504 | references to all the mcount call sites in the .text section. |
1174 | compiled back into the original object. The final linker will add | 1505 | This section is compiled back into the original object. The |
1175 | all these references into a single table. | 1506 | final linker will add all these references into a single table. |
1176 | 1507 | ||
1177 | On boot up, before SMP is initialized, the dynamic ftrace code | 1508 | On boot up, before SMP is initialized, the dynamic ftrace code |
1178 | scans this table and updates all the locations into nops. It also | 1509 | scans this table and updates all the locations into nops. It |
1179 | records the locations, which are added to the available_filter_functions | 1510 | also records the locations, which are added to the |
1180 | list. Modules are processed as they are loaded and before they are | 1511 | available_filter_functions list. Modules are processed as they |
1181 | executed. When a module is unloaded, it also removes its functions from | 1512 | are loaded and before they are executed. When a module is |
1182 | the ftrace function list. This is automatic in the module unload | 1513 | unloaded, it also removes its functions from the ftrace function |
1183 | code, and the module author does not need to worry about it. | 1514 | list. This is automatic in the module unload code, and the |
1184 | 1515 | module author does not need to worry about it. | |
1185 | When tracing is enabled, kstop_machine is called to prevent races | 1516 | |
1186 | with the CPUS executing code being modified (which can cause the | 1517 | When tracing is enabled, kstop_machine is called to prevent |
1187 | CPU to do undesireable things), and the nops are patched back | 1518 | races with the CPUS executing code being modified (which can |
1188 | to calls. But this time, they do not call mcount (which is just | 1519 | cause the CPU to do undesireable things), and the nops are |
1189 | a function stub). They now call into the ftrace infrastructure. | 1520 | patched back to calls. But this time, they do not call mcount |
1521 | (which is just a function stub). They now call into the ftrace | ||
1522 | infrastructure. | ||
1190 | 1523 | ||
1191 | One special side-effect to the recording of the functions being | 1524 | One special side-effect to the recording of the functions being |
1192 | traced is that we can now selectively choose which functions we | 1525 | traced is that we can now selectively choose which functions we |
1193 | wish to trace and which ones we want the mcount calls to remain as | 1526 | wish to trace and which ones we want the mcount calls to remain |
1194 | nops. | 1527 | as nops. |
1195 | 1528 | ||
1196 | Two files are used, one for enabling and one for disabling the tracing | 1529 | Two files are used, one for enabling and one for disabling the |
1197 | of specified functions. They are: | 1530 | tracing of specified functions. They are: |
1198 | 1531 | ||
1199 | set_ftrace_filter | 1532 | set_ftrace_filter |
1200 | 1533 | ||
@@ -1202,8 +1535,8 @@ and | |||
1202 | 1535 | ||
1203 | set_ftrace_notrace | 1536 | set_ftrace_notrace |
1204 | 1537 | ||
1205 | A list of available functions that you can add to these files is listed | 1538 | A list of available functions that you can add to these files is |
1206 | in: | 1539 | listed in: |
1207 | 1540 | ||
1208 | available_filter_functions | 1541 | available_filter_functions |
1209 | 1542 | ||
@@ -1240,8 +1573,8 @@ hrtimer_interrupt | |||
1240 | sys_nanosleep | 1573 | sys_nanosleep |
1241 | 1574 | ||
1242 | 1575 | ||
1243 | Perhaps this is not enough. The filters also allow simple wild cards. | 1576 | Perhaps this is not enough. The filters also allow simple wild |
1244 | Only the following are currently available | 1577 | cards. Only the following are currently available |
1245 | 1578 | ||
1246 | <match>* - will match functions that begin with <match> | 1579 | <match>* - will match functions that begin with <match> |
1247 | *<match> - will match functions that end with <match> | 1580 | *<match> - will match functions that end with <match> |
@@ -1251,9 +1584,9 @@ These are the only wild cards which are supported. | |||
1251 | 1584 | ||
1252 | <match>*<match> will not work. | 1585 | <match>*<match> will not work. |
1253 | 1586 | ||
1254 | Note: It is better to use quotes to enclose the wild cards, otherwise | 1587 | Note: It is better to use quotes to enclose the wild cards, |
1255 | the shell may expand the parameters into names of files in the local | 1588 | otherwise the shell may expand the parameters into names |
1256 | directory. | 1589 | of files in the local directory. |
1257 | 1590 | ||
1258 | # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter | 1591 | # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter |
1259 | 1592 | ||
@@ -1299,7 +1632,8 @@ This is because the '>' and '>>' act just like they do in bash. | |||
1299 | To rewrite the filters, use '>' | 1632 | To rewrite the filters, use '>' |
1300 | To append to the filters, use '>>' | 1633 | To append to the filters, use '>>' |
1301 | 1634 | ||
1302 | To clear out a filter so that all functions will be recorded again: | 1635 | To clear out a filter so that all functions will be recorded |
1636 | again: | ||
1303 | 1637 | ||
1304 | # echo > /debug/tracing/set_ftrace_filter | 1638 | # echo > /debug/tracing/set_ftrace_filter |
1305 | # cat /debug/tracing/set_ftrace_filter | 1639 | # cat /debug/tracing/set_ftrace_filter |
@@ -1331,7 +1665,8 @@ hrtimer_get_res | |||
1331 | hrtimer_init_sleeper | 1665 | hrtimer_init_sleeper |
1332 | 1666 | ||
1333 | 1667 | ||
1334 | The set_ftrace_notrace prevents those functions from being traced. | 1668 | The set_ftrace_notrace prevents those functions from being |
1669 | traced. | ||
1335 | 1670 | ||
1336 | # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace | 1671 | # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace |
1337 | 1672 | ||
@@ -1353,13 +1688,75 @@ Produces: | |||
1353 | 1688 | ||
1354 | We can see that there's no more lock or preempt tracing. | 1689 | We can see that there's no more lock or preempt tracing. |
1355 | 1690 | ||
1691 | |||
1692 | Dynamic ftrace with the function graph tracer | ||
1693 | --------------------------------------------- | ||
1694 | |||
1695 | Although what has been explained above concerns both the | ||
1696 | function tracer and the function-graph-tracer, there are some | ||
1697 | special features only available in the function-graph tracer. | ||
1698 | |||
1699 | If you want to trace only one function and all of its children, | ||
1700 | you just have to echo its name into set_graph_function: | ||
1701 | |||
1702 | echo __do_fault > set_graph_function | ||
1703 | |||
1704 | will produce the following "expanded" trace of the __do_fault() | ||
1705 | function: | ||
1706 | |||
1707 | 0) | __do_fault() { | ||
1708 | 0) | filemap_fault() { | ||
1709 | 0) | find_lock_page() { | ||
1710 | 0) 0.804 us | find_get_page(); | ||
1711 | 0) | __might_sleep() { | ||
1712 | 0) 1.329 us | } | ||
1713 | 0) 3.904 us | } | ||
1714 | 0) 4.979 us | } | ||
1715 | 0) 0.653 us | _spin_lock(); | ||
1716 | 0) 0.578 us | page_add_file_rmap(); | ||
1717 | 0) 0.525 us | native_set_pte_at(); | ||
1718 | 0) 0.585 us | _spin_unlock(); | ||
1719 | 0) | unlock_page() { | ||
1720 | 0) 0.541 us | page_waitqueue(); | ||
1721 | 0) 0.639 us | __wake_up_bit(); | ||
1722 | 0) 2.786 us | } | ||
1723 | 0) + 14.237 us | } | ||
1724 | 0) | __do_fault() { | ||
1725 | 0) | filemap_fault() { | ||
1726 | 0) | find_lock_page() { | ||
1727 | 0) 0.698 us | find_get_page(); | ||
1728 | 0) | __might_sleep() { | ||
1729 | 0) 1.412 us | } | ||
1730 | 0) 3.950 us | } | ||
1731 | 0) 5.098 us | } | ||
1732 | 0) 0.631 us | _spin_lock(); | ||
1733 | 0) 0.571 us | page_add_file_rmap(); | ||
1734 | 0) 0.526 us | native_set_pte_at(); | ||
1735 | 0) 0.586 us | _spin_unlock(); | ||
1736 | 0) | unlock_page() { | ||
1737 | 0) 0.533 us | page_waitqueue(); | ||
1738 | 0) 0.638 us | __wake_up_bit(); | ||
1739 | 0) 2.793 us | } | ||
1740 | 0) + 14.012 us | } | ||
1741 | |||
1742 | You can also expand several functions at once: | ||
1743 | |||
1744 | echo sys_open > set_graph_function | ||
1745 | echo sys_close >> set_graph_function | ||
1746 | |||
1747 | Now if you want to go back to trace all functions you can clear | ||
1748 | this special filter via: | ||
1749 | |||
1750 | echo > set_graph_function | ||
1751 | |||
1752 | |||
1356 | trace_pipe | 1753 | trace_pipe |
1357 | ---------- | 1754 | ---------- |
1358 | 1755 | ||
1359 | The trace_pipe outputs the same content as the trace file, but the effect | 1756 | The trace_pipe outputs the same content as the trace file, but |
1360 | on the tracing is different. Every read from trace_pipe is consumed. | 1757 | the effect on the tracing is different. Every read from |
1361 | This means that subsequent reads will be different. The trace | 1758 | trace_pipe is consumed. This means that subsequent reads will be |
1362 | is live. | 1759 | different. The trace is live. |
1363 | 1760 | ||
1364 | # echo function > /debug/tracing/current_tracer | 1761 | # echo function > /debug/tracing/current_tracer |
1365 | # cat /debug/tracing/trace_pipe > /tmp/trace.out & | 1762 | # cat /debug/tracing/trace_pipe > /tmp/trace.out & |
@@ -1387,38 +1784,45 @@ is live. | |||
1387 | bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up | 1784 | bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up |
1388 | 1785 | ||
1389 | 1786 | ||
1390 | Note, reading the trace_pipe file will block until more input is added. | 1787 | Note, reading the trace_pipe file will block until more input is |
1391 | By changing the tracer, trace_pipe will issue an EOF. We needed | 1788 | added. By changing the tracer, trace_pipe will issue an EOF. We |
1392 | to set the function tracer _before_ we "cat" the trace_pipe file. | 1789 | needed to set the function tracer _before_ we "cat" the |
1790 | trace_pipe file. | ||
1393 | 1791 | ||
1394 | 1792 | ||
1395 | trace entries | 1793 | trace entries |
1396 | ------------- | 1794 | ------------- |
1397 | 1795 | ||
1398 | Having too much or not enough data can be troublesome in diagnosing | 1796 | Having too much or not enough data can be troublesome in |
1399 | an issue in the kernel. The file buffer_size_kb is used to modify | 1797 | diagnosing an issue in the kernel. The file buffer_size_kb is |
1400 | the size of the internal trace buffers. The number listed | 1798 | used to modify the size of the internal trace buffers. The |
1401 | is the number of entries that can be recorded per CPU. To know | 1799 | number listed is the number of entries that can be recorded per |
1402 | the full size, multiply the number of possible CPUS with the | 1800 | CPU. To know the full size, multiply the number of possible CPUS |
1403 | number of entries. | 1801 | with the number of entries. |
1404 | 1802 | ||
1405 | # cat /debug/tracing/buffer_size_kb | 1803 | # cat /debug/tracing/buffer_size_kb |
1406 | 1408 (units kilobytes) | 1804 | 1408 (units kilobytes) |
1407 | 1805 | ||
1408 | Note, to modify this, you must have tracing completely disabled. To do that, | 1806 | Note, to modify this, you must have tracing completely disabled. |
1409 | echo "nop" into the current_tracer. If the current_tracer is not set | 1807 | To do that, echo "nop" into the current_tracer. If the |
1410 | to "nop", an EINVAL error will be returned. | 1808 | current_tracer is not set to "nop", an EINVAL error will be |
1809 | returned. | ||
1411 | 1810 | ||
1412 | # echo nop > /debug/tracing/current_tracer | 1811 | # echo nop > /debug/tracing/current_tracer |
1413 | # echo 10000 > /debug/tracing/buffer_size_kb | 1812 | # echo 10000 > /debug/tracing/buffer_size_kb |
1414 | # cat /debug/tracing/buffer_size_kb | 1813 | # cat /debug/tracing/buffer_size_kb |
1415 | 10000 (units kilobytes) | 1814 | 10000 (units kilobytes) |
1416 | 1815 | ||
1417 | The number of pages which will be allocated is limited to a percentage | 1816 | The number of pages which will be allocated is limited to a |
1418 | of available memory. Allocating too much will produce an error. | 1817 | percentage of available memory. Allocating too much will produce |
1818 | an error. | ||
1419 | 1819 | ||
1420 | # echo 1000000000000 > /debug/tracing/buffer_size_kb | 1820 | # echo 1000000000000 > /debug/tracing/buffer_size_kb |
1421 | -bash: echo: write error: Cannot allocate memory | 1821 | -bash: echo: write error: Cannot allocate memory |
1422 | # cat /debug/tracing/buffer_size_kb | 1822 | # cat /debug/tracing/buffer_size_kb |
1423 | 85 | 1823 | 85 |
1424 | 1824 | ||
1825 | ----------- | ||
1826 | |||
1827 | More details can be found in the source code, in the | ||
1828 | kernel/tracing/*.c files. | ||
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 421920897a37..2895ce29dea5 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -50,6 +50,7 @@ parameter is applicable: | |||
50 | ISAPNP ISA PnP code is enabled. | 50 | ISAPNP ISA PnP code is enabled. |
51 | ISDN Appropriate ISDN support is enabled. | 51 | ISDN Appropriate ISDN support is enabled. |
52 | JOY Appropriate joystick support is enabled. | 52 | JOY Appropriate joystick support is enabled. |
53 | KMEMTRACE kmemtrace is enabled. | ||
53 | LIBATA Libata driver is enabled | 54 | LIBATA Libata driver is enabled |
54 | LP Printer support is enabled. | 55 | LP Printer support is enabled. |
55 | LOOP Loopback device support is enabled. | 56 | LOOP Loopback device support is enabled. |
@@ -259,6 +260,22 @@ and is between 256 and 4096 characters. It is defined in the file | |||
259 | to assume that this machine's pmtimer latches its value | 260 | to assume that this machine's pmtimer latches its value |
260 | and always returns good values. | 261 | and always returns good values. |
261 | 262 | ||
263 | acpi_enforce_resources= [ACPI] | ||
264 | { strict | lax | no } | ||
265 | Check for resource conflicts between native drivers | ||
266 | and ACPI OperationRegions (SystemIO and SystemMemory | ||
267 | only). IO ports and memory declared in ACPI might be | ||
268 | used by the ACPI subsystem in arbitrary AML code and | ||
269 | can interfere with legacy drivers. | ||
270 | strict (default): access to resources claimed by ACPI | ||
271 | is denied; legacy drivers trying to access reserved | ||
272 | resources will fail to bind to device using them. | ||
273 | lax: access to resources claimed by ACPI is allowed; | ||
274 | legacy drivers trying to access reserved resources | ||
275 | will bind successfully but a warning message is logged. | ||
276 | no: ACPI OperationRegions are not marked as reserved, | ||
277 | no further checks are performed. | ||
278 | |||
262 | agp= [AGP] | 279 | agp= [AGP] |
263 | { off | try_unsupported } | 280 | { off | try_unsupported } |
264 | off: disable AGP support | 281 | off: disable AGP support |
@@ -617,6 +634,9 @@ and is between 256 and 4096 characters. It is defined in the file | |||
617 | 634 | ||
618 | debug_objects [KNL] Enable object debugging | 635 | debug_objects [KNL] Enable object debugging |
619 | 636 | ||
637 | no_debug_objects | ||
638 | [KNL] Disable object debugging | ||
639 | |||
620 | debugpat [X86] Enable PAT debugging | 640 | debugpat [X86] Enable PAT debugging |
621 | 641 | ||
622 | decnet.addr= [HW,NET] | 642 | decnet.addr= [HW,NET] |
@@ -1078,6 +1098,15 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1078 | use the HighMem zone if it exists, and the Normal | 1098 | use the HighMem zone if it exists, and the Normal |
1079 | zone if it does not. | 1099 | zone if it does not. |
1080 | 1100 | ||
1101 | kmemtrace.enable= [KNL,KMEMTRACE] Format: { yes | no } | ||
1102 | Controls whether kmemtrace is enabled | ||
1103 | at boot-time. | ||
1104 | |||
1105 | kmemtrace.subbufs=n [KNL,KMEMTRACE] Overrides the number of | ||
1106 | subbufs kmemtrace's relay channel has. Set this | ||
1107 | higher than default (KMEMTRACE_N_SUBBUFS in code) if | ||
1108 | you experience buffer overruns. | ||
1109 | |||
1081 | movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter | 1110 | movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter |
1082 | is similar to kernelcore except it specifies the | 1111 | is similar to kernelcore except it specifies the |
1083 | amount of memory used for migratable allocations. | 1112 | amount of memory used for migratable allocations. |
@@ -1546,6 +1575,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1546 | Valid arguments: on, off | 1575 | Valid arguments: on, off |
1547 | Default: on | 1576 | Default: on |
1548 | 1577 | ||
1578 | noiotrap [SH] Disables trapped I/O port accesses. | ||
1579 | |||
1549 | noirqdebug [X86-32] Disables the code which attempts to detect and | 1580 | noirqdebug [X86-32] Disables the code which attempts to detect and |
1550 | disable unhandled interrupt sources. | 1581 | disable unhandled interrupt sources. |
1551 | 1582 | ||
@@ -2364,6 +2395,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
2364 | 2395 | ||
2365 | tp720= [HW,PS2] | 2396 | tp720= [HW,PS2] |
2366 | 2397 | ||
2398 | trace_buf_size=nn[KMG] [ftrace] will set tracing buffer size. | ||
2399 | |||
2367 | trix= [HW,OSS] MediaTrix AudioTrix Pro | 2400 | trix= [HW,OSS] MediaTrix AudioTrix Pro |
2368 | Format: | 2401 | Format: |
2369 | <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> | 2402 | <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> |
diff --git a/Documentation/laptops/acer-wmi.txt b/Documentation/laptops/acer-wmi.txt index 2b3a6b5260bf..5ee2a02b3b40 100644 --- a/Documentation/laptops/acer-wmi.txt +++ b/Documentation/laptops/acer-wmi.txt | |||
@@ -1,9 +1,9 @@ | |||
1 | Acer Laptop WMI Extras Driver | 1 | Acer Laptop WMI Extras Driver |
2 | http://code.google.com/p/aceracpi | 2 | http://code.google.com/p/aceracpi |
3 | Version 0.2 | 3 | Version 0.3 |
4 | 18th August 2008 | 4 | 4th April 2009 |
5 | 5 | ||
6 | Copyright 2007-2008 Carlos Corbacho <carlos@strangeworlds.co.uk> | 6 | Copyright 2007-2009 Carlos Corbacho <carlos@strangeworlds.co.uk> |
7 | 7 | ||
8 | acer-wmi is a driver to allow you to control various parts of your Acer laptop | 8 | acer-wmi is a driver to allow you to control various parts of your Acer laptop |
9 | hardware under Linux which are exposed via ACPI-WMI. | 9 | hardware under Linux which are exposed via ACPI-WMI. |
@@ -36,6 +36,10 @@ not possible in kernel space from a 64 bit OS. | |||
36 | Supported Hardware | 36 | Supported Hardware |
37 | ****************** | 37 | ****************** |
38 | 38 | ||
39 | NOTE: The Acer Aspire One is not supported hardware. It cannot work with | ||
40 | acer-wmi until Acer fix their ACPI-WMI implementation on them, so has been | ||
41 | blacklisted until that happens. | ||
42 | |||
39 | Please see the website for the current list of known working hardare: | 43 | Please see the website for the current list of known working hardare: |
40 | 44 | ||
41 | http://code.google.com/p/aceracpi/wiki/SupportedHardware | 45 | http://code.google.com/p/aceracpi/wiki/SupportedHardware |
diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 41bc99fa1884..3d7650768bb5 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt | |||
@@ -20,7 +20,8 @@ moved to the drivers/misc tree and renamed to thinkpad-acpi for kernel | |||
20 | kernel 2.6.29 and release 0.22. | 20 | kernel 2.6.29 and release 0.22. |
21 | 21 | ||
22 | The driver is named "thinkpad-acpi". In some places, like module | 22 | The driver is named "thinkpad-acpi". In some places, like module |
23 | names, "thinkpad_acpi" is used because of userspace issues. | 23 | names and log messages, "thinkpad_acpi" is used because of userspace |
24 | issues. | ||
24 | 25 | ||
25 | "tpacpi" is used as a shorthand where "thinkpad-acpi" would be too | 26 | "tpacpi" is used as a shorthand where "thinkpad-acpi" would be too |
26 | long due to length limitations on some Linux kernel versions. | 27 | long due to length limitations on some Linux kernel versions. |
@@ -37,7 +38,7 @@ detailed description): | |||
37 | - ThinkLight on and off | 38 | - ThinkLight on and off |
38 | - limited docking and undocking | 39 | - limited docking and undocking |
39 | - UltraBay eject | 40 | - UltraBay eject |
40 | - CMOS control | 41 | - CMOS/UCMS control |
41 | - LED control | 42 | - LED control |
42 | - ACPI sounds | 43 | - ACPI sounds |
43 | - temperature sensors | 44 | - temperature sensors |
@@ -46,6 +47,7 @@ detailed description): | |||
46 | - Volume control | 47 | - Volume control |
47 | - Fan control and monitoring: fan speed, fan enable/disable | 48 | - Fan control and monitoring: fan speed, fan enable/disable |
48 | - WAN enable and disable | 49 | - WAN enable and disable |
50 | - UWB enable and disable | ||
49 | 51 | ||
50 | A compatibility table by model and feature is maintained on the web | 52 | A compatibility table by model and feature is maintained on the web |
51 | site, http://ibm-acpi.sf.net/. I appreciate any success or failure | 53 | site, http://ibm-acpi.sf.net/. I appreciate any success or failure |
@@ -53,7 +55,7 @@ reports, especially if they add to or correct the compatibility table. | |||
53 | Please include the following information in your report: | 55 | Please include the following information in your report: |
54 | 56 | ||
55 | - ThinkPad model name | 57 | - ThinkPad model name |
56 | - a copy of your DSDT, from /proc/acpi/dsdt | 58 | - a copy of your ACPI tables, using the "acpidump" utility |
57 | - a copy of the output of dmidecode, with serial numbers | 59 | - a copy of the output of dmidecode, with serial numbers |
58 | and UUIDs masked off | 60 | and UUIDs masked off |
59 | - which driver features work and which don't | 61 | - which driver features work and which don't |
@@ -66,17 +68,18 @@ Installation | |||
66 | ------------ | 68 | ------------ |
67 | 69 | ||
68 | If you are compiling this driver as included in the Linux kernel | 70 | If you are compiling this driver as included in the Linux kernel |
69 | sources, simply enable the CONFIG_THINKPAD_ACPI option, and optionally | 71 | sources, look for the CONFIG_THINKPAD_ACPI Kconfig option. |
70 | enable the CONFIG_THINKPAD_ACPI_BAY option if you want the | 72 | It is located on the menu path: "Device Drivers" -> "X86 Platform |
71 | thinkpad-specific bay functionality. | 73 | Specific Device Drivers" -> "ThinkPad ACPI Laptop Extras". |
74 | |||
72 | 75 | ||
73 | Features | 76 | Features |
74 | -------- | 77 | -------- |
75 | 78 | ||
76 | The driver exports two different interfaces to userspace, which can be | 79 | The driver exports two different interfaces to userspace, which can be |
77 | used to access the features it provides. One is a legacy procfs-based | 80 | used to access the features it provides. One is a legacy procfs-based |
78 | interface, which will be removed at some time in the distant future. | 81 | interface, which will be removed at some time in the future. The other |
79 | The other is a new sysfs-based interface which is not complete yet. | 82 | is a new sysfs-based interface which is not complete yet. |
80 | 83 | ||
81 | The procfs interface creates the /proc/acpi/ibm directory. There is a | 84 | The procfs interface creates the /proc/acpi/ibm directory. There is a |
82 | file under that directory for each feature it supports. The procfs | 85 | file under that directory for each feature it supports. The procfs |
@@ -111,15 +114,17 @@ The version of thinkpad-acpi's sysfs interface is exported by the driver | |||
111 | as a driver attribute (see below). | 114 | as a driver attribute (see below). |
112 | 115 | ||
113 | Sysfs driver attributes are on the driver's sysfs attribute space, | 116 | Sysfs driver attributes are on the driver's sysfs attribute space, |
114 | for 2.6.23 this is /sys/bus/platform/drivers/thinkpad_acpi/ and | 117 | for 2.6.23+ this is /sys/bus/platform/drivers/thinkpad_acpi/ and |
115 | /sys/bus/platform/drivers/thinkpad_hwmon/ | 118 | /sys/bus/platform/drivers/thinkpad_hwmon/ |
116 | 119 | ||
117 | Sysfs device attributes are on the thinkpad_acpi device sysfs attribute | 120 | Sysfs device attributes are on the thinkpad_acpi device sysfs attribute |
118 | space, for 2.6.23 this is /sys/devices/platform/thinkpad_acpi/. | 121 | space, for 2.6.23+ this is /sys/devices/platform/thinkpad_acpi/. |
119 | 122 | ||
120 | Sysfs device attributes for the sensors and fan are on the | 123 | Sysfs device attributes for the sensors and fan are on the |
121 | thinkpad_hwmon device's sysfs attribute space, but you should locate it | 124 | thinkpad_hwmon device's sysfs attribute space, but you should locate it |
122 | looking for a hwmon device with the name attribute of "thinkpad". | 125 | looking for a hwmon device with the name attribute of "thinkpad", or |
126 | better yet, through libsensors. | ||
127 | |||
123 | 128 | ||
124 | Driver version | 129 | Driver version |
125 | -------------- | 130 | -------------- |
@@ -129,6 +134,7 @@ sysfs driver attribute: version | |||
129 | 134 | ||
130 | The driver name and version. No commands can be written to this file. | 135 | The driver name and version. No commands can be written to this file. |
131 | 136 | ||
137 | |||
132 | Sysfs interface version | 138 | Sysfs interface version |
133 | ----------------------- | 139 | ----------------------- |
134 | 140 | ||
@@ -160,6 +166,7 @@ expect that an attribute might not be there, and deal with it properly | |||
160 | (an attribute not being there *is* a valid way to make it clear that a | 166 | (an attribute not being there *is* a valid way to make it clear that a |
161 | feature is not available in sysfs). | 167 | feature is not available in sysfs). |
162 | 168 | ||
169 | |||
163 | Hot keys | 170 | Hot keys |
164 | -------- | 171 | -------- |
165 | 172 | ||
@@ -172,17 +179,14 @@ system. Enabling the hotkey functionality of thinkpad-acpi signals the | |||
172 | firmware that such a driver is present, and modifies how the ThinkPad | 179 | firmware that such a driver is present, and modifies how the ThinkPad |
173 | firmware will behave in many situations. | 180 | firmware will behave in many situations. |
174 | 181 | ||
175 | The driver enables the hot key feature automatically when loaded. The | 182 | The driver enables the HKEY ("hot key") event reporting automatically |
176 | feature can later be disabled and enabled back at runtime. The driver | 183 | when loaded, and disables it when it is removed. |
177 | will also restore the hot key feature to its previous state and mask | ||
178 | when it is unloaded. | ||
179 | 184 | ||
180 | When the hotkey feature is enabled and the hot key mask is set (see | 185 | The driver will report HKEY events in the following format: |
181 | below), the driver will report HKEY events in the following format: | ||
182 | 186 | ||
183 | ibm/hotkey HKEY 00000080 0000xxxx | 187 | ibm/hotkey HKEY 00000080 0000xxxx |
184 | 188 | ||
185 | Some of these events refer to hot key presses, but not all. | 189 | Some of these events refer to hot key presses, but not all of them. |
186 | 190 | ||
187 | The driver will generate events over the input layer for hot keys and | 191 | The driver will generate events over the input layer for hot keys and |
188 | radio switches, and over the ACPI netlink layer for other events. The | 192 | radio switches, and over the ACPI netlink layer for other events. The |
@@ -214,13 +218,17 @@ procfs notes: | |||
214 | 218 | ||
215 | The following commands can be written to the /proc/acpi/ibm/hotkey file: | 219 | The following commands can be written to the /proc/acpi/ibm/hotkey file: |
216 | 220 | ||
217 | echo enable > /proc/acpi/ibm/hotkey -- enable the hot keys feature | ||
218 | echo disable > /proc/acpi/ibm/hotkey -- disable the hot keys feature | ||
219 | echo 0xffffffff > /proc/acpi/ibm/hotkey -- enable all hot keys | 221 | echo 0xffffffff > /proc/acpi/ibm/hotkey -- enable all hot keys |
220 | echo 0 > /proc/acpi/ibm/hotkey -- disable all possible hot keys | 222 | echo 0 > /proc/acpi/ibm/hotkey -- disable all possible hot keys |
221 | ... any other 8-hex-digit mask ... | 223 | ... any other 8-hex-digit mask ... |
222 | echo reset > /proc/acpi/ibm/hotkey -- restore the original mask | 224 | echo reset > /proc/acpi/ibm/hotkey -- restore the original mask |
223 | 225 | ||
226 | The following commands have been deprecated and will cause the kernel | ||
227 | to log a warning: | ||
228 | |||
229 | echo enable > /proc/acpi/ibm/hotkey -- does nothing | ||
230 | echo disable > /proc/acpi/ibm/hotkey -- returns an error | ||
231 | |||
224 | The procfs interface does not support NVRAM polling control. So as to | 232 | The procfs interface does not support NVRAM polling control. So as to |
225 | maintain maximum bug-to-bug compatibility, it does not report any masks, | 233 | maintain maximum bug-to-bug compatibility, it does not report any masks, |
226 | nor does it allow one to manipulate the hot key mask when the firmware | 234 | nor does it allow one to manipulate the hot key mask when the firmware |
@@ -229,12 +237,9 @@ does not support masks at all, even if NVRAM polling is in use. | |||
229 | sysfs notes: | 237 | sysfs notes: |
230 | 238 | ||
231 | hotkey_bios_enabled: | 239 | hotkey_bios_enabled: |
232 | Returns the status of the hot keys feature when | 240 | DEPRECATED, WILL BE REMOVED SOON. |
233 | thinkpad-acpi was loaded. Upon module unload, the hot | ||
234 | key feature status will be restored to this value. | ||
235 | 241 | ||
236 | 0: hot keys were disabled | 242 | Returns 0. |
237 | 1: hot keys were enabled (unusual) | ||
238 | 243 | ||
239 | hotkey_bios_mask: | 244 | hotkey_bios_mask: |
240 | Returns the hot keys mask when thinkpad-acpi was loaded. | 245 | Returns the hot keys mask when thinkpad-acpi was loaded. |
@@ -242,13 +247,10 @@ sysfs notes: | |||
242 | to this value. | 247 | to this value. |
243 | 248 | ||
244 | hotkey_enable: | 249 | hotkey_enable: |
245 | Enables/disables the hot keys feature in the ACPI | 250 | DEPRECATED, WILL BE REMOVED SOON. |
246 | firmware, and reports current status of the hot keys | ||
247 | feature. Has no effect on the NVRAM hot key polling | ||
248 | functionality. | ||
249 | 251 | ||
250 | 0: disables the hot keys feature / feature disabled | 252 | 0: returns -EPERM |
251 | 1: enables the hot keys feature / feature enabled | 253 | 1: does nothing |
252 | 254 | ||
253 | hotkey_mask: | 255 | hotkey_mask: |
254 | bit mask to enable driver-handling (and depending on | 256 | bit mask to enable driver-handling (and depending on |
@@ -618,6 +620,7 @@ For Lenovo models *with* ACPI backlight control: | |||
618 | and map them to KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN. Process | 620 | and map them to KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN. Process |
619 | these keys on userspace somehow (e.g. by calling xbacklight). | 621 | these keys on userspace somehow (e.g. by calling xbacklight). |
620 | 622 | ||
623 | |||
621 | Bluetooth | 624 | Bluetooth |
622 | --------- | 625 | --------- |
623 | 626 | ||
@@ -628,6 +631,9 @@ sysfs rfkill class: switch "tpacpi_bluetooth_sw" | |||
628 | This feature shows the presence and current state of a ThinkPad | 631 | This feature shows the presence and current state of a ThinkPad |
629 | Bluetooth device in the internal ThinkPad CDC slot. | 632 | Bluetooth device in the internal ThinkPad CDC slot. |
630 | 633 | ||
634 | If the ThinkPad supports it, the Bluetooth state is stored in NVRAM, | ||
635 | so it is kept across reboots and power-off. | ||
636 | |||
631 | Procfs notes: | 637 | Procfs notes: |
632 | 638 | ||
633 | If Bluetooth is installed, the following commands can be used: | 639 | If Bluetooth is installed, the following commands can be used: |
@@ -652,6 +658,7 @@ Sysfs notes: | |||
652 | rfkill controller switch "tpacpi_bluetooth_sw": refer to | 658 | rfkill controller switch "tpacpi_bluetooth_sw": refer to |
653 | Documentation/rfkill.txt for details. | 659 | Documentation/rfkill.txt for details. |
654 | 660 | ||
661 | |||
655 | Video output control -- /proc/acpi/ibm/video | 662 | Video output control -- /proc/acpi/ibm/video |
656 | -------------------------------------------- | 663 | -------------------------------------------- |
657 | 664 | ||
@@ -693,11 +700,8 @@ Fn-F7 from working. This also disables the video output switching | |||
693 | features of this driver, as it uses the same ACPI methods as | 700 | features of this driver, as it uses the same ACPI methods as |
694 | Fn-F7. Video switching on the console should still work. | 701 | Fn-F7. Video switching on the console should still work. |
695 | 702 | ||
696 | UPDATE: There's now a patch for the X.org Radeon driver which | 703 | UPDATE: refer to https://bugs.freedesktop.org/show_bug.cgi?id=2000 |
697 | addresses this issue. Some people are reporting success with the patch | ||
698 | while others are still having problems. For more information: | ||
699 | 704 | ||
700 | https://bugs.freedesktop.org/show_bug.cgi?id=2000 | ||
701 | 705 | ||
702 | ThinkLight control | 706 | ThinkLight control |
703 | ------------------ | 707 | ------------------ |
@@ -720,10 +724,11 @@ The ThinkLight sysfs interface is documented by the LED class | |||
720 | documentation, in Documentation/leds-class.txt. The ThinkLight LED name | 724 | documentation, in Documentation/leds-class.txt. The ThinkLight LED name |
721 | is "tpacpi::thinklight". | 725 | is "tpacpi::thinklight". |
722 | 726 | ||
723 | Due to limitations in the sysfs LED class, if the status of the thinklight | 727 | Due to limitations in the sysfs LED class, if the status of the ThinkLight |
724 | cannot be read or if it is unknown, thinkpad-acpi will report it as "off". | 728 | cannot be read or if it is unknown, thinkpad-acpi will report it as "off". |
725 | It is impossible to know if the status returned through sysfs is valid. | 729 | It is impossible to know if the status returned through sysfs is valid. |
726 | 730 | ||
731 | |||
727 | Docking / undocking -- /proc/acpi/ibm/dock | 732 | Docking / undocking -- /proc/acpi/ibm/dock |
728 | ------------------------------------------ | 733 | ------------------------------------------ |
729 | 734 | ||
@@ -784,6 +789,7 @@ the only docking stations currently supported are the X-series | |||
784 | UltraBase docks and "dumb" port replicators like the Mini Dock (the | 789 | UltraBase docks and "dumb" port replicators like the Mini Dock (the |
785 | latter don't need any ACPI support, actually). | 790 | latter don't need any ACPI support, actually). |
786 | 791 | ||
792 | |||
787 | UltraBay eject -- /proc/acpi/ibm/bay | 793 | UltraBay eject -- /proc/acpi/ibm/bay |
788 | ------------------------------------ | 794 | ------------------------------------ |
789 | 795 | ||
@@ -847,8 +853,9 @@ supported. Use "eject2" instead of "eject" for the second bay. | |||
847 | Note: the UltraBay eject support on the 600e/x, A22p and A3x is | 853 | Note: the UltraBay eject support on the 600e/x, A22p and A3x is |
848 | EXPERIMENTAL and may not work as expected. USE WITH CAUTION! | 854 | EXPERIMENTAL and may not work as expected. USE WITH CAUTION! |
849 | 855 | ||
850 | CMOS control | 856 | |
851 | ------------ | 857 | CMOS/UCMS control |
858 | ----------------- | ||
852 | 859 | ||
853 | procfs: /proc/acpi/ibm/cmos | 860 | procfs: /proc/acpi/ibm/cmos |
854 | sysfs device attribute: cmos_command | 861 | sysfs device attribute: cmos_command |
@@ -882,6 +889,7 @@ The cmos command interface is prone to firmware split-brain problems, as | |||
882 | in newer ThinkPads it is just a compatibility layer. Do not use it, it is | 889 | in newer ThinkPads it is just a compatibility layer. Do not use it, it is |
883 | exported just as a debug tool. | 890 | exported just as a debug tool. |
884 | 891 | ||
892 | |||
885 | LED control | 893 | LED control |
886 | ----------- | 894 | ----------- |
887 | 895 | ||
@@ -893,6 +901,17 @@ some older ThinkPad models, it is possible to query the status of the | |||
893 | LED indicators as well. Newer ThinkPads cannot query the real status | 901 | LED indicators as well. Newer ThinkPads cannot query the real status |
894 | of the LED indicators. | 902 | of the LED indicators. |
895 | 903 | ||
904 | Because misuse of the LEDs could induce an unaware user to perform | ||
905 | dangerous actions (like undocking or ejecting a bay device while the | ||
906 | buses are still active), or mask an important alarm (such as a nearly | ||
907 | empty battery, or a broken battery), access to most LEDs is | ||
908 | restricted. | ||
909 | |||
910 | Unrestricted access to all LEDs requires that thinkpad-acpi be | ||
911 | compiled with the CONFIG_THINKPAD_ACPI_UNSAFE_LEDS option enabled. | ||
912 | Distributions must never enable this option. Individual users that | ||
913 | are aware of the consequences are welcome to enabling it. | ||
914 | |||
896 | procfs notes: | 915 | procfs notes: |
897 | 916 | ||
898 | The available commands are: | 917 | The available commands are: |
@@ -939,6 +958,7 @@ ThinkPad indicator LED should blink in hardware accelerated mode, use the | |||
939 | "timer" trigger, and leave the delay_on and delay_off parameters set to | 958 | "timer" trigger, and leave the delay_on and delay_off parameters set to |
940 | zero (to request hardware acceleration autodetection). | 959 | zero (to request hardware acceleration autodetection). |
941 | 960 | ||
961 | |||
942 | ACPI sounds -- /proc/acpi/ibm/beep | 962 | ACPI sounds -- /proc/acpi/ibm/beep |
943 | ---------------------------------- | 963 | ---------------------------------- |
944 | 964 | ||
@@ -968,6 +988,7 @@ X40: | |||
968 | 16 - one medium-pitched beep repeating constantly, stop with 17 | 988 | 16 - one medium-pitched beep repeating constantly, stop with 17 |
969 | 17 - stop 16 | 989 | 17 - stop 16 |
970 | 990 | ||
991 | |||
971 | Temperature sensors | 992 | Temperature sensors |
972 | ------------------- | 993 | ------------------- |
973 | 994 | ||
@@ -1115,6 +1136,7 @@ registers contain the current battery capacity, etc. If you experiment | |||
1115 | with this, do send me your results (including some complete dumps with | 1136 | with this, do send me your results (including some complete dumps with |
1116 | a description of the conditions when they were taken.) | 1137 | a description of the conditions when they were taken.) |
1117 | 1138 | ||
1139 | |||
1118 | LCD brightness control | 1140 | LCD brightness control |
1119 | ---------------------- | 1141 | ---------------------- |
1120 | 1142 | ||
@@ -1124,10 +1146,9 @@ sysfs backlight device "thinkpad_screen" | |||
1124 | This feature allows software control of the LCD brightness on ThinkPad | 1146 | This feature allows software control of the LCD brightness on ThinkPad |
1125 | models which don't have a hardware brightness slider. | 1147 | models which don't have a hardware brightness slider. |
1126 | 1148 | ||
1127 | It has some limitations: the LCD backlight cannot be actually turned on or | 1149 | It has some limitations: the LCD backlight cannot be actually turned |
1128 | off by this interface, and in many ThinkPad models, the "dim while on | 1150 | on or off by this interface, it just controls the backlight brightness |
1129 | battery" functionality will be enabled by the BIOS when this interface is | 1151 | level. |
1130 | used, and cannot be controlled. | ||
1131 | 1152 | ||
1132 | On IBM (and some of the earlier Lenovo) ThinkPads, the backlight control | 1153 | On IBM (and some of the earlier Lenovo) ThinkPads, the backlight control |
1133 | has eight brightness levels, ranging from 0 to 7. Some of the levels | 1154 | has eight brightness levels, ranging from 0 to 7. Some of the levels |
@@ -1136,10 +1157,15 @@ display backlight brightness control methods have 16 levels, ranging | |||
1136 | from 0 to 15. | 1157 | from 0 to 15. |
1137 | 1158 | ||
1138 | There are two interfaces to the firmware for direct brightness control, | 1159 | There are two interfaces to the firmware for direct brightness control, |
1139 | EC and CMOS. To select which one should be used, use the | 1160 | EC and UCMS (or CMOS). To select which one should be used, use the |
1140 | brightness_mode module parameter: brightness_mode=1 selects EC mode, | 1161 | brightness_mode module parameter: brightness_mode=1 selects EC mode, |
1141 | brightness_mode=2 selects CMOS mode, brightness_mode=3 selects both EC | 1162 | brightness_mode=2 selects UCMS mode, brightness_mode=3 selects EC |
1142 | and CMOS. The driver tries to auto-detect which interface to use. | 1163 | mode with NVRAM backing (so that brightness changes are remembered |
1164 | across shutdown/reboot). | ||
1165 | |||
1166 | The driver tries to select which interface to use from a table of | ||
1167 | defaults for each ThinkPad model. If it makes a wrong choice, please | ||
1168 | report this as a bug, so that we can fix it. | ||
1143 | 1169 | ||
1144 | When display backlight brightness controls are available through the | 1170 | When display backlight brightness controls are available through the |
1145 | standard ACPI interface, it is best to use it instead of this direct | 1171 | standard ACPI interface, it is best to use it instead of this direct |
@@ -1201,6 +1227,7 @@ WARNING: | |||
1201 | and maybe reduce the life of the backlight lamps by needlessly kicking | 1227 | and maybe reduce the life of the backlight lamps by needlessly kicking |
1202 | its level up and down at every change. | 1228 | its level up and down at every change. |
1203 | 1229 | ||
1230 | |||
1204 | Volume control -- /proc/acpi/ibm/volume | 1231 | Volume control -- /proc/acpi/ibm/volume |
1205 | --------------------------------------- | 1232 | --------------------------------------- |
1206 | 1233 | ||
@@ -1217,6 +1244,11 @@ distinct. The unmute the volume after the mute command, use either the | |||
1217 | up or down command (the level command will not unmute the volume). | 1244 | up or down command (the level command will not unmute the volume). |
1218 | The current volume level and mute state is shown in the file. | 1245 | The current volume level and mute state is shown in the file. |
1219 | 1246 | ||
1247 | The ALSA mixer interface to this feature is still missing, but patches | ||
1248 | to add it exist. That problem should be addressed in the not so | ||
1249 | distant future. | ||
1250 | |||
1251 | |||
1220 | Fan control and monitoring: fan speed, fan enable/disable | 1252 | Fan control and monitoring: fan speed, fan enable/disable |
1221 | --------------------------------------------------------- | 1253 | --------------------------------------------------------- |
1222 | 1254 | ||
@@ -1383,8 +1415,11 @@ procfs: /proc/acpi/ibm/wan | |||
1383 | sysfs device attribute: wwan_enable (deprecated) | 1415 | sysfs device attribute: wwan_enable (deprecated) |
1384 | sysfs rfkill class: switch "tpacpi_wwan_sw" | 1416 | sysfs rfkill class: switch "tpacpi_wwan_sw" |
1385 | 1417 | ||
1386 | This feature shows the presence and current state of a W-WAN (Sierra | 1418 | This feature shows the presence and current state of the built-in |
1387 | Wireless EV-DO) device. | 1419 | Wireless WAN device. |
1420 | |||
1421 | If the ThinkPad supports it, the WWAN state is stored in NVRAM, | ||
1422 | so it is kept across reboots and power-off. | ||
1388 | 1423 | ||
1389 | It was tested on a Lenovo ThinkPad X60. It should probably work on other | 1424 | It was tested on a Lenovo ThinkPad X60. It should probably work on other |
1390 | ThinkPad models which come with this module installed. | 1425 | ThinkPad models which come with this module installed. |
@@ -1413,6 +1448,7 @@ Sysfs notes: | |||
1413 | rfkill controller switch "tpacpi_wwan_sw": refer to | 1448 | rfkill controller switch "tpacpi_wwan_sw": refer to |
1414 | Documentation/rfkill.txt for details. | 1449 | Documentation/rfkill.txt for details. |
1415 | 1450 | ||
1451 | |||
1416 | EXPERIMENTAL: UWB | 1452 | EXPERIMENTAL: UWB |
1417 | ----------------- | 1453 | ----------------- |
1418 | 1454 | ||
@@ -1431,6 +1467,7 @@ Sysfs notes: | |||
1431 | rfkill controller switch "tpacpi_uwb_sw": refer to | 1467 | rfkill controller switch "tpacpi_uwb_sw": refer to |
1432 | Documentation/rfkill.txt for details. | 1468 | Documentation/rfkill.txt for details. |
1433 | 1469 | ||
1470 | |||
1434 | Multiple Commands, Module Parameters | 1471 | Multiple Commands, Module Parameters |
1435 | ------------------------------------ | 1472 | ------------------------------------ |
1436 | 1473 | ||
@@ -1445,6 +1482,7 @@ for example: | |||
1445 | 1482 | ||
1446 | modprobe thinkpad_acpi hotkey=enable,0xffff video=auto_disable | 1483 | modprobe thinkpad_acpi hotkey=enable,0xffff video=auto_disable |
1447 | 1484 | ||
1485 | |||
1448 | Enabling debugging output | 1486 | Enabling debugging output |
1449 | ------------------------- | 1487 | ------------------------- |
1450 | 1488 | ||
@@ -1457,8 +1495,15 @@ will enable all debugging output classes. It takes a bitmask, so | |||
1457 | to enable more than one output class, just add their values. | 1495 | to enable more than one output class, just add their values. |
1458 | 1496 | ||
1459 | Debug bitmask Description | 1497 | Debug bitmask Description |
1498 | 0x8000 Disclose PID of userspace programs | ||
1499 | accessing some functions of the driver | ||
1460 | 0x0001 Initialization and probing | 1500 | 0x0001 Initialization and probing |
1461 | 0x0002 Removal | 1501 | 0x0002 Removal |
1502 | 0x0004 RF Transmitter control (RFKILL) | ||
1503 | (bluetooth, WWAN, UWB...) | ||
1504 | 0x0008 HKEY event interface, hotkeys | ||
1505 | 0x0010 Fan control | ||
1506 | 0x0020 Backlight brightness | ||
1462 | 1507 | ||
1463 | There is also a kernel build option to enable more debugging | 1508 | There is also a kernel build option to enable more debugging |
1464 | information, which may be necessary to debug driver problems. | 1509 | information, which may be necessary to debug driver problems. |
@@ -1467,6 +1512,7 @@ The level of debugging information output by the driver can be changed | |||
1467 | at runtime through sysfs, using the driver attribute debug_level. The | 1512 | at runtime through sysfs, using the driver attribute debug_level. The |
1468 | attribute takes the same bitmask as the debug module parameter above. | 1513 | attribute takes the same bitmask as the debug module parameter above. |
1469 | 1514 | ||
1515 | |||
1470 | Force loading of module | 1516 | Force loading of module |
1471 | ----------------------- | 1517 | ----------------------- |
1472 | 1518 | ||
@@ -1505,3 +1551,7 @@ Sysfs interface changelog: | |||
1505 | 1551 | ||
1506 | 0x020200: Add poll()/select() support to the following attributes: | 1552 | 0x020200: Add poll()/select() support to the following attributes: |
1507 | hotkey_radio_sw, wakeup_hotunplug_complete, wakeup_reason | 1553 | hotkey_radio_sw, wakeup_hotunplug_complete, wakeup_reason |
1554 | |||
1555 | 0x020300: hotkey enable/disable support removed, attributes | ||
1556 | hotkey_bios_enabled and hotkey_enable deprecated and | ||
1557 | marked for removal. | ||
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index afa2946892da..cf42b820ff9d 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -115,6 +115,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.: | |||
115 | 115 | ||
116 | 'x' - Used by xmon interface on ppc/powerpc platforms. | 116 | 'x' - Used by xmon interface on ppc/powerpc platforms. |
117 | 117 | ||
118 | 'z' - Dump the ftrace buffer | ||
119 | |||
118 | '0'-'9' - Sets the console log level, controlling which kernel messages | 120 | '0'-'9' - Sets the console log level, controlling which kernel messages |
119 | will be printed to your console. ('0', for example would make | 121 | will be printed to your console. ('0', for example would make |
120 | it so that only emergency messages like PANICs or OOPSes would | 122 | it so that only emergency messages like PANICs or OOPSes would |
diff --git a/Documentation/tracepoints.txt b/Documentation/tracepoints.txt index 6f0a044f5b5e..c0e1ceed75a4 100644 --- a/Documentation/tracepoints.txt +++ b/Documentation/tracepoints.txt | |||
@@ -45,8 +45,8 @@ In include/trace/subsys.h : | |||
45 | #include <linux/tracepoint.h> | 45 | #include <linux/tracepoint.h> |
46 | 46 | ||
47 | DECLARE_TRACE(subsys_eventname, | 47 | DECLARE_TRACE(subsys_eventname, |
48 | TPPROTO(int firstarg, struct task_struct *p), | 48 | TP_PROTO(int firstarg, struct task_struct *p), |
49 | TPARGS(firstarg, p)); | 49 | TP_ARGS(firstarg, p)); |
50 | 50 | ||
51 | In subsys/file.c (where the tracing statement must be added) : | 51 | In subsys/file.c (where the tracing statement must be added) : |
52 | 52 | ||
@@ -66,10 +66,10 @@ Where : | |||
66 | - subsys is the name of your subsystem. | 66 | - subsys is the name of your subsystem. |
67 | - eventname is the name of the event to trace. | 67 | - eventname is the name of the event to trace. |
68 | 68 | ||
69 | - TPPROTO(int firstarg, struct task_struct *p) is the prototype of the | 69 | - TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the |
70 | function called by this tracepoint. | 70 | function called by this tracepoint. |
71 | 71 | ||
72 | - TPARGS(firstarg, p) are the parameters names, same as found in the | 72 | - TP_ARGS(firstarg, p) are the parameters names, same as found in the |
73 | prototype. | 73 | prototype. |
74 | 74 | ||
75 | Connecting a function (probe) to a tracepoint is done by providing a | 75 | Connecting a function (probe) to a tracepoint is done by providing a |
@@ -103,13 +103,14 @@ used to export the defined tracepoints. | |||
103 | 103 | ||
104 | * Probe / tracepoint example | 104 | * Probe / tracepoint example |
105 | 105 | ||
106 | See the example provided in samples/tracepoints/src | 106 | See the example provided in samples/tracepoints |
107 | 107 | ||
108 | Compile them with your kernel. | 108 | Compile them with your kernel. They are built during 'make' (not |
109 | 'make modules') when CONFIG_SAMPLE_TRACEPOINTS=m. | ||
109 | 110 | ||
110 | Run, as root : | 111 | Run, as root : |
111 | modprobe tracepoint-example (insmod order is not important) | 112 | modprobe tracepoint-sample (insmod order is not important) |
112 | modprobe tracepoint-probe-example | 113 | modprobe tracepoint-probe-sample |
113 | cat /proc/tracepoint-example (returns an expected error) | 114 | cat /proc/tracepoint-sample (returns an expected error) |
114 | rmmod tracepoint-example tracepoint-probe-example | 115 | rmmod tracepoint-sample tracepoint-probe-sample |
115 | dmesg | 116 | dmesg |
diff --git a/Documentation/vm/kmemtrace.txt b/Documentation/vm/kmemtrace.txt new file mode 100644 index 000000000000..a956d9b7f943 --- /dev/null +++ b/Documentation/vm/kmemtrace.txt | |||
@@ -0,0 +1,126 @@ | |||
1 | kmemtrace - Kernel Memory Tracer | ||
2 | |||
3 | by Eduard - Gabriel Munteanu | ||
4 | <eduard.munteanu@linux360.ro> | ||
5 | |||
6 | I. Introduction | ||
7 | =============== | ||
8 | |||
9 | kmemtrace helps kernel developers figure out two things: | ||
10 | 1) how different allocators (SLAB, SLUB etc.) perform | ||
11 | 2) how kernel code allocates memory and how much | ||
12 | |||
13 | To do this, we trace every allocation and export information to the userspace | ||
14 | through the relay interface. We export things such as the number of requested | ||
15 | bytes, the number of bytes actually allocated (i.e. including internal | ||
16 | fragmentation), whether this is a slab allocation or a plain kmalloc() and so | ||
17 | on. | ||
18 | |||
19 | The actual analysis is performed by a userspace tool (see section III for | ||
20 | details on where to get it from). It logs the data exported by the kernel, | ||
21 | processes it and (as of writing this) can provide the following information: | ||
22 | - the total amount of memory allocated and fragmentation per call-site | ||
23 | - the amount of memory allocated and fragmentation per allocation | ||
24 | - total memory allocated and fragmentation in the collected dataset | ||
25 | - number of cross-CPU allocation and frees (makes sense in NUMA environments) | ||
26 | |||
27 | Moreover, it can potentially find inconsistent and erroneous behavior in | ||
28 | kernel code, such as using slab free functions on kmalloc'ed memory or | ||
29 | allocating less memory than requested (but not truly failed allocations). | ||
30 | |||
31 | kmemtrace also makes provisions for tracing on some arch and analysing the | ||
32 | data on another. | ||
33 | |||
34 | II. Design and goals | ||
35 | ==================== | ||
36 | |||
37 | kmemtrace was designed to handle rather large amounts of data. Thus, it uses | ||
38 | the relay interface to export whatever is logged to userspace, which then | ||
39 | stores it. Analysis and reporting is done asynchronously, that is, after the | ||
40 | data is collected and stored. By design, it allows one to log and analyse | ||
41 | on different machines and different arches. | ||
42 | |||
43 | As of writing this, the ABI is not considered stable, though it might not | ||
44 | change much. However, no guarantees are made about compatibility yet. When | ||
45 | deemed stable, the ABI should still allow easy extension while maintaining | ||
46 | backward compatibility. This is described further in Documentation/ABI. | ||
47 | |||
48 | Summary of design goals: | ||
49 | - allow logging and analysis to be done across different machines | ||
50 | - be fast and anticipate usage in high-load environments (*) | ||
51 | - be reasonably extensible | ||
52 | - make it possible for GNU/Linux distributions to have kmemtrace | ||
53 | included in their repositories | ||
54 | |||
55 | (*) - one of the reasons Pekka Enberg's original userspace data analysis | ||
56 | tool's code was rewritten from Perl to C (although this is more than a | ||
57 | simple conversion) | ||
58 | |||
59 | |||
60 | III. Quick usage guide | ||
61 | ====================== | ||
62 | |||
63 | 1) Get a kernel that supports kmemtrace and build it accordingly (i.e. enable | ||
64 | CONFIG_KMEMTRACE). | ||
65 | |||
66 | 2) Get the userspace tool and build it: | ||
67 | $ git-clone git://repo.or.cz/kmemtrace-user.git # current repository | ||
68 | $ cd kmemtrace-user/ | ||
69 | $ ./autogen.sh | ||
70 | $ ./configure | ||
71 | $ make | ||
72 | |||
73 | 3) Boot the kmemtrace-enabled kernel if you haven't, preferably in the | ||
74 | 'single' runlevel (so that relay buffers don't fill up easily), and run | ||
75 | kmemtrace: | ||
76 | # '$' does not mean user, but root here. | ||
77 | $ mount -t debugfs none /sys/kernel/debug | ||
78 | $ mount -t proc none /proc | ||
79 | $ cd path/to/kmemtrace-user/ | ||
80 | $ ./kmemtraced | ||
81 | Wait a bit, then stop it with CTRL+C. | ||
82 | $ cat /sys/kernel/debug/kmemtrace/total_overruns # Check if we didn't | ||
83 | # overrun, should | ||
84 | # be zero. | ||
85 | $ (Optionally) [Run kmemtrace_check separately on each cpu[0-9]*.out file to | ||
86 | check its correctness] | ||
87 | $ ./kmemtrace-report | ||
88 | |||
89 | Now you should have a nice and short summary of how the allocator performs. | ||
90 | |||
91 | IV. FAQ and known issues | ||
92 | ======================== | ||
93 | |||
94 | Q: 'cat /sys/kernel/debug/kmemtrace/total_overruns' is non-zero, how do I fix | ||
95 | this? Should I worry? | ||
96 | A: If it's non-zero, this affects kmemtrace's accuracy, depending on how | ||
97 | large the number is. You can fix it by supplying a higher | ||
98 | 'kmemtrace.subbufs=N' kernel parameter. | ||
99 | --- | ||
100 | |||
101 | Q: kmemtrace_check reports errors, how do I fix this? Should I worry? | ||
102 | A: This is a bug and should be reported. It can occur for a variety of | ||
103 | reasons: | ||
104 | - possible bugs in relay code | ||
105 | - possible misuse of relay by kmemtrace | ||
106 | - timestamps being collected unorderly | ||
107 | Or you may fix it yourself and send us a patch. | ||
108 | --- | ||
109 | |||
110 | Q: kmemtrace_report shows many errors, how do I fix this? Should I worry? | ||
111 | A: This is a known issue and I'm working on it. These might be true errors | ||
112 | in kernel code, which may have inconsistent behavior (e.g. allocating memory | ||
113 | with kmem_cache_alloc() and freeing it with kfree()). Pekka Enberg pointed | ||
114 | out this behavior may work with SLAB, but may fail with other allocators. | ||
115 | |||
116 | It may also be due to lack of tracing in some unusual allocator functions. | ||
117 | |||
118 | We don't want bug reports regarding this issue yet. | ||
119 | --- | ||
120 | |||
121 | V. See also | ||
122 | =========== | ||
123 | |||
124 | Documentation/kernel-parameters.txt | ||
125 | Documentation/ABI/testing/debugfs-kmemtrace | ||
126 | |||