diff options
Diffstat (limited to 'Documentation/accounting')
-rw-r--r-- | Documentation/accounting/delay-accounting.txt | 110 | ||||
-rw-r--r-- | Documentation/accounting/getdelays.c | 396 | ||||
-rw-r--r-- | Documentation/accounting/taskstats.txt | 181 |
3 files changed, 687 insertions, 0 deletions
diff --git a/Documentation/accounting/delay-accounting.txt b/Documentation/accounting/delay-accounting.txt new file mode 100644 index 000000000000..be215e58423b --- /dev/null +++ b/Documentation/accounting/delay-accounting.txt | |||
@@ -0,0 +1,110 @@ | |||
1 | Delay accounting | ||
2 | ---------------- | ||
3 | |||
4 | Tasks encounter delays in execution when they wait | ||
5 | for some kernel resource to become available e.g. a | ||
6 | runnable task may wait for a free CPU to run on. | ||
7 | |||
8 | The per-task delay accounting functionality measures | ||
9 | the delays experienced by a task while | ||
10 | |||
11 | a) waiting for a CPU (while being runnable) | ||
12 | b) completion of synchronous block I/O initiated by the task | ||
13 | c) swapping in pages | ||
14 | |||
15 | and makes these statistics available to userspace through | ||
16 | the taskstats interface. | ||
17 | |||
18 | Such delays provide feedback for setting a task's cpu priority, | ||
19 | io priority and rss limit values appropriately. Long delays for | ||
20 | important tasks could be a trigger for raising its corresponding priority. | ||
21 | |||
22 | The functionality, through its use of the taskstats interface, also provides | ||
23 | delay statistics aggregated for all tasks (or threads) belonging to a | ||
24 | thread group (corresponding to a traditional Unix process). This is a commonly | ||
25 | needed aggregation that is more efficiently done by the kernel. | ||
26 | |||
27 | Userspace utilities, particularly resource management applications, can also | ||
28 | aggregate delay statistics into arbitrary groups. To enable this, delay | ||
29 | statistics of a task are available both during its lifetime as well as on its | ||
30 | exit, ensuring continuous and complete monitoring can be done. | ||
31 | |||
32 | |||
33 | Interface | ||
34 | --------- | ||
35 | |||
36 | Delay accounting uses the taskstats interface which is described | ||
37 | in detail in a separate document in this directory. Taskstats returns a | ||
38 | generic data structure to userspace corresponding to per-pid and per-tgid | ||
39 | statistics. The delay accounting functionality populates specific fields of | ||
40 | this structure. See | ||
41 | include/linux/taskstats.h | ||
42 | for a description of the fields pertaining to delay accounting. | ||
43 | It will generally be in the form of counters returning the cumulative | ||
44 | delay seen for cpu, sync block I/O, swapin etc. | ||
45 | |||
46 | Taking the difference of two successive readings of a given | ||
47 | counter (say cpu_delay_total) for a task will give the delay | ||
48 | experienced by the task waiting for the corresponding resource | ||
49 | in that interval. | ||
50 | |||
51 | When a task exits, records containing the per-task statistics | ||
52 | are sent to userspace without requiring a command. If it is the last exiting | ||
53 | task of a thread group, the per-tgid statistics are also sent. More details | ||
54 | are given in the taskstats interface description. | ||
55 | |||
56 | The getdelays.c userspace utility in this directory allows simple commands to | ||
57 | be run and the corresponding delay statistics to be displayed. It also serves | ||
58 | as an example of using the taskstats interface. | ||
59 | |||
60 | Usage | ||
61 | ----- | ||
62 | |||
63 | Compile the kernel with | ||
64 | CONFIG_TASK_DELAY_ACCT=y | ||
65 | CONFIG_TASKSTATS=y | ||
66 | |||
67 | Enable the accounting at boot time by adding | ||
68 | the following to the kernel boot options | ||
69 | delayacct | ||
70 | |||
71 | and after the system has booted up, use a utility | ||
72 | similar to getdelays.c to access the delays | ||
73 | seen by a given task or a task group (tgid). | ||
74 | The utility also allows a given command to be | ||
75 | executed and the corresponding delays to be | ||
76 | seen. | ||
77 | |||
78 | General format of the getdelays command | ||
79 | |||
80 | getdelays [-t tgid] [-p pid] [-c cmd...] | ||
81 | |||
82 | |||
83 | Get delays, since system boot, for pid 10 | ||
84 | # ./getdelays -p 10 | ||
85 | (output similar to next case) | ||
86 | |||
87 | Get sum of delays, since system boot, for all pids with tgid 5 | ||
88 | # ./getdelays -t 5 | ||
89 | |||
90 | |||
91 | CPU count real total virtual total delay total | ||
92 | 7876 92005750 100000000 24001500 | ||
93 | IO count delay total | ||
94 | 0 0 | ||
95 | MEM count delay total | ||
96 | 0 0 | ||
97 | |||
98 | Get delays seen in executing a given simple command | ||
99 | # ./getdelays -c ls / | ||
100 | |||
101 | bin data1 data3 data5 dev home media opt root srv sys usr | ||
102 | boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var | ||
103 | |||
104 | |||
105 | CPU count real total virtual total delay total | ||
106 | 6 4000250 4000000 0 | ||
107 | IO count delay total | ||
108 | 0 0 | ||
109 | MEM count delay total | ||
110 | 0 0 | ||
diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c new file mode 100644 index 000000000000..795ca3911cc5 --- /dev/null +++ b/Documentation/accounting/getdelays.c | |||
@@ -0,0 +1,396 @@ | |||
1 | /* getdelays.c | ||
2 | * | ||
3 | * Utility to get per-pid and per-tgid delay accounting statistics | ||
4 | * Also illustrates usage of the taskstats interface | ||
5 | * | ||
6 | * Copyright (C) Shailabh Nagar, IBM Corp. 2005 | ||
7 | * Copyright (C) Balbir Singh, IBM Corp. 2006 | ||
8 | * Copyright (c) Jay Lan, SGI. 2006 | ||
9 | * | ||
10 | */ | ||
11 | |||
12 | #include <stdio.h> | ||
13 | #include <stdlib.h> | ||
14 | #include <errno.h> | ||
15 | #include <unistd.h> | ||
16 | #include <poll.h> | ||
17 | #include <string.h> | ||
18 | #include <fcntl.h> | ||
19 | #include <sys/types.h> | ||
20 | #include <sys/stat.h> | ||
21 | #include <sys/socket.h> | ||
22 | #include <sys/types.h> | ||
23 | #include <signal.h> | ||
24 | |||
25 | #include <linux/genetlink.h> | ||
26 | #include <linux/taskstats.h> | ||
27 | |||
28 | /* | ||
29 | * Generic macros for dealing with netlink sockets. Might be duplicated | ||
30 | * elsewhere. It is recommended that commercial grade applications use | ||
31 | * libnl or libnetlink and use the interfaces provided by the library | ||
32 | */ | ||
33 | #define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN)) | ||
34 | #define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN) | ||
35 | #define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN)) | ||
36 | #define NLA_PAYLOAD(len) (len - NLA_HDRLEN) | ||
37 | |||
38 | #define err(code, fmt, arg...) do { printf(fmt, ##arg); exit(code); } while (0) | ||
39 | int done = 0; | ||
40 | int rcvbufsz=0; | ||
41 | |||
42 | char name[100]; | ||
43 | int dbg=0, print_delays=0; | ||
44 | __u64 stime, utime; | ||
45 | #define PRINTF(fmt, arg...) { \ | ||
46 | if (dbg) { \ | ||
47 | printf(fmt, ##arg); \ | ||
48 | } \ | ||
49 | } | ||
50 | |||
51 | /* Maximum size of response requested or message sent */ | ||
52 | #define MAX_MSG_SIZE 256 | ||
53 | /* Maximum number of cpus expected to be specified in a cpumask */ | ||
54 | #define MAX_CPUS 32 | ||
55 | /* Maximum length of pathname to log file */ | ||
56 | #define MAX_FILENAME 256 | ||
57 | |||
58 | struct msgtemplate { | ||
59 | struct nlmsghdr n; | ||
60 | struct genlmsghdr g; | ||
61 | char buf[MAX_MSG_SIZE]; | ||
62 | }; | ||
63 | |||
64 | char cpumask[100+6*MAX_CPUS]; | ||
65 | |||
66 | /* | ||
67 | * Create a raw netlink socket and bind | ||
68 | */ | ||
69 | static int create_nl_socket(int protocol) | ||
70 | { | ||
71 | int fd; | ||
72 | struct sockaddr_nl local; | ||
73 | |||
74 | fd = socket(AF_NETLINK, SOCK_RAW, protocol); | ||
75 | if (fd < 0) | ||
76 | return -1; | ||
77 | |||
78 | if (rcvbufsz) | ||
79 | if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, | ||
80 | &rcvbufsz, sizeof(rcvbufsz)) < 0) { | ||
81 | printf("Unable to set socket rcv buf size to %d\n", | ||
82 | rcvbufsz); | ||
83 | return -1; | ||
84 | } | ||
85 | |||
86 | memset(&local, 0, sizeof(local)); | ||
87 | local.nl_family = AF_NETLINK; | ||
88 | |||
89 | if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0) | ||
90 | goto error; | ||
91 | |||
92 | return fd; | ||
93 | error: | ||
94 | close(fd); | ||
95 | return -1; | ||
96 | } | ||
97 | |||
98 | |||
99 | int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, | ||
100 | __u8 genl_cmd, __u16 nla_type, | ||
101 | void *nla_data, int nla_len) | ||
102 | { | ||
103 | struct nlattr *na; | ||
104 | struct sockaddr_nl nladdr; | ||
105 | int r, buflen; | ||
106 | char *buf; | ||
107 | |||
108 | struct msgtemplate msg; | ||
109 | |||
110 | msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN); | ||
111 | msg.n.nlmsg_type = nlmsg_type; | ||
112 | msg.n.nlmsg_flags = NLM_F_REQUEST; | ||
113 | msg.n.nlmsg_seq = 0; | ||
114 | msg.n.nlmsg_pid = nlmsg_pid; | ||
115 | msg.g.cmd = genl_cmd; | ||
116 | msg.g.version = 0x1; | ||
117 | na = (struct nlattr *) GENLMSG_DATA(&msg); | ||
118 | na->nla_type = nla_type; | ||
119 | na->nla_len = nla_len + 1 + NLA_HDRLEN; | ||
120 | memcpy(NLA_DATA(na), nla_data, nla_len); | ||
121 | msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len); | ||
122 | |||
123 | buf = (char *) &msg; | ||
124 | buflen = msg.n.nlmsg_len ; | ||
125 | memset(&nladdr, 0, sizeof(nladdr)); | ||
126 | nladdr.nl_family = AF_NETLINK; | ||
127 | while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr *) &nladdr, | ||
128 | sizeof(nladdr))) < buflen) { | ||
129 | if (r > 0) { | ||
130 | buf += r; | ||
131 | buflen -= r; | ||
132 | } else if (errno != EAGAIN) | ||
133 | return -1; | ||
134 | } | ||
135 | return 0; | ||
136 | } | ||
137 | |||
138 | |||
139 | /* | ||
140 | * Probe the controller in genetlink to find the family id | ||
141 | * for the TASKSTATS family | ||
142 | */ | ||
143 | int get_family_id(int sd) | ||
144 | { | ||
145 | struct { | ||
146 | struct nlmsghdr n; | ||
147 | struct genlmsghdr g; | ||
148 | char buf[256]; | ||
149 | } ans; | ||
150 | |||
151 | int id, rc; | ||
152 | struct nlattr *na; | ||
153 | int rep_len; | ||
154 | |||
155 | strcpy(name, TASKSTATS_GENL_NAME); | ||
156 | rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY, | ||
157 | CTRL_ATTR_FAMILY_NAME, (void *)name, | ||
158 | strlen(TASKSTATS_GENL_NAME)+1); | ||
159 | |||
160 | rep_len = recv(sd, &ans, sizeof(ans), 0); | ||
161 | if (ans.n.nlmsg_type == NLMSG_ERROR || | ||
162 | (rep_len < 0) || !NLMSG_OK((&ans.n), rep_len)) | ||
163 | return 0; | ||
164 | |||
165 | na = (struct nlattr *) GENLMSG_DATA(&ans); | ||
166 | na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len)); | ||
167 | if (na->nla_type == CTRL_ATTR_FAMILY_ID) { | ||
168 | id = *(__u16 *) NLA_DATA(na); | ||
169 | } | ||
170 | return id; | ||
171 | } | ||
172 | |||
173 | void print_delayacct(struct taskstats *t) | ||
174 | { | ||
175 | printf("\n\nCPU %15s%15s%15s%15s\n" | ||
176 | " %15llu%15llu%15llu%15llu\n" | ||
177 | "IO %15s%15s\n" | ||
178 | " %15llu%15llu\n" | ||
179 | "MEM %15s%15s\n" | ||
180 | " %15llu%15llu\n\n", | ||
181 | "count", "real total", "virtual total", "delay total", | ||
182 | t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total, | ||
183 | t->cpu_delay_total, | ||
184 | "count", "delay total", | ||
185 | t->blkio_count, t->blkio_delay_total, | ||
186 | "count", "delay total", t->swapin_count, t->swapin_delay_total); | ||
187 | } | ||
188 | |||
189 | int main(int argc, char *argv[]) | ||
190 | { | ||
191 | int c, rc, rep_len, aggr_len, len2, cmd_type; | ||
192 | __u16 id; | ||
193 | __u32 mypid; | ||
194 | |||
195 | struct nlattr *na; | ||
196 | int nl_sd = -1; | ||
197 | int len = 0; | ||
198 | pid_t tid = 0; | ||
199 | pid_t rtid = 0; | ||
200 | |||
201 | int fd = 0; | ||
202 | int count = 0; | ||
203 | int write_file = 0; | ||
204 | int maskset = 0; | ||
205 | char logfile[128]; | ||
206 | int loop = 0; | ||
207 | |||
208 | struct msgtemplate msg; | ||
209 | |||
210 | while (1) { | ||
211 | c = getopt(argc, argv, "dw:r:m:t:p:v:l"); | ||
212 | if (c < 0) | ||
213 | break; | ||
214 | |||
215 | switch (c) { | ||
216 | case 'd': | ||
217 | printf("print delayacct stats ON\n"); | ||
218 | print_delays = 1; | ||
219 | break; | ||
220 | case 'w': | ||
221 | strncpy(logfile, optarg, MAX_FILENAME); | ||
222 | printf("write to file %s\n", logfile); | ||
223 | write_file = 1; | ||
224 | break; | ||
225 | case 'r': | ||
226 | rcvbufsz = atoi(optarg); | ||
227 | printf("receive buf size %d\n", rcvbufsz); | ||
228 | if (rcvbufsz < 0) | ||
229 | err(1, "Invalid rcv buf size\n"); | ||
230 | break; | ||
231 | case 'm': | ||
232 | strncpy(cpumask, optarg, sizeof(cpumask)); | ||
233 | maskset = 1; | ||
234 | printf("cpumask %s maskset %d\n", cpumask, maskset); | ||
235 | break; | ||
236 | case 't': | ||
237 | tid = atoi(optarg); | ||
238 | if (!tid) | ||
239 | err(1, "Invalid tgid\n"); | ||
240 | cmd_type = TASKSTATS_CMD_ATTR_TGID; | ||
241 | print_delays = 1; | ||
242 | break; | ||
243 | case 'p': | ||
244 | tid = atoi(optarg); | ||
245 | if (!tid) | ||
246 | err(1, "Invalid pid\n"); | ||
247 | cmd_type = TASKSTATS_CMD_ATTR_PID; | ||
248 | print_delays = 1; | ||
249 | break; | ||
250 | case 'v': | ||
251 | printf("debug on\n"); | ||
252 | dbg = 1; | ||
253 | break; | ||
254 | case 'l': | ||
255 | printf("listen forever\n"); | ||
256 | loop = 1; | ||
257 | break; | ||
258 | default: | ||
259 | printf("Unknown option %d\n", c); | ||
260 | exit(-1); | ||
261 | } | ||
262 | } | ||
263 | |||
264 | if (write_file) { | ||
265 | fd = open(logfile, O_WRONLY | O_CREAT | O_TRUNC, | ||
266 | S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); | ||
267 | if (fd == -1) { | ||
268 | perror("Cannot open output file\n"); | ||
269 | exit(1); | ||
270 | } | ||
271 | } | ||
272 | |||
273 | if ((nl_sd = create_nl_socket(NETLINK_GENERIC)) < 0) | ||
274 | err(1, "error creating Netlink socket\n"); | ||
275 | |||
276 | |||
277 | mypid = getpid(); | ||
278 | id = get_family_id(nl_sd); | ||
279 | if (!id) { | ||
280 | printf("Error getting family id, errno %d", errno); | ||
281 | goto err; | ||
282 | } | ||
283 | PRINTF("family id %d\n", id); | ||
284 | |||
285 | if (maskset) { | ||
286 | rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, | ||
287 | TASKSTATS_CMD_ATTR_REGISTER_CPUMASK, | ||
288 | &cpumask, sizeof(cpumask)); | ||
289 | PRINTF("Sent register cpumask, retval %d\n", rc); | ||
290 | if (rc < 0) { | ||
291 | printf("error sending register cpumask\n"); | ||
292 | goto err; | ||
293 | } | ||
294 | } | ||
295 | |||
296 | if (tid) { | ||
297 | rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, | ||
298 | cmd_type, &tid, sizeof(__u32)); | ||
299 | PRINTF("Sent pid/tgid, retval %d\n", rc); | ||
300 | if (rc < 0) { | ||
301 | printf("error sending tid/tgid cmd\n"); | ||
302 | goto done; | ||
303 | } | ||
304 | } | ||
305 | |||
306 | do { | ||
307 | int i; | ||
308 | |||
309 | rep_len = recv(nl_sd, &msg, sizeof(msg), 0); | ||
310 | PRINTF("received %d bytes\n", rep_len); | ||
311 | |||
312 | if (rep_len < 0) { | ||
313 | printf("nonfatal reply error: errno %d\n", errno); | ||
314 | continue; | ||
315 | } | ||
316 | if (msg.n.nlmsg_type == NLMSG_ERROR || | ||
317 | !NLMSG_OK((&msg.n), rep_len)) { | ||
318 | printf("fatal reply error, errno %d\n", errno); | ||
319 | goto done; | ||
320 | } | ||
321 | |||
322 | PRINTF("nlmsghdr size=%d, nlmsg_len=%d, rep_len=%d\n", | ||
323 | sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len); | ||
324 | |||
325 | |||
326 | rep_len = GENLMSG_PAYLOAD(&msg.n); | ||
327 | |||
328 | na = (struct nlattr *) GENLMSG_DATA(&msg); | ||
329 | len = 0; | ||
330 | i = 0; | ||
331 | while (len < rep_len) { | ||
332 | len += NLA_ALIGN(na->nla_len); | ||
333 | switch (na->nla_type) { | ||
334 | case TASKSTATS_TYPE_AGGR_TGID: | ||
335 | /* Fall through */ | ||
336 | case TASKSTATS_TYPE_AGGR_PID: | ||
337 | aggr_len = NLA_PAYLOAD(na->nla_len); | ||
338 | len2 = 0; | ||
339 | /* For nested attributes, na follows */ | ||
340 | na = (struct nlattr *) NLA_DATA(na); | ||
341 | done = 0; | ||
342 | while (len2 < aggr_len) { | ||
343 | switch (na->nla_type) { | ||
344 | case TASKSTATS_TYPE_PID: | ||
345 | rtid = *(int *) NLA_DATA(na); | ||
346 | if (print_delays) | ||
347 | printf("PID\t%d\n", rtid); | ||
348 | break; | ||
349 | case TASKSTATS_TYPE_TGID: | ||
350 | rtid = *(int *) NLA_DATA(na); | ||
351 | if (print_delays) | ||
352 | printf("TGID\t%d\n", rtid); | ||
353 | break; | ||
354 | case TASKSTATS_TYPE_STATS: | ||
355 | count++; | ||
356 | if (print_delays) | ||
357 | print_delayacct((struct taskstats *) NLA_DATA(na)); | ||
358 | if (fd) { | ||
359 | if (write(fd, NLA_DATA(na), na->nla_len) < 0) { | ||
360 | err(1,"write error\n"); | ||
361 | } | ||
362 | } | ||
363 | if (!loop) | ||
364 | goto done; | ||
365 | break; | ||
366 | default: | ||
367 | printf("Unknown nested nla_type %d\n", na->nla_type); | ||
368 | break; | ||
369 | } | ||
370 | len2 += NLA_ALIGN(na->nla_len); | ||
371 | na = (struct nlattr *) ((char *) na + len2); | ||
372 | } | ||
373 | break; | ||
374 | |||
375 | default: | ||
376 | printf("Unknown nla_type %d\n", na->nla_type); | ||
377 | break; | ||
378 | } | ||
379 | na = (struct nlattr *) (GENLMSG_DATA(&msg) + len); | ||
380 | } | ||
381 | } while (loop); | ||
382 | done: | ||
383 | if (maskset) { | ||
384 | rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET, | ||
385 | TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK, | ||
386 | &cpumask, sizeof(cpumask)); | ||
387 | printf("Sent deregister mask, retval %d\n", rc); | ||
388 | if (rc < 0) | ||
389 | err(rc, "error sending deregister cpumask\n"); | ||
390 | } | ||
391 | err: | ||
392 | close(nl_sd); | ||
393 | if (fd) | ||
394 | close(fd); | ||
395 | return 0; | ||
396 | } | ||
diff --git a/Documentation/accounting/taskstats.txt b/Documentation/accounting/taskstats.txt new file mode 100644 index 000000000000..92ebf29e9041 --- /dev/null +++ b/Documentation/accounting/taskstats.txt | |||
@@ -0,0 +1,181 @@ | |||
1 | Per-task statistics interface | ||
2 | ----------------------------- | ||
3 | |||
4 | |||
5 | Taskstats is a netlink-based interface for sending per-task and | ||
6 | per-process statistics from the kernel to userspace. | ||
7 | |||
8 | Taskstats was designed for the following benefits: | ||
9 | |||
10 | - efficiently provide statistics during lifetime of a task and on its exit | ||
11 | - unified interface for multiple accounting subsystems | ||
12 | - extensibility for use by future accounting patches | ||
13 | |||
14 | Terminology | ||
15 | ----------- | ||
16 | |||
17 | "pid", "tid" and "task" are used interchangeably and refer to the standard | ||
18 | Linux task defined by struct task_struct. per-pid stats are the same as | ||
19 | per-task stats. | ||
20 | |||
21 | "tgid", "process" and "thread group" are used interchangeably and refer to the | ||
22 | tasks that share an mm_struct i.e. the traditional Unix process. Despite the | ||
23 | use of tgid, there is no special treatment for the task that is thread group | ||
24 | leader - a process is deemed alive as long as it has any task belonging to it. | ||
25 | |||
26 | Usage | ||
27 | ----- | ||
28 | |||
29 | To get statistics during a task's lifetime, userspace opens a unicast netlink | ||
30 | socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. | ||
31 | The response contains statistics for a task (if pid is specified) or the sum of | ||
32 | statistics for all tasks of the process (if tgid is specified). | ||
33 | |||
34 | To obtain statistics for tasks which are exiting, the userspace listener | ||
35 | sends a register command and specifies a cpumask. Whenever a task exits on | ||
36 | one of the cpus in the cpumask, its per-pid statistics are sent to the | ||
37 | registered listener. Using cpumasks allows the data received by one listener | ||
38 | to be limited and assists in flow control over the netlink interface and is | ||
39 | explained in more detail below. | ||
40 | |||
41 | If the exiting task is the last thread exiting its thread group, | ||
42 | an additional record containing the per-tgid stats is also sent to userspace. | ||
43 | The latter contains the sum of per-pid stats for all threads in the thread | ||
44 | group, both past and present. | ||
45 | |||
46 | getdelays.c is a simple utility demonstrating usage of the taskstats interface | ||
47 | for reporting delay accounting statistics. Users can register cpumasks, | ||
48 | send commands and process responses, listen for per-tid/tgid exit data, | ||
49 | write the data received to a file and do basic flow control by increasing | ||
50 | receive buffer sizes. | ||
51 | |||
52 | Interface | ||
53 | --------- | ||
54 | |||
55 | The user-kernel interface is encapsulated in include/linux/taskstats.h | ||
56 | |||
57 | To avoid this documentation becoming obsolete as the interface evolves, only | ||
58 | an outline of the current version is given. taskstats.h always overrides the | ||
59 | description here. | ||
60 | |||
61 | struct taskstats is the common accounting structure for both per-pid and | ||
62 | per-tgid data. It is versioned and can be extended by each accounting subsystem | ||
63 | that is added to the kernel. The fields and their semantics are defined in the | ||
64 | taskstats.h file. | ||
65 | |||
66 | The data exchanged between user and kernel space is a netlink message belonging | ||
67 | to the NETLINK_GENERIC family and using the netlink attributes interface. | ||
68 | The messages are in the format | ||
69 | |||
70 | +----------+- - -+-------------+-------------------+ | ||
71 | | nlmsghdr | Pad | genlmsghdr | taskstats payload | | ||
72 | +----------+- - -+-------------+-------------------+ | ||
73 | |||
74 | |||
75 | The taskstats payload is one of the following three kinds: | ||
76 | |||
77 | 1. Commands: Sent from user to kernel. Commands to get data on | ||
78 | a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, | ||
79 | containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes | ||
80 | the task/process for which userspace wants statistics. | ||
81 | |||
82 | Commands to register/deregister interest in exit data from a set of cpus | ||
83 | consist of one attribute, of type | ||
84 | TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the | ||
85 | attribute payload. The cpumask is specified as an ascii string of | ||
86 | comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 | ||
87 | the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest | ||
88 | in cpus before closing the listening socket, the kernel cleans up its interest | ||
89 | set over time. However, for the sake of efficiency, an explicit deregistration | ||
90 | is advisable. | ||
91 | |||
92 | 2. Response for a command: sent from the kernel in response to a userspace | ||
93 | command. The payload is a series of three attributes of type: | ||
94 | |||
95 | a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates | ||
96 | a pid/tgid will be followed by some stats. | ||
97 | |||
98 | b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats | ||
99 | is being returned. | ||
100 | |||
101 | c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The | ||
102 | same structure is used for both per-pid and per-tgid stats. | ||
103 | |||
104 | 3. New message sent by kernel whenever a task exits. The payload consists of a | ||
105 | series of attributes of the following type: | ||
106 | |||
107 | a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats | ||
108 | b) TASKSTATS_TYPE_PID: contains exiting task's pid | ||
109 | c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats | ||
110 | d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats | ||
111 | e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs | ||
112 | f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process | ||
113 | |||
114 | |||
115 | per-tgid stats | ||
116 | -------------- | ||
117 | |||
118 | Taskstats provides per-process stats, in addition to per-task stats, since | ||
119 | resource management is often done at a process granularity and aggregating task | ||
120 | stats in userspace alone is inefficient and potentially inaccurate (due to lack | ||
121 | of atomicity). | ||
122 | |||
123 | However, maintaining per-process, in addition to per-task stats, within the | ||
124 | kernel has space and time overheads. To address this, the taskstats code | ||
125 | accumalates each exiting task's statistics into a process-wide data structure. | ||
126 | When the last task of a process exits, the process level data accumalated also | ||
127 | gets sent to userspace (along with the per-task data). | ||
128 | |||
129 | When a user queries to get per-tgid data, the sum of all other live threads in | ||
130 | the group is added up and added to the accumalated total for previously exited | ||
131 | threads of the same thread group. | ||
132 | |||
133 | Extending taskstats | ||
134 | ------------------- | ||
135 | |||
136 | There are two ways to extend the taskstats interface to export more | ||
137 | per-task/process stats as patches to collect them get added to the kernel | ||
138 | in future: | ||
139 | |||
140 | 1. Adding more fields to the end of the existing struct taskstats. Backward | ||
141 | compatibility is ensured by the version number within the | ||
142 | structure. Userspace will use only the fields of the struct that correspond | ||
143 | to the version its using. | ||
144 | |||
145 | 2. Defining separate statistic structs and using the netlink attributes | ||
146 | interface to return them. Since userspace processes each netlink attribute | ||
147 | independently, it can always ignore attributes whose type it does not | ||
148 | understand (because it is using an older version of the interface). | ||
149 | |||
150 | |||
151 | Choosing between 1. and 2. is a matter of trading off flexibility and | ||
152 | overhead. If only a few fields need to be added, then 1. is the preferable | ||
153 | path since the kernel and userspace don't need to incur the overhead of | ||
154 | processing new netlink attributes. But if the new fields expand the existing | ||
155 | struct too much, requiring disparate userspace accounting utilities to | ||
156 | unnecessarily receive large structures whose fields are of no interest, then | ||
157 | extending the attributes structure would be worthwhile. | ||
158 | |||
159 | Flow control for taskstats | ||
160 | -------------------------- | ||
161 | |||
162 | When the rate of task exits becomes large, a listener may not be able to keep | ||
163 | up with the kernel's rate of sending per-tid/tgid exit data leading to data | ||
164 | loss. This possibility gets compounded when the taskstats structure gets | ||
165 | extended and the number of cpus grows large. | ||
166 | |||
167 | To avoid losing statistics, userspace should do one or more of the following: | ||
168 | |||
169 | - increase the receive buffer sizes for the netlink sockets opened by | ||
170 | listeners to receive exit data. | ||
171 | |||
172 | - create more listeners and reduce the number of cpus being listened to by | ||
173 | each listener. In the extreme case, there could be one listener for each cpu. | ||
174 | Users may also consider setting the cpu affinity of the listener to the subset | ||
175 | of cpus to which it listens, especially if they are listening to just one cpu. | ||
176 | |||
177 | Despite these measures, if the userspace receives ENOBUFS error messages | ||
178 | indicated overflow of receive buffers, it should take measures to handle the | ||
179 | loss of data. | ||
180 | |||
181 | ---- | ||