diff options
Diffstat (limited to 'Documentation/accounting/taskstats.txt')
-rw-r--r-- | Documentation/accounting/taskstats.txt | 181 |
1 files changed, 181 insertions, 0 deletions
diff --git a/Documentation/accounting/taskstats.txt b/Documentation/accounting/taskstats.txt new file mode 100644 index 000000000000..92ebf29e9041 --- /dev/null +++ b/Documentation/accounting/taskstats.txt | |||
@@ -0,0 +1,181 @@ | |||
1 | Per-task statistics interface | ||
2 | ----------------------------- | ||
3 | |||
4 | |||
5 | Taskstats is a netlink-based interface for sending per-task and | ||
6 | per-process statistics from the kernel to userspace. | ||
7 | |||
8 | Taskstats was designed for the following benefits: | ||
9 | |||
10 | - efficiently provide statistics during lifetime of a task and on its exit | ||
11 | - unified interface for multiple accounting subsystems | ||
12 | - extensibility for use by future accounting patches | ||
13 | |||
14 | Terminology | ||
15 | ----------- | ||
16 | |||
17 | "pid", "tid" and "task" are used interchangeably and refer to the standard | ||
18 | Linux task defined by struct task_struct. per-pid stats are the same as | ||
19 | per-task stats. | ||
20 | |||
21 | "tgid", "process" and "thread group" are used interchangeably and refer to the | ||
22 | tasks that share an mm_struct i.e. the traditional Unix process. Despite the | ||
23 | use of tgid, there is no special treatment for the task that is thread group | ||
24 | leader - a process is deemed alive as long as it has any task belonging to it. | ||
25 | |||
26 | Usage | ||
27 | ----- | ||
28 | |||
29 | To get statistics during a task's lifetime, userspace opens a unicast netlink | ||
30 | socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. | ||
31 | The response contains statistics for a task (if pid is specified) or the sum of | ||
32 | statistics for all tasks of the process (if tgid is specified). | ||
33 | |||
34 | To obtain statistics for tasks which are exiting, the userspace listener | ||
35 | sends a register command and specifies a cpumask. Whenever a task exits on | ||
36 | one of the cpus in the cpumask, its per-pid statistics are sent to the | ||
37 | registered listener. Using cpumasks allows the data received by one listener | ||
38 | to be limited and assists in flow control over the netlink interface and is | ||
39 | explained in more detail below. | ||
40 | |||
41 | If the exiting task is the last thread exiting its thread group, | ||
42 | an additional record containing the per-tgid stats is also sent to userspace. | ||
43 | The latter contains the sum of per-pid stats for all threads in the thread | ||
44 | group, both past and present. | ||
45 | |||
46 | getdelays.c is a simple utility demonstrating usage of the taskstats interface | ||
47 | for reporting delay accounting statistics. Users can register cpumasks, | ||
48 | send commands and process responses, listen for per-tid/tgid exit data, | ||
49 | write the data received to a file and do basic flow control by increasing | ||
50 | receive buffer sizes. | ||
51 | |||
52 | Interface | ||
53 | --------- | ||
54 | |||
55 | The user-kernel interface is encapsulated in include/linux/taskstats.h | ||
56 | |||
57 | To avoid this documentation becoming obsolete as the interface evolves, only | ||
58 | an outline of the current version is given. taskstats.h always overrides the | ||
59 | description here. | ||
60 | |||
61 | struct taskstats is the common accounting structure for both per-pid and | ||
62 | per-tgid data. It is versioned and can be extended by each accounting subsystem | ||
63 | that is added to the kernel. The fields and their semantics are defined in the | ||
64 | taskstats.h file. | ||
65 | |||
66 | The data exchanged between user and kernel space is a netlink message belonging | ||
67 | to the NETLINK_GENERIC family and using the netlink attributes interface. | ||
68 | The messages are in the format | ||
69 | |||
70 | +----------+- - -+-------------+-------------------+ | ||
71 | | nlmsghdr | Pad | genlmsghdr | taskstats payload | | ||
72 | +----------+- - -+-------------+-------------------+ | ||
73 | |||
74 | |||
75 | The taskstats payload is one of the following three kinds: | ||
76 | |||
77 | 1. Commands: Sent from user to kernel. Commands to get data on | ||
78 | a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, | ||
79 | containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes | ||
80 | the task/process for which userspace wants statistics. | ||
81 | |||
82 | Commands to register/deregister interest in exit data from a set of cpus | ||
83 | consist of one attribute, of type | ||
84 | TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the | ||
85 | attribute payload. The cpumask is specified as an ascii string of | ||
86 | comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 | ||
87 | the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest | ||
88 | in cpus before closing the listening socket, the kernel cleans up its interest | ||
89 | set over time. However, for the sake of efficiency, an explicit deregistration | ||
90 | is advisable. | ||
91 | |||
92 | 2. Response for a command: sent from the kernel in response to a userspace | ||
93 | command. The payload is a series of three attributes of type: | ||
94 | |||
95 | a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates | ||
96 | a pid/tgid will be followed by some stats. | ||
97 | |||
98 | b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats | ||
99 | is being returned. | ||
100 | |||
101 | c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The | ||
102 | same structure is used for both per-pid and per-tgid stats. | ||
103 | |||
104 | 3. New message sent by kernel whenever a task exits. The payload consists of a | ||
105 | series of attributes of the following type: | ||
106 | |||
107 | a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats | ||
108 | b) TASKSTATS_TYPE_PID: contains exiting task's pid | ||
109 | c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats | ||
110 | d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats | ||
111 | e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs | ||
112 | f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process | ||
113 | |||
114 | |||
115 | per-tgid stats | ||
116 | -------------- | ||
117 | |||
118 | Taskstats provides per-process stats, in addition to per-task stats, since | ||
119 | resource management is often done at a process granularity and aggregating task | ||
120 | stats in userspace alone is inefficient and potentially inaccurate (due to lack | ||
121 | of atomicity). | ||
122 | |||
123 | However, maintaining per-process, in addition to per-task stats, within the | ||
124 | kernel has space and time overheads. To address this, the taskstats code | ||
125 | accumalates each exiting task's statistics into a process-wide data structure. | ||
126 | When the last task of a process exits, the process level data accumalated also | ||
127 | gets sent to userspace (along with the per-task data). | ||
128 | |||
129 | When a user queries to get per-tgid data, the sum of all other live threads in | ||
130 | the group is added up and added to the accumalated total for previously exited | ||
131 | threads of the same thread group. | ||
132 | |||
133 | Extending taskstats | ||
134 | ------------------- | ||
135 | |||
136 | There are two ways to extend the taskstats interface to export more | ||
137 | per-task/process stats as patches to collect them get added to the kernel | ||
138 | in future: | ||
139 | |||
140 | 1. Adding more fields to the end of the existing struct taskstats. Backward | ||
141 | compatibility is ensured by the version number within the | ||
142 | structure. Userspace will use only the fields of the struct that correspond | ||
143 | to the version its using. | ||
144 | |||
145 | 2. Defining separate statistic structs and using the netlink attributes | ||
146 | interface to return them. Since userspace processes each netlink attribute | ||
147 | independently, it can always ignore attributes whose type it does not | ||
148 | understand (because it is using an older version of the interface). | ||
149 | |||
150 | |||
151 | Choosing between 1. and 2. is a matter of trading off flexibility and | ||
152 | overhead. If only a few fields need to be added, then 1. is the preferable | ||
153 | path since the kernel and userspace don't need to incur the overhead of | ||
154 | processing new netlink attributes. But if the new fields expand the existing | ||
155 | struct too much, requiring disparate userspace accounting utilities to | ||
156 | unnecessarily receive large structures whose fields are of no interest, then | ||
157 | extending the attributes structure would be worthwhile. | ||
158 | |||
159 | Flow control for taskstats | ||
160 | -------------------------- | ||
161 | |||
162 | When the rate of task exits becomes large, a listener may not be able to keep | ||
163 | up with the kernel's rate of sending per-tid/tgid exit data leading to data | ||
164 | loss. This possibility gets compounded when the taskstats structure gets | ||
165 | extended and the number of cpus grows large. | ||
166 | |||
167 | To avoid losing statistics, userspace should do one or more of the following: | ||
168 | |||
169 | - increase the receive buffer sizes for the netlink sockets opened by | ||
170 | listeners to receive exit data. | ||
171 | |||
172 | - create more listeners and reduce the number of cpus being listened to by | ||
173 | each listener. In the extreme case, there could be one listener for each cpu. | ||
174 | Users may also consider setting the cpu affinity of the listener to the subset | ||
175 | of cpus to which it listens, especially if they are listening to just one cpu. | ||
176 | |||
177 | Despite these measures, if the userspace receives ENOBUFS error messages | ||
178 | indicated overflow of receive buffers, it should take measures to handle the | ||
179 | loss of data. | ||
180 | |||
181 | ---- | ||