diff options
Diffstat (limited to 'Documentation/ia64/fsys.txt')
-rw-r--r-- | Documentation/ia64/fsys.txt | 286 |
1 files changed, 286 insertions, 0 deletions
diff --git a/Documentation/ia64/fsys.txt b/Documentation/ia64/fsys.txt new file mode 100644 index 000000000000..28da181f9966 --- /dev/null +++ b/Documentation/ia64/fsys.txt | |||
@@ -0,0 +1,286 @@ | |||
1 | -*-Mode: outline-*- | ||
2 | |||
3 | Light-weight System Calls for IA-64 | ||
4 | ----------------------------------- | ||
5 | |||
6 | Started: 13-Jan-2003 | ||
7 | Last update: 27-Sep-2003 | ||
8 | |||
9 | David Mosberger-Tang | ||
10 | <davidm@hpl.hp.com> | ||
11 | |||
12 | Using the "epc" instruction effectively introduces a new mode of | ||
13 | execution to the ia64 linux kernel. We call this mode the | ||
14 | "fsys-mode". To recap, the normal states of execution are: | ||
15 | |||
16 | - kernel mode: | ||
17 | Both the register stack and the memory stack have been | ||
18 | switched over to kernel memory. The user-level state is saved | ||
19 | in a pt-regs structure at the top of the kernel memory stack. | ||
20 | |||
21 | - user mode: | ||
22 | Both the register stack and the kernel stack are in | ||
23 | user memory. The user-level state is contained in the | ||
24 | CPU registers. | ||
25 | |||
26 | - bank 0 interruption-handling mode: | ||
27 | This is the non-interruptible state which all | ||
28 | interruption-handlers start execution in. The user-level | ||
29 | state remains in the CPU registers and some kernel state may | ||
30 | be stored in bank 0 of registers r16-r31. | ||
31 | |||
32 | In contrast, fsys-mode has the following special properties: | ||
33 | |||
34 | - execution is at privilege level 0 (most-privileged) | ||
35 | |||
36 | - CPU registers may contain a mixture of user-level and kernel-level | ||
37 | state (it is the responsibility of the kernel to ensure that no | ||
38 | security-sensitive kernel-level state is leaked back to | ||
39 | user-level) | ||
40 | |||
41 | - execution is interruptible and preemptible (an fsys-mode handler | ||
42 | can disable interrupts and avoid all other interruption-sources | ||
43 | to avoid preemption) | ||
44 | |||
45 | - neither the memory-stack nor the register-stack can be trusted while | ||
46 | in fsys-mode (they point to the user-level stacks, which may | ||
47 | be invalid, or completely bogus addresses) | ||
48 | |||
49 | In summary, fsys-mode is much more similar to running in user-mode | ||
50 | than it is to running in kernel-mode. Of course, given that the | ||
51 | privilege level is at level 0, this means that fsys-mode requires some | ||
52 | care (see below). | ||
53 | |||
54 | |||
55 | * How to tell fsys-mode | ||
56 | |||
57 | Linux operates in fsys-mode when (a) the privilege level is 0 (most | ||
58 | privileged) and (b) the stacks have NOT been switched to kernel memory | ||
59 | yet. For convenience, the header file <asm-ia64/ptrace.h> provides | ||
60 | three macros: | ||
61 | |||
62 | user_mode(regs) | ||
63 | user_stack(task,regs) | ||
64 | fsys_mode(task,regs) | ||
65 | |||
66 | The "regs" argument is a pointer to a pt_regs structure. The "task" | ||
67 | argument is a pointer to the task structure to which the "regs" | ||
68 | pointer belongs to. user_mode() returns TRUE if the CPU state pointed | ||
69 | to by "regs" was executing in user mode (privilege level 3). | ||
70 | user_stack() returns TRUE if the state pointed to by "regs" was | ||
71 | executing on the user-level stack(s). Finally, fsys_mode() returns | ||
72 | TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. | ||
73 | The fsys_mode() macro is equivalent to the expression: | ||
74 | |||
75 | !user_mode(regs) && user_stack(task,regs) | ||
76 | |||
77 | * How to write an fsyscall handler | ||
78 | |||
79 | The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers | ||
80 | (fsyscall_table). This table contains one entry for each system call. | ||
81 | By default, a system call is handled by fsys_fallback_syscall(). This | ||
82 | routine takes care of entering (full) kernel mode and calling the | ||
83 | normal Linux system call handler. For performance-critical system | ||
84 | calls, it is possible to write a hand-tuned fsyscall_handler. For | ||
85 | example, fsys.S contains fsys_getpid(), which is a hand-tuned version | ||
86 | of the getpid() system call. | ||
87 | |||
88 | The entry and exit-state of an fsyscall handler is as follows: | ||
89 | |||
90 | ** Machine state on entry to fsyscall handler: | ||
91 | |||
92 | - r10 = 0 | ||
93 | - r11 = saved ar.pfs (a user-level value) | ||
94 | - r15 = system call number | ||
95 | - r16 = "current" task pointer (in normal kernel-mode, this is in r13) | ||
96 | - r32-r39 = system call arguments | ||
97 | - b6 = return address (a user-level value) | ||
98 | - ar.pfs = previous frame-state (a user-level value) | ||
99 | - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) | ||
100 | - all other registers may contain values passed in from user-mode | ||
101 | |||
102 | ** Required machine state on exit to fsyscall handler: | ||
103 | |||
104 | - r11 = saved ar.pfs (as passed into the fsyscall handler) | ||
105 | - r15 = system call number (as passed into the fsyscall handler) | ||
106 | - r32-r39 = system call arguments (as passed into the fsyscall handler) | ||
107 | - b6 = return address (as passed into the fsyscall handler) | ||
108 | - ar.pfs = previous frame-state (as passed into the fsyscall handler) | ||
109 | |||
110 | Fsyscall handlers can execute with very little overhead, but with that | ||
111 | speed comes a set of restrictions: | ||
112 | |||
113 | o Fsyscall-handlers MUST check for any pending work in the flags | ||
114 | member of the thread-info structure and if any of the | ||
115 | TIF_ALLWORK_MASK flags are set, the handler needs to fall back on | ||
116 | doing a full system call (by calling fsys_fallback_syscall). | ||
117 | |||
118 | o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, | ||
119 | r15, b6, and ar.pfs) because they will be needed in case of a | ||
120 | system call restart. Of course, all "preserved" registers also | ||
121 | must be preserved, in accordance to the normal calling conventions. | ||
122 | |||
123 | o Fsyscall-handlers MUST check argument registers for containing a | ||
124 | NaT value before using them in any way that could trigger a | ||
125 | NaT-consumption fault. If a system call argument is found to | ||
126 | contain a NaT value, an fsyscall-handler may return immediately | ||
127 | with r8=EINVAL, r10=-1. | ||
128 | |||
129 | o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform | ||
130 | any other operation that would trigger mandatory RSE | ||
131 | (register-stack engine) traffic. | ||
132 | |||
133 | o Fsyscall-handlers MUST NOT write to any stacked registers because | ||
134 | it is not safe to assume that user-level called a handler with the | ||
135 | proper number of arguments. | ||
136 | |||
137 | o Fsyscall-handlers need to be careful when accessing per-CPU variables: | ||
138 | unless proper safe-guards are taken (e.g., interruptions are avoided), | ||
139 | execution may be pre-empted and resumed on another CPU at any given | ||
140 | time. | ||
141 | |||
142 | o Fsyscall-handlers must be careful not to leak sensitive kernel' | ||
143 | information back to user-level. In particular, before returning to | ||
144 | user-level, care needs to be taken to clear any scratch registers | ||
145 | that could contain sensitive information (note that the current | ||
146 | task pointer is not considered sensitive: it's already exposed | ||
147 | through ar.k6). | ||
148 | |||
149 | o Fsyscall-handlers MUST NOT access user-memory without first | ||
150 | validating access-permission (this can be done typically via | ||
151 | probe.r.fault and/or probe.w.fault) and without guarding against | ||
152 | memory access exceptions (this can be done with the EX() macros | ||
153 | defined by asmmacro.h). | ||
154 | |||
155 | The above restrictions may seem draconian, but remember that it's | ||
156 | possible to trade off some of the restrictions by paying a slightly | ||
157 | higher overhead. For example, if an fsyscall-handler could benefit | ||
158 | from the shadow register bank, it could temporarily disable PSR.i and | ||
159 | PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as | ||
160 | needed. In other words, following the above rules yields extremely | ||
161 | fast system call execution (while fully preserving system call | ||
162 | semantics), but there is also a lot of flexibility in handling more | ||
163 | complicated cases. | ||
164 | |||
165 | * Signal handling | ||
166 | |||
167 | The delivery of (asynchronous) signals must be delayed until fsys-mode | ||
168 | is exited. This is acomplished with the help of the lower-privilege | ||
169 | transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() | ||
170 | checks whether the interrupted task was in fsys-mode and, if so, sets | ||
171 | PSR.lp and returns immediately. When fsys-mode is exited via the | ||
172 | "br.ret" instruction that lowers the privilege level, a trap will | ||
173 | occur. The trap handler clears PSR.lp again and returns immediately. | ||
174 | The kernel exit path then checks for and delivers any pending signals. | ||
175 | |||
176 | * PSR Handling | ||
177 | |||
178 | The "epc" instruction doesn't change the contents of PSR at all. This | ||
179 | is in contrast to a regular interruption, which clears almost all | ||
180 | bits. Because of that, some care needs to be taken to ensure things | ||
181 | work as expected. The following discussion describes how each PSR bit | ||
182 | is handled. | ||
183 | |||
184 | PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used | ||
185 | to ensure the CPU is in little-endian mode before the first | ||
186 | load/store instruction is executed. PSR.be is normally NOT | ||
187 | restored upon return from an fsys-mode handler. In other | ||
188 | words, user-level code must not rely on PSR.be being preserved | ||
189 | across a system call. | ||
190 | PSR.up Unchanged. | ||
191 | PSR.ac Unchanged. | ||
192 | PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! | ||
193 | PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! | ||
194 | PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. | ||
195 | PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. | ||
196 | PSR.pk Unchanged. | ||
197 | PSR.dt Unchanged. | ||
198 | PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! | ||
199 | PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! | ||
200 | PSR.sp Unchanged. | ||
201 | PSR.pp Unchanged. | ||
202 | PSR.di Unchanged. | ||
203 | PSR.si Unchanged. | ||
204 | PSR.db Unchanged. The kernel prevents user-level from setting a hardware | ||
205 | breakpoint that triggers at any privilege level other than 3 (user-mode). | ||
206 | PSR.lp Unchanged. | ||
207 | PSR.tb Lazy redirect. If a taken-branch trap occurs while in | ||
208 | fsys-mode, the trap-handler modifies the saved machine state | ||
209 | such that execution resumes in the gate page at | ||
210 | syscall_via_break(), with privilege level 3. Note: the | ||
211 | taken branch would occur on the branch invoking the | ||
212 | fsyscall-handler, at which point, by definition, a syscall | ||
213 | restart is still safe. If the system call number is invalid, | ||
214 | the fsys-mode handler will return directly to user-level. This | ||
215 | return will trigger a taken-branch trap, but since the trap is | ||
216 | taken _after_ restoring the privilege level, the CPU has already | ||
217 | left fsys-mode, so no special treatment is needed. | ||
218 | PSR.rt Unchanged. | ||
219 | PSR.cpl Cleared to 0. | ||
220 | PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). | ||
221 | PSR.mc Unchanged. | ||
222 | PSR.it Unchanged (guaranteed to be 1). | ||
223 | PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. | ||
224 | PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. | ||
225 | PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. | ||
226 | PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to | ||
227 | be taken. The trap handler then modifies the saved machine | ||
228 | state such that execution resumes in the gate page at | ||
229 | syscall_via_break(), with privilege level 3. | ||
230 | PSR.ri Unchanged. | ||
231 | PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode | ||
232 | handler performed a speculative load that gets NaTted. If so, this | ||
233 | would be the normal & expected behavior, so no special treatment is | ||
234 | needed. | ||
235 | PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. | ||
236 | Doing so requires clearing PSR.i and PSR.ic as well. | ||
237 | PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. | ||
238 | |||
239 | * Using fast system calls | ||
240 | |||
241 | To use fast system calls, userspace applications need simply call | ||
242 | __kernel_syscall_via_epc(). For example | ||
243 | |||
244 | -- example fgettimeofday() call -- | ||
245 | -- fgettimeofday.S -- | ||
246 | |||
247 | #include <asm/asmmacro.h> | ||
248 | |||
249 | GLOBAL_ENTRY(fgettimeofday) | ||
250 | .prologue | ||
251 | .save ar.pfs, r11 | ||
252 | mov r11 = ar.pfs | ||
253 | .body | ||
254 | |||
255 | mov r2 = 0xa000000000020660;; // gate address | ||
256 | // found by inspection of System.map for the | ||
257 | // __kernel_syscall_via_epc() function. See | ||
258 | // below for how to do this for real. | ||
259 | |||
260 | mov b7 = r2 | ||
261 | mov r15 = 1087 // gettimeofday syscall | ||
262 | ;; | ||
263 | br.call.sptk.many b6 = b7 | ||
264 | ;; | ||
265 | |||
266 | .restore sp | ||
267 | |||
268 | mov ar.pfs = r11 | ||
269 | br.ret.sptk.many rp;; // return to caller | ||
270 | END(fgettimeofday) | ||
271 | |||
272 | -- end fgettimeofday.S -- | ||
273 | |||
274 | In reality, getting the gate address is accomplished by two extra | ||
275 | values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) | ||
276 | |||
277 | o AT_SYSINFO : is the address of __kernel_syscall_via_epc() | ||
278 | o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO | ||
279 | |||
280 | The ELF DSO is a pre-linked library that is mapped in by the kernel at | ||
281 | the gate page. It is a proper ELF shared object so, with a dynamic | ||
282 | loader that recognises the library, you should be able to make calls to | ||
283 | the exported functions within it as with any other shared library. | ||
284 | AT_SYSINFO points into the kernel DSO at the | ||
285 | __kernel_syscall_via_epc() function for historical reasons (it was | ||
286 | used before the kernel DSO) and as a convenience. | ||