diff options
author | Andy Lutomirski <luto@MIT.EDU> | 2011-06-05 13:50:18 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2011-06-05 15:30:32 -0400 |
commit | 8b4777a4b50cb0c84c1152eac85d24415fb6ff7d (patch) | |
tree | 6cbec89eb4afbbf4873efb4b0c8afcb6375dee73 /Documentation/x86 | |
parent | 6879eb2deed7171a81b2f904c9ad14b9648689a7 (diff) |
x86-64: Document some of entry_64.S
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/fc134867cc550977cc996866129e11a16ba0f9ea.1307292171.git.luto@mit.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'Documentation/x86')
-rw-r--r-- | Documentation/x86/entry_64.txt | 98 |
1 files changed, 98 insertions, 0 deletions
diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt new file mode 100644 index 000000000000..7869f14d055c --- /dev/null +++ b/Documentation/x86/entry_64.txt | |||
@@ -0,0 +1,98 @@ | |||
1 | This file documents some of the kernel entries in | ||
2 | arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from | ||
3 | an email from Ingo Molnar: | ||
4 | |||
5 | http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> | ||
6 | |||
7 | The x86 architecture has quite a few different ways to jump into | ||
8 | kernel code. Most of these entry points are registered in | ||
9 | arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S | ||
10 | and arch/x86/ia32/ia32entry.S. | ||
11 | |||
12 | The IDT vector assignments are listed in arch/x86/include/irq_vectors.h. | ||
13 | |||
14 | Some of these entries are: | ||
15 | |||
16 | - system_call: syscall instruction from 64-bit code. | ||
17 | |||
18 | - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall | ||
19 | either way. | ||
20 | |||
21 | - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit | ||
22 | code | ||
23 | |||
24 | - interrupt: An array of entries. Every IDT vector that doesn't | ||
25 | explicitly point somewhere else gets set to the corresponding | ||
26 | value in interrupts. These point to a whole array of | ||
27 | magically-generated functions that make their way to do_IRQ with | ||
28 | the interrupt number as a parameter. | ||
29 | |||
30 | - emulate_vsyscall: int 0xcc, a special non-ABI entry used by | ||
31 | vsyscall emulation. | ||
32 | |||
33 | - APIC interrupts: Various special-purpose interrupts for things | ||
34 | like TLB shootdown. | ||
35 | |||
36 | - Architecturally-defined exceptions like divide_error. | ||
37 | |||
38 | There are a few complexities here. The different x86-64 entries | ||
39 | have different calling conventions. The syscall and sysenter | ||
40 | instructions have their own peculiar calling conventions. Some of | ||
41 | the IDT entries push an error code onto the stack; others don't. | ||
42 | IDT entries using the IST alternative stack mechanism need their own | ||
43 | magic to get the stack frames right. (You can find some | ||
44 | documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM, | ||
45 | Volume 3, Chapter 6.) | ||
46 | |||
47 | Dealing with the swapgs instruction is especially tricky. Swapgs | ||
48 | toggles whether gs is the kernel gs or the user gs. The swapgs | ||
49 | instruction is rather fragile: it must nest perfectly and only in | ||
50 | single depth, it should only be used if entering from user mode to | ||
51 | kernel mode and then when returning to user-space, and precisely | ||
52 | so. If we mess that up even slightly, we crash. | ||
53 | |||
54 | So when we have a secondary entry, already in kernel mode, we *must | ||
55 | not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's | ||
56 | not switched/swapped yet. | ||
57 | |||
58 | Now, there's a secondary complication: there's a cheap way to test | ||
59 | which mode the CPU is in and an expensive way. | ||
60 | |||
61 | The cheap way is to pick this info off the entry frame on the kernel | ||
62 | stack, from the CS of the ptregs area of the kernel stack: | ||
63 | |||
64 | xorl %ebx,%ebx | ||
65 | testl $3,CS+8(%rsp) | ||
66 | je error_kernelspace | ||
67 | SWAPGS | ||
68 | |||
69 | The expensive (paranoid) way is to read back the MSR_GS_BASE value | ||
70 | (which is what SWAPGS modifies): | ||
71 | |||
72 | movl $1,%ebx | ||
73 | movl $MSR_GS_BASE,%ecx | ||
74 | rdmsr | ||
75 | testl %edx,%edx | ||
76 | js 1f /* negative -> in kernel */ | ||
77 | SWAPGS | ||
78 | xorl %ebx,%ebx | ||
79 | 1: ret | ||
80 | |||
81 | and the whole paranoid non-paranoid macro complexity is about whether | ||
82 | to suffer that RDMSR cost. | ||
83 | |||
84 | If we are at an interrupt or user-trap/gate-alike boundary then we can | ||
85 | use the faster check: the stack will be a reliable indicator of | ||
86 | whether SWAPGS was already done: if we see that we are a secondary | ||
87 | entry interrupting kernel mode execution, then we know that the GS | ||
88 | base has already been switched. If it says that we interrupted | ||
89 | user-space execution then we must do the SWAPGS. | ||
90 | |||
91 | But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, | ||
92 | which might have triggered right after a normal entry wrote CS to the | ||
93 | stack but before we executed SWAPGS, then the only safe way to check | ||
94 | for GS is the slower method: the RDMSR. | ||
95 | |||
96 | So we try only to mark those entry methods 'paranoid' that absolutely | ||
97 | need the more expensive check for the GS base - and we generate all | ||
98 | 'normal' entry points with the regular (faster) entry macros. | ||