diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2011-07-22 20:05:15 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2011-07-22 20:05:15 -0400 |
commit | 8e204874db000928e37199c2db82b7eb8966cc3c (patch) | |
tree | eae66035cb761c3c5a79e98b92280b5156bc01ef /Documentation/x86 | |
parent | 3e0b8df79ddb8955d2cce5e858972a9cfe763384 (diff) | |
parent | aafade242ff24fac3aabf61c7861dfa44a3c2445 (diff) |
Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, vdso: Do not allocate memory for the vDSO
clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option
x86, vdso: Drop now wrong comment
Document the vDSO and add a reference parser
ia64: Replace clocksource.fsys_mmio with generic arch data
x86-64: Move vread_tsc and vread_hpet into the vDSO
clocksource: Replace vread with generic arch data
x86-64: Add --no-undefined to vDSO build
x86-64: Allow alternative patching in the vDSO
x86: Make alternative instruction pointers relative
x86-64: Improve vsyscall emulation CS and RIP handling
x86-64: Emulate legacy vsyscalls
x86-64: Fill unused parts of the vsyscall page with 0xcc
x86-64: Remove vsyscall number 3 (venosys)
x86-64: Map the HPET NX
x86-64: Remove kernel.vsyscall64 sysctl
x86-64: Give vvars their own page
x86-64: Document some of entry_64.S
x86-64: Fix alignment of jiffies variable
Diffstat (limited to 'Documentation/x86')
-rw-r--r-- | Documentation/x86/entry_64.txt | 98 |
1 files changed, 98 insertions, 0 deletions
diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt new file mode 100644 index 000000000000..7869f14d055c --- /dev/null +++ b/Documentation/x86/entry_64.txt | |||
@@ -0,0 +1,98 @@ | |||
1 | This file documents some of the kernel entries in | ||
2 | arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from | ||
3 | an email from Ingo Molnar: | ||
4 | |||
5 | http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> | ||
6 | |||
7 | The x86 architecture has quite a few different ways to jump into | ||
8 | kernel code. Most of these entry points are registered in | ||
9 | arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S | ||
10 | and arch/x86/ia32/ia32entry.S. | ||
11 | |||
12 | The IDT vector assignments are listed in arch/x86/include/irq_vectors.h. | ||
13 | |||
14 | Some of these entries are: | ||
15 | |||
16 | - system_call: syscall instruction from 64-bit code. | ||
17 | |||
18 | - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall | ||
19 | either way. | ||
20 | |||
21 | - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit | ||
22 | code | ||
23 | |||
24 | - interrupt: An array of entries. Every IDT vector that doesn't | ||
25 | explicitly point somewhere else gets set to the corresponding | ||
26 | value in interrupts. These point to a whole array of | ||
27 | magically-generated functions that make their way to do_IRQ with | ||
28 | the interrupt number as a parameter. | ||
29 | |||
30 | - emulate_vsyscall: int 0xcc, a special non-ABI entry used by | ||
31 | vsyscall emulation. | ||
32 | |||
33 | - APIC interrupts: Various special-purpose interrupts for things | ||
34 | like TLB shootdown. | ||
35 | |||
36 | - Architecturally-defined exceptions like divide_error. | ||
37 | |||
38 | There are a few complexities here. The different x86-64 entries | ||
39 | have different calling conventions. The syscall and sysenter | ||
40 | instructions have their own peculiar calling conventions. Some of | ||
41 | the IDT entries push an error code onto the stack; others don't. | ||
42 | IDT entries using the IST alternative stack mechanism need their own | ||
43 | magic to get the stack frames right. (You can find some | ||
44 | documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM, | ||
45 | Volume 3, Chapter 6.) | ||
46 | |||
47 | Dealing with the swapgs instruction is especially tricky. Swapgs | ||
48 | toggles whether gs is the kernel gs or the user gs. The swapgs | ||
49 | instruction is rather fragile: it must nest perfectly and only in | ||
50 | single depth, it should only be used if entering from user mode to | ||
51 | kernel mode and then when returning to user-space, and precisely | ||
52 | so. If we mess that up even slightly, we crash. | ||
53 | |||
54 | So when we have a secondary entry, already in kernel mode, we *must | ||
55 | not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's | ||
56 | not switched/swapped yet. | ||
57 | |||
58 | Now, there's a secondary complication: there's a cheap way to test | ||
59 | which mode the CPU is in and an expensive way. | ||
60 | |||
61 | The cheap way is to pick this info off the entry frame on the kernel | ||
62 | stack, from the CS of the ptregs area of the kernel stack: | ||
63 | |||
64 | xorl %ebx,%ebx | ||
65 | testl $3,CS+8(%rsp) | ||
66 | je error_kernelspace | ||
67 | SWAPGS | ||
68 | |||
69 | The expensive (paranoid) way is to read back the MSR_GS_BASE value | ||
70 | (which is what SWAPGS modifies): | ||
71 | |||
72 | movl $1,%ebx | ||
73 | movl $MSR_GS_BASE,%ecx | ||
74 | rdmsr | ||
75 | testl %edx,%edx | ||
76 | js 1f /* negative -> in kernel */ | ||
77 | SWAPGS | ||
78 | xorl %ebx,%ebx | ||
79 | 1: ret | ||
80 | |||
81 | and the whole paranoid non-paranoid macro complexity is about whether | ||
82 | to suffer that RDMSR cost. | ||
83 | |||
84 | If we are at an interrupt or user-trap/gate-alike boundary then we can | ||
85 | use the faster check: the stack will be a reliable indicator of | ||
86 | whether SWAPGS was already done: if we see that we are a secondary | ||
87 | entry interrupting kernel mode execution, then we know that the GS | ||
88 | base has already been switched. If it says that we interrupted | ||
89 | user-space execution then we must do the SWAPGS. | ||
90 | |||
91 | But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, | ||
92 | which might have triggered right after a normal entry wrote CS to the | ||
93 | stack but before we executed SWAPGS, then the only safe way to check | ||
94 | for GS is the slower method: the RDMSR. | ||
95 | |||
96 | So we try only to mark those entry methods 'paranoid' that absolutely | ||
97 | need the more expensive check for the GS base - and we generate all | ||
98 | 'normal' entry points with the regular (faster) entry macros. | ||