diff options
-rw-r--r-- | Documentation/x86_64/kernel-stacks | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/Documentation/x86_64/kernel-stacks b/Documentation/x86_64/kernel-stacks new file mode 100644 index 000000000000..bddfddd466ab --- /dev/null +++ b/Documentation/x86_64/kernel-stacks | |||
@@ -0,0 +1,99 @@ | |||
1 | Most of the text from Keith Owens, hacked by AK | ||
2 | |||
3 | x86_64 page size (PAGE_SIZE) is 4K. | ||
4 | |||
5 | Like all other architectures, x86_64 has a kernel stack for every | ||
6 | active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. | ||
7 | These stacks contain useful data as long as a thread is alive or a | ||
8 | zombie. While the thread is in user space the kernel stack is empty | ||
9 | except for the thread_info structure at the bottom. | ||
10 | |||
11 | In addition to the per thread stacks, there are specialized stacks | ||
12 | associated with each cpu. These stacks are only used while the kernel | ||
13 | is in control on that cpu, when a cpu returns to user space the | ||
14 | specialized stacks contain no useful data. The main cpu stacks is | ||
15 | |||
16 | * Interrupt stack. IRQSTACKSIZE | ||
17 | |||
18 | Used for external hardware interrupts. If this is the first external | ||
19 | hardware interrupt (i.e. not a nested hardware interrupt) then the | ||
20 | kernel switches from the current task to the interrupt stack. Like | ||
21 | the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS), | ||
22 | this gives more room for kernel interrupt processing without having | ||
23 | to increase the size of every per thread stack. | ||
24 | |||
25 | The interrupt stack is also used when processing a softirq. | ||
26 | |||
27 | Switching to the kernel interrupt stack is done by software based on a | ||
28 | per CPU interrupt nest counter. This is needed because x86-64 "IST" | ||
29 | hardware stacks cannot nest without races. | ||
30 | |||
31 | x86_64 also has a feature which is not available on i386, the ability | ||
32 | to automatically switch to a new stack for designated events such as | ||
33 | double fault or NMI, which makes it easier to handle these unusual | ||
34 | events on x86_64. This feature is called the Interrupt Stack Table | ||
35 | (IST). There can be up to 7 IST entries per cpu. The IST code is an | ||
36 | index into the Task State Segment (TSS), the IST entries in the TSS | ||
37 | point to dedicated stacks, each stack can be a different size. | ||
38 | |||
39 | An IST is selected by an non-zero value in the IST field of an | ||
40 | interrupt-gate descriptor. When an interrupt occurs and the hardware | ||
41 | loads such a descriptor, the hardware automatically sets the new stack | ||
42 | pointer based on the IST value, then invokes the interrupt handler. If | ||
43 | software wants to allow nested IST interrupts then the handler must | ||
44 | adjust the IST values on entry to and exit from the interrupt handler. | ||
45 | (this is occasionally done, e.g. for debug exceptions) | ||
46 | |||
47 | Events with different IST codes (i.e. with different stacks) can be | ||
48 | nested. For example, a debug interrupt can safely be interrupted by an | ||
49 | NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack | ||
50 | pointers on entry to and exit from all IST events, in theory allowing | ||
51 | IST events with the same code to be nested. However in most cases, the | ||
52 | stack size allocated to an IST assumes no nesting for the same code. | ||
53 | If that assumption is ever broken then the stacks will become corrupt. | ||
54 | |||
55 | The currently assigned IST stacks are :- | ||
56 | |||
57 | * STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
58 | |||
59 | Used for interrupt 12 - Stack Fault Exception (#SS). | ||
60 | |||
61 | This allows to recover from invalid stack segments. Rarely | ||
62 | happens. | ||
63 | |||
64 | * DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
65 | |||
66 | Used for interrupt 8 - Double Fault Exception (#DF). | ||
67 | |||
68 | Invoked when handling a exception causes another exception. Happens | ||
69 | when the kernel is very confused (e.g. kernel stack pointer corrupt) | ||
70 | Using a separate stack allows to recover from it well enough in many | ||
71 | cases to still output an oops. | ||
72 | |||
73 | * NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
74 | |||
75 | Used for non-maskable interrupts (NMI). | ||
76 | |||
77 | NMI can be delivered at any time, including when the kernel is in the | ||
78 | middle of switching stacks. Using IST for NMI events avoids making | ||
79 | assumptions about the previous state of the kernel stack. | ||
80 | |||
81 | * DEBUG_STACK. DEBUG_STKSZ | ||
82 | |||
83 | Used for hardware debug interrupts (interrupt 1) and for software | ||
84 | debug interrupts (INT3). | ||
85 | |||
86 | When debugging a kernel, debug interrupts (both hardware and | ||
87 | software) can occur at any time. Using IST for these interrupts | ||
88 | avoids making assumptions about the previous state of the kernel | ||
89 | stack. | ||
90 | |||
91 | * MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
92 | |||
93 | Used for interrupt 18 - Machine Check Exception (#MC). | ||
94 | |||
95 | MCE can be delivered at any time, including when the kernel is in the | ||
96 | middle of switching stacks. Using IST for MCE events avoids making | ||
97 | assumptions about the previous state of the kernel stack. | ||
98 | |||
99 | For more details see the Intel IA32 or AMD AMD64 architecture manuals. | ||