diff options
| -rw-r--r-- | Documentation/exception.txt | 202 |
1 files changed, 101 insertions, 101 deletions
diff --git a/Documentation/exception.txt b/Documentation/exception.txt index 2d5aded64247..32901aa36f0a 100644 --- a/Documentation/exception.txt +++ b/Documentation/exception.txt | |||
| @@ -1,123 +1,123 @@ | |||
| 1 | Kernel level exception handling in Linux 2.1.8 | 1 | Kernel level exception handling in Linux |
| 2 | Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> | 2 | Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> |
| 3 | 3 | ||
| 4 | When a process runs in kernel mode, it often has to access user | 4 | When a process runs in kernel mode, it often has to access user |
| 5 | mode memory whose address has been passed by an untrusted program. | 5 | mode memory whose address has been passed by an untrusted program. |
| 6 | To protect itself the kernel has to verify this address. | 6 | To protect itself the kernel has to verify this address. |
| 7 | 7 | ||
| 8 | In older versions of Linux this was done with the | 8 | In older versions of Linux this was done with the |
| 9 | int verify_area(int type, const void * addr, unsigned long size) | 9 | int verify_area(int type, const void * addr, unsigned long size) |
| 10 | function (which has since been replaced by access_ok()). | 10 | function (which has since been replaced by access_ok()). |
| 11 | 11 | ||
| 12 | This function verified that the memory area starting at address | 12 | This function verified that the memory area starting at address |
| 13 | 'addr' and of size 'size' was accessible for the operation specified | 13 | 'addr' and of size 'size' was accessible for the operation specified |
| 14 | in type (read or write). To do this, verify_read had to look up the | 14 | in type (read or write). To do this, verify_read had to look up the |
| 15 | virtual memory area (vma) that contained the address addr. In the | 15 | virtual memory area (vma) that contained the address addr. In the |
| 16 | normal case (correctly working program), this test was successful. | 16 | normal case (correctly working program), this test was successful. |
| 17 | It only failed for a few buggy programs. In some kernel profiling | 17 | It only failed for a few buggy programs. In some kernel profiling |
| 18 | tests, this normally unneeded verification used up a considerable | 18 | tests, this normally unneeded verification used up a considerable |
| 19 | amount of time. | 19 | amount of time. |
| 20 | 20 | ||
| 21 | To overcome this situation, Linus decided to let the virtual memory | 21 | To overcome this situation, Linus decided to let the virtual memory |
| 22 | hardware present in every Linux-capable CPU handle this test. | 22 | hardware present in every Linux-capable CPU handle this test. |
| 23 | 23 | ||
| 24 | How does this work? | 24 | How does this work? |
| 25 | 25 | ||
| 26 | Whenever the kernel tries to access an address that is currently not | 26 | Whenever the kernel tries to access an address that is currently not |
| 27 | accessible, the CPU generates a page fault exception and calls the | 27 | accessible, the CPU generates a page fault exception and calls the |
| 28 | page fault handler | 28 | page fault handler |
| 29 | 29 | ||
| 30 | void do_page_fault(struct pt_regs *regs, unsigned long error_code) | 30 | void do_page_fault(struct pt_regs *regs, unsigned long error_code) |
| 31 | 31 | ||
| 32 | in arch/i386/mm/fault.c. The parameters on the stack are set up by | 32 | in arch/x86/mm/fault.c. The parameters on the stack are set up by |
| 33 | the low level assembly glue in arch/i386/kernel/entry.S. The parameter | 33 | the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter |
| 34 | regs is a pointer to the saved registers on the stack, error_code | 34 | regs is a pointer to the saved registers on the stack, error_code |
| 35 | contains a reason code for the exception. | 35 | contains a reason code for the exception. |
| 36 | 36 | ||
| 37 | do_page_fault first obtains the unaccessible address from the CPU | 37 | do_page_fault first obtains the unaccessible address from the CPU |
| 38 | control register CR2. If the address is within the virtual address | 38 | control register CR2. If the address is within the virtual address |
| 39 | space of the process, the fault probably occurred, because the page | 39 | space of the process, the fault probably occurred, because the page |
| 40 | was not swapped in, write protected or something similar. However, | 40 | was not swapped in, write protected or something similar. However, |
| 41 | we are interested in the other case: the address is not valid, there | 41 | we are interested in the other case: the address is not valid, there |
| 42 | is no vma that contains this address. In this case, the kernel jumps | 42 | is no vma that contains this address. In this case, the kernel jumps |
| 43 | to the bad_area label. | 43 | to the bad_area label. |
| 44 | 44 | ||
| 45 | There it uses the address of the instruction that caused the exception | 45 | There it uses the address of the instruction that caused the exception |
| 46 | (i.e. regs->eip) to find an address where the execution can continue | 46 | (i.e. regs->eip) to find an address where the execution can continue |
| 47 | (fixup). If this search is successful, the fault handler modifies the | 47 | (fixup). If this search is successful, the fault handler modifies the |
| 48 | return address (again regs->eip) and returns. The execution will | 48 | return address (again regs->eip) and returns. The execution will |
| 49 | continue at the address in fixup. | 49 | continue at the address in fixup. |
| 50 | 50 | ||
| 51 | Where does fixup point to? | 51 | Where does fixup point to? |
| 52 | 52 | ||
| 53 | Since we jump to the contents of fixup, fixup obviously points | 53 | Since we jump to the contents of fixup, fixup obviously points |
| 54 | to executable code. This code is hidden inside the user access macros. | 54 | to executable code. This code is hidden inside the user access macros. |
| 55 | I have picked the get_user macro defined in include/asm/uaccess.h as an | 55 | I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h |
| 56 | example. The definition is somewhat hard to follow, so let's peek at | 56 | as an example. The definition is somewhat hard to follow, so let's peek at |
| 57 | the code generated by the preprocessor and the compiler. I selected | 57 | the code generated by the preprocessor and the compiler. I selected |
| 58 | the get_user call in drivers/char/console.c for a detailed examination. | 58 | the get_user call in drivers/char/sysrq.c for a detailed examination. |
| 59 | 59 | ||
| 60 | The original code in console.c line 1405: | 60 | The original code in sysrq.c line 587: |
| 61 | get_user(c, buf); | 61 | get_user(c, buf); |
| 62 | 62 | ||
| 63 | The preprocessor output (edited to become somewhat readable): | 63 | The preprocessor output (edited to become somewhat readable): |
| 64 | 64 | ||
| 65 | ( | 65 | ( |
| 66 | { | 66 | { |
| 67 | long __gu_err = - 14 , __gu_val = 0; | 67 | long __gu_err = - 14 , __gu_val = 0; |
| 68 | const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); | 68 | const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); |
| 69 | if (((((0 + current_set[0])->tss.segment) == 0x18 ) || | 69 | if (((((0 + current_set[0])->tss.segment) == 0x18 ) || |
| 70 | (((sizeof(*(buf))) <= 0xC0000000UL) && | 70 | (((sizeof(*(buf))) <= 0xC0000000UL) && |
| 71 | ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) | 71 | ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) |
| 72 | do { | 72 | do { |
| 73 | __gu_err = 0; | 73 | __gu_err = 0; |
| 74 | switch ((sizeof(*(buf)))) { | 74 | switch ((sizeof(*(buf)))) { |
| 75 | case 1: | 75 | case 1: |
| 76 | __asm__ __volatile__( | 76 | __asm__ __volatile__( |
| 77 | "1: mov" "b" " %2,%" "b" "1\n" | 77 | "1: mov" "b" " %2,%" "b" "1\n" |
| 78 | "2:\n" | 78 | "2:\n" |
| 79 | ".section .fixup,\"ax\"\n" | 79 | ".section .fixup,\"ax\"\n" |
| 80 | "3: movl %3,%0\n" | 80 | "3: movl %3,%0\n" |
| 81 | " xor" "b" " %" "b" "1,%" "b" "1\n" | 81 | " xor" "b" " %" "b" "1,%" "b" "1\n" |
| 82 | " jmp 2b\n" | 82 | " jmp 2b\n" |
| 83 | ".section __ex_table,\"a\"\n" | 83 | ".section __ex_table,\"a\"\n" |
| 84 | " .align 4\n" | 84 | " .align 4\n" |
| 85 | " .long 1b,3b\n" | 85 | " .long 1b,3b\n" |
| 86 | ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) | 86 | ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) |
| 87 | ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; | 87 | ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; |
| 88 | break; | 88 | break; |
| 89 | case 2: | 89 | case 2: |
| 90 | __asm__ __volatile__( | 90 | __asm__ __volatile__( |
| 91 | "1: mov" "w" " %2,%" "w" "1\n" | 91 | "1: mov" "w" " %2,%" "w" "1\n" |
| 92 | "2:\n" | 92 | "2:\n" |
| 93 | ".section .fixup,\"ax\"\n" | 93 | ".section .fixup,\"ax\"\n" |
| 94 | "3: movl %3,%0\n" | 94 | "3: movl %3,%0\n" |
| 95 | " xor" "w" " %" "w" "1,%" "w" "1\n" | 95 | " xor" "w" " %" "w" "1,%" "w" "1\n" |
| 96 | " jmp 2b\n" | 96 | " jmp 2b\n" |
| 97 | ".section __ex_table,\"a\"\n" | 97 | ".section __ex_table,\"a\"\n" |
| 98 | " .align 4\n" | 98 | " .align 4\n" |
| 99 | " .long 1b,3b\n" | 99 | " .long 1b,3b\n" |
| 100 | ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) | 100 | ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) |
| 101 | ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); | 101 | ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); |
| 102 | break; | 102 | break; |
| 103 | case 4: | 103 | case 4: |
| 104 | __asm__ __volatile__( | 104 | __asm__ __volatile__( |
| 105 | "1: mov" "l" " %2,%" "" "1\n" | 105 | "1: mov" "l" " %2,%" "" "1\n" |
| 106 | "2:\n" | 106 | "2:\n" |
| 107 | ".section .fixup,\"ax\"\n" | 107 | ".section .fixup,\"ax\"\n" |
| 108 | "3: movl %3,%0\n" | 108 | "3: movl %3,%0\n" |
| 109 | " xor" "l" " %" "" "1,%" "" "1\n" | 109 | " xor" "l" " %" "" "1,%" "" "1\n" |
| 110 | " jmp 2b\n" | 110 | " jmp 2b\n" |
| 111 | ".section __ex_table,\"a\"\n" | 111 | ".section __ex_table,\"a\"\n" |
| 112 | " .align 4\n" " .long 1b,3b\n" | 112 | " .align 4\n" " .long 1b,3b\n" |
| 113 | ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) | 113 | ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) |
| 114 | ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); | 114 | ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); |
| 115 | break; | 115 | break; |
| 116 | default: | 116 | default: |
| 117 | (__gu_val) = __get_user_bad(); | 117 | (__gu_val) = __get_user_bad(); |
| 118 | } | 118 | } |
| 119 | } while (0) ; | 119 | } while (0) ; |
| 120 | ((c)) = (__typeof__(*((buf))))__gu_val; | 120 | ((c)) = (__typeof__(*((buf))))__gu_val; |
| 121 | __gu_err; | 121 | __gu_err; |
| 122 | } | 122 | } |
| 123 | ); | 123 | ); |
| @@ -127,12 +127,12 @@ see what code gcc generates: | |||
| 127 | 127 | ||
| 128 | > xorl %edx,%edx | 128 | > xorl %edx,%edx |
| 129 | > movl current_set,%eax | 129 | > movl current_set,%eax |
| 130 | > cmpl $24,788(%eax) | 130 | > cmpl $24,788(%eax) |
| 131 | > je .L1424 | 131 | > je .L1424 |
| 132 | > cmpl $-1073741825,64(%esp) | 132 | > cmpl $-1073741825,64(%esp) |
| 133 | > ja .L1423 | 133 | > ja .L1423 |
| 134 | > .L1424: | 134 | > .L1424: |
| 135 | > movl %edx,%eax | 135 | > movl %edx,%eax |
| 136 | > movl 64(%esp),%ebx | 136 | > movl 64(%esp),%ebx |
| 137 | > #APP | 137 | > #APP |
| 138 | > 1: movb (%ebx),%dl /* this is the actual user access */ | 138 | > 1: movb (%ebx),%dl /* this is the actual user access */ |
| @@ -149,17 +149,17 @@ see what code gcc generates: | |||
| 149 | > .L1423: | 149 | > .L1423: |
| 150 | > movzbl %dl,%esi | 150 | > movzbl %dl,%esi |
| 151 | 151 | ||
| 152 | The optimizer does a good job and gives us something we can actually | 152 | The optimizer does a good job and gives us something we can actually |
| 153 | understand. Can we? The actual user access is quite obvious. Thanks | 153 | understand. Can we? The actual user access is quite obvious. Thanks |
| 154 | to the unified address space we can just access the address in user | 154 | to the unified address space we can just access the address in user |
| 155 | memory. But what does the .section stuff do????? | 155 | memory. But what does the .section stuff do????? |
| 156 | 156 | ||
| 157 | To understand this we have to look at the final kernel: | 157 | To understand this we have to look at the final kernel: |
| 158 | 158 | ||
| 159 | > objdump --section-headers vmlinux | 159 | > objdump --section-headers vmlinux |
| 160 | > | 160 | > |
| 161 | > vmlinux: file format elf32-i386 | 161 | > vmlinux: file format elf32-i386 |
| 162 | > | 162 | > |
| 163 | > Sections: | 163 | > Sections: |
| 164 | > Idx Name Size VMA LMA File off Algn | 164 | > Idx Name Size VMA LMA File off Algn |
| 165 | > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 | 165 | > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 |
| @@ -198,18 +198,18 @@ final kernel executable: | |||
| 198 | 198 | ||
| 199 | The whole user memory access is reduced to 10 x86 machine instructions. | 199 | The whole user memory access is reduced to 10 x86 machine instructions. |
| 200 | The instructions bracketed in the .section directives are no longer | 200 | The instructions bracketed in the .section directives are no longer |
| 201 | in the normal execution path. They are located in a different section | 201 | in the normal execution path. They are located in a different section |
| 202 | of the executable file: | 202 | of the executable file: |
| 203 | 203 | ||
| 204 | > objdump --disassemble --section=.fixup vmlinux | 204 | > objdump --disassemble --section=.fixup vmlinux |
| 205 | > | 205 | > |
| 206 | > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax | 206 | > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax |
| 207 | > c0199ffa <.fixup+10ba> xorb %dl,%dl | 207 | > c0199ffa <.fixup+10ba> xorb %dl,%dl |
| 208 | > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> | 208 | > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> |
| 209 | 209 | ||
| 210 | And finally: | 210 | And finally: |
| 211 | > objdump --full-contents --section=__ex_table vmlinux | 211 | > objdump --full-contents --section=__ex_table vmlinux |
| 212 | > | 212 | > |
| 213 | > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ | 213 | > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ |
| 214 | > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ | 214 | > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ |
| 215 | > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ | 215 | > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ |
| @@ -235,8 +235,8 @@ sections in the ELF object file. So the instructions | |||
| 235 | ended up in the .fixup section of the object file and the addresses | 235 | ended up in the .fixup section of the object file and the addresses |
| 236 | .long 1b,3b | 236 | .long 1b,3b |
| 237 | ended up in the __ex_table section of the object file. 1b and 3b | 237 | ended up in the __ex_table section of the object file. 1b and 3b |
| 238 | are local labels. The local label 1b (1b stands for next label 1 | 238 | are local labels. The local label 1b (1b stands for next label 1 |
| 239 | backward) is the address of the instruction that might fault, i.e. | 239 | backward) is the address of the instruction that might fault, i.e. |
| 240 | in our case the address of the label 1 is c017e7a5: | 240 | in our case the address of the label 1 is c017e7a5: |
| 241 | the original assembly code: > 1: movb (%ebx),%dl | 241 | the original assembly code: > 1: movb (%ebx),%dl |
| 242 | and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl | 242 | and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl |
| @@ -254,7 +254,7 @@ The assembly code | |||
| 254 | becomes the value pair | 254 | becomes the value pair |
| 255 | > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ | 255 | > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ |
| 256 | ^this is ^this is | 256 | ^this is ^this is |
| 257 | 1b 3b | 257 | 1b 3b |
| 258 | c017e7a5,c0199ff5 in the exception table of the kernel. | 258 | c017e7a5,c0199ff5 in the exception table of the kernel. |
| 259 | 259 | ||
| 260 | So, what actually happens if a fault from kernel mode with no suitable | 260 | So, what actually happens if a fault from kernel mode with no suitable |
| @@ -266,9 +266,9 @@ vma occurs? | |||
| 266 | 3.) CPU calls do_page_fault | 266 | 3.) CPU calls do_page_fault |
| 267 | 4.) do page fault calls search_exception_table (regs->eip == c017e7a5); | 267 | 4.) do page fault calls search_exception_table (regs->eip == c017e7a5); |
| 268 | 5.) search_exception_table looks up the address c017e7a5 in the | 268 | 5.) search_exception_table looks up the address c017e7a5 in the |
| 269 | exception table (i.e. the contents of the ELF section __ex_table) | 269 | exception table (i.e. the contents of the ELF section __ex_table) |
| 270 | and returns the address of the associated fault handle code c0199ff5. | 270 | and returns the address of the associated fault handle code c0199ff5. |
| 271 | 6.) do_page_fault modifies its own return address to point to the fault | 271 | 6.) do_page_fault modifies its own return address to point to the fault |
| 272 | handle code and returns. | 272 | handle code and returns. |
| 273 | 7.) execution continues in the fault handling code. | 273 | 7.) execution continues in the fault handling code. |
| 274 | 8.) 8a) EAX becomes -EFAULT (== -14) | 274 | 8.) 8a) EAX becomes -EFAULT (== -14) |
