diff options
Diffstat (limited to 'Documentation/vm/hwpoison.txt')
| -rw-r--r-- | Documentation/vm/hwpoison.txt | 136 |
1 files changed, 136 insertions, 0 deletions
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt new file mode 100644 index 000000000000..3ffadf8da61f --- /dev/null +++ b/Documentation/vm/hwpoison.txt | |||
| @@ -0,0 +1,136 @@ | |||
| 1 | What is hwpoison? | ||
| 2 | |||
| 3 | Upcoming Intel CPUs have support for recovering from some memory errors | ||
| 4 | (``MCA recovery''). This requires the OS to declare a page "poisoned", | ||
| 5 | kill the processes associated with it and avoid using it in the future. | ||
| 6 | |||
| 7 | This patchkit implements the necessary infrastructure in the VM. | ||
| 8 | |||
| 9 | To quote the overview comment: | ||
| 10 | |||
| 11 | * High level machine check handler. Handles pages reported by the | ||
| 12 | * hardware as being corrupted usually due to a 2bit ECC memory or cache | ||
| 13 | * failure. | ||
| 14 | * | ||
| 15 | * This focusses on pages detected as corrupted in the background. | ||
| 16 | * When the current CPU tries to consume corruption the currently | ||
| 17 | * running process can just be killed directly instead. This implies | ||
| 18 | * that if the error cannot be handled for some reason it's safe to | ||
| 19 | * just ignore it because no corruption has been consumed yet. Instead | ||
| 20 | * when that happens another machine check will happen. | ||
| 21 | * | ||
| 22 | * Handles page cache pages in various states. The tricky part | ||
| 23 | * here is that we can access any page asynchronous to other VM | ||
| 24 | * users, because memory failures could happen anytime and anywhere, | ||
| 25 | * possibly violating some of their assumptions. This is why this code | ||
| 26 | * has to be extremely careful. Generally it tries to use normal locking | ||
| 27 | * rules, as in get the standard locks, even if that means the | ||
| 28 | * error handling takes potentially a long time. | ||
| 29 | * | ||
| 30 | * Some of the operations here are somewhat inefficient and have non | ||
| 31 | * linear algorithmic complexity, because the data structures have not | ||
| 32 | * been optimized for this case. This is in particular the case | ||
| 33 | * for the mapping from a vma to a process. Since this case is expected | ||
| 34 | * to be rare we hope we can get away with this. | ||
| 35 | |||
| 36 | The code consists of a the high level handler in mm/memory-failure.c, | ||
| 37 | a new page poison bit and various checks in the VM to handle poisoned | ||
| 38 | pages. | ||
| 39 | |||
| 40 | The main target right now is KVM guests, but it works for all kinds | ||
| 41 | of applications. KVM support requires a recent qemu-kvm release. | ||
| 42 | |||
| 43 | For the KVM use there was need for a new signal type so that | ||
| 44 | KVM can inject the machine check into the guest with the proper | ||
| 45 | address. This in theory allows other applications to handle | ||
| 46 | memory failures too. The expection is that near all applications | ||
| 47 | won't do that, but some very specialized ones might. | ||
| 48 | |||
| 49 | --- | ||
| 50 | |||
| 51 | There are two (actually three) modi memory failure recovery can be in: | ||
| 52 | |||
| 53 | vm.memory_failure_recovery sysctl set to zero: | ||
| 54 | All memory failures cause a panic. Do not attempt recovery. | ||
| 55 | (on x86 this can be also affected by the tolerant level of the | ||
| 56 | MCE subsystem) | ||
| 57 | |||
| 58 | early kill | ||
| 59 | (can be controlled globally and per process) | ||
| 60 | Send SIGBUS to the application as soon as the error is detected | ||
| 61 | This allows applications who can process memory errors in a gentle | ||
| 62 | way (e.g. drop affected object) | ||
| 63 | This is the mode used by KVM qemu. | ||
| 64 | |||
| 65 | late kill | ||
| 66 | Send SIGBUS when the application runs into the corrupted page. | ||
| 67 | This is best for memory error unaware applications and default | ||
| 68 | Note some pages are always handled as late kill. | ||
| 69 | |||
| 70 | --- | ||
| 71 | |||
| 72 | User control: | ||
| 73 | |||
| 74 | vm.memory_failure_recovery | ||
| 75 | See sysctl.txt | ||
| 76 | |||
| 77 | vm.memory_failure_early_kill | ||
| 78 | Enable early kill mode globally | ||
| 79 | |||
| 80 | PR_MCE_KILL | ||
| 81 | Set early/late kill mode/revert to system default | ||
| 82 | arg1: PR_MCE_KILL_CLEAR: Revert to system default | ||
| 83 | arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode | ||
| 84 | PR_MCE_KILL_EARLY: Early kill | ||
| 85 | PR_MCE_KILL_LATE: Late kill | ||
| 86 | PR_MCE_KILL_DEFAULT: Use system global default | ||
| 87 | PR_MCE_KILL_GET | ||
| 88 | return current mode | ||
| 89 | |||
| 90 | |||
| 91 | --- | ||
| 92 | |||
| 93 | Testing: | ||
| 94 | |||
| 95 | madvise(MADV_POISON, ....) | ||
| 96 | (as root) | ||
| 97 | Poison a page in the process for testing | ||
| 98 | |||
| 99 | |||
| 100 | hwpoison-inject module through debugfs | ||
| 101 | /sys/debug/hwpoison/corrupt-pfn | ||
| 102 | |||
| 103 | Inject hwpoison fault at PFN echoed into this file | ||
| 104 | |||
| 105 | |||
| 106 | Architecture specific MCE injector | ||
| 107 | |||
| 108 | x86 has mce-inject, mce-test | ||
| 109 | |||
| 110 | Some portable hwpoison test programs in mce-test, see blow. | ||
| 111 | |||
| 112 | --- | ||
| 113 | |||
| 114 | References: | ||
| 115 | |||
| 116 | http://halobates.de/mce-lc09-2.pdf | ||
| 117 | Overview presentation from LinuxCon 09 | ||
| 118 | |||
| 119 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | ||
| 120 | Test suite (hwpoison specific portable tests in tsrc) | ||
| 121 | |||
| 122 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | ||
| 123 | x86 specific injector | ||
| 124 | |||
| 125 | |||
| 126 | --- | ||
| 127 | |||
| 128 | Limitations: | ||
| 129 | |||
| 130 | - Not all page types are supported and never will. Most kernel internal | ||
| 131 | objects cannot be recovered, only LRU pages for now. | ||
| 132 | - Right now hugepage support is missing. | ||
| 133 | |||
| 134 | --- | ||
| 135 | Andi Kleen, Oct 2009 | ||
| 136 | |||
