x86: move x86-specific documentation into Documentation/x86

The current organization of the x86 documentation makes it appear as if the "i386" documentation doesn't apply to x86-64, which is does. Thus, move that documentation into Documentation/x86, and move the x86-64-specific stuff into Documentation/x86/x86_64 with the eventual goal to move stuff that isn't actually 64-bit specific back into Documentation/x86. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
author: H. Peter Anvin <hpa@zytor.com> 2008-05-30 20:19:03 -0400
committer: H. Peter Anvin <hpa@zytor.com> 2008-05-30 20:19:03 -0400
commit: 23deb06821442506615f34bd92ccd6a2422629d7 (patch)
tree: 5e95dba1471007a161e19844fab2d60d422f5423 /Documentation/x86/x86_64
parent: 4039feb5bae72a5fed9ba6bc1a9cfd8dfe0a8613 (diff)
8 files changed, 660 insertions, 0 deletions
diff --git a/Documentation/x86/x86_64/00-INDEX b/Documentation/x86/x86_64/00-INDEX
new file mode 100644
index 000000000000..92fc20ab5f0e
--- /dev/null
+++ b/Documentation/x86/x86_64/00-INDEX
@@ -0,0 +1,16 @@
+00-INDEX
+        - This file
+boot-options.txt
+        - AMD64-specific boot options.
+cpu-hotplug-spec
+        - Firmware support for CPU hotplug under Linux/x86-64
+fake-numa-for-cpusets
+        - Using numa=fake and CPUSets for Resource Management
+kernel-stacks
+        - Context-specific per-processor interrupt stacks.
+machinecheck
+        - Configurable sysfs parameters for the x86-64 machine check code.
+mm.txt
+        - Memory layout of x86-64 (4 level page tables, 46 bits physical).
+uefi.txt
+        - Booting Linux via Unified Extensible Firmware Interface.
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
new file mode 100644
index 000000000000..b0c7b6c4abda
--- /dev/null
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -0,0 +1,314 @@
+AMD64 specific boot options
+There are many others (usually documented in driver documentation), but
+only the AMD64 specific ones are listed here.
+Machine check
+   mce=off disable machine check
+   mce=bootlog Enable logging of machine checks left over from booting.
+               Disabled by default on AMD because some BIOS leave bogus ones.
+               If your BIOS doesn't do that it's a good idea to enable though
+               to make sure you log even machine check events that result
+               in a reboot. On Intel systems it is enabled by default.
+   mce=nobootlog
+                Disable boot machine check logging.
+   mce=tolerancelevel (number)
+                0: always panic on uncorrected errors, log corrected errors
+                1: panic or SIGBUS on uncorrected errors, log corrected errors
+                2: SIGBUS or log uncorrected errors, log corrected errors
+                3: never panic or SIGBUS, log all errors (for testing only)
+                Default is 1
+                Can be also set using sysfs which is preferable.
+   nomce (for compatibility with i386): same as mce=off
+   Everything else is in sysfs now.
+APICs
+   apic          Use IO-APIC. Default
+   noapic        Don't use the IO-APIC.
+   disableapic   Don't use the local APIC
+   nolapic       Don't use the local APIC (alias for i386 compatibility)
+   pirq=...      See Documentation/i386/IO-APIC.txt
+   noapictimer   Don't set up the APIC timer
+   no_timer_check Don't check the IO-APIC timer. This can work around
+                 problems with incorrect timer initialization on some boards.
+   apicmaintimer Run time keeping from the local APIC timer instead
+                 of using the PIT/HPET interrupt for this. This is useful
+                 when the PIT/HPET interrupts are unreliable.
+   noapicmaintimer  Don't do time keeping using the APIC timer.
+                 Useful when this option was auto selected, but doesn't work.
+   apicpmtimer
+                 Do APIC timer calibration using the pmtimer. Implies
+                 apicmaintimer. Useful when your PIT timer is totally
+                 broken.
+   disable_8254_timer / enable_8254_timer
+                 Enable interrupt 0 timer routing over the 8254 in addition to over
+                 the IO-APIC. The kernel tries to set a sensible default.
+Early Console
+   syntax: earlyprintk=vga
+           earlyprintk=serial[,ttySn[,baudrate]]
+   The early console is useful when the kernel crashes before the
+   normal console is initialized. It is not enabled by
+   default because it has some cosmetic problems.
+   Append ,keep to not disable it when the real console takes over.
+   Only vga or serial at a time, not both.
+   Currently only ttyS0 and ttyS1 are supported.
+   Interaction with the standard serial driver is not very good.
+   The VGA output is eventually overwritten by the real console.
+Timing
+  notsc
+  Don't use the CPU time stamp counter to read the wall time.
+  This can be used to work around timing problems on multiprocessor systems
+  with not properly synchronized CPUs.
+  report_lost_ticks
+  Report when timer interrupts are lost because some code turned off
+  interrupts for too long.
+  nmi_watchdog=NUMBER[,panic]
+  NUMBER can be:
+  0 don't use an NMI watchdog
+  1 use the IO-APIC timer for the NMI watchdog
+  2 use the local APIC for the NMI watchdog using a performance counter. Note
+  This will use one performance counter and the local APIC's performance
+  vector.
+  When panic is specified panic when an NMI watchdog timeout occurs.
+  This is useful when you use a panic=... timeout and need the box
+  quickly up again.
+  nohpet
+  Don't use the HPET timer.
+Idle loop
+  idle=poll
+  Don't do power saving in the idle loop using HLT, but poll for rescheduling
+  event. This will make the CPUs eat a lot more power, but may be useful
+  to get slightly better performance in multiprocessor benchmarks. It also
+  makes some profiling using performance counters more accurate.
+  Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
+  CPUs) this option has no performance advantage over the normal idle loop.
+  It may also interact badly with hyperthreading.
+Rebooting
+   reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
+   bios   Use the CPU reboot vector for warm reset
+   warm   Don't set the cold reboot flag
+   cold   Set the cold reboot flag
+   triple Force a triple fault (init)
+   kbd    Use the keyboard controller. cold reset (default)
+   acpi   Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the
+          ACPI reset does not work, the reboot path attempts the reset using
+          the keyboard controller.
+   efi    Use efi reset_system runtime service. If EFI is not configured or the
+          EFI reset does not work, the reboot path attempts the reset using
+          the keyboard controller.
+   Using warm reset will be much faster especially on big memory
+   systems because the BIOS will not go through the memory check.
+   Disadvantage is that not all hardware will be completely reinitialized
+   on reboot so there may be boot problems on some systems.
+   reboot=force
+   Don't stop other CPUs on reboot. This can make reboot more reliable
+   in some cases.
+Non Executable Mappings
+  noexec=on|off
+  on      Enable(default)
+  off     Disable
+SMP
+  additional_cpus=NUM Allow NUM more CPUs for hotplug
+                 (defaults are specified by the BIOS, see Documentation/x86_64/cpu-hotplug-spec)
+NUMA
+  numa=off      Only set up a single NUMA node spanning all memory.
+  numa=noacpi   Don't parse the SRAT table for NUMA setup
+  numa=fake=CMDLINE
+                If a number, fakes CMDLINE nodes and ignores NUMA setup of the
+                actual machine.  Otherwise, system memory is configured
+                depending on the sizes and coefficients listed.  For example:
+                        numa=fake=2*512,1024,4*256,*128
+                gives two 512M nodes, a 1024M node, four 256M nodes, and the
+                rest split into 128M chunks.  If the last character of CMDLINE
+                is a *, the remaining memory is divided up equally among its
+                coefficient:
+                        numa=fake=2*512,2*
+                gives two 512M nodes and the rest split into two nodes.
+                Otherwise, the remaining system RAM is allocated to an
+                additional node.
+  numa=hotadd=percent
+                Only allow hotadd memory to preallocate page structures upto
+                percent of already available memory.
+                numa=hotadd=0 will disable hotadd memory.
+ACPI
+  acpi=off      Don't enable ACPI
+  acpi=ht       Use ACPI boot table parsing, but don't enable ACPI
+                interpreter
+  acpi=force    Force ACPI on (currently not needed)
+  acpi=strict   Disable out of spec ACPI workarounds.
+  acpi_sci={edge,level,high,low}  Set up ACPI SCI interrupt.
+  acpi=noirq    Don't route interrupts
+PCI
+  pci=off       Don't use PCI
+  pci=conf1     Use conf1 access.
+  pci=conf2     Use conf2 access.
+  pci=rom       Assign ROMs.
+  pci=assign-busses    Assign busses
+  pci=irqmask=MASK             Set PCI interrupt mask to MASK
+  pci=lastbus=NUMBER           Scan upto NUMBER busses, no matter what the mptable says.
+  pci=noacpi            Don't use ACPI to set up PCI interrupt routing.
+IOMMU (input/output memory management unit)
+ Currently four x86-64 PCI-DMA mapping implementations exist:
+   1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
+      (e.g. because you have < 3 GB memory).
+      Kernel boot message: "PCI-DMA: Disabling IOMMU"
+   2. <arch/x86_64/kernel/pci-gart.c>: AMD GART based hardware IOMMU.
+      Kernel boot message: "PCI-DMA: using GART IOMMU"
+   3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
+      e.g. if there is no hardware IOMMU in the system and it is need because
+      you have >3GB memory or told the kernel to us it (iommu=soft))
+      Kernel boot message: "PCI-DMA: Using software bounce buffering
+      for IO (SWIOTLB)"
+   4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
+      pSeries and xSeries servers. This hardware IOMMU supports DMA address
+      mapping with memory protection, etc.
+      Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
+ iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
+        [,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge]
+        [,noaperture][,calgary]
+  General iommu options:
+    off                Don't initialize and use any kind of IOMMU.
+    noforce            Don't force hardware IOMMU usage when it is not needed.
+                       (default).
+    force              Force the use of the hardware IOMMU even when it is
+                       not actually needed (e.g. because < 3 GB memory).
+    soft               Use software bounce buffering (SWIOTLB) (default for
+                       Intel machines). This can be used to prevent the usage
+                       of an available hardware IOMMU.
+  iommu options only relevant to the AMD GART hardware IOMMU:
+    <size>             Set the size of the remapping area in bytes.
+    allowed            Overwrite iommu off workarounds for specific chipsets.
+    fullflush          Flush IOMMU on each allocation (default).
+    nofullflush        Don't use IOMMU fullflush.
+    leak               Turn on simple iommu leak tracing (only when
+                       CONFIG_IOMMU_LEAK is on). Default number of leak pages
+                       is 20.
+    memaper[=<order>]  Allocate an own aperture over RAM with size 32MB<<order.
+                       (default: order=1, i.e. 64MB)
+    merge              Do scatter-gather (SG) merging. Implies "force"
+                       (experimental).
+    nomerge            Don't do scatter-gather (SG) merging.
+    noaperture         Ask the IOMMU not to touch the aperture for AGP.
+    forcesac           Force single-address cycle (SAC) mode for masks <40bits
+                       (experimental).
+    noagp              Don't initialize the AGP driver and use full aperture.
+    allowdac           Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
+                       DAC is used with 32-bit PCI to push a 64-bit address in
+                       two cycles. When off all DMA over >4GB is forced through
+                       an IOMMU or software bounce buffering.
+    nodac              Forbid DAC mode, i.e. DMA >4GB.
+    panic              Always panic when IOMMU overflows.
+    calgary            Use the Calgary IOMMU if it is available
+  iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
+  implementation:
+    swiotlb=<pages>[,force]
+    <pages>            Prereserve that many 128K pages for the software IO
+                       bounce buffering.
+    force              Force all IO through the software TLB.
+  Settings for the IBM Calgary hardware IOMMU currently found in IBM
+  pSeries and xSeries machines:
+    calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
+    calgary=[translate_empty_slots]
+    calgary=[disable=<PCI bus number>]
+    panic              Always panic when IOMMU overflows
+    64k,...,8M - Set the size of each PCI slot's translation table
+    when using the Calgary IOMMU. This is the size of the translation
+    table itself in main memory. The smallest table, 64k, covers an IO
+    space of 32MB; the largest, 8MB table, can cover an IO space of
+    4GB. Normally the kernel will make the right choice by itself.
+    translate_empty_slots - Enable translation even on slots that have
+    no devices attached to them, in case a device will be hotplugged
+    in the future.
+    disable=<PCI bus number> - Disable translation on a given PHB. For
+    example, the built-in graphics adapter resides on the first bridge
+    (PCI bus number 0); if translation (isolation) is enabled on this
+    bridge, X servers that access the hardware directly from user
+    space might stop working. Use this option if you have devices that
+    are accessed from userspace directly on some PCI host bridge.
+Debugging
+  oops=panic    Always panic on oopses. Default is to just kill the process,
+                but there is a small probability of deadlocking the machine.
+                This will also cause panics on machine check exceptions.
+                Useful together with panic=30 to trigger a reboot.
+  kstack=N      Print N words from the kernel stack in oops dumps.
+  pagefaulttrace  Dump all page faults. Only useful for extreme debugging
+                and will create a lot of output.
+  call_trace=[old|both|newfallback|new]
+                old: use old inexact backtracer
+                new: use new exact dwarf2 unwinder
+                both: print entries from both
+                newfallback: use new unwinder but fall back to old if it gets
+                        stuck (default)
+Miscellaneous
+        nogbpages
+                Do not use GB pages for kernel direct mappings.
+        gbpages
+                Use GB pages for kernel direct mappings.
diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec
new file mode 100644
index 000000000000..3c23e0587db3
--- /dev/null
+++ b/Documentation/x86/x86_64/cpu-hotplug-spec
@@ -0,0 +1,21 @@
+Firmware support for CPU hotplug under Linux/x86-64
+---------------------------------------------------
+Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
+know in advance of boot time the maximum number of CPUs that could be plugged
+into the system. ACPI 3.0 currently has no official way to supply
+this information from the firmware to the operating system.
+In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
+ACPI 3.0 specification).  ACPI already has the concept of disabled LAPIC
+objects by setting the Enabled bit in the LAPIC object to zero.
+For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
+CPU is already available in the MADT. If the CPU is not available yet
+it should have its LAPIC Enabled bit set to 0. Linux will use the number
+of disabled LAPICs to compute the maximum number of future CPUs.
+In the worst case the user can overwrite this choice using a command line
+option (additional_cpus=...), but it is recommended to supply the correct
+number (or a reasonable approximation of it, with erring towards more not less)
+in the MADT to avoid manual configuration.
diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets
new file mode 100644
index 000000000000..d1a985c5b00a
--- /dev/null
+++ b/Documentation/x86/x86_64/fake-numa-for-cpusets
@@ -0,0 +1,66 @@
+Using numa=fake and CPUSets for Resource Management
+Written by David Rientjes <rientjes@cs.washington.edu>
+This document describes how the numa=fake x86_64 command-line option can be used
+in conjunction with cpusets for coarse memory management.  Using this feature,
+you can create fake NUMA nodes that represent contiguous chunks of memory and
+assign them to cpusets and their attached tasks.  This is a way of limiting the
+amount of system memory that are available to a certain class of tasks.
+For more information on the features of cpusets, see Documentation/cpusets.txt.
+There are a number of different configurations you can use for your needs.  For
+more information on the numa=fake command line option and its various ways of
+configuring fake nodes, see Documentation/x86_64/boot-options.txt.
+For the purposes of this introduction, we'll assume a very primitive NUMA
+emulation setup of "numa=fake=4*512,".  This will split our system memory into
+four equal chunks of 512M each that we can now use to assign to cpusets.  As
+you become more familiar with using this combination for resource control,
+you'll determine a better setup to minimize the number of nodes you have to deal
+with.
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
+        Faking node 0 at 0000000000000000-0000000020000000 (512MB)
+        Faking node 1 at 0000000020000000-0000000040000000 (512MB)
+        Faking node 2 at 0000000040000000-0000000060000000 (512MB)
+        Faking node 3 at 0000000060000000-0000000080000000 (512MB)
+        ...
+        On node 0 totalpages: 130975
+        On node 1 totalpages: 131072
+        On node 2 totalpages: 131072
+        On node 3 totalpages: 131072
+Now following the instructions for mounting the cpusets filesystem from
+Documentation/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
+address spaces) to individual cpusets:
+        [root@xroads /]# mkdir exampleset
+        [root@xroads /]# mount -t cpuset none exampleset
+        [root@xroads /]# mkdir exampleset/ddset
+        [root@xroads /]# cd exampleset/ddset
+        [root@xroads /exampleset/ddset]# echo 0-1 > cpus
+        [root@xroads /exampleset/ddset]# echo 0-1 > mems
+Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
+memory allocations (1G).
+You can now assign tasks to these cpusets to limit the memory resources
+available to them according to the fake nodes assigned as mems:
+        [root@xroads /exampleset/ddset]# echo $$ > tasks
+        [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
+        [1] 13425
+Notice the difference between the system memory usage as reported by
+/proc/meminfo between the restricted cpuset case above and the unrestricted
+case (i.e. running the same 'dd' command without assigning it to a fake NUMA
+cpuset):
+                                Unrestricted    Restricted
+        MemTotal:               3091900 kB      3091900 kB
+        MemFree:                  42113 kB      1513236 kB
+This allows for coarse memory management for the tasks you assign to particular
+cpusets.  Since cpusets can form a hierarchy, you can create some pretty
+interesting combinations of use-cases for various classes of tasks for your
+memory management needs.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
new file mode 100644
index 000000000000..5ad65d51fb95
--- /dev/null
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -0,0 +1,99 @@
+Most of the text from Keith Owens, hacked by AK
+x86_64 page size (PAGE_SIZE) is 4K.
+Like all other architectures, x86_64 has a kernel stack for every
+active thread.  These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
+These stacks contain useful data as long as a thread is alive or a
+zombie. While the thread is in user space the kernel stack is empty
+except for the thread_info structure at the bottom.
+In addition to the per thread stacks, there are specialized stacks
+associated with each CPU.  These stacks are only used while the kernel
+is in control on that CPU; when a CPU returns to user space the
+specialized stacks contain no useful data.  The main CPU stacks are:
+* Interrupt stack.  IRQSTACKSIZE
+  Used for external hardware interrupts.  If this is the first external
+  hardware interrupt (i.e. not a nested hardware interrupt) then the
+  kernel switches from the current task to the interrupt stack.  Like
+  the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS),
+  this gives more room for kernel interrupt processing without having
+  to increase the size of every per thread stack.
+  The interrupt stack is also used when processing a softirq.
+Switching to the kernel interrupt stack is done by software based on a
+per CPU interrupt nest counter. This is needed because x86-64 "IST"
+hardware stacks cannot nest without races.
+x86_64 also has a feature which is not available on i386, the ability
+to automatically switch to a new stack for designated events such as
+double fault or NMI, which makes it easier to handle these unusual
+events on x86_64.  This feature is called the Interrupt Stack Table
+(IST).  There can be up to 7 IST entries per CPU. The IST code is an
+index into the Task State Segment (TSS). The IST entries in the TSS
+point to dedicated stacks; each stack can be a different size.
+An IST is selected by a non-zero value in the IST field of an
+interrupt-gate descriptor.  When an interrupt occurs and the hardware
+loads such a descriptor, the hardware automatically sets the new stack
+pointer based on the IST value, then invokes the interrupt handler.  If
+software wants to allow nested IST interrupts then the handler must
+adjust the IST values on entry to and exit from the interrupt handler.
+(This is occasionally done, e.g. for debug exceptions.)
+Events with different IST codes (i.e. with different stacks) can be
+nested.  For example, a debug interrupt can safely be interrupted by an
+NMI.  arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
+pointers on entry to and exit from all IST events, in theory allowing
+IST events with the same code to be nested.  However in most cases, the
+stack size allocated to an IST assumes no nesting for the same code.
+If that assumption is ever broken then the stacks will become corrupt.
+The currently assigned IST stacks are :-
+* STACKFAULT_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+  Used for interrupt 12 - Stack Fault Exception (#SS).
+  This allows the CPU to recover from invalid stack segments. Rarely
+  happens.
+* DOUBLEFAULT_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+  Used for interrupt 8 - Double Fault Exception (#DF).
+  Invoked when handling one exception causes another exception. Happens
+  when the kernel is very confused (e.g. kernel stack pointer corrupt).
+  Using a separate stack allows the kernel to recover from it well enough
+  in many cases to still output an oops.
+* NMI_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+  Used for non-maskable interrupts (NMI).
+  NMI can be delivered at any time, including when the kernel is in the
+  middle of switching stacks.  Using IST for NMI events avoids making
+  assumptions about the previous state of the kernel stack.
+* DEBUG_STACK.  DEBUG_STKSZ
+  Used for hardware debug interrupts (interrupt 1) and for software
+  debug interrupts (INT3).
+  When debugging a kernel, debug interrupts (both hardware and
+  software) can occur at any time.  Using IST for these interrupts
+  avoids making assumptions about the previous state of the kernel
+  stack.
+* MCE_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+  Used for interrupt 18 - Machine Check Exception (#MC).
+  MCE can be delivered at any time, including when the kernel is in the
+  middle of switching stacks.  Using IST for MCE events avoids making
+  assumptions about the previous state of the kernel stack.
+For more details see the Intel IA32 or AMD AMD64 architecture manuals.
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck
new file mode 100644
index 000000000000..a05e58e7b159
--- /dev/null
+++ b/Documentation/x86/x86_64/machinecheck
@@ -0,0 +1,77 @@
+Configurable sysfs parameters for the x86-64 machine check code.
+Machine checks report internal hardware error conditions detected
+by the CPU. Uncorrected errors typically cause a machine check
+(often with panic), corrected ones cause a machine check log entry.
+Machine checks are organized in banks (normally associated with
+a hardware subsystem) and subevents in a bank. The exact meaning
+of the banks and subevent is CPU specific.
+mcelog knows how to decode them.
+When you see the "Machine check errors logged" message in the system
+log then mcelog should run to collect and decode machine check entries
+from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
+Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
+(N = CPU number)
+The directory contains some configurable entries:
+Entries:
+bankNctl
+(N bank number)
+        64bit Hex bitmask enabling/disabling specific subevents for bank N
+        When a bit in the bitmask is zero then the respective
+        subevent will not be reported.
+        By default all events are enabled.
+        Note that BIOS maintain another mask to disable specific events
+        per bank.  This is not visible here
+The following entries appear for each CPU, but they are truly shared
+between all CPUs.
+check_interval
+        How often to poll for corrected machine check errors, in seconds
+        (Note output is hexademical). Default 5 minutes.  When the poller
+        finds MCEs it triggers an exponential speedup (poll more often) on
+        the polling interval.  When the poller stops finding MCEs, it
+        triggers an exponential backoff (poll less often) on the polling
+        interval. The check_interval variable is both the initial and
+        maximum polling interval.
+tolerant
+        Tolerance level. When a machine check exception occurs for a non
+        corrected machine check the kernel can take different actions.
+        Since machine check exceptions can happen any time it is sometimes
+        risky for the kernel to kill a process because it defies
+        normal kernel locking rules. The tolerance level configures
+        how hard the kernel tries to recover even at some risk of
+        deadlock.  Higher tolerant values trade potentially better uptime
+        with the risk of a crash or even corruption (for tolerant >= 3).
+        0: always panic on uncorrected errors, log corrected errors
+        1: panic or SIGBUS on uncorrected errors, log corrected errors
+        2: SIGBUS or log uncorrected errors, log corrected errors
+        3: never panic or SIGBUS, log all errors (for testing only)
+        Default: 1
+        Note this only makes a difference if the CPU allows recovery
+        from a machine check exception. Current x86 CPUs generally do not.
+trigger
+        Program to run when a machine check event is detected.
+        This is an alternative to running mcelog regularly from cron
+        and allows to detect events faster.
+TBD document entries for AMD threshold interrupt configuration
+For more details about the x86 machine check architecture
+see the Intel and AMD architecture manuals from their developer websites.
+For more details about the architecture see
+see http://one.firstfloor.org/~andi/mce.pdf
diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
new file mode 100644
index 000000000000..b89b6d2bebfa
--- /dev/null
+++ b/Documentation/x86/x86_64/mm.txt
@@ -0,0 +1,29 @@
+<previous description obsolete, deleted>
+Virtual memory map with 4 level page tables:
+0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
+hole caused by [48:63] sign extension
+ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
+ffff810000000000 - ffffc0ffffffffff (=46 bits) direct mapping of all phys. memory
+ffffc10000000000 - ffffc1ffffffffff (=40 bits) hole
+ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space
+ffffe20000000000 - ffffe2ffffffffff (=40 bits) virtual memory map (1TB)
+... unused hole ...
+ffffffff80000000 - ffffffff82800000 (=40 MB)   kernel text mapping, from phys 0
+... unused hole ...
+ffffffff88000000 - fffffffffff00000 (=1919 MB) module mapping space
+The direct mapping covers all memory in the system up to the highest
+memory address (this means in some cases it can also include PCI memory
+holes).
+vmalloc space is lazily synchronized into the different PML4 pages of
+the processes using the page fault handler, with init_level4_pgt as
+reference.
+Current X86-64 implementations only support 40 bits of address space,
+but we support up to 46 bits. This expands into MBZ space in the page tables.
+-Andi Kleen, Jul 2004
diff --git a/Documentation/x86/x86_64/uefi.txt b/Documentation/x86/x86_64/uefi.txt
new file mode 100644
index 000000000000..7d77120a5184
--- /dev/null
+++ b/Documentation/x86/x86_64/uefi.txt
@@ -0,0 +1,38 @@
+General note on [U]EFI x86_64 support
+-------------------------------------
+The nomenclature EFI and UEFI are used interchangeably in this document.
+Although the tools below are _not_ needed for building the kernel,
+the needed bootloader support and associated tools for x86_64 platforms
+with EFI firmware and specifications are listed below.
+1. UEFI specification:  http://www.uefi.org
+2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
+   support. Elilo with x86_64 support can be used.
+3. x86_64 platform with EFI/UEFI firmware.
+Mechanics:
+---------
+- Build the kernel with the following configuration.
+        CONFIG_FB_EFI=y
+        CONFIG_FRAMEBUFFER_CONSOLE=y
+  If EFI runtime services are expected, the following configuration should
+  be selected.
+        CONFIG_EFI=y
+        CONFIG_EFI_VARS=y or m          # optional
+- Create a VFAT partition on the disk
+- Copy the following to the VFAT partition:
+        elilo bootloader with x86_64 support, elilo configuration file,
+        kernel image built in first step and corresponding
+        initrd. Instructions on building elilo  and its dependencies
+        can be found in the elilo sourceforge project.
+- Boot to EFI shell and invoke elilo choosing the kernel image built
+  in first step.
+- If some or all EFI runtime services don't work, you can try following
+  kernel command line parameters to turn off some or all EFI runtime
+  services.
+        noefi           turn off all EFI runtime services
+        reboot_type=k   turn off EFI reboot runtime service
author	H. Peter Anvin <hpa@zytor.com>	2008-05-30 20:19:03 -0400
committer	H. Peter Anvin <hpa@zytor.com>	2008-05-30 20:19:03 -0400
commit	23deb06821442506615f34bd92ccd6a2422629d7 (patch)
tree	5e95dba1471007a161e19844fab2d60d422f5423 /Documentation/x86/x86_64
parent	4039feb5bae72a5fed9ba6bc1a9cfd8dfe0a8613 (diff)