6 files changed, 268 insertions, 81 deletions
diff --git a/Documentation/Intel-IOMMU.txt b/Documentation/Intel-IOMMU.txt
new file mode 100644
index 000000000000..c2321903aa09
--- /dev/null
+++ b/Documentation/Intel-IOMMU.txt
@@ -0,0 +1,115 @@
+Linux IOMMU Support
+===================
+The architecture spec can be obtained from the below location.
+http://www.intel.com/technology/virtualization/
+This guide gives a quick cheat sheet for some basic understanding.
+Some Keywords
+DMAR - DMA remapping
+DRHD - DMA Engine Reporting Structure
+RMRR - Reserved memory Region Reporting Structure
+ZLR  - Zero length reads from PCI devices
+IOVA - IO Virtual address.
+Basic stuff
+-----------
+ACPI enumerates and lists the different DMA engines in the platform, and
+device scope relationships between PCI devices and which DMA engine  controls
+them.
+What is RMRR?
+-------------
+There are some devices the BIOS controls, for e.g USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+unity mappings for these regions for these devices to access these regions.
+How is IOVA generated?
+---------------------
+Well behaved drivers call pci_map_*() calls before sending command to device
+that needs to perform DMA. Once DMA is completed and mapping is no longer
+required, device performs a pci_unmap_*() calls to unmap the region.
+The Intel IOMMU driver allocates a virtual address per domain. Each PCIE
+device has its own domain (hence protection). Devices under p2p bridges
+share the virtual address with all devices under the p2p bridge due to
+transaction id aliasing for p2p bridges.
+IOVA generation is pretty generic. We used the same technique as vmalloc()
+but these are not global address spaces, but separate for each domain.
+Different DMA engines may support different number of domains.
+We also allocate gaurd pages with each mapping, so we can attempt to catch
+any overflow that might happen.
+Graphics Problems?
+------------------
+If you encounter issues with graphics devices, you can try adding
+option intel_iommu=igfx_off to turn off the integrated graphics engine.
+If it happens to be a PCI device included in the INCLUDE_ALL Engine,
+then try enabling CONFIG_DMAR_GFX_WA to setup a 1-1 map. We hear
+graphics drivers may be in process of using DMA api's in the near
+future and at that time this option can be yanked out.
+Some exceptions to IOVA
+-----------------------
+Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
+The same is true for peer to peer transactions. Hence we reserve the
+address from PCI MMIO ranges so they are not allocated for IOVA addresses.
+Fault reporting
+---------------
+When errors are reported, the DMA engine signals via an interrupt. The fault
+reason and device that caused it with fault reason is printed on console.
+See below for sample.
+Boot Message Sample
+-------------------
+Something like this gets printed indicating presence of DMAR tables
+in ACPI.
+ACPI: DMAR (v001 A M I  OEMDMAR  0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
+When DMAR is being processed and initialized by ACPI, prints DMAR locations
+and any RMRR's processed.
+ACPI DMAR:Host address width 36
+ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
+ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
+ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
+ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
+ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
+When DMAR is enabled for use, you will notice..
+PCI-DMA: Using DMAR IOMMU
+Fault reporting
+---------------
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+TBD
+----
+- For compatibility testing, could use unity map domain for all devices, just
+  provide a 1-1 for all useful memory under a single domain for all devices.
+- API for paravirt ops for abstracting functionlity for VMM folks.
diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting
index 31047e0fe14b..87019d2b5981 100644
--- a/Documentation/filesystems/Exporting
+++ b/Documentation/filesystems/Exporting
@@ -2,9 +2,12 @@
 Making Filesystems Exportable
 =============================
-Most filesystem operations require a dentry (or two) as a starting
+Overview
+--------
+All filesystem operations require a dentry (or two) as a starting
 point.  Local applications have a reference-counted hold on suitable
-dentrys via open file descriptors or cwd/root.  However remote
+dentries via open file descriptors or cwd/root.  However remote
 applications that access a filesystem via a remote filesystem protocol
 such as NFS may not be able to hold such a reference, and so need a
 different way to refer to a particular dentry.  As the alternative
@@ -13,14 +16,14 @@ server-reboot (among other things, though these tend to be the most
 problematic), there is no simple answer like 'filename'.
 The mechanism discussed here allows each filesystem implementation to
-specify how to generate an opaque (out side of the filesystem) byte
+specify how to generate an opaque (outside of the filesystem) byte
 string for any dentry, and how to find an appropriate dentry for any
 given opaque byte string.
 This byte string will be called a "filehandle fragment" as it
 corresponds to part of an NFS filehandle.
 A filesystem which supports the mapping between filehandle fragments
-and dentrys will be termed "exportable".
+and dentries will be termed "exportable".
@@ -89,11 +92,9 @@ For a filesystem to be exportable it must:
   1/ provide the filehandle fragment routines described below.
   2/ make sure that d_splice_alias is used rather than d_add
      when ->lookup finds an inode for a given parent and name.
-      Typically the ->lookup routine will end:
+      Typically the ->lookup routine will end with a:
-                if (inode)
-                        return d_splice(inode, dentry);
+                return d_splice_alias(inode, dentry);
-                d_add(dentry, inode);
-                return NULL;
        }
@@ -101,67 +102,39 @@ For a filesystem to be exportable it must:
  A file system implementation declares that instances of the filesystem
 are exportable by setting the s_export_op field in the struct
 super_block.  This field must point to a "struct export_operations"
-struct which could potentially be full of NULLs, though normally at
+struct which has the following members:
-least get_parent will be set.
+ encode_fh  (optional)
- The primary operations are decode_fh and encode_fh.  
+    Takes a dentry and creates a filehandle fragment which can later be used
-decode_fh takes a filehandle fragment and tries to find or create a
+    to find or create a dentry for the same object.  The default
-dentry for the object referred to by the filehandle.
+    implementation creates a filehandle fragment that encodes a 32bit inode
-encode_fh takes a dentry and creates a filehandle fragment which can
+    and generation number for the inode encoded, and if necessary the
-later be used to find/create a dentry for the same object.
+    same information for the parent.
-decode_fh will probably make use of "find_exported_dentry".
+  fh_to_dentry (mandatory)
-This function lives in the "exportfs" module which a filesystem does
+    Given a filehandle fragment, this should find the implied object and
-not need unless it is being exported.  So rather that calling
+    create a dentry for it (possibly with d_alloc_anon).
-find_exported_dentry directly, each filesystem should call it through
-the find_exported_dentry pointer in it's export_operations table.
+  fh_to_parent (optional but strongly recommended)
-This field is set correctly by the exporting agent (e.g. nfsd) when a
+    Given a filehandle fragment, this should find the parent of the
-filesystem is exported, and before any export operations are called.
+    implied object and create a dentry for it (possibly with d_alloc_anon).
+    May fail if the filehandle fragment is too small.
-find_exported_dentry needs three support functions from the
-filesystem:
+  get_parent (optional but strongly recommended)
-  get_name.  When given a parent dentry and a child dentry, this
+    When given a dentry for a directory, this should return  a dentry for
-    should find a name in the directory identified by the parent
+    the parent.  Quite possibly the parent dentry will have been allocated
-    dentry, which leads to the object identified by the child dentry.
+    by d_alloc_anon.  The default get_parent function just returns an error
-    If no get_name function is supplied, a default implementation is
+    so any filehandle lookup that requires finding a parent will fail.
-    provided which uses vfs_readdir to find potential names, and
+    ->lookup("..") is *not* used as a default as it can leave ".." entries
-    matches inode numbers to find the correct match.
+    in the dcache which are too messy to work with.
-  get_parent.  When given a dentry for a directory, this should return 
+  get_name (optional)
-    a dentry for the parent.  Quite possibly the parent dentry will
+    When given a parent dentry and a child dentry, this should find a name
-    have been allocated by d_alloc_anon.  
+    in the directory identified by the parent dentry, which leads to the
-    The default get_parent function just returns an error so any
+    object identified by the child dentry.  If no get_name function is
-    filehandle lookup that requires finding a parent will fail.
+    supplied, a default implementation is provided which uses vfs_readdir
-    ->lookup("..") is *not* used as a default as it can leave ".."
+    to find potential names, and matches inode numbers to find the correct
-    entries in the dcache which are too messy to work with.
+    match.
-  get_dentry.  When given an opaque datum, this should find the
-    implied object and create a dentry for it (possibly with
-    d_alloc_anon). 
-    The opaque datum is whatever is passed down by the decode_fh
-    function, and is often simply a fragment of the filehandle
-    fragment.
-    decode_fh passes two datums through find_exported_dentry.  One that 
-    should be used to identify the target object, and one that can be
-    used to identify the object's parent, should that be necessary.
-    The default get_dentry function assumes that the datum contains an
-    inode number and a generation number, and it attempts to get the
-    inode using "iget" and check it's validity by matching the
-    generation number.  A filesystem should only depend on the default
-    if iget can safely be used this way.
-If decode_fh and/or encode_fh are left as NULL, then default
-implementations are used.  These defaults are suitable for ext2 and 
-extremely similar filesystems (like ext3).
-The default encode_fh creates a filehandle fragment from the inode
-number and generation number of the target together with the inode
-number and generation number of the parent (if the parent is
-required).
-The default decode_fh extract the target and parent datums from the
-filehandle assuming the format used by the default encode_fh and
-passed them to find_exported_dentry.
 A filehandle fragment consists of an array of 1 or more 4byte words,
@@ -172,5 +145,3 @@ generated by encode_fh, in which case it will have been padded with
 nuls.  Rather, the encode_fh routine should choose a "type" which
 indicates the decode_fh how much of the filehandle is valid, and how
 it should be interpreted.
- 
diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
index 35985b34d5a6..2f75e750e4f5 100644
--- a/Documentation/i386/boot.txt
+++ b/Documentation/i386/boot.txt
@@ -168,6 +168,8 @@ Offset	Proto	Name		Meaning
 0234/1  2.05+   relocatable_kernel Whether kernel is relocatable or not
 0235/3  N/A     pad2            Unused
 0238/4  2.06+   cmdline_size    Maximum size of the kernel command line
+023C/4  2.07+   hardware_subarch Hardware subarchitecture
+0240/8  2.07+   hardware_subarch_data Subarchitecture-specific data
 (1) For backwards compatibility, if the setup_sects field contains 0, the
    real value is 4.
@@ -204,7 +206,7 @@ boot loaders can ignore those fields.
 The byte order of all fields is littleendian (this is x86, after all.)
-Field name:     setup_secs
+Field name:     setup_sects
 Type:           read
 Offset/size:    0x1f1/1
 Protocol:       ALL
@@ -356,6 +358,13 @@ Protocol:	2.00+
        - If 0, the protected-mode code is loaded at 0x10000.
        - If 1, the protected-mode code is loaded at 0x100000.
+  Bit 6 (write): KEEP_SEGMENTS
+        Protocol: 2.07+
+        - if 0, reload the segment registers in the 32bit entry point.
+        - if 1, do not reload the segment registers in the 32bit entry point.
+                Assume that %cs %ds %ss %es are all set to flat segments with
+                a base of 0 (or the equivalent for their environment).
  Bit 7 (write): CAN_USE_HEAP
        Set this bit to 1 to indicate that the value entered in the
        heap_end_ptr is valid.  If this field is clear, some setup code
@@ -480,6 +489,29 @@ Protocol:	2.06+
  cmdline_size characters. With protocol version 2.05 and earlier, the
  maximum size was 255.
+Field name:     hardware_subarch
+Type:           write
+Offset/size:    0x23c/4
+Protocol:       2.07+
+  In a paravirtualized environment the hardware low level architectural
+  pieces such as interrupt handling, page table handling, and
+  accessing process control registers needs to be done differently.
+  This field allows the bootloader to inform the kernel we are in one
+  one of those environments.
+  0x00000000    The default x86/PC environment
+  0x00000001    lguest
+  0x00000002    Xen
+Field name:     hardware_subarch_data
+Type:           write
+Offset/size:    0x240/8
+Protocol:       2.07+
+  A pointer to data that is specific to hardware subarch
 **** THE KERNEL COMMAND LINE
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index 6166e2d7da76..7a7753321a26 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -519,17 +519,17 @@ more details, with real examples.
        to the user why it stops.
    cc-cross-prefix
-        cc-cross-prefix is used to check if there exist a $(CC) in path with
+        cc-cross-prefix is used to check if there exists a $(CC) in path with
        one of the listed prefixes. The first prefix where there exist a
        prefix$(CC) in the PATH is returned - and if no prefix$(CC) is found
        then nothing is returned.
        Additional prefixes are separated by a single space in the
        call of cc-cross-prefix.
-        This functionality is usefull for architecture Makefile that try
+        This functionality is useful for architecture Makefiles that try
-        to set CROSS_COMPILE to well know values but may have several
+        to set CROSS_COMPILE to well-known values but may have several
        values to select between.
-        It is recommended only to try to set CROSS_COMPILE is it is a cross
+        It is recommended only to try to set CROSS_COMPILE if it is a cross
-        build (host arch is different from target arch). And is CROSS_COMPILE
+        build (host arch is different from target arch). And if CROSS_COMPILE
        is already set then leave it with the old value.
        Example:
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 6accd360da73..b2361667839f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -772,6 +772,23 @@ and is between 256 and 4096 characters. It is defined in the file
        inttest=        [IA64]
+        intel_iommu=    [DMAR] Intel IOMMU driver (DMAR) option
+                off
+                        Disable intel iommu driver.
+                igfx_off [Default Off]
+                        By default, gfx is mapped as normal device. If a gfx
+                        device has a dedicated DMAR unit, the DMAR unit is
+                        bypassed by not enabling DMAR with this option. In
+                        this case, gfx device will use physical address for
+                        DMA.
+                forcedac [x86_64]
+                        With this option iommu will not optimize to look
+                        for io virtual address below 32 bit forcing dual
+                        address cycle on pci bus for cards supporting greater
+                        than 32 bit addressing. The default is to look
+                        for translation below 32 bit and if not available
+                        then look in the higher range.
        io7=            [HW] IO7 for Marvel based alpha systems
                        See comment before marvel_specify_io7 in
                        arch/alpha/kernel/core_marvel.c.
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 5fbcc22c98e9..168117bd6ee8 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -2,7 +2,8 @@
 Memory Hotplug
 ==============
-Last Updated: Jul 28 2007
+Created:                                        Jul 28 2007
+Add description of notifier of memory hotplug   Oct 11 2007
 This document is about memory hotplug including how-to-use and current status.
 Because Memory Hotplug is still under development, contents of this text will
@@ -24,7 +25,8 @@ be changed often.
  6.1 Memory offline and ZONE_MOVABLE
  6.2. How to offline memory
 7. Physical memory remove
-8. Future Work List
+8. Memory hotplug event notifier
+9. Future Work List
 Note(1): x86_64's has special implementation for memory hotplug.
         This text does not describe it.
@@ -307,8 +309,58 @@ Need more implementation yet....
 - Notification completion of remove works by OS to firmware.
 - Guard from remove if not yet.
+--------------------------------
+8. Memory hotplug event notifier
+--------------------------------
+Memory hotplug has event notifer. There are 6 types of notification.
+MEMORY_GOING_ONLINE
+  Generated before new memory becomes available in order to be able to
+  prepare subsystems to handle memory. The page allocator is still unable
+  to allocate from the new memory.
+MEMORY_CANCEL_ONLINE
+  Generated if MEMORY_GOING_ONLINE fails.
+MEMORY_ONLINE
+  Generated when memory has succesfully brought online. The callback may
+  allocate pages from the new memory.
+MEMORY_GOING_OFFLINE
+  Generated to begin the process of offlining memory. Allocations are no
+  longer possible from the memory but some of the memory to be offlined
+  is still in use. The callback can be used to free memory known to a
+  subsystem from the indicated memory section.
+MEMORY_CANCEL_OFFLINE
+  Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
+  the section that we attempted to offline.
+MEMORY_OFFLINE
+  Generated after offlining memory is complete.
+A callback routine can be registered by
+  hotplug_memory_notifier(callback_func, priority)
+The second argument of callback function (action) is event types of above.
+The third argument is passed by pointer of struct memory_notify.
+struct memory_notify {
+       unsigned long start_pfn;
+       unsigned long nr_pages;
+       int status_cahnge_nid;
+}
+start_pfn is start_pfn of online/offline memory.
+nr_pages is # of pages of online/offline memory.
+status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+set/clear. It means a new(memoryless) node gets new memory by online and a
+node loses all memory. If this is -1, then nodemask status is not changed.
+If status_changed_nid >= 0, callback should create/discard structures for the
+node if necessary.
 --------------
-8. Future Work
+9. Future Work
 --------------
  - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
    sysctl or new control file.