40 files changed, 875 insertions, 485 deletions
diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst
index fce929144ccd..2e7165f86f55 100644
--- a/Documentation/core-api/atomic_ops.rst
+++ b/Documentation/core-api/atomic_ops.rst
@@ -111,7 +111,6 @@ If the compiler can prove that do_something() does not store to the
 variable a, then the compiler is within its rights transforming this to
 the following::
-        tmp = a;
        if (a > 0)
                for (;;)
                        do_something();
@@ -119,7 +118,7 @@ the following::
 If you don't want the compiler to do this (and you probably don't), then
 you should use something like the following::
-        while (READ_ONCE(a) < 0)
+        while (READ_ONCE(a) > 0)
                do_something();
 Alternatively, you could place a barrier() call in the loop.
@@ -467,10 +466,12 @@ Like the above, except that these routines return a boolean which
 indicates whether the changed bit was set _BEFORE_ the atomic bit
 operation.
-WARNING! It is incredibly important that the value be a boolean,
-ie. "0" or "1".  Do not try to be fancy and save a few instructions by
+.. warning::
-declaring the above to return "long" and just returning something like
+        It is incredibly important that the value be a boolean, ie. "0" or "1".
-"old_val & mask" because that will not work.
+        Do not try to be fancy and save a few instructions by declaring the
+        above to return "long" and just returning something like "old_val &
+        mask" because that will not work.
 For one thing, this return value gets truncated to int in many code
 paths using these interfaces, so on 64-bit if the bit is set in the
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 33b8bc9573f8..a02d6bbfc9d0 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1920,9 +1920,6 @@ There are some more advanced barrier functions:
                /* assign ownership */
                desc->status = DEVICE_OWN;
-                /* force memory to sync before notifying device via MMIO */
-                wmb();
                /* notify device of new descriptors */
                writel(DESC_NOTIFY, doorbell);
        }
@@ -1930,11 +1927,15 @@ There are some more advanced barrier functions:
     The dma_rmb() allows us guarantee the device has released ownership
     before we read the data from the descriptor, and the dma_wmb() allows
     us to guarantee the data is written to the descriptor before the device
-     can see it now has ownership.  The wmb() is needed to guarantee that the
+     can see it now has ownership.  Note that, when using writel(), a prior
-     cache coherent memory writes have completed before attempting a write to
+     wmb() is not needed to guarantee that the cache coherent memory writes
-     the cache incoherent MMIO region.
+     have completed before writing to the MMIO region.  The cheaper
+     writel_relaxed() does not provide this guarantee and must not be used
-     See Documentation/DMA-API.txt for more information on consistent memory.
+     here.
+     See the subsection "Kernel I/O barrier effects" for more information on
+     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+     information on consistent memory.
 MMIO WRITE BARRIER
diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
index 2ec5fe0c9cf4..921739d00f69 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -36,6 +36,9 @@ Documentation/memory-barriers.txt
 부분도 있고, 의도하진 않았지만 사람에 의해 쓰였다보니 불완전한 부분도 있습니다.
 이 문서는 리눅스에서 제공하는 다양한 메모리 배리어들을 사용하기 위한
 안내서입니다만, 뭔가 이상하다 싶으면 (그런게 많을 겁니다) 질문을 부탁드립니다.
+일부 이상한 점들은 공식적인 메모리 일관성 모델과 tools/memory-model/ 에 있는
+관련 문서를 참고해서 해결될 수 있을 겁니다.  그러나, 이 메모리 모델조차도 그
+관리자들의 의견의 집합으로 봐야지, 절대 옳은 예언자로 신봉해선 안될 겁니다.
 다시 말하지만, 이 문서는 리눅스가 하드웨어에 기대하는 사항에 대한 명세서가
 아닙니다.
@@ -77,7 +80,7 @@ Documentation/memory-barriers.txt
     - 메모리 배리어의 종류.
     - 메모리 배리어에 대해 가정해선 안될 것.
-     - 데이터 의존성 배리어.
+     - 데이터 의존성 배리어 (역사적).
     - 컨트롤 의존성.
     - SMP 배리어 짝맞추기.
     - 메모리 배리어 시퀀스의 예.
@@ -255,17 +258,20 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니
 (*) 어떤 CPU 든, 의존성이 존재하는 메모리 액세스들은 해당 CPU 자신에게
     있어서는 순서대로 메모리 시스템에 수행 요청됩니다. 즉, 다음에 대해서:
-        Q = READ_ONCE(P); smp_read_barrier_depends(); D = READ_ONCE(*Q);
+        Q = READ_ONCE(P); D = READ_ONCE(*Q);
     CPU 는 다음과 같은 메모리 오퍼레이션 시퀀스를 수행 요청합니다:
        Q = LOAD P, D = LOAD *Q
-     그리고 그 시퀀스 내에서의 순서는 항상 지켜집니다.  대부분의 시스템에서
+     그리고 그 시퀀스 내에서의 순서는 항상 지켜집니다.  하지만, DEC Alpha 에서
-     smp_read_barrier_depends() 는 아무일도 안하지만 DEC Alpha 에서는
+     READ_ONCE() 는 메모리 배리어 명령도 내게 되어 있어서, DEC Alpha CPU 는
-     명시적으로 사용되어야 합니다.  보통의 경우에는 smp_read_barrier_depends()
+     다음과 같은 메모리 오퍼레이션들을 내놓게 됩니다:
-     를 직접 사용하는 대신 rcu_dereference() 같은 것들을 사용해야 함을
-     알아두세요.
+        Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
+     DEC Alpha 에서 수행되든 아니든, READ_ONCE() 는 컴파일러로부터의 악영향
+     또한 제거합니다.
 (*) 특정 CPU 내에서 겹치는 영역의 메모리에 행해지는 로드와 스토어 들은 해당
     CPU 안에서는 순서가 바뀌지 않은 것으로 보여집니다.  즉, 다음에 대해서:
@@ -421,8 +427,8 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니
     데이터 의존성 배리어는 읽기 배리어의 보다 완화된 형태입니다.  두개의 로드
     오퍼레이션이 있고 두번째 것이 첫번째 것의 결과에 의존하고 있을 때(예:
     두번째 로드가 참조할 주소를 첫번째 로드가 읽는 경우), 두번째 로드가 읽어올
-     데이터는 첫번째 로드에 의해 그 주소가 얻어��기 ��에 업데이트 ����� ��음을
+     데이터는 첫번째 로드에 의해 그 주소가 얻어진 뒤에 업데이트 ����� 보��하기
-     보���하기 ���해서 데이터 의존성 배리어가 필요할 수 있습니다.
+     위해서 데이터 의존성 배리어가 필요할 수 있습니다.
     데이터 의존성 배리어는 상호 의존적인 로드 오퍼레이션들 사이의 부분적 순서
     세우기입니다; 스토어 오퍼레이션들이나 독립적인 로드들, 또는 중복되는
@@ -570,8 +576,14 @@ ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE �
            Documentation/DMA-API.txt
-데이터 의존성 배리어
+데이터 의존성 배리어 (역사적)
--------------------
+-----------------------------
+리눅스 커널 v4.15 기준으로, smp_read_barrier_depends() 가 READ_ONCE() 에
+추가되었는데, 이는 이 섹션에 주의를 기울여야 하는 사람들은 DEC Alpha 아키텍쳐
+전용 코드를 만드는 사람들과 READ_ONCE() 자체를 만드는 사람들 뿐임을 의미합니다.
+그런 분들을 위해, 그리고 역사에 관심 있는 분들을 위해, 여기 데이터 의존성
+배리어에 대한 이야기를 적습니다.
 데이터 의존성 배리어의 사용에 있어 지켜야 하는 사항들은 약간 미묘하고, 데이터
 의존성 배리어가 사용되어야 하는 상황도 항상 명백하지는 않습니다.  설명을 위해
@@ -1787,7 +1799,7 @@ CPU 메모리 배리어
        범용          mb()                    smp_mb()
        쓰기          wmb()                   smp_wmb()
        읽기          rmb()                   smp_rmb()
-        데이터 의존성     read_barrier_depends()  smp_read_barrier_depends()
+        데이터 의존성                             READ_ONCE()
 데이터 의존성 배리어를 제외한 모든 메모리 배리어는 컴파일러 배리어를
@@ -2796,8 +2808,9 @@ CPU 2 는 C/D 를 갖습니다)가 병렬로 연결되어 있는 시스템을 �
 여기에 개입하기 위해선, 데이터 의존성 배리어나 읽기 배리어를 로드 오퍼레이션들
-사이에 넣어야 합니다.  이렇게 함으로써 캐시가 다음 요청을 처리하기 전에 일관성
+사이에 넣어야 합니다 (v4.15 부터는 READ_ONCE() 매크로에 의해 무조건적으로
-큐를 처리하도록 강제하게 됩니다.
+그렇게 됩니다).  이렇게 함으로써 캐시가 다음 요청을 처리하기 전에 일관성 큐를
+처리하도록 강제하게 됩니다.
        CPU 1           CPU 2           COMMENT
        =============== =============== =======================================
@@ -2826,7 +2839,10 @@ CPU 2 는 C/D 를 갖습니다)가 병렬로 연결되어 있는 시스템을 �
 다른 CPU 들도 분할된 캐시를 가지고 있을 수 있지만, 그런 CPU 들은 평범한 메모리
 액세스를 위해서도 이 분할된 캐시들 사이의 조정을 해야만 합니다.  Alpha 는 가장
 약한 메모리 순서 시맨틱 (semantic) 을 선택함으로써 메모리 배리어가 명시적으로
-사용되지 않았을 때에는 그런 조정이 필요하지 않게 했습니다.
+사용되지 않았을 때에는 그런 조정이 필요하지 않게 했으며, 이는 Alpha 가 당시에
+더 높은 CPU 클락 속도를 가질 수 있게 했습니다.  하지만, (다시 말하건대, v4.15
+이후부터는) Alpha 아키텍쳐 전용 코드와 READ_ONCE() 매크로 내부에서를 제외하고는
+smp_read_barrier_depends() 가 사용되지 않아야 함을 알아두시기 바랍니다.
 캐시 일관성 VS DMA
@@ -2988,7 +3004,9 @@ Alpha CPU 의 일부 버전은 분할된 데이터 캐시를 가지고 있어서
 메모리 일관성 시스템과 함께 두개의 캐시를 동기화 시켜서, 포인터 변경과 새로운
 데이터의 발견을 올바른 순서로 일어나게 하기 때문입니다.
-리눅스 커널의 메모리 배리어 모델은 Alpha 에 기초해서 정의되었습니다.
+리눅스 커널의 메모리 배리어 모델은 Alpha 에 기초해서 정의되었습니다만, v4.15
+부터는 리눅스 커널이 READ_ONCE() 내에 smp_read_barrier_depends() 를 추가해서
+Alpha 의 메모리 모델로의 영향력이 크게 줄어들긴 했습니다.
 위의 "캐시 일관성" 서브섹션을 참고하세요.
diff --git a/MAINTAINERS b/MAINTAINERS
index fdf15f3de473..aa635837a6af 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8212,7 +8212,7 @@ F:	drivers/misc/lkdtm/*
 LINUX KERNEL MEMORY CONSISTENCY MODEL (LKMM)
 M:      Alan Stern <stern@rowland.harvard.edu>
-M:      Andrea Parri <parri.andrea@gmail.com>
+M:      Andrea Parri <andrea.parri@amarulasolutions.com>
 M:      Will Deacon <will.deacon@arm.com>
 M:      Peter Zijlstra <peterz@infradead.org>
 M:      Boqun Feng <boqun.feng@gmail.com>
@@ -8319,6 +8319,7 @@ F:	Documentation/admin-guide/LSM/LoadPin.rst
 LOCKING PRIMITIVES
 M:      Peter Zijlstra <peterz@infradead.org>
 M:      Ingo Molnar <mingo@redhat.com>
+M:      Will Deacon <will.deacon@arm.com>
 L:      linux-kernel@vger.kernel.org
 T:      git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
 S:      Maintained
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index ebdae15d665d..26c5bd7d88d8 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -122,11 +122,6 @@ static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-        /*
-         * Ensure prior spin_lock operations to other locks have completed
-         * on this CPU before we test whether "lock" is locked.
-         */
-        smp_mb(); /* ^^^ */
        return !arch_spin_value_unlocked(READ_ONCE(*lock));
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 5e16b5d40d32..3e70bed8a978 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,6 +7,14 @@
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
+#define _Q_PENDING_LOOPS        (1 << 9)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
 #define queued_spin_unlock queued_spin_unlock
 /**
 * queued_spin_unlock - release a queued spinlock
@@ -16,15 +24,9 @@
 */
 static inline void native_queued_spin_unlock(struct qspinlock *lock)
 {
-        smp_store_release((u8 *)lock, 0);
+        smp_store_release(&lock->locked, 0);
 }
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_init_lock_hash(void);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
 static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
        pv_queued_spin_lock_slowpath(lock, val);
@@ -40,11 +42,6 @@ static inline bool vcpu_is_preempted(long cpu)
 {
        return pv_vcpu_is_preempted(cpu);
 }
-#else
-static inline void queued_spin_unlock(struct qspinlock *lock)
-{
-        native_queued_spin_unlock(lock);
-}
 #endif
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 923307ea11c7..9ef5ee03d2d7 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -22,8 +22,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
 *
 * void __pv_queued_spin_unlock(struct qspinlock *lock)
 * {
- *      struct __qspinlock *l = (void *)lock;
+ *      u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
- *      u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
 *
 *      if (likely(lockval == _Q_LOCKED_VAL))
 *              return;
diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index 34a028a7bcc5..87d14476edc2 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -25,6 +25,7 @@ typedef atomic64_t atomic_long_t;
 #define ATOMIC_LONG_INIT(i)     ATOMIC64_INIT(i)
 #define ATOMIC_LONG_PFX(x)      atomic64 ## x
+#define ATOMIC_LONG_TYPE        s64
 #else
@@ -32,6 +33,7 @@ typedef atomic_t atomic_long_t;
 #define ATOMIC_LONG_INIT(i)     ATOMIC_INIT(i)
 #define ATOMIC_LONG_PFX(x)      atomic ## x
+#define ATOMIC_LONG_TYPE        int
 #endif
@@ -90,6 +92,21 @@ ATOMIC_LONG_ADD_SUB_OP(sub, _release)
 #define atomic_long_cmpxchg(l, old, new) \
        (ATOMIC_LONG_PFX(_cmpxchg)((ATOMIC_LONG_PFX(_t) *)(l), (old), (new)))
+#define atomic_long_try_cmpxchg_relaxed(l, old, new) \
+        (ATOMIC_LONG_PFX(_try_cmpxchg_relaxed)((ATOMIC_LONG_PFX(_t) *)(l), \
+                                           (ATOMIC_LONG_TYPE *)(old), (ATOMIC_LONG_TYPE)(new)))
+#define atomic_long_try_cmpxchg_acquire(l, old, new) \
+        (ATOMIC_LONG_PFX(_try_cmpxchg_acquire)((ATOMIC_LONG_PFX(_t) *)(l), \
+                                           (ATOMIC_LONG_TYPE *)(old), (ATOMIC_LONG_TYPE)(new)))
+#define atomic_long_try_cmpxchg_release(l, old, new) \
+        (ATOMIC_LONG_PFX(_try_cmpxchg_release)((ATOMIC_LONG_PFX(_t) *)(l), \
+                                           (ATOMIC_LONG_TYPE *)(old), (ATOMIC_LONG_TYPE)(new)))
+#define atomic_long_try_cmpxchg(l, old, new) \
+        (ATOMIC_LONG_PFX(_try_cmpxchg)((ATOMIC_LONG_PFX(_t) *)(l), \
+                                       (ATOMIC_LONG_TYPE *)(old), (ATOMIC_LONG_TYPE)(new)))
 #define atomic_long_xchg_relaxed(v, new) \
        (ATOMIC_LONG_PFX(_xchg_relaxed)((ATOMIC_LONG_PFX(_t) *)(v), (new)))
 #define atomic_long_xchg_acquire(v, new) \
@@ -244,6 +261,8 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 #define atomic_long_inc_not_zero(l) \
        ATOMIC_LONG_PFX(_inc_not_zero)((ATOMIC_LONG_PFX(_t) *)(l))
+#define atomic_long_cond_read_relaxed(v, c) \
+        ATOMIC_LONG_PFX(_cond_read_relaxed)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 #define atomic_long_cond_read_acquire(v, c) \
        ATOMIC_LONG_PFX(_cond_read_acquire)((ATOMIC_LONG_PFX(_t) *)(v), (c))
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 29458bbb2fa0..2cafdbb9ae4c 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,18 +221,17 @@ do {									\
 #endif
 /**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
 * @ptr: pointer to the variable to wait on
 * @cond: boolean expression to wait for
 *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * Equivalent to using READ_ONCE() on the condition variable.
- * the control dependency of the wait to reduce the barrier on many platforms.
 *
 * Due to C lacking lambda expressions we load the value of *ptr into a
 * pre-named variable @VAL to be used in @cond.
 */
-#ifndef smp_cond_load_acquire
+#ifndef smp_cond_load_relaxed
-#define smp_cond_load_acquire(ptr, cond_expr) ({                \
+#define smp_cond_load_relaxed(ptr, cond_expr) ({                \
        typeof(ptr) __PTR = (ptr);                              \
        typeof(*ptr) VAL;                                       \
        for (;;) {                                              \
@@ -241,10 +240,26 @@ do {									\
                        break;                                  \
                cpu_relax();                                    \
        }                                                       \
-        smp_acquire__after_ctrl_dep();                          \
        VAL;                                                    \
 })
 #endif
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({                \
+        typeof(*ptr) _val;                                      \
+        _val = smp_cond_load_relaxed(ptr, cond_expr);           \
+        smp_acquire__after_ctrl_dep();                          \
+        _val;                                                   \
+})
+#endif
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index b37b4ad7eb94..9cc457597ddf 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -26,7 +26,6 @@
 * @lock: Pointer to queued spinlock structure
 * Return: 1 if it is locked, 0 otherwise
 */
-#ifndef queued_spin_is_locked
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
        /*
@@ -35,7 +34,6 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
         */
        return atomic_read(&lock->val);
 }
-#endif
 /**
 * queued_spin_value_unlocked - is the spinlock structure unlocked?
@@ -100,7 +98,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
        /*
         * unlock() needs release semantics:
         */
-        (void)atomic_sub_return_release(_Q_LOCKED_VAL, &lock->val);
+        smp_store_release(&lock->locked, 0);
 }
 #endif
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 034acd0c4956..0763f065b975 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -29,13 +29,41 @@
 #endif
 typedef struct qspinlock {
-        atomic_t        val;
+        union {
+                atomic_t val;
+                /*
+                 * By using the whole 2nd least significant byte for the
+                 * pending bit, we can allow better optimization of the lock
+                 * acquisition for the pending bit holder.
+                 */
+#ifdef __LITTLE_ENDIAN
+                struct {
+                        u8      locked;
+                        u8      pending;
+                };
+                struct {
+                        u16     locked_pending;
+                        u16     tail;
+                };
+#else
+                struct {
+                        u16     tail;
+                        u16     locked_pending;
+                };
+                struct {
+                        u8      reserved[2];
+                        u8      pending;
+                        u8      locked;
+                };
+#endif
+        };
 } arch_spinlock_t;
 /*
 * Initializier
 */
-#define __ARCH_SPIN_LOCK_UNLOCKED       { ATOMIC_INIT(0) }
+#define __ARCH_SPIN_LOCK_UNLOCKED       { .val = ATOMIC_INIT(0) }
 /*
 * Bitfields in the atomic value:
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8b276fd9a127..01ce3997cb42 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
+#define atomic_cond_read_relaxed(v, c)  smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic_cond_read_acquire(v, c)  smp_cond_load_acquire(&(v)->counter, (c))
 #ifdef CONFIG_GENERIC_ATOMIC64
@@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
+#define atomic64_cond_read_relaxed(v, c)        smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic64_cond_read_acquire(v, c)        smp_cond_load_acquire(&(v)->counter, (c))
 #include <asm-generic/atomic-long.h>
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 5e335b6203f4..e6c0448ebcc7 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -29,7 +29,7 @@
 #ifdef CONFIG_TASK_DELAY_ACCT
 struct task_delay_info {
-        spinlock_t      lock;
+        raw_spinlock_t  lock;
        unsigned int    flags;  /* Private per-task flags */
        /* For each stat XXX, add following, aligned appropriately
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 14bc0d5d0ee5..3093dd162424 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -146,9 +146,6 @@ extern void __mutex_init(struct mutex *lock, const char *name,
 */
 static inline bool mutex_is_locked(struct mutex *lock)
 {
-        /*
-         * XXX think about spin_is_locked
-         */
        return __mutex_owner(lock) != NULL;
 }
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 4894d322d258..1e8a46435838 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -380,6 +380,24 @@ static __always_inline int spin_trylock_irq(spinlock_t *lock)
        raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
 })
+/**
+ * spin_is_locked() - Check whether a spinlock is locked.
+ * @lock: Pointer to the spinlock.
+ *
+ * This function is NOT required to provide any memory ordering
+ * guarantees; it could be used for debugging purposes or, when
+ * additional synchronization is needed, accompanied with other
+ * constructs (memory barriers) enforcing the synchronization.
+ *
+ * Returns: 1 if @lock is locked, 0 otherwise.
+ *
+ * Note that the function only tells you that the spinlock is
+ * seen to be locked, not that it is locked on your CPU.
+ *
+ * Further, on CONFIG_SMP=n builds with CONFIG_DEBUG_SPINLOCK=n,
+ * the return value is always 0 (see include/linux/spinlock_up.h).
+ * Therefore you should not rely heavily on the return value.
+ */
 static __always_inline int spin_is_locked(spinlock_t *lock)
 {
        return raw_spin_is_locked(&lock->rlock);
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index e2764d767f18..ca8ac2824f0b 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -44,23 +44,24 @@ void __delayacct_tsk_init(struct task_struct *tsk)
 {
        tsk->delays = kmem_cache_zalloc(delayacct_cache, GFP_KERNEL);
        if (tsk->delays)
-                spin_lock_init(&tsk->delays->lock);
+                raw_spin_lock_init(&tsk->delays->lock);
 }
 /*
 * Finish delay accounting for a statistic using its timestamps (@start),
 * accumalator (@total) and @count
 */
-static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count)
+static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total,
+                          u32 *count)
 {
        s64 ns = ktime_get_ns() - *start;
        unsigned long flags;
        if (ns > 0) {
-                spin_lock_irqsave(lock, flags);
+                raw_spin_lock_irqsave(lock, flags);
                *total += ns;
                (*count)++;
-                spin_unlock_irqrestore(lock, flags);
+                raw_spin_unlock_irqrestore(lock, flags);
        }
 }
@@ -127,7 +128,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
        /* zero XXX_total, non-zero XXX_count implies XXX stat overflowed */
-        spin_lock_irqsave(&tsk->delays->lock, flags);
+        raw_spin_lock_irqsave(&tsk->delays->lock, flags);
        tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
        d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp;
        tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
@@ -137,7 +138,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
        d->blkio_count += tsk->delays->blkio_count;
        d->swapin_count += tsk->delays->swapin_count;
        d->freepages_count += tsk->delays->freepages_count;
-        spin_unlock_irqrestore(&tsk->delays->lock, flags);
+        raw_spin_unlock_irqrestore(&tsk->delays->lock, flags);
        return 0;
 }
@@ -147,10 +148,10 @@ __u64 __delayacct_blkio_ticks(struct task_struct *tsk)
        __u64 ret;
        unsigned long flags;
-        spin_lock_irqsave(&tsk->delays->lock, flags);
+        raw_spin_lock_irqsave(&tsk->delays->lock, flags);
        ret = nsec_to_clock_t(tsk->delays->blkio_delay +
                                tsk->delays->swapin_delay);
-        spin_unlock_irqrestore(&tsk->delays->lock, flags);
+        raw_spin_unlock_irqrestore(&tsk->delays->lock, flags);
        return ret;
 }
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 023386338269..edcac5de7ebc 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -561,20 +561,24 @@ static void print_lock(struct held_lock *hlock)
        printk(KERN_CONT ", at: %pS\n", (void *)hlock->acquire_ip);
 }
-static void lockdep_print_held_locks(struct task_struct *curr)
+static void lockdep_print_held_locks(struct task_struct *p)
 {
-        int i, depth = curr->lockdep_depth;
+        int i, depth = READ_ONCE(p->lockdep_depth);
-        if (!depth) {
+        if (!depth)
-                printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
+                printk("no locks held by %s/%d.\n", p->comm, task_pid_nr(p));
+        else
+                printk("%d lock%s held by %s/%d:\n", depth,
+                       depth > 1 ? "s" : "", p->comm, task_pid_nr(p));
+        /*
+         * It's not reliable to print a task's held locks if it's not sleeping
+         * and it's not the current task.
+         */
+        if (p->state == TASK_RUNNING && p != current)
                return;
-        }
-        printk("%d lock%s held by %s/%d:\n",
-                depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
        for (i = 0; i < depth; i++) {
                printk(" #%d: ", i);
-                print_lock(curr->held_locks + i);
+                print_lock(p->held_locks + i);
        }
 }
@@ -4451,8 +4455,6 @@ EXPORT_SYMBOL_GPL(debug_check_no_locks_held);
 void debug_show_all_locks(void)
 {
        struct task_struct *g, *p;
-        int count = 10;
-        int unlock = 1;
        if (unlikely(!debug_locks)) {
                pr_warn("INFO: lockdep is turned off.\n");
@@ -4460,50 +4462,18 @@ void debug_show_all_locks(void)
        }
        pr_warn("\nShowing all locks held in the system:\n");
-        /*
+        rcu_read_lock();
-         * Here we try to get the tasklist_lock as hard as possible,
+        for_each_process_thread(g, p) {
-         * if not successful after 2 seconds we ignore it (but keep
+                if (!p->lockdep_depth)
-         * trying). This is to enable a debug printout even if a
-         * tasklist_lock-holding task deadlocks or crashes.
-         */
-retry:
-        if (!read_trylock(&tasklist_lock)) {
-                if (count == 10)
-                        pr_warn("hm, tasklist_lock locked, retrying... ");
-                if (count) {
-                        count--;
-                        pr_cont(" #%d", 10-count);
-                        mdelay(200);
-                        goto retry;
-                }
-                pr_cont(" ignoring it.\n");
-                unlock = 0;
-        } else {
-                if (count != 10)
-                        pr_cont(" locked it.\n");
-        }
-        do_each_thread(g, p) {
-                /*
-                 * It's not reliable to print a task's held locks
-                 * if it's not sleeping (or if it's not the current
-                 * task):
-                 */
-                if (p->state == TASK_RUNNING && p != current)
                        continue;
-                if (p->lockdep_depth)
+                lockdep_print_held_locks(p);
-                        lockdep_print_held_locks(p);
-                if (!unlock)
-                        if (read_trylock(&tasklist_lock))
-                                unlock = 1;
                touch_nmi_watchdog();
-        } while_each_thread(g, p);
+                touch_all_softlockup_watchdogs();
+        }
+        rcu_read_unlock();
        pr_warn("\n");
        pr_warn("=============================================\n\n");
-        if (unlock)
-                read_unlock(&tasklist_lock);
 }
 EXPORT_SYMBOL_GPL(debug_show_all_locks);
 #endif
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index f046b7ce9dd6..5e10153b4d3c 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -23,13 +23,15 @@ struct mcs_spinlock {
 #ifndef arch_mcs_spin_lock_contended
 /*
- * Using smp_load_acquire() provides a memory barrier that ensures
+ * Using smp_cond_load_acquire() provides the acquire semantics
- * subsequent operations happen after the lock is acquired.
+ * required so that subsequent operations happen after the
+ * lock is acquired. Additionally, some architectures such as
+ * ARM64 would like to do spin-waiting instead of purely
+ * spinning, and smp_cond_load_acquire() provides that behavior.
 */
 #define arch_mcs_spin_lock_contended(l)                                 \
 do {                                                                    \
-        while (!(smp_load_acquire(l)))                                  \
+        smp_cond_load_acquire(l, VAL);                                  \
-                cpu_relax();                                            \
 } while (0)
 #endif
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2048359f33d2..f44f658ae629 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -139,8 +139,9 @@ static inline bool __mutex_trylock(struct mutex *lock)
 static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
 {
        unsigned long curr = (unsigned long)current;
+        unsigned long zero = 0UL;
-        if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr))
+        if (atomic_long_try_cmpxchg_acquire(&lock->owner, &zero, curr))
                return true;
        return false;
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d880296245c5..bfaeb05123ff 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -12,11 +12,11 @@
 * GNU General Public License for more details.
 *
 * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
- * (C) Copyright 2013-2014 Red Hat, Inc.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
 * (C) Copyright 2015 Intel Corp.
 * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
 *
- * Authors: Waiman Long <waiman.long@hpe.com>
+ * Authors: Waiman Long <longman@redhat.com>
 *          Peter Zijlstra <peterz@infradead.org>
 */
@@ -33,6 +33,11 @@
 #include <asm/qspinlock.h>
 /*
+ * Include queued spinlock statistics code
+ */
+#include "qspinlock_stat.h"
+/*
 * The basic principle of a queue-based spinlock can best be understood
 * by studying a classic queue-based spinlock implementation called the
 * MCS lock. The paper below provides a good description for this kind
@@ -77,6 +82,18 @@
 #endif
 /*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+#ifndef _Q_PENDING_LOOPS
+#define _Q_PENDING_LOOPS        1
+#endif
+/*
 * Per-CPU queue node structures; we can never have more than 4 nested
 * contexts: task, softirq, hardirq, nmi.
 *
@@ -114,41 +131,18 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
-/*
+#if _Q_PENDING_BITS == 8
- * By using the whole 2nd least significant byte for the pending bit, we
+/**
- * can allow better optimization of the lock acquisition for the pending
+ * clear_pending - clear the pending bit.
- * bit holder.
+ * @lock: Pointer to queued spinlock structure
 *
- * This internal structure is also used by the set_locked function which
+ * *,1,* -> *,0,*
- * is not restricted to _Q_PENDING_BITS == 8.
 */
-struct __qspinlock {
+static __always_inline void clear_pending(struct qspinlock *lock)
-        union {
+{
-                atomic_t val;
+        WRITE_ONCE(lock->pending, 0);
-#ifdef __LITTLE_ENDIAN
+}
-                struct {
-                        u8      locked;
-                        u8      pending;
-                };
-                struct {
-                        u16     locked_pending;
-                        u16     tail;
-                };
-#else
-                struct {
-                        u16     tail;
-                        u16     locked_pending;
-                };
-                struct {
-                        u8      reserved[2];
-                        u8      pending;
-                        u8      locked;
-                };
-#endif
-        };
-};
-#if _Q_PENDING_BITS == 8
 /**
 * clear_pending_set_locked - take ownership and clear the pending bit.
 * @lock: Pointer to queued spinlock structure
@@ -159,9 +153,7 @@ struct __qspinlock {
 */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
-        struct __qspinlock *l = (void *)lock;
+        WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
-        WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
 }
 /*
@@ -176,19 +168,28 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
-        struct __qspinlock *l = (void *)lock;
        /*
-         * Use release semantics to make sure that the MCS node is properly
+         * We can use relaxed semantics since the caller ensures that the
-         * initialized before changing the tail code.
+         * MCS node is properly initialized before updating the tail.
         */
-        return (u32)xchg_release(&l->tail,
+        return (u32)xchg_relaxed(&lock->tail,
                                 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 #else /* _Q_PENDING_BITS == 8 */
 /**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+        atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
+/**
 * clear_pending_set_locked - take ownership and clear the pending bit.
 * @lock: Pointer to queued spinlock structure
 *
@@ -216,10 +217,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
        for (;;) {
                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
                /*
-                 * Use release semantics to make sure that the MCS node is
+                 * We can use relaxed semantics since the caller ensures that
-                 * properly initialized before changing the tail code.
+                 * the MCS node is properly initialized before updating the
+                 * tail.
                 */
-                old = atomic_cmpxchg_release(&lock->val, val, new);
+                old = atomic_cmpxchg_relaxed(&lock->val, val, new);
                if (old == val)
                        break;
@@ -237,9 +239,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 */
 static __always_inline void set_locked(struct qspinlock *lock)
 {
-        struct __qspinlock *l = (void *)lock;
+        WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
-        WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 }
@@ -294,86 +294,83 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
        struct mcs_spinlock *prev, *next, *node;
-        u32 new, old, tail;
+        u32 old, tail;
        int idx;
        BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
        if (pv_enabled())
-                goto queue;
+                goto pv_queue;
        if (virt_spin_lock(lock))
                return;
        /*
-         * wait for in-progress pending->locked hand-overs
+         * Wait for in-progress pending->locked hand-overs with a bounded
+         * number of spins so that we guarantee forward progress.
         *
         * 0,1,0 -> 0,0,1
         */
        if (val == _Q_PENDING_VAL) {
-                while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
+                int cnt = _Q_PENDING_LOOPS;
-                        cpu_relax();
+                val = atomic_cond_read_relaxed(&lock->val,
+                                               (VAL != _Q_PENDING_VAL) || !cnt--);
        }
        /*
+         * If we observe any contention; queue.
+         */
+        if (val & ~_Q_LOCKED_MASK)
+                goto queue;
+        /*
         * trylock || pending
         *
         * 0,0,0 -> 0,0,1 ; trylock
         * 0,0,1 -> 0,1,1 ; pending
         */
-        for (;;) {
+        val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+        if (!(val & ~_Q_LOCKED_MASK)) {
                /*
-                 * If we observe any contention; queue.
+                 * We're pending, wait for the owner to go away.
+                 *
+                 * *,1,1 -> *,1,0
+                 *
+                 * this wait loop must be a load-acquire such that we match the
+                 * store-release that clears the locked bit and create lock
+                 * sequentiality; this is because not all
+                 * clear_pending_set_locked() implementations imply full
+                 * barriers.
                 */
-                if (val & ~_Q_LOCKED_MASK)
+                if (val & _Q_LOCKED_MASK) {
-                        goto queue;
+                        atomic_cond_read_acquire(&lock->val,
+                                                 !(VAL & _Q_LOCKED_MASK));
-                new = _Q_LOCKED_VAL;
+                }
-                if (val == new)
-                        new |= _Q_PENDING_VAL;
                /*
-                 * Acquire semantic is required here as the function may
+                 * take ownership and clear the pending bit.
-                 * return immediately if the lock was free.
+                 *
+                 * *,1,0 -> *,0,1
                 */
-                old = atomic_cmpxchg_acquire(&lock->val, val, new);
+                clear_pending_set_locked(lock);
-                if (old == val)
+                qstat_inc(qstat_lock_pending, true);
-                        break;
-                val = old;
-        }
-        /*
-         * we won the trylock
-         */
-        if (new == _Q_LOCKED_VAL)
                return;
+        }
        /*
-         * we're pending, wait for the owner to go away.
+         * If pending was clear but there are waiters in the queue, then
-         *
+         * we need to undo our setting of pending before we queue ourselves.
-         * *,1,1 -> *,1,0
-         *
-         * this wait loop must be a load-acquire such that we match the
-         * store-release that clears the locked bit and create lock
-         * sequentiality; this is because not all clear_pending_set_locked()
-         * implementations imply full barriers.
-         */
-        smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
-        /*
-         * take ownership and clear the pending bit.
-         *
-         * *,1,0 -> *,0,1
         */
-        clear_pending_set_locked(lock);
+        if (!(val & _Q_PENDING_MASK))
-        return;
+                clear_pending(lock);
        /*
         * End of pending bit optimistic spinning and beginning of MCS
         * queuing.
         */
 queue:
+        qstat_inc(qstat_lock_slowpath, true);
+pv_queue:
        node = this_cpu_ptr(&mcs_nodes[0]);
        idx = node->count++;
        tail = encode_tail(smp_processor_id(), idx);
@@ -400,12 +397,18 @@ queue:
                goto release;
        /*
+         * Ensure that the initialisation of @node is complete before we
+         * publish the updated tail via xchg_tail() and potentially link
+         * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+         */
+        smp_wmb();
+        /*
+         * Publish the updated tail.
         * We have already touched the queueing cacheline; don't bother with
         * pending stuff.
         *
         * p,*,* -> n,*,*
-         *
-         * RELEASE, such that the stores to @node must be complete.
         */
        old = xchg_tail(lock, tail);
        next = NULL;
@@ -417,14 +420,8 @@ queue:
        if (old & _Q_TAIL_MASK) {
                prev = decode_tail(old);
-                /*
+                /* Link @node into the waitqueue. */
-                 * We must ensure that the stores to @node are observed before
+                WRITE_ONCE(prev->next, node);
-                 * the write to prev->next. The address dependency from
-                 * xchg_tail is not sufficient to ensure this because the read
-                 * component of xchg_tail is unordered with respect to the
-                 * initialisation of @node.
-                 */
-                smp_store_release(&prev->next, node);
                pv_wait_node(node, prev);
                arch_mcs_spin_lock_contended(&node->locked);
@@ -453,8 +450,8 @@ queue:
         *
         * The PV pv_wait_head_or_lock function, if active, will acquire
         * the lock and return a non-zero value. So we have to skip the
-         * smp_cond_load_acquire() call. As the next PV queue head hasn't been
+         * atomic_cond_read_acquire() call. As the next PV queue head hasn't
-         * designated yet, there is no way for the locked value to become
+         * been designated yet, there is no way for the locked value to become
         * _Q_SLOW_VAL. So both the set_locked() and the
         * atomic_cmpxchg_relaxed() calls will be safe.
         *
@@ -464,44 +461,38 @@ queue:
        if ((val = pv_wait_head_or_lock(lock, node)))
                goto locked;
-        val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
+        val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
 locked:
        /*
         * claim the lock:
         *
         * n,0,0 -> 0,0,1 : lock, uncontended
-         * *,0,0 -> *,0,1 : lock, contended
+         * *,*,0 -> *,*,1 : lock, contended
         *
-         * If the queue head is the only one in the queue (lock value == tail),
+         * If the queue head is the only one in the queue (lock value == tail)
-         * clear the tail code and grab the lock. Otherwise, we only need
+         * and nobody is pending, clear the tail code and grab the lock.
-         * to grab the lock.
+         * Otherwise, we only need to grab the lock.
         */
-        for (;;) {
-                /* In the PV case we might already have _Q_LOCKED_VAL set */
-                if ((val & _Q_TAIL_MASK) != tail) {
-                        set_locked(lock);
-                        break;
-                }
-                /*
-                 * The smp_cond_load_acquire() call above has provided the
-                 * necessary acquire semantics required for locking. At most
-                 * two iterations of this loop may be ran.
-                 */
-                old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-                if (old == val)
-                        goto release;   /* No contention */
-                val = old;
+        /*
-        }
+         * In the PV case we might already have _Q_LOCKED_VAL set.
+         *
+         * The atomic_cond_read_acquire() call above has provided the
+         * necessary acquire semantics required for locking.
+         */
+        if (((val & _Q_TAIL_MASK) == tail) &&
+            atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+                goto release; /* No contention */
+        /* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+        set_locked(lock);
        /*
         * contended path; wait for next if not observed yet, release.
         */
-        if (!next) {
+        if (!next)
-                while (!(next = READ_ONCE(node->next)))
+                next = smp_cond_load_relaxed(&node->next, (VAL));
-                        cpu_relax();
-        }
        arch_mcs_spin_unlock_contended(&next->locked);
        pv_kick_node(lock, next);
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6ee477765e6c..5a0cf5f9008c 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -56,11 +56,6 @@ struct pv_node {
 };
 /*
- * Include queued spinlock statistics code
- */
-#include "qspinlock_stat.h"
-/*
 * Hybrid PV queued/unfair lock
 *
 * By replacing the regular queued_spin_trylock() with the function below,
@@ -87,8 +82,6 @@ struct pv_node {
 #define queued_spin_trylock(l)  pv_hybrid_queued_unfair_trylock(l)
 static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 {
-        struct __qspinlock *l = (void *)lock;
        /*
         * Stay in unfair lock mode as long as queued mode waiters are
         * present in the MCS wait queue but the pending bit isn't set.
@@ -97,7 +90,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
                int val = atomic_read(&lock->val);
                if (!(val & _Q_LOCKED_PENDING_MASK) &&
-                   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+                   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
                        qstat_inc(qstat_pv_lock_stealing, true);
                        return true;
                }
@@ -117,16 +110,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 #if _Q_PENDING_BITS == 8
 static __always_inline void set_pending(struct qspinlock *lock)
 {
-        struct __qspinlock *l = (void *)lock;
+        WRITE_ONCE(lock->pending, 1);
-        WRITE_ONCE(l->pending, 1);
-}
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
-        struct __qspinlock *l = (void *)lock;
-        WRITE_ONCE(l->pending, 0);
 }
 /*
@@ -136,10 +120,8 @@ static __always_inline void clear_pending(struct qspinlock *lock)
 */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
-        struct __qspinlock *l = (void *)lock;
+        return !READ_ONCE(lock->locked) &&
+               (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
-        return !READ_ONCE(l->locked) &&
-               (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
                                _Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
@@ -148,11 +130,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
        atomic_or(_Q_PENDING_VAL, &lock->val);
 }
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
-        atomic_andnot(_Q_PENDING_VAL, &lock->val);
-}
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
        int val = atomic_read(&lock->val);
@@ -384,7 +361,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
        struct pv_node *pn = (struct pv_node *)node;
-        struct __qspinlock *l = (void *)lock;
        /*
         * If the vCPU is indeed halted, advance its state to match that of
@@ -413,7 +389,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
         * the hash table later on at unlock time, no atomic instruction is
         * needed.
         */
-        WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+        WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
        (void)pv_hash(lock, pn);
 }
@@ -428,7 +404,6 @@ static u32
 pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
        struct pv_node *pn = (struct pv_node *)node;
-        struct __qspinlock *l = (void *)lock;
        struct qspinlock **lp = NULL;
        int waitcnt = 0;
        int loop;
@@ -443,7 +418,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
        /*
         * Tracking # of slowpath locking operations
         */
-        qstat_inc(qstat_pv_lock_slowpath, true);
+        qstat_inc(qstat_lock_slowpath, true);
        for (;; waitcnt++) {
                /*
@@ -479,13 +454,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
                         *
                         * Matches the smp_rmb() in __pv_queued_spin_unlock().
                         */
-                        if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+                        if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
                                /*
                                 * The lock was free and now we own the lock.
                                 * Change the lock value back to _Q_LOCKED_VAL
                                 * and unhash the table.
                                 */
-                                WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+                                WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
                                WRITE_ONCE(*lp, NULL);
                                goto gotlock;
                        }
@@ -493,7 +468,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
                WRITE_ONCE(pn->state, vcpu_hashed);
                qstat_inc(qstat_pv_wait_head, true);
                qstat_inc(qstat_pv_wait_again, waitcnt);
-                pv_wait(&l->locked, _Q_SLOW_VAL);
+                pv_wait(&lock->locked, _Q_SLOW_VAL);
                /*
                 * Because of lock stealing, the queue head vCPU may not be
@@ -518,7 +493,6 @@ gotlock:
 __visible void
 __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
-        struct __qspinlock *l = (void *)lock;
        struct pv_node *node;
        if (unlikely(locked != _Q_SLOW_VAL)) {
@@ -547,7 +521,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
         * Now that we have a reference to the (likely) blocked pv_node,
         * release the lock.
         */
-        smp_store_release(&l->locked, 0);
+        smp_store_release(&lock->locked, 0);
        /*
         * At this point the memory pointed at by lock can be freed/reused,
@@ -573,7 +547,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 #ifndef __pv_queued_spin_unlock
 __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 {
-        struct __qspinlock *l = (void *)lock;
        u8 locked;
        /*
@@ -581,7 +554,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
         * unhash. Otherwise it would be possible to have multiple @lock
         * entries, which would be BAD.
         */
-        locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
+        locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
        if (likely(locked == _Q_LOCKED_VAL))
                return;
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 4a30ef63c607..6bd78c0740fc 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,13 +22,14 @@
 *   pv_kick_wake       - # of vCPU kicks used for computing pv_latency_wake
 *   pv_latency_kick    - average latency (ns) of vCPU kick operation
 *   pv_latency_wake    - average latency (ns) from vCPU kick to wakeup
- *   pv_lock_slowpath   - # of locking operations via the slowpath
 *   pv_lock_stealing   - # of lock stealing operations
 *   pv_spurious_wakeup - # of spurious wakeups in non-head vCPUs
 *   pv_wait_again      - # of wait's after a queue head vCPU kick
 *   pv_wait_early      - # of early vCPU wait's
 *   pv_wait_head       - # of vCPU wait's at the queue head
 *   pv_wait_node       - # of vCPU wait's at a non-head queue node
+ *   lock_pending       - # of locking operations via pending code
+ *   lock_slowpath      - # of locking operations via MCS lock queue
 *
 * Writing to the "reset_counters" file will reset all the above counter
 * values.
@@ -46,13 +47,14 @@ enum qlock_stats {
        qstat_pv_kick_wake,
        qstat_pv_latency_kick,
        qstat_pv_latency_wake,
-        qstat_pv_lock_slowpath,
        qstat_pv_lock_stealing,
        qstat_pv_spurious_wakeup,
        qstat_pv_wait_again,
        qstat_pv_wait_early,
        qstat_pv_wait_head,
        qstat_pv_wait_node,
+        qstat_lock_pending,
+        qstat_lock_slowpath,
        qstat_num,      /* Total number of statistical counters */
        qstat_reset_cnts = qstat_num,
 };
@@ -73,12 +75,13 @@ static const char * const qstat_names[qstat_num + 1] = {
        [qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
        [qstat_pv_latency_kick]    = "pv_latency_kick",
        [qstat_pv_latency_wake]    = "pv_latency_wake",
-        [qstat_pv_lock_slowpath]   = "pv_lock_slowpath",
        [qstat_pv_lock_stealing]   = "pv_lock_stealing",
        [qstat_pv_wait_again]      = "pv_wait_again",
        [qstat_pv_wait_early]      = "pv_wait_early",
        [qstat_pv_wait_head]       = "pv_wait_head",
        [qstat_pv_wait_node]       = "pv_wait_node",
+        [qstat_lock_pending]       = "lock_pending",
+        [qstat_lock_slowpath]      = "lock_slowpath",
        [qstat_reset_cnts]         = "reset_counters",
 };
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index a90336779375..3064c50e181e 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -347,6 +347,15 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
        }
 }
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+        /*
+         * As lock holder preemption issue, we both skip spinning if
+         * task is not on cpu or its cpu is preempted
+         */
+        return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
        struct task_struct *owner;
@@ -359,17 +368,10 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
        rcu_read_lock();
        owner = READ_ONCE(sem->owner);
-        if (!owner || !is_rwsem_owner_spinnable(owner)) {
+        if (owner) {
-                ret = !owner;   /* !owner is spinnable */
+                ret = is_rwsem_owner_spinnable(owner) &&
-                goto done;
+                      owner_on_cpu(owner);
        }
-        /*
-         * As lock holder preemption issue, we both skip spinning if task is not
-         * on cpu or its cpu is preempted
-         */
-        ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-done:
        rcu_read_unlock();
        return ret;
 }
@@ -398,8 +400,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
                 * abort spinning when need_resched or owner is not running or
                 * owner's cpu is preempted.
                 */
-                if (!owner->on_cpu || need_resched() ||
+                if (need_resched() || !owner_on_cpu(owner)) {
-                                vcpu_is_preempted(task_cpu(owner))) {
                        rcu_read_unlock();
                        return false;
                }
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 64c0291b579c..f89014a2c238 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -37,7 +37,7 @@ struct cpu_stop_done {
 struct cpu_stopper {
        struct task_struct      *thread;
-        spinlock_t              lock;
+        raw_spinlock_t          lock;
        bool                    enabled;        /* is this stopper enabled? */
        struct list_head        works;          /* list of pending works */
@@ -81,13 +81,13 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
        unsigned long flags;
        bool enabled;
-        spin_lock_irqsave(&stopper->lock, flags);
+        raw_spin_lock_irqsave(&stopper->lock, flags);
        enabled = stopper->enabled;
        if (enabled)
                __cpu_stop_queue_work(stopper, work, &wakeq);
        else if (work->done)
                cpu_stop_signal_done(work->done);
-        spin_unlock_irqrestore(&stopper->lock, flags);
+        raw_spin_unlock_irqrestore(&stopper->lock, flags);
        wake_up_q(&wakeq);
@@ -237,8 +237,8 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
        DEFINE_WAKE_Q(wakeq);
        int err;
 retry:
-        spin_lock_irq(&stopper1->lock);
+        raw_spin_lock_irq(&stopper1->lock);
-        spin_lock_nested(&stopper2->lock, SINGLE_DEPTH_NESTING);
+        raw_spin_lock_nested(&stopper2->lock, SINGLE_DEPTH_NESTING);
        err = -ENOENT;
        if (!stopper1->enabled || !stopper2->enabled)
@@ -261,8 +261,8 @@ retry:
        __cpu_stop_queue_work(stopper1, work1, &wakeq);
        __cpu_stop_queue_work(stopper2, work2, &wakeq);
 unlock:
-        spin_unlock(&stopper2->lock);
+        raw_spin_unlock(&stopper2->lock);
-        spin_unlock_irq(&stopper1->lock);
+        raw_spin_unlock_irq(&stopper1->lock);
        if (unlikely(err == -EDEADLK)) {
                while (stop_cpus_in_progress)
@@ -457,9 +457,9 @@ static int cpu_stop_should_run(unsigned int cpu)
        unsigned long flags;
        int run;
-        spin_lock_irqsave(&stopper->lock, flags);
+        raw_spin_lock_irqsave(&stopper->lock, flags);
        run = !list_empty(&stopper->works);
-        spin_unlock_irqrestore(&stopper->lock, flags);
+        raw_spin_unlock_irqrestore(&stopper->lock, flags);
        return run;
 }
@@ -470,13 +470,13 @@ static void cpu_stopper_thread(unsigned int cpu)
 repeat:
        work = NULL;
-        spin_lock_irq(&stopper->lock);
+        raw_spin_lock_irq(&stopper->lock);
        if (!list_empty(&stopper->works)) {
                work = list_first_entry(&stopper->works,
                                        struct cpu_stop_work, list);
                list_del_init(&work->list);
        }
-        spin_unlock_irq(&stopper->lock);
+        raw_spin_unlock_irq(&stopper->lock);
        if (work) {
                cpu_stop_fn_t fn = work->fn;
@@ -550,7 +550,7 @@ static int __init cpu_stop_init(void)
        for_each_possible_cpu(cpu) {
                struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
-                spin_lock_init(&stopper->lock);
+                raw_spin_lock_init(&stopper->lock);
                INIT_LIST_HEAD(&stopper->works);
        }
diff --git a/tools/memory-model/Documentation/cheatsheet.txt b/tools/memory-model/Documentation/cheatsheet.txt
index 956b1ae4aafb..33ba98d72b16 100644
--- a/tools/memory-model/Documentation/cheatsheet.txt
+++ b/tools/memory-model/Documentation/cheatsheet.txt
@@ -1,6 +1,6 @@
                                  Prior Operation     Subsequent Operation
                                  ---------------  ---------------------------
-                               C  Self  R  W  RWM  Self  R  W  DR  DW  RMW  SV
+                               C  Self  R  W  RMW  Self  R  W  DR  DW  RMW  SV
                              --  ----  -  -  ---  ----  -  -  --  --  ---  --
 Store, e.g., WRITE_ONCE()            Y                                       Y
@@ -14,7 +14,7 @@ smp_wmb()                                  Y    W           Y       Y    W
 smp_mb() & synchronize_rcu()  CP        Y  Y    Y        Y  Y   Y   Y    Y
 Successful full non-void RMW  CP     Y  Y  Y    Y     Y  Y  Y   Y   Y    Y   Y
 smp_mb__before_atomic()       CP        Y  Y    Y        a  a   a   a    Y
-smp_mb__after_atomic()        CP        a  a    Y        Y  Y   Y   Y
+smp_mb__after_atomic()        CP        a  a    Y        Y  Y   Y   Y    Y
 Key:    C:      Ordering is cumulative
@@ -26,4 +26,5 @@ Key:	C:	Ordering is cumulative
        DR:     Dependent read (address dependency)
        DW:     Dependent write (address, data, or control dependency)
        RMW:    Atomic read-modify-write operation
-        SV      Same-variable access
+        SELF:   Orders self, as opposed to accesses before and/or after
+        SV:     Orders later accesses to the same variable
diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index a727c82bd434..1b09f3175a1f 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -27,7 +27,7 @@ Explanation of the Linux-Kernel Memory Consistency Model
  19. AND THEN THERE WAS ALPHA
  20. THE HAPPENS-BEFORE RELATION: hb
  21. THE PROPAGATES-BEFORE RELATION: pb
-  22. RCU RELATIONS: link, gp-link, rscs-link, and rcu-path
+  22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
  23. ODDS AND ENDS
@@ -1451,8 +1451,8 @@ they execute means that it cannot have cycles.  This requirement is
 the content of the LKMM's "propagation" axiom.
-RCU RELATIONS: link, gp-link, rscs-link, and rcu-path
+RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
-----------------------------------------------------
+----------------------------------------------------
 RCU (Read-Copy-Update) is a powerful synchronization mechanism.  It
 rests on two concepts: grace periods and read-side critical sections.
@@ -1509,8 +1509,8 @@ y, which occurs before the end of the critical section, did not
 propagate to P1 before the end of the grace period, violating the
 Guarantee.
-In the kernel's implementations of RCU, the business about stores
+In the kernel's implementations of RCU, the requirements for stores
-propagating to every CPU is realized by placing strong fences at
+to propagate to every CPU are fulfilled by placing strong fences at
 suitable places in the RCU-related code.  Thus, if a critical section
 starts before a grace period does then the critical section's CPU will
 execute an smp_mb() fence after the end of the critical section and
@@ -1523,72 +1523,124 @@ executes.
 What exactly do we mean by saying that a critical section "starts
 before" or "ends after" a grace period?  Some aspects of the meaning
 are pretty obvious, as in the example above, but the details aren't
-entirely clear.  The LKMM formalizes this notion by means of a
+entirely clear.  The LKMM formalizes this notion by means of the
-relation with the unfortunately generic name "link".  It is a very
+rcu-link relation.  rcu-link encompasses a very general notion of
-general relation; among other things, X ->link Z includes cases where
+"before": Among other things, X ->rcu-link Z includes cases where X
-X happens-before or is equal to some event Y which is equal to or
+happens-before or is equal to some event Y which is equal to or comes
-comes before Z in the coherence order.  Taking Y = Z, this says that
+before Z in the coherence order.  When Y = Z this says that X ->rfe Z
-X ->rfe Z implies X ->link Z, and taking Y = X, it says that X ->fr Z
+implies X ->rcu-link Z.  In addition, when Y = X it says that X ->fr Z
-and X ->co Z each imply X ->link Z.
+and X ->co Z each imply X ->rcu-link Z.
-The formal definition of the link relation is more than a little
+The formal definition of the rcu-link relation is more than a little
 obscure, and we won't give it here.  It is closely related to the pb
 relation, and the details don't matter unless you want to comb through
 a somewhat lengthy formal proof.  Pretty much all you need to know
-about link is the information in the preceding paragraph.
+about rcu-link is the information in the preceding paragraph.
-The LKMM goes on to define the gp-link and rscs-link relations.  They
+The LKMM also defines the gp and rscs relations.  They bring grace
-bring grace periods and read-side critical sections into the picture,
+periods and read-side critical sections into the picture, in the
-in the following way:
+following way:
-        E ->gp-link F means there is a synchronize_rcu() fence event S
+        E ->gp F means there is a synchronize_rcu() fence event S such
-        and an event X such that E ->po S, either S ->po X or S = X,
+        that E ->po S and either S ->po F or S = F.  In simple terms,
-        and X ->link F.  In other words, E and F are connected by a
+        there is a grace period po-between E and F.
-        grace period followed by an instance of link.
+        E ->rscs F means there is a critical section delimited by an
-        E ->rscs-link F means there is a critical section delimited by
+        rcu_read_lock() fence L and an rcu_read_unlock() fence U, such
-        an rcu_read_lock() fence L and an rcu_read_unlock() fence U,
+        that E ->po U and either L ->po F or L = F.  You can think of
-        and an event X such that E ->po U, either L ->po X or L = X,
+        this as saying that E and F are in the same critical section
-        and X ->link F.  Roughly speaking, this says that some event
+        (in fact, it also allows E to be po-before the start of the
-        in the same critical section as E is connected by link to F.
+        critical section and F to be po-after the end).
-If we think of the link relation as standing for an extended "before",
+If we think of the rcu-link relation as standing for an extended
-then E ->gp-link F says that E executes before a grace period which
+"before", then X ->gp Y ->rcu-link Z says that X executes before a
-ends before F executes.  (In fact it says more than this, because it
+grace period which ends before Z executes.  (In fact it covers more
-includes cases where E executes before a grace period and some store
+than this, because it also includes cases where X executes before a
-propagates to F's CPU before F executes and doesn't propagate to some
+grace period and some store propagates to Z's CPU before Z executes
-other CPU until after the grace period ends.)  Similarly,
+but doesn't propagate to some other CPU until after the grace period
-E ->rscs-link F says that E is part of (or before the start of) a
+ends.)  Similarly, X ->rscs Y ->rcu-link Z says that X is part of (or
-critical section which starts before F executes.
+before the start of) a critical section which starts before Z
+executes.
+The LKMM goes on to define the rcu-fence relation as a sequence of gp
+and rscs links separated by rcu-link links, in which the number of gp
+links is >= the number of rscs links.  For example:
+        X ->gp Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V
+would imply that X ->rcu-fence V, because this sequence contains two
+gp links and only one rscs link.  (It also implies that X ->rcu-fence T
+and Z ->rcu-fence V.)  On the other hand:
+        X ->rscs Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V
+does not imply X ->rcu-fence V, because the sequence contains only
+one gp link but two rscs links.
+The rcu-fence relation is important because the Grace Period Guarantee
+means that rcu-fence acts kind of like a strong fence.  In particular,
+if W is a write and we have W ->rcu-fence Z, the Guarantee says that W
+will propagate to every CPU before Z executes.
+To prove this in full generality requires some intellectual effort.
+We'll consider just a very simple case:
+        W ->gp X ->rcu-link Y ->rscs Z.
+This formula means that there is a grace period G and a critical
+section C such that:
+        1. W is po-before G;
+        2. X is equal to or po-after G;
+        3. X comes "before" Y in some sense;
+        4. Y is po-before the end of C;
+        5. Z is equal to or po-after the start of C.
+From 2 - 4 we deduce that the grace period G ends before the critical
+section C.  Then the second part of the Grace Period Guarantee says
+not only that G starts before C does, but also that W (which executes
+on G's CPU before G starts) must propagate to every CPU before C
+starts.  In particular, W propagates to every CPU before Z executes
+(or finishes executing, in the case where Z is equal to the
+rcu_read_lock() fence event which starts C.)  This sort of reasoning
+can be expanded to handle all the situations covered by rcu-fence.
+Finally, the LKMM defines the RCU-before (rb) relation in terms of
+rcu-fence.  This is done in essentially the same way as the pb
+relation was defined in terms of strong-fence.  We will omit the
+details; the end result is that E ->rb F implies E must execute before
+F, just as E ->pb F does (and for much the same reasons).
 Putting this all together, the LKMM expresses the Grace Period
-Guarantee by requiring that there are no cycles consisting of gp-link
+Guarantee by requiring that the rb relation does not contain a cycle.
-and rscs-link connections in which the number of gp-link instances is
+Equivalently, this "rcu" axiom requires that there are no events E and
->= the number of rscs-link instances.  It does this by defining the
+F with E ->rcu-link F ->rcu-fence E.  Or to put it a third way, the
-rcu-path relation to link events E and F whenever it is possible to
+axiom requires that there are no cycles consisting of gp and rscs
-pass from E to F by a sequence of gp-link and rscs-link connections
+alternating with rcu-link, where the number of gp links is >= the
-with at least as many of the former as the latter.  The LKMM's "rcu"
+number of rscs links.
-axiom then says that there are no events E such that E ->rcu-path E.
+Justifying the axiom isn't easy, but it is in fact a valid
-Justifying this axiom takes some intellectual effort, but it is in
+formalization of the Grace Period Guarantee.  We won't attempt to go
-fact a valid formalization of the Grace Period Guarantee.  We won't
+through the detailed argument, but the following analysis gives a
-attempt to go through the detailed argument, but the following
+taste of what is involved.  Suppose we have a violation of the first
-analysis gives a taste of what is involved.  Suppose we have a
+part of the Guarantee: A critical section starts before a grace
-violation of the first part of the Guarantee: A critical section
+period, and some store propagates to the critical section's CPU before
-starts before a grace period, and some store propagates to the
+the end of the critical section but doesn't propagate to some other
-critical section's CPU before the end of the critical section but
+CPU until after the end of the grace period.
-doesn't propagate to some other CPU until after the end of the grace
-period.
 Putting symbols to these ideas, let L and U be the rcu_read_lock() and
 rcu_read_unlock() fence events delimiting the critical section in
 question, and let S be the synchronize_rcu() fence event for the grace
 period.  Saying that the critical section starts before S means there
 are events E and F where E is po-after L (which marks the start of the
-critical section), E is "before" F in the sense of the link relation,
+critical section), E is "before" F in the sense of the rcu-link
-and F is po-before the grace period S:
+relation, and F is po-before the grace period S:
-        L ->po E ->link F ->po S.
+        L ->po E ->rcu-link F ->po S.
 Let W be the store mentioned above, let Z come before the end of the
 critical section and witness that W propagates to the critical
@@ -1600,16 +1652,19 @@ some event X which is po-after S.  Symbolically, this amounts to:
 The fr link from Y to W indicates that W has not propagated to Y's CPU
 at the time that Y executes.  From this, it can be shown (see the
-discussion of the link relation earlier) that X and Z are connected by
+discussion of the rcu-link relation earlier) that X and Z are related
-link, yielding:
+by rcu-link, yielding:
+        S ->po X ->rcu-link Z ->po U.
+The formulas say that S is po-between F and X, hence F ->gp X.  They
+also say that Z comes before the end of the critical section and E
+comes after its start, hence Z ->rscs E.  From all this we obtain:
-        S ->po X ->link Z ->po U.
+        F ->gp X ->rcu-link Z ->rscs E ->rcu-link F,
-These formulas say that S is po-between F and X, hence F ->gp-link Z
+a forbidden cycle.  Thus the "rcu" axiom rules out this violation of
-via X.  They also say that Z comes before the end of the critical
+the Grace Period Guarantee.
-section and E comes after its start, hence Z ->rscs-link F via E.  But
-now we have a forbidden cycle: F ->gp-link Z ->rscs-link F.  Thus the
-"rcu" axiom rules out this violation of the Grace Period Guarantee.
 For something a little more down-to-earth, let's see how the axiom
 works out in practice.  Consider the RCU code example from above, this
@@ -1635,18 +1690,18 @@ time with statement labels added to the memory access instructions:
        }
-If r2 = 0 at the end then P0's store at X overwrites the value
+If r2 = 0 at the end then P0's store at X overwrites the value that
-that P1's load at Z reads from, so we have Z ->fre X and thus
+P1's load at Z reads from, so we have Z ->fre X and thus Z ->rcu-link X.
-Z ->link X.  In addition, there is a synchronize_rcu() between Y and
+In addition, there is a synchronize_rcu() between Y and Z, so therefore
-Z, so therefore we have Y ->gp-link X.
+we have Y ->gp Z.
 If r1 = 1 at the end then P1's load at Y reads from P0's store at W,
-so we have W ->link Y.  In addition, W and X are in the same critical
+so we have W ->rcu-link Y.  In addition, W and X are in the same critical
-section, so therefore we have X ->rscs-link Y.
+section, so therefore we have X ->rscs W.
-This gives us a cycle, Y ->gp-link X ->rscs-link Y, with one gp-link
+Then X ->rscs W ->rcu-link Y ->gp Z ->rcu-link X is a forbidden cycle,
-and one rscs-link, violating the "rcu" axiom.  Hence the outcome is
+violating the "rcu" axiom.  Hence the outcome is not allowed by the
-not allowed by the LKMM, as we would expect.
+LKMM, as we would expect.
 For contrast, let's see what can happen in a more complicated example:
@@ -1682,15 +1737,11 @@ For contrast, let's see what can happen in a more complicated example:
        }
 If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows
-that W ->rscs-link Y via X, Y ->gp-link U via Z, and U ->rscs-link W
+that W ->rscs X ->rcu-link Y ->gp Z ->rcu-link U ->rscs V ->rcu-link W.
-via V.  And just as before, this gives a cycle:
+However this cycle is not forbidden, because the sequence of relations
+contains fewer instances of gp (one) than of rscs (two).  Consequently
-        W ->rscs-link Y ->gp-link U ->rscs-link W.
+the outcome is allowed by the LKMM.  The following instruction timing
+diagram shows how it might actually occur:
-However, this cycle has fewer gp-link instances than rscs-link
-instances, and consequently the outcome is not forbidden by the LKMM.
-The following instruction timing diagram shows how it might actually
-occur:
 P0                      P1                      P2
 --------------------    --------------------    --------------------
diff --git a/tools/memory-model/Documentation/references.txt b/tools/memory-model/Documentation/references.txt
index ba2e34c2ec3f..b177f3e4a614 100644
--- a/tools/memory-model/Documentation/references.txt
+++ b/tools/memory-model/Documentation/references.txt
@@ -63,15 +63,22 @@ o	Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis,
        Principles of Programming Languages (POPL 2017). ACM, New York,
        NY, USA, 429–442.
+o       Christopher Pulte, Shaked Flur, Will Deacon, Jon French,
+        Susmit Sarkar, and Peter Sewell. 2018. "Simplifying ARM concurrency:
+        multicopy-atomic axiomatic and operational models for ARMv8". In
+        Proceedings of the ACM on Programming Languages, Volume 2, Issue
+        POPL, Article No. 19. ACM, New York, NY, USA.
 Linux-kernel memory model
 =========================
-o       Andrea Parri, Alan Stern, Luc Maranget, Paul E. McKenney,
+o       Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and
-        and Jade Alglave.  2017. "A formal model of
+        Alan Stern.  2018. "Frightening small children and disconcerting
-        Linux-kernel memory ordering - companion webpage".
+        grown-ups: Concurrency in the Linux kernel". In Proceedings of
-        http://moscova.inria.fr/∼maranget/cats7/linux/. (2017). [Online;
+        the 23rd International Conference on Architectural Support for
-        accessed 30-January-2017].
+        Programming Languages and Operating Systems (ASPLOS 2018). ACM,
+        New York, NY, USA, 405-418.  Webpage: http://diy.inria.fr/linux/.
 o       Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and
        Alan Stern.  2017.  "A formal kernel memory-ordering model (part 1)"
diff --git a/tools/memory-model/README b/tools/memory-model/README
index 0b3a5f3c9ccd..734f7feaa5dc 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -20,7 +20,7 @@ that litmus test to be exercised within the Linux kernel.
 REQUIREMENTS
 ============
-Version 7.48 of the "herd7" and "klitmus7" tools must be downloaded
+Version 7.49 of the "herd7" and "klitmus7" tools must be downloaded
 separately:
  https://github.com/herd/herdtools7
diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell
index 432c7cf71b23..64f5740e0e75 100644
--- a/tools/memory-model/linux-kernel.bell
+++ b/tools/memory-model/linux-kernel.bell
@@ -5,10 +5,10 @@
 * Copyright (C) 2017 Alan Stern <stern@rowland.harvard.edu>,
 *                    Andrea Parri <parri.andrea@gmail.com>
 *
- * An earlier version of this file appears in the companion webpage for
+ * An earlier version of this file appeared in the companion webpage for
 * "Frightening small children and disconcerting grown-ups: Concurrency
 * in the Linux kernel" by Alglave, Maranget, McKenney, Parri, and Stern,
- * which is to appear in ASPLOS 2018.
+ * which appeared in ASPLOS 2018.
 *)
 "Linux-kernel memory consistency model"
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index df97db03b6c2..59b5cbe6b624 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -5,10 +5,10 @@
 * Copyright (C) 2017 Alan Stern <stern@rowland.harvard.edu>,
 *                    Andrea Parri <parri.andrea@gmail.com>
 *
- * An earlier version of this file appears in the companion webpage for
+ * An earlier version of this file appeared in the companion webpage for
 * "Frightening small children and disconcerting grown-ups: Concurrency
 * in the Linux kernel" by Alglave, Maranget, McKenney, Parri, and Stern,
- * which is to appear in ASPLOS 2018.
+ * which appeared in ASPLOS 2018.
 *)
 "Linux-kernel memory consistency model"
@@ -100,22 +100,29 @@ let rscs = po ; crit^-1 ; po?
 * one but two non-rf relations, but only in conjunction with an RCU
 * read-side critical section.
 *)
-let link = hb* ; pb* ; prop
+let rcu-link = hb* ; pb* ; prop
-(* Chains that affect the RCU grace-period guarantee *)
+(*
-let gp-link = gp ; link
+ * Any sequence containing at least as many grace periods as RCU read-side
-let rscs-link = rscs ; link
+ * critical sections (joined by rcu-link) acts as a generalized strong fence.
+ *)
+let rec rcu-fence = gp |
+        (gp ; rcu-link ; rscs) |
+        (rscs ; rcu-link ; gp) |
+        (gp ; rcu-link ; rcu-fence ; rcu-link ; rscs) |
+        (rscs ; rcu-link ; rcu-fence ; rcu-link ; gp) |
+        (rcu-fence ; rcu-link ; rcu-fence)
+(* rb orders instructions just as pb does *)
+let rb = prop ; rcu-fence ; hb* ; pb*
+irreflexive rb as rcu
 (*
- * A cycle containing at least as many grace periods as RCU read-side
+ * The happens-before, propagation, and rcu constraints are all
- * critical sections is forbidden.
+ * expressions of temporal ordering.  They could be replaced by
+ * a single constraint on an "executes-before" relation, xb:
+ *
+ * let xb = hb | pb | rb
+ * acyclic xb as executes-before
 *)
-let rec rcu-path =
-        gp-link |
-        (gp-link ; rscs-link) |
-        (rscs-link ; gp-link) |
-        (rcu-path ; rcu-path) |
-        (gp-link ; rcu-path ; rscs-link) |
-        (rscs-link ; rcu-path ; gp-link)
-irreflexive rcu-path as rcu
diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def
index 397e4e67e8c8..6fa3eb28d40b 100644
--- a/tools/memory-model/linux-kernel.def
+++ b/tools/memory-model/linux-kernel.def
@@ -1,9 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0+
 //
-// An earlier version of this file appears in the companion webpage for
+// An earlier version of this file appeared in the companion webpage for
 // "Frightening small children and disconcerting grown-ups: Concurrency
 // in the Linux kernel" by Alglave, Maranget, McKenney, Parri, and Stern,
-// which is to appear in ASPLOS 2018.
+// which appeared in ASPLOS 2018.
 // ONCE
 READ_ONCE(X) __load{once}(X)
@@ -14,14 +14,15 @@ smp_store_release(X,V) { __store{release}(*X,V); }
 smp_load_acquire(X) __load{acquire}(*X)
 rcu_assign_pointer(X,V) { __store{release}(X,V); }
 rcu_dereference(X) __load{once}(X)
+smp_store_mb(X,V) { __store{once}(X,V); __fence{mb}; }
 // Fences
-smp_mb() { __fence{mb} ; }
+smp_mb() { __fence{mb}; }
-smp_rmb() { __fence{rmb} ; }
+smp_rmb() { __fence{rmb}; }
-smp_wmb() { __fence{wmb} ; }
+smp_wmb() { __fence{wmb}; }
-smp_mb__before_atomic() { __fence{before-atomic} ; }
+smp_mb__before_atomic() { __fence{before-atomic}; }
-smp_mb__after_atomic() { __fence{after-atomic} ; }
+smp_mb__after_atomic() { __fence{after-atomic}; }
-smp_mb__after_spinlock() { __fence{after-spinlock} ; }
+smp_mb__after_spinlock() { __fence{after-spinlock}; }
 // Exchange
 xchg(X,V)  __xchg{mb}(X,V)
@@ -34,26 +35,27 @@ cmpxchg_acquire(X,V,W) __cmpxchg{acquire}(X,V,W)
 cmpxchg_release(X,V,W) __cmpxchg{release}(X,V,W)
 // Spinlocks
-spin_lock(X) { __lock(X) ; }
+spin_lock(X) { __lock(X); }
-spin_unlock(X) { __unlock(X) ; }
+spin_unlock(X) { __unlock(X); }
 spin_trylock(X) __trylock(X)
+spin_is_locked(X) __islocked(X)
 // RCU
 rcu_read_lock() { __fence{rcu-lock}; }
-rcu_read_unlock() { __fence{rcu-unlock};}
+rcu_read_unlock() { __fence{rcu-unlock}; }
 synchronize_rcu() { __fence{sync-rcu}; }
 synchronize_rcu_expedited() { __fence{sync-rcu}; }
 // Atomic
 atomic_read(X) READ_ONCE(*X)
-atomic_set(X,V) { WRITE_ONCE(*X,V) ; }
+atomic_set(X,V) { WRITE_ONCE(*X,V); }
 atomic_read_acquire(X) smp_load_acquire(X)
 atomic_set_release(X,V) { smp_store_release(X,V); }
-atomic_add(V,X) { __atomic_op(X,+,V) ; }
+atomic_add(V,X) { __atomic_op(X,+,V); }
-atomic_sub(V,X) { __atomic_op(X,-,V) ; }
+atomic_sub(V,X) { __atomic_op(X,-,V); }
-atomic_inc(X)   { __atomic_op(X,+,1) ; }
+atomic_inc(X)   { __atomic_op(X,+,1); }
-atomic_dec(X)   { __atomic_op(X,-,1) ; }
+atomic_dec(X)   { __atomic_op(X,-,1); }
 atomic_add_return(V,X) __atomic_op_return{mb}(X,+,V)
 atomic_add_return_relaxed(V,X) __atomic_op_return{once}(X,+,V)
diff --git a/tools/memory-model/litmus-tests/.gitignore b/tools/memory-model/litmus-tests/.gitignore
new file mode 100644
index 000000000000..6e2ddc54152f
--- /dev/null
+++ b/tools/memory-model/litmus-tests/.gitignore
@@ -0,0 +1 @@
+*.litmus.out
diff --git a/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus b/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus
index 50d5db9ea983..98a3716efa37 100644
--- a/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus
+++ b/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus
@@ -7,7 +7,7 @@ C IRIW+mbonceonces+OnceOnce
 * between each pairs of reads.  In other words, is smp_mb() sufficient to
 * cause two different reading processes to agree on the order of a pair
 * of writes, where each write is to a different variable by a different
- * process?
+ * process?  This litmus test exercises LKMM's "propagation" rule.
 *)
 {}
diff --git a/tools/memory-model/litmus-tests/MP+polockmbonce+poacquiresilsil.litmus b/tools/memory-model/litmus-tests/MP+polockmbonce+poacquiresilsil.litmus
new file mode 100644
index 000000000000..50f4d62bbf0e
--- /dev/null
+++ b/tools/memory-model/litmus-tests/MP+polockmbonce+poacquiresilsil.litmus
@@ -0,0 +1,35 @@
+C MP+polockmbonce+poacquiresilsil
+(*
+ * Result: Never
+ *
+ * Do spinlocks combined with smp_mb__after_spinlock() provide order
+ * to outside observers using spin_is_locked() to sense the lock-held
+ * state, ordered by acquire?  Note that when the first spin_is_locked()
+ * returns false and the second true, we know that the smp_load_acquire()
+ * executed before the lock was acquired (loosely speaking).
+ *)
+{
+}
+P0(spinlock_t *lo, int *x)
+{
+        spin_lock(lo);
+        smp_mb__after_spinlock();
+        WRITE_ONCE(*x, 1);
+        spin_unlock(lo);
+}
+P1(spinlock_t *lo, int *x)
+{
+        int r1;
+        int r2;
+        int r3;
+        r1 = smp_load_acquire(x);
+        r2 = spin_is_locked(lo);
+        r3 = spin_is_locked(lo);
+}
+exists (1:r1=1 /\ 1:r2=0 /\ 1:r3=1)
diff --git a/tools/memory-model/litmus-tests/MP+polockonce+poacquiresilsil.litmus b/tools/memory-model/litmus-tests/MP+polockonce+poacquiresilsil.litmus
new file mode 100644
index 000000000000..abf81e7a0895
--- /dev/null
+++ b/tools/memory-model/litmus-tests/MP+polockonce+poacquiresilsil.litmus
@@ -0,0 +1,34 @@
+C MP+polockonce+poacquiresilsil
+(*
+ * Result: Sometimes
+ *
+ * Do spinlocks provide order to outside observers using spin_is_locked()
+ * to sense the lock-held state, ordered by acquire?  Note that when the
+ * first spin_is_locked() returns false and the second true, we know that
+ * the smp_load_acquire() executed before the lock was acquired (loosely
+ * speaking).
+ *)
+{
+}
+P0(spinlock_t *lo, int *x)
+{
+        spin_lock(lo);
+        WRITE_ONCE(*x, 1);
+        spin_unlock(lo);
+}
+P1(spinlock_t *lo, int *x)
+{
+        int r1;
+        int r2;
+        int r3;
+        r1 = smp_load_acquire(x);
+        r2 = spin_is_locked(lo);
+        r3 = spin_is_locked(lo);
+}
+exists (1:r1=1 /\ 1:r2=0 /\ 1:r3=1)
diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README
index 04096fb8b8d9..17eb9a8c222d 100644
--- a/tools/memory-model/litmus-tests/README
+++ b/tools/memory-model/litmus-tests/README
@@ -23,7 +23,8 @@ IRIW+mbonceonces+OnceOnce.litmus
        between each pairs of reads.  In other words, is smp_mb()
        sufficient to cause two different reading processes to agree on
        the order of a pair of writes, where each write is to a different
-        variable by a different process?
+        variable by a different process?  This litmus test is forbidden
+        by LKMM's propagation rule.
 IRIW+poonceonces+OnceOnce.litmus
        Test of independent reads from independent writes with nothing
@@ -63,6 +64,16 @@ LB+poonceonces.litmus
 MP+onceassign+derefonce.litmus
        As below, but with rcu_assign_pointer() and an rcu_dereference().
+MP+polockmbonce+poacquiresilsil.litmus
+        Protect the access with a lock and an smp_mb__after_spinlock()
+        in one process, and use an acquire load followed by a pair of
+        spin_is_locked() calls in the other process.
+MP+polockonce+poacquiresilsil.litmus
+        Protect the access with a lock in one process, and use an
+        acquire load followed by a pair of spin_is_locked() calls
+        in the other process.
 MP+polocks.litmus
        As below, but with the second access of the writer process
        and the first access of reader process protected by a lock.
@@ -109,8 +120,10 @@ S+wmbonceonce+poacquireonce.litmus
 WRC+poonceonces+Once.litmus
 WRC+pooncerelease+rmbonceonce+Once.litmus
-        These two are members of an extension of the MP litmus-test class
+        These two are members of an extension of the MP litmus-test
-        in which the first write is moved to a separate process.
+        class in which the first write is moved to a separate process.
+        The second is forbidden because smp_store_release() is
+        A-cumulative in LKMM.
 Z6.0+pooncelock+pooncelock+pombonce.litmus
        Is the ordering provided by a spin_unlock() and a subsequent
diff --git a/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus b/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus
index 97fcbffde9a0..ad3448b941e6 100644
--- a/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus
+++ b/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus
@@ -5,7 +5,9 @@ C WRC+pooncerelease+rmbonceonce+Once
 *
 * This litmus test is an extension of the message-passing pattern, where
 * the first write is moved to a separate process.  Because it features
- * a release and a read memory barrier, it should be forbidden.
+ * a release and a read memory barrier, it should be forbidden.  More
+ * specifically, this litmus test is forbidden because smp_store_release()
+ * is A-cumulative in LKMM.
 *)
 {}
diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
index ba4a4ec6d313..305ded17e741 100644
--- a/tools/memory-model/lock.cat
+++ b/tools/memory-model/lock.cat
@@ -4,46 +4,72 @@
 * Copyright (C) 2017 Alan Stern <stern@rowland.harvard.edu>
 *)
-(* Generate coherence orders and handle lock operations *)
+(*
+ * Generate coherence orders and handle lock operations
+ *
+ * Warning: spin_is_locked() crashes herd7 versions strictly before 7.48.
+ * spin_is_locked() is functional from herd7 version 7.49.
+ *)
 include "cross.cat"
-(* From lock reads to their partner lock writes *)
-let lk-rmw = ([LKR] ; po-loc ; [LKW]) \ (po ; po)
-let rmw = rmw | lk-rmw
 (*
- * A paired LKR must always see an unlocked value; spin_lock() calls nested
+ * The lock-related events generated by herd are as follows:
- * inside a critical section (for the same lock) always deadlock.
+ *
+ * LKR          Lock-Read: the read part of a spin_lock() or successful
+ *                      spin_trylock() read-modify-write event pair
+ * LKW          Lock-Write: the write part of a spin_lock() or successful
+ *                      spin_trylock() RMW event pair
+ * UL           Unlock: a spin_unlock() event
+ * LF           Lock-Fail: a failed spin_trylock() event
+ * RL           Read-Locked: a spin_is_locked() event which returns True
+ * RU           Read-Unlocked: a spin_is_locked() event which returns False
+ *
+ * LKR and LKW events always come paired, like all RMW event sequences.
+ *
+ * LKR, LF, RL, and RU are read events; LKR has Acquire ordering.
+ * LKW and UL are write events; UL has Release ordering.
+ * LKW, LF, RL, and RU have no ordering properties.
 *)
-empty ([LKW] ; po-loc ; [domain(lk-rmw)]) \ (po-loc ; [UL] ; po-loc)
-        as lock-nest
-(* The litmus test is invalid if an LKW event is not part of an RMW pair *)
+(* Backward compatibility *)
-flag ~empty LKW \ range(lk-rmw) as unpaired-LKW
+let RL = try RL with emptyset
+let RU = try RU with emptyset
-(* This will be allowed if we implement spin_is_locked() *)
+(* Treat RL as a kind of LF: a read with no ordering properties *)
-flag ~empty LKR \ domain(lk-rmw) as unpaired-LKR
+let LF = LF | RL
-(* There should be no R or W accesses to spinlocks *)
+(* There should be no ordinary R or W accesses to spinlocks *)
-let ALL-LOCKS = LKR | LKW | UL | LF
+let ALL-LOCKS = LKR | LKW | UL | LF | RU
 flag ~empty [M \ IW] ; loc ; [ALL-LOCKS] as mixed-lock-accesses
+(* Link Lock-Reads to their RMW-partner Lock-Writes *)
+let lk-rmw = ([LKR] ; po-loc ; [LKW]) \ (po ; po)
+let rmw = rmw | lk-rmw
+(* The litmus test is invalid if an LKR/LKW event is not part of an RMW pair *)
+flag ~empty LKW \ range(lk-rmw) as unpaired-LKW
+flag ~empty LKR \ domain(lk-rmw) as unpaired-LKR
+(*
+ * An LKR must always see an unlocked value; spin_lock() calls nested
+ * inside a critical section (for the same lock) always deadlock.
+ *)
+empty ([LKW] ; po-loc ; [LKR]) \ (po-loc ; [UL] ; po-loc) as lock-nest
 (* The final value of a spinlock should not be tested *)
 flag ~empty [FW] ; loc ; [ALL-LOCKS] as lock-final
 (*
 * Put lock operations in their appropriate classes, but leave UL out of W
 * until after the co relation has been generated.
 *)
-let R = R | LKR | LF
+let R = R | LKR | LF | RU
 let W = W | LKW
 let Release = Release | UL
 let Acquire = Acquire | LKR
 (* Match LKW events to their corresponding UL events *)
 let critical = ([LKW] ; po-loc ; [UL]) \ (po-loc ; [LKW | UL] ; po-loc)
@@ -53,27 +79,48 @@ flag ~empty UL \ range(critical) as unmatched-unlock
 let UNMATCHED-LKW = LKW \ domain(critical)
 empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
 (* rfi for LF events: link each LKW to the LF events in its critical section *)
 let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
 (* rfe for LF events *)
 let all-possible-rfe-lf =
-  (*
+        (*
-   * Given an LF event r, compute the possible rfe edges for that event
+         * Given an LF event r, compute the possible rfe edges for that event
-   * (all those starting from LKW events in other threads),
+         * (all those starting from LKW events in other threads),
-   * and then convert that relation to a set of single-edge relations.
+         * and then convert that relation to a set of single-edge relations.
-   *)
+         *)
-  let possible-rfe-lf r =
+        let possible-rfe-lf r =
-    let pair-to-relation p = p ++ 0
+                let pair-to-relation p = p ++ 0
-    in map pair-to-relation ((LKW * {r}) & loc & ext)
+                in map pair-to-relation ((LKW * {r}) & loc & ext)
-  (* Do this for each LF event r that isn't in rfi-lf *)
+        (* Do this for each LF event r that isn't in rfi-lf *)
-  in map possible-rfe-lf (LF \ range(rfi-lf))
+        in map possible-rfe-lf (LF \ range(rfi-lf))
 (* Generate all rf relations for LF events *)
 with rfe-lf from cross(all-possible-rfe-lf)
-let rf = rf | rfi-lf | rfe-lf
+let rf-lf = rfe-lf | rfi-lf
+(*
+ * RU, i.e., spin_is_locked() returning False, is slightly different.
+ * We rely on the memory model to rule out cases where spin_is_locked()
+ * within one of the lock's critical sections returns False.
+ *)
+(* rfi for RU events: an RU may read from the last po-previous UL *)
+let rfi-ru = ([UL] ; po-loc ; [RU]) \ ([UL] ; po-loc ; [LKW] ; po-loc)
+(* rfe for RU events: an RU may read from an external UL or the initial write *)
+let all-possible-rfe-ru =
+        let possible-rfe-ru r =
+                let pair-to-relation p = p ++ 0
+                in map pair-to-relation (((UL | IW) * {r}) & loc & ext)
+        in map possible-rfe-ru RU
+(* Generate all rf relations for RU events *)
+with rfe-ru from cross(all-possible-rfe-ru)
+let rf-ru = rfe-ru | rfi-ru
+(* Final rf relation *)
+let rf = rf | rf-lf | rf-ru
 (* Generate all co relations, including LKW events but not UL *)
 let co0 = co0 | ([IW] ; loc ; [LKW]) |
diff --git a/tools/memory-model/scripts/checkalllitmus.sh b/tools/memory-model/scripts/checkalllitmus.sh
new file mode 100644
index 000000000000..af0aa15ab84e
--- /dev/null
+++ b/tools/memory-model/scripts/checkalllitmus.sh
@@ -0,0 +1,73 @@
+#!/bin/sh
+#
+# Run herd tests on all .litmus files in the specified directory (which
+# defaults to litmus-tests) and check each file's result against a "Result:"
+# comment within that litmus test.  If the verification result does not
+# match that specified in the litmus test, this script prints an error
+# message prefixed with "^^^".  It also outputs verification results to
+# a file whose name is that of the specified litmus test, but with ".out"
+# appended.
+#
+# Usage:
+#       sh checkalllitmus.sh [ directory ]
+#
+# The LINUX_HERD_OPTIONS environment variable may be used to specify
+# arguments to herd, whose default is defined by the checklitmus.sh script.
+# Thus, one would normally run this in the directory containing the memory
+# model, specifying the pathname of the litmus test to check.
+#
+# This script makes no attempt to run the litmus tests concurrently.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, you can access it online at
+# http://www.gnu.org/licenses/gpl-2.0.html.
+#
+# Copyright IBM Corporation, 2018
+#
+# Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
+litmusdir=${1-litmus-tests}
+if test -d "$litmusdir" -a -r "$litmusdir" -a -x "$litmusdir"
+then
+        :
+else
+        echo ' --- ' error: $litmusdir is not an accessible directory
+        exit 255
+fi
+# Find the checklitmus script.  If it is not where we expect it, then
+# assume that the caller has the PATH environment variable set
+# appropriately.
+if test -x scripts/checklitmus.sh
+then
+        clscript=scripts/checklitmus.sh
+else
+        clscript=checklitmus.sh
+fi
+# Run the script on all the litmus tests in the specified directory
+ret=0
+for i in litmus-tests/*.litmus
+do
+        if ! $clscript $i
+        then
+                ret=1
+        fi
+done
+if test "$ret" -ne 0
+then
+        echo " ^^^ VERIFICATION MISMATCHES"
+else
+        echo All litmus tests verified as was expected.
+fi
+exit $ret
diff --git a/tools/memory-model/scripts/checklitmus.sh b/tools/memory-model/scripts/checklitmus.sh
new file mode 100644
index 000000000000..e2e477472844
--- /dev/null
+++ b/tools/memory-model/scripts/checklitmus.sh
@@ -0,0 +1,86 @@
+#!/bin/sh
+#
+# Run a herd test and check the result against a "Result:" comment within
+# the litmus test.  If the verification result does not match that specified
+# in the litmus test, this script prints an error message prefixed with
+# "^^^" and exits with a non-zero status.  It also outputs verification
+# results to a file whose name is that of the specified litmus test, but
+# with ".out" appended.
+#
+# Usage:
+#       sh checklitmus.sh file.litmus
+#
+# The LINUX_HERD_OPTIONS environment variable may be used to specify
+# arguments to herd, which default to "-conf linux-kernel.cfg".  Thus,
+# one would normally run this in the directory containing the memory model,
+# specifying the pathname of the litmus test to check.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, you can access it online at
+# http://www.gnu.org/licenses/gpl-2.0.html.
+#
+# Copyright IBM Corporation, 2018
+#
+# Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
+litmus=$1
+herdoptions=${LINUX_HERD_OPTIONS--conf linux-kernel.cfg}
+if test -f "$litmus" -a -r "$litmus"
+then
+        :
+else
+        echo ' --- ' error: \"$litmus\" is not a readable file
+        exit 255
+fi
+if grep -q '^ \* Result: ' $litmus
+then
+        outcome=`grep -m 1 '^ \* Result: ' $litmus | awk '{ print $3 }'`
+else
+        outcome=specified
+fi
+echo Herd options: $herdoptions > $litmus.out
+/usr/bin/time herd7 -o ~/tmp $herdoptions $litmus >> $litmus.out 2>&1
+grep "Herd options:" $litmus.out
+grep '^Observation' $litmus.out
+if grep -q '^Observation' $litmus.out
+then
+        :
+else
+        cat $litmus.out
+        echo ' ^^^ Verification error'
+        echo ' ^^^ Verification error' >> $litmus.out 2>&1
+        exit 255
+fi
+if test "$outcome" = DEADLOCK
+then
+        echo grep 3 and 4
+        if grep '^Observation' $litmus.out | grep -q 'Never 0 0$'
+        then
+                ret=0
+        else
+                echo " ^^^ Unexpected non-$outcome verification"
+                echo " ^^^ Unexpected non-$outcome verification" >> $litmus.out 2>&1
+                ret=1
+        fi
+elif grep '^Observation' $litmus.out | grep -q $outcome || test "$outcome" = Maybe
+then
+        ret=0
+else
+        echo " ^^^ Unexpected non-$outcome verification"
+        echo " ^^^ Unexpected non-$outcome verification" >> $litmus.out 2>&1
+        ret=1
+fi
+tail -2 $litmus.out | head -1
+exit $ret