powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX

Implement a POWER7 optimised copy_to_user/copy_from_user using VMX. For large aligned copies this new loop is over 10% faster, and for large unaligned copies it is over 200% faster. If we take a fault we fall back to the old version, this keeps things relatively simple and easy to verify. On POWER7 unaligned stores rarely slow down - they only flush when a store crosses a 4KB page boundary. Furthermore this flush is handled completely in hardware and should be 20-30 cycles. Unaligned loads on the other hand flush much more often - whenever crossing a 128 byte cache line, or a 32 byte sector if either sector is an L1 miss. Considering this information we really want to get the loads aligned and not worry about the alignment of the stores. Microbenchmarks confirm that this approach is much faster than the current unaligned copy loop that uses shifts and rotates to ensure both loads and stores are aligned. We also want to try and do the stores in cacheline aligned, cacheline sized chunks. If the store queue is unable to merge an entire cacheline of stores then the L2 cache will have to do a read/modify/write. Even worse, we will serialise this with the stores in the next iteration of the copy loop since both iterations hit the same cacheline. Based on this, the new loop does the following things: 1 - 127 bytes Get the source 8 byte aligned and use 8 byte loads and stores. Pretty boring and similar to how the current loop works. 128 - 4095 bytes Get the source 8 byte aligned and use 8 byte loads and stores, 1 cacheline at a time. We aren't doing the stores in cacheline aligned chunks so we will potentially serialise once per cacheline. Even so it is much better than the loop we have today. 4096 - bytes If both source and destination have the same alignment get them both 16 byte aligned, then get the destination cacheline aligned. Do cacheline sized loads and stores using VMX. If source and destination do not have the same alignment, we get the destination cacheline aligned, and use permute to do aligned loads. In both cases the VMX loop should be optimal - we always do aligned loads and stores and are always doing stores in cacheline aligned, cacheline sized chunks. To be able to use VMX we must be careful about interrupts and sleeping. We don't use the VMX loop when in an interrupt (which should be rare anyway) and we wrap the VMX loop in disable/enable_pagefault and fall back to the existing copy_tofrom_user loop if we do need to sleep. The VMX breakpoint of 4096 bytes was chosen using this microbenchmark: http://ozlabs.org/~anton/junkcode/copy_to_user.c Since we are using VMX and there is a cost to saving and restoring the user VMX state there are two broad cases we need to benchmark: - Best case - userspace never uses VMX - Worst case - userspace always uses VMX In reality a userspace process will sit somewhere between these two extremes. Since we need to test both aligned and unaligned copies we end up with 4 combinations. The point at which the VMX loop begins to win is: 0% VMX aligned 2048 bytes unaligned 2048 bytes 100% VMX aligned 16384 bytes unaligned 8192 bytes Considering this is a microbenchmark, the data is hot in cache and the VMX loop has better store queue merging properties we set the breakpoint to 4096 bytes, a little below the unaligned breakpoints. Some future optimisations we can look at: - Looking at the perf data, a significant part of the cost when a task is always using VMX is the extra exception we take to restore the VMX state. As such we should do something similar to the x86 optimisation that restores FPU state for heavy users. ie: /* * If the task has used fpu the last 5 timeslices, just do a full * restore of the math state immediately to avoid the trap; the * chances of needing FPU soon are obviously high now */ preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; and /* * fpu_counter contains the number of consecutive context switches * that the FPU is used. If this is over a threshold, the lazy fpu * saving becomes unlazy to save the trap. This is an unsigned char * so that after 256 times the counter wraps and the behavior turns * lazy again; this to deal with bursty apps that only use FPU for * a short time */ - We could create a paca bit to mirror the VMX enabled MSR bit and check that first, avoiding multiple calls to calling enable_kernel_altivec. That should help with iovec based system calls like readv. - We could have two VMX breakpoints, one for when we know the user VMX state is loaded into the registers and one when it isn't. This could be a second bit in the paca so we can calculate the break points quickly. - One suggestion from Ben was to save and restore the VSX registers we use inline instead of using enable_kernel_altivec. [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
author: Anton Blanchard <anton@samba.org> 2011-12-07 15:11:45 -0500
committer: Benjamin Herrenschmidt <benh@kernel.crashing.org> 2011-12-18 22:40:40 -0500
commit: a66086b8197da8dc83b698642d5947ff850e708d (patch)
tree: 5d05fbca4e687f5591c852c3a1dd3d80c373d307 /arch
parent: 0766387bcf162ecd875b4eb5f44e3ef057a3329b (diff)
5 files changed, 744 insertions, 2 deletions
diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index 7044233124ba..ad55a1ccb9fb 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -201,6 +201,7 @@ extern const char *powerpc_base_platform;
 #define CPU_FTR_POPCNTB                 LONG_ASM_CONST(0x0400000000000000)
 #define CPU_FTR_POPCNTD                 LONG_ASM_CONST(0x0800000000000000)
 #define CPU_FTR_ICSWX                   LONG_ASM_CONST(0x1000000000000000)
+#define CPU_FTR_VMX_COPY                LONG_ASM_CONST(0x2000000000000000)
 #ifndef __ASSEMBLY__
@@ -425,7 +426,7 @@ extern const char *powerpc_base_platform;
            CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
            CPU_FTR_DSCR | CPU_FTR_SAO  | CPU_FTR_ASYM_SMT | \
            CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-            CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE)
+            CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY)
 #define CPU_FTRS_CELL   (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
            CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
            CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 166a6a0ad544..7735a2c2e6d9 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -16,13 +16,15 @@ obj-$(CONFIG_HAS_IOMEM)	+= devres.o
 obj-$(CONFIG_PPC64)     += copypage_64.o copyuser_64.o \
                           memcpy_64.o usercopy_64.o mem_64.o string.o \
-                           checksum_wrappers_64.o hweight_64.o
+                           checksum_wrappers_64.o hweight_64.o \
+                           copyuser_power7.o
 obj-$(CONFIG_XMON)      += sstep.o ldstfp.o
 obj-$(CONFIG_KPROBES)   += sstep.o ldstfp.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)        += sstep.o ldstfp.o
 ifeq ($(CONFIG_PPC64),y)
 obj-$(CONFIG_SMP)       += locks.o
+obj-$(CONFIG_ALTIVEC)   += copyuser_power7_vmx.o
 endif
 obj-$(CONFIG_PPC_LIB_RHEAP) += rheap.o
diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S
index 578b625d6a3c..773d38f90aaa 100644
--- a/arch/powerpc/lib/copyuser_64.S
+++ b/arch/powerpc/lib/copyuser_64.S
@@ -11,6 +11,12 @@
        .align  7
 _GLOBAL(__copy_tofrom_user)
+BEGIN_FTR_SECTION
+        nop
+FTR_SECTION_ELSE
+        b       __copy_tofrom_user_power7
+ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
+_GLOBAL(__copy_tofrom_user_base)
        /* first check for a whole page copy on a page boundary */
        cmpldi  cr1,r5,16
        cmpdi   cr6,r5,4096
diff --git a/arch/powerpc/lib/copyuser_power7.S b/arch/powerpc/lib/copyuser_power7.S
new file mode 100644
index 000000000000..497db7b23bb1
--- /dev/null
+++ b/arch/powerpc/lib/copyuser_power7.S
@@ -0,0 +1,683 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2011
+ *
+ * Author: Anton Blanchard <anton@au.ibm.com>
+ */
+#include <asm/ppc_asm.h>
+#define STACKFRAMESIZE  256
+#define STK_REG(i)      (112 + ((i)-14)*8)
+        .macro err1
+100:
+        .section __ex_table,"a"
+        .align 3
+        .llong 100b,.Ldo_err1
+        .previous
+        .endm
+        .macro err2
+200:
+        .section __ex_table,"a"
+        .align 3
+        .llong 200b,.Ldo_err2
+        .previous
+        .endm
+#ifdef CONFIG_ALTIVEC
+        .macro err3
+300:
+        .section __ex_table,"a"
+        .align 3
+        .llong 300b,.Ldo_err3
+        .previous
+        .endm
+        .macro err4
+400:
+        .section __ex_table,"a"
+        .align 3
+        .llong 400b,.Ldo_err4
+        .previous
+        .endm
+.Ldo_err4:
+        ld      r16,STK_REG(r16)(r1)
+        ld      r15,STK_REG(r15)(r1)
+        ld      r14,STK_REG(r14)(r1)
+.Ldo_err3:
+        bl      .exit_vmx_copy
+        ld      r0,STACKFRAMESIZE+16(r1)
+        mtlr    r0
+        b       .Lexit
+#endif /* CONFIG_ALTIVEC */
+.Ldo_err2:
+        ld      r22,STK_REG(r22)(r1)
+        ld      r21,STK_REG(r21)(r1)
+        ld      r20,STK_REG(r20)(r1)
+        ld      r19,STK_REG(r19)(r1)
+        ld      r18,STK_REG(r18)(r1)
+        ld      r17,STK_REG(r17)(r1)
+        ld      r16,STK_REG(r16)(r1)
+        ld      r15,STK_REG(r15)(r1)
+        ld      r14,STK_REG(r14)(r1)
+.Lexit:
+        addi    r1,r1,STACKFRAMESIZE
+.Ldo_err1:
+        ld      r3,48(r1)
+        ld      r4,56(r1)
+        ld      r5,64(r1)
+        b       __copy_tofrom_user_base
+_GLOBAL(__copy_tofrom_user_power7)
+#ifdef CONFIG_ALTIVEC
+        cmpldi  r5,16
+        cmpldi  cr1,r5,4096
+        std     r3,48(r1)
+        std     r4,56(r1)
+        std     r5,64(r1)
+        blt     .Lshort_copy
+        bgt     cr1,.Lvmx_copy
+#else
+        cmpldi  r5,16
+        std     r3,48(r1)
+        std     r4,56(r1)
+        std     r5,64(r1)
+        blt     .Lshort_copy
+#endif
+.Lnonvmx_copy:
+        /* Get the source 8B aligned */
+        neg     r6,r4
+        mtocrf  0x01,r6
+        clrldi  r6,r6,(64-3)
+        bf      cr7*4+3,1f
+err1;   lbz     r0,0(r4)
+        addi    r4,r4,1
+err1;   stb     r0,0(r3)
+        addi    r3,r3,1
+1:      bf      cr7*4+2,2f
+err1;   lhz     r0,0(r4)
+        addi    r4,r4,2
+err1;   sth     r0,0(r3)
+        addi    r3,r3,2
+2:      bf      cr7*4+1,3f
+err1;   lwz     r0,0(r4)
+        addi    r4,r4,4
+err1;   stw     r0,0(r3)
+        addi    r3,r3,4
+3:      sub     r5,r5,r6
+        cmpldi  r5,128
+        blt     5f
+        mflr    r0
+        stdu    r1,-STACKFRAMESIZE(r1)
+        std     r14,STK_REG(r14)(r1)
+        std     r15,STK_REG(r15)(r1)
+        std     r16,STK_REG(r16)(r1)
+        std     r17,STK_REG(r17)(r1)
+        std     r18,STK_REG(r18)(r1)
+        std     r19,STK_REG(r19)(r1)
+        std     r20,STK_REG(r20)(r1)
+        std     r21,STK_REG(r21)(r1)
+        std     r22,STK_REG(r22)(r1)
+        std     r0,STACKFRAMESIZE+16(r1)
+        srdi    r6,r5,7
+        mtctr   r6
+        /* Now do cacheline (128B) sized loads and stores. */
+        .align  5
+4:
+err2;   ld      r0,0(r4)
+err2;   ld      r6,8(r4)
+err2;   ld      r7,16(r4)
+err2;   ld      r8,24(r4)
+err2;   ld      r9,32(r4)
+err2;   ld      r10,40(r4)
+err2;   ld      r11,48(r4)
+err2;   ld      r12,56(r4)
+err2;   ld      r14,64(r4)
+err2;   ld      r15,72(r4)
+err2;   ld      r16,80(r4)
+err2;   ld      r17,88(r4)
+err2;   ld      r18,96(r4)
+err2;   ld      r19,104(r4)
+err2;   ld      r20,112(r4)
+err2;   ld      r21,120(r4)
+        addi    r4,r4,128
+err2;   std     r0,0(r3)
+err2;   std     r6,8(r3)
+err2;   std     r7,16(r3)
+err2;   std     r8,24(r3)
+err2;   std     r9,32(r3)
+err2;   std     r10,40(r3)
+err2;   std     r11,48(r3)
+err2;   std     r12,56(r3)
+err2;   std     r14,64(r3)
+err2;   std     r15,72(r3)
+err2;   std     r16,80(r3)
+err2;   std     r17,88(r3)
+err2;   std     r18,96(r3)
+err2;   std     r19,104(r3)
+err2;   std     r20,112(r3)
+err2;   std     r21,120(r3)
+        addi    r3,r3,128
+        bdnz    4b
+        clrldi  r5,r5,(64-7)
+        ld      r14,STK_REG(r14)(r1)
+        ld      r15,STK_REG(r15)(r1)
+        ld      r16,STK_REG(r16)(r1)
+        ld      r17,STK_REG(r17)(r1)
+        ld      r18,STK_REG(r18)(r1)
+        ld      r19,STK_REG(r19)(r1)
+        ld      r20,STK_REG(r20)(r1)
+        ld      r21,STK_REG(r21)(r1)
+        ld      r22,STK_REG(r22)(r1)
+        addi    r1,r1,STACKFRAMESIZE
+        /* Up to 127B to go */
+5:      srdi    r6,r5,4
+        mtocrf  0x01,r6
+6:      bf      cr7*4+1,7f
+err1;   ld      r0,0(r4)
+err1;   ld      r6,8(r4)
+err1;   ld      r7,16(r4)
+err1;   ld      r8,24(r4)
+err1;   ld      r9,32(r4)
+err1;   ld      r10,40(r4)
+err1;   ld      r11,48(r4)
+err1;   ld      r12,56(r4)
+        addi    r4,r4,64
+err1;   std     r0,0(r3)
+err1;   std     r6,8(r3)
+err1;   std     r7,16(r3)
+err1;   std     r8,24(r3)
+err1;   std     r9,32(r3)
+err1;   std     r10,40(r3)
+err1;   std     r11,48(r3)
+err1;   std     r12,56(r3)
+        addi    r3,r3,64
+        /* Up to 63B to go */
+7:      bf      cr7*4+2,8f
+err1;   ld      r0,0(r4)
+err1;   ld      r6,8(r4)
+err1;   ld      r7,16(r4)
+err1;   ld      r8,24(r4)
+        addi    r4,r4,32
+err1;   std     r0,0(r3)
+err1;   std     r6,8(r3)
+err1;   std     r7,16(r3)
+err1;   std     r8,24(r3)
+        addi    r3,r3,32
+        /* Up to 31B to go */
+8:      bf      cr7*4+3,9f
+err1;   ld      r0,0(r4)
+err1;   ld      r6,8(r4)
+        addi    r4,r4,16
+err1;   std     r0,0(r3)
+err1;   std     r6,8(r3)
+        addi    r3,r3,16
+9:      clrldi  r5,r5,(64-4)
+        /* Up to 15B to go */
+.Lshort_copy:
+        mtocrf  0x01,r5
+        bf      cr7*4+0,12f
+err1;   lwz     r0,0(r4)        /* Less chance of a reject with word ops */
+err1;   lwz     r6,4(r4)
+        addi    r4,r4,8
+err1;   stw     r0,0(r3)
+err1;   stw     r6,4(r3)
+        addi    r3,r3,8
+12:     bf      cr7*4+1,13f
+err1;   lwz     r0,0(r4)
+        addi    r4,r4,4
+err1;   stw     r0,0(r3)
+        addi    r3,r3,4
+13:     bf      cr7*4+2,14f
+err1;   lhz     r0,0(r4)
+        addi    r4,r4,2
+err1;   sth     r0,0(r3)
+        addi    r3,r3,2
+14:     bf      cr7*4+3,15f
+err1;   lbz     r0,0(r4)
+err1;   stb     r0,0(r3)
+15:     li      r3,0
+        blr
+.Lunwind_stack_nonvmx_copy:
+        addi    r1,r1,STACKFRAMESIZE
+        b       .Lnonvmx_copy
+#ifdef CONFIG_ALTIVEC
+.Lvmx_copy:
+        mflr    r0
+        std     r0,16(r1)
+        stdu    r1,-STACKFRAMESIZE(r1)
+        bl      .enter_vmx_copy
+        cmpwi   r3,0
+        ld      r0,STACKFRAMESIZE+16(r1)
+        ld      r3,STACKFRAMESIZE+48(r1)
+        ld      r4,STACKFRAMESIZE+56(r1)
+        ld      r5,STACKFRAMESIZE+64(r1)
+        mtlr    r0
+        beq     .Lunwind_stack_nonvmx_copy
+        /*
+         * If source and destination are not relatively aligned we use a
+         * slower permute loop.
+         */
+        xor     r6,r4,r3
+        rldicl. r6,r6,0,(64-4)
+        bne     .Lvmx_unaligned_copy
+        /* Get the destination 16B aligned */
+        neg     r6,r3
+        mtocrf  0x01,r6
+        clrldi  r6,r6,(64-4)
+        bf      cr7*4+3,1f
+err3;   lbz     r0,0(r4)
+        addi    r4,r4,1
+err3;   stb     r0,0(r3)
+        addi    r3,r3,1
+1:      bf      cr7*4+2,2f
+err3;   lhz     r0,0(r4)
+        addi    r4,r4,2
+err3;   sth     r0,0(r3)
+        addi    r3,r3,2
+2:      bf      cr7*4+1,3f
+err3;   lwz     r0,0(r4)
+        addi    r4,r4,4
+err3;   stw     r0,0(r3)
+        addi    r3,r3,4
+3:      bf      cr7*4+0,4f
+err3;   ld      r0,0(r4)
+        addi    r4,r4,8
+err3;   std     r0,0(r3)
+        addi    r3,r3,8
+4:      sub     r5,r5,r6
+        /* Get the desination 128B aligned */
+        neg     r6,r3
+        srdi    r7,r6,4
+        mtocrf  0x01,r7
+        clrldi  r6,r6,(64-7)
+        li      r9,16
+        li      r10,32
+        li      r11,48
+        bf      cr7*4+3,5f
+err3;   lvx     vr1,r0,r4
+        addi    r4,r4,16
+err3;   stvx    vr1,r0,r3
+        addi    r3,r3,16
+5:      bf      cr7*4+2,6f
+err3;   lvx     vr1,r0,r4
+err3;   lvx     vr0,r4,r9
+        addi    r4,r4,32
+err3;   stvx    vr1,r0,r3
+err3;   stvx    vr0,r3,r9
+        addi    r3,r3,32
+6:      bf      cr7*4+1,7f
+err3;   lvx     vr3,r0,r4
+err3;   lvx     vr2,r4,r9
+err3;   lvx     vr1,r4,r10
+err3;   lvx     vr0,r4,r11
+        addi    r4,r4,64
+err3;   stvx    vr3,r0,r3
+err3;   stvx    vr2,r3,r9
+err3;   stvx    vr1,r3,r10
+err3;   stvx    vr0,r3,r11
+        addi    r3,r3,64
+7:      sub     r5,r5,r6
+        srdi    r6,r5,7
+        std     r14,STK_REG(r14)(r1)
+        std     r15,STK_REG(r15)(r1)
+        std     r16,STK_REG(r16)(r1)
+        li      r12,64
+        li      r14,80
+        li      r15,96
+        li      r16,112
+        mtctr   r6
+        /*
+         * Now do cacheline sized loads and stores. By this stage the
+         * cacheline stores are also cacheline aligned.
+         */
+        .align  5
+8:
+err4;   lvx     vr7,r0,r4
+err4;   lvx     vr6,r4,r9
+err4;   lvx     vr5,r4,r10
+err4;   lvx     vr4,r4,r11
+err4;   lvx     vr3,r4,r12
+err4;   lvx     vr2,r4,r14
+err4;   lvx     vr1,r4,r15
+err4;   lvx     vr0,r4,r16
+        addi    r4,r4,128
+err4;   stvx    vr7,r0,r3
+err4;   stvx    vr6,r3,r9
+err4;   stvx    vr5,r3,r10
+err4;   stvx    vr4,r3,r11
+err4;   stvx    vr3,r3,r12
+err4;   stvx    vr2,r3,r14
+err4;   stvx    vr1,r3,r15
+err4;   stvx    vr0,r3,r16
+        addi    r3,r3,128
+        bdnz    8b
+        ld      r14,STK_REG(r14)(r1)
+        ld      r15,STK_REG(r15)(r1)
+        ld      r16,STK_REG(r16)(r1)
+        /* Up to 127B to go */
+        clrldi  r5,r5,(64-7)
+        srdi    r6,r5,4
+        mtocrf  0x01,r6
+        bf      cr7*4+1,9f
+err3;   lvx     vr3,r0,r4
+err3;   lvx     vr2,r4,r9
+err3;   lvx     vr1,r4,r10
+err3;   lvx     vr0,r4,r11
+        addi    r4,r4,64
+err3;   stvx    vr3,r0,r3
+err3;   stvx    vr2,r3,r9
+err3;   stvx    vr1,r3,r10
+err3;   stvx    vr0,r3,r11
+        addi    r3,r3,64
+9:      bf      cr7*4+2,10f
+err3;   lvx     vr1,r0,r4
+err3;   lvx     vr0,r4,r9
+        addi    r4,r4,32
+err3;   stvx    vr1,r0,r3
+err3;   stvx    vr0,r3,r9
+        addi    r3,r3,32
+10:     bf      cr7*4+3,11f
+err3;   lvx     vr1,r0,r4
+        addi    r4,r4,16
+err3;   stvx    vr1,r0,r3
+        addi    r3,r3,16
+        /* Up to 15B to go */
+11:     clrldi  r5,r5,(64-4)
+        mtocrf  0x01,r5
+        bf      cr7*4+0,12f
+err3;   ld      r0,0(r4)
+        addi    r4,r4,8
+err3;   std     r0,0(r3)
+        addi    r3,r3,8
+12:     bf      cr7*4+1,13f
+err3;   lwz     r0,0(r4)
+        addi    r4,r4,4
+err3;   stw     r0,0(r3)
+        addi    r3,r3,4
+13:     bf      cr7*4+2,14f
+err3;   lhz     r0,0(r4)
+        addi    r4,r4,2
+err3;   sth     r0,0(r3)
+        addi    r3,r3,2
+14:     bf      cr7*4+3,15f
+err3;   lbz     r0,0(r4)
+err3;   stb     r0,0(r3)
+15:     addi    r1,r1,STACKFRAMESIZE
+        b       .exit_vmx_copy          /* tail call optimise */
+.Lvmx_unaligned_copy:
+        /* Get the destination 16B aligned */
+        neg     r6,r3
+        mtocrf  0x01,r6
+        clrldi  r6,r6,(64-4)
+        bf      cr7*4+3,1f
+err3;   lbz     r0,0(r4)
+        addi    r4,r4,1
+err3;   stb     r0,0(r3)
+        addi    r3,r3,1
+1:      bf      cr7*4+2,2f
+err3;   lhz     r0,0(r4)
+        addi    r4,r4,2
+err3;   sth     r0,0(r3)
+        addi    r3,r3,2
+2:      bf      cr7*4+1,3f
+err3;   lwz     r0,0(r4)
+        addi    r4,r4,4
+err3;   stw     r0,0(r3)
+        addi    r3,r3,4
+3:      bf      cr7*4+0,4f
+err3;   lwz     r0,0(r4)        /* Less chance of a reject with word ops */
+err3;   lwz     r7,4(r4)
+        addi    r4,r4,8
+err3;   stw     r0,0(r3)
+err3;   stw     r7,4(r3)
+        addi    r3,r3,8
+4:      sub     r5,r5,r6
+        /* Get the desination 128B aligned */
+        neg     r6,r3
+        srdi    r7,r6,4
+        mtocrf  0x01,r7
+        clrldi  r6,r6,(64-7)
+        li      r9,16
+        li      r10,32
+        li      r11,48
+        lvsl    vr16,0,r4       /* Setup permute control vector */
+err3;   lvx     vr0,0,r4
+        addi    r4,r4,16
+        bf      cr7*4+3,5f
+err3;   lvx     vr1,r0,r4
+        vperm   vr8,vr0,vr1,vr16
+        addi    r4,r4,16
+err3;   stvx    vr8,r0,r3
+        addi    r3,r3,16
+        vor     vr0,vr1,vr1
+5:      bf      cr7*4+2,6f
+err3;   lvx     vr1,r0,r4
+        vperm   vr8,vr0,vr1,vr16
+err3;   lvx     vr0,r4,r9
+        vperm   vr9,vr1,vr0,vr16
+        addi    r4,r4,32
+err3;   stvx    vr8,r0,r3
+err3;   stvx    vr9,r3,r9
+        addi    r3,r3,32
+6:      bf      cr7*4+1,7f
+err3;   lvx     vr3,r0,r4
+        vperm   vr8,vr0,vr3,vr16
+err3;   lvx     vr2,r4,r9
+        vperm   vr9,vr3,vr2,vr16
+err3;   lvx     vr1,r4,r10
+        vperm   vr10,vr2,vr1,vr16
+err3;   lvx     vr0,r4,r11
+        vperm   vr11,vr1,vr0,vr16
+        addi    r4,r4,64
+err3;   stvx    vr8,r0,r3
+err3;   stvx    vr9,r3,r9
+err3;   stvx    vr10,r3,r10
+err3;   stvx    vr11,r3,r11
+        addi    r3,r3,64
+7:      sub     r5,r5,r6
+        srdi    r6,r5,7
+        std     r14,STK_REG(r14)(r1)
+        std     r15,STK_REG(r15)(r1)
+        std     r16,STK_REG(r16)(r1)
+        li      r12,64
+        li      r14,80
+        li      r15,96
+        li      r16,112
+        mtctr   r6
+        /*
+         * Now do cacheline sized loads and stores. By this stage the
+         * cacheline stores are also cacheline aligned.
+         */
+        .align  5
+8:
+err4;   lvx     vr7,r0,r4
+        vperm   vr8,vr0,vr7,vr16
+err4;   lvx     vr6,r4,r9
+        vperm   vr9,vr7,vr6,vr16
+err4;   lvx     vr5,r4,r10
+        vperm   vr10,vr6,vr5,vr16
+err4;   lvx     vr4,r4,r11
+        vperm   vr11,vr5,vr4,vr16
+err4;   lvx     vr3,r4,r12
+        vperm   vr12,vr4,vr3,vr16
+err4;   lvx     vr2,r4,r14
+        vperm   vr13,vr3,vr2,vr16
+err4;   lvx     vr1,r4,r15
+        vperm   vr14,vr2,vr1,vr16
+err4;   lvx     vr0,r4,r16
+        vperm   vr15,vr1,vr0,vr16
+        addi    r4,r4,128
+err4;   stvx    vr8,r0,r3
+err4;   stvx    vr9,r3,r9
+err4;   stvx    vr10,r3,r10
+err4;   stvx    vr11,r3,r11
+err4;   stvx    vr12,r3,r12
+err4;   stvx    vr13,r3,r14
+err4;   stvx    vr14,r3,r15
+err4;   stvx    vr15,r3,r16
+        addi    r3,r3,128
+        bdnz    8b
+        ld      r14,STK_REG(r14)(r1)
+        ld      r15,STK_REG(r15)(r1)
+        ld      r16,STK_REG(r16)(r1)
+        /* Up to 127B to go */
+        clrldi  r5,r5,(64-7)
+        srdi    r6,r5,4
+        mtocrf  0x01,r6
+        bf      cr7*4+1,9f
+err3;   lvx     vr3,r0,r4
+        vperm   vr8,vr0,vr3,vr16
+err3;   lvx     vr2,r4,r9
+        vperm   vr9,vr3,vr2,vr16
+err3;   lvx     vr1,r4,r10
+        vperm   vr10,vr2,vr1,vr16
+err3;   lvx     vr0,r4,r11
+        vperm   vr11,vr1,vr0,vr16
+        addi    r4,r4,64
+err3;   stvx    vr8,r0,r3
+err3;   stvx    vr9,r3,r9
+err3;   stvx    vr10,r3,r10
+err3;   stvx    vr11,r3,r11
+        addi    r3,r3,64
+9:      bf      cr7*4+2,10f
+err3;   lvx     vr1,r0,r4
+        vperm   vr8,vr0,vr1,vr16
+err3;   lvx     vr0,r4,r9
+        vperm   vr9,vr1,vr0,vr16
+        addi    r4,r4,32
+err3;   stvx    vr8,r0,r3
+err3;   stvx    vr9,r3,r9
+        addi    r3,r3,32
+10:     bf      cr7*4+3,11f
+err3;   lvx     vr1,r0,r4
+        vperm   vr8,vr0,vr1,vr16
+        addi    r4,r4,16
+err3;   stvx    vr8,r0,r3
+        addi    r3,r3,16
+        /* Up to 15B to go */
+11:     clrldi  r5,r5,(64-4)
+        addi    r4,r4,-16       /* Unwind the +16 load offset */
+        mtocrf  0x01,r5
+        bf      cr7*4+0,12f
+err3;   lwz     r0,0(r4)        /* Less chance of a reject with word ops */
+err3;   lwz     r6,4(r4)
+        addi    r4,r4,8
+err3;   stw     r0,0(r3)
+err3;   stw     r6,4(r3)
+        addi    r3,r3,8
+12:     bf      cr7*4+1,13f
+err3;   lwz     r0,0(r4)
+        addi    r4,r4,4
+err3;   stw     r0,0(r3)
+        addi    r3,r3,4
+13:     bf      cr7*4+2,14f
+err3;   lhz     r0,0(r4)
+        addi    r4,r4,2
+err3;   sth     r0,0(r3)
+        addi    r3,r3,2
+14:     bf      cr7*4+3,15f
+err3;   lbz     r0,0(r4)
+err3;   stb     r0,0(r3)
+15:     addi    r1,r1,STACKFRAMESIZE
+        b       .exit_vmx_copy          /* tail call optimise */
+#endif /* CONFiG_ALTIVEC */
diff --git a/arch/powerpc/lib/copyuser_power7_vmx.c b/arch/powerpc/lib/copyuser_power7_vmx.c
new file mode 100644
index 000000000000..6e1efadac48b
--- /dev/null
+++ b/arch/powerpc/lib/copyuser_power7_vmx.c
@@ -0,0 +1,50 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2011
+ *
+ * Authors: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
+ *          Anton Blanchard <anton@au.ibm.com>
+ */
+#include <linux/uaccess.h>
+#include <linux/hardirq.h>
+int enter_vmx_copy(void)
+{
+        if (in_interrupt())
+                return 0;
+        /* This acts as preempt_disable() as well and will make
+         * enable_kernel_altivec(). We need to disable page faults
+         * as they can call schedule and thus make us lose the VMX
+         * context. So on page faults, we just fail which will cause
+         * a fallback to the normal non-vmx copy.
+         */
+        pagefault_disable();
+        enable_kernel_altivec();
+        return 1;
+}
+/*
+ * This function must return 0 because we tail call optimise when calling
+ * from __copy_tofrom_user_power7 which returns 0 on success.
+ */
+int exit_vmx_copy(void)
+{
+        pagefault_enable();
+        return 0;
+}
author	Anton Blanchard <anton@samba.org>	2011-12-07 15:11:45 -0500
committer	Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-12-18 22:40:40 -0500
commit	a66086b8197da8dc83b698642d5947ff850e708d (patch)
tree	5d05fbca4e687f5591c852c3a1dd3d80c373d307 /arch
parent	0766387bcf162ecd875b4eb5f44e3ef057a3329b (diff)

diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h index 7044233124ba..ad55a1ccb9fb 100644 --- a/arch/powerpc/include/asm/cputable.h +++ b/arch/powerpc/include/asm/cputable.h
@@ -201,6 +201,7 @@ extern const char *powerpc_base_platform;
201	#define CPU_FTR_POPCNTB LONG_ASM_CONST(0x0400000000000000)	201	#define CPU_FTR_POPCNTB LONG_ASM_CONST(0x0400000000000000)
202	#define CPU_FTR_POPCNTD LONG_ASM_CONST(0x0800000000000000)	202	#define CPU_FTR_POPCNTD LONG_ASM_CONST(0x0800000000000000)
203	#define CPU_FTR_ICSWX LONG_ASM_CONST(0x1000000000000000)	203	#define CPU_FTR_ICSWX LONG_ASM_CONST(0x1000000000000000)
		204	#define CPU_FTR_VMX_COPY LONG_ASM_CONST(0x2000000000000000)
204		205
205	#ifndef __ASSEMBLY__	206	#ifndef __ASSEMBLY__
206		207
@@ -425,7 +426,7 @@ extern const char *powerpc_base_platform;
425	CPU_FTR_PURR \| CPU_FTR_SPURR \| CPU_FTR_REAL_LE \| \	426	CPU_FTR_PURR \| CPU_FTR_SPURR \| CPU_FTR_REAL_LE \| \
426	CPU_FTR_DSCR \| CPU_FTR_SAO \| CPU_FTR_ASYM_SMT \| \	427	CPU_FTR_DSCR \| CPU_FTR_SAO \| CPU_FTR_ASYM_SMT \| \
427	CPU_FTR_STCX_CHECKS_ADDRESS \| CPU_FTR_POPCNTB \| CPU_FTR_POPCNTD \| \	428	CPU_FTR_STCX_CHECKS_ADDRESS \| CPU_FTR_POPCNTB \| CPU_FTR_POPCNTD \| \
428	CPU_FTR_ICSWX \| CPU_FTR_CFAR \| CPU_FTR_HVMODE)	429	CPU_FTR_ICSWX \| CPU_FTR_CFAR \| CPU_FTR_HVMODE \| CPU_FTR_VMX_COPY)
429	#define CPU_FTRS_CELL (CPU_FTR_USE_TB \| CPU_FTR_LWSYNC \| \	430	#define CPU_FTRS_CELL (CPU_FTR_USE_TB \| CPU_FTR_LWSYNC \| \
430	CPU_FTR_PPCAS_ARCH_V2 \| CPU_FTR_CTRL \| \	431	CPU_FTR_PPCAS_ARCH_V2 \| CPU_FTR_CTRL \| \
431	CPU_FTR_ALTIVEC_COMP \| CPU_FTR_MMCRA \| CPU_FTR_SMT \| \	432	CPU_FTR_ALTIVEC_COMP \| CPU_FTR_MMCRA \| CPU_FTR_SMT \| \


diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile index 166a6a0ad544..7735a2c2e6d9 100644 --- a/arch/powerpc/lib/Makefile +++ b/arch/powerpc/lib/Makefile
@@ -16,13 +16,15 @@ obj-$(CONFIG_HAS_IOMEM) += devres.o
16		16
17	obj-$(CONFIG_PPC64) += copypage_64.o copyuser_64.o \	17	obj-$(CONFIG_PPC64) += copypage_64.o copyuser_64.o \
18	memcpy_64.o usercopy_64.o mem_64.o string.o \	18	memcpy_64.o usercopy_64.o mem_64.o string.o \
19	checksum_wrappers_64.o hweight_64.o	19	checksum_wrappers_64.o hweight_64.o \
		20	copyuser_power7.o
20	obj-$(CONFIG_XMON) += sstep.o ldstfp.o	21	obj-$(CONFIG_XMON) += sstep.o ldstfp.o
21	obj-$(CONFIG_KPROBES) += sstep.o ldstfp.o	22	obj-$(CONFIG_KPROBES) += sstep.o ldstfp.o
22	obj-$(CONFIG_HAVE_HW_BREAKPOINT) += sstep.o ldstfp.o	23	obj-$(CONFIG_HAVE_HW_BREAKPOINT) += sstep.o ldstfp.o
23		24
24	ifeq ($(CONFIG_PPC64),y)	25	ifeq ($(CONFIG_PPC64),y)
25	obj-$(CONFIG_SMP) += locks.o	26	obj-$(CONFIG_SMP) += locks.o
		27	obj-$(CONFIG_ALTIVEC) += copyuser_power7_vmx.o
26	endif	28	endif
27		29
28	obj-$(CONFIG_PPC_LIB_RHEAP) += rheap.o	30	obj-$(CONFIG_PPC_LIB_RHEAP) += rheap.o


diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S index 578b625d6a3c..773d38f90aaa 100644 --- a/arch/powerpc/lib/copyuser_64.S +++ b/arch/powerpc/lib/copyuser_64.S
@@ -11,6 +11,12 @@
11		11
12	.align 7	12	.align 7
13	_GLOBAL(__copy_tofrom_user)	13	_GLOBAL(__copy_tofrom_user)
		14	BEGIN_FTR_SECTION
		15	nop
		16	FTR_SECTION_ELSE
		17	b __copy_tofrom_user_power7
		18	ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
		19	_GLOBAL(__copy_tofrom_user_base)
14	/* first check for a whole page copy on a page boundary */	20	/* first check for a whole page copy on a page boundary */
15	cmpldi cr1,r5,16	21	cmpldi cr1,r5,16
16	cmpdi cr6,r5,4096	22	cmpdi cr6,r5,4096


diff --git a/arch/powerpc/lib/copyuser_power7.S b/arch/powerpc/lib/copyuser_power7.S new file mode 100644 index 000000000000..497db7b23bb1 --- /dev/null +++ b/arch/powerpc/lib/copyuser_power7.S
@@ -0,0 +1,683 @@
		1	/*
		2	* This program is free software; you can redistribute it and/or modify
		3	* it under the terms of the GNU General Public License as published by
		4	* the Free Software Foundation; either version 2 of the License, or
		5	* (at your option) any later version.
		6	*
		7	* This program is distributed in the hope that it will be useful,
		8	* but WITHOUT ANY WARRANTY; without even the implied warranty of
		9	* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
		10	* GNU General Public License for more details.
		11	*
		12	* You should have received a copy of the GNU General Public License
		13	* along with this program; if not, write to the Free Software
		14	* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
		15	*
		16	* Copyright (C) IBM Corporation, 2011
		17	*
		18	* Author: Anton Blanchard <anton@au.ibm.com>
		19	*/
		20	#include <asm/ppc_asm.h>
		21
		22	#define STACKFRAMESIZE 256
		23	#define STK_REG(i) (112 + ((i)-14)*8)
		24
		25	.macro err1
		26	100:
		27	.section __ex_table,"a"
		28	.align 3
		29	.llong 100b,.Ldo_err1
		30	.previous
		31	.endm
		32
		33	.macro err2
		34	200:
		35	.section __ex_table,"a"
		36	.align 3
		37	.llong 200b,.Ldo_err2
		38	.previous
		39	.endm
		40
		41	#ifdef CONFIG_ALTIVEC
		42	.macro err3
		43	300:
		44	.section __ex_table,"a"
		45	.align 3
		46	.llong 300b,.Ldo_err3
		47	.previous
		48	.endm
		49
		50	.macro err4
		51	400:
		52	.section __ex_table,"a"
		53	.align 3
		54	.llong 400b,.Ldo_err4
		55	.previous
		56	.endm
		57
		58
		59	.Ldo_err4:
		60	ld r16,STK_REG(r16)(r1)
		61	ld r15,STK_REG(r15)(r1)
		62	ld r14,STK_REG(r14)(r1)
		63	.Ldo_err3:
		64	bl .exit_vmx_copy
		65	ld r0,STACKFRAMESIZE+16(r1)
		66	mtlr r0
		67	b .Lexit
		68	#endif /* CONFIG_ALTIVEC */
		69
		70	.Ldo_err2:
		71	ld r22,STK_REG(r22)(r1)
		72	ld r21,STK_REG(r21)(r1)
		73	ld r20,STK_REG(r20)(r1)
		74	ld r19,STK_REG(r19)(r1)
		75	ld r18,STK_REG(r18)(r1)
		76	ld r17,STK_REG(r17)(r1)
		77	ld r16,STK_REG(r16)(r1)
		78	ld r15,STK_REG(r15)(r1)
		79	ld r14,STK_REG(r14)(r1)
		80	.Lexit:
		81	addi r1,r1,STACKFRAMESIZE
		82	.Ldo_err1:
		83	ld r3,48(r1)
		84	ld r4,56(r1)
		85	ld r5,64(r1)
		86	b __copy_tofrom_user_base
		87
		88
		89	_GLOBAL(__copy_tofrom_user_power7)
		90	#ifdef CONFIG_ALTIVEC
		91	cmpldi r5,16
		92	cmpldi cr1,r5,4096
		93
		94	std r3,48(r1)
		95	std r4,56(r1)
		96	std r5,64(r1)
		97
		98	blt .Lshort_copy
		99	bgt cr1,.Lvmx_copy
		100	#else
		101	cmpldi r5,16
		102
		103	std r3,48(r1)
		104	std r4,56(r1)
		105	std r5,64(r1)
		106
		107	blt .Lshort_copy
		108	#endif
		109
		110	.Lnonvmx_copy:
		111	/* Get the source 8B aligned */
		112	neg r6,r4
		113	mtocrf 0x01,r6
		114	clrldi r6,r6,(64-3)
		115
		116	bf cr7*4+3,1f
		117	err1; lbz r0,0(r4)
		118	addi r4,r4,1
		119	err1; stb r0,0(r3)
		120	addi r3,r3,1
		121
		122	1: bf cr7*4+2,2f
		123	err1; lhz r0,0(r4)
		124	addi r4,r4,2
		125	err1; sth r0,0(r3)
		126	addi r3,r3,2
		127
		128	2: bf cr7*4+1,3f
		129	err1; lwz r0,0(r4)
		130	addi r4,r4,4
		131	err1; stw r0,0(r3)
		132	addi r3,r3,4
		133
		134	3: sub r5,r5,r6
		135	cmpldi r5,128
		136	blt 5f
		137
		138	mflr r0
		139	stdu r1,-STACKFRAMESIZE(r1)
		140	std r14,STK_REG(r14)(r1)
		141	std r15,STK_REG(r15)(r1)
		142	std r16,STK_REG(r16)(r1)
		143	std r17,STK_REG(r17)(r1)
		144	std r18,STK_REG(r18)(r1)
		145	std r19,STK_REG(r19)(r1)
		146	std r20,STK_REG(r20)(r1)
		147	std r21,STK_REG(r21)(r1)
		148	std r22,STK_REG(r22)(r1)
		149	std r0,STACKFRAMESIZE+16(r1)
		150
		151	srdi r6,r5,7
		152	mtctr r6
		153
		154	/* Now do cacheline (128B) sized loads and stores. */
		155	.align 5
		156	4:
		157	err2; ld r0,0(r4)
		158	err2; ld r6,8(r4)
		159	err2; ld r7,16(r4)
		160	err2; ld r8,24(r4)
		161	err2; ld r9,32(r4)
		162	err2; ld r10,40(r4)
		163	err2; ld r11,48(r4)
		164	err2; ld r12,56(r4)
		165	err2; ld r14,64(r4)
		166	err2; ld r15,72(r4)
		167	err2; ld r16,80(r4)
		168	err2; ld r17,88(r4)
		169	err2; ld r18,96(r4)
		170	err2; ld r19,104(r4)
		171	err2; ld r20,112(r4)
		172	err2; ld r21,120(r4)
		173	addi r4,r4,128
		174	err2; std r0,0(r3)
		175	err2; std r6,8(r3)
		176	err2; std r7,16(r3)
		177	err2; std r8,24(r3)
		178	err2; std r9,32(r3)
		179	err2; std r10,40(r3)
		180	err2; std r11,48(r3)
		181	err2; std r12,56(r3)
		182	err2; std r14,64(r3)
		183	err2; std r15,72(r3)
		184	err2; std r16,80(r3)
		185	err2; std r17,88(r3)
		186	err2; std r18,96(r3)
		187	err2; std r19,104(r3)
		188	err2; std r20,112(r3)
		189	err2; std r21,120(r3)
		190	addi r3,r3,128
		191	bdnz 4b
		192
		193	clrldi r5,r5,(64-7)
		194
		195	ld r14,STK_REG(r14)(r1)
		196	ld r15,STK_REG(r15)(r1)
		197	ld r16,STK_REG(r16)(r1)
		198	ld r17,STK_REG(r17)(r1)
		199	ld r18,STK_REG(r18)(r1)
		200	ld r19,STK_REG(r19)(r1)
		201	ld r20,STK_REG(r20)(r1)
		202	ld r21,STK_REG(r21)(r1)
		203	ld r22,STK_REG(r22)(r1)
		204	addi r1,r1,STACKFRAMESIZE
		205
		206	/* Up to 127B to go */
		207	5: srdi r6,r5,4
		208	mtocrf 0x01,r6
		209
		210	6: bf cr7*4+1,7f
		211	err1; ld r0,0(r4)
		212	err1; ld r6,8(r4)
		213	err1; ld r7,16(r4)
		214	err1; ld r8,24(r4)
		215	err1; ld r9,32(r4)
		216	err1; ld r10,40(r4)
		217	err1; ld r11,48(r4)
		218	err1; ld r12,56(r4)
		219	addi r4,r4,64
		220	err1; std r0,0(r3)
		221	err1; std r6,8(r3)
		222	err1; std r7,16(r3)
		223	err1; std r8,24(r3)
		224	err1; std r9,32(r3)
		225	err1; std r10,40(r3)
		226	err1; std r11,48(r3)
		227	err1; std r12,56(r3)
		228	addi r3,r3,64
		229
		230	/* Up to 63B to go */
		231	7: bf cr7*4+2,8f
		232	err1; ld r0,0(r4)
		233	err1; ld r6,8(r4)
		234	err1; ld r7,16(r4)
		235	err1; ld r8,24(r4)
		236	addi r4,r4,32
		237	err1; std r0,0(r3)
		238	err1; std r6,8(r3)
		239	err1; std r7,16(r3)
		240	err1; std r8,24(r3)
		241	addi r3,r3,32
		242
		243	/* Up to 31B to go */
		244	8: bf cr7*4+3,9f
		245	err1; ld r0,0(r4)
		246	err1; ld r6,8(r4)
		247	addi r4,r4,16
		248	err1; std r0,0(r3)
		249	err1; std r6,8(r3)
		250	addi r3,r3,16
		251
		252	9: clrldi r5,r5,(64-4)
		253
		254	/* Up to 15B to go */
		255	.Lshort_copy:
		256	mtocrf 0x01,r5
		257	bf cr7*4+0,12f
		258	err1; lwz r0,0(r4) /* Less chance of a reject with word ops */
		259	err1; lwz r6,4(r4)
		260	addi r4,r4,8
		261	err1; stw r0,0(r3)
		262	err1; stw r6,4(r3)
		263	addi r3,r3,8
		264
		265	12: bf cr7*4+1,13f
		266	err1; lwz r0,0(r4)
		267	addi r4,r4,4
		268	err1; stw r0,0(r3)
		269	addi r3,r3,4
		270
		271	13: bf cr7*4+2,14f
		272	err1; lhz r0,0(r4)
		273	addi r4,r4,2
		274	err1; sth r0,0(r3)
		275	addi r3,r3,2
		276
		277	14: bf cr7*4+3,15f
		278	err1; lbz r0,0(r4)
		279	err1; stb r0,0(r3)
		280
		281	15: li r3,0
		282	blr
		283
		284	.Lunwind_stack_nonvmx_copy:
		285	addi r1,r1,STACKFRAMESIZE
		286	b .Lnonvmx_copy
		287
		288	#ifdef CONFIG_ALTIVEC
		289	.Lvmx_copy:
		290	mflr r0
		291	std r0,16(r1)
		292	stdu r1,-STACKFRAMESIZE(r1)
		293	bl .enter_vmx_copy
		294	cmpwi r3,0
		295	ld r0,STACKFRAMESIZE+16(r1)
		296	ld r3,STACKFRAMESIZE+48(r1)
		297	ld r4,STACKFRAMESIZE+56(r1)
		298	ld r5,STACKFRAMESIZE+64(r1)
		299	mtlr r0
		300
		301	beq .Lunwind_stack_nonvmx_copy
		302
		303	/*
		304	* If source and destination are not relatively aligned we use a
		305	* slower permute loop.
		306	*/
		307	xor r6,r4,r3
		308	rldicl. r6,r6,0,(64-4)
		309	bne .Lvmx_unaligned_copy
		310
		311	/* Get the destination 16B aligned */
		312	neg r6,r3
		313	mtocrf 0x01,r6
		314	clrldi r6,r6,(64-4)
		315
		316	bf cr7*4+3,1f
		317	err3; lbz r0,0(r4)
		318	addi r4,r4,1
		319	err3; stb r0,0(r3)
		320	addi r3,r3,1
		321
		322	1: bf cr7*4+2,2f
		323	err3; lhz r0,0(r4)
		324	addi r4,r4,2
		325	err3; sth r0,0(r3)
		326	addi r3,r3,2
		327
		328	2: bf cr7*4+1,3f
		329	err3; lwz r0,0(r4)
		330	addi r4,r4,4
		331	err3; stw r0,0(r3)
		332	addi r3,r3,4
		333
		334	3: bf cr7*4+0,4f
		335	err3; ld r0,0(r4)
		336	addi r4,r4,8
		337	err3; std r0,0(r3)
		338	addi r3,r3,8
		339
		340	4: sub r5,r5,r6
		341
		342	/* Get the desination 128B aligned */
		343	neg r6,r3
		344	srdi r7,r6,4
		345	mtocrf 0x01,r7
		346	clrldi r6,r6,(64-7)
		347
		348	li r9,16
		349	li r10,32
		350	li r11,48
		351
		352	bf cr7*4+3,5f
		353	err3; lvx vr1,r0,r4
		354	addi r4,r4,16
		355	err3; stvx vr1,r0,r3
		356	addi r3,r3,16
		357
		358	5: bf cr7*4+2,6f
		359	err3; lvx vr1,r0,r4
		360	err3; lvx vr0,r4,r9
		361	addi r4,r4,32
		362	err3; stvx vr1,r0,r3
		363	err3; stvx vr0,r3,r9
		364	addi r3,r3,32
		365
		366	6: bf cr7*4+1,7f
		367	err3; lvx vr3,r0,r4
		368	err3; lvx vr2,r4,r9
		369	err3; lvx vr1,r4,r10
		370	err3; lvx vr0,r4,r11
		371	addi r4,r4,64
		372	err3; stvx vr3,r0,r3
		373	err3; stvx vr2,r3,r9
		374	err3; stvx vr1,r3,r10
		375	err3; stvx vr0,r3,r11
		376	addi r3,r3,64
		377
		378	7: sub r5,r5,r6
		379	srdi r6,r5,7
		380
		381	std r14,STK_REG(r14)(r1)
		382	std r15,STK_REG(r15)(r1)
		383	std r16,STK_REG(r16)(r1)
		384
		385	li r12,64
		386	li r14,80
		387	li r15,96
		388	li r16,112
		389
		390	mtctr r6
		391
		392	/*
		393	* Now do cacheline sized loads and stores. By this stage the
		394	* cacheline stores are also cacheline aligned.
		395	*/
		396	.align 5
		397	8:
		398	err4; lvx vr7,r0,r4
		399	err4; lvx vr6,r4,r9
		400	err4; lvx vr5,r4,r10
		401	err4; lvx vr4,r4,r11
		402	err4; lvx vr3,r4,r12
		403	err4; lvx vr2,r4,r14
		404	err4; lvx vr1,r4,r15
		405	err4; lvx vr0,r4,r16
		406	addi r4,r4,128
		407	err4; stvx vr7,r0,r3
		408	err4; stvx vr6,r3,r9
		409	err4; stvx vr5,r3,r10
		410	err4; stvx vr4,r3,r11
		411	err4; stvx vr3,r3,r12
		412	err4; stvx vr2,r3,r14
		413	err4; stvx vr1,r3,r15
		414	err4; stvx vr0,r3,r16
		415	addi r3,r3,128
		416	bdnz 8b
		417
		418	ld r14,STK_REG(r14)(r1)
		419	ld r15,STK_REG(r15)(r1)
		420	ld r16,STK_REG(r16)(r1)
		421
		422	/* Up to 127B to go */
		423	clrldi r5,r5,(64-7)
		424	srdi r6,r5,4
		425	mtocrf 0x01,r6
		426
		427	bf cr7*4+1,9f
		428	err3; lvx vr3,r0,r4
		429	err3; lvx vr2,r4,r9
		430	err3; lvx vr1,r4,r10
		431	err3; lvx vr0,r4,r11
		432	addi r4,r4,64
		433	err3; stvx vr3,r0,r3
		434	err3; stvx vr2,r3,r9
		435	err3; stvx vr1,r3,r10
		436	err3; stvx vr0,r3,r11
		437	addi r3,r3,64
		438
		439	9: bf cr7*4+2,10f
		440	err3; lvx vr1,r0,r4
		441	err3; lvx vr0,r4,r9
		442	addi r4,r4,32
		443	err3; stvx vr1,r0,r3
		444	err3; stvx vr0,r3,r9
		445	addi r3,r3,32
		446
		447	10: bf cr7*4+3,11f
		448	err3; lvx vr1,r0,r4
		449	addi r4,r4,16
		450	err3; stvx vr1,r0,r3
		451	addi r3,r3,16
		452
		453	/* Up to 15B to go */
		454	11: clrldi r5,r5,(64-4)
		455	mtocrf 0x01,r5
		456	bf cr7*4+0,12f
		457	err3; ld r0,0(r4)
		458	addi r4,r4,8
		459	err3; std r0,0(r3)
		460	addi r3,r3,8
		461
		462	12: bf cr7*4+1,13f
		463	err3; lwz r0,0(r4)
		464	addi r4,r4,4
		465	err3; stw r0,0(r3)
		466	addi r3,r3,4
		467
		468	13: bf cr7*4+2,14f
		469	err3; lhz r0,0(r4)
		470	addi r4,r4,2
		471	err3; sth r0,0(r3)
		472	addi r3,r3,2
		473
		474	14: bf cr7*4+3,15f
		475	err3; lbz r0,0(r4)
		476	err3; stb r0,0(r3)
		477
		478	15: addi r1,r1,STACKFRAMESIZE
		479	b .exit_vmx_copy /* tail call optimise */
		480
		481	.Lvmx_unaligned_copy:
		482	/* Get the destination 16B aligned */
		483	neg r6,r3
		484	mtocrf 0x01,r6
		485	clrldi r6,r6,(64-4)
		486
		487	bf cr7*4+3,1f
		488	err3; lbz r0,0(r4)
		489	addi r4,r4,1
		490	err3; stb r0,0(r3)
		491	addi r3,r3,1
		492
		493	1: bf cr7*4+2,2f
		494	err3; lhz r0,0(r4)
		495	addi r4,r4,2
		496	err3; sth r0,0(r3)
		497	addi r3,r3,2
		498
		499	2: bf cr7*4+1,3f
		500	err3; lwz r0,0(r4)
		501	addi r4,r4,4
		502	err3; stw r0,0(r3)
		503	addi r3,r3,4
		504
		505	3: bf cr7*4+0,4f
		506	err3; lwz r0,0(r4) /* Less chance of a reject with word ops */
		507	err3; lwz r7,4(r4)
		508	addi r4,r4,8
		509	err3; stw r0,0(r3)
		510	err3; stw r7,4(r3)
		511	addi r3,r3,8
		512
		513	4: sub r5,r5,r6
		514
		515	/* Get the desination 128B aligned */
		516	neg r6,r3
		517	srdi r7,r6,4
		518	mtocrf 0x01,r7
		519	clrldi r6,r6,(64-7)
		520
		521	li r9,16
		522	li r10,32
		523	li r11,48
		524
		525	lvsl vr16,0,r4 /* Setup permute control vector */
		526	err3; lvx vr0,0,r4
		527	addi r4,r4,16
		528
		529	bf cr7*4+3,5f
		530	err3; lvx vr1,r0,r4
		531	vperm vr8,vr0,vr1,vr16
		532	addi r4,r4,16
		533	err3; stvx vr8,r0,r3
		534	addi r3,r3,16
		535	vor vr0,vr1,vr1
		536
		537	5: bf cr7*4+2,6f
		538	err3; lvx vr1,r0,r4
		539	vperm vr8,vr0,vr1,vr16
		540	err3; lvx vr0,r4,r9
		541	vperm vr9,vr1,vr0,vr16
		542	addi r4,r4,32
		543	err3; stvx vr8,r0,r3
		544	err3; stvx vr9,r3,r9
		545	addi r3,r3,32
		546
		547	6: bf cr7*4+1,7f
		548	err3; lvx vr3,r0,r4
		549	vperm vr8,vr0,vr3,vr16
		550	err3; lvx vr2,r4,r9
		551	vperm vr9,vr3,vr2,vr16
		552	err3; lvx vr1,r4,r10
		553	vperm vr10,vr2,vr1,vr16
		554	err3; lvx vr0,r4,r11
		555	vperm vr11,vr1,vr0,vr16
		556	addi r4,r4,64
		557	err3; stvx vr8,r0,r3
		558	err3; stvx vr9,r3,r9
		559	err3; stvx vr10,r3,r10
		560	err3; stvx vr11,r3,r11
		561	addi r3,r3,64
		562
		563	7: sub r5,r5,r6
		564	srdi r6,r5,7
		565
		566	std r14,STK_REG(r14)(r1)
		567	std r15,STK_REG(r15)(r1)
		568	std r16,STK_REG(r16)(r1)
		569
		570	li r12,64
		571	li r14,80
		572	li r15,96
		573	li r16,112
		574
		575	mtctr r6
		576
		577	/*
		578	* Now do cacheline sized loads and stores. By this stage the
		579	* cacheline stores are also cacheline aligned.
		580	*/
		581	.align 5
		582	8:
		583	err4; lvx vr7,r0,r4
		584	vperm vr8,vr0,vr7,vr16
		585	err4; lvx vr6,r4,r9
		586	vperm vr9,vr7,vr6,vr16
		587	err4; lvx vr5,r4,r10
		588	vperm vr10,vr6,vr5,vr16
		589	err4; lvx vr4,r4,r11
		590	vperm vr11,vr5,vr4,vr16
		591	err4; lvx vr3,r4,r12
		592	vperm vr12,vr4,vr3,vr16
		593	err4; lvx vr2,r4,r14
		594	vperm vr13,vr3,vr2,vr16
		595	err4; lvx vr1,r4,r15
		596	vperm vr14,vr2,vr1,vr16
		597	err4; lvx vr0,r4,r16
		598	vperm vr15,vr1,vr0,vr16
		599	addi r4,r4,128
		600	err4; stvx vr8,r0,r3
		601	err4; stvx vr9,r3,r9
		602	err4; stvx vr10,r3,r10
		603	err4; stvx vr11,r3,r11
		604	err4; stvx vr12,r3,r12
		605	err4; stvx vr13,r3,r14
		606	err4; stvx vr14,r3,r15
		607	err4; stvx vr15,r3,r16
		608	addi r3,r3,128
		609	bdnz 8b
		610
		611	ld r14,STK_REG(r14)(r1)
		612	ld r15,STK_REG(r15)(r1)
		613	ld r16,STK_REG(r16)(r1)
		614
		615	/* Up to 127B to go */
		616	clrldi r5,r5,(64-7)
		617	srdi r6,r5,4
		618	mtocrf 0x01,r6
		619
		620	bf cr7*4+1,9f
		621	err3; lvx vr3,r0,r4
		622	vperm vr8,vr0,vr3,vr16
		623	err3; lvx vr2,r4,r9
		624	vperm vr9,vr3,vr2,vr16
		625	err3; lvx vr1,r4,r10
		626	vperm vr10,vr2,vr1,vr16
		627	err3; lvx vr0,r4,r11
		628	vperm vr11,vr1,vr0,vr16
		629	addi r4,r4,64
		630	err3; stvx vr8,r0,r3
		631	err3; stvx vr9,r3,r9
		632	err3; stvx vr10,r3,r10
		633	err3; stvx vr11,r3,r11
		634	addi r3,r3,64
		635
		636	9: bf cr7*4+2,10f
		637	err3; lvx vr1,r0,r4
		638	vperm vr8,vr0,vr1,vr16
		639	err3; lvx vr0,r4,r9
		640	vperm vr9,vr1,vr0,vr16
		641	addi r4,r4,32
		642	err3; stvx vr8,r0,r3
		643	err3; stvx vr9,r3,r9
		644	addi r3,r3,32
		645
		646	10: bf cr7*4+3,11f
		647	err3; lvx vr1,r0,r4
		648	vperm vr8,vr0,vr1,vr16
		649	addi r4,r4,16
		650	err3; stvx vr8,r0,r3
		651	addi r3,r3,16
		652
		653	/* Up to 15B to go */
		654	11: clrldi r5,r5,(64-4)
		655	addi r4,r4,-16 /* Unwind the +16 load offset */
		656	mtocrf 0x01,r5
		657	bf cr7*4+0,12f
		658	err3; lwz r0,0(r4) /* Less chance of a reject with word ops */
		659	err3; lwz r6,4(r4)
		660	addi r4,r4,8
		661	err3; stw r0,0(r3)
		662	err3; stw r6,4(r3)
		663	addi r3,r3,8
		664
		665	12: bf cr7*4+1,13f
		666	err3; lwz r0,0(r4)
		667	addi r4,r4,4
		668	err3; stw r0,0(r3)
		669	addi r3,r3,4
		670
		671	13: bf cr7*4+2,14f
		672	err3; lhz r0,0(r4)
		673	addi r4,r4,2
		674	err3; sth r0,0(r3)
		675	addi r3,r3,2
		676
		677	14: bf cr7*4+3,15f
		678	err3; lbz r0,0(r4)
		679	err3; stb r0,0(r3)
		680
		681	15: addi r1,r1,STACKFRAMESIZE
		682	b .exit_vmx_copy /* tail call optimise */
		683	#endif /* CONFiG_ALTIVEC */


diff --git a/arch/powerpc/lib/copyuser_power7_vmx.c b/arch/powerpc/lib/copyuser_power7_vmx.c new file mode 100644 index 000000000000..6e1efadac48b --- /dev/null +++ b/arch/powerpc/lib/copyuser_power7_vmx.c
@@ -0,0 +1,50 @@
		1	/*
		2	* This program is free software; you can redistribute it and/or modify
		3	* it under the terms of the GNU General Public License as published by
		4	* the Free Software Foundation; either version 2 of the License, or
		5	* (at your option) any later version.
		6	*
		7	* This program is distributed in the hope that it will be useful,
		8	* but WITHOUT ANY WARRANTY; without even the implied warranty of
		9	* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
		10	* GNU General Public License for more details.
		11	*
		12	* You should have received a copy of the GNU General Public License
		13	* along with this program; if not, write to the Free Software
		14	* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
		15	*
		16	* Copyright (C) IBM Corporation, 2011
		17	*
		18	* Authors: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
		19	* Anton Blanchard <anton@au.ibm.com>
		20	*/
		21	#include <linux/uaccess.h>
		22	#include <linux/hardirq.h>
		23
		24	int enter_vmx_copy(void)
		25	{
		26	if (in_interrupt())
		27	return 0;
		28
		29	/* This acts as preempt_disable() as well and will make
		30	* enable_kernel_altivec(). We need to disable page faults
		31	* as they can call schedule and thus make us lose the VMX
		32	* context. So on page faults, we just fail which will cause
		33	* a fallback to the normal non-vmx copy.
		34	*/
		35	pagefault_disable();
		36
		37	enable_kernel_altivec();
		38
		39	return 1;
		40	}
		41
		42	/*
		43	* This function must return 0 because we tail call optimise when calling
		44	* from __copy_tofrom_user_power7 which returns 0 on success.
		45	*/
		46	int exit_vmx_copy(void)
		47	{
		48	pagefault_enable();
		49	return 0;
		50	}