[patch, mips] More mips memcpy improvements

public inbox for libc-ports@sourceware.org
 help / color / mirror / Atom feed

* [patch, mips] More mips memcpy improvements
@ 2012-11-27 20:37 Steve Ellcey 
  2012-11-28  7:43 ` Maxim Kuvyrkov
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Ellcey  @ 2012-11-27 20:37 UTC (permalink / raw)
  To: libc-ports; +Cc: pinskia, maxim


I would like to take another crack at using the MIPS prepare-for-store
prefetch hint in memcpy.  The performance improvement in using it is 
substantial, but the problem of course is that you cannot prefetch anything
that is outside the area being copied.  I have addressed that problem by
having a variable MAX_PREFETCH_SIZE set to 128.  As long as no CPU has a
cache line greater then 128 bytes, this code should work fine.  It is still
optimized for a cache line of 32 bytes, but it will work for machines that
have a cache line of 64 or 128 bytes.

On a MIPS 74K board, using a test case that is nothing but memcpy's of various
sizes, I got a timing of 72.676 seconds for my earlier memcpy and a timing of
41.685 seconds with this new version.

To test for correctness, I added two macros, RETURN_FIRST_PREFETCH and
RETURN_LAST_PREFETCH, which change memcpy to return the first (last) address
being prefetched instead of the destination pointer.  I used these to verify
that my memcpy does not prefetch any address less then 128 bytes after the
beginning of the destination buffer or less then 128 bytes before the end of
the destination buffer.  I was also able to directly test this memcpy on a
Cavium system with 128 byte cache lines by running a test and verifying that
it didn't zero out any bytes outside the destination buffer.

This new memcpy also showed a performance improvement on the Cavium system
over my previous version, with my performance testcase going from 3.848
seconds to 3.597 seconds.  Not as impressive a gain as the 74K, but still
faster.

In addition to the prefetches I made a couple of other small improvements
such as reordering the partial loads in the unaligned loop copy case. The
idea is to separate the load left/load right instructions that load the
same register from each other because if they are one after the other it can
cause a 1 cycle delay due to a partial load having to both read and write
the destination register.

Andrew and Maxim, could you take a look at this new version and see if it
works OK for you?

Steve Ellcey
sellcey@mips.com


2012-11-27  Steve Ellcey  <sellcey@mips.com>

	* sysdeps/mips/memcpy.S: Change prefetch hint, reorder partial
	loads, set and use MAX_PREFETCH_SIZE.


diff --git a/ports/sysdeps/mips/memcpy.S b/ports/sysdeps/mips/memcpy.S
index abb07f9..31c00f1 100644
--- a/ports/sysdeps/mips/memcpy.S
+++ b/ports/sysdeps/mips/memcpy.S
@@ -26,12 +26,12 @@
 #include <regdef.h>
 #include <sys/asm.h>
 #define PREFETCH_LOAD_HINT PREFETCH_HINT_LOAD_STREAMED
-#define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
+#define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
 #elif _COMPILING_NEWLIB
 #include "machine/asm.h"
 #include "machine/regdef.h"
 #define PREFETCH_LOAD_HINT PREFETCH_HINT_LOAD_STREAMED
-#define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
+#define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
 #else
 #include <regdef.h>
 #include <sys/asm.h>
@@ -141,11 +141,11 @@
 #ifdef USE_DOUBLE
 # define PREFETCH_CHUNK 64
 # define PREFETCH_FOR_LOAD(chunk, reg) \
- pref PREFETCH_LOAD_HINT, (chunk)*32(reg); \
- pref PREFETCH_LOAD_HINT, ((chunk)+1)*32(reg)
+ pref PREFETCH_LOAD_HINT, (chunk)*64(reg); \
+ pref PREFETCH_LOAD_HINT, ((chunk)*64)+32(reg)
 # define PREFETCH_FOR_STORE(chunk, reg) \
- pref PREFETCH_STORE_HINT, (chunk)*32(reg); \
- pref PREFETCH_STORE_HINT, ((chunk)+1)*32(reg)
+ pref PREFETCH_STORE_HINT, (chunk)*64(reg); \
+ pref PREFETCH_STORE_HINT, ((chunk)*64)+32(reg)
 #else
 # define PREFETCH_CHUNK 32
 # define PREFETCH_FOR_LOAD(chunk, reg) \
@@ -153,7 +153,30 @@
 # define PREFETCH_FOR_STORE(chunk, reg) \
  pref PREFETCH_STORE_HINT, (chunk)*32(reg)
 #endif
-# define PREFETCH_LIMIT (5 * PREFETCH_CHUNK)
+
+/* MAX_PREFETCH_SIZE is the maximum size of a prefetch, it must not be less
+ * then PREFETCH_CHUNK, the assumed size of each prefetch.  If the real size
+ * of a prefetch is greater then MAX_PREFETCH_SIZE and the PREPAREFORSTORE
+ * hint is used, the code will not work corrrectly.  If PREPAREFORSTORE is not
+ * used then MAX_PREFETCH_SIZE does not matter.  */
+#define MAX_PREFETCH_SIZE 128
+/* PREFETCH_LIMIT is set based on the fact that we neve use an offset number
+ * greater then 5 on a STORE prefetch and that a single prefetch can never be
+ * larger then MAX_PREFETCH_SIZE.  We add the extra 32 when USE_DOUBLE is set
+ * because we actually do two prefetches in that case, one 32 bytes after the
+ * other.  */
+#ifdef USE_DOUBLE
+# define PREFETCH_LIMIT (5 * PREFETCH_CHUNK) + 32 + MAX_PREFETCH_SIZE
+#else
+# define PREFETCH_LIMIT (5 * PREFETCH_CHUNK) + MAX_PREFETCH_SIZE
+#endif
+#if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE) \
+    && ((PREFETCH_CHUNK * 4) < MAX_PREFETCH_SIZE)
+/* We cannot handle this because the initial prefetches may fetch bytes that
+ * are before the buffer being copied.  We start copies with an offset 
+ * of 4 so avoid this situation when using PREPAREFORSTORE.  */
+#error "PREFETCH_CHUNK is too large and/or MAX_PREFETCH_SIZE is too small."
+#endif
 #else /* USE_PREFETCH not defined */
 # define PREFETCH_FOR_LOAD(offset, reg)
 # define PREFETCH_FOR_STORE(offset, reg)
@@ -258,7 +281,11 @@ L(memcpy):
  */
 	slti	t2,a2,(2 * NSIZE)
 	bne	t2,zero,L(lastb)
+#if defined(RETURN_FIRST_PREFETCH) || defined(RETURN_LAST_PREFETCH)
+	move	v0,zero
+#else
 	move	v0,a0
+#endif
 /*
  * If src and dst have different alignments, go to L(unaligned), if they
  * have the same alignment (but are not actually aligned) do a partial
@@ -306,22 +333,46 @@ L(aligned):
 	PREFETCH_FOR_LOAD  (0, a1)
 	PREFETCH_FOR_LOAD  (1, a1)
 	PREFETCH_FOR_LOAD  (2, a1)
+	PREFETCH_FOR_LOAD  (3, a1)
+#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
 	PREFETCH_FOR_STORE (1, a0)
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0		/* If a0 > t9 don't use next prefetch */
-	bgtz	v1,L(loop16w)
+	PREFETCH_FOR_STORE (2, a0)
+	PREFETCH_FOR_STORE (3, a0)
+#endif
+#if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
+#if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
+	sltu    v1,t9,a0
+	bgtz    v1,L(skip_set)
 	nop
+	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
+L(skip_set):
+#else
+	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
+#endif
+#endif
+#if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH) \
+    && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
+	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*3)
+#ifdef USE_DOUBLE
+	PTR_ADDIU v0,v0,32
+#endif
 #endif
-	PREFETCH_FOR_STORE (2, a0)
 L(loop16w):
-	PREFETCH_FOR_LOAD  (3, a1)
 	C_LD	t0,UNIT(0)(a1)
 #if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	bgtz	v1,L(skip_pref30_96)
+	sltu	v1,t9,a0		/* If a0 > t9 don't use next prefetch */
+	bgtz	v1,L(skip_pref)
 #endif
 	C_LD	t1,UNIT(1)(a1)
-	PREFETCH_FOR_STORE (3, a0)
-L(skip_pref30_96):
+	PREFETCH_FOR_STORE (4, a0)
+	PREFETCH_FOR_STORE (5, a0)
+#if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH)
+	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*5)
+#ifdef USE_DOUBLE
+	PTR_ADDIU v0,v0,32
+#endif
+#endif
+L(skip_pref):
 	C_LD	REG2,UNIT(2)(a1)
 	C_LD	REG3,UNIT(3)(a1)
 	C_LD	REG4,UNIT(4)(a1)
@@ -340,12 +391,7 @@ L(skip_pref30_96):
 	C_ST	REG7,UNIT(7)(a0)
 
 	C_LD	t0,UNIT(8)(a1)
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	bgtz	v1,L(skip_pref30_128)
-#endif
 	C_LD	t1,UNIT(9)(a1)
-	PREFETCH_FOR_STORE (4, a0)
-L(skip_pref30_128):
 	C_LD	REG2,UNIT(10)(a1)
 	C_LD	REG3,UNIT(11)(a1)
 	C_LD	REG4,UNIT(12)(a1)
@@ -362,9 +408,6 @@ L(skip_pref30_128):
 	C_ST	REG6,UNIT(14)(a0)
 	C_ST	REG7,UNIT(15)(a0)
 	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0
-#endif
 	bne	a0,a3,L(loop16w)
 	PTR_ADDIU a1,a1,UNIT(16)	/* adding 64/128 to src */
 	move	a2,t8
@@ -416,8 +459,8 @@ L(chk1w):
 /* copying in words (4-byte or 8-byte chunks) */
 L(wordCopy_loop):
 	C_LD	REG3,UNIT(0)(a1)
-	PTR_ADDIU a1,a1,UNIT(1)
 	PTR_ADDIU a0,a0,UNIT(1)
+	PTR_ADDIU a1,a1,UNIT(1)
 	bne	a0,a3,L(wordCopy_loop)
 	C_ST	REG3,UNIT(-1)(a0)
 
@@ -427,8 +470,8 @@ L(lastb):
 	PTR_ADDU a3,a0,a2	/* a3 is the last dst address */
 L(lastbloop):
 	lb	v1,0(a1)
-	PTR_ADDIU a1,a1,1
 	PTR_ADDIU a0,a0,1
+	PTR_ADDIU a1,a1,1
 	bne	a0,a3,L(lastbloop)
 	sb	v1,-1(a0)
 L(leave):
@@ -475,35 +518,46 @@ L(ua_chk16w):
 	PREFETCH_FOR_LOAD  (0, a1)
 	PREFETCH_FOR_LOAD  (1, a1)
 	PREFETCH_FOR_LOAD  (2, a1)
+#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
 	PREFETCH_FOR_STORE (1, a0)
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0
-	bgtz	v1,L(ua_loop16w)  /* skip prefetch for too short arrays */
+	PREFETCH_FOR_STORE (2, a0)
+	PREFETCH_FOR_STORE (3, a0)
+#endif
+#if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
+#if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
+	sltu    v1,t9,a0
+	bgtz    v1,L(ua_skip_set)
 	nop
+	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
+L(ua_skip_set):
+#else
+	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
+#endif
 #endif
-	PREFETCH_FOR_STORE (2, a0)
 L(ua_loop16w):
 	PREFETCH_FOR_LOAD  (3, a1)
 	C_LDHI	t0,UNIT(0)(a1)
-	C_LDLO	t0,UNITM1(1)(a1)
 	C_LDHI	t1,UNIT(1)(a1)
+	C_LDHI	REG2,UNIT(2)(a1)
 #if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	bgtz	v1,L(ua_skip_pref30_96)
+	sltu	v1,t9,a0
+	bgtz	v1,L(ua_skip_pref)
 #endif
+	C_LDHI	REG3,UNIT(3)(a1)
+	PREFETCH_FOR_STORE (4, a0)
+	PREFETCH_FOR_STORE (5, a0)
+L(ua_skip_pref):
+	C_LDHI	REG4,UNIT(4)(a1)
+	C_LDHI	REG5,UNIT(5)(a1)
+	C_LDHI	REG6,UNIT(6)(a1)
+	C_LDHI	REG7,UNIT(7)(a1)
+	C_LDLO	t0,UNITM1(1)(a1)
 	C_LDLO	t1,UNITM1(2)(a1)
-	PREFETCH_FOR_STORE (3, a0)
-L(ua_skip_pref30_96):
-	C_LDHI	REG2,UNIT(2)(a1)
 	C_LDLO	REG2,UNITM1(3)(a1)
-	C_LDHI	REG3,UNIT(3)(a1)
 	C_LDLO	REG3,UNITM1(4)(a1)
-	C_LDHI	REG4,UNIT(4)(a1)
 	C_LDLO	REG4,UNITM1(5)(a1)
-	C_LDHI	REG5,UNIT(5)(a1)
 	C_LDLO	REG5,UNITM1(6)(a1)
-	C_LDHI	REG6,UNIT(6)(a1)
 	C_LDLO	REG6,UNITM1(7)(a1)
-	C_LDHI	REG7,UNIT(7)(a1)
 	C_LDLO	REG7,UNITM1(8)(a1)
         PREFETCH_FOR_LOAD (4, a1)
 	C_ST	t0,UNIT(0)(a0)
@@ -515,25 +569,20 @@ L(ua_skip_pref30_96):
 	C_ST	REG6,UNIT(6)(a0)
 	C_ST	REG7,UNIT(7)(a0)
 	C_LDHI	t0,UNIT(8)(a1)
-	C_LDLO	t0,UNITM1(9)(a1)
 	C_LDHI	t1,UNIT(9)(a1)
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	bgtz	v1,L(ua_skip_pref30_128)
-#endif
-	C_LDLO	t1,UNITM1(10)(a1)
-	PREFETCH_FOR_STORE (4, a0)
-L(ua_skip_pref30_128):
 	C_LDHI	REG2,UNIT(10)(a1)
-	C_LDLO	REG2,UNITM1(11)(a1)
 	C_LDHI	REG3,UNIT(11)(a1)
-	C_LDLO	REG3,UNITM1(12)(a1)
 	C_LDHI	REG4,UNIT(12)(a1)
-	C_LDLO	REG4,UNITM1(13)(a1)
 	C_LDHI	REG5,UNIT(13)(a1)
-	C_LDLO	REG5,UNITM1(14)(a1)
 	C_LDHI	REG6,UNIT(14)(a1)
-	C_LDLO	REG6,UNITM1(15)(a1)
 	C_LDHI	REG7,UNIT(15)(a1)
+	C_LDLO	t0,UNITM1(9)(a1)
+	C_LDLO	t1,UNITM1(10)(a1)
+	C_LDLO	REG2,UNITM1(11)(a1)
+	C_LDLO	REG3,UNITM1(12)(a1)
+	C_LDLO	REG4,UNITM1(13)(a1)
+	C_LDLO	REG5,UNITM1(14)(a1)
+	C_LDLO	REG6,UNITM1(15)(a1)
 	C_LDLO	REG7,UNITM1(16)(a1)
         PREFETCH_FOR_LOAD (5, a1)
 	C_ST	t0,UNIT(8)(a0)
@@ -545,9 +594,6 @@ L(ua_skip_pref30_128):
 	C_ST	REG6,UNIT(14)(a0)
 	C_ST	REG7,UNIT(15)(a0)
 	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0
-#endif
 	bne	a0,a3,L(ua_loop16w)
 	PTR_ADDIU a1,a1,UNIT(16)	/* adding 64/128 to src */
 	move	a2,t8
@@ -564,20 +610,20 @@ L(ua_chkw):
 	beq	a2,t8,L(ua_chk1w) /* When a2=t8, no 32-byte chunk */
 	nop
 	C_LDHI	t0,UNIT(0)(a1)
-	C_LDLO	t0,UNITM1(1)(a1)
 	C_LDHI	t1,UNIT(1)(a1)
-	C_LDLO	t1,UNITM1(2)(a1)
 	C_LDHI	REG2,UNIT(2)(a1)
-	C_LDLO	REG2,UNITM1(3)(a1)
 	C_LDHI	REG3,UNIT(3)(a1)
-	C_LDLO	REG3,UNITM1(4)(a1)
 	C_LDHI	REG4,UNIT(4)(a1)
-	C_LDLO	REG4,UNITM1(5)(a1)
 	C_LDHI	REG5,UNIT(5)(a1)
-	C_LDLO	REG5,UNITM1(6)(a1)
 	C_LDHI	REG6,UNIT(6)(a1)
-	C_LDLO	REG6,UNITM1(7)(a1)
 	C_LDHI	REG7,UNIT(7)(a1)
+	C_LDLO	t0,UNITM1(1)(a1)
+	C_LDLO	t1,UNITM1(2)(a1)
+	C_LDLO	REG2,UNITM1(3)(a1)
+	C_LDLO	REG3,UNITM1(4)(a1)
+	C_LDLO	REG4,UNITM1(5)(a1)
+	C_LDLO	REG5,UNITM1(6)(a1)
+	C_LDLO	REG6,UNITM1(7)(a1)
 	C_LDLO	REG7,UNITM1(8)(a1)
 	PTR_ADDIU a1,a1,UNIT(8)
 	C_ST	t0,UNIT(0)(a0)
@@ -603,8 +649,8 @@ L(ua_chk1w):
 L(ua_wordCopy_loop):
 	C_LDHI	v1,UNIT(0)(a1)
 	C_LDLO	v1,UNITM1(1)(a1)
-	PTR_ADDIU a1,a1,UNIT(1)
 	PTR_ADDIU a0,a0,UNIT(1)
+	PTR_ADDIU a1,a1,UNIT(1)
 	bne	a0,a3,L(ua_wordCopy_loop)
 	C_ST	v1,UNIT(-1)(a0)
 
@@ -614,8 +660,8 @@ L(ua_smallCopy):
 	PTR_ADDU a3,a0,a2	/* a3 is the last dst address */
 L(ua_smallCopy_loop):
 	lb	v1,0(a1)
-	PTR_ADDIU a1,a1,1
 	PTR_ADDIU a0,a0,1
+	PTR_ADDIU a1,a1,1
 	bne	a0,a3,L(ua_smallCopy_loop)
 	sb	v1,-1(a0)
 
@@ -625,6 +671,8 @@ L(ua_smallCopy_loop):
 	.set	at
 	.set	reorder
 END(MEMCPY_NAME)
+#ifndef ANDROID_CHANGES
 #ifdef _LIBC
 libc_hidden_builtin_def (MEMCPY_NAME)
 #endif
+#endif

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch, mips] More mips memcpy improvements
  2012-11-27 20:37 [patch, mips] More mips memcpy improvements Steve Ellcey 
@ 2012-11-28  7:43 ` Maxim Kuvyrkov
  2012-11-28 14:46   ` Carlos O'Donell
  0 siblings, 1 reply; 6+ messages in thread
From: Maxim Kuvyrkov @ 2012-11-28  7:43 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: libc-ports, pinskia

On 28/11/2012, at 9:37 AM, Steve Ellcey wrote:

...
> Andrew and Maxim, could you take a look at this new version and see if it
> works OK for you?

Correctness-wise, the patch does not trigger any faults on the benchmark that I'm using, so that's good.  Performance-wise, I didn't see any significant difference.  It may be that XLP's cache doesn't benefit from using prepare-for-store hints.  In any case, the patch doesn't regress performance, so I'm OK with it.

Steve, did you run normal glibc testsuite on this patch with no failures?

Thanks,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch, mips] More mips memcpy improvements
  2012-11-28  7:43 ` Maxim Kuvyrkov
@ 2012-11-28 14:46   ` Carlos O'Donell
  2012-11-28 18:54     ` Steve Ellcey
  0 siblings, 1 reply; 6+ messages in thread
From: Carlos O'Donell @ 2012-11-28 14:46 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Steve Ellcey, libc-ports, pinskia

On Wed, Nov 28, 2012 at 2:43 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> On 28/11/2012, at 9:37 AM, Steve Ellcey wrote:
>
> ...
>> Andrew and Maxim, could you take a look at this new version and see if it
>> works OK for you?
>
> Correctness-wise, the patch does not trigger any faults on the benchmark that I'm using, so that's good.  Performance-wise, I didn't see any significant difference.  It may be that XLP's cache doesn't benefit from using prepare-for-store hints.  In any case, the patch doesn't regress performance, so I'm OK with it.
>
> Steve, did you run normal glibc testsuite on this patch with no failures?

I will note that I'm not comfortable pushing a patch like this so
close to the 2.17 freeze.

My preference would be that 2.17 goes out the door, then once the
freeze is over we put this patch into trunk and then everyone can test
it and ensure that what we get is continual incremental improvement.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch, mips] More mips memcpy improvements
  2012-11-28 14:46   ` Carlos O'Donell
@ 2012-11-28 18:54     ` Steve Ellcey
  2012-11-28 18:59       ` Maxim Kuvyrkov
  2012-11-28 19:03       ` Carlos O'Donell
  0 siblings, 2 replies; 6+ messages in thread
From: Steve Ellcey @ 2012-11-28 18:54 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Maxim Kuvyrkov, libc-ports, pinskia

On Wed, 2012-11-28 at 09:46 -0500, Carlos O'Donell wrote:
> On Wed, Nov 28, 2012 at 2:43 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> > On 28/11/2012, at 9:37 AM, Steve Ellcey wrote:
> >
> > ...
> >> Andrew and Maxim, could you take a look at this new version and see if it
> >> works OK for you?
> >
> > Correctness-wise, the patch does not trigger any faults on the benchmark that I'm using, so that's good.  Performance-wise, I didn't see any significant difference.  It may be that XLP's cache doesn't benefit from using prepare-for-store hints.  In any case, the patch doesn't regress performance, so I'm OK with it.
> >
> > Steve, did you run normal glibc testsuite on this patch with no failures?
> 
> I will note that I'm not comfortable pushing a patch like this so
> close to the 2.17 freeze.
> 
> My preference would be that 2.17 goes out the door, then once the
> freeze is over we put this patch into trunk and then everyone can test
> it and ensure that what we get is continual incremental improvement.
> 
> Cheers,
> Carlos.

I can wait to do the checkin if you want.  I would point out that the
current MIPS memcpy was only checked in a month ago so it doesn't have
much more testing then this version.  I haven't run the glibc testsuite
with this memcpy, I used a combination of memcpy specific tests I wrote
plus GCC testing for verification.  I will run the glibc testsuite
today, it hasn't been clean in the past for me (before I made any
changes) and I don't expect it to be so now but I will report on my
results when I have them.

Steve Ellcey
sellcey@mips.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch, mips] More mips memcpy improvements
  2012-11-28 18:54     ` Steve Ellcey
@ 2012-11-28 18:59       ` Maxim Kuvyrkov
  2012-11-28 19:03       ` Carlos O'Donell
  1 sibling, 0 replies; 6+ messages in thread
From: Maxim Kuvyrkov @ 2012-11-28 18:59 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: Carlos O'Donell, libc-ports, pinskia

On 29/11/2012, at 7:53 AM, Steve Ellcey wrote:

> On Wed, 2012-11-28 at 09:46 -0500, Carlos O'Donell wrote:
>> 
...
>> I will note that I'm not comfortable pushing a patch like this so
>> close to the 2.17 freeze.
>> 
>> My preference would be that 2.17 goes out the door, then once the
>> freeze is over we put this patch into trunk and then everyone can test
>> it and ensure that what we get is continual incremental improvement.

I second this.

>> 
>> Cheers,
>> Carlos.
> 
...
>   I haven't run the glibc testsuite
> with this memcpy, I used a combination of memcpy specific tests I wrote
> plus GCC testing for verification.  I will run the glibc testsuite
> today, it hasn't been clean in the past for me (before I made any
> changes) and I don't expect it to be so now but I will report on my
> results when I have them.

You don't need to have clean test results; as long as your patch doesn't cause any new failures, it is fine.  [Of course, this is assuming that the testsuite is functional, i.e., >80% tests PASS.]

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch, mips] More mips memcpy improvements
  2012-11-28 18:54     ` Steve Ellcey
  2012-11-28 18:59       ` Maxim Kuvyrkov
@ 2012-11-28 19:03       ` Carlos O'Donell
  1 sibling, 0 replies; 6+ messages in thread
From: Carlos O'Donell @ 2012-11-28 19:03 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: Maxim Kuvyrkov, libc-ports, pinskia

On Wed, Nov 28, 2012 at 1:53 PM, Steve Ellcey <sellcey@mips.com> wrote:
> On Wed, 2012-11-28 at 09:46 -0500, Carlos O'Donell wrote:
>> On Wed, Nov 28, 2012 at 2:43 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>> > On 28/11/2012, at 9:37 AM, Steve Ellcey wrote:
>> >
>> > ...
>> >> Andrew and Maxim, could you take a look at this new version and see if it
>> >> works OK for you?
>> >
>> > Correctness-wise, the patch does not trigger any faults on the benchmark that I'm using, so that's good.  Performance-wise, I didn't see any significant difference.  It may be that XLP's cache doesn't benefit from using prepare-for-store hints.  In any case, the patch doesn't regress performance, so I'm OK with it.
>> >
>> > Steve, did you run normal glibc testsuite on this patch with no failures?
>>
>> I will note that I'm not comfortable pushing a patch like this so
>> close to the 2.17 freeze.
>>
>> My preference would be that 2.17 goes out the door, then once the
>> freeze is over we put this patch into trunk and then everyone can test
>> it and ensure that what we get is continual incremental improvement.
>>
>> Cheers,
>> Carlos.
>
> I can wait to do the checkin if you want.  I would point out that the
> current MIPS memcpy was only checked in a month ago so it doesn't have
> much more testing then this version.  I haven't run the glibc testsuite
> with this memcpy, I used a combination of memcpy specific tests I wrote
> plus GCC testing for verification.  I will run the glibc testsuite
> today, it hasn't been clean in the past for me (before I made any
> changes) and I don't expect it to be so now but I will report on my
> results when I have them.

Please wait to check this in. Thank you for your patience and
understanding with the upcoming freeze.

Please don't wait to keep talking about it and getting concensus and
testing done by everyone else interested in MIPS performance. Feel
free to keep your own freature branch and carry the patch there,
asking for testing and review.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-11-28 19:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-27 20:37 [patch, mips] More mips memcpy improvements Steve Ellcey 
2012-11-28  7:43 ` Maxim Kuvyrkov
2012-11-28 14:46   ` Carlos O'Donell
2012-11-28 18:54     ` Steve Ellcey
2012-11-28 18:59       ` Maxim Kuvyrkov
2012-11-28 19:03       ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).