[PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
@ 2018-06-19 14:09 Tamar Christina
  2018-06-19 21:23 ` James Greenhalgh
  2018-07-03 17:13 ` James Greenhalgh
  0 siblings, 2 replies; 5+ messages in thread
From: Tamar Christina @ 2018-06-19 14:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, james.greenhalgh, Richard.Earnshaw, Marcus.Shawcroft

[-- Attachment #1: Type: text/plain, Size: 1662 bytes --]

Hi All,

This changes the movmem code in AArch64 that does copy for data between 4 and 7
bytes to use the smallest possible mode capable of copying the remaining bytes.

This means that if we're copying 5 bytes we would issue an SImode and QImode
load instead of two SImode loads.

This does smaller memory accesses but also gives the mid-end a chance to realise
that it can CSE the loads in certain circumstances. e.g. when you have something
like

return foo;

where foo is a struct. This would be transformed by the mid-end into SSA form as

D.XXXX = foo;

return D.XXXX;

This movmem routine will handle the first copy, but it's usually not needed,
the mid-end would do SImode and QImode stores into X0 for the 5 bytes example
but without the first copies being in the same mode, it doesn't know it doesn't
need the stores at all.

This will generate

fun5:
	sub	sp, sp, #16
	adrp	x0, .LANCHOR0
	add	x1, x0, :lo12:.LANCHOR0
	ldr	w0, [x0, #:lo12:.LANCHOR0]
	str	w0, [sp, 8]
	ldrb	w0, [x1, 4]
	strb	w0, [sp, 12]
	add	sp, sp, 16
	ret

instead of

fun5:
        sub     sp, sp, #16
        adrp    x0, .LANCHOR0
        add     x1, x0, :lo12:.LANCHOR0
        ldr     w0, [x0, #:lo12:.LANCHOR0]
        str     w0, [sp, 8]
        ldr     w0, [x1, 1]
        str     w0, [sp, 9]
        add     sp, sp, 16
        ret

for a 5 byte copy.

Regtested on aarch64-none-elf and .. issues.

Ok for trunk?

Thanks,
Tamar

gcc/
2018-06-19  Tamar Christina  <tamar.christina@arm.com>

	* config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.

gcc/testsuite/
2018-06-19  Tamar Christina  <tamar.christina@arm.com>

	* gcc.target/aarch64/struct_cpy.c: New.

-- 

[-- Attachment #2: rb9112.patch --]
[-- Type: text/x-diff, Size: 8076 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
 bool
 aarch64_expand_movmem (rtx *operands)
 {
-  unsigned int n;
+  int n, mode_bits;
   rtx dst = operands[0];
   rtx src = operands[1];
   rtx base;
+  machine_mode cur_mode = BLKmode, next_mode;
   bool speed_p = !optimize_function_for_size_p (cfun);
 
   /* When optimizing for size, give a better estimate of the length of a
-     memcpy call, but use the default otherwise.  */
-  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
+     memcpy call, but use the default otherwise.  Moves larger than 8 bytes
+     will always require an even number of instructions to do now.  And each
+     operation requires both a load+store, so devide the max number by 2.  */
+  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
 
   /* We can't do anything smart if the amount to copy is not constant.  */
   if (!CONST_INT_P (operands[2]))
     return false;
 
-  n = UINTVAL (operands[2]);
+  n = INTVAL (operands[2]);
 
-  /* Try to keep the number of instructions low.  For cases below 16 bytes we
-     need to make at most two moves.  For cases above 16 bytes it will be one
-     move for each 16 byte chunk, then at most two additional moves.  */
-  if (((n / 16) + (n % 16 ? 2 : 0)) > max_instructions)
+  /* Try to keep the number of instructions low.  For all cases we will do at
+     most two moves for the residual amount, since we'll always overlap the
+     remainder.  */
+  if (((n / 16) + (n % 16 ? 2 : 0)) > max_num_moves)
     return false;
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
@@ -16208,81 +16211,36 @@ aarch64_expand_movmem (rtx *operands)
   base = copy_to_mode_reg (Pmode, XEXP (src, 0));
   src = adjust_automodify_address (src, VOIDmode, base, 0);
 
-  /* Simple cases.  Copy 0-3 bytes, as (if applicable) a 2-byte, then a
-     1-byte chunk.  */
-  if (n < 4)
-    {
-      if (n >= 2)
-	{
-	  aarch64_copy_one_block_and_progress_pointers (&src, &dst, HImode);
-	  n -= 2;
-	}
-
-      if (n == 1)
-	aarch64_copy_one_block_and_progress_pointers (&src, &dst, QImode);
-
-      return true;
-    }
+  /* Convert n to bits to make the rest of the code simpler.  */
+  n = n * BITS_PER_UNIT;
 
-  /* Copy 4-8 bytes.  First a 4-byte chunk, then (if applicable) a second
-     4-byte chunk, partially overlapping with the previously copied chunk.  */
-  if (n < 8)
+  while (n > 0)
     {
-      aarch64_copy_one_block_and_progress_pointers (&src, &dst, SImode);
-      n -= 4;
-      if (n > 0)
-	{
-	  int move = n - 4;
+      /* Find the largest mode in which to do the copy in without over reading
+	 or writing.  */
+      opt_scalar_int_mode mode_iter;
+      FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
+	if (GET_MODE_BITSIZE (mode_iter.require ()) <= n)
+	  cur_mode = mode_iter.require ();
 
-	  src = aarch64_move_pointer (src, move);
-	  dst = aarch64_move_pointer (dst, move);
-	  aarch64_copy_one_block_and_progress_pointers (&src, &dst, SImode);
-	}
-      return true;
-    }
+      gcc_assert (cur_mode != BLKmode);
 
-  /* Copy more than 8 bytes.  Copy chunks of 16 bytes until we run out of
-     them, then (if applicable) an 8-byte chunk.  */
-  while (n >= 8)
-    {
-      if (n / 16)
-	{
-	  aarch64_copy_one_block_and_progress_pointers (&src, &dst, TImode);
-	  n -= 16;
-	}
-      else
-	{
-	  aarch64_copy_one_block_and_progress_pointers (&src, &dst, DImode);
-	  n -= 8;
-	}
-    }
+      mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
+      aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
 
-  /* Finish the final bytes of the copy.  We can always do this in one
-     instruction.  We either copy the exact amount we need, or partially
-     overlap with the previous chunk we copied and copy 8-bytes.  */
-  if (n == 0)
-    return true;
-  else if (n == 1)
-    aarch64_copy_one_block_and_progress_pointers (&src, &dst, QImode);
-  else if (n == 2)
-    aarch64_copy_one_block_and_progress_pointers (&src, &dst, HImode);
-  else if (n == 4)
-    aarch64_copy_one_block_and_progress_pointers (&src, &dst, SImode);
-  else
-    {
-      if (n == 3)
-	{
-	  src = aarch64_move_pointer (src, -1);
-	  dst = aarch64_move_pointer (dst, -1);
-	  aarch64_copy_one_block_and_progress_pointers (&src, &dst, SImode);
-	}
-      else
-	{
-	  int move = n - 8;
+      n -= mode_bits;
 
-	  src = aarch64_move_pointer (src, move);
-	  dst = aarch64_move_pointer (dst, move);
-	  aarch64_copy_one_block_and_progress_pointers (&src, &dst, DImode);
+      /* Do certain trailing copies as overlapping if it's going to be
+	 cheaper.  i.e. less instructions to do so.  For instance doing a 15
+	 byte copy it's more efficient to do two overlapping 8 byte copies than
+	 8 + 6 + 1.  */
+      next_mode = smallest_mode_for_size (n, MODE_INT);
+      int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
+      if (n > 0 && n_bits > n && n_bits <= 8 * BITS_PER_UNIT)
+	{
+	  src = aarch64_move_pointer (src, (n - n_bits) / BITS_PER_UNIT);
+	  dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT);
+	  n = n_bits;
 	}
     }
 
diff --git a/gcc/testsuite/gcc.target/aarch64/struct_cpy.c b/gcc/testsuite/gcc.target/aarch64/struct_cpy.c
new file mode 100644
index 0000000000000000000000000000000000000000..4feb3ea990994e0de82b3d54f13ec073cfa7a335
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/struct_cpy.c
@@ -0,0 +1,69 @@
+/* { dg-do compile } */
+
+struct struct1 { char a;};
+struct struct2 { char a, b;};
+struct struct3 { char a, b, c; };
+struct struct4 { char a, b, c, d; };
+struct struct5 { char a, b, c, d, e; };
+struct struct6 { char a, b, c, d, e, f; };
+struct struct7 { char a, b, c, d, e, f, g; };
+struct struct8 { char a, b, c, d, e, f, g, h; };
+struct struct9 { char a, b, c, d, e, f, g, h, i; };
+struct struct10 { char a, b, c, d, e, f, g, h, i, j; };
+struct struct11 { char a, b, c, d, e, f, g, h, i, j, k; };
+struct struct12 { char a, b, c, d, e, f, g, h, i, j, k, l; };
+struct struct13 { char a, b, c, d, e, f, g, h, i, j, k, l, m; };
+struct struct14 { char a, b, c, d, e, f, g, h, i, j, k, l, m, n; };
+struct struct15 { char a, b, c, d, e, f, g, h, i, j, k, l, m, n, o; };
+struct struct16 { char a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p; };
+
+struct struct1 foo1 = {'1'};
+struct struct2 foo2 = { 'a', 'b'};
+struct struct3 foo3 = { 'A', 'B', 'C'};
+struct struct4 foo4 = {'1', '2', '3', '4'};
+struct struct5 foo5 = {'a', 'b', 'c', 'd', 'e'};
+struct struct6 foo6 = {'A', 'B', 'C', 'D', 'E', 'F'};
+struct struct7 foo7 = {'1', '2', '3', '4', '5', '6', '7'};
+struct struct8 foo8 = {'1', '2', '3', '4', '5', '6', '7', '8'};
+struct struct9 foo9 = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'};
+struct struct10 foo10 = {
+  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'};
+struct struct11 foo11 = {
+  '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B'};
+struct struct12 foo12 = {
+  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'};
+struct struct13 foo13 = {
+  'a','b','c','d','e','f','g','h','i','j','k','l','m'};
+struct struct14 foo14 = {
+  'a','b','c','d','e','f','g','h','i','j','k','l','m','n'};
+struct struct15 foo15 = {
+  'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o'};
+struct struct16 foo16 = {
+  'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p'};
+
+#define FUN(x) void fun##x ()\
+{ \
+  volatile struct struct##x var##x = foo##x; \
+}
+
+FUN(1);
+FUN(2);
+FUN(3);
+FUN(4);
+FUN(5);
+FUN(6);
+FUN(7);
+FUN(8);
+FUN(9);
+FUN(10);
+FUN(11);
+FUN(12);
+FUN(13);
+FUN(14);
+FUN(15);
+FUN(16);
+
+/* { dg-final { scan-assembler-times {ldr\s} 18 } } */
+/* { dg-final { scan-assembler-times {ldrb} 4 } } */
+/* { dg-final { scan-assembler-times {ldrh} 4 } } */
+/* { dg-final { scan-assembler-times {ldp} 1 } } */


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
  2018-06-19 14:09 [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes Tamar Christina
@ 2018-06-19 21:23 ` James Greenhalgh
  2018-06-20 13:13   ` Tamar Christina
  2018-07-03 17:13 ` James Greenhalgh
  1 sibling, 1 reply; 5+ messages in thread
From: James Greenhalgh @ 2018-06-19 21:23 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft

On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,
> 
> This changes the movmem code in AArch64 that does copy for data between 4 and 7
> bytes to use the smallest possible mode capable of copying the remaining bytes.
> 
> This means that if we're copying 5 bytes we would issue an SImode and QImode
> load instead of two SImode loads.
> 
> This does smaller memory accesses but also gives the mid-end a chance to realise
> that it can CSE the loads in certain circumstances. e.g. when you have something
> like
> 
> return foo;
> 
> where foo is a struct. This would be transformed by the mid-end into SSA form as
> 
> D.XXXX = foo;
> 
> return D.XXXX;
> 
> This movmem routine will handle the first copy, but it's usually not needed,
> the mid-end would do SImode and QImode stores into X0 for the 5 bytes example
> but without the first copies being in the same mode, it doesn't know it doesn't
> need the stores at all.
> 
> This will generate
> 
> fun5:
> 	sub	sp, sp, #16
> 	adrp	x0, .LANCHOR0
> 	add	x1, x0, :lo12:.LANCHOR0
> 	ldr	w0, [x0, #:lo12:.LANCHOR0]
> 	str	w0, [sp, 8]
> 	ldrb	w0, [x1, 4]
> 	strb	w0, [sp, 12]
> 	add	sp, sp, 16
> 	ret
> 
> instead of
> 
> fun5:
>         sub     sp, sp, #16
>         adrp    x0, .LANCHOR0
>         add     x1, x0, :lo12:.LANCHOR0
>         ldr     w0, [x0, #:lo12:.LANCHOR0]
>         str     w0, [sp, 8]
>         ldr     w0, [x1, 1]
>         str     w0, [sp, 9]
>         add     sp, sp, 16
>         ret
> 
> for a 5 byte copy.

So... Given that I wrote the code to do the overlapping memmove back in 2014
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
why we want to achieve this transformation. I can see that we come out
about even for a 5 or 6 byte copy, but your patch would also force it for a
7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
probably weren't though through that closely.

I'd like to see a testcase which shows the advantages for CSE you describe
in your introduction to the patch; I'm not sure your testcase below gives
that.

> Regtested on aarch64-none-elf and .. issues.

Do I draw a zero on the line on my screen with a sharpie or what ;-)

> Ok for trunk?

Justification and performance impact would be appreciated. Right now I'm not
convinced this is forward progress.

> gcc/
> 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
>  bool
>  aarch64_expand_movmem (rtx *operands)
>  {
> -  unsigned int n;
> +  int n, mode_bits;
>    rtx dst = operands[0];
>    rtx src = operands[1];
>    rtx base;
> +  machine_mode cur_mode = BLKmode, next_mode;
>    bool speed_p = !optimize_function_for_size_p (cfun);
>  
>    /* When optimizing for size, give a better estimate of the length of a
> -     memcpy call, but use the default otherwise.  */
> -  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> +     memcpy call, but use the default otherwise.  Moves larger than 8 bytes
> +     will always require an even number of instructions to do now.  And each
> +     operation requires both a load+store, so devide the max number by 2.  */
> +  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;

Why did we go from unsigned to signed int?

Thanks,
James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
  2018-06-19 21:23 ` James Greenhalgh
@ 2018-06-20 13:13   ` Tamar Christina
  0 siblings, 0 replies; 5+ messages in thread
From: Tamar Christina @ 2018-06-20 13:13 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft

Hi James,

Many thanks for the review!

The 06/19/2018 22:23, James Greenhalgh wrote:
> On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> > Hi All,
> > 
> > This changes the movmem code in AArch64 that does copy for data between 4 and 7
> > bytes to use the smallest possible mode capable of copying the remaining bytes.
> > 
> > for a 5 byte copy.
> 
> So... Given that I wrote the code to do the overlapping memmove back in 2014
> https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
> why we want to achieve this transformation. I can see that we come out
> about even for a 5 or 6 byte copy, but your patch would also force it for a
> 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
> probably weren't though through that closely.
> 

Actually, the 7 case stays the same.  This patch will generate just as before

	ldr	w1, [x0, 48]
	ldr	w0, [x0, 51]
	str	w1, [sp, 8]
	str	w0, [sp, 11]

This patch should never generate more loads or stores than it did before, in fact
it strictly generates less or the same amount of loads/stores.  Just in less code.
The reason this is the case is because of how the algorithm works.

This can be proved by induction. My hypothesis is that this algorithm never issues
more than 2 loads and 2 stores per copies of n % 32 bytes.

The case of n - (n % 32), the code will as before satisfy these using loads and stores
of TImode mode, issuing as many as required to get to n < 31. 

Which leaves the cases of n < 31 bytes.

the cases n > 24 and n < 31 will be done using overlapping TImode operations now,
whereas before they would issue more operations. e.g. n = 25 will now now overlap
where before it would have done 16 + 8 + 1. Another example, n == 27 before would
have done 16 + 8 + 4 (where the 4 overlaps). Now it will do 16 + 16 where the second
16 overlaps.

etc.

It does this by always doing the first copy using the largest mode that doesn't over
read. So for your 7 bytes example before, this would be a SImode. In the loop it selects
the smallest mode that is larger or equal to the remainder. SImode is the smallest mode
larger than 3 bytes. So it issues the same two SImode operations as before.

I had described this in the comments but if this wasn't clear I can expand on it.

> I'd like to see a testcase which shows the advantages for CSE you describe
> in your introduction to the patch; I'm not sure your testcase below gives
> that.

This CSE relies on a mid-end change also being accepted. This would be
https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01088.html which was reverted do to
it not handling corner cases correctly https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00239.html
I will am performing regression testing on the new version and will submit it soon.

If we take a more useful example of a program than the ones in the testcase and cover letter:

struct struct5  fun5()
{
  return foo5;
}

where foo5 is a simple 5 byte struct.

after SSA this becomes

struct struct5  fun5()
{
  D.XXXX = foo5
  return D.XXXX;
}

Which means two copies will be done, first one into D.XXXX and the second one by the return.
One handled by the mid-end and one by this back-end code.

Before this patch we generated at -O3 16 insn:

fun5:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        mov     x0, 0
        sub     sp, sp, #16
        ldrb    w2, [x1, 32]
        ldr     w1, [x1, 33]
        add     sp, sp, 16
        bfi     x0, x2, 0, 8
        ubfx    x3, x1, 8, 8
        bfi     x0, x1, 8, 8
        ubfx    x2, x1, 16, 8
        lsr     w1, w1, 24
        bfi     x0, x3, 16, 8
        bfi     x0, x2, 24, 8
        bfi     x0, x1, 32, 8
        ret

With this patch alone 16 insn no regression, but no improvement either:

fun5:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        mov     x0, 0
        sub     sp, sp, #16
        ldr     w1, [x2, 32]
        ldrb    w2, [x2, 36]
        add     sp, sp, 16
        ubfx    x4, x1, 8, 8
        bfi     x0, x1, 0, 8
        ubfx    x3, x1, 16, 8
        lsr     w1, w1, 24
        bfi     x0, x4, 8, 8
        bfi     x0, x3, 16, 8
        bfi     x0, x1, 24, 8
        bfi     x0, x2, 32, 8
        ret

And with this patch and mid-end patch this becomes 10 insn:

fun5:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        mov     x0, 0
        sub     sp, sp, #16
        ldr     w2, [x1, 32]
        ldrb    w1, [x1, 36]
        add     sp, sp, 16
        bfi     x0, x2, 0, 32
        bfi     x0, x1, 32, 8
        ret

with the mid-end patch alone and the existing implementation it would be 14 insn:

fun5:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        sub     sp, sp, #16
        mov     x0, 0
        ldr     w2, [x1, 32]
        ldr     w1, [x1, 33]
        str     w2, [sp, 8]
        str     w1, [sp, 9]
        ldr     w2, [sp, 8]
        ldrb    w1, [sp, 12]
        add     sp, sp, 16
        bfi     x0, x2, 0, 32
        bfi     x0, x1, 32, 8
        ret

and the differences become more pronounced with the larger ones,like the 10 bytes copy case saves 18 instructions.

Every function handled by this code path becomes smaller. I can provide the comparison data if you want to see them.

> 
> > Regtested on aarch64-none-elf and .. issues.
> 
> Do I draw a zero on the line on my screen with a sharpie or what ;-)

Sorry, seems I had forgotten to fill this in. My apologies.
This should be 0 regressions.

> 
> > Ok for trunk?
> 
> Justification and performance impact would be appreciated. Right now I'm not
> convinced this is forward progress.
> 
> > gcc/
> > 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	* config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> > 
> > gcc/testsuite/
> > 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	* gcc.target/aarch64/struct_cpy.c: New.
> > 
> > -- 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
> >  bool
> >  aarch64_expand_movmem (rtx *operands)
> >  {
> > -  unsigned int n;
> > +  int n, mode_bits;
> >    rtx dst = operands[0];
> >    rtx src = operands[1];
> >    rtx base;
> > +  machine_mode cur_mode = BLKmode, next_mode;
> >    bool speed_p = !optimize_function_for_size_p (cfun);
> >  
> >    /* When optimizing for size, give a better estimate of the length of a
> > -     memcpy call, but use the default otherwise.  */
> > -  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> > +     memcpy call, but use the default otherwise.  Moves larger than 8 bytes
> > +     will always require an even number of instructions to do now.  And each
> > +     operation requires both a load+store, so devide the max number by 2.  */
> > +  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
> 
> Why did we go from unsigned to signed int?

Because n is now signed. It made it easier to work with in the loop, and n will never be large enough
to not fit in a signed integer as then this copy routine wouldn't be called.  So I wanted to avoid
casts back and forth between signed and unsigned.  I don't believe we benefit from using unsigned
types here.

> 
> Thanks,
> James
> 

-- 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
  2018-06-19 14:09 [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes Tamar Christina
  2018-06-19 21:23 ` James Greenhalgh
@ 2018-07-03 17:13 ` James Greenhalgh
  2018-07-06  7:53   ` Christophe Lyon
  1 sibling, 1 reply; 5+ messages in thread
From: James Greenhalgh @ 2018-07-03 17:13 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft

On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,

<snip>

OK.

Thanks,
James

> Thanks,
> Tamar
> 
> gcc/
> 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
  2018-07-03 17:13 ` James Greenhalgh
@ 2018-07-06  7:53   ` Christophe Lyon
  0 siblings, 0 replies; 5+ messages in thread
From: Christophe Lyon @ 2018-07-06  7:53 UTC (permalink / raw)
  To: James Greenhalgh
  Cc: Tamar Christina, gcc Patches, nd, Richard Earnshaw, Marcus Shawcroft

Hi Tamar,

On Tue, 3 Jul 2018 at 19:13, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>
> On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> > Hi All,
>
> <snip>
>
> OK.
>
> Thanks,
> James
>
> > Thanks,
> > Tamar
> >
> > gcc/
> > 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> >
> >       * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> >
> > gcc/testsuite/
> > 2018-06-19  Tamar Christina  <tamar.christina@arm.com>
> >
> >       * gcc.target/aarch64/struct_cpy.c: New.
> >

The new test has a typo making it fail to compile:
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:49:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:50:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:51:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:52:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:53:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:54:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:55:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:56:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:57:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:58:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:59:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:60:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:61:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:62:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:63:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:64:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]

Would you mind fixing it?

Thanks,

Christophe

> > --

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-06  7:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-19 14:09 [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes Tamar Christina
2018-06-19 21:23 ` James Greenhalgh
2018-06-20 13:13   ` Tamar Christina
2018-07-03 17:13 ` James Greenhalgh
2018-07-06  7:53   ` Christophe Lyon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).