[PATCH] Reduce stack usage in sha512 (PR target/77308)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Reduce stack usage in sha512 (PR target/77308)
@ 2016-09-30  8:17 Bernd Edlinger
  2016-09-30  8:49 ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Bernd Edlinger @ 2016-09-30  8:17 UTC (permalink / raw)
  To: GCC Patches; +Cc: Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 602 bytes --]

Hi,

this patch mitigates the excessive stack usage on arm in code
that does lots of int64 shift ops like sha512.

It reduces the stack usage in that example from 4K to 2K while
less than 0.5K would be expected.

In all cases the additional set instructions are optimized later
so that this caused no code size increase, but just made
LRA's job a bit easier.

It does certainly not solve the problem completely but at least
improve the stability, in an area that I'd call security relevant.


Boot-strapped and reg-tested on arm-linux-gnueabihf.
Is it OK for trunk?


Thanks
Bernd.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-pr77308.diff --]
[-- Type: text/x-patch; name="patch-pr77308.diff", Size: 1165 bytes --]

2016-09-29  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.c (arm_emit_coreregs_64bit_shift): Clear the result
	register explicitly.

Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 239624)
+++ gcc/config/arm/arm.c	(working copy)
@@ -29159,6 +29159,7 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
 	  /* Shifts by a constant less than 32.  */
 	  rtx reverse_amount = GEN_INT (32 - INTVAL (amount));
 
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
 	  emit_insn (SET (out_down,
 			  ORR (REV_LSHIFT (code, in_up, reverse_amount),
@@ -29170,12 +29171,11 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
 	  /* Shifts by a constant greater than 31.  */
 	  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);
 
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
 	  if (code == ASHIFTRT)
 	    emit_insn (gen_ashrsi3 (out_up, in_up,
 				    GEN_INT (31)));
-	  else
-	    emit_insn (SET (out_up, const0_rtx));
 	}
     }
   else

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-09-30  8:17 [PATCH] Reduce stack usage in sha512 (PR target/77308) Bernd Edlinger
@ 2016-09-30  8:49 ` Richard Biener
  2016-09-30  9:32   ` Eric Botcazou
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2016-09-30  8:49 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: GCC Patches, Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

On Fri, Sep 30, 2016 at 9:48 AM, Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
> Hi,
>
> this patch mitigates the excessive stack usage on arm in code
> that does lots of int64 shift ops like sha512.
>
> It reduces the stack usage in that example from 4K to 2K while
> less than 0.5K would be expected.
>
> In all cases the additional set instructions are optimized later
> so that this caused no code size increase, but just made
> LRA's job a bit easier.
>
> It does certainly not solve the problem completely but at least
> improve the stability, in an area that I'd call security relevant.
>
>
> Boot-strapped and reg-tested on arm-linux-gnueabihf.
> Is it OK for trunk?

A comment before the SETs and a testcase would be nice.  IIRC
we do have stack size testcases via using -fstack-usage.

Richard.

>
> Thanks
> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-09-30  8:49 ` Richard Biener
@ 2016-09-30  9:32   ` Eric Botcazou
  2016-09-30 10:27     ` Bernd Edlinger
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Botcazou @ 2016-09-30  9:32 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, Bernd Edlinger, Nick Clifton, Richard Earnshaw,
	Ramana Radhakrishnan

> A comment before the SETs and a testcase would be nice.  IIRC
> we do have stack size testcases via using -fstack-usage.

Or -Wstack-usage, which might be more appropriate here.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-09-30  9:32   ` Eric Botcazou
@ 2016-09-30 10:27     ` Bernd Edlinger
  2016-09-30 13:45       ` Bernd Edlinger
  0 siblings, 1 reply; 16+ messages in thread
From: Bernd Edlinger @ 2016-09-30 10:27 UTC (permalink / raw)
  To: Eric Botcazou, Richard Biener
  Cc: gcc-patches, Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

Eric Botcazou wrote:
>> A comment before the SETs and a testcase would be nice.  IIRC
>> we do have stack size testcases via using -fstack-usage.
>
>Or -Wstack-usage, which might be more appropriate here.

Yes.  good idea.  I was not aware that we already have that kind of tests.

When trying to write this test. I noticed, that I did not try -Os so far.
But for -Os the stack is still the unchanged 3500 bytes.

However for embedded targets I am often inclined to use -Os, and
would certainly not expect the stack to explode...

I see in arm.md there are places like

      /* If we're optimizing for size, we prefer the libgcc calls.  */
      if (optimize_function_for_size_p (cfun))
        FAIL;

      /* Expand operation using core-registers.
         'FAIL' would achieve the same thing, but this is a bit smarter.  */
      scratch1 = gen_reg_rtx (SImode);
      scratch2 = gen_reg_rtx (SImode);
      arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0], operands[1],
                                     operands[2], scratch1, scratch2);

.. that explains why this happens.  I think it would be better to
use the emit_coreregs for shift count >= 32, because these are
effectively 32-bit shifts.

Will try if that can be improved, and come back with the
results.

Thanks
Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-09-30 10:27     ` Bernd Edlinger
@ 2016-09-30 13:45       ` Bernd Edlinger
  2016-10-06 16:19         ` [PING] [PATCH, ARM] " Bernd Edlinger
  2016-10-17 16:47         ` [PATCH] " Kyrill Tkachov
  0 siblings, 2 replies; 16+ messages in thread
From: Bernd Edlinger @ 2016-09-30 13:45 UTC (permalink / raw)
  To: Eric Botcazou, Richard Biener
  Cc: gcc-patches, Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

On 09/30/16 12:14, Bernd Edlinger wrote:
> Eric Botcazou wrote:
>>> A comment before the SETs and a testcase would be nice.  IIRC
>>> we do have stack size testcases via using -fstack-usage.
>>
>> Or -Wstack-usage, which might be more appropriate here.
>
> Yes.  good idea.  I was not aware that we already have that kind of tests.
>
> When trying to write this test. I noticed, that I did not try -Os so far.
> But for -Os the stack is still the unchanged 3500 bytes.
>
> However for embedded targets I am often inclined to use -Os, and
> would certainly not expect the stack to explode...
>
> I see in arm.md there are places like
>
>        /* If we're optimizing for size, we prefer the libgcc calls.  */
>        if (optimize_function_for_size_p (cfun))
>          FAIL;
>

Oh, yeah.  The comment is completely misleading.

If this pattern fails, expmed.c simply expands some
less efficient rtl, which also results in two shifts
and one or-op.  No libgcc calls at all.

So in simple cases without spilling the resulting
assembler is the same, regardless if this pattern
fails or not.  But the half-defined out registers
make a big difference when it has to be spilled.

>        /* Expand operation using core-registers.
>           'FAIL' would achieve the same thing, but this is a bit smarter.  */
>        scratch1 = gen_reg_rtx (SImode);
>        scratch2 = gen_reg_rtx (SImode);
>        arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0], operands[1],
>                                       operands[2], scratch1, scratch2);
>
>
> .. that explains why this happens.  I think it would be better to
> use the emit_coreregs for shift count >= 32, because these are
> effectively 32-bit shifts.
>
> Will try if that can be improved, and come back with the
> results.
>

The test case with -Os has 3520 bytes stack usage.
When only shift count >= 32 are handled we
have still 3000 bytes stack usage.
And when arm_emit_coreregs_64bit_shift is always
allowed to run, we have 2360 bytes stack usage.

Also for the code size it is better not to fail this
pattern.  So I propose to remove this exception in all
three expansions.

Here is an improved patch with the test case from the PR.
And a comment on the redundant SET why it is better to clear
the out register first.


Bootstrap and reg-testing on arm-linux-gnueabihf.
Is it OK for trunk?


Thanks
Bernd.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-pr77308.diff --]
[-- Type: text/x-patch; name="patch-pr77308.diff", Size: 10426 bytes --]

2016-09-29  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.c (arm_emit_coreregs_64bit_shift): Clear the result
	register explicitly.
	* config/arm/arm.md (ashldi3, ashrdi3, lshrdi3): Don't FAIL if
	optimizing for size.

testsuite:
2016-09-29  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* gcc.target/arm/pr77308.c: New test.


Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 240645)
+++ gcc/config/arm/arm.c	(working copy)
@@ -29226,6 +29226,10 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
 	  /* Shifts by a constant less than 32.  */
 	  rtx reverse_amount = GEN_INT (32 - INTVAL (amount));
 
+	  /* Clearing the out register in DImode first avoids lots
+	     of spilling and results in less stack usage.
+	     Later this redundant insn is completely removed.  */
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
 	  emit_insn (SET (out_down,
 			  ORR (REV_LSHIFT (code, in_up, reverse_amount),
@@ -29237,12 +29241,11 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
 	  /* Shifts by a constant greater than 31.  */
 	  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);
 
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
 	  if (code == ASHIFTRT)
 	    emit_insn (gen_ashrsi3 (out_up, in_up,
 				    GEN_INT (31)));
-	  else
-	    emit_insn (SET (out_up, const0_rtx));
 	}
     }
   else
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 240645)
+++ gcc/config/arm/arm.md	(working copy)
@@ -4016,10 +4016,6 @@
          cheaper to have the alternate code being generated than moving
          values to iwmmxt regs and back.  */
 
-      /* If we're optimizing for size, we prefer the libgcc calls.  */
-      if (optimize_function_for_size_p (cfun))
-	FAIL;
-
       /* Expand operation using core-registers.
 	 'FAIL' would achieve the same thing, but this is a bit smarter.  */
       scratch1 = gen_reg_rtx (SImode);
@@ -4089,10 +4085,6 @@
          cheaper to have the alternate code being generated than moving
          values to iwmmxt regs and back.  */
 
-      /* If we're optimizing for size, we prefer the libgcc calls.  */
-      if (optimize_function_for_size_p (cfun))
-	FAIL;
-
       /* Expand operation using core-registers.
 	 'FAIL' would achieve the same thing, but this is a bit smarter.  */
       scratch1 = gen_reg_rtx (SImode);
@@ -4159,10 +4151,6 @@
          cheaper to have the alternate code being generated than moving
          values to iwmmxt regs and back.  */
 
-      /* If we're optimizing for size, we prefer the libgcc calls.  */
-      if (optimize_function_for_size_p (cfun))
-	FAIL;
-
       /* Expand operation using core-registers.
 	 'FAIL' would achieve the same thing, but this is a bit smarter.  */
       scratch1 = gen_reg_rtx (SImode);
Index: gcc/testsuite/gcc.target/arm/pr77308.c
===================================================================
--- gcc/testsuite/gcc.target/arm/pr77308.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/pr77308.c	(working copy)
@@ -0,0 +1,164 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7))
+#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PING] [PATCH, ARM] Reduce stack usage in sha512 (PR target/77308)
  2016-09-30 13:45       ` Bernd Edlinger
@ 2016-10-06 16:19         ` Bernd Edlinger
  2016-10-17 13:47           ` [PING**2] " Bernd Edlinger
  2016-10-17 16:47         ` [PATCH] " Kyrill Tkachov
  1 sibling, 1 reply; 16+ messages in thread
From: Bernd Edlinger @ 2016-10-06 16:19 UTC (permalink / raw)
  To: Eric Botcazou, Richard Biener
  Cc: gcc-patches, Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

Hi,

I'd like to ping for this patch:
https://gcc.gnu.org/ml/gcc-patches/2016-09/msg02362.html

Thanks,
Bernd.


On 09/30/16 15:33, Bernd Edlinger wrote:
> On 09/30/16 12:14, Bernd Edlinger wrote:
>> Eric Botcazou wrote:
>>>> A comment before the SETs and a testcase would be nice.  IIRC
>>>> we do have stack size testcases via using -fstack-usage.
>>>
>>> Or -Wstack-usage, which might be more appropriate here.
>>
>> Yes.  good idea.  I was not aware that we already have that kind of
>> tests.
>>
>> When trying to write this test. I noticed, that I did not try -Os so far.
>> But for -Os the stack is still the unchanged 3500 bytes.
>>
>> However for embedded targets I am often inclined to use -Os, and
>> would certainly not expect the stack to explode...
>>
>> I see in arm.md there are places like
>>
>>        /* If we're optimizing for size, we prefer the libgcc calls.  */
>>        if (optimize_function_for_size_p (cfun))
>>          FAIL;
>>
>
> Oh, yeah.  The comment is completely misleading.
>
> If this pattern fails, expmed.c simply expands some
> less efficient rtl, which also results in two shifts
> and one or-op.  No libgcc calls at all.
>
> So in simple cases without spilling the resulting
> assembler is the same, regardless if this pattern
> fails or not.  But the half-defined out registers
> make a big difference when it has to be spilled.
>
>>        /* Expand operation using core-registers.
>>           'FAIL' would achieve the same thing, but this is a bit
>> smarter.  */
>>        scratch1 = gen_reg_rtx (SImode);
>>        scratch2 = gen_reg_rtx (SImode);
>>        arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0], operands[1],
>>                                       operands[2], scratch1, scratch2);
>>
>>
>> .. that explains why this happens.  I think it would be better to
>> use the emit_coreregs for shift count >= 32, because these are
>> effectively 32-bit shifts.
>>
>> Will try if that can be improved, and come back with the
>> results.
>>
>
> The test case with -Os has 3520 bytes stack usage.
> When only shift count >= 32 are handled we
> have still 3000 bytes stack usage.
> And when arm_emit_coreregs_64bit_shift is always
> allowed to run, we have 2360 bytes stack usage.
>
> Also for the code size it is better not to fail this
> pattern.  So I propose to remove this exception in all
> three expansions.
>
> Here is an improved patch with the test case from the PR.
> And a comment on the redundant SET why it is better to clear
> the out register first.
>
>
> Bootstrap and reg-testing on arm-linux-gnueabihf.
> Is it OK for trunk?
>
>
> Thanks
> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PING**2] [PATCH, ARM] Reduce stack usage in sha512 (PR target/77308)
  2016-10-06 16:19         ` [PING] [PATCH, ARM] " Bernd Edlinger
@ 2016-10-17 13:47           ` Bernd Edlinger
  0 siblings, 0 replies; 16+ messages in thread
From: Bernd Edlinger @ 2016-10-17 13:47 UTC (permalink / raw)
  To: Eric Botcazou, Richard Biener
  Cc: gcc-patches, Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

Ping....

On 10/06/16 18:18, Bernd Edlinger wrote:
> Hi,
>
> I'd like to ping for this patch:
> https://gcc.gnu.org/ml/gcc-patches/2016-09/msg02362.html
>
> Thanks,
> Bernd.
>
>
> On 09/30/16 15:33, Bernd Edlinger wrote:
>> On 09/30/16 12:14, Bernd Edlinger wrote:
>>> Eric Botcazou wrote:
>>>>> A comment before the SETs and a testcase would be nice.  IIRC
>>>>> we do have stack size testcases via using -fstack-usage.
>>>>
>>>> Or -Wstack-usage, which might be more appropriate here.
>>>
>>> Yes.  good idea.  I was not aware that we already have that kind of
>>> tests.
>>>
>>> When trying to write this test. I noticed, that I did not try -Os so
>>> far.
>>> But for -Os the stack is still the unchanged 3500 bytes.
>>>
>>> However for embedded targets I am often inclined to use -Os, and
>>> would certainly not expect the stack to explode...
>>>
>>> I see in arm.md there are places like
>>>
>>>        /* If we're optimizing for size, we prefer the libgcc calls.  */
>>>        if (optimize_function_for_size_p (cfun))
>>>          FAIL;
>>>
>>
>> Oh, yeah.  The comment is completely misleading.
>>
>> If this pattern fails, expmed.c simply expands some
>> less efficient rtl, which also results in two shifts
>> and one or-op.  No libgcc calls at all.
>>
>> So in simple cases without spilling the resulting
>> assembler is the same, regardless if this pattern
>> fails or not.  But the half-defined out registers
>> make a big difference when it has to be spilled.
>>
>>>        /* Expand operation using core-registers.
>>>           'FAIL' would achieve the same thing, but this is a bit
>>> smarter.  */
>>>        scratch1 = gen_reg_rtx (SImode);
>>>        scratch2 = gen_reg_rtx (SImode);
>>>        arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0],
>>> operands[1],
>>>                                       operands[2], scratch1, scratch2);
>>>
>>>
>>> .. that explains why this happens.  I think it would be better to
>>> use the emit_coreregs for shift count >= 32, because these are
>>> effectively 32-bit shifts.
>>>
>>> Will try if that can be improved, and come back with the
>>> results.
>>>
>>
>> The test case with -Os has 3520 bytes stack usage.
>> When only shift count >= 32 are handled we
>> have still 3000 bytes stack usage.
>> And when arm_emit_coreregs_64bit_shift is always
>> allowed to run, we have 2360 bytes stack usage.
>>
>> Also for the code size it is better not to fail this
>> pattern.  So I propose to remove this exception in all
>> three expansions.
>>
>> Here is an improved patch with the test case from the PR.
>> And a comment on the redundant SET why it is better to clear
>> the out register first.
>>
>>
>> Bootstrap and reg-testing on arm-linux-gnueabihf.
>> Is it OK for trunk?
>>
>>
>> Thanks
>> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-09-30 13:45       ` Bernd Edlinger
  2016-10-06 16:19         ` [PING] [PATCH, ARM] " Bernd Edlinger
@ 2016-10-17 16:47         ` Kyrill Tkachov
  2016-10-18  8:36           ` Christophe Lyon
  1 sibling, 1 reply; 16+ messages in thread
From: Kyrill Tkachov @ 2016-10-17 16:47 UTC (permalink / raw)
  To: Bernd Edlinger, Eric Botcazou, Richard Biener
  Cc: gcc-patches, Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan


On 30/09/16 14:34, Bernd Edlinger wrote:
> On 09/30/16 12:14, Bernd Edlinger wrote:
>> Eric Botcazou wrote:
>>>> A comment before the SETs and a testcase would be nice.  IIRC
>>>> we do have stack size testcases via using -fstack-usage.
>>> Or -Wstack-usage, which might be more appropriate here.
>> Yes.  good idea.  I was not aware that we already have that kind of tests.
>>
>> When trying to write this test. I noticed, that I did not try -Os so far.
>> But for -Os the stack is still the unchanged 3500 bytes.
>>
>> However for embedded targets I am often inclined to use -Os, and
>> would certainly not expect the stack to explode...
>>
>> I see in arm.md there are places like
>>
>>         /* If we're optimizing for size, we prefer the libgcc calls.  */
>>         if (optimize_function_for_size_p (cfun))
>>           FAIL;
>>
> Oh, yeah.  The comment is completely misleading.
>
> If this pattern fails, expmed.c simply expands some
> less efficient rtl, which also results in two shifts
> and one or-op.  No libgcc calls at all.
>
> So in simple cases without spilling the resulting
> assembler is the same, regardless if this pattern
> fails or not.  But the half-defined out registers
> make a big difference when it has to be spilled.
>
>>         /* Expand operation using core-registers.
>>            'FAIL' would achieve the same thing, but this is a bit smarter.  */
>>         scratch1 = gen_reg_rtx (SImode);
>>         scratch2 = gen_reg_rtx (SImode);
>>         arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0], operands[1],
>>                                        operands[2], scratch1, scratch2);
>>
>>
>> .. that explains why this happens.  I think it would be better to
>> use the emit_coreregs for shift count >= 32, because these are
>> effectively 32-bit shifts.
>>
>> Will try if that can be improved, and come back with the
>> results.
>>
> The test case with -Os has 3520 bytes stack usage.
> When only shift count >= 32 are handled we
> have still 3000 bytes stack usage.
> And when arm_emit_coreregs_64bit_shift is always
> allowed to run, we have 2360 bytes stack usage.
>
> Also for the code size it is better not to fail this
> pattern.  So I propose to remove this exception in all
> three expansions.
>
> Here is an improved patch with the test case from the PR.
> And a comment on the redundant SET why it is better to clear
> the out register first.
>
>
> Bootstrap and reg-testing on arm-linux-gnueabihf.
> Is it OK for trunk?

This looks ok to me.
Thanks,
Kyrill

>
> Thanks
> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-17 16:47         ` [PATCH] " Kyrill Tkachov
@ 2016-10-18  8:36           ` Christophe Lyon
  2016-10-18 14:45             ` Bernd Edlinger
  0 siblings, 1 reply; 16+ messages in thread
From: Christophe Lyon @ 2016-10-18  8:36 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Bernd Edlinger, Eric Botcazou, Richard Biener, gcc-patches,
	Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

Hi,


On 17 October 2016 at 18:47, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>
> On 30/09/16 14:34, Bernd Edlinger wrote:
>>
>> On 09/30/16 12:14, Bernd Edlinger wrote:
>>>
>>> Eric Botcazou wrote:
>>>>>
>>>>> A comment before the SETs and a testcase would be nice.  IIRC
>>>>> we do have stack size testcases via using -fstack-usage.
>>>>
>>>> Or -Wstack-usage, which might be more appropriate here.
>>>
>>> Yes.  good idea.  I was not aware that we already have that kind of
>>> tests.
>>>
>>> When trying to write this test. I noticed, that I did not try -Os so far.
>>> But for -Os the stack is still the unchanged 3500 bytes.
>>>
>>> However for embedded targets I am often inclined to use -Os, and
>>> would certainly not expect the stack to explode...
>>>
>>> I see in arm.md there are places like
>>>
>>>         /* If we're optimizing for size, we prefer the libgcc calls.  */
>>>         if (optimize_function_for_size_p (cfun))
>>>           FAIL;
>>>
>> Oh, yeah.  The comment is completely misleading.
>>
>> If this pattern fails, expmed.c simply expands some
>> less efficient rtl, which also results in two shifts
>> and one or-op.  No libgcc calls at all.
>>
>> So in simple cases without spilling the resulting
>> assembler is the same, regardless if this pattern
>> fails or not.  But the half-defined out registers
>> make a big difference when it has to be spilled.
>>
>>>         /* Expand operation using core-registers.
>>>            'FAIL' would achieve the same thing, but this is a bit
>>> smarter.  */
>>>         scratch1 = gen_reg_rtx (SImode);
>>>         scratch2 = gen_reg_rtx (SImode);
>>>         arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0],
>>> operands[1],
>>>                                        operands[2], scratch1, scratch2);
>>>
>>>
>>> .. that explains why this happens.  I think it would be better to
>>> use the emit_coreregs for shift count >= 32, because these are
>>> effectively 32-bit shifts.
>>>
>>> Will try if that can be improved, and come back with the
>>> results.
>>>
>> The test case with -Os has 3520 bytes stack usage.
>> When only shift count >= 32 are handled we
>> have still 3000 bytes stack usage.
>> And when arm_emit_coreregs_64bit_shift is always
>> allowed to run, we have 2360 bytes stack usage.
>>
>> Also for the code size it is better not to fail this
>> pattern.  So I propose to remove this exception in all
>> three expansions.
>>
>> Here is an improved patch with the test case from the PR.
>> And a comment on the redundant SET why it is better to clear
>> the out register first.
>>
>>
>> Bootstrap and reg-testing on arm-linux-gnueabihf.
>> Is it OK for trunk?
>
>
> This looks ok to me.
> Thanks,
> Kyrill
>

I am seeing a lot of regressions since this patch was committed:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html

(you can click on "REGRESSED" to see the list of regressions, "sum"
and "log" to download
the corresponding .sum/.log)

Thanks,

Christophe

>>
>> Thanks
>> Bernd.
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-18  8:36           ` Christophe Lyon
@ 2016-10-18 14:45             ` Bernd Edlinger
  2016-10-18 15:35               ` Christophe Lyon
  0 siblings, 1 reply; 16+ messages in thread
From: Bernd Edlinger @ 2016-10-18 14:45 UTC (permalink / raw)
  To: Christophe Lyon, Kyrill Tkachov
  Cc: Eric Botcazou, Richard Biener, gcc-patches, Nick Clifton,
	Richard Earnshaw, Ramana Radhakrishnan

[-- Attachment #1: Type: text/plain, Size: 663 bytes --]

On 10/18/16 10:36, Christophe Lyon wrote:
>
> I am seeing a lot of regressions since this patch was committed:
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>
> (you can click on "REGRESSED" to see the list of regressions, "sum"
> and "log" to download
> the corresponding .sum/.log)
>
> Thanks,
>
> Christophe
>

Oh, sorry, I have completely missed that.
Unfortunately I have tested one of the good combinations.

I have not considered the case that in and out could be
the same register.  But that happens here.


This should solve it.

Can you give it a try?



Thanks
Bernd.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-arm-fixup.diff --]
[-- Type: text/x-patch; name="patch-arm-fixup.diff", Size: 1552 bytes --]

2016-10-18  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* config/arm/arm.c (arm_emit_coreregs_64bit_shift): Clear the result
	register only if "in" and "out" are different registers.

--- gcc/config/arm/arm.c.orig	2016-10-17 11:00:34.045673223 +0200
+++ gcc/config/arm/arm.c	2016-10-18 14:53:06.710101327 +0200
@@ -29218,8 +29218,10 @@ arm_emit_coreregs_64bit_shift (enum rtx_
 
 	  /* Clearing the out register in DImode first avoids lots
 	     of spilling and results in less stack usage.
-	     Later this redundant insn is completely removed.  */
-	  emit_insn (SET (out, const0_rtx));
+	     Later this redundant insn is completely removed.
+	     Do that only if "in" and "out" are different registers.  */
+	  if (REG_P (out) && REG_P (in) && REGNO (out) != REGNO (in))
+	    emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
 	  emit_insn (SET (out_down,
 			  ORR (REV_LSHIFT (code, in_up, reverse_amount),
@@ -29231,11 +29233,14 @@ arm_emit_coreregs_64bit_shift (enum rtx_
 	  /* Shifts by a constant greater than 31.  */
 	  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);
 
-	  emit_insn (SET (out, const0_rtx));
+	  if (REG_P (out) && REG_P (in) && REGNO (out) != REGNO (in))
+	    emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
 	  if (code == ASHIFTRT)
 	    emit_insn (gen_ashrsi3 (out_up, in_up,
 				    GEN_INT (31)));
+	  else
+	    emit_insn (SET (out_up, const0_rtx));
 	}
     }
   else

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-18 14:45             ` Bernd Edlinger
@ 2016-10-18 15:35               ` Christophe Lyon
  2016-10-19  6:55                 ` Christophe Lyon
  0 siblings, 1 reply; 16+ messages in thread
From: Christophe Lyon @ 2016-10-18 15:35 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Kyrill Tkachov, Eric Botcazou, Richard Biener, gcc-patches,
	Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

On 18 October 2016 at 16:45, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> On 10/18/16 10:36, Christophe Lyon wrote:
>>
>> I am seeing a lot of regressions since this patch was committed:
>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>>
>> (you can click on "REGRESSED" to see the list of regressions, "sum"
>> and "log" to download
>> the corresponding .sum/.log)
>>
>> Thanks,
>>
>> Christophe
>>
>
> Oh, sorry, I have completely missed that.
> Unfortunately I have tested one of the good combinations.
>
> I have not considered the case that in and out could be
> the same register.  But that happens here.
>
>
> This should solve it.
>
> Can you give it a try?
>

Sure, I have started the validations, it will take a few hours.

>
>
> Thanks
> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-18 15:35               ` Christophe Lyon
@ 2016-10-19  6:55                 ` Christophe Lyon
  2016-10-19  8:34                   ` Kyrill Tkachov
  0 siblings, 1 reply; 16+ messages in thread
From: Christophe Lyon @ 2016-10-19  6:55 UTC (permalink / raw)
  To: Bernd Edlinger
  Cc: Kyrill Tkachov, Eric Botcazou, Richard Biener, gcc-patches,
	Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

On 18 October 2016 at 17:35, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> On 18 October 2016 at 16:45, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>> On 10/18/16 10:36, Christophe Lyon wrote:
>>>
>>> I am seeing a lot of regressions since this patch was committed:
>>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>>>
>>> (you can click on "REGRESSED" to see the list of regressions, "sum"
>>> and "log" to download
>>> the corresponding .sum/.log)
>>>
>>> Thanks,
>>>
>>> Christophe
>>>
>>
>> Oh, sorry, I have completely missed that.
>> Unfortunately I have tested one of the good combinations.
>>
>> I have not considered the case that in and out could be
>> the same register.  But that happens here.
>>
>>
>> This should solve it.
>>
>> Can you give it a try?
>>
>
> Sure, I have started the validations, it will take a few hours.
>

It looks OK, thanks.

>>
>>
>> Thanks
>> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-19  6:55                 ` Christophe Lyon
@ 2016-10-19  8:34                   ` Kyrill Tkachov
  2016-10-19 10:44                     ` Christophe Lyon
  0 siblings, 1 reply; 16+ messages in thread
From: Kyrill Tkachov @ 2016-10-19  8:34 UTC (permalink / raw)
  To: Christophe Lyon, Bernd Edlinger
  Cc: Eric Botcazou, Richard Biener, gcc-patches, Nick Clifton,
	Richard Earnshaw, Ramana Radhakrishnan


On 19/10/16 07:55, Christophe Lyon wrote:
> On 18 October 2016 at 17:35, Christophe Lyon <christophe.lyon@linaro.org> wrote:
>> On 18 October 2016 at 16:45, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>> On 10/18/16 10:36, Christophe Lyon wrote:
>>>> I am seeing a lot of regressions since this patch was committed:
>>>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>>>>
>>>> (you can click on "REGRESSED" to see the list of regressions, "sum"
>>>> and "log" to download
>>>> the corresponding .sum/.log)
>>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>> Oh, sorry, I have completely missed that.
>>> Unfortunately I have tested one of the good combinations.
>>>
>>> I have not considered the case that in and out could be
>>> the same register.  But that happens here.
>>>
>>>
>>> This should solve it.
>>>
>>> Can you give it a try?
>>>
>> Sure, I have started the validations, it will take a few hours.
>>
> It looks OK, thanks.

Ok. Thanks for testing Christophe.
Sorry for not catching it in review.
Kyrill

>>>
>>> Thanks
>>> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-19  8:34                   ` Kyrill Tkachov
@ 2016-10-19 10:44                     ` Christophe Lyon
  2016-10-19 16:06                       ` Bernd Edlinger
  0 siblings, 1 reply; 16+ messages in thread
From: Christophe Lyon @ 2016-10-19 10:44 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Bernd Edlinger, Eric Botcazou, Richard Biener, gcc-patches,
	Nick Clifton, Richard Earnshaw, Ramana Radhakrishnan

On 19 October 2016 at 10:34, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>
> On 19/10/16 07:55, Christophe Lyon wrote:
>>
>> On 18 October 2016 at 17:35, Christophe Lyon <christophe.lyon@linaro.org>
>> wrote:
>>>
>>> On 18 October 2016 at 16:45, Bernd Edlinger <bernd.edlinger@hotmail.de>
>>> wrote:
>>>>
>>>> On 10/18/16 10:36, Christophe Lyon wrote:
>>>>>
>>>>> I am seeing a lot of regressions since this patch was committed:
>>>>>
>>>>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>>>>>
>>>>> (you can click on "REGRESSED" to see the list of regressions, "sum"
>>>>> and "log" to download
>>>>> the corresponding .sum/.log)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Christophe
>>>>>
>>>> Oh, sorry, I have completely missed that.
>>>> Unfortunately I have tested one of the good combinations.
>>>>
>>>> I have not considered the case that in and out could be
>>>> the same register.  But that happens here.
>>>>
>>>>
>>>> This should solve it.
>>>>
>>>> Can you give it a try?
>>>>
>>> Sure, I have started the validations, it will take a few hours.
>>>
>> It looks OK, thanks.
>
>
> Ok. Thanks for testing Christophe.
> Sorry for not catching it in review.
> Kyrill
>

Since it was painful to check that this 2nd patch fixed all the regressions
observed in all the configurations, I ran another validation with
the 2 patches combined, on top of r241272 to check the effect of the two
over the previous reference.

It turns out there is still a failure:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/241272-r241273-combined/report-build-info.html
gcc.c-torture/execute/pr34971.c   -O0  execution test
now fails on arm-none-linux-gnueabihf when using thumb mode, expect
when targeting cortex-a5.

Christophe

>>>>
>>>> Thanks
>>>> Bernd.
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-19 10:44                     ` Christophe Lyon
@ 2016-10-19 16:06                       ` Bernd Edlinger
  2016-10-19 16:47                         ` Kyrill Tkachov
  0 siblings, 1 reply; 16+ messages in thread
From: Bernd Edlinger @ 2016-10-19 16:06 UTC (permalink / raw)
  To: Christophe Lyon, Kyrill Tkachov
  Cc: Eric Botcazou, Richard Biener, gcc-patches, Nick Clifton,
	Richard Earnshaw, Ramana Radhakrishnan

On 10/19/16 12:44, Christophe Lyon wrote:
> On 19 October 2016 at 10:34, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>>
>> On 19/10/16 07:55, Christophe Lyon wrote:
>>>
>>> On 18 October 2016 at 17:35, Christophe Lyon <christophe.lyon@linaro.org>
>>> wrote:
>>>>
>>>> On 18 October 2016 at 16:45, Bernd Edlinger <bernd.edlinger@hotmail.de>
>>>> wrote:
>>>>>
>>>>> On 10/18/16 10:36, Christophe Lyon wrote:
>>>>>>
>>>>>> I am seeing a lot of regressions since this patch was committed:
>>>>>>
>>>>>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>>>>>>
>>>>>> (you can click on "REGRESSED" to see the list of regressions, "sum"
>>>>>> and "log" to download
>>>>>> the corresponding .sum/.log)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Christophe
>>>>>>
>>>>> Oh, sorry, I have completely missed that.
>>>>> Unfortunately I have tested one of the good combinations.
>>>>>
>>>>> I have not considered the case that in and out could be
>>>>> the same register.  But that happens here.
>>>>>
>>>>>
>>>>> This should solve it.
>>>>>
>>>>> Can you give it a try?
>>>>>
>>>> Sure, I have started the validations, it will take a few hours.
>>>>
>>> It looks OK, thanks.
>>
>>
>> Ok. Thanks for testing Christophe.
>> Sorry for not catching it in review.
>> Kyrill
>>
>
> Since it was painful to check that this 2nd patch fixed all the regressions
> observed in all the configurations, I ran another validation with
> the 2 patches combined, on top of r241272 to check the effect of the two
> over the previous reference.
>
> It turns out there is still a failure:
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/241272-r241273-combined/report-build-info.html
> gcc.c-torture/execute/pr34971.c   -O0  execution test
> now fails on arm-none-linux-gnueabihf when using thumb mode, expect
> when targeting cortex-a5.
>

Thanks Christophe, I looked in the test case,
and I think that is a pre-existing issue.

I can make the test case fail before my patch:

struct foo
{
   unsigned long long b:40;
} x;

extern void abort (void);

void test1(unsigned long long res)
{
   /* Build a rotate expression on a 40 bit argument.  */
   if ((x.b<<9) + (x.b>>31) != res)
     abort ();
}

int main()
{
   x.b = 0x0100000001;
   test1(0x0000000202);
   x.b = 0x0100000000;
   test1(0x0000000002);
   return 0;
}


fails with -mcpu=cortex-a9 -mthumb -mfpu=neon-fp16 -O0
even before my patch.

before reload we have this insn:
(insn 19 18 20 2 (parallel [
             (set (reg:DI 113 [ _4 ])
                 (lshiftrt:DI (reg:DI 112 [ _3 ])
                     (const_int 31 [0x1f])))
             (clobber (scratch:SI))
             (clobber (scratch:SI))
             (clobber (scratch:DI))
             (clobber (reg:CC 100 cc))
         ]) "pr34971.c":11 1232 {lshrdi3_neon}


which is replaced to this insn:
(insn 19 18 20 2 (parallel [
             (set (reg:DI 1 r1 [orig:113 _4 ] [113])
                 (lshiftrt:DI (reg:DI 0 r0 [orig:112 _3 ] [112])
                     (const_int 31 [0x1f])))
             (clobber (scratch:SI))
             (clobber (scratch:SI))
             (clobber (scratch:DI))
             (clobber (reg:CC 100 cc))
         ]) "pr34971.c":11 1232 {lshrdi3_neon}
      (nil))

And now that is not supposed to happen, when this code
is executed for shifts by a constant less than 32:

           emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
           emit_insn (SET (out_down,
                           ORR (REV_LSHIFT (code, in_up, reverse_amount),
                                out_down)));
           emit_insn (SET (out_up, SHIFT (code, in_up, amount)));


out_down may not be the same hard register than in_up for this to work.

I opened a new BZ for this:
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

I am not sure if this is a LRA issue or if it can be handled in the
shift expansion.

Is the second patch good for trunk as it is?


Thanks
Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Reduce stack usage in sha512 (PR target/77308)
  2016-10-19 16:06                       ` Bernd Edlinger
@ 2016-10-19 16:47                         ` Kyrill Tkachov
  0 siblings, 0 replies; 16+ messages in thread
From: Kyrill Tkachov @ 2016-10-19 16:47 UTC (permalink / raw)
  To: Bernd Edlinger, Christophe Lyon
  Cc: Eric Botcazou, Richard Biener, gcc-patches, Nick Clifton,
	Richard Earnshaw, Ramana Radhakrishnan


On 19/10/16 17:06, Bernd Edlinger wrote:
> On 10/19/16 12:44, Christophe Lyon wrote:
>> On 19 October 2016 at 10:34, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>>> On 19/10/16 07:55, Christophe Lyon wrote:
>>>> On 18 October 2016 at 17:35, Christophe Lyon <christophe.lyon@linaro.org>
>>>> wrote:
>>>>> On 18 October 2016 at 16:45, Bernd Edlinger <bernd.edlinger@hotmail.de>
>>>>> wrote:
>>>>>> On 10/18/16 10:36, Christophe Lyon wrote:
>>>>>>> I am seeing a lot of regressions since this patch was committed:
>>>>>>>
>>>>>>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/241273/report-build-info.html
>>>>>>>
>>>>>>> (you can click on "REGRESSED" to see the list of regressions, "sum"
>>>>>>> and "log" to download
>>>>>>> the corresponding .sum/.log)
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Christophe
>>>>>>>
>>>>>> Oh, sorry, I have completely missed that.
>>>>>> Unfortunately I have tested one of the good combinations.
>>>>>>
>>>>>> I have not considered the case that in and out could be
>>>>>> the same register.  But that happens here.
>>>>>>
>>>>>>
>>>>>> This should solve it.
>>>>>>
>>>>>> Can you give it a try?
>>>>>>
>>>>> Sure, I have started the validations, it will take a few hours.
>>>>>
>>>> It looks OK, thanks.
>>>
>>> Ok. Thanks for testing Christophe.
>>> Sorry for not catching it in review.
>>> Kyrill
>>>
>> Since it was painful to check that this 2nd patch fixed all the regressions
>> observed in all the configurations, I ran another validation with
>> the 2 patches combined, on top of r241272 to check the effect of the two
>> over the previous reference.
>>
>> It turns out there is still a failure:
>> http://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/241272-r241273-combined/report-build-info.html
>> gcc.c-torture/execute/pr34971.c   -O0  execution test
>> now fails on arm-none-linux-gnueabihf when using thumb mode, expect
>> when targeting cortex-a5.
>>
> Thanks Christophe, I looked in the test case,
> and I think that is a pre-existing issue.
>
> I can make the test case fail before my patch:
>
> struct foo
> {
>     unsigned long long b:40;
> } x;
>
> extern void abort (void);
>
> void test1(unsigned long long res)
> {
>     /* Build a rotate expression on a 40 bit argument.  */
>     if ((x.b<<9) + (x.b>>31) != res)
>       abort ();
> }
>
> int main()
> {
>     x.b = 0x0100000001;
>     test1(0x0000000202);
>     x.b = 0x0100000000;
>     test1(0x0000000002);
>     return 0;
> }
>
>
> fails with -mcpu=cortex-a9 -mthumb -mfpu=neon-fp16 -O0
> even before my patch.
>
> before reload we have this insn:
> (insn 19 18 20 2 (parallel [
>               (set (reg:DI 113 [ _4 ])
>                   (lshiftrt:DI (reg:DI 112 [ _3 ])
>                       (const_int 31 [0x1f])))
>               (clobber (scratch:SI))
>               (clobber (scratch:SI))
>               (clobber (scratch:DI))
>               (clobber (reg:CC 100 cc))
>           ]) "pr34971.c":11 1232 {lshrdi3_neon}
>
>
> which is replaced to this insn:
> (insn 19 18 20 2 (parallel [
>               (set (reg:DI 1 r1 [orig:113 _4 ] [113])
>                   (lshiftrt:DI (reg:DI 0 r0 [orig:112 _3 ] [112])
>                       (const_int 31 [0x1f])))
>               (clobber (scratch:SI))
>               (clobber (scratch:SI))
>               (clobber (scratch:DI))
>               (clobber (reg:CC 100 cc))
>           ]) "pr34971.c":11 1232 {lshrdi3_neon}
>        (nil))
>
> And now that is not supposed to happen, when this code
> is executed for shifts by a constant less than 32:
>
>             emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
>             emit_insn (SET (out_down,
>                             ORR (REV_LSHIFT (code, in_up, reverse_amount),
>                                  out_down)));
>             emit_insn (SET (out_up, SHIFT (code, in_up, amount)));
>
>
> out_down may not be the same hard register than in_up for this to work.
>
> I opened a new BZ for this:
>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041
>
> I am not sure if this is a LRA issue or if it can be handled in the
> shift expansion.
>
> Is the second patch good for trunk as it is?

Thanks for the investigation and the bug report.
The patch is ok.
Kyrill

>
> Thanks
> Bernd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-10-19 16:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30  8:17 [PATCH] Reduce stack usage in sha512 (PR target/77308) Bernd Edlinger
2016-09-30  8:49 ` Richard Biener
2016-09-30  9:32   ` Eric Botcazou
2016-09-30 10:27     ` Bernd Edlinger
2016-09-30 13:45       ` Bernd Edlinger
2016-10-06 16:19         ` [PING] [PATCH, ARM] " Bernd Edlinger
2016-10-17 13:47           ` [PING**2] " Bernd Edlinger
2016-10-17 16:47         ` [PATCH] " Kyrill Tkachov
2016-10-18  8:36           ` Christophe Lyon
2016-10-18 14:45             ` Bernd Edlinger
2016-10-18 15:35               ` Christophe Lyon
2016-10-19  6:55                 ` Christophe Lyon
2016-10-19  8:34                   ` Kyrill Tkachov
2016-10-19 10:44                     ` Christophe Lyon
2016-10-19 16:06                       ` Bernd Edlinger
2016-10-19 16:47                         ` Kyrill Tkachov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).