[PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
@ 2016-12-18 13:32 Bernd Edlinger
  2016-12-20 15:29 ` Wilco Dijkstra
  2017-04-29 20:09 ` Bernd Edlinger
  0 siblings, 2 replies; 9+ messages in thread
From: Bernd Edlinger @ 2016-12-18 13:32 UTC (permalink / raw)
  To: gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

Hi,

this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
also at split1 except for TARGET_NEON and TARGET_IWMMXT.

In the new test case the stack is reduced to about 270 bytes, except
for neon and iwmmxt, where this does not change anything.

This patch depends on [1] and [2] before it can be applied.

Bootstrapped and reg-tested on arm-linux-gnueabihf.
Is it OK for trunk?


Thanks
Bernd.



[1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html
[2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-pr77308-4.diff --]
[-- Type: text/x-patch; name="patch-pr77308-4.diff", Size: 9756 bytes --]

2016-12-18  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn,
	*arm_cmpdi_unsigned): Split early except for TARGET_NEON and
	TARGET_IWMMXT.

testsuite:
2016-12-18  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* gcc.target/arm/pr77308-2.c: New test.

Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 243782)
+++ gcc/config/arm/arm.md	(working copy)
@@ -4689,7 +4689,7 @@
   "TARGET_32BIT"
   "#"	; rsbs %Q0, %Q1, #0; rsc %R0, %R1, #0	       (ARM)
 	; negs %Q0, %Q1    ; sbc %R0, %R1, %R1, lsl #1 (Thumb-2)
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
@@ -7359,7 +7359,7 @@
    (clobber (match_scratch:SI 2 "=r"))]
   "TARGET_32BIT"
   "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 0) (match_dup 1)))
    (parallel [(set (reg:CC CC_REGNUM)
@@ -7383,7 +7383,10 @@
         operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]);
       }
     operands[1] = gen_lowpart (SImode, operands[1]);
-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
+    else
+      operands[2] = gen_lowpart (SImode, operands[2]);
   }
   [(set_attr "conds" "set")
    (set_attr "length" "8")
@@ -7397,7 +7400,7 @@
 
   "TARGET_32BIT"
   "#"   ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 2) (match_dup 3)))
    (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))
Index: gcc/testsuite/gcc.target/arm/pr77308-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/pr77308-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/pr77308-2.c	(working copy)
@@ -0,0 +1,169 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks.
+   It improves the test coverage of cmpdi and negdi2 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ (((x)>>7) > (x)) ? -(x) : (x))
+#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2016-12-18 13:32 [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) Bernd Edlinger
@ 2016-12-20 15:29 ` Wilco Dijkstra
  2016-12-20 18:52   ` Bernd Edlinger
  2017-04-29 20:09 ` Bernd Edlinger
  1 sibling, 1 reply; 9+ messages in thread
From: Wilco Dijkstra @ 2016-12-20 15:29 UTC (permalink / raw)
  To: Bernd Edlinger, gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, nd

Bernd Edlinger wrote:
> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>
> In the new test case the stack is reduced to about 270 bytes, except
> for neon and iwmmxt, where this does not change anything.

This looks odd:

-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
+    else
+      operands[2] = gen_lowpart (SImode, operands[2]);

Given this is an SI mode scratch, do we need the else part at all? It seems wrong
to ask for the low part of an SI mode operand...

Other than that it looks good to me, but I can't approve.

As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi*
patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI
mode compare with zero is always split into ORR during expand.

Wilco


    

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2016-12-20 15:29 ` Wilco Dijkstra
@ 2016-12-20 18:52   ` Bernd Edlinger
  2016-12-21 15:20     ` Wilco Dijkstra
  0 siblings, 1 reply; 9+ messages in thread
From: Bernd Edlinger @ 2016-12-20 18:52 UTC (permalink / raw)
  To: Wilco Dijkstra, gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, nd

On 12/20/16 16:09, Wilco Dijkstra wrote:
> Bernd Edlinger wrote:
>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
>> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>>
>> In the new test case the stack is reduced to about 270 bytes, except
>> for neon and iwmmxt, where this does not change anything.
>
> This looks odd:
>
> -    operands[2] = gen_lowpart (SImode, operands[2]);
> +    if (can_create_pseudo_p ())
> +      operands[2] = gen_reg_rtx (SImode);
> +    else
> +      operands[2] = gen_lowpart (SImode, operands[2]);
>
> Given this is an SI mode scratch, do we need the else part at all? It seems wrong
> to ask for the low part of an SI mode operand...
>

Yes, I think that is correct.

> Other than that it looks good to me, but I can't approve.
>
> As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi*
> patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI
> mode compare with zero is always split into ORR during expand.
>

I did not change anything for -mthumb -mfpu=neon for instance.
Do you think that iordi_notdi* is never used also for that
configuration?

And if the arm_cmpdi_zero is never expanded, isn't it already
unused before my patch?


Bernd.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2016-12-20 18:52   ` Bernd Edlinger
@ 2016-12-21 15:20     ` Wilco Dijkstra
  0 siblings, 0 replies; 9+ messages in thread
From: Wilco Dijkstra @ 2016-12-21 15:20 UTC (permalink / raw)
  To: Bernd Edlinger, gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, nd

Bernd Edlinger wrote:
On 12/20/16 16:09, Wilco Dijkstra wrote:
> > As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi*
> > patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI
>> mode compare with zero is always split into ORR during expand.
>
> I did not change anything for -mthumb -mfpu=neon for instance.
> Do you think that iordi_notdi* is never used also for that
> configuration?

With -mfpu=vfp or -msoft-float, these patterns cannot be used as logical operations are expanded before combine. Interestingly with -mfpu=neon ARM uses the orndi3_neon patterns (which are inefficient for ARM and probably should be disabled) but Thumb-2 uses the iordi_notdi patterns... So removing these reduces the number of patterns while we will still generate orn for Thumb-2.

> And if the arm_cmpdi_zero is never expanded, isn't it already
> unused before my patch?

It appears to be, so we don't need to fix it now. However when improving the expansion of comparisons it does trigger. For example x == 3 expands currently into 3 instructions:

	cmp	r1, #0
	itt	eq
	cmpeq	r0, #3

Tweaking arm_select_cc_mode uses arm_cmpdi_zero, and when expanded early we generate this:

	eor	r0, r0, #3
	orrs	r0, r0, r1

Using sub rather than eor would be even better of course.

Wilco

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2016-12-18 13:32 [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) Bernd Edlinger
  2016-12-20 15:29 ` Wilco Dijkstra
@ 2017-04-29 20:09 ` Bernd Edlinger
  2017-05-12 16:55   ` [PING**2] " Bernd Edlinger
  2017-09-04 16:20   ` Kyrill Tkachov
  1 sibling, 2 replies; 9+ messages in thread
From: Bernd Edlinger @ 2017-04-29 20:09 UTC (permalink / raw)
  To: gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra

[-- Attachment #1: Type: text/plain, Size: 707 bytes --]

Ping...

I attached the latest version of my patch.


Thanks
Bernd.

On 12/18/16 14:14, Bernd Edlinger wrote:
> Hi,
>
> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>
> In the new test case the stack is reduced to about 270 bytes, except
> for neon and iwmmxt, where this does not change anything.
>
> This patch depends on [1] and [2] before it can be applied.
>
> Bootstrapped and reg-tested on arm-linux-gnueabihf.
> Is it OK for trunk?
>
>
> Thanks
> Bernd.
>
>
>
> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html
> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-pr77308-4.diff --]
[-- Type: text/x-patch; name="patch-pr77308-4.diff", Size: 9687 bytes --]

2016-12-18  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn,
	*arm_cmpdi_unsigned): Split early except for TARGET_NEON and
	TARGET_IWMMXT.

testsuite:
2016-12-18  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* gcc.target/arm/pr77308-2.c: New test.

Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 243782)
+++ gcc/config/arm/arm.md	(working copy)
@@ -4689,7 +4689,7 @@
   "TARGET_32BIT"
   "#"	; rsbs %Q0, %Q1, #0; rsc %R0, %R1, #0	       (ARM)
 	; negs %Q0, %Q1    ; sbc %R0, %R1, %R1, lsl #1 (Thumb-2)
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
@@ -7359,7 +7359,7 @@
    (clobber (match_scratch:SI 2 "=r"))]
   "TARGET_32BIT"
   "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 0) (match_dup 1)))
    (parallel [(set (reg:CC CC_REGNUM)
@@ -7383,7 +7383,8 @@
         operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]);
       }
     operands[1] = gen_lowpart (SImode, operands[1]);
-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
   }
   [(set_attr "conds" "set")
    (set_attr "length" "8")
@@ -7397,7 +7398,7 @@
 
   "TARGET_32BIT"
   "#"   ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 2) (match_dup 3)))
    (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))
Index: gcc/testsuite/gcc.target/arm/pr77308-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/pr77308-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/pr77308-2.c	(working copy)
@@ -0,0 +1,169 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks.
+   It improves the test coverage of cmpdi and negdi2 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ (((x)>>7) > (x)) ? -(x) : (x))
+#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PING**2] [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2017-04-29 20:09 ` Bernd Edlinger
@ 2017-05-12 16:55   ` Bernd Edlinger
  2017-06-01 16:02     ` [PING**3] " Bernd Edlinger
       [not found]     ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de>
  2017-09-04 16:20   ` Kyrill Tkachov
  1 sibling, 2 replies; 9+ messages in thread
From: Bernd Edlinger @ 2017-05-12 16:55 UTC (permalink / raw)
  To: gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra

Ping...

On 04/29/17 19:52, Bernd Edlinger wrote:
> Ping...
>
> I attached the latest version of my patch.
>
>
> Thanks
> Bernd.
>
> On 12/18/16 14:14, Bernd Edlinger wrote:
>> Hi,
>>
>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
>> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>>
>> In the new test case the stack is reduced to about 270 bytes, except
>> for neon and iwmmxt, where this does not change anything.
>>
>> This patch depends on [1] and [2] before it can be applied.
>>
>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>> Is it OK for trunk?
>>
>>
>> Thanks
>> Bernd.
>>
>>
>>
>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html
>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PING**3] [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2017-05-12 16:55   ` [PING**2] " Bernd Edlinger
@ 2017-06-01 16:02     ` Bernd Edlinger
       [not found]     ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de>
  1 sibling, 0 replies; 9+ messages in thread
From: Bernd Edlinger @ 2017-06-01 16:02 UTC (permalink / raw)
  To: gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra

Ping...

On 05/12/17 18:50, Bernd Edlinger wrote:
> Ping...
> 
> On 04/29/17 19:52, Bernd Edlinger wrote:
>> Ping...
>>
>> I attached the latest version of my patch.
>>
>>
>> Thanks
>> Bernd.
>>
>> On 12/18/16 14:14, Bernd Edlinger wrote:
>>> Hi,
>>>
>>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
>>> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>>>
>>> In the new test case the stack is reduced to about 270 bytes, except
>>> for neon and iwmmxt, where this does not change anything.
>>>
>>> This patch depends on [1] and [2] before it can be applied.
>>>
>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>> Is it OK for trunk?
>>>
>>>
>>> Thanks
>>> Bernd.
>>>
>>>
>>>
>>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html
>>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PING**4] [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
       [not found]     ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de>
@ 2017-06-14 12:39       ` Bernd Edlinger
  0 siblings, 0 replies; 9+ messages in thread
From: Bernd Edlinger @ 2017-06-14 12:39 UTC (permalink / raw)
  To: gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra

Ping...

for this patch: https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01568.html

On 06/01/17 18:02, Bernd Edlinger wrote:
> Ping...
> 
> On 05/12/17 18:50, Bernd Edlinger wrote:
>> Ping...
>>
>> On 04/29/17 19:52, Bernd Edlinger wrote:
>>> Ping...
>>>
>>> I attached the latest version of my patch.
>>>
>>>
>>> Thanks
>>> Bernd.
>>>
>>> On 12/18/16 14:14, Bernd Edlinger wrote:
>>>> Hi,
>>>>
>>>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
>>>> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>>>>
>>>> In the new test case the stack is reduced to about 270 bytes, except
>>>> for neon and iwmmxt, where this does not change anything.
>>>>
>>>> This patch depends on [1] and [2] before it can be applied.
>>>>
>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>> Is it OK for trunk?
>>>>
>>>>
>>>> Thanks
>>>> Bernd.
>>>>
>>>>
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html
>>>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308)
  2017-04-29 20:09 ` Bernd Edlinger
  2017-05-12 16:55   ` [PING**2] " Bernd Edlinger
@ 2017-09-04 16:20   ` Kyrill Tkachov
  1 sibling, 0 replies; 9+ messages in thread
From: Kyrill Tkachov @ 2017-09-04 16:20 UTC (permalink / raw)
  To: Bernd Edlinger, gcc-patches
  Cc: Ramana Radhakrishnan, Richard Earnshaw, Wilco Dijkstra

Hi Bernd,

On 29/04/17 18:52, Bernd Edlinger wrote:
> Ping...
>
> I attached the latest version of my patch.
>
>
> Thanks
> Bernd.
>
> On 12/18/16 14:14, Bernd Edlinger wrote:
>> Hi,
>>
>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned
>> also at split1 except for TARGET_NEON and TARGET_IWMMXT.
>>
>> In the new test case the stack is reduced to about 270 bytes, except
>> for neon and iwmmxt, where this does not change anything.
>>
>> This patch depends on [1] and [2] before it can be applied.
>>
>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>> Is it OK for trunk?
>>
>>
>> Thanks
>> Bernd.
>>
>>
>>
>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html
>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html

2016-12-18  Bernd Edlinger<bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn,
	*arm_cmpdi_unsigned): Split early except for TARGET_NEON and
	TARGET_IWMMXT.

You're changing negdi2_insn rather than *arm_negdi2.

Ok with the fixed ChangeLog once the prerequisite is committed.

Thanks,
Kyrill


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-04 16:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-18 13:32 [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) Bernd Edlinger
2016-12-20 15:29 ` Wilco Dijkstra
2016-12-20 18:52   ` Bernd Edlinger
2016-12-21 15:20     ` Wilco Dijkstra
2017-04-29 20:09 ` Bernd Edlinger
2017-05-12 16:55   ` [PING**2] " Bernd Edlinger
2017-06-01 16:02     ` [PING**3] " Bernd Edlinger
     [not found]     ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de>
2017-06-14 12:39       ` [PING**4] " Bernd Edlinger
2017-09-04 16:20   ` Kyrill Tkachov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).