* [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) @ 2016-12-18 13:32 Bernd Edlinger 2016-12-20 15:29 ` Wilco Dijkstra 2017-04-29 20:09 ` Bernd Edlinger 0 siblings, 2 replies; 9+ messages in thread From: Bernd Edlinger @ 2016-12-18 13:32 UTC (permalink / raw) To: gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra [-- Attachment #1: Type: text/plain, Size: 555 bytes --] Hi, this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned also at split1 except for TARGET_NEON and TARGET_IWMMXT. In the new test case the stack is reduced to about 270 bytes, except for neon and iwmmxt, where this does not change anything. This patch depends on [1] and [2] before it can be applied. Bootstrapped and reg-tested on arm-linux-gnueabihf. Is it OK for trunk? Thanks Bernd. [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-4.diff --] [-- Type: text/x-patch; name="patch-pr77308-4.diff", Size: 9756 bytes --] 2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. testsuite: 2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-2.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 243782) +++ gcc/config/arm/arm.md (working copy) @@ -4689,7 +4689,7 @@ "TARGET_32BIT" "#" ; rsbs %Q0, %Q1, #0; rsc %R0, %R1, #0 (ARM) ; negs %Q0, %Q1 ; sbc %R0, %R1, %R1, lsl #1 (Thumb-2) - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -7359,7 +7359,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) @@ -7383,7 +7383,10 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7397,7 +7400,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) Index: gcc/testsuite/gcc.target/arm/pr77308-2.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-2.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-2.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2016-12-18 13:32 [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) Bernd Edlinger @ 2016-12-20 15:29 ` Wilco Dijkstra 2016-12-20 18:52 ` Bernd Edlinger 2017-04-29 20:09 ` Bernd Edlinger 1 sibling, 1 reply; 9+ messages in thread From: Wilco Dijkstra @ 2016-12-20 15:29 UTC (permalink / raw) To: Bernd Edlinger, gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, nd Bernd Edlinger wrote: > this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned > also at split1 except for TARGET_NEON and TARGET_IWMMXT. > > In the new test case the stack is reduced to about 270 bytes, except > for neon and iwmmxt, where this does not change anything. This looks odd: - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); Given this is an SI mode scratch, do we need the else part at all? It seems wrong to ask for the low part of an SI mode operand... Other than that it looks good to me, but I can't approve. As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi* patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI mode compare with zero is always split into ORR during expand. Wilco ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2016-12-20 15:29 ` Wilco Dijkstra @ 2016-12-20 18:52 ` Bernd Edlinger 2016-12-21 15:20 ` Wilco Dijkstra 0 siblings, 1 reply; 9+ messages in thread From: Bernd Edlinger @ 2016-12-20 18:52 UTC (permalink / raw) To: Wilco Dijkstra, gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, nd On 12/20/16 16:09, Wilco Dijkstra wrote: > Bernd Edlinger wrote: >> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >> >> In the new test case the stack is reduced to about 270 bytes, except >> for neon and iwmmxt, where this does not change anything. > > This looks odd: > > - operands[2] = gen_lowpart (SImode, operands[2]); > + if (can_create_pseudo_p ()) > + operands[2] = gen_reg_rtx (SImode); > + else > + operands[2] = gen_lowpart (SImode, operands[2]); > > Given this is an SI mode scratch, do we need the else part at all? It seems wrong > to ask for the low part of an SI mode operand... > Yes, I think that is correct. > Other than that it looks good to me, but I can't approve. > > As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi* > patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI > mode compare with zero is always split into ORR during expand. > I did not change anything for -mthumb -mfpu=neon for instance. Do you think that iordi_notdi* is never used also for that configuration? And if the arm_cmpdi_zero is never expanded, isn't it already unused before my patch? Bernd. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2016-12-20 18:52 ` Bernd Edlinger @ 2016-12-21 15:20 ` Wilco Dijkstra 0 siblings, 0 replies; 9+ messages in thread From: Wilco Dijkstra @ 2016-12-21 15:20 UTC (permalink / raw) To: Bernd Edlinger, gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, nd Bernd Edlinger wrote: On 12/20/16 16:09, Wilco Dijkstra wrote: > > As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi* > > patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI >> mode compare with zero is always split into ORR during expand. > > I did not change anything for -mthumb -mfpu=neon for instance. > Do you think that iordi_notdi* is never used also for that > configuration? With -mfpu=vfp or -msoft-float, these patterns cannot be used as logical operations are expanded before combine. Interestingly with -mfpu=neon ARM uses the orndi3_neon patterns (which are inefficient for ARM and probably should be disabled) but Thumb-2 uses the iordi_notdi patterns... So removing these reduces the number of patterns while we will still generate orn for Thumb-2. > And if the arm_cmpdi_zero is never expanded, isn't it already > unused before my patch? It appears to be, so we don't need to fix it now. However when improving the expansion of comparisons it does trigger. For example x == 3 expands currently into 3 instructions: cmp r1, #0 itt eq cmpeq r0, #3 Tweaking arm_select_cc_mode uses arm_cmpdi_zero, and when expanded early we generate this: eor r0, r0, #3 orrs r0, r0, r1 Using sub rather than eor would be even better of course. Wilco ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2016-12-18 13:32 [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) Bernd Edlinger 2016-12-20 15:29 ` Wilco Dijkstra @ 2017-04-29 20:09 ` Bernd Edlinger 2017-05-12 16:55 ` [PING**2] " Bernd Edlinger 2017-09-04 16:20 ` Kyrill Tkachov 1 sibling, 2 replies; 9+ messages in thread From: Bernd Edlinger @ 2017-04-29 20:09 UTC (permalink / raw) To: gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra [-- Attachment #1: Type: text/plain, Size: 707 bytes --] Ping... I attached the latest version of my patch. Thanks Bernd. On 12/18/16 14:14, Bernd Edlinger wrote: > Hi, > > this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned > also at split1 except for TARGET_NEON and TARGET_IWMMXT. > > In the new test case the stack is reduced to about 270 bytes, except > for neon and iwmmxt, where this does not change anything. > > This patch depends on [1] and [2] before it can be applied. > > Bootstrapped and reg-tested on arm-linux-gnueabihf. > Is it OK for trunk? > > > Thanks > Bernd. > > > > [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html > [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-4.diff --] [-- Type: text/x-patch; name="patch-pr77308-4.diff", Size: 9687 bytes --] 2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. testsuite: 2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-2.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 243782) +++ gcc/config/arm/arm.md (working copy) @@ -4689,7 +4689,7 @@ "TARGET_32BIT" "#" ; rsbs %Q0, %Q1, #0; rsc %R0, %R1, #0 (ARM) ; negs %Q0, %Q1 ; sbc %R0, %R1, %R1, lsl #1 (Thumb-2) - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -7359,7 +7359,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) @@ -7383,7 +7383,8 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7397,7 +7398,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) Index: gcc/testsuite/gcc.target/arm/pr77308-2.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-2.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-2.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PING**2] [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2017-04-29 20:09 ` Bernd Edlinger @ 2017-05-12 16:55 ` Bernd Edlinger 2017-06-01 16:02 ` [PING**3] " Bernd Edlinger [not found] ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de> 2017-09-04 16:20 ` Kyrill Tkachov 1 sibling, 2 replies; 9+ messages in thread From: Bernd Edlinger @ 2017-05-12 16:55 UTC (permalink / raw) To: gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra Ping... On 04/29/17 19:52, Bernd Edlinger wrote: > Ping... > > I attached the latest version of my patch. > > > Thanks > Bernd. > > On 12/18/16 14:14, Bernd Edlinger wrote: >> Hi, >> >> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >> >> In the new test case the stack is reduced to about 270 bytes, except >> for neon and iwmmxt, where this does not change anything. >> >> This patch depends on [1] and [2] before it can be applied. >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf. >> Is it OK for trunk? >> >> >> Thanks >> Bernd. >> >> >> >> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html >> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PING**3] [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2017-05-12 16:55 ` [PING**2] " Bernd Edlinger @ 2017-06-01 16:02 ` Bernd Edlinger [not found] ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de> 1 sibling, 0 replies; 9+ messages in thread From: Bernd Edlinger @ 2017-06-01 16:02 UTC (permalink / raw) To: gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra Ping... On 05/12/17 18:50, Bernd Edlinger wrote: > Ping... > > On 04/29/17 19:52, Bernd Edlinger wrote: >> Ping... >> >> I attached the latest version of my patch. >> >> >> Thanks >> Bernd. >> >> On 12/18/16 14:14, Bernd Edlinger wrote: >>> Hi, >>> >>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >>> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >>> >>> In the new test case the stack is reduced to about 270 bytes, except >>> for neon and iwmmxt, where this does not change anything. >>> >>> This patch depends on [1] and [2] before it can be applied. >>> >>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>> Is it OK for trunk? >>> >>> >>> Thanks >>> Bernd. >>> >>> >>> >>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html >>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de>]
* [PING**4] [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) [not found] ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de> @ 2017-06-14 12:39 ` Bernd Edlinger 0 siblings, 0 replies; 9+ messages in thread From: Bernd Edlinger @ 2017-06-14 12:39 UTC (permalink / raw) To: gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Kyrill Tkachov, Wilco Dijkstra Ping... for this patch: https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01568.html On 06/01/17 18:02, Bernd Edlinger wrote: > Ping... > > On 05/12/17 18:50, Bernd Edlinger wrote: >> Ping... >> >> On 04/29/17 19:52, Bernd Edlinger wrote: >>> Ping... >>> >>> I attached the latest version of my patch. >>> >>> >>> Thanks >>> Bernd. >>> >>> On 12/18/16 14:14, Bernd Edlinger wrote: >>>> Hi, >>>> >>>> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >>>> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >>>> >>>> In the new test case the stack is reduced to about 270 bytes, except >>>> for neon and iwmmxt, where this does not change anything. >>>> >>>> This patch depends on [1] and [2] before it can be applied. >>>> >>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>> Is it OK for trunk? >>>> >>>> >>>> Thanks >>>> Bernd. >>>> >>>> >>>> >>>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html >>>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) 2017-04-29 20:09 ` Bernd Edlinger 2017-05-12 16:55 ` [PING**2] " Bernd Edlinger @ 2017-09-04 16:20 ` Kyrill Tkachov 1 sibling, 0 replies; 9+ messages in thread From: Kyrill Tkachov @ 2017-09-04 16:20 UTC (permalink / raw) To: Bernd Edlinger, gcc-patches Cc: Ramana Radhakrishnan, Richard Earnshaw, Wilco Dijkstra Hi Bernd, On 29/04/17 18:52, Bernd Edlinger wrote: > Ping... > > I attached the latest version of my patch. > > > Thanks > Bernd. > > On 12/18/16 14:14, Bernd Edlinger wrote: >> Hi, >> >> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >> >> In the new test case the stack is reduced to about 270 bytes, except >> for neon and iwmmxt, where this does not change anything. >> >> This patch depends on [1] and [2] before it can be applied. >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf. >> Is it OK for trunk? >> >> >> Thanks >> Bernd. >> >> >> >> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html >> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html 2016-12-18 Bernd Edlinger<bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. You're changing negdi2_insn rather than *arm_negdi2. Ok with the fixed ChangeLog once the prerequisite is committed. Thanks, Kyrill ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-09-04 16:20 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-18 13:32 [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) Bernd Edlinger 2016-12-20 15:29 ` Wilco Dijkstra 2016-12-20 18:52 ` Bernd Edlinger 2016-12-21 15:20 ` Wilco Dijkstra 2017-04-29 20:09 ` Bernd Edlinger 2017-05-12 16:55 ` [PING**2] " Bernd Edlinger 2017-06-01 16:02 ` [PING**3] " Bernd Edlinger [not found] ` <d2793404-f5c5-fe88-cb16-51a38ffa05c0@hotmail.de> 2017-06-14 12:39 ` [PING**4] " Bernd Edlinger 2017-09-04 16:20 ` Kyrill Tkachov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).