* [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) @ 2016-11-06 14:18 Bernd Edlinger 2016-11-25 11:30 ` Ramana Radhakrishnan 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-11-06 14:18 UTC (permalink / raw) To: GCC Patches; +Cc: Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra [-- Attachment #1: Type: text/plain, Size: 672 bytes --] Hi! This improves the stack usage on the sha512 test case for the case without hardware fpu and without iwmmxt by splitting all di-mode patterns right while expanding which is similar to what the shift-pattern does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as thumb1. It reduces the stack usage from 2300 to near optimal 272 bytes (!). Note this also splits many ldrd/strd instructions and therefore I will post a followup-patch that mitigates this effect by enabling the ldrd/strd peephole optimization after the necessary reg-testing. Bootstrapped and reg-tested on arm-linux-gnueabihf. Is it OK for trunk? Thanks Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-2.diff --] [-- Type: text/x-patch; name="patch-pr77308-2.diff", Size: 12213 bytes --] 2016-11-02 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for TARGET_HARD_FLOAT. (anddi3, iordi3, xordi3): Split while expanding except for TARGET_HARD_FLOAT and TARGET_IWMMXT. (one_cmpldi2): Split while expanding except for TARGET_HARD_FLOAT. (*one_cmpldi2_insn): Moved the body of one_cmpldi2 here. testsuite: 2016-11-02 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-1.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 241686) +++ gcc/config/arm/arm.md (working copy) @@ -467,7 +467,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && reload_completed + "TARGET_32BIT && (!TARGET_HARD_FLOAT || reload_completed) && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) @@ -1271,7 +1271,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" - "&& reload_completed" + "&& (!TARGET_HARD_FLOAT || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))]) @@ -2257,7 +2257,24 @@ (and:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_inv_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_HARD_FLOAT && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (AND, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (AND, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*anddi3_insn" @@ -3130,7 +3147,24 @@ (ior:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_HARD_FLOAT && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (IOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (IOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*iordi3_insn" @@ -3311,7 +3345,24 @@ (xor:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "arm_xordi_operand" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_HARD_FLOAT && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (XOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (XOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*xordi3_insn" @@ -5017,7 +5068,31 @@ "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP_DOUBLE" "") -(define_insn_and_split "one_cmpldi2" +(define_expand "one_cmpldi2" + [(set (match_operand:DI 0 "s_register_operand" "") + (not:DI (match_operand:DI 1 "s_register_operand" "")))] + "TARGET_32BIT" + " + if (!TARGET_HARD_FLOAT) + { + rtx low = simplify_gen_unary (NOT, SImode, + gen_lowpart (SImode, operands[1]), + SImode); + rtx high = simplify_gen_unary (NOT, SImode, + gen_highpart_mode (SImode, DImode, + operands[1]), + SImode); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " +) + +(define_insn_and_split "*one_cmpldi2_insn" [(set (match_operand:DI 0 "s_register_operand" "=w,&r,&r,?w") (not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))] "TARGET_32BIT" Index: gcc/testsuite/gcc.target/arm/pr77308-1.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-1.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-1.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with bit-not "~" at the Sigma-blocks. + It improves the test coverage of one_cmpldi2 and subdi3 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39)) +#define Sigma1(x) ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41)) +#define sigma0(x) ~(ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7)) +#define sigma1(x) ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-06 14:18 [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) Bernd Edlinger @ 2016-11-25 11:30 ` Ramana Radhakrishnan 2016-11-28 19:42 ` Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Ramana Radhakrishnan @ 2016-11-25 11:30 UTC (permalink / raw) To: Bernd Edlinger Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote: > Hi! > > This improves the stack usage on the sha512 test case for the case > without hardware fpu and without iwmmxt by splitting all di-mode > patterns right while expanding which is similar to what the shift-pattern > does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as > thumb1. > I would go further and do this in the absence of Neon, the VFP unit being there doesn't help with DImode operations i.e. we do not have 64 bit integer arithmetic instructions without Neon. The main reason why we have the DImode patterns split so late is to give a chance for folks who want to do 64 bit arithmetic in Neon a chance to make this work as well as support some of the 64 bit Neon intrinsics which IIRC map down to these instructions. Doing this just for soft-float doesn't improve the default case only. I don't usually test iwmmxt and I'm not sure who has the ability to do so, thus keeping this restriction for iwMMX is fine. > It reduces the stack usage from 2300 to near optimal 272 bytes (!). > > Note this also splits many ldrd/strd instructions and therefore I will > post a followup-patch that mitigates this effect by enabling the ldrd/strd > peephole optimization after the necessary reg-testing. > > > Bootstrapped and reg-tested on arm-linux-gnueabihf. What do you mean by arm-linux-gnueabihf - when folks say that I interpret it as --with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 or (--with-fpu=neon). If you've really bootstrapped and regtested it on armhf, doesn't this patch as it stand have no effect there i.e. no change ? arm-linux-gnueabihf usually means to me someone has configured with --with-float=hard, so there are no regressions in the hard float ABI case, Ramana > Is it OK for trunk? > > > Thanks > Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-25 11:30 ` Ramana Radhakrishnan @ 2016-11-28 19:42 ` Bernd Edlinger [not found] ` <VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com> 2017-04-29 19:17 ` [PING**2] " Bernd Edlinger 0 siblings, 2 replies; 28+ messages in thread From: Bernd Edlinger @ 2016-11-28 19:42 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra [-- Attachment #1: Type: text/plain, Size: 3030 bytes --] On 11/25/16 12:30, Ramana Radhakrishnan wrote: > On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger > <bernd.edlinger@hotmail.de> wrote: >> Hi! >> >> This improves the stack usage on the sha512 test case for the case >> without hardware fpu and without iwmmxt by splitting all di-mode >> patterns right while expanding which is similar to what the shift-pattern >> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as >> thumb1. >> > > I would go further and do this in the absence of Neon, the VFP unit > being there doesn't help with DImode operations i.e. we do not have 64 > bit integer arithmetic instructions without Neon. The main reason why > we have the DImode patterns split so late is to give a chance for > folks who want to do 64 bit arithmetic in Neon a chance to make this > work as well as support some of the 64 bit Neon intrinsics which IIRC > map down to these instructions. Doing this just for soft-float doesn't > improve the default case only. I don't usually test iwmmxt and I'm not > sure who has the ability to do so, thus keeping this restriction for > iwMMX is fine. > > Yes I understand, thanks for pointing that out. I was not aware what iwmmxt exists at all, but I noticed that most 64bit expansions work completely different, and would break if we split the pattern early. I can however only look at the assembler outout for iwmmxt, and make sure that the stack usage does not get worse. Thus the new version of the patch keeps only thumb1, neon and iwmmxt as it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack for the test cases, and vfp and soft-float at around 270 bytes stack usage. >> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >> >> Note this also splits many ldrd/strd instructions and therefore I will >> post a followup-patch that mitigates this effect by enabling the ldrd/strd >> peephole optimization after the necessary reg-testing. >> >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf. > > What do you mean by arm-linux-gnueabihf - when folks say that I > interpret it as --with-arch=armv7-a --with-float=hard > --with-fpu=vfpv3-d16 or (--with-fpu=neon). > > If you've really bootstrapped and regtested it on armhf, doesn't this > patch as it stand have no effect there i.e. no change ? > arm-linux-gnueabihf usually means to me someone has configured with > --with-float=hard, so there are no regressions in the hard float ABI > case, > I know it proves little. When I say arm-linux-gnueabihf I do in fact mean --enable-languages=all,ada,go,obj-c++ --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard. My main interest in the stack usage is of course not because of linux, but because of eCos where we have very small task stacks and in fact no fpu support by the O/S at all, so that patch is exactly what we need. Bootstrapped and reg-tested on arm-linux-gnueabihf Is it OK for trunk? Thanks Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-2.diff --] [-- Type: text/x-patch; name="patch-pr77308-2.diff", Size: 12490 bytes --] 2016-11-25 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for TARGET_NEON and TARGET_IWMMXT. (anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for TARGET_NEON and TARGET_IWMMXT. (*one_cmpldi2_insn): Moved the body of one_cmpldi2 here. testsuite: 2016-11-25 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-1.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 242875) +++ gcc/config/arm/arm.md (working copy) @@ -467,7 +467,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && reload_completed + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) @@ -1272,7 +1272,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" - "&& reload_completed" + "&& (!TARGET_IWMMXT || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))]) @@ -2258,7 +2258,24 @@ (and:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_inv_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (AND, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (AND, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*anddi3_insn" @@ -3131,7 +3148,24 @@ (ior:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (IOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (IOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*iordi3_insn" @@ -3312,7 +3346,24 @@ (xor:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "arm_xordi_operand" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (XOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (XOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*xordi3_insn" @@ -5022,7 +5073,31 @@ "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP_DOUBLE" "") -(define_insn_and_split "one_cmpldi2" +(define_expand "one_cmpldi2" + [(set (match_operand:DI 0 "s_register_operand" "") + (not:DI (match_operand:DI 1 "s_register_operand" "")))] + "TARGET_32BIT" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_unary (NOT, SImode, + gen_lowpart (SImode, operands[1]), + SImode); + rtx high = simplify_gen_unary (NOT, SImode, + gen_highpart_mode (SImode, DImode, + operands[1]), + SImode); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " +) + +(define_insn_and_split "*one_cmpldi2_insn" [(set (match_operand:DI 0 "s_register_operand" "=w,&r,&r,?w") (not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))] "TARGET_32BIT" Index: gcc/testsuite/gcc.target/arm/pr77308-1.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-1.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-1.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with bit-not "~" at the Sigma-blocks. + It improves the test coverage of one_cmpldi2 and subdi3 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39)) +#define Sigma1(x) ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41)) +#define sigma0(x) ~(ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7)) +#define sigma1(x) ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com>]
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) [not found] ` <VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com> @ 2016-11-29 21:37 ` Bernd Edlinger [not found] ` <AM5PR0802MB261038521472515DDE3E58DA838D0@AM5PR0802MB2610.eurprd08.prod.outlook.com> 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-11-29 21:37 UTC (permalink / raw) To: Wilco Dijkstra, Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw On 11/29/16 16:06, Wilco Dijkstra wrote: > Bernd Edlinger wrote: > > - "TARGET_32BIT && reload_completed > + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) > && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" > > This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're > already excluding NEON. > Aehm, no. This would split the addi_neon insn before it is clear if the reload pass will assign a VFP register. With this change the stack usage with -mfpu=neon increases from 2300 to around 2600 bytes. > This patch expands ADD and SUB earlier, so shouldn't we do the same obvious > change for the similar instructions CMP and NEG? > Good question. I think the cmp and neg pattern are more complicated and do typically have a more complicated data flow than the other patterns. I tried to create a test case which expands cmpdi and negdi patterns as follows: --- pr77308-1.c 2016-11-25 17:53:20.379141465 +0100 +++ pr77308-2.c 2016-11-29 20:46:51.266948631 +0100 @@ -68,10 +68,10 @@ #define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) #define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) #define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) -#define Sigma0(x) ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39)) -#define Sigma1(x) ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41)) -#define sigma0(x) ~(ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7)) -#define sigma1(x) ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39) == (x) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ROTR((x),41) < (x) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7) <= (x) ? ~(x) : (x)) +#define sigma1(x) ((long long)(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) < (long long)(x) ? -(x) : (x)) #define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) #define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) This expands *arm_negdi2, *arm_cmpdi_unsigned, *arm_cmpdi_insn. The stack usage is around 1900 bytes with previous patch, and 2300 bytes without. I tried to split *arm_negdi2 and *arm_cmpdi_unsined early, and it gives indeed smaller stack sizes in the test case above (~400 bytes). But when I make *arm_cmpdi_insn split early, it ICEs: --- arm.md.orig 2016-11-27 09:22:41.794790123 +0100 +++ arm.md 2016-11-29 21:51:51.438163078 +0100 @@ -7432,7 +7432,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) ontop of the latest patch, I got: gcc -S -Os pr77308-2.c -fdump-rtl-all-verbose pr77308-2.c: In function 'sha512_block_data_order': pr77308-2.c:169:1: error: unrecognizable insn: } ^ (insn 4870 4869 1636 87 (set (scratch:SI) (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4) (subreg:SI (reg:DI 473 [ X$14 ]) 4)) (ltu:SI (reg:CC_C 100 cc) (const_int 0 [0])))) "pr77308-2.c":140 -1 (nil)) pr77308-2.c:169:1: internal compiler error: in extract_insn, at recog.c:2311 0xaf4cd8 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) ../../gcc-trunk/gcc/rtl-error.c:108 0xaf4d09 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) ../../gcc-trunk/gcc/rtl-error.c:116 0xac74ef extract_insn(rtx_insn*) ../../gcc-trunk/gcc/recog.c:2311 0x122427a decompose_multiword_subregs ../../gcc-trunk/gcc/lower-subreg.c:1467 0x122550d execute ../../gcc-trunk/gcc/lower-subreg.c:1734 So it is certainly possible, but not really simple to improve the stack size even further. But I would prefer to do that in a separate patch. BTW: there are also negd2_compare, *negdi_extendsidi, *negdi_zero_extendsidi, *thumb2_negdi2. I think it would be a precondition to have test cases that exercise each of these patterns before we try to split these instructions. Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <AM5PR0802MB261038521472515DDE3E58DA838D0@AM5PR0802MB2610.eurprd08.prod.outlook.com>]
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) [not found] ` <AM5PR0802MB261038521472515DDE3E58DA838D0@AM5PR0802MB2610.eurprd08.prod.outlook.com> @ 2016-11-30 12:01 ` Wilco Dijkstra 2016-11-30 17:01 ` Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Wilco Dijkstra @ 2016-11-30 12:01 UTC (permalink / raw) To: Bernd Edlinger, Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, nd Bernd Edlinger wrote: > On 11/29/16 16:06, Wilco Dijkstra wrote: > > Bernd Edlinger wrote: > > > > - "TARGET_32BIT && reload_completed > > + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) > > && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" > > > > This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're > > already excluding NEON. > > Aehm, no. This would split the addi_neon insn before it is clear > if the reload pass will assign a VFP register. Hmm that's strange... This instruction shouldn't be used to also split some random Neon pattern - for example arm_subdi3 doesn't do the same. To understand and reason about any of these complex patterns they should all work in the same way... > But when I make *arm_cmpdi_insn split early, it ICEs: (insn 4870 4869 1636 87 (set (scratch:SI) (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4) (subreg:SI (reg:DI 473 [ X$14 ]) 4)) (ltu:SI (reg:CC_C 100 cc) (const_int 0 [0])))) "pr77308-2.c":140 -1 (nil)) That's easy, we don't have a sbcs <scratch>, r1, r2 pattern. A quick workaround is to create a temporary for operand[2] (if before reload) so it will match the standard sbcs pattern, and then the split works fine. > So it is certainly possible, but not really simple to improve the > stack size even further. But I would prefer to do that in a > separate patch. Yes separate patches would be fine. However there is a lot of scope to improve this further. For example after your patch shifts and logical operations are expanded in expand, add/sub are in split1 after combine runs and everything else is split after reload. It doesn't make sense to split different operations at different times - it means you're still going to get the bad DImode subregs and miss lots of optimization opportunities due to the mix of partly split and partly not-yet-split operations. > BTW: there are also negd2_compare, *negdi_extendsidi, > *negdi_zero_extendsidi, *thumb2_negdi2. I have a patch to merge thumb2_negdi2 into arm_negdi2. For extends, if we split them at expand time, then none of the combined alu+extend patterns will be needed, and that will be a huge simplification. > I think it would be a precondition to have test cases that exercise > each of these patterns before we try to split these instructions. Agreed. Wilco ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-30 12:01 ` Wilco Dijkstra @ 2016-11-30 17:01 ` Bernd Edlinger 2016-12-08 19:50 ` Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-11-30 17:01 UTC (permalink / raw) To: Wilco Dijkstra, Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, nd [-- Attachment #1: Type: text/plain, Size: 4647 bytes --] On 11/30/16 13:01, Wilco Dijkstra wrote: > Bernd Edlinger wrote: >> On 11/29/16 16:06, Wilco Dijkstra wrote: >>> Bernd Edlinger wrote: >>> >>> - "TARGET_32BIT && reload_completed >>> + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) >>> && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" >>> >>> This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're >>> already excluding NEON. >> >> Aehm, no. This would split the addi_neon insn before it is clear >> if the reload pass will assign a VFP register. > > Hmm that's strange... This instruction shouldn't be used to also split some random > Neon pattern - for example arm_subdi3 doesn't do the same. To understand and > reason about any of these complex patterns they should all work in the same way... > I was a bit surprised as well, when I saw that happen. But subdi3 is different: "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" "&& reload_completed" so this never splits anything if TARGET_NEON. but adddi3 can not expand if TARGET_NEON but it's pattern simply looks exactly like the addi3_neon: (define_insn_and_split "*arm_adddi3" [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r,&r,&r") (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r") (match_operand:DI 2 "arm_adddi_operand" "r, 0, r, Dd, Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" "TARGET_32BIT && reload_completed && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" (define_insn "adddi3_neon" [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?w,?&r,?&r,?&r") (plus:DI (match_operand:DI 1 "s_register_operand" "%w,0,0,w,r,0,r") (match_operand:DI 2 "arm_adddi_operand" "w,r,0,w,r,Dd,Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_NEON" { switch (which_alternative) { case 0: /* fall through */ case 3: return "vadd.i64\t%P0, %P1, %P2"; case 1: return "#"; case 2: return "#"; case 4: return "#"; case 5: return "#"; case 6: return "#"; default: gcc_unreachable (); } Even the return "#" explicitly invokes the former pattern. So I think the author knew that, and did it on purpose. >> But when I make *arm_cmpdi_insn split early, it ICEs: > > (insn 4870 4869 1636 87 (set (scratch:SI) > (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4) > (subreg:SI (reg:DI 473 [ X$14 ]) 4)) > (ltu:SI (reg:CC_C 100 cc) > (const_int 0 [0])))) "pr77308-2.c":140 -1 > (nil)) > > That's easy, we don't have a sbcs <scratch>, r1, r2 pattern. A quick workaround is > to create a temporary for operand[2] (if before reload) so it will match the standard > sbcs pattern, and then the split works fine. > >> So it is certainly possible, but not really simple to improve the >> stack size even further. But I would prefer to do that in a >> separate patch. > > Yes separate patches would be fine. However there is a lot of scope to improve this > further. For example after your patch shifts and logical operations are expanded in > expand, add/sub are in split1 after combine runs and everything else is split after > reload. It doesn't make sense to split different operations at different times - it means > you're still going to get the bad DImode subregs and miss lots of optimization > opportunities due to the mix of partly split and partly not-yet-split operations. > Yes. I did the add/sub differently because it was more easy this way, and it was simply sufficient to make the existing test cases happy. Also, the biggest benefit was IIRC from the very early splitting of the anddi/iordi/xordi patterns, because they have completely separate data flow in low and high parts. And that is not the case for the arihmetic patterns, but nevertheless they can still be optimized, preferably, when a new test case is found, that can demonstrate an improvement. I am not sure why the cmpdi pattern have an influence at all, because from the data flow you need all 64 bits of both sides. Nevertheless it is a fact: With the modified test case I get 264 bytes frame size, and that was 1920 before. I attached the completely untested follow-up patch now, but I would like to post that one again for review, after I applied my current patch, which is still waiting for final review (please feel pinged!). This is really exciting... Thanks Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-4.diff --] [-- Type: text/x-patch; name="patch-pr77308-4.diff", Size: 9716 bytes --] --- gcc/config/arm/arm.md.orig 2016-11-27 09:22:41.794790123 +0100 +++ gcc/config/arm/arm.md 2016-11-30 16:40:30.140532737 +0100 @@ -4738,7 +4738,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_ARM" "#" ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -7432,7 +7432,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) @@ -7456,7 +7456,10 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7470,7 +7473,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) --- gcc/config/arm/thumb2.md.orig 2016-11-30 16:57:44.760589624 +0100 +++ gcc/config/arm/thumb2.md 2016-11-30 16:58:05.310590754 +0100 @@ -132,7 +132,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_THUMB2" "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1 - "&& reload_completed" + "&& (!TARGET_NEON || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) --- /dev/null 2016-11-30 15:23:46.779473644 +0100 +++ gcc/testsuite/gcc.target/arm/pr77308-2.c 2016-11-30 17:05:21.021614711 +0100 @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-30 17:01 ` Bernd Edlinger @ 2016-12-08 19:50 ` Bernd Edlinger 2017-01-11 16:55 ` Richard Earnshaw (lists) 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-12-08 19:50 UTC (permalink / raw) To: Wilco Dijkstra, Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, nd [-- Attachment #1: Type: text/plain, Size: 3501 bytes --] Hi Wilco, On 11/30/16 18:01, Bernd Edlinger wrote: > I attached the completely untested follow-up patch now, but I would > like to post that one again for review, after I applied my current > patch, which is still waiting for final review (please feel pinged!). > > > This is really exciting... > > when testing the follow-up patch I discovered a single regression in gcc.dg/fixed-point/convert-sat.c that was caused by a mis-compilation of the libgcc function __gnu_satfractdasq. I think it triggerd a latent bug in the carryin_compare patterns. everything is as expected until reload. First what is left over of a split cmpdi_insn followed by a former cmpdi_unsigned, if the branch is not taken. (insn 109 10 110 2 (set (reg:CC 100 cc) (compare:CC (reg:SI 0 r0 [orig:124 _10 ] [124]) (const_int 0 [0]))) "../../../gcc-trunk/libgcc/fixed-bit.c":785 196 {*arm_cmpsi_insn} (nil)) (insn 110 109 13 2 (parallel [ (set (reg:CC 100 cc) (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125]) (const_int -1 [0xffffffffffffffff]))) (set (reg:SI 3 r3 [123]) (minus:SI (plus:SI (reg:SI 1 r1 [orig:125 _10+4 ] [125]) (const_int -1 [0xffffffffffffffff])) (ltu:SI (reg:CC_C 100 cc) (const_int 0 [0])))) ]) "../../../gcc-trunk/libgcc/fixed-bit.c":785 32 {*subsi3_carryin_compare_const} (nil)) (jump_insn 13 110 31 2 (set (pc) (if_then_else (ge (reg:CC_NCV 100 cc) (const_int 0 [0])) (label_ref:SI 102) (pc))) "../../../gcc-trunk/libgcc/fixed-bit.c":785 204 {arm_cond_branch} (int_list:REG_BR_PROB 6400 (nil)) (note 31 13 97 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (note 97 31 114 3 NOTE_INSN_DELETED) (insn 114 97 113 3 (set (reg:SI 2 r2 [orig:127+4 ] [127]) (const_int -1 [0xffffffffffffffff])) "../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp} (expr_list:REG_EQUIV (const_int -1 [0xffffffffffffffff]) (nil))) (insn 113 114 107 3 (set (reg:SI 3 r3 [126]) (const_int 2147483647 [0x7fffffff])) "../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp} (expr_list:REG_EQUIV (const_int 2147483647 [0x7fffffff]) (nil))) (insn 107 113 108 3 (set (reg:CC 100 cc) (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125]) (reg:SI 2 r2 [orig:127+4 ] [127]))) "../../../gcc-trunk/libgcc/fixed-bit.c":831 196 {*arm_cmpsi_insn} (nil)) Note that the CC register is not really set as implied by insn 110, because the C flag depends on the comparison of r1, 0xFFFF and the carry flag from insn 109. Therefore in the postreload pass the insn 107 appears to be unnecessary, as if should compute exactly the same CC flag, as insn 110, i.e. not dependent on previous CC flag. I think all carryin_compare patterns are wrong because they do not describe the true value of the CC reg. I think the CC reg is actually dependent on left, right and CC-in value, as in the new version of the patch it must be a computation in DI mode without overflow, as in my new version of the patch. I attached an update of the followup patch which is not yet adjusted on your pending negdi patch. Reg-testing is no yet done, but the mis-compilation on libgcc is fixed at least. What do you think? Thanks Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-5.diff --] [-- Type: text/x-patch; name="patch-pr77308-5.diff", Size: 15481 bytes --] 2016-12-08 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (subdi3_compare1, subsi3_carryin_compare, subsi3_carryin_compare_const, negdi2_compare): Fix the CC reg dataflow. (*arm_negdi2, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. (*arm_cmpdi_insn): Split early except for TARGET_NEON and TARGET_IWMMXT. Fix the CC reg dataflow. * config/arm/thumb2.md (*thumb2_negdi2): Split early except for TARGET_NEON and TARGET_IWMMXT. testsuite: 2016-12-08 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-2.c: New test. --- gcc/config/arm/arm.md.orig 2016-12-08 16:01:43.290595127 +0100 +++ gcc/config/arm/arm.md 2016-12-08 19:04:22.251065848 +0100 @@ -1086,8 +1086,8 @@ }) (define_insn_and_split "subdi3_compare1" - [(set (reg:CC CC_REGNUM) - (compare:CC + [(set (reg:CC_NCV CC_REGNUM) + (compare:CC_NCV (match_operand:DI 1 "register_operand" "r") (match_operand:DI 2 "register_operand" "r"))) (set (match_operand:DI 0 "register_operand" "=&r") @@ -1098,10 +1098,15 @@ [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))]) - (parallel [(set (reg:CC CC_REGNUM) - (compare:CC (match_dup 4) (match_dup 5))) - (set (match_dup 3) (minus:SI (minus:SI (match_dup 4) (match_dup 5)) - (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] + (parallel [(set (reg:CC_C CC_REGNUM) + (compare:CC_C + (zero_extend:DI (match_dup 4)) + (plus:DI + (zero_extend:DI (match_dup 5)) + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) + (set (match_dup 3) + (minus:SI (minus:SI (match_dup 4) (match_dup 5)) + (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] { operands[3] = gen_highpart (SImode, operands[0]); operands[0] = gen_lowpart (SImode, operands[0]); @@ -1156,13 +1161,15 @@ ) (define_insn "*subsi3_carryin_compare" - [(set (reg:CC CC_REGNUM) - (compare:CC (match_operand:SI 1 "s_register_operand" "r") - (match_operand:SI 2 "s_register_operand" "r"))) + [(set (reg:CC_C CC_REGNUM) + (compare:CC_C + (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r")) + (plus:DI + (zero_extend:DI (match_operand:SI 2 "s_register_operand" "r")) + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) (set (match_operand:SI 0 "s_register_operand" "=r") - (minus:SI (minus:SI (match_dup 1) - (match_dup 2)) - (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] + (minus:SI (minus:SI (match_dup 1) (match_dup 2)) + (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] "TARGET_32BIT" "sbcs\\t%0, %1, %2" [(set_attr "conds" "set") @@ -1170,12 +1177,14 @@ ) (define_insn "*subsi3_carryin_compare_const" - [(set (reg:CC CC_REGNUM) - (compare:CC (match_operand:SI 1 "reg_or_int_operand" "r") - (match_operand:SI 2 "arm_not_operand" "K"))) + [(set (reg:CC_C CC_REGNUM) + (compare:CC_C + (zero_extend:DI (match_operand:SI 1 "reg_or_int_operand" "r")) + (plus:DI + (zero_extend:DI (match_operand:SI 2 "arm_not_operand" "K")) + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) (set (match_operand:SI 0 "s_register_operand" "=r") - (minus:SI (plus:SI (match_dup 1) - (match_dup 2)) + (minus:SI (plus:SI (match_dup 1) (match_dup 2)) (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] "TARGET_32BIT" "sbcs\\t%0, %1, #%B2" @@ -4684,8 +4693,8 @@ (define_insn_and_split "negdi2_compare" - [(set (reg:CC CC_REGNUM) - (compare:CC + [(set (reg:CC_NCV CC_REGNUM) + (compare:CC_NCV (const_int 0) (match_operand:DI 1 "register_operand" "0,r"))) (set (match_operand:DI 0 "register_operand" "=r,&r") @@ -4697,8 +4706,12 @@ (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) - (parallel [(set (reg:CC CC_REGNUM) - (compare:CC (const_int 0) (match_dup 3))) + (parallel [(set (reg:CC_C CC_REGNUM) + (compare:CC_C + (const_int 0) + (plus:DI + (zero_extend:DI (match_dup 3)) + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) (set (match_dup 2) (minus:SI (minus:SI (const_int 0) (match_dup 3)) @@ -4738,7 +4751,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_ARM" "#" ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -4756,12 +4769,14 @@ ) (define_insn "*negsi2_carryin_compare" - [(set (reg:CC CC_REGNUM) - (compare:CC (const_int 0) - (match_operand:SI 1 "s_register_operand" "r"))) + [(set (reg:CC_C CC_REGNUM) + (compare:CC_C + (const_int 0) + (plus:DI + (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r")) + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) (set (match_operand:SI 0 "s_register_operand" "=r") - (minus:SI (minus:SI (const_int 0) - (match_dup 1)) + (minus:SI (minus:SI (const_int 0) (match_dup 1)) (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] "TARGET_ARM" "rscs\\t%0, %1, #0" @@ -7432,14 +7447,17 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) - (compare:CC (match_dup 0) (match_dup 1))) - (parallel [(set (reg:CC CC_REGNUM) - (compare:CC (match_dup 3) (match_dup 4))) - (set (match_dup 2) - (minus:SI (match_dup 5) - (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] + (compare:CC (match_dup 0) (match_dup 1))) + (parallel [(set (reg:CC_C CC_REGNUM) + (compare:CC_C + (zero_extend:DI (match_dup 3)) + (plus:DI (zero_extend:DI (match_dup 4)) + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) + (set (match_dup 2) + (minus:SI (match_dup 5) + (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] { operands[3] = gen_highpart (SImode, operands[0]); operands[0] = gen_lowpart (SImode, operands[0]); @@ -7456,7 +7474,10 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7470,7 +7491,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) --- gcc/config/arm/thumb2.md.orig 2016-12-08 16:00:59.017597265 +0100 +++ gcc/config/arm/thumb2.md 2016-12-08 16:02:38.591592456 +0100 @@ -132,7 +132,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_THUMB2" "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1 - "&& reload_completed" + "&& (!TARGET_NEON || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) --- /dev/null 2016-12-08 15:50:45.426271450 +0100 +++ gcc/testsuite/gcc.target/arm/pr77308-2.c 2016-12-08 16:02:38.591592456 +0100 @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-12-08 19:50 ` Bernd Edlinger @ 2017-01-11 16:55 ` Richard Earnshaw (lists) 2017-01-11 17:19 ` Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Richard Earnshaw (lists) @ 2017-01-11 16:55 UTC (permalink / raw) To: Bernd Edlinger, Wilco Dijkstra, Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, nd On 08/12/16 19:50, Bernd Edlinger wrote: > Hi Wilco, > > On 11/30/16 18:01, Bernd Edlinger wrote: >> I attached the completely untested follow-up patch now, but I would >> like to post that one again for review, after I applied my current >> patch, which is still waiting for final review (please feel pinged!). >> >> >> This is really exciting... >> >> > > > when testing the follow-up patch I discovered a single regression > in gcc.dg/fixed-point/convert-sat.c that was caused by a mis-compilation > of the libgcc function __gnu_satfractdasq. > > I think it triggerd a latent bug in the carryin_compare patterns. > > everything is as expected until reload. First what is left over > of a split cmpdi_insn followed by a former cmpdi_unsigned, if the > branch is not taken. > > (insn 109 10 110 2 (set (reg:CC 100 cc) > (compare:CC (reg:SI 0 r0 [orig:124 _10 ] [124]) > (const_int 0 [0]))) > "../../../gcc-trunk/libgcc/fixed-bit.c":785 196 {*arm_cmpsi_insn} > (nil)) > (insn 110 109 13 2 (parallel [ > (set (reg:CC 100 cc) > (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125]) > (const_int -1 [0xffffffffffffffff]))) > (set (reg:SI 3 r3 [123]) > (minus:SI (plus:SI (reg:SI 1 r1 [orig:125 _10+4 ] [125]) > (const_int -1 [0xffffffffffffffff])) > (ltu:SI (reg:CC_C 100 cc) > (const_int 0 [0])))) > ]) "../../../gcc-trunk/libgcc/fixed-bit.c":785 32 > {*subsi3_carryin_compare_const} > (nil)) > (jump_insn 13 110 31 2 (set (pc) > (if_then_else (ge (reg:CC_NCV 100 cc) > (const_int 0 [0])) > (label_ref:SI 102) > (pc))) "../../../gcc-trunk/libgcc/fixed-bit.c":785 204 > {arm_cond_branch} > (int_list:REG_BR_PROB 6400 (nil)) > > (note 31 13 97 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > (note 97 31 114 3 NOTE_INSN_DELETED) > (insn 114 97 113 3 (set (reg:SI 2 r2 [orig:127+4 ] [127]) > (const_int -1 [0xffffffffffffffff])) > "../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp} > (expr_list:REG_EQUIV (const_int -1 [0xffffffffffffffff]) > (nil))) > (insn 113 114 107 3 (set (reg:SI 3 r3 [126]) > (const_int 2147483647 [0x7fffffff])) > "../../../gcc-trunk/libgcc/fixed-bit.c":831 630 {*arm_movsi_vfp} > (expr_list:REG_EQUIV (const_int 2147483647 [0x7fffffff]) > (nil))) > (insn 107 113 108 3 (set (reg:CC 100 cc) > (compare:CC (reg:SI 1 r1 [orig:125 _10+4 ] [125]) > (reg:SI 2 r2 [orig:127+4 ] [127]))) > "../../../gcc-trunk/libgcc/fixed-bit.c":831 196 {*arm_cmpsi_insn} > (nil)) > > > Note that the CC register is not really set as implied by insn 110, > because the C flag depends on the comparison of r1, 0xFFFF and the > carry flag from insn 109. Therefore in the postreload pass the > insn 107 appears to be unnecessary, as if should compute > exactly the same CC flag, as insn 110, i.e. not dependent on > previous CC flag. I think all carryin_compare patterns are > wrong because they do not describe the true value of the CC reg. > > I think the CC reg is actually dependent on left, right and CC-in > value, as in the new version of the patch it must be a computation > in DI mode without overflow, as in my new version of the patch. > > I attached an update of the followup patch which is not yet adjusted > on your pending negdi patch. Reg-testing is no yet done, but the > mis-compilation on libgcc is fixed at least. > > What do you think? Sorry for the delay getting around to this. I just tried this patch and found that it doesn't apply. Furthermore, there's not enough context in the rejected hunks for me to be certain which patterns you're trying to fix up. Could you do an update please? R. > > > Thanks > Bernd. > > > patch-pr77308-5.diff > > > 2016-12-08 Bernd Edlinger <bernd.edlinger@hotmail.de> > > PR target/77308 > * config/arm/arm.md (subdi3_compare1, subsi3_carryin_compare, > subsi3_carryin_compare_const, negdi2_compare): Fix the CC reg dataflow. > (*arm_negdi2, *arm_cmpdi_unsigned): Split early except for > TARGET_NEON and TARGET_IWMMXT. > (*arm_cmpdi_insn): Split early except for > TARGET_NEON and TARGET_IWMMXT. Fix the CC reg dataflow. > * config/arm/thumb2.md (*thumb2_negdi2): Split early except for > TARGET_NEON and TARGET_IWMMXT. > > testsuite: > 2016-12-08 Bernd Edlinger <bernd.edlinger@hotmail.de> > > PR target/77308 > * gcc.target/arm/pr77308-2.c: New test. > > --- gcc/config/arm/arm.md.orig 2016-12-08 16:01:43.290595127 +0100 > +++ gcc/config/arm/arm.md 2016-12-08 19:04:22.251065848 +0100 > @@ -1086,8 +1086,8 @@ > }) > > (define_insn_and_split "subdi3_compare1" > - [(set (reg:CC CC_REGNUM) > - (compare:CC > + [(set (reg:CC_NCV CC_REGNUM) > + (compare:CC_NCV > (match_operand:DI 1 "register_operand" "r") > (match_operand:DI 2 "register_operand" "r"))) > (set (match_operand:DI 0 "register_operand" "=&r") > @@ -1098,10 +1098,15 @@ > [(parallel [(set (reg:CC CC_REGNUM) > (compare:CC (match_dup 1) (match_dup 2))) > (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))]) > - (parallel [(set (reg:CC CC_REGNUM) > - (compare:CC (match_dup 4) (match_dup 5))) > - (set (match_dup 3) (minus:SI (minus:SI (match_dup 4) (match_dup 5)) > - (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] > + (parallel [(set (reg:CC_C CC_REGNUM) > + (compare:CC_C > + (zero_extend:DI (match_dup 4)) > + (plus:DI > + (zero_extend:DI (match_dup 5)) > + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) > + (set (match_dup 3) > + (minus:SI (minus:SI (match_dup 4) (match_dup 5)) > + (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] > { > operands[3] = gen_highpart (SImode, operands[0]); > operands[0] = gen_lowpart (SImode, operands[0]); > @@ -1156,13 +1161,15 @@ > ) > > (define_insn "*subsi3_carryin_compare" > - [(set (reg:CC CC_REGNUM) > - (compare:CC (match_operand:SI 1 "s_register_operand" "r") > - (match_operand:SI 2 "s_register_operand" "r"))) > + [(set (reg:CC_C CC_REGNUM) > + (compare:CC_C > + (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r")) > + (plus:DI > + (zero_extend:DI (match_operand:SI 2 "s_register_operand" "r")) > + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) > (set (match_operand:SI 0 "s_register_operand" "=r") > - (minus:SI (minus:SI (match_dup 1) > - (match_dup 2)) > - (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] > + (minus:SI (minus:SI (match_dup 1) (match_dup 2)) > + (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] > "TARGET_32BIT" > "sbcs\\t%0, %1, %2" > [(set_attr "conds" "set") > @@ -1170,12 +1177,14 @@ > ) > > (define_insn "*subsi3_carryin_compare_const" > - [(set (reg:CC CC_REGNUM) > - (compare:CC (match_operand:SI 1 "reg_or_int_operand" "r") > - (match_operand:SI 2 "arm_not_operand" "K"))) > + [(set (reg:CC_C CC_REGNUM) > + (compare:CC_C > + (zero_extend:DI (match_operand:SI 1 "reg_or_int_operand" "r")) > + (plus:DI > + (zero_extend:DI (match_operand:SI 2 "arm_not_operand" "K")) > + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) > (set (match_operand:SI 0 "s_register_operand" "=r") > - (minus:SI (plus:SI (match_dup 1) > - (match_dup 2)) > + (minus:SI (plus:SI (match_dup 1) (match_dup 2)) > (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] > "TARGET_32BIT" > "sbcs\\t%0, %1, #%B2" > @@ -4684,8 +4693,8 @@ > > > (define_insn_and_split "negdi2_compare" > - [(set (reg:CC CC_REGNUM) > - (compare:CC > + [(set (reg:CC_NCV CC_REGNUM) > + (compare:CC_NCV > (const_int 0) > (match_operand:DI 1 "register_operand" "0,r"))) > (set (match_operand:DI 0 "register_operand" "=r,&r") > @@ -4697,8 +4706,12 @@ > (compare:CC (const_int 0) (match_dup 1))) > (set (match_dup 0) (minus:SI (const_int 0) > (match_dup 1)))]) > - (parallel [(set (reg:CC CC_REGNUM) > - (compare:CC (const_int 0) (match_dup 3))) > + (parallel [(set (reg:CC_C CC_REGNUM) > + (compare:CC_C > + (const_int 0) > + (plus:DI > + (zero_extend:DI (match_dup 3)) > + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) > (set (match_dup 2) > (minus:SI > (minus:SI (const_int 0) (match_dup 3)) > @@ -4738,7 +4751,7 @@ > (clobber (reg:CC CC_REGNUM))] > "TARGET_ARM" > "#" ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0" > - "&& reload_completed" > + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" > [(parallel [(set (reg:CC CC_REGNUM) > (compare:CC (const_int 0) (match_dup 1))) > (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) > @@ -4756,12 +4769,14 @@ > ) > > (define_insn "*negsi2_carryin_compare" > - [(set (reg:CC CC_REGNUM) > - (compare:CC (const_int 0) > - (match_operand:SI 1 "s_register_operand" "r"))) > + [(set (reg:CC_C CC_REGNUM) > + (compare:CC_C > + (const_int 0) > + (plus:DI > + (zero_extend:DI (match_operand:SI 1 "s_register_operand" "r")) > + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) > (set (match_operand:SI 0 "s_register_operand" "=r") > - (minus:SI (minus:SI (const_int 0) > - (match_dup 1)) > + (minus:SI (minus:SI (const_int 0) (match_dup 1)) > (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))] > "TARGET_ARM" > "rscs\\t%0, %1, #0" > @@ -7432,14 +7447,17 @@ > (clobber (match_scratch:SI 2 "=r"))] > "TARGET_32BIT" > "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" > - "&& reload_completed" > + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" > [(set (reg:CC CC_REGNUM) > - (compare:CC (match_dup 0) (match_dup 1))) > - (parallel [(set (reg:CC CC_REGNUM) > - (compare:CC (match_dup 3) (match_dup 4))) > - (set (match_dup 2) > - (minus:SI (match_dup 5) > - (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] > + (compare:CC (match_dup 0) (match_dup 1))) > + (parallel [(set (reg:CC_C CC_REGNUM) > + (compare:CC_C > + (zero_extend:DI (match_dup 3)) > + (plus:DI (zero_extend:DI (match_dup 4)) > + (ltu:DI (reg:CC_C CC_REGNUM) (const_int 0))))) > + (set (match_dup 2) > + (minus:SI (match_dup 5) > + (ltu:SI (reg:CC_C CC_REGNUM) (const_int 0))))])] > { > operands[3] = gen_highpart (SImode, operands[0]); > operands[0] = gen_lowpart (SImode, operands[0]); > @@ -7456,7 +7474,10 @@ > operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); > } > operands[1] = gen_lowpart (SImode, operands[1]); > - operands[2] = gen_lowpart (SImode, operands[2]); > + if (can_create_pseudo_p ()) > + operands[2] = gen_reg_rtx (SImode); > + else > + operands[2] = gen_lowpart (SImode, operands[2]); > } > [(set_attr "conds" "set") > (set_attr "length" "8") > @@ -7470,7 +7491,7 @@ > > "TARGET_32BIT" > "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" > - "&& reload_completed" > + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" > [(set (reg:CC CC_REGNUM) > (compare:CC (match_dup 2) (match_dup 3))) > (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) > --- gcc/config/arm/thumb2.md.orig 2016-12-08 16:00:59.017597265 +0100 > +++ gcc/config/arm/thumb2.md 2016-12-08 16:02:38.591592456 +0100 > @@ -132,7 +132,7 @@ > (clobber (reg:CC CC_REGNUM))] > "TARGET_THUMB2" > "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1 > - "&& reload_completed" > + "&& (!TARGET_NEON || reload_completed)" > [(parallel [(set (reg:CC CC_REGNUM) > (compare:CC (const_int 0) (match_dup 1))) > (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) > --- /dev/null 2016-12-08 15:50:45.426271450 +0100 > +++ gcc/testsuite/gcc.target/arm/pr77308-2.c 2016-12-08 16:02:38.591592456 +0100 > @@ -0,0 +1,169 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Os -Wstack-usage=2500" } */ > + > +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. > + It improves the test coverage of cmpdi and negdi2 patterns. > + Unlike the original test case these insns can reach the reload pass, > + which may result in large stack usage. */ > + > +#define SHA_LONG64 unsigned long long > +#define U64(C) C##ULL > + > +#define SHA_LBLOCK 16 > +#define SHA512_CBLOCK (SHA_LBLOCK*8) > + > +typedef struct SHA512state_st { > + SHA_LONG64 h[8]; > + SHA_LONG64 Nl, Nh; > + union { > + SHA_LONG64 d[SHA_LBLOCK]; > + unsigned char p[SHA512_CBLOCK]; > + } u; > + unsigned int num, md_len; > +} SHA512_CTX; > + > +static const SHA_LONG64 K512[80] = { > + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), > + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), > + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), > + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), > + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), > + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), > + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), > + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), > + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), > + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), > + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), > + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), > + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), > + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), > + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), > + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), > + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), > + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), > + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), > + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), > + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), > + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), > + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), > + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), > + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), > + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), > + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), > + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), > + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), > + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), > + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), > + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), > + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), > + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), > + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), > + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), > + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), > + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), > + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), > + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) > +}; > + > +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) > +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) > +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) > +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) > +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) > +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) > +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) > +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) > +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) > + > +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ > + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ > + h = Sigma0(a) + Maj(a,b,c); \ > + d += T1; h += T1; } while (0) > +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ > + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ > + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ > + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ > + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) > +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, > + unsigned int num) > +{ > + const SHA_LONG64 *W = in; > + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; > + SHA_LONG64 X[16]; > + int i; > + > + while (num--) { > + > + a = ctx->h[0]; > + b = ctx->h[1]; > + c = ctx->h[2]; > + d = ctx->h[3]; > + e = ctx->h[4]; > + f = ctx->h[5]; > + g = ctx->h[6]; > + h = ctx->h[7]; > + > + T1 = X[0] = PULL64(W[0]); > + ROUND_00_15(0, a, b, c, d, e, f, g, h); > + T1 = X[1] = PULL64(W[1]); > + ROUND_00_15(1, h, a, b, c, d, e, f, g); > + T1 = X[2] = PULL64(W[2]); > + ROUND_00_15(2, g, h, a, b, c, d, e, f); > + T1 = X[3] = PULL64(W[3]); > + ROUND_00_15(3, f, g, h, a, b, c, d, e); > + T1 = X[4] = PULL64(W[4]); > + ROUND_00_15(4, e, f, g, h, a, b, c, d); > + T1 = X[5] = PULL64(W[5]); > + ROUND_00_15(5, d, e, f, g, h, a, b, c); > + T1 = X[6] = PULL64(W[6]); > + ROUND_00_15(6, c, d, e, f, g, h, a, b); > + T1 = X[7] = PULL64(W[7]); > + ROUND_00_15(7, b, c, d, e, f, g, h, a); > + T1 = X[8] = PULL64(W[8]); > + ROUND_00_15(8, a, b, c, d, e, f, g, h); > + T1 = X[9] = PULL64(W[9]); > + ROUND_00_15(9, h, a, b, c, d, e, f, g); > + T1 = X[10] = PULL64(W[10]); > + ROUND_00_15(10, g, h, a, b, c, d, e, f); > + T1 = X[11] = PULL64(W[11]); > + ROUND_00_15(11, f, g, h, a, b, c, d, e); > + T1 = X[12] = PULL64(W[12]); > + ROUND_00_15(12, e, f, g, h, a, b, c, d); > + T1 = X[13] = PULL64(W[13]); > + ROUND_00_15(13, d, e, f, g, h, a, b, c); > + T1 = X[14] = PULL64(W[14]); > + ROUND_00_15(14, c, d, e, f, g, h, a, b); > + T1 = X[15] = PULL64(W[15]); > + ROUND_00_15(15, b, c, d, e, f, g, h, a); > + > + for (i = 16; i < 80; i += 16) { > + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); > + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); > + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); > + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); > + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); > + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); > + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); > + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); > + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); > + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); > + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); > + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); > + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); > + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); > + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); > + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); > + } > + > + ctx->h[0] += a; > + ctx->h[1] += b; > + ctx->h[2] += c; > + ctx->h[3] += d; > + ctx->h[4] += e; > + ctx->h[5] += f; > + ctx->h[6] += g; > + ctx->h[7] += h; > + > + W += SHA_LBLOCK; > + } > +} > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-01-11 16:55 ` Richard Earnshaw (lists) @ 2017-01-11 17:19 ` Bernd Edlinger 0 siblings, 0 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-01-11 17:19 UTC (permalink / raw) To: Richard Earnshaw (lists), Wilco Dijkstra, Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, nd On 01/11/17 17:55, Richard Earnshaw (lists) wrote: > > Sorry for the delay getting around to this. > > I just tried this patch and found that it doesn't apply. Furthermore, > there's not enough context in the rejected hunks for me to be certain > which patterns you're trying to fix up. > > Could you do an update please? > Sure, I just gave up pinging, as we are rather late in stage 3 already... So the current status is this: I have the invalid code issue here; it is independent of the optimization issues: [PATCH, ARM] correctly encode the CC reg data flow https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html Then I have the patch for splitting the most important 64bit patterns here: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html and the follow-up patch that triggered the invalid code here: [PATCH, ARM] Further improve stack usage in sha512, part 2 (PR 77308) https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01563.html In the last part I had initially this hunk, - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); As Wilco pointed out that the else part is superfluous, I already removed the gen_lowpart stuff locally. All three parts should apply to trunk, only the last part depends on both earlier patches. Thanks Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-28 19:42 ` Bernd Edlinger [not found] ` <VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com> @ 2017-04-29 19:17 ` Bernd Edlinger 2017-05-12 16:51 ` [PING**3] " Bernd Edlinger 2017-09-04 14:52 ` [PING**2] " Kyrill Tkachov 1 sibling, 2 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-04-29 19:17 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra [-- Attachment #1: Type: text/plain, Size: 3455 bytes --] Ping... I attached a rebased version since there was a merge conflict in the xordi3 pattern, otherwise the patch is still identical. It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 early when the target has no neon or iwmmxt. Thanks Bernd. On 11/28/16 20:42, Bernd Edlinger wrote: > On 11/25/16 12:30, Ramana Radhakrishnan wrote: >> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >> <bernd.edlinger@hotmail.de> wrote: >>> Hi! >>> >>> This improves the stack usage on the sha512 test case for the case >>> without hardware fpu and without iwmmxt by splitting all di-mode >>> patterns right while expanding which is similar to what the >>> shift-pattern >>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as >>> thumb1. >>> >> >> I would go further and do this in the absence of Neon, the VFP unit >> being there doesn't help with DImode operations i.e. we do not have 64 >> bit integer arithmetic instructions without Neon. The main reason why >> we have the DImode patterns split so late is to give a chance for >> folks who want to do 64 bit arithmetic in Neon a chance to make this >> work as well as support some of the 64 bit Neon intrinsics which IIRC >> map down to these instructions. Doing this just for soft-float doesn't >> improve the default case only. I don't usually test iwmmxt and I'm not >> sure who has the ability to do so, thus keeping this restriction for >> iwMMX is fine. >> >> > > Yes I understand, thanks for pointing that out. > > I was not aware what iwmmxt exists at all, but I noticed that most > 64bit expansions work completely different, and would break if we split > the pattern early. > > I can however only look at the assembler outout for iwmmxt, and make > sure that the stack usage does not get worse. > > Thus the new version of the patch keeps only thumb1, neon and iwmmxt as > it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack > for the test cases, and vfp and soft-float at around 270 bytes stack > usage. > >>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>> >>> Note this also splits many ldrd/strd instructions and therefore I will >>> post a followup-patch that mitigates this effect by enabling the >>> ldrd/strd >>> peephole optimization after the necessary reg-testing. >>> >>> >>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >> >> What do you mean by arm-linux-gnueabihf - when folks say that I >> interpret it as --with-arch=armv7-a --with-float=hard >> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >> >> If you've really bootstrapped and regtested it on armhf, doesn't this >> patch as it stand have no effect there i.e. no change ? >> arm-linux-gnueabihf usually means to me someone has configured with >> --with-float=hard, so there are no regressions in the hard float ABI >> case, >> > > I know it proves little. When I say arm-linux-gnueabihf > I do in fact mean --enable-languages=all,ada,go,obj-c++ > --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 > --with-float=hard. > > My main interest in the stack usage is of course not because of linux, > but because of eCos where we have very small task stacks and in fact > no fpu support by the O/S at all, so that patch is exactly what we need. > > > Bootstrapped and reg-tested on arm-linux-gnueabihf > Is it OK for trunk? > > > Thanks > Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-2.diff --] [-- Type: text/x-patch; name="patch-pr77308-2.diff", Size: 12438 bytes --] 2017-04-26 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for TARGET_NEON and TARGET_IWMMXT. (anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for TARGET_NEON and TARGET_IWMMXT. (*one_cmpldi2_insn): Moved the body of one_cmpldi2 here. testsuite: 2017-04-26 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-1.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 247222) +++ gcc/config/arm/arm.md (working copy) @@ -467,7 +467,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && reload_completed + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) @@ -1273,7 +1273,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" - "&& reload_completed" + "&& (!TARGET_IWMMXT || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) (match_dup 2))) (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))]) @@ -2259,7 +2259,24 @@ (and:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_inv_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (AND, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (AND, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*anddi3_insn" @@ -3132,7 +3149,24 @@ (ior:DI (match_operand:DI 1 "s_register_operand" "") (match_operand:DI 2 "neon_logic_op2" "")))] "TARGET_32BIT" - "" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (IOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (IOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " ) (define_insn_and_split "*iordi3_insn" @@ -3320,6 +3354,22 @@ no NEON instructions that take an immediate. */ if (TARGET_IWMMXT && !REG_P (operands[2])) operands[2] = force_reg (DImode, operands[2]); + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_binary (XOR, SImode, + gen_lowpart (SImode, operands[1]), + gen_lowpart (SImode, operands[2])); + rtx high = simplify_gen_binary (XOR, SImode, + gen_highpart (SImode, operands[1]), + gen_highpart_mode (SImode, DImode, + operands[2])); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ } ) @@ -5031,7 +5081,31 @@ "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP_DOUBLE" "") -(define_insn_and_split "one_cmpldi2" +(define_expand "one_cmpldi2" + [(set (match_operand:DI 0 "s_register_operand" "") + (not:DI (match_operand:DI 1 "s_register_operand" "")))] + "TARGET_32BIT" + " + if (!TARGET_NEON && !TARGET_IWMMXT) + { + rtx low = simplify_gen_unary (NOT, SImode, + gen_lowpart (SImode, operands[1]), + SImode); + rtx high = simplify_gen_unary (NOT, SImode, + gen_highpart_mode (SImode, DImode, + operands[1]), + SImode); + + emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low)); + emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high)); + + DONE; + } + /* Otherwise expand pattern as above. */ + " +) + +(define_insn_and_split "*one_cmpldi2_insn" [(set (match_operand:DI 0 "s_register_operand" "=w,&r,&r,?w") (not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))] "TARGET_32BIT" Index: gcc/testsuite/gcc.target/arm/pr77308-1.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-1.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-1.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with bit-not "~" at the Sigma-blocks. + It improves the test coverage of one_cmpldi2 and subdi3 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39)) +#define Sigma1(x) ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41)) +#define sigma0(x) ~(ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7)) +#define sigma1(x) ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +} ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING**3] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-04-29 19:17 ` [PING**2] " Bernd Edlinger @ 2017-05-12 16:51 ` Bernd Edlinger 2017-06-01 16:01 ` [PING**4] " Bernd Edlinger [not found] ` <bd5e03b1-860f-dd16-2030-9ce0f9a94c7c@hotmail.de> 2017-09-04 14:52 ` [PING**2] " Kyrill Tkachov 1 sibling, 2 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-05-12 16:51 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra Ping... On 04/29/17 19:45, Bernd Edlinger wrote: > Ping... > > I attached a rebased version since there was a merge conflict in > the xordi3 pattern, otherwise the patch is still identical. > It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 > early when the target has no neon or iwmmxt. > > > Thanks > Bernd. > > > > On 11/28/16 20:42, Bernd Edlinger wrote: >> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>> <bernd.edlinger@hotmail.de> wrote: >>>> Hi! >>>> >>>> This improves the stack usage on the sha512 test case for the case >>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>> patterns right while expanding which is similar to what the >>>> shift-pattern >>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as >>>> well as >>>> thumb1. >>>> >>> >>> I would go further and do this in the absence of Neon, the VFP unit >>> being there doesn't help with DImode operations i.e. we do not have 64 >>> bit integer arithmetic instructions without Neon. The main reason why >>> we have the DImode patterns split so late is to give a chance for >>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>> map down to these instructions. Doing this just for soft-float doesn't >>> improve the default case only. I don't usually test iwmmxt and I'm not >>> sure who has the ability to do so, thus keeping this restriction for >>> iwMMX is fine. >>> >>> >> >> Yes I understand, thanks for pointing that out. >> >> I was not aware what iwmmxt exists at all, but I noticed that most >> 64bit expansions work completely different, and would break if we split >> the pattern early. >> >> I can however only look at the assembler outout for iwmmxt, and make >> sure that the stack usage does not get worse. >> >> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >> for the test cases, and vfp and soft-float at around 270 bytes stack >> usage. >> >>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>> >>>> Note this also splits many ldrd/strd instructions and therefore I will >>>> post a followup-patch that mitigates this effect by enabling the >>>> ldrd/strd >>>> peephole optimization after the necessary reg-testing. >>>> >>>> >>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>> >>> What do you mean by arm-linux-gnueabihf - when folks say that I >>> interpret it as --with-arch=armv7-a --with-float=hard >>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>> >>> If you've really bootstrapped and regtested it on armhf, doesn't this >>> patch as it stand have no effect there i.e. no change ? >>> arm-linux-gnueabihf usually means to me someone has configured with >>> --with-float=hard, so there are no regressions in the hard float ABI >>> case, >>> >> >> I know it proves little. When I say arm-linux-gnueabihf >> I do in fact mean --enable-languages=all,ada,go,obj-c++ >> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >> --with-float=hard. >> >> My main interest in the stack usage is of course not because of linux, >> but because of eCos where we have very small task stacks and in fact >> no fpu support by the O/S at all, so that patch is exactly what we need. >> >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf >> Is it OK for trunk? >> >> >> Thanks >> Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING**4] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-05-12 16:51 ` [PING**3] " Bernd Edlinger @ 2017-06-01 16:01 ` Bernd Edlinger [not found] ` <bd5e03b1-860f-dd16-2030-9ce0f9a94c7c@hotmail.de> 1 sibling, 0 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-06-01 16:01 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra Ping... On 05/12/17 18:49, Bernd Edlinger wrote: > Ping... > > On 04/29/17 19:45, Bernd Edlinger wrote: >> Ping... >> >> I attached a rebased version since there was a merge conflict in >> the xordi3 pattern, otherwise the patch is still identical. >> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >> early when the target has no neon or iwmmxt. >> >> >> Thanks >> Bernd. >> >> >> >> On 11/28/16 20:42, Bernd Edlinger wrote: >>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>> <bernd.edlinger@hotmail.de> wrote: >>>>> Hi! >>>>> >>>>> This improves the stack usage on the sha512 test case for the case >>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>> patterns right while expanding which is similar to what the >>>>> shift-pattern >>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as >>>>> well as >>>>> thumb1. >>>>> >>>> >>>> I would go further and do this in the absence of Neon, the VFP unit >>>> being there doesn't help with DImode operations i.e. we do not have 64 >>>> bit integer arithmetic instructions without Neon. The main reason why >>>> we have the DImode patterns split so late is to give a chance for >>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>> map down to these instructions. Doing this just for soft-float doesn't >>>> improve the default case only. I don't usually test iwmmxt and I'm not >>>> sure who has the ability to do so, thus keeping this restriction for >>>> iwMMX is fine. >>>> >>>> >>> >>> Yes I understand, thanks for pointing that out. >>> >>> I was not aware what iwmmxt exists at all, but I noticed that most >>> 64bit expansions work completely different, and would break if we split >>> the pattern early. >>> >>> I can however only look at the assembler outout for iwmmxt, and make >>> sure that the stack usage does not get worse. >>> >>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>> for the test cases, and vfp and soft-float at around 270 bytes stack >>> usage. >>> >>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>> >>>>> Note this also splits many ldrd/strd instructions and therefore I will >>>>> post a followup-patch that mitigates this effect by enabling the >>>>> ldrd/strd >>>>> peephole optimization after the necessary reg-testing. >>>>> >>>>> >>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>> >>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>> interpret it as --with-arch=armv7-a --with-float=hard >>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>> >>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>> patch as it stand have no effect there i.e. no change ? >>>> arm-linux-gnueabihf usually means to me someone has configured with >>>> --with-float=hard, so there are no regressions in the hard float ABI >>>> case, >>>> >>> >>> I know it proves little. When I say arm-linux-gnueabihf >>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>> --with-float=hard. >>> >>> My main interest in the stack usage is of course not because of linux, >>> but because of eCos where we have very small task stacks and in fact >>> no fpu support by the O/S at all, so that patch is exactly what we need. >>> >>> >>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>> Is it OK for trunk? >>> >>> >>> Thanks >>> Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <bd5e03b1-860f-dd16-2030-9ce0f9a94c7c@hotmail.de>]
* [PING**5] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) [not found] ` <bd5e03b1-860f-dd16-2030-9ce0f9a94c7c@hotmail.de> @ 2017-06-14 12:35 ` Bernd Edlinger [not found] ` <9a0fbb5d-9909-ef4d-6871-0cb4f7971bbb@hotmail.de> 1 sibling, 0 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-06-14 12:35 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra Ping... On 06/01/17 18:01, Bernd Edlinger wrote: > Ping... > > On 05/12/17 18:49, Bernd Edlinger wrote: >> Ping... >> >> On 04/29/17 19:45, Bernd Edlinger wrote: >>> Ping... >>> >>> I attached a rebased version since there was a merge conflict in >>> the xordi3 pattern, otherwise the patch is still identical. >>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >>> early when the target has no neon or iwmmxt. >>> >>> >>> Thanks >>> Bernd. >>> >>> >>> >>> On 11/28/16 20:42, Bernd Edlinger wrote: >>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>>> <bernd.edlinger@hotmail.de> wrote: >>>>>> Hi! >>>>>> >>>>>> This improves the stack usage on the sha512 test case for the case >>>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>>> patterns right while expanding which is similar to what the >>>>>> shift-pattern >>>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as >>>>>> well as >>>>>> thumb1. >>>>>> >>>>> >>>>> I would go further and do this in the absence of Neon, the VFP unit >>>>> being there doesn't help with DImode operations i.e. we do not have 64 >>>>> bit integer arithmetic instructions without Neon. The main reason why >>>>> we have the DImode patterns split so late is to give a chance for >>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>>> map down to these instructions. Doing this just for soft-float doesn't >>>>> improve the default case only. I don't usually test iwmmxt and I'm not >>>>> sure who has the ability to do so, thus keeping this restriction for >>>>> iwMMX is fine. >>>>> >>>>> >>>> >>>> Yes I understand, thanks for pointing that out. >>>> >>>> I was not aware what iwmmxt exists at all, but I noticed that most >>>> 64bit expansions work completely different, and would break if we split >>>> the pattern early. >>>> >>>> I can however only look at the assembler outout for iwmmxt, and make >>>> sure that the stack usage does not get worse. >>>> >>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>>> for the test cases, and vfp and soft-float at around 270 bytes stack >>>> usage. >>>> >>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>>> >>>>>> Note this also splits many ldrd/strd instructions and therefore I >>>>>> will >>>>>> post a followup-patch that mitigates this effect by enabling the >>>>>> ldrd/strd >>>>>> peephole optimization after the necessary reg-testing. >>>>>> >>>>>> >>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>>> >>>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>>> interpret it as --with-arch=armv7-a --with-float=hard >>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>>> >>>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>>> patch as it stand have no effect there i.e. no change ? >>>>> arm-linux-gnueabihf usually means to me someone has configured with >>>>> --with-float=hard, so there are no regressions in the hard float ABI >>>>> case, >>>>> >>>> >>>> I know it proves little. When I say arm-linux-gnueabihf >>>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>>> --with-float=hard. >>>> >>>> My main interest in the stack usage is of course not because of linux, >>>> but because of eCos where we have very small task stacks and in fact >>>> no fpu support by the O/S at all, so that patch is exactly what we >>>> need. >>>> >>>> >>>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>>> Is it OK for trunk? >>>> >>>> >>>> Thanks >>>> Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <9a0fbb5d-9909-ef4d-6871-0cb4f7971bbb@hotmail.de>]
* [PING**6] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) [not found] ` <9a0fbb5d-9909-ef4d-6871-0cb4f7971bbb@hotmail.de> @ 2017-07-05 18:14 ` Bernd Edlinger 0 siblings, 0 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-07-05 18:14 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: GCC Patches, Kyrill Tkachov, Richard Earnshaw, Wilco Dijkstra Ping... The latest version of this patch was here: https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01567.html Thanks Bernd. On 06/14/17 14:34, Bernd Edlinger wrote: > Ping... > > On 06/01/17 18:01, Bernd Edlinger wrote: >> Ping... >> >> On 05/12/17 18:49, Bernd Edlinger wrote: >>> Ping... >>> >>> On 04/29/17 19:45, Bernd Edlinger wrote: >>>> Ping... >>>> >>>> I attached a rebased version since there was a merge conflict in >>>> the xordi3 pattern, otherwise the patch is still identical. >>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >>>> early when the target has no neon or iwmmxt. >>>> >>>> >>>> Thanks >>>> Bernd. >>>> >>>> >>>> >>>> On 11/28/16 20:42, Bernd Edlinger wrote: >>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>>>> <bernd.edlinger@hotmail.de> wrote: >>>>>>> Hi! >>>>>>> >>>>>>> This improves the stack usage on the sha512 test case for the case >>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>>>> patterns right while expanding which is similar to what the >>>>>>> shift-pattern >>>>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as >>>>>>> well as >>>>>>> thumb1. >>>>>>> >>>>>> >>>>>> I would go further and do this in the absence of Neon, the VFP unit >>>>>> being there doesn't help with DImode operations i.e. we do not >>>>>> have 64 >>>>>> bit integer arithmetic instructions without Neon. The main reason why >>>>>> we have the DImode patterns split so late is to give a chance for >>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>>>> map down to these instructions. Doing this just for soft-float >>>>>> doesn't >>>>>> improve the default case only. I don't usually test iwmmxt and I'm >>>>>> not >>>>>> sure who has the ability to do so, thus keeping this restriction for >>>>>> iwMMX is fine. >>>>>> >>>>>> >>>>> >>>>> Yes I understand, thanks for pointing that out. >>>>> >>>>> I was not aware what iwmmxt exists at all, but I noticed that most >>>>> 64bit expansions work completely different, and would break if we >>>>> split >>>>> the pattern early. >>>>> >>>>> I can however only look at the assembler outout for iwmmxt, and make >>>>> sure that the stack usage does not get worse. >>>>> >>>>> Thus the new version of the patch keeps only thumb1, neon and >>>>> iwmmxt as >>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>>>> for the test cases, and vfp and soft-float at around 270 bytes stack >>>>> usage. >>>>> >>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>>>> >>>>>>> Note this also splits many ldrd/strd instructions and therefore I >>>>>>> will >>>>>>> post a followup-patch that mitigates this effect by enabling the >>>>>>> ldrd/strd >>>>>>> peephole optimization after the necessary reg-testing. >>>>>>> >>>>>>> >>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>>>> >>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>>>> interpret it as --with-arch=armv7-a --with-float=hard >>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>>>> >>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>>>> patch as it stand have no effect there i.e. no change ? >>>>>> arm-linux-gnueabihf usually means to me someone has configured with >>>>>> --with-float=hard, so there are no regressions in the hard float ABI >>>>>> case, >>>>>> >>>>> >>>>> I know it proves little. When I say arm-linux-gnueabihf >>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>>>> --with-float=hard. >>>>> >>>>> My main interest in the stack usage is of course not because of linux, >>>>> but because of eCos where we have very small task stacks and in fact >>>>> no fpu support by the O/S at all, so that patch is exactly what we >>>>> need. >>>>> >>>>> >>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>>>> Is it OK for trunk? >>>>> >>>>> >>>>> Thanks >>>>> Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-04-29 19:17 ` [PING**2] " Bernd Edlinger 2017-05-12 16:51 ` [PING**3] " Bernd Edlinger @ 2017-09-04 14:52 ` Kyrill Tkachov 2017-09-05 8:47 ` Christophe Lyon 1 sibling, 1 reply; 28+ messages in thread From: Kyrill Tkachov @ 2017-09-04 14:52 UTC (permalink / raw) To: Bernd Edlinger, Ramana Radhakrishnan Cc: GCC Patches, Richard Earnshaw, Wilco Dijkstra On 29/04/17 18:45, Bernd Edlinger wrote: > Ping... > > I attached a rebased version since there was a merge conflict in > the xordi3 pattern, otherwise the patch is still identical. > It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 > early when the target has no neon or iwmmxt. > > > Thanks > Bernd. > > > > On 11/28/16 20:42, Bernd Edlinger wrote: >> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>> <bernd.edlinger@hotmail.de> wrote: >>>> Hi! >>>> >>>> This improves the stack usage on the sha512 test case for the case >>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>> patterns right while expanding which is similar to what the >>>> shift-pattern >>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well as >>>> thumb1. >>>> >>> I would go further and do this in the absence of Neon, the VFP unit >>> being there doesn't help with DImode operations i.e. we do not have 64 >>> bit integer arithmetic instructions without Neon. The main reason why >>> we have the DImode patterns split so late is to give a chance for >>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>> map down to these instructions. Doing this just for soft-float doesn't >>> improve the default case only. I don't usually test iwmmxt and I'm not >>> sure who has the ability to do so, thus keeping this restriction for >>> iwMMX is fine. >>> >>> >> Yes I understand, thanks for pointing that out. >> >> I was not aware what iwmmxt exists at all, but I noticed that most >> 64bit expansions work completely different, and would break if we split >> the pattern early. >> >> I can however only look at the assembler outout for iwmmxt, and make >> sure that the stack usage does not get worse. >> >> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >> for the test cases, and vfp and soft-float at around 270 bytes stack >> usage. >> >>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>> >>>> Note this also splits many ldrd/strd instructions and therefore I will >>>> post a followup-patch that mitigates this effect by enabling the >>>> ldrd/strd >>>> peephole optimization after the necessary reg-testing. >>>> >>>> >>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>> What do you mean by arm-linux-gnueabihf - when folks say that I >>> interpret it as --with-arch=armv7-a --with-float=hard >>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>> >>> If you've really bootstrapped and regtested it on armhf, doesn't this >>> patch as it stand have no effect there i.e. no change ? >>> arm-linux-gnueabihf usually means to me someone has configured with >>> --with-float=hard, so there are no regressions in the hard float ABI >>> case, >>> >> I know it proves little. When I say arm-linux-gnueabihf >> I do in fact mean --enable-languages=all,ada,go,obj-c++ >> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >> --with-float=hard. >> >> My main interest in the stack usage is of course not because of linux, >> but because of eCos where we have very small task stacks and in fact >> no fpu support by the O/S at all, so that patch is exactly what we need. >> >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf >> Is it OK for trunk? The code is ok. AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as the splits will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and TARGET_IWMMXT2 is irrelevant here). So this is ok for trunk. Thanks, and sorry again for the delay. Kyrill >> >> Thanks >> Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-04 14:52 ` [PING**2] " Kyrill Tkachov @ 2017-09-05 8:47 ` Christophe Lyon 2017-09-05 14:25 ` Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Christophe Lyon @ 2017-09-05 8:47 UTC (permalink / raw) To: Kyrill Tkachov Cc: Bernd Edlinger, Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, Wilco Dijkstra Hi Bernd, On 4 September 2017 at 16:52, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > On 29/04/17 18:45, Bernd Edlinger wrote: >> >> Ping... >> >> I attached a rebased version since there was a merge conflict in >> the xordi3 pattern, otherwise the patch is still identical. >> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >> early when the target has no neon or iwmmxt. >> >> >> Thanks >> Bernd. >> >> >> >> On 11/28/16 20:42, Bernd Edlinger wrote: >>> >>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>> >>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>> <bernd.edlinger@hotmail.de> wrote: >>>>> >>>>> Hi! >>>>> >>>>> This improves the stack usage on the sha512 test case for the case >>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>> patterns right while expanding which is similar to what the >>>>> shift-pattern >>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well >>>>> as >>>>> thumb1. >>>>> >>>> I would go further and do this in the absence of Neon, the VFP unit >>>> being there doesn't help with DImode operations i.e. we do not have 64 >>>> bit integer arithmetic instructions without Neon. The main reason why >>>> we have the DImode patterns split so late is to give a chance for >>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>> map down to these instructions. Doing this just for soft-float doesn't >>>> improve the default case only. I don't usually test iwmmxt and I'm not >>>> sure who has the ability to do so, thus keeping this restriction for >>>> iwMMX is fine. >>>> >>>> >>> Yes I understand, thanks for pointing that out. >>> >>> I was not aware what iwmmxt exists at all, but I noticed that most >>> 64bit expansions work completely different, and would break if we split >>> the pattern early. >>> >>> I can however only look at the assembler outout for iwmmxt, and make >>> sure that the stack usage does not get worse. >>> >>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>> for the test cases, and vfp and soft-float at around 270 bytes stack >>> usage. >>> >>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>> >>>>> Note this also splits many ldrd/strd instructions and therefore I will >>>>> post a followup-patch that mitigates this effect by enabling the >>>>> ldrd/strd >>>>> peephole optimization after the necessary reg-testing. >>>>> >>>>> >>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>> >>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>> interpret it as --with-arch=armv7-a --with-float=hard >>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>> >>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>> patch as it stand have no effect there i.e. no change ? >>>> arm-linux-gnueabihf usually means to me someone has configured with >>>> --with-float=hard, so there are no regressions in the hard float ABI >>>> case, >>>> >>> I know it proves little. When I say arm-linux-gnueabihf >>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>> --with-float=hard. >>> >>> My main interest in the stack usage is of course not because of linux, >>> but because of eCos where we have very small task stacks and in fact >>> no fpu support by the O/S at all, so that patch is exactly what we need. >>> >>> >>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>> Is it OK for trunk? > > > The code is ok. > AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as > the splits > will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and > TARGET_IWMMXT2 > is irrelevant here). > > So this is ok for trunk. > Thanks, and sorry again for the delay. > Kyrill > This patch (r251663) causes a regression on armeb-none-linux-gnueabihf --with-mode arm --with-cpu cortex-a9 --with-fpu vfpv3-d16-fp16 FAIL: gcc.dg/vect/vect-singleton_1.c (internal compiler error) FAIL: gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects (internal compiler error) the test passes if gcc is configured --with-fpu neon-fp16 Christophe >>> >>> Thanks >>> Bernd. > > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 8:47 ` Christophe Lyon @ 2017-09-05 14:25 ` Bernd Edlinger 2017-09-05 15:02 ` Wilco Dijkstra 2017-09-05 17:45 ` Kyrill Tkachov 0 siblings, 2 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-09-05 14:25 UTC (permalink / raw) To: Christophe Lyon, Kyrill Tkachov Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, Wilco Dijkstra [-- Attachment #1: Type: text/plain, Size: 6507 bytes --] Hi Christophe, On 09/05/17 10:45, Christophe Lyon wrote: > Hi Bernd, > > > On 4 September 2017 at 16:52, Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> >> On 29/04/17 18:45, Bernd Edlinger wrote: >>> >>> Ping... >>> >>> I attached a rebased version since there was a merge conflict in >>> the xordi3 pattern, otherwise the patch is still identical. >>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >>> early when the target has no neon or iwmmxt. >>> >>> >>> Thanks >>> Bernd. >>> >>> >>> >>> On 11/28/16 20:42, Bernd Edlinger wrote: >>>> >>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>>> >>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>>> <bernd.edlinger@hotmail.de> wrote: >>>>>> >>>>>> Hi! >>>>>> >>>>>> This improves the stack usage on the sha512 test case for the case >>>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>>> patterns right while expanding which is similar to what the >>>>>> shift-pattern >>>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well >>>>>> as >>>>>> thumb1. >>>>>> >>>>> I would go further and do this in the absence of Neon, the VFP unit >>>>> being there doesn't help with DImode operations i.e. we do not have 64 >>>>> bit integer arithmetic instructions without Neon. The main reason why >>>>> we have the DImode patterns split so late is to give a chance for >>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>>> map down to these instructions. Doing this just for soft-float doesn't >>>>> improve the default case only. I don't usually test iwmmxt and I'm not >>>>> sure who has the ability to do so, thus keeping this restriction for >>>>> iwMMX is fine. >>>>> >>>>> >>>> Yes I understand, thanks for pointing that out. >>>> >>>> I was not aware what iwmmxt exists at all, but I noticed that most >>>> 64bit expansions work completely different, and would break if we split >>>> the pattern early. >>>> >>>> I can however only look at the assembler outout for iwmmxt, and make >>>> sure that the stack usage does not get worse. >>>> >>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>>> for the test cases, and vfp and soft-float at around 270 bytes stack >>>> usage. >>>> >>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>>> >>>>>> Note this also splits many ldrd/strd instructions and therefore I will >>>>>> post a followup-patch that mitigates this effect by enabling the >>>>>> ldrd/strd >>>>>> peephole optimization after the necessary reg-testing. >>>>>> >>>>>> >>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>>> >>>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>>> interpret it as --with-arch=armv7-a --with-float=hard >>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>>> >>>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>>> patch as it stand have no effect there i.e. no change ? >>>>> arm-linux-gnueabihf usually means to me someone has configured with >>>>> --with-float=hard, so there are no regressions in the hard float ABI >>>>> case, >>>>> >>>> I know it proves little. When I say arm-linux-gnueabihf >>>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>>> --with-float=hard. >>>> >>>> My main interest in the stack usage is of course not because of linux, >>>> but because of eCos where we have very small task stacks and in fact >>>> no fpu support by the O/S at all, so that patch is exactly what we need. >>>> >>>> >>>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>>> Is it OK for trunk? >> >> >> The code is ok. >> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as >> the splits >> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and >> TARGET_IWMMXT2 >> is irrelevant here). >> >> So this is ok for trunk. >> Thanks, and sorry again for the delay. >> Kyrill >> > > This patch (r251663) causes a regression on armeb-none-linux-gnueabihf > --with-mode arm > --with-cpu cortex-a9 > --with-fpu vfpv3-d16-fp16 > FAIL: gcc.dg/vect/vect-singleton_1.c (internal compiler error) > FAIL: gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects > (internal compiler error) > > the test passes if gcc is configured --with-fpu neon-fp16 > Thank you very much for what you do! I am able to reproduce this. Combine creates an invalid insn out of these two insns: (insn 12 8 18 2 (parallel [ (set (reg:DI 122) (plus:DI (reg:DI 116 [ aD.5331 ]) (reg:DI 119 [ bD.5332 ]))) (clobber (reg:CC 100 cc)) ]) "vect-singleton_1.c":28 1 {*arm_adddi3} (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ]) (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ]) (expr_list:REG_UNUSED (reg:CC 100 cc) (nil))))) (insn 18 12 19 2 (set (reg/i:DI 16 s0) (reg:DI 122)) "vect-singleton_1.c":28 650 {*movdi_vfp} (expr_list:REG_DEAD (reg:DI 122) (nil))) => (insn 18 12 19 2 (parallel [ (set (reg/i:DI 16 s0) (plus:DI (reg:DI 116 [ aD.5331 ]) (reg:DI 119 [ bD.5332 ]))) (clobber (reg:CC 100 cc)) ]) "vect-singleton_1.c":28 1 {*arm_adddi3} (expr_list:REG_UNUSED (reg:CC 100 cc) (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ]) (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ]) (nil))))) But after reload this is fixed again, so that is no issue for a splitter that runs always after reload. But now the splitter runs into an assertion in gen_highpart(op0). The same problem should exist with the subdi3 as well, but not with the other patterns in this patch. I think the right way to fix this is use a non-vfp register predicate to prevent combine to create the un-splittable instruction in the first place. See attached my proposed fix for this defect. Christophe, You could do me a favor, and test it in your environment as I have nothing comparable. Kyrill, is this patch OK after bootstrap / reg-testing? Thanks Bernd. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-6.diff --] [-- Type: text/x-patch; name="patch-pr77308-6.diff", Size: 2669 bytes --] 2017-09-05 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/predicates.md (s_register_operand_nv, arm_adddi_operand_nv): Create new non-vfp predicates. * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 251663) +++ gcc/config/arm/arm.md (working copy) @@ -457,14 +457,13 @@ ) (define_insn_and_split "*arm_adddi3" - [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r,&r,&r") - (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r") - (match_operand:DI 2 "arm_adddi_operand" "r, 0, r, Dd, Dd"))) + [(set (match_operand:DI 0 "s_register_operand_nv" "=&r,&r,&r,&r,&r") + (plus:DI (match_operand:DI 1 "s_register_operand_nv" "%0, 0, r, 0, r") + (match_operand:DI 2 "arm_adddi_operand_nv" "r, 0, r, Dd, Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) - && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) (match_dup 1))) @@ -1263,9 +1262,9 @@ ) (define_insn_and_split "*arm_subdi3" - [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r") - (minus:DI (match_operand:DI 1 "s_register_operand" "0,r,0") - (match_operand:DI 2 "s_register_operand" "r,0,0"))) + [(set (match_operand:DI 0 "s_register_operand_nv" "=&r,&r,&r") + (minus:DI (match_operand:DI 1 "s_register_operand_nv" "0,r,0") + (match_operand:DI 2 "s_register_operand_nv" "r,0,0"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" Index: gcc/config/arm/predicates.md =================================================================== --- gcc/config/arm/predicates.md (revision 251663) +++ gcc/config/arm/predicates.md (working copy) @@ -653,3 +653,11 @@ (ior (and (match_code "symbol_ref") (match_test "!arm_is_long_call_p (SYMBOL_REF_DECL (op))")) (match_operand 0 "s_register_operand"))) + +(define_predicate "s_register_operand_nv" + (and (match_operand 0 "s_register_operand") + (not (match_operand 0 "vfp_hard_register_operand")))) + +(define_predicate "arm_adddi_operand_nv" + (and (match_operand 0 "arm_adddi_operand") + (not (match_operand 0 "vfp_hard_register_operand")))) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 14:25 ` Bernd Edlinger @ 2017-09-05 15:02 ` Wilco Dijkstra 2017-09-05 17:48 ` Bernd Edlinger 2017-09-05 17:45 ` Kyrill Tkachov 1 sibling, 1 reply; 28+ messages in thread From: Wilco Dijkstra @ 2017-09-05 15:02 UTC (permalink / raw) To: Bernd Edlinger, Christophe Lyon, Kyrill Tkachov Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd Bernd Edlinger wrote: > Combine creates an invalid insn out of these two insns: Yes it looks like a latent bug. We need to use arm_general_register_operand as arm_adddi3/subdi3 only allow integer registers. You don't need a new predicate s_register_operand_nv. Also I'd prefer something like arm_general_adddi_operand. + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" The split condition for adddi3 now looks more accurate indeed, although we could remove the !TARGET_NEON from the split condition as this is always true given arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON". Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue. Wilco ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 15:02 ` Wilco Dijkstra @ 2017-09-05 17:48 ` Bernd Edlinger 2017-09-05 17:53 ` Kyrill Tkachov 2017-09-05 21:28 ` Wilco Dijkstra 0 siblings, 2 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-09-05 17:48 UTC (permalink / raw) To: Wilco Dijkstra, Christophe Lyon, Kyrill Tkachov Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd [-- Attachment #1: Type: text/plain, Size: 1589 bytes --] On 09/05/17 17:02, Wilco Dijkstra wrote: > Bernd Edlinger wrote: > >> Combine creates an invalid insn out of these two insns: > > Yes it looks like a latent bug. We need to use arm_general_register_operand > as arm_adddi3/subdi3 only allow integer registers. You don't need a new predicate > s_register_operand_nv. Also I'd prefer something like arm_general_adddi_operand. > Thanks, attached is a patch following your suggestion. > + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" > > The split condition for adddi3 now looks more accurate indeed, although we could > remove the !TARGET_NEON from the split condition as this is always true given > arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON". > No, the split condition does not begin with "&& TARGET_32BIT...". Therefore the split is enabled in TARGET_NEON after reload_completed. And it is invoked from adddi3_neon for all alternatives without vfp registers: switch (which_alternative) { case 0: /* fall through */ case 3: return "vadd.i64\t%P0, %P1, %P2"; case 1: return "#"; case 2: return "#"; case 4: return "#"; case 5: return "#"; case 6: return "#"; > Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue. > Yes, that pattern can be cleaned up in a follow-up patch. Note this splitter is invoked from bicdi3_neon as well. However I think anddi_notdi_di should be safe as long as it is enabled after reload_completed (which is probably a bug). Bernd. > Wilco > [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch-pr77308-6.diff --] [-- Type: text/x-patch; name="patch-pr77308-6.diff", Size: 2572 bytes --] 2017-09-05 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/predicates.md (arm_general_adddi_operand): Create new non-vfp predicate. * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 251663) +++ gcc/config/arm/arm.md (working copy) @@ -457,14 +457,13 @@ ) (define_insn_and_split "*arm_adddi3" - [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r,&r,&r") - (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r") - (match_operand:DI 2 "arm_adddi_operand" "r, 0, r, Dd, Dd"))) + [(set (match_operand:DI 0 "arm_general_register_operand" "=&r,&r,&r,&r,&r") + (plus:DI (match_operand:DI 1 "arm_general_register_operand" "%0, 0, r, 0, r") + (match_operand:DI 2 "arm_general_adddi_operand" "r, 0, r, Dd, Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) - && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) (match_dup 1))) @@ -1263,9 +1262,9 @@ ) (define_insn_and_split "*arm_subdi3" - [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r") - (minus:DI (match_operand:DI 1 "s_register_operand" "0,r,0") - (match_operand:DI 2 "s_register_operand" "r,0,0"))) + [(set (match_operand:DI 0 "arm_general_register_operand" "=&r,&r,&r") + (minus:DI (match_operand:DI 1 "arm_general_register_operand" "0,r,0") + (match_operand:DI 2 "arm_general_register_operand" "r,0,0"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" Index: gcc/config/arm/predicates.md =================================================================== --- gcc/config/arm/predicates.md (revision 251663) +++ gcc/config/arm/predicates.md (working copy) @@ -82,6 +82,11 @@ || REGNO (op) >= FIRST_PSEUDO_REGISTER)); }) +(define_predicate "arm_general_adddi_operand" + (ior (match_operand 0 "arm_general_register_operand") + (and (match_code "const_int") + (match_test "const_ok_for_dimode_op (INTVAL (op), PLUS)")))) + (define_predicate "vfp_register_operand" (match_code "reg,subreg") { ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 17:48 ` Bernd Edlinger @ 2017-09-05 17:53 ` Kyrill Tkachov 2017-09-05 18:20 ` Christophe Lyon 2017-09-05 21:28 ` Wilco Dijkstra 1 sibling, 1 reply; 28+ messages in thread From: Kyrill Tkachov @ 2017-09-05 17:53 UTC (permalink / raw) To: Bernd Edlinger, Wilco Dijkstra, Christophe Lyon Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd On 05/09/17 18:48, Bernd Edlinger wrote: > On 09/05/17 17:02, Wilco Dijkstra wrote: >> Bernd Edlinger wrote: >> >>> Combine creates an invalid insn out of these two insns: >> Yes it looks like a latent bug. We need to use arm_general_register_operand >> as arm_adddi3/subdi3 only allow integer registers. You don't need a new predicate >> s_register_operand_nv. Also I'd prefer something like arm_general_adddi_operand. >> > Thanks, attached is a patch following your suggestion. > >> + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" >> >> The split condition for adddi3 now looks more accurate indeed, although we could >> remove the !TARGET_NEON from the split condition as this is always true given >> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON". >> > No, the split condition does not begin with "&& TARGET_32BIT...". > Therefore the split is enabled in TARGET_NEON after reload_completed. > And it is invoked from adddi3_neon for all alternatives without vfp > registers: > > switch (which_alternative) > { > case 0: /* fall through */ > case 3: return "vadd.i64\t%P0, %P1, %P2"; > case 1: return "#"; > case 2: return "#"; > case 4: return "#"; > case 5: return "#"; > case 6: return "#"; > > > >> Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue. >> > Yes, that pattern can be cleaned up in a follow-up patch. > Note this splitter is invoked from bicdi3_neon as well. > However I think anddi_notdi_di should be safe as long as it is enabled > after reload_completed (which is probably a bug). > Thanks, that's what I had in mind in my other reply. This is ok if testing comes back ok. Kyrill > Bernd. > >> Wilco >> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 17:53 ` Kyrill Tkachov @ 2017-09-05 18:20 ` Christophe Lyon 2017-09-06 7:35 ` Christophe Lyon 0 siblings, 1 reply; 28+ messages in thread From: Christophe Lyon @ 2017-09-05 18:20 UTC (permalink / raw) To: Kyrill Tkachov Cc: Bernd Edlinger, Wilco Dijkstra, Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd On 5 September 2017 at 19:53, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > On 05/09/17 18:48, Bernd Edlinger wrote: >> >> On 09/05/17 17:02, Wilco Dijkstra wrote: >>> >>> Bernd Edlinger wrote: >>> >>>> Combine creates an invalid insn out of these two insns: >>> >>> Yes it looks like a latent bug. We need to use >>> arm_general_register_operand >>> as arm_adddi3/subdi3 only allow integer registers. You don't need a new >>> predicate >>> s_register_operand_nv. Also I'd prefer something like >>> arm_general_adddi_operand. >>> >> Thanks, attached is a patch following your suggestion. >> >>> + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || >>> reload_completed)" >>> >>> The split condition for adddi3 now looks more accurate indeed, although >>> we could >>> remove the !TARGET_NEON from the split condition as this is always true >>> given >>> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON". >>> >> No, the split condition does not begin with "&& TARGET_32BIT...". >> Therefore the split is enabled in TARGET_NEON after reload_completed. >> And it is invoked from adddi3_neon for all alternatives without vfp >> registers: >> >> switch (which_alternative) >> { >> case 0: /* fall through */ >> case 3: return "vadd.i64\t%P0, %P1, %P2"; >> case 1: return "#"; >> case 2: return "#"; >> case 4: return "#"; >> case 5: return "#"; >> case 6: return "#"; >> >> >> >>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the >>> same issue. >>> >> Yes, that pattern can be cleaned up in a follow-up patch. >> Note this splitter is invoked from bicdi3_neon as well. >> However I think anddi_notdi_di should be safe as long as it is enabled >> after reload_completed (which is probably a bug). >> > > Thanks, that's what I had in mind in my other reply. > This is ok if testing comes back ok. > I've submitted the patch for testing, I'll let you know about the results. Christophe > Kyrill > > >> Bernd. >> >>> Wilco >>> > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 18:20 ` Christophe Lyon @ 2017-09-06 7:35 ` Christophe Lyon 0 siblings, 0 replies; 28+ messages in thread From: Christophe Lyon @ 2017-09-06 7:35 UTC (permalink / raw) To: Kyrill Tkachov Cc: Bernd Edlinger, Wilco Dijkstra, Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd On 5 September 2017 at 20:20, Christophe Lyon <christophe.lyon@linaro.org> wrote: > On 5 September 2017 at 19:53, Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> >> On 05/09/17 18:48, Bernd Edlinger wrote: >>> >>> On 09/05/17 17:02, Wilco Dijkstra wrote: >>>> >>>> Bernd Edlinger wrote: >>>> >>>>> Combine creates an invalid insn out of these two insns: >>>> >>>> Yes it looks like a latent bug. We need to use >>>> arm_general_register_operand >>>> as arm_adddi3/subdi3 only allow integer registers. You don't need a new >>>> predicate >>>> s_register_operand_nv. Also I'd prefer something like >>>> arm_general_adddi_operand. >>>> >>> Thanks, attached is a patch following your suggestion. >>> >>>> + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || >>>> reload_completed)" >>>> >>>> The split condition for adddi3 now looks more accurate indeed, although >>>> we could >>>> remove the !TARGET_NEON from the split condition as this is always true >>>> given >>>> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON". >>>> >>> No, the split condition does not begin with "&& TARGET_32BIT...". >>> Therefore the split is enabled in TARGET_NEON after reload_completed. >>> And it is invoked from adddi3_neon for all alternatives without vfp >>> registers: >>> >>> switch (which_alternative) >>> { >>> case 0: /* fall through */ >>> case 3: return "vadd.i64\t%P0, %P1, %P2"; >>> case 1: return "#"; >>> case 2: return "#"; >>> case 4: return "#"; >>> case 5: return "#"; >>> case 6: return "#"; >>> >>> >>> >>>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the >>>> same issue. >>>> >>> Yes, that pattern can be cleaned up in a follow-up patch. >>> Note this splitter is invoked from bicdi3_neon as well. >>> However I think anddi_notdi_di should be safe as long as it is enabled >>> after reload_completed (which is probably a bug). >>> >> >> Thanks, that's what I had in mind in my other reply. >> This is ok if testing comes back ok. >> > > I've submitted the patch for testing, I'll let you know about the results. > I can confirm the last patch does fix the regression I reported, and causes no other regression. (The previous previous of the fix, worked, too). Thanks for the prompt fix. Christophe > Christophe > >> Kyrill >> >> >>> Bernd. >>> >>>> Wilco >>>> >> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 17:48 ` Bernd Edlinger 2017-09-05 17:53 ` Kyrill Tkachov @ 2017-09-05 21:28 ` Wilco Dijkstra 2017-09-06 9:31 ` Bernd Edlinger 1 sibling, 1 reply; 28+ messages in thread From: Wilco Dijkstra @ 2017-09-05 21:28 UTC (permalink / raw) To: Bernd Edlinger, Christophe Lyon, Kyrill Tkachov Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd Bernd Edlinger wrote: > No, the split condition does not begin with "&& TARGET_32BIT...". > Therefore the split is enabled in TARGET_NEON after reload_completed. > And it is invoked from adddi3_neon for all alternatives without vfp > registers: Hmm that's a huge mess. I'd argue that any inst_and_split should only split it's own instruction, never other instructions (especially if they are from different md files, which is extremely confusing). Otherwise we should use a separate split and explicitly list which instructions it splits. So then the next question is whether the neon_adddi3 still needs the arm_adddi3 splitter in some odd corner cases? > > Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue. > Yes, that pattern can be cleaned up in a follow-up patch. And there are a lot more instructions that need the same treatment and split early (probably best at expand time). I noticed none of the zero/sign extends split before regalloc for example. > Note this splitter is invoked from bicdi3_neon as well. > However I think anddi_notdi_di should be safe as long as it is enabled > after reload_completed (which is probably a bug). Since we should be splitting and/bic early now I don't think you can get anddi_notdi anymore. So it could be removed completely assuming Neon already does the right thing. It looks like we need to do a full pass over all DI mode instructions and clean up all the mess. Wilco ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 21:28 ` Wilco Dijkstra @ 2017-09-06 9:31 ` Bernd Edlinger 0 siblings, 0 replies; 28+ messages in thread From: Bernd Edlinger @ 2017-09-06 9:31 UTC (permalink / raw) To: Wilco Dijkstra, Christophe Lyon, Kyrill Tkachov Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, nd On 09/05/17 23:27, Wilco Dijkstra wrote: > Bernd Edlinger wrote: >> No, the split condition does not begin with "&& TARGET_32BIT...". >> Therefore the split is enabled in TARGET_NEON after reload_completed. >> And it is invoked from adddi3_neon for all alternatives without vfp >> registers: > > Hmm that's a huge mess. I'd argue that any inst_and_split should only split > it's own instruction, never other instructions (especially if they are from > different md files, which is extremely confusing). Otherwise we should use > a separate split and explicitly list which instructions it splits. > This is literally a mine-field. > So then the next question is whether the neon_adddi3 still needs the > arm_adddi3 splitter in some odd corner cases? > Yes, I think so. While *arm_adddi3 and adddi3_neon insn are mutually exclusive, they share the splitter. I don't know with which iwmmxt-pattern the *arm_adddi3-splitter might interfere, therefore I don't know if the !TARGET_IWMMXT can be removed from the splitter condition. Other patterns have a iwmmxt-twin that is not mutually exclusive. For instance *anddi_notdi_di, bicdi3_neon and iwmmxt_nanddi3 The bicdi3_neon insn duplicates the alternatives while iwmmxt does not. And nobody is able to test iwmmxt. >>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue. > >> Yes, that pattern can be cleaned up in a follow-up patch. > > And there are a lot more instructions that need the same treatment and split > early (probably best at expand time). I noticed none of the zero/sign extends > split before regalloc for example. > I did use the test cases as benchmarks to decide which way to go. It is quite easy to try different combinations of cpu and inspect the stack usage of neon, iwmmxt, and vfp etc. It may be due to interaction with other patterns, but not in every case the split at expand time produced the best results. >> Note this splitter is invoked from bicdi3_neon as well. >> However I think anddi_notdi_di should be safe as long as it is enabled >> after reload_completed (which is probably a bug). > > Since we should be splitting and/bic early now I don't think you can get anddi_notdi > anymore. So it could be removed completely assuming Neon already does the right > thing. > > It looks like we need to do a full pass over all DI mode instructions and clean up > all the mess. > Yes, but in small steps, and using some benchmarks to make sure that each step does improve at least something. Bernd. > Wilco > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2017-09-05 14:25 ` Bernd Edlinger 2017-09-05 15:02 ` Wilco Dijkstra @ 2017-09-05 17:45 ` Kyrill Tkachov 1 sibling, 0 replies; 28+ messages in thread From: Kyrill Tkachov @ 2017-09-05 17:45 UTC (permalink / raw) To: Bernd Edlinger, Christophe Lyon Cc: Ramana Radhakrishnan, GCC Patches, Richard Earnshaw, Wilco Dijkstra Hi Bernd, On 05/09/17 15:25, Bernd Edlinger wrote: > Hi Christophe, > > On 09/05/17 10:45, Christophe Lyon wrote: >> Hi Bernd, >> >> >> On 4 September 2017 at 16:52, Kyrill Tkachov >> <kyrylo.tkachov@foss.arm.com> wrote: >>> On 29/04/17 18:45, Bernd Edlinger wrote: >>>> Ping... >>>> >>>> I attached a rebased version since there was a merge conflict in >>>> the xordi3 pattern, otherwise the patch is still identical. >>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >>>> early when the target has no neon or iwmmxt. >>>> >>>> >>>> Thanks >>>> Bernd. >>>> >>>> >>>> >>>> On 11/28/16 20:42, Bernd Edlinger wrote: >>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>>>> <bernd.edlinger@hotmail.de> wrote: >>>>>>> Hi! >>>>>>> >>>>>>> This improves the stack usage on the sha512 test case for the case >>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>>>> patterns right while expanding which is similar to what the >>>>>>> shift-pattern >>>>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well >>>>>>> as >>>>>>> thumb1. >>>>>>> >>>>>> I would go further and do this in the absence of Neon, the VFP unit >>>>>> being there doesn't help with DImode operations i.e. we do not have 64 >>>>>> bit integer arithmetic instructions without Neon. The main reason why >>>>>> we have the DImode patterns split so late is to give a chance for >>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>>>> map down to these instructions. Doing this just for soft-float doesn't >>>>>> improve the default case only. I don't usually test iwmmxt and I'm not >>>>>> sure who has the ability to do so, thus keeping this restriction for >>>>>> iwMMX is fine. >>>>>> >>>>>> >>>>> Yes I understand, thanks for pointing that out. >>>>> >>>>> I was not aware what iwmmxt exists at all, but I noticed that most >>>>> 64bit expansions work completely different, and would break if we split >>>>> the pattern early. >>>>> >>>>> I can however only look at the assembler outout for iwmmxt, and make >>>>> sure that the stack usage does not get worse. >>>>> >>>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>>>> for the test cases, and vfp and soft-float at around 270 bytes stack >>>>> usage. >>>>> >>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>>>> >>>>>>> Note this also splits many ldrd/strd instructions and therefore I will >>>>>>> post a followup-patch that mitigates this effect by enabling the >>>>>>> ldrd/strd >>>>>>> peephole optimization after the necessary reg-testing. >>>>>>> >>>>>>> >>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>>>> interpret it as --with-arch=armv7-a --with-float=hard >>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>>>> >>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>>>> patch as it stand have no effect there i.e. no change ? >>>>>> arm-linux-gnueabihf usually means to me someone has configured with >>>>>> --with-float=hard, so there are no regressions in the hard float ABI >>>>>> case, >>>>>> >>>>> I know it proves little. When I say arm-linux-gnueabihf >>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>>>> --with-float=hard. >>>>> >>>>> My main interest in the stack usage is of course not because of linux, >>>>> but because of eCos where we have very small task stacks and in fact >>>>> no fpu support by the O/S at all, so that patch is exactly what we need. >>>>> >>>>> >>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>>>> Is it OK for trunk? >>> >>> The code is ok. >>> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as >>> the splits >>> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and >>> TARGET_IWMMXT2 >>> is irrelevant here). >>> >>> So this is ok for trunk. >>> Thanks, and sorry again for the delay. >>> Kyrill >>> >> This patch (r251663) causes a regression on armeb-none-linux-gnueabihf >> --with-mode arm >> --with-cpu cortex-a9 >> --with-fpu vfpv3-d16-fp16 >> FAIL: gcc.dg/vect/vect-singleton_1.c (internal compiler error) >> FAIL: gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects >> (internal compiler error) >> >> the test passes if gcc is configured --with-fpu neon-fp16 >> > Thank you very much for what you do! > > I am able to reproduce this. > > Combine creates an invalid insn out of these two insns: > > (insn 12 8 18 2 (parallel [ > (set (reg:DI 122) > (plus:DI (reg:DI 116 [ aD.5331 ]) > (reg:DI 119 [ bD.5332 ]))) > (clobber (reg:CC 100 cc)) > ]) "vect-singleton_1.c":28 1 {*arm_adddi3} > (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ]) > (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ]) > (expr_list:REG_UNUSED (reg:CC 100 cc) > (nil))))) > (insn 18 12 19 2 (set (reg/i:DI 16 s0) > (reg:DI 122)) "vect-singleton_1.c":28 650 {*movdi_vfp} > (expr_list:REG_DEAD (reg:DI 122) > (nil))) > > => > > (insn 18 12 19 2 (parallel [ > (set (reg/i:DI 16 s0) > (plus:DI (reg:DI 116 [ aD.5331 ]) > (reg:DI 119 [ bD.5332 ]))) > (clobber (reg:CC 100 cc)) > ]) "vect-singleton_1.c":28 1 {*arm_adddi3} > (expr_list:REG_UNUSED (reg:CC 100 cc) > (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ]) > (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ]) > (nil))))) > > > But after reload this is fixed again, so that is no issue for a > splitter that runs always after reload. > > But now the splitter runs into an assertion in gen_highpart(op0). > The same problem should exist with the subdi3 as well, but > not with the other patterns in this patch. > > I think the right way to fix this is use a non-vfp register > predicate to prevent combine to create the un-splittable > instruction in the first place. > > See attached my proposed fix for this defect. > > Christophe, You could do me a favor, and test it in your environment > as I have nothing comparable. > > > Kyrill, is this patch OK after bootstrap / reg-testing? > Thanks for tracking this down so quickly. Like Wilco said in his reply, you can just use arm_general_register_operand and for the adddi_operand predicate use a name like arm_general_adddi_operand (_nv associates to some with non-volatile or non-valid). I don't think anddi_notdi_di has the same issue as that one still requires splitting after reload in all cases and the registers are guaranteed to be core registers by then. So ok with using arm_general_register_operand and renaming the new adjusted adddi_operand-based predicate to arm_general_adddi_operand, assuming bootstrap and testing goes ok. Thanks, Kyrill ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) @ 2016-11-16 17:28 Bernd Edlinger 2016-11-24 14:24 ` [PING**2] " Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-11-16 17:28 UTC (permalink / raw) To: GCC Patches; +Cc: Kyrill Tkachov, Richard Earnshaw, Ramana Radhakrishnan Hi, I'd like to ping for these two patches: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00523.html [PATCH, ARM] Enable ldrd/strd peephole rules unconditionally https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00830.html Thanks Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-16 17:28 [PING] " Bernd Edlinger @ 2016-11-24 14:24 ` Bernd Edlinger 2016-12-05 13:41 ` [PING] " Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-11-24 14:24 UTC (permalink / raw) To: GCC Patches; +Cc: Kyrill Tkachov, Richard Earnshaw, Ramana Radhakrishnan Hi, this patch is still waiting for review: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00523.html Thanks Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-11-24 14:24 ` [PING**2] " Bernd Edlinger @ 2016-12-05 13:41 ` Bernd Edlinger 2016-12-12 5:59 ` [PING**2] " Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-12-05 13:41 UTC (permalink / raw) To: GCC Patches; +Cc: Kyrill Tkachov, Richard Earnshaw, Ramana Radhakrishnan Hi, this was the latest version of my patch: https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html Thanks Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-12-05 13:41 ` [PING] " Bernd Edlinger @ 2016-12-12 5:59 ` Bernd Edlinger 2016-12-19 8:22 ` Bernd Edlinger 0 siblings, 1 reply; 28+ messages in thread From: Bernd Edlinger @ 2016-12-12 5:59 UTC (permalink / raw) To: GCC Patches; +Cc: Kyrill Tkachov, Richard Earnshaw, Ramana Radhakrishnan Ping... On 12/05/16 14:41, Bernd Edlinger wrote: > Hi, > > this was the latest version of my patch: > https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html > > > Thanks > Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) 2016-12-12 5:59 ` [PING**2] " Bernd Edlinger @ 2016-12-19 8:22 ` Bernd Edlinger 0 siblings, 0 replies; 28+ messages in thread From: Bernd Edlinger @ 2016-12-19 8:22 UTC (permalink / raw) To: GCC Patches; +Cc: Kyrill Tkachov, Richard Earnshaw, Ramana Radhakrishnan Ping... On 12/12/16 06:59, Bernd Edlinger wrote: > Ping... > > On 12/05/16 14:41, Bernd Edlinger wrote: >> Hi, >> >> this was the latest version of my patch: >> https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html >> >> >> Thanks >> Bernd. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2017-09-06 9:31 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-06 14:18 [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) Bernd Edlinger 2016-11-25 11:30 ` Ramana Radhakrishnan 2016-11-28 19:42 ` Bernd Edlinger [not found] ` <VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com> 2016-11-29 21:37 ` Bernd Edlinger [not found] ` <AM5PR0802MB261038521472515DDE3E58DA838D0@AM5PR0802MB2610.eurprd08.prod.outlook.com> 2016-11-30 12:01 ` Wilco Dijkstra 2016-11-30 17:01 ` Bernd Edlinger 2016-12-08 19:50 ` Bernd Edlinger 2017-01-11 16:55 ` Richard Earnshaw (lists) 2017-01-11 17:19 ` Bernd Edlinger 2017-04-29 19:17 ` [PING**2] " Bernd Edlinger 2017-05-12 16:51 ` [PING**3] " Bernd Edlinger 2017-06-01 16:01 ` [PING**4] " Bernd Edlinger [not found] ` <bd5e03b1-860f-dd16-2030-9ce0f9a94c7c@hotmail.de> 2017-06-14 12:35 ` [PING**5] " Bernd Edlinger [not found] ` <9a0fbb5d-9909-ef4d-6871-0cb4f7971bbb@hotmail.de> 2017-07-05 18:14 ` [PING**6] " Bernd Edlinger 2017-09-04 14:52 ` [PING**2] " Kyrill Tkachov 2017-09-05 8:47 ` Christophe Lyon 2017-09-05 14:25 ` Bernd Edlinger 2017-09-05 15:02 ` Wilco Dijkstra 2017-09-05 17:48 ` Bernd Edlinger 2017-09-05 17:53 ` Kyrill Tkachov 2017-09-05 18:20 ` Christophe Lyon 2017-09-06 7:35 ` Christophe Lyon 2017-09-05 21:28 ` Wilco Dijkstra 2017-09-06 9:31 ` Bernd Edlinger 2017-09-05 17:45 ` Kyrill Tkachov 2016-11-16 17:28 [PING] " Bernd Edlinger 2016-11-24 14:24 ` [PING**2] " Bernd Edlinger 2016-12-05 13:41 ` [PING] " Bernd Edlinger 2016-12-12 5:59 ` [PING**2] " Bernd Edlinger 2016-12-19 8:22 ` Bernd Edlinger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).