* [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate @ 2010-12-09 16:45 Andrew Stubbs 2011-01-27 2:09 ` Ramana Radhakrishnan 2011-01-27 19:22 ` Richard Earnshaw 0 siblings, 2 replies; 27+ messages in thread From: Andrew Stubbs @ 2010-12-09 16:45 UTC (permalink / raw) To: gcc-patches [-- Attachment #1: Type: text/plain, Size: 1033 bytes --] The attached patch fixes a bug that prevented smlalbb being generated, even though it was present in the machine description. The problem was that the MULT portion of the maddhidi4 pattern did not match the mulhisi3 pattern, and so combine could not build matching patterns. My patch simply adjusts the (internal) modes and extends in the maddhidi4 pattern so that they match. The externally visible modes remain unaltered. Test case: long long foolong (long long x, short *a, short *b) { return x + *a * *b; } Before, compiled with "-O2 -mcpu=cortex-a8 -mthumb": ldrh r2, [r2, #0] ldrh r3, [r3, #0] smulbb r3, r2, r3 adds r0, r0, r3 adc r1, r1, r3, asr #31 bx lr The compiler is forced to do the multiply and addition operation separately. After: ldrh r2, [r2, #0] ldrh r3, [r3, #0] smlalbb r0, r1, r2, r3 bx lr OK to commit, once stage 1 opens again? Andrew [-- Attachment #2: arm-maddhidi4.patch --] [-- Type: text/x-patch, Size: 1737 bytes --] 2010-12-09 Andrew Stubbs <ams@codesourcery.com> gcc/ * config/arm/arm.md (*maddhidi4): Make mult mode match mulhisi3 for combine. gcc/testsuite/ * gcc.target/arm/smlalbb.c: New file. --- src/gcc-mainline/gcc/config/arm/arm.md | 9 +++++---- .../gcc/testsuite/gcc.target/arm/smlalbb.c | 8 ++++++++ 2 files changed, 13 insertions(+), 4 deletions(-) create mode 100644 src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c diff --git a/src/gcc-mainline/gcc/config/arm/arm.md b/src/gcc-mainline/gcc/config/arm/arm.md index c816126..dc9da52 100644 --- a/src/gcc-mainline/gcc/config/arm/arm.md +++ b/src/gcc-mainline/gcc/config/arm/arm.md @@ -1814,10 +1814,11 @@ (define_insn "*maddhidi4" [(set (match_operand:DI 0 "s_register_operand" "=r") (plus:DI - (mult:DI (sign_extend:DI - (match_operand:HI 1 "s_register_operand" "%r")) - (sign_extend:DI - (match_operand:HI 2 "s_register_operand" "r"))) + (sign_extend:DI + (mult:SI (sign_extend:SI + (match_operand:HI 1 "s_register_operand" "%r")) + (sign_extend:SI + (match_operand:HI 2 "s_register_operand" "r")))) (match_operand:DI 3 "s_register_operand" "0")))] "TARGET_DSP_MULTIPLY" "smlalbb%?\\t%Q0, %R0, %1, %2" diff --git a/src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c b/src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c new file mode 100644 index 0000000..0e4dd70 --- /dev/null +++ b/src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c @@ -0,0 +1,8 @@ +/* Ensure the smlalbb pattern works. */ +/* { dg-options "-O2 -march=armv7-a" } */ +/* { dg-final { scan-assembler "smlalbb" } } */ + +long long foolong (long long x, short *a, short *b) +{ + return x + *a * *b; +} ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2010-12-09 16:45 [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate Andrew Stubbs @ 2011-01-27 2:09 ` Ramana Radhakrishnan 2011-01-27 10:46 ` Andrew Stubbs 2011-01-27 19:22 ` Richard Earnshaw 1 sibling, 1 reply; 27+ messages in thread From: Ramana Radhakrishnan @ 2011-01-27 2:09 UTC (permalink / raw) To: Andrew Stubbs; +Cc: gcc-patches Hi Andrew, > > OK to commit, once stage 1 opens again? I can't approve or reject your patch but is there any reason why this shouldn't be covered under the same rules as the smulbb patch that you submitted around the same time ? http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00772.html Ramana ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-27 2:09 ` Ramana Radhakrishnan @ 2011-01-27 10:46 ` Andrew Stubbs 0 siblings, 0 replies; 27+ messages in thread From: Andrew Stubbs @ 2011-01-27 10:46 UTC (permalink / raw) To: Ramana Radhakrishnan; +Cc: gcc-patches On 27/01/11 01:32, Ramana Radhakrishnan wrote: > I can't approve or reject your patch but is there any reason why this > shouldn't be covered under the same rules as the smulbb patch that you > submitted around the same time ? > > http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00772.html Only that that was stage 3, and we're now in stage 4, I think? Otherwise it's exactly the same sort of issue. Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2010-12-09 16:45 [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate Andrew Stubbs 2011-01-27 2:09 ` Ramana Radhakrishnan @ 2011-01-27 19:22 ` Richard Earnshaw 2011-01-28 13:06 ` Andrew Stubbs 1 sibling, 1 reply; 27+ messages in thread From: Richard Earnshaw @ 2011-01-27 19:22 UTC (permalink / raw) To: Andrew Stubbs; +Cc: gcc-patches On Thu, 2010-12-09 at 16:19 +0000, Andrew Stubbs wrote: > The attached patch fixes a bug that prevented smlalbb being generated, > even though it was present in the machine description. > > The problem was that the MULT portion of the maddhidi4 pattern did not > match the mulhisi3 pattern, and so combine could not build matching > patterns. > > My patch simply adjusts the (internal) modes and extends in the > maddhidi4 pattern so that they match. The externally visible modes > remain unaltered. > > Test case: > > long long foolong (long long x, short *a, short *b) > { > return x + *a * *b; > } > > Before, compiled with "-O2 -mcpu=cortex-a8 -mthumb": > > ldrh r2, [r2, #0] > ldrh r3, [r3, #0] > smulbb r3, r2, r3 > adds r0, r0, r3 > adc r1, r1, r3, asr #31 > bx lr > > The compiler is forced to do the multiply and addition operation separately. > > After: > ldrh r2, [r2, #0] > ldrh r3, [r3, #0] > smlalbb r0, r1, r2, r3 > bx lr > > > OK to commit, once stage 1 opens again? > > Andrew I'm worried about this change. The proposed RTL might be what GCC generates for this particular case, but the pattern hardly looks canonical: a sign-extend of a multiply that contained sign-extended operands. Are you sure that fixing this particular case doesn't introduce a regression elsewhere? Alternatively, is there some part of the manual that describes the form you have as the correct canonical form? R. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-27 19:22 ` Richard Earnshaw @ 2011-01-28 13:06 ` Andrew Stubbs 2011-01-28 15:08 ` Richard Earnshaw 0 siblings, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-01-28 13:06 UTC (permalink / raw) To: Richard Earnshaw; +Cc: gcc-patches On 27/01/11 18:40, Richard Earnshaw wrote: > I'm worried about this change. The proposed RTL might be what GCC > generates for this particular case, but the pattern hardly looks > canonical: a sign-extend of a multiply that contained sign-extended > operands. > > Are you sure that fixing this particular case doesn't introduce a > regression elsewhere? > > Alternatively, is there some part of the manual that describes the form > you have as the correct canonical form? I don't think I said it was the canonical form (although the other similar smulbb patch was about that), but if it weren't in the canonical form then combine would not work (which it does with my patch), so I don't believe that's the problem here. Anyway, combine only canonicalizes the *order* or the expressions, not the modes, which is what I'm adjusting here. Or have I missed something? The problem here is that the expand pass used mulhisi3 which has the 'internal' modes set one way, and then combine tries to insert that into a multiply-and-accumulate form, but recog rejects it because the 'internal' modes of maddhidi4 are set a different way. Now, mulhisi3 is currently written using MULT:SI mode, and I can't see how this could be any other way - it's an SI mode insn. If we want to be able to insert that into maddhidi4, and we do, I don't see any way to do that if it uses a MULT:DI - it just isn't compatible. I have regression tested the patch for correctness, but I haven't proven that I haven't produced worse code in some other case. I could introduce a second pattern, say "*maddhidi4_2", so that both patterns match and then there could be no regressions, right? However, I don't believe it's necessary. Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-28 13:06 ` Andrew Stubbs @ 2011-01-28 15:08 ` Richard Earnshaw 2011-01-28 15:55 ` Andrew Stubbs 2011-02-09 6:31 ` [patch][ARM] " Hans-Peter Nilsson 0 siblings, 2 replies; 27+ messages in thread From: Richard Earnshaw @ 2011-01-28 15:08 UTC (permalink / raw) To: Andrew Stubbs; +Cc: gcc-patches On Fri, 2011-01-28 at 10:07 +0000, Andrew Stubbs wrote: > On 27/01/11 18:40, Richard Earnshaw wrote: > > I'm worried about this change. The proposed RTL might be what GCC > > generates for this particular case, but the pattern hardly looks > > canonical: a sign-extend of a multiply that contained sign-extended > > operands. > > > > Are you sure that fixing this particular case doesn't introduce a > > regression elsewhere? > > > > Alternatively, is there some part of the manual that describes the form > > you have as the correct canonical form? > > I don't think I said it was the canonical form (although the other > similar smulbb patch was about that), but if it weren't in the canonical > form then combine would not work (which it does with my patch), so I > don't believe that's the problem here. > > Anyway, combine only canonicalizes the *order* or the expressions, not > the modes, which is what I'm adjusting here. Or have I missed something? > Canonical RTL is not just about the order of the operands its about not having multiple ways of representing the same operation. Order is part of that, but not the end of it. For example, the canonical form of (lt:SI (reg:SI) (const_int 0)) is (lshiftrt:SI (reg:SI) (const_int 31)) As many machines have an instruction for the latter but not the former. My question really, is about whether (sign_extend:DI (mult:SI (sign_extend:SI (reg:HI)) (sign_extend:SI (reg:HI))))) should be canonical, or whether (mult:DI (sign_extend:DI (reg:HI) (sign_extend:DI (reg:HI)) is a better expression of this operation. Ultimately this comes down to what is the correct thing to do in GCC. If its the latter, then simplify RTX may need some work to make it do the right thing, rather than tweaking the ARM back-end. > The problem here is that the expand pass used mulhisi3 which has the > 'internal' modes set one way, and then combine tries to insert that into > a multiply-and-accumulate form, but recog rejects it because the > 'internal' modes of maddhidi4 are set a different way. > > Now, mulhisi3 is currently written using MULT:SI mode, and I can't see > how this could be any other way - it's an SI mode insn. If we want to be > able to insert that into maddhidi4, and we do, I don't see any way to do > that if it uses a MULT:DI - it just isn't compatible. > > I have regression tested the patch for correctness, but I haven't proven > that I haven't produced worse code in some other case. > > I could introduce a second pattern, say "*maddhidi4_2", so that both > patterns match and then there could be no regressions, right? However, I > don't believe it's necessary. > So what happens to a variation of your testcase: long long foolong (long long x, short *a, short *b) { return x + (long long)*a * (long long)*b; } With your patch? This should generate identical code to your original test-case. R. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-28 15:08 ` Richard Earnshaw @ 2011-01-28 15:55 ` Andrew Stubbs 2011-01-28 16:01 ` Richard Earnshaw 2011-02-09 6:31 ` [patch][ARM] " Hans-Peter Nilsson 1 sibling, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-01-28 15:55 UTC (permalink / raw) To: Richard Earnshaw; +Cc: gcc-patches On 28/01/11 14:12, Richard Earnshaw wrote: > So what happens to a variation of your testcase: > > long long foolong (long long x, short *a, short *b) > { > return x + (long long)*a * (long long)*b; > } > > With your patch? This should generate identical code to your original > test-case. > The patch has no effect on that testcase - it's broken in some other way, I think, and the same with and without my patch: ldrsh r3, [r3, #0] ldrsh r2, [r2, #0] push {r4, r5} asrs r4, r3, #31 asrs r5, r2, #31 mul r4, r2, r4 mla r4, r3, r5, r4 umull r2, r3, r2, r3 adds r3, r4, r3 adds r0, r0, r2 adc r1, r1, r3 pop {r4, r5} bx lr Hmmm, that probably doesn't add anything useful to the discussion. :( I'll add that one to the todo list ... Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-28 15:55 ` Andrew Stubbs @ 2011-01-28 16:01 ` Richard Earnshaw 2011-03-25 16:27 ` Andrew Stubbs 0 siblings, 1 reply; 27+ messages in thread From: Richard Earnshaw @ 2011-01-28 16:01 UTC (permalink / raw) To: Andrew Stubbs; +Cc: gcc-patches On Fri, 2011-01-28 at 15:13 +0000, Andrew Stubbs wrote: > On 28/01/11 14:12, Richard Earnshaw wrote: > > So what happens to a variation of your testcase: > > > > long long foolong (long long x, short *a, short *b) > > { > > return x + (long long)*a * (long long)*b; > > } > > > > With your patch? This should generate identical code to your original > > test-case. > > > > The patch has no effect on that testcase - it's broken in some other > way, I think, and the same with and without my patch: > > ldrsh r3, [r3, #0] > ldrsh r2, [r2, #0] > push {r4, r5} > asrs r4, r3, #31 > asrs r5, r2, #31 > mul r4, r2, r4 > mla r4, r3, r5, r4 > umull r2, r3, r2, r3 > adds r3, r4, r3 > adds r0, r0, r2 > adc r1, r1, r3 > pop {r4, r5} > bx lr > > Hmmm, that probably doesn't add anything useful to the discussion. :( > > I'll add that one to the todo list ... > > Andrew > Ouch! I though that used to work :-( R. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-28 16:01 ` Richard Earnshaw @ 2011-03-25 16:27 ` Andrew Stubbs 2011-04-15 11:23 ` Andrew Stubbs 0 siblings, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-03-25 16:27 UTC (permalink / raw) To: Richard Earnshaw; +Cc: gcc-patches On 28/01/11 15:20, Richard Earnshaw wrote: > > On Fri, 2011-01-28 at 15:13 +0000, Andrew Stubbs wrote: >> On 28/01/11 14:12, Richard Earnshaw wrote: >>> So what happens to a variation of your testcase: >>> >>> long long foolong (long long x, short *a, short *b) >>> { >>> return x + (long long)*a * (long long)*b; >>> } >>> >>> With your patch? This should generate identical code to your original >>> test-case. >>> >> >> The patch has no effect on that testcase - it's broken in some other >> way, I think, and the same with and without my patch: >> >> ldrsh r3, [r3, #0] >> ldrsh r2, [r2, #0] >> push {r4, r5} >> asrs r4, r3, #31 >> asrs r5, r2, #31 >> mul r4, r2, r4 >> mla r4, r3, r5, r4 >> umull r2, r3, r2, r3 >> adds r3, r4, r3 >> adds r0, r0, r2 >> adc r1, r1, r3 >> pop {r4, r5} >> bx lr >> >> Hmmm, that probably doesn't add anything useful to the discussion. :( >> >> I'll add that one to the todo list ... >> >> Andrew >> > > Ouch! I though that used to work :-( I looked at this one again, but on a second inspection I'm not sure there's much wrong with it? When I wrote the above I thought that there was a 64-bit multiply instruction, but now I look more closely I see there isn't, hence the above. It does two 16-bit loads, sign-extends the inputs to 64-bit, does a 64-bit -> 64-bit multiply, and then adds 'x'. Can the umull/add/add/adc be optimized using umlal? It's too late on a Friday to workout what's going on with the carries .... Also, I don't fully understand why the compiler can't just disregard the casts and use maddhidi4? Isn't that mathematically equivalent in this case? When you say it used to work, what did it use to look like? Thanks Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-03-25 16:27 ` Andrew Stubbs @ 2011-04-15 11:23 ` Andrew Stubbs 2011-05-03 9:08 ` Bernd Schmidt 0 siblings, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-04-15 11:23 UTC (permalink / raw) Cc: Richard Earnshaw, gcc-patches Ping. On 25/03/11 16:19, Andrew Stubbs wrote: > On 28/01/11 15:20, Richard Earnshaw wrote: >> >> On Fri, 2011-01-28 at 15:13 +0000, Andrew Stubbs wrote: >>> On 28/01/11 14:12, Richard Earnshaw wrote: >>>> So what happens to a variation of your testcase: >>>> >>>> long long foolong (long long x, short *a, short *b) >>>> { >>>> return x + (long long)*a * (long long)*b; >>>> } >>>> >>>> With your patch? This should generate identical code to your original >>>> test-case. >>>> >>> >>> The patch has no effect on that testcase - it's broken in some other >>> way, I think, and the same with and without my patch: >>> >>> ldrsh r3, [r3, #0] >>> ldrsh r2, [r2, #0] >>> push {r4, r5} >>> asrs r4, r3, #31 >>> asrs r5, r2, #31 >>> mul r4, r2, r4 >>> mla r4, r3, r5, r4 >>> umull r2, r3, r2, r3 >>> adds r3, r4, r3 >>> adds r0, r0, r2 >>> adc r1, r1, r3 >>> pop {r4, r5} >>> bx lr >>> >>> Hmmm, that probably doesn't add anything useful to the discussion. :( >>> >>> I'll add that one to the todo list ... >>> >>> Andrew >>> >> >> Ouch! I though that used to work :-( > > > I looked at this one again, but on a second inspection I'm not sure > there's much wrong with it? > > When I wrote the above I thought that there was a 64-bit multiply > instruction, but now I look more closely I see there isn't, hence the > above. It does two 16-bit loads, sign-extends the inputs to 64-bit, does > a 64-bit -> 64-bit multiply, and then adds 'x'. > > Can the umull/add/add/adc be optimized using umlal? It's too late on a > Friday to workout what's going on with the carries .... > > Also, I don't fully understand why the compiler can't just disregard the > casts and use maddhidi4? Isn't that mathematically equivalent in this case? > > When you say it used to work, what did it use to look like? > > Thanks > > Andrew > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-04-15 11:23 ` Andrew Stubbs @ 2011-05-03 9:08 ` Bernd Schmidt 2011-05-03 9:36 ` Andrew Stubbs ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Bernd Schmidt @ 2011-05-03 9:08 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Richard Earnshaw, gcc-patches [-- Attachment #1: Type: text/plain, Size: 798 bytes --] On 04/15/2011 12:54 PM, Andrew Stubbs wrote: > Ping. Trying to unblock this... I think the point is that both examples: long long foolong (long long x, short *a, short *b) { return x + (long long)*a * (long long)*b; } and long long foolong (long long x, short *a, short *b) { return x + *a * *b; } should produce the same code using the maddhidi pattern, and for that to happen, simplify_rtx must produce a canonical form of the 16->64 bit widening multiply. The form that already exists in arm.md looks better so we should pick that. I tried to fix it with the patch below, which unfortunately doesn't work since during combine we don't see the SIGN_EXTEND operations inside the MULT, but two shift operations instead. Maybe you can complete it from here? Bernd [-- Attachment #2: srtx-widen.diff --] [-- Type: text/plain, Size: 2105 bytes --] Index: gcc/simplify-rtx.c =================================================================== --- gcc/simplify-rtx.c (revision 173242) +++ gcc/simplify-rtx.c (working copy) @@ -1000,6 +1000,23 @@ simplify_unary_operation_1 (enum rtx_cod && GET_CODE (XEXP (XEXP (op, 0), 1)) == LABEL_REF) return XEXP (op, 0); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT + && GET_CODE (XEXP (op, 0)) == SIGN_EXTEND + && GET_CODE (XEXP (op, 1)) == SIGN_EXTEND) + { + rtx op0 = XEXP (XEXP (op, 0), 0); + rtx op1 = XEXP (XEXP (op, 1), 0); + enum machine_mode op0_mode = GET_MODE (op0); + enum machine_mode op1_mode = GET_MODE (op1); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (SIGN_EXTEND, mode, + op0, op0_mode), + simplify_gen_unary (SIGN_EXTEND, mode, + op1, op1_mode)); + } + /* Check for a sign extension of a subreg of a promoted variable, where the promotion is sign-extended, and the target mode is the same as the variable's promotion. */ @@ -1071,6 +1088,23 @@ simplify_unary_operation_1 (enum rtx_cod && GET_MODE_SIZE (mode) <= GET_MODE_SIZE (GET_MODE (XEXP (op, 0)))) return rtl_hooks.gen_lowpart_no_emit (mode, op); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT + && GET_CODE (XEXP (op, 0)) == ZERO_EXTEND + && GET_CODE (XEXP (op, 1)) == ZERO_EXTEND) + { + rtx op0 = XEXP (XEXP (op, 0), 0); + rtx op1 = XEXP (XEXP (op, 1), 0); + enum machine_mode op0_mode = GET_MODE (op0); + enum machine_mode op1_mode = GET_MODE (op1); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (ZERO_EXTEND, mode, + op0, op0_mode), + simplify_gen_unary (ZERO_EXTEND, mode, + op1, op1_mode)); + } + /* (zero_extend:M (zero_extend:N <X>)) is (zero_extend:M <X>). */ if (GET_CODE (op) == ZERO_EXTEND) return simplify_gen_unary (ZERO_EXTEND, mode, XEXP (op, 0), ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-03 9:08 ` Bernd Schmidt @ 2011-05-03 9:36 ` Andrew Stubbs 2011-05-18 15:32 ` [patch][combine] " Andrew Stubbs 2011-05-24 17:51 ` [patch][simplify-rtx] " Andrew Stubbs 2 siblings, 0 replies; 27+ messages in thread From: Andrew Stubbs @ 2011-05-03 9:36 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Richard Earnshaw, gcc-patches On 03/05/11 10:07, Bernd Schmidt wrote: > I tried to fix it with the patch below, which unfortunately doesn't work > since during combine we don't see the SIGN_EXTEND operations inside the > MULT, but two shift operations instead. Maybe you can complete it from here? I'll take a look. Thanks for the help! :) Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][combine] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-03 9:08 ` Bernd Schmidt 2011-05-03 9:36 ` Andrew Stubbs @ 2011-05-18 15:32 ` Andrew Stubbs 2011-05-19 13:06 ` Bernd Schmidt 2011-05-24 17:51 ` [patch][simplify-rtx] " Andrew Stubbs 2 siblings, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-05-18 15:32 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Richard Earnshaw, gcc-patches [-- Attachment #1: Type: text/plain, Size: 2094 bytes --] On 03/05/11 10:07, Bernd Schmidt wrote: > On 04/15/2011 12:54 PM, Andrew Stubbs wrote: >> Ping. > > Trying to unblock this... > > I think the point is that both examples: > > long long foolong (long long x, short *a, short *b) > { > return x + (long long)*a * (long long)*b; > } > > and > > > long long foolong (long long x, short *a, short *b) > { > return x + *a * *b; > } > > should produce the same code using the maddhidi pattern, and for that to > happen, simplify_rtx must produce a canonical form of the 16->64 bit > widening multiply. The form that already exists in arm.md looks better > so we should pick that. > > I tried to fix it with the patch below, which unfortunately doesn't work > since during combine we don't see the SIGN_EXTEND operations inside the > MULT, but two shift operations instead. Maybe you can complete it from here? This patch fixes my original testcase (the second one above), and so can be a complete replacement for that patch, if preferred? I did originally want to do it without exporting the function from combine.c (since nothing else appears to be exported like that), but it required moving too much code between files. The patch does not fix the first test case above because combine does not recognise the testcase as a HI -> DI multiply-and-add. It can't do it because it has already combined the HI load with a HI->DI sign-extend, and there's neither a multiply that supports mem arguments, nor a multiply with DI mode inputs. I could add define_and_split patterns to catch these cases, but that seems the wrong way to do it ... The real problem is that the widening-multiply tree optimization pass does not support multiplies that widen by more than one mode. As far as I can tell, ARM is the only backend that tries to do this, although cris has a comment to the effect that it would use it if it worked. I am investigating a way to fix this by introducing an optab matrix akin to the existing one for conversions, but that's another patch. Is this patch OK? (Assuming my tests pass.) Andrew [-- Attachment #2: canonical-mla.patch --] [-- Type: text/x-patch, Size: 3926 bytes --] 2011-05-18 Bernd Schmidt <bernds@codesourcery.com> Andrew Stubbs <ams@codesourcery.com> gcc/ * combine.c (make_compound_operation): Remove 'static'. * rtl.h (make_compound_operation): New prototype. * simplify-rtx.c (simplify_unary_operation_1): Create a new canonical form for widening multiplies. --- a/gcc/combine.c +++ b/gcc/combine.c @@ -427,7 +427,6 @@ static const_rtx expand_field_assignment (const_rtx); static rtx make_extraction (enum machine_mode, rtx, HOST_WIDE_INT, rtx, unsigned HOST_WIDE_INT, int, int, int); static rtx extract_left_shift (rtx, int); -static rtx make_compound_operation (rtx, enum rtx_code); static int get_pos_from_mask (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT *); static rtx canon_reg_for_combine (rtx, rtx); @@ -7516,7 +7515,7 @@ extract_left_shift (rtx x, int count) being kludges), it is MEM. When processing the arguments of a comparison or a COMPARE against zero, it is COMPARE. */ -static rtx +rtx make_compound_operation (rtx x, enum rtx_code in_code) { enum rtx_code code = GET_CODE (x); --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -2372,6 +2372,7 @@ extern unsigned int extended_count (const_rtx, enum machine_mode, int); extern rtx remove_death (unsigned int, rtx); extern void dump_combine_stats (FILE *); extern void dump_combine_total_stats (FILE *); +extern rtx make_compound_operation (rtx, enum rtx_code); /* In cfgcleanup.c */ extern void delete_dead_jumptables (void); --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -989,6 +989,8 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) break; case SIGN_EXTEND: + op = make_compound_operation (op, SET); + /* (sign_extend (truncate (minus (label_ref L1) (label_ref L2)))) becomes just the MINUS if its mode is MODE. This allows folding switch statements on machines using casesi (such as @@ -1000,6 +1002,23 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_CODE (XEXP (XEXP (op, 0), 1)) == LABEL_REF) return XEXP (op, 0); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT + && GET_CODE (XEXP (op, 0)) == SIGN_EXTEND + && GET_CODE (XEXP (op, 1)) == SIGN_EXTEND) + { + rtx op0 = XEXP (XEXP (op, 0), 0); + rtx op1 = XEXP (XEXP (op, 1), 0); + enum machine_mode op0_mode = GET_MODE (op0); + enum machine_mode op1_mode = GET_MODE (op1); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (SIGN_EXTEND, mode, + op0, op0_mode), + simplify_gen_unary (SIGN_EXTEND, mode, + op1, op1_mode)); + } + /* Check for a sign extension of a subreg of a promoted variable, where the promotion is sign-extended, and the target mode is the same as the variable's promotion. */ @@ -1071,6 +1090,23 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_MODE_SIZE (mode) <= GET_MODE_SIZE (GET_MODE (XEXP (op, 0)))) return rtl_hooks.gen_lowpart_no_emit (mode, op); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT + && GET_CODE (XEXP (op, 0)) == ZERO_EXTEND + && GET_CODE (XEXP (op, 1)) == ZERO_EXTEND) + { + rtx op0 = XEXP (XEXP (op, 0), 0); + rtx op1 = XEXP (XEXP (op, 1), 0); + enum machine_mode op0_mode = GET_MODE (op0); + enum machine_mode op1_mode = GET_MODE (op1); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (ZERO_EXTEND, mode, + op0, op0_mode), + simplify_gen_unary (ZERO_EXTEND, mode, + op1, op1_mode)); + } + /* (zero_extend:M (zero_extend:N <X>)) is (zero_extend:M <X>). */ if (GET_CODE (op) == ZERO_EXTEND) return simplify_gen_unary (ZERO_EXTEND, mode, XEXP (op, 0), ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][combine] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-18 15:32 ` [patch][combine] " Andrew Stubbs @ 2011-05-19 13:06 ` Bernd Schmidt 0 siblings, 0 replies; 27+ messages in thread From: Bernd Schmidt @ 2011-05-19 13:06 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Richard Earnshaw, gcc-patches On 05/18/2011 11:55 AM, Andrew Stubbs wrote: > > case SIGN_EXTEND: > + op = make_compound_operation (op, SET); > + I would have expected that bit to go into the MULT case (doing it for both operands), so that the effect is limited to just making widening multiplies more obvious. Also, using make_compound_operation unchanged is probably not safe - it calls SUBST which I think is only valid during combine. I think you may have to break out the non-recursive part into a make_compound_operation_1 (probably to be placed into simplify-rtx.c). Bernd ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-03 9:08 ` Bernd Schmidt 2011-05-03 9:36 ` Andrew Stubbs 2011-05-18 15:32 ` [patch][combine] " Andrew Stubbs @ 2011-05-24 17:51 ` Andrew Stubbs 2011-05-24 21:00 ` Joseph S. Myers 2 siblings, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-05-24 17:51 UTC (permalink / raw) To: Bernd Schmidt; +Cc: Richard Earnshaw, gcc-patches [-- Attachment #1: Type: text/plain, Size: 1359 bytes --] On 03/05/11 10:07, Bernd Schmidt wrote: > I tried to fix it with the patch below, which unfortunately doesn't work > since during combine we don't see the SIGN_EXTEND operations inside the > MULT, but two shift operations instead. Maybe you can complete it from here? I've tried to make this patch go various ways, and I always find a test case that doesn't quite work. I think we're approaching it from the wrong angle. The problem is that widening multiplies are required to be defined as (mult (extend ..) (extend ..)), but when the combiner tries to build a widening multiply pattern from a regular multiply it naturally ends up as (extend (mult .. ..)). The result is that the patch Benrd posted made existing widening multiplies wider, but failed to convert regular multiplies to widening ones. I've created this new, simpler patch that converts (extend (mult a b)) into (mult (extend a) (extend b)) regardless of what 'a' and 'b' might be. (These are then simplified and superfluous extends removed, of course.) I find that this patch fixes all the testcases I have, and permitted me to add support for ARM smlalbt/smlaltb/smlaltt also (I'll post that in a separate patch). It does assume that the outer sign_extend/zero_extend indicates the inner extend types though, so I'm not sure if there's a problem there? OK? Andrew [-- Attachment #2: canonical-mul.patch --] [-- Type: text/x-patch, Size: 3143 bytes --] 2011-05-24 Bernd Schmidt <bernds@codesourcery.com> Andrew Stubbs <ams@codesourcery.com> gcc/ * simplify-rtx.c (simplify_unary_operation_1): Create a new canonical form for widening multiplies. * doc/md.texi (Canonicalization of Instructions): Document widening multiply canonicalization. gcc/testsuite/ * gcc.target/arm/mla-2.c: New test. --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5840,6 +5840,11 @@ Equality comparisons of a group of bits (usually a single bit) with zero will be written using @code{zero_extract} rather than the equivalent @code{and} or @code{sign_extract} operations. +@cindex @code{mult}, canonicalization of +@item +@code{(sign_extend:@var{m1} (mult:@var{m2} @var{x} @var{y}))} is converted +to @code{(mult:@var{m1} (sign_extend:@var{m1} @var{x}) (sign_extend:@var{m1} @var{y}))}, and likewise for @code{zero_extract}. + @end itemize Further canonicalization rules are defined in the function --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -1000,6 +1000,21 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_CODE (XEXP (XEXP (op, 0), 1)) == LABEL_REF) return XEXP (op, 0); + /* Convert (sign_extend (mult ..)) to a canonical widening + mulitplication (mult (sign_extend ..) (sign_extend ..)). */ + if (GET_CODE (op) == MULT && GET_MODE (op) < mode) + { + rtx lhs = XEXP (op, 0); + rtx rhs = XEXP (op, 1); + enum machine_mode lhs_mode = GET_MODE (lhs); + enum machine_mode rhs_mode = GET_MODE (rhs); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (SIGN_EXTEND, mode, + lhs, lhs_mode), + simplify_gen_unary (SIGN_EXTEND, mode, + rhs, rhs_mode)); + } + /* Check for a sign extension of a subreg of a promoted variable, where the promotion is sign-extended, and the target mode is the same as the variable's promotion. */ @@ -1071,6 +1086,21 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_MODE_SIZE (mode) <= GET_MODE_SIZE (GET_MODE (XEXP (op, 0)))) return rtl_hooks.gen_lowpart_no_emit (mode, op); + /* Convert (zero_extend (mult ..)) to a canonical widening + mulitplication (mult (zero_extend ..) (zero_extend ..)). */ + if (GET_CODE (op) == MULT && GET_MODE (op) < mode) + { + rtx lhs = XEXP (op, 0); + rtx rhs = XEXP (op, 1); + enum machine_mode lhs_mode = GET_MODE (lhs); + enum machine_mode rhs_mode = GET_MODE (rhs); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (ZERO_EXTEND, mode, + lhs, lhs_mode), + simplify_gen_unary (ZERO_EXTEND, mode, + rhs, rhs_mode)); + } + /* (zero_extend:M (zero_extend:N <X>)) is (zero_extend:M <X>). */ if (GET_CODE (op) == ZERO_EXTEND) return simplify_gen_unary (ZERO_EXTEND, mode, XEXP (op, 0), --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mla-2.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=armv7-a" } */ + +long long foolong (long long x, short *a, short *b) +{ + return x + *a * *b; +} + +/* { dg-final { scan-assembler "smlalbb" } } */ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-24 17:51 ` [patch][simplify-rtx] " Andrew Stubbs @ 2011-05-24 21:00 ` Joseph S. Myers 2011-05-25 9:47 ` Andrew Stubbs 2011-05-25 13:42 ` Andrew Stubbs 0 siblings, 2 replies; 27+ messages in thread From: Joseph S. Myers @ 2011-05-24 21:00 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches On Tue, 24 May 2011, Andrew Stubbs wrote: > I've created this new, simpler patch that converts > > (extend (mult a b)) > > into > > (mult (extend a) (extend b)) > > regardless of what 'a' and 'b' might be. (These are then simplified and > superfluous extends removed, of course.) Are there some missing conditions here? The two aren't equivalent in general - (extend:SI (mult:HI a b)) multiplies the HImode values in HImode (with modulo arithmetic on overflow) before extending the possibly wrapped result to SImode. You'd need a and b themselves to be extended from narrower modes in such a way that if you interpret the extended values in the signedness of the outer extension, the result of the multiplication is exactly representable in the mode of the multiplication. (For example, if both values are extended from QImode, and all extensions have the same signedness, that would be OK. There are cases that are OK where not all extensions have the same signedness, e.g. (sign_extend:DI (mult:SI a b)) where a and b are zero-extended from HImode or QImode, at least one from QImode, though there the outer extension is equivalent to a zero-extension.) -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-24 21:00 ` Joseph S. Myers @ 2011-05-25 9:47 ` Andrew Stubbs 2011-05-25 12:34 ` Joseph S. Myers 2011-05-25 13:42 ` Andrew Stubbs 1 sibling, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-05-25 9:47 UTC (permalink / raw) To: Joseph S. Myers; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches On 24/05/11 20:35, Joseph S. Myers wrote: > On Tue, 24 May 2011, Andrew Stubbs wrote: > >> I've created this new, simpler patch that converts >> >> (extend (mult a b)) >> >> into >> >> (mult (extend a) (extend b)) >> >> regardless of what 'a' and 'b' might be. (These are then simplified and >> superfluous extends removed, of course.) > > Are there some missing conditions here? The two aren't equivalent in > general - (extend:SI (mult:HI a b)) multiplies the HImode values in HImode > (with modulo arithmetic on overflow) before extending the possibly wrapped > result to SImode. You'd need a and b themselves to be extended from > narrower modes in such a way that if you interpret the extended values in > the signedness of the outer extension, the result of the multiplication is > exactly representable in the mode of the multiplication. (For example, if > both values are extended from QImode, and all extensions have the same > signedness, that would be OK. There are cases that are OK where not all > extensions have the same signedness, e.g. (sign_extend:DI (mult:SI a b)) > where a and b are zero-extended from HImode or QImode, at least one from > QImode, though there the outer extension is equivalent to a > zero-extension.) So, you're saying that promoting a regular multiply to a widening multiply isn't a valid transformation anyway? I suppose that does make sense. I knew something was too easy. OK, I'll go try again. :) Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-25 9:47 ` Andrew Stubbs @ 2011-05-25 12:34 ` Joseph S. Myers 0 siblings, 0 replies; 27+ messages in thread From: Joseph S. Myers @ 2011-05-25 12:34 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches On Wed, 25 May 2011, Andrew Stubbs wrote: > So, you're saying that promoting a regular multiply to a widening multiply > isn't a valid transformation anyway? I suppose that does make sense. I knew In general, yes. RTL always has modulo semantics (except for division and remainder by -1); all optimizations based on undefinedness of overflow (in the absence of -fwrapv) happen at tree/GIMPLE level, where signed and unsigned types are still distinct. (So you could promote a regular multiply of signed types at GIMPLE level in the absence of -fwrapv/-ftrapv, but not at RTL level and not for unsigned types at GIMPLE level.) -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-24 21:00 ` Joseph S. Myers 2011-05-25 9:47 ` Andrew Stubbs @ 2011-05-25 13:42 ` Andrew Stubbs 2011-05-25 13:48 ` Joseph S. Myers 1 sibling, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-05-25 13:42 UTC (permalink / raw) To: Joseph S. Myers; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches [-- Attachment #1: Type: text/plain, Size: 1690 bytes --] On 24/05/11 20:35, Joseph S. Myers wrote: > On Tue, 24 May 2011, Andrew Stubbs wrote: > >> I've created this new, simpler patch that converts >> >> (extend (mult a b)) >> >> into >> >> (mult (extend a) (extend b)) >> >> regardless of what 'a' and 'b' might be. (These are then simplified and >> superfluous extends removed, of course.) > > Are there some missing conditions here? The two aren't equivalent in > general - (extend:SI (mult:HI a b)) multiplies the HImode values in HImode > (with modulo arithmetic on overflow) before extending the possibly wrapped > result to SImode. Ok, I've now modified my patch to prevent it widening regular multiplies. It now converts (extend (mult (extend a) (extend b))) to (mult (newextend a) (newextend b)) But I also have it convert (extend (mult (shift a) (extend b))) to (mult (newextend (shift a)) (newextend b)) The latter case is to catch widening multiplies that extract a subreg using a shift. I don't understand why it doesn't just use subreg in the first place, but apparently it doesn't, and changing it to do that would no doubt break many existing machine descriptions. The latter case also happens to catch cases where an extend is represented by (ashiftrt (ashift x)), which is nice. I know that, potentially, not all shifted operands are going to be widening multiplies, but I *think* this should be safe because other random shift values are unlikely to match a real widening mult instruction (and if they do then the code would already be broken). If somebody knows a reason why this isn't safe then I think I'm going to need some help figuring out what conditions to use. OK? Andrew [-- Attachment #2: canonical-mul.patch --] [-- Type: text/x-patch, Size: 4484 bytes --] 2011-05-25 Bernd Schmidt <bernds@codesourcery.com> Andrew Stubbs <ams@codesourcery.com> gcc/ * simplify-rtx.c (simplify_unary_operation_1): Canonicalize widening multiplies. * doc/md.texi (Canonicalization of Instructions): Document widening multiply canonicalization. gcc/testsuite/ * gcc.target/arm/mla-2.c: New test. --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5840,6 +5840,21 @@ Equality comparisons of a group of bits (usually a single bit) with zero will be written using @code{zero_extract} rather than the equivalent @code{and} or @code{sign_extract} operations. +@cindex @code{mult}, canonicalization of +@item +@code{(sign_extend:@var{m1} (mult:@var{m2} (sign_extend:@var{m2} @var{x}) +(sign_extend:@var{m2} @var{y})))} is converted to @code{(mult:@var{m1} +(sign_extend:@var{m1} @var{x}) (sign_extend:@var{m1} @var{y}))}, and likewise +for @code{zero_extend}. + +@item +@code{(sign_extend:@var{m1} (mult:@var{m2} (ashiftrt:@var{m2} +@var{x} @var{s}) (sign_extend:@var{m2} @var{y})))} is converted to +@code{(mult:@var{m1} (sign_extend:@var{m1} (ashiftrt:@var{m2} @var{x} @var{s})) +(sign_extend:@var{m1} @var{y}))}, and likewise for patterns using @code{zero_extend} +and @code{lshiftrt}. If the second operand of @code{mult} is also a shift, +then that is extended also. + @end itemize Further canonicalization rules are defined in the function --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -1000,6 +1000,34 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_CODE (XEXP (XEXP (op, 0), 1)) == LABEL_REF) return XEXP (op, 0); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT) + { + rtx lhs = XEXP (op, 0); + rtx rhs = XEXP (op, 1); + enum rtx_code lcode = GET_CODE (lhs); + enum rtx_code rcode = GET_CODE (rhs); + + /* Widening multiplies usually extend both operands, but sometimes + they use a shift to extract a portion of a register. We assume + it is safe to widen all such operands because other examples + won't match real instructions. */ + if ((lcode == SIGN_EXTEND || lcode == ASHIFTRT) + && (rcode == SIGN_EXTEND || rcode == ASHIFTRT)) + { + enum machine_mode lmode = GET_MODE (lhs); + enum machine_mode rmode = GET_MODE (lhs); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (SIGN_EXTEND, + mode, + lhs, lmode), + simplify_gen_unary (SIGN_EXTEND, + mode, + rhs, rmode)); + } + } + /* Check for a sign extension of a subreg of a promoted variable, where the promotion is sign-extended, and the target mode is the same as the variable's promotion. */ @@ -1071,6 +1099,34 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_MODE_SIZE (mode) <= GET_MODE_SIZE (GET_MODE (XEXP (op, 0)))) return rtl_hooks.gen_lowpart_no_emit (mode, op); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT) + { + rtx lhs = XEXP (op, 0); + rtx rhs = XEXP (op, 1); + enum rtx_code lcode = GET_CODE (lhs); + enum rtx_code rcode = GET_CODE (rhs); + + /* Widening multiplies usually extend both operands, but sometimes + they use a shift to extract a portion of a register. We assume + it is safe to widen all such operands because other examples + won't match real instructions. */ + if ((lcode == ZERO_EXTEND || lcode == LSHIFTRT) + && (rcode == ZERO_EXTEND || rcode == LSHIFTRT)) + { + enum machine_mode lmode = GET_MODE (lhs); + enum machine_mode rmode = GET_MODE (lhs); + return simplify_gen_binary (MULT, mode, + simplify_gen_unary (ZERO_EXTEND, + mode, + lhs, lmode), + simplify_gen_unary (ZERO_EXTEND, + mode, + rhs, rmode)); + } + } + /* (zero_extend:M (zero_extend:N <X>)) is (zero_extend:M <X>). */ if (GET_CODE (op) == ZERO_EXTEND) return simplify_gen_unary (ZERO_EXTEND, mode, XEXP (op, 0), --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mla-2.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=armv7-a" } */ + +long long foolong (long long x, short *a, short *b) +{ + return x + *a * *b; +} + +/* { dg-final { scan-assembler "smlalbb" } } */ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-25 13:42 ` Andrew Stubbs @ 2011-05-25 13:48 ` Joseph S. Myers 2011-05-25 14:01 ` Andrew Stubbs 0 siblings, 1 reply; 27+ messages in thread From: Joseph S. Myers @ 2011-05-25 13:48 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches On Wed, 25 May 2011, Andrew Stubbs wrote: > I know that, potentially, not all shifted operands are going to be widening > multiplies, but I *think* this should be safe because other random shift > values are unlikely to match a real widening mult instruction (and if they do > then the code would already be broken). If somebody knows a reason why this > isn't safe then I think I'm going to need some help figuring out what > conditions to use. Random supposition like that is not a sensible basis for modifying GCC. I haven't managed to produce an example of code demonstrating the problem, but that's probably because I'm not sufficiently familiar with all the RTL optimizers. Where is the guarantee that the inputs to these functions must represent real instructions, or that the outputs will only be used if they represent real instructions? Where are the assertions to ensure that wrong code is not quietly generated if this is not the case? Where is the documentation of what instruction patterns it is not permitted to put in .md files because they would violate the assumptions about what instructions you are permitted to represent in RTL? How have you checked there are no existing problematic instruction patterns? RTL has defined abstract semantics and RTL transformations should be ones that are valid in accordance with those semantics, with proper assertions if there are additional constraints on the input passed to a function. This means actually counting the numbers of variable bits in the operands to determine whether the multiplication could overflow. -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-25 13:48 ` Joseph S. Myers @ 2011-05-25 14:01 ` Andrew Stubbs 2011-05-25 14:27 ` Joseph S. Myers 0 siblings, 1 reply; 27+ messages in thread From: Andrew Stubbs @ 2011-05-25 14:01 UTC (permalink / raw) To: Joseph S. Myers; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches On 25/05/11 14:19, Joseph S. Myers wrote: > RTL has defined abstract semantics and RTL transformations should be ones > that are valid in accordance with those semantics, with proper assertions > if there are additional constraints on the input passed to a function. > This means actually counting the numbers of variable bits in the operands > to determine whether the multiplication could overflow. Ok, fair enough, so how can I identify a valid subreg extraction that is defined in terms of shifts? The case that I care about is simple enough: (mult:SI (ashiftrt:SI (reg:SI rM) (const_int 16)) (sign_extend:SI (subreg:HI (reg:SI rN) 0))) I guess that's just equivalent to this: (mult:SI (sign_extend:SI (subreg:HI (reg:SI rM) 4))) (sign_extend:SI (subreg:HI (reg:SI rN) 0))) but it chooses not to represent it that way, which is less than helpful in this case. So I could just scan for that exact pattern, or perhaps look for shift sizes that are half the size of the register, or some such thing, but is that general enough? Or is it too general again? Is there anything else I've missed? Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-25 14:01 ` Andrew Stubbs @ 2011-05-25 14:27 ` Joseph S. Myers 2011-05-26 13:57 ` Andrew Stubbs 0 siblings, 1 reply; 27+ messages in thread From: Joseph S. Myers @ 2011-05-25 14:27 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches On Wed, 25 May 2011, Andrew Stubbs wrote: > On 25/05/11 14:19, Joseph S. Myers wrote: > > RTL has defined abstract semantics and RTL transformations should be ones > > that are valid in accordance with those semantics, with proper assertions > > if there are additional constraints on the input passed to a function. > > This means actually counting the numbers of variable bits in the operands > > to determine whether the multiplication could overflow. > > Ok, fair enough, so how can I identify a valid subreg extraction that is > defined in terms of shifts? The shift must be by a positive constant amount, strictly less than the precision (GET_MODE_PRECISION) of the mode (of the value being shifted). If that applies, the relevant number of bits is the precision of the mode minus the number of bits of the shift. For an extension, just take the number of bits in the inner mode. Add the two numbers of bits; if the result does not exceed the number of bits in the mode (of the operands and the multiplication) then the multiplication won't overflow. As in your patch, either all the operands must be sign-extensions / arithmetic shifts (and then the result is equivalent to a widening signed multiply), or all must be zero-extensions / logical shifts (and the result is a widening unsigned multiply). -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-25 14:27 ` Joseph S. Myers @ 2011-05-26 13:57 ` Andrew Stubbs 2011-05-26 14:47 ` Joseph S. Myers 2011-06-02 9:46 ` Richard Earnshaw 0 siblings, 2 replies; 27+ messages in thread From: Andrew Stubbs @ 2011-05-26 13:57 UTC (permalink / raw) To: Joseph S. Myers; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches, patches [-- Attachment #1: Type: text/plain, Size: 643 bytes --] On 25/05/11 14:47, Joseph S. Myers wrote: > The shift must be by a positive constant amount, strictly less than the > precision (GET_MODE_PRECISION) of the mode (of the value being shifted). > If that applies, the relevant number of bits is the precision of the mode > minus the number of bits of the shift. For an extension, just take the > number of bits in the inner mode. Add the two numbers of bits; if the > result does not exceed the number of bits in the mode (of the operands and > the multiplication) then the multiplication won't overflow. I believe the attached should implement what you describe. Is the patch OK now? Andrew [-- Attachment #2: canonical-mul.patch --] [-- Type: text/x-patch, Size: 5764 bytes --] 2011-05-26 Bernd Schmidt <bernds@codesourcery.com> Andrew Stubbs <ams@codesourcery.com> gcc/ * simplify-rtx.c (simplify_unary_operation_1): Canonicalize widening multiplies. * doc/md.texi (Canonicalization of Instructions): Document widening multiply canonicalization. gcc/testsuite/ * gcc.target/arm/mla-2.c: New test. --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5840,6 +5840,23 @@ Equality comparisons of a group of bits (usually a single bit) with zero will be written using @code{zero_extract} rather than the equivalent @code{and} or @code{sign_extract} operations. +@cindex @code{mult}, canonicalization of +@item +@code{(sign_extend:@var{m1} (mult:@var{m2} (sign_extend:@var{m2} @var{x}) +(sign_extend:@var{m2} @var{y})))} is converted to @code{(mult:@var{m1} +(sign_extend:@var{m1} @var{x}) (sign_extend:@var{m1} @var{y}))}, and likewise +for @code{zero_extend}. + +@item +@code{(sign_extend:@var{m1} (mult:@var{m2} (ashiftrt:@var{m2} +@var{x} @var{s}) (sign_extend:@var{m2} @var{y})))} is converted +to @code{(mult:@var{m1} (sign_extend:@var{m1} (ashiftrt:@var{m2} +@var{x} @var{s})) (sign_extend:@var{m1} @var{y}))}, and likewise for +patterns using @code{zero_extend} and @code{lshiftrt}. If the second +operand of @code{mult} is also a shift, then that is extended also. +This transformation is only applied when it can be proven that the +original operation had sufficient precision to prevent overflow. + @end itemize Further canonicalization rules are defined in the function --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -1000,6 +1000,48 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_CODE (XEXP (XEXP (op, 0), 1)) == LABEL_REF) return XEXP (op, 0); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT) + { + rtx lhs = XEXP (op, 0); + rtx rhs = XEXP (op, 1); + enum rtx_code lcode = GET_CODE (lhs); + enum rtx_code rcode = GET_CODE (rhs); + + /* Widening multiplies usually extend both operands, but sometimes + they use a shift to extract a portion of a register. */ + if ((lcode == SIGN_EXTEND + || (lcode == ASHIFTRT && CONST_INT_P (XEXP (lhs, 1)))) + && (rcode == SIGN_EXTEND + || (rcode == ASHIFTRT && CONST_INT_P (XEXP (rhs, 1))))) + { + enum machine_mode lmode = GET_MODE (lhs); + enum machine_mode rmode = GET_MODE (rhs); + int bits; + + if (lcode == ASHIFTRT) + /* Number of bits not shifted off the end. */ + bits = GET_MODE_PRECISION (lmode) - INTVAL (XEXP (lhs, 1)); + else /* lcode == SIGN_EXTEND */ + /* Size of inner mode. */ + bits = GET_MODE_PRECISION (GET_MODE (XEXP (lhs, 0))); + + if (rcode == ASHIFTRT) + bits += GET_MODE_PRECISION (rmode) - INTVAL (XEXP (rhs, 1)); + else /* rcode == SIGN_EXTEND */ + bits += GET_MODE_PRECISION (GET_MODE (XEXP (rhs, 0))); + + /* We can only widen multiplies if the result is mathematiclly + equivalent. I.e. if overflow was impossible. */ + if (bits <= GET_MODE_PRECISION (GET_MODE (op))) + return simplify_gen_binary + (MULT, mode, + simplify_gen_unary (SIGN_EXTEND, mode, lhs, lmode), + simplify_gen_unary (SIGN_EXTEND, mode, rhs, rmode)); + } + } + /* Check for a sign extension of a subreg of a promoted variable, where the promotion is sign-extended, and the target mode is the same as the variable's promotion. */ @@ -1071,6 +1113,48 @@ simplify_unary_operation_1 (enum rtx_code code, enum machine_mode mode, rtx op) && GET_MODE_SIZE (mode) <= GET_MODE_SIZE (GET_MODE (XEXP (op, 0)))) return rtl_hooks.gen_lowpart_no_emit (mode, op); + /* Extending a widening multiplication should be canonicalized to + a wider widening multiplication. */ + if (GET_CODE (op) == MULT) + { + rtx lhs = XEXP (op, 0); + rtx rhs = XEXP (op, 1); + enum rtx_code lcode = GET_CODE (lhs); + enum rtx_code rcode = GET_CODE (rhs); + + /* Widening multiplies usually extend both operands, but sometimes + they use a shift to extract a portion of a register. */ + if ((lcode == ZERO_EXTEND + || (lcode == LSHIFTRT && CONST_INT_P (XEXP (lhs, 1)))) + && (rcode == ZERO_EXTEND + || (rcode == LSHIFTRT && CONST_INT_P (XEXP (rhs, 1))))) + { + enum machine_mode lmode = GET_MODE (lhs); + enum machine_mode rmode = GET_MODE (rhs); + int bits; + + if (lcode == LSHIFTRT) + /* Number of bits not shifted off the end. */ + bits = GET_MODE_PRECISION (lmode) - INTVAL (XEXP (lhs, 1)); + else /* lcode == ZERO_EXTEND */ + /* Size of inner mode. */ + bits = GET_MODE_PRECISION (GET_MODE (XEXP (lhs, 0))); + + if (rcode == LSHIFTRT) + bits += GET_MODE_PRECISION (rmode) - INTVAL (XEXP (rhs, 1)); + else /* rcode == ZERO_EXTEND */ + bits += GET_MODE_PRECISION (GET_MODE (XEXP (rhs, 0))); + + /* We can only widen multiplies if the result is mathematiclly + equivalent. I.e. if overflow was impossible. */ + if (bits <= GET_MODE_PRECISION (GET_MODE (op))) + return simplify_gen_binary + (MULT, mode, + simplify_gen_unary (ZERO_EXTEND, mode, lhs, lmode), + simplify_gen_unary (ZERO_EXTEND, mode, rhs, rmode)); + } + } + /* (zero_extend:M (zero_extend:N <X>)) is (zero_extend:M <X>). */ if (GET_CODE (op) == ZERO_EXTEND) return simplify_gen_unary (ZERO_EXTEND, mode, XEXP (op, 0), --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mla-2.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=armv7-a" } */ + +long long foolong (long long x, short *a, short *b) +{ + return x + *a * *b; +} + +/* { dg-final { scan-assembler "smlalbb" } } */ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-26 13:57 ` Andrew Stubbs @ 2011-05-26 14:47 ` Joseph S. Myers 2011-06-02 9:46 ` Richard Earnshaw 1 sibling, 0 replies; 27+ messages in thread From: Joseph S. Myers @ 2011-05-26 14:47 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Bernd Schmidt, Richard Earnshaw, gcc-patches, patches On Thu, 26 May 2011, Andrew Stubbs wrote: > On 25/05/11 14:47, Joseph S. Myers wrote: > > The shift must be by a positive constant amount, strictly less than the > > precision (GET_MODE_PRECISION) of the mode (of the value being shifted). > > If that applies, the relevant number of bits is the precision of the mode > > minus the number of bits of the shift. For an extension, just take the > > number of bits in the inner mode. Add the two numbers of bits; if the > > result does not exceed the number of bits in the mode (of the operands and > > the multiplication) then the multiplication won't overflow. > > I believe the attached should implement what you describe. > > Is the patch OK now? It appears to implement the logic I described, so I have no further comments on it (but can't review it). -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-05-26 13:57 ` Andrew Stubbs 2011-05-26 14:47 ` Joseph S. Myers @ 2011-06-02 9:46 ` Richard Earnshaw 2011-06-07 11:05 ` Andrew Stubbs 1 sibling, 1 reply; 27+ messages in thread From: Richard Earnshaw @ 2011-06-02 9:46 UTC (permalink / raw) To: Andrew Stubbs; +Cc: Joseph S. Myers, Bernd Schmidt, gcc-patches, patches On Thu, 2011-05-26 at 14:35 +0100, Andrew Stubbs wrote: > On 25/05/11 14:47, Joseph S. Myers wrote: > > The shift must be by a positive constant amount, strictly less than the > > precision (GET_MODE_PRECISION) of the mode (of the value being shifted). > > If that applies, the relevant number of bits is the precision of the mode > > minus the number of bits of the shift. For an extension, just take the > > number of bits in the inner mode. Add the two numbers of bits; if the > > result does not exceed the number of bits in the mode (of the operands and > > the multiplication) then the multiplication won't overflow. > > I believe the attached should implement what you describe. > > Is the patch OK now? > > Andrew OK. R. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][simplify-rtx] Fix 16-bit -> 64-bit multiply and accumulate 2011-06-02 9:46 ` Richard Earnshaw @ 2011-06-07 11:05 ` Andrew Stubbs 0 siblings, 0 replies; 27+ messages in thread From: Andrew Stubbs @ 2011-06-07 11:05 UTC (permalink / raw) To: Richard Earnshaw; +Cc: Joseph S. Myers, Bernd Schmidt, gcc-patches, patches On 02/06/11 10:46, Richard Earnshaw wrote: > OK. Committed, thanks. Andrew ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate 2011-01-28 15:08 ` Richard Earnshaw 2011-01-28 15:55 ` Andrew Stubbs @ 2011-02-09 6:31 ` Hans-Peter Nilsson 1 sibling, 0 replies; 27+ messages in thread From: Hans-Peter Nilsson @ 2011-02-09 6:31 UTC (permalink / raw) To: Richard Earnshaw; +Cc: gcc-patches On Fri, 28 Jan 2011, Richard Earnshaw wrote: > Canonical RTL is not just about the order of the operands its about not > having multiple ways of representing the same operation. Order is part > of that, but not the end of it. For example, the canonical form of > > (lt:SI (reg:SI) (const_int 0)) > > is > (lshiftrt:SI (reg:SI) (const_int 31)) I kind of doubt that's canon. At least it shouldn't be that way, except as a variant attempted by combine. If so, ISTM i'd be a complicating factor for the cmpM patterns of most ports. :) (Likewise unnamed variants for combine.) brgds, H-P ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2011-06-07 11:05 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-12-09 16:45 [patch][ARM] Fix 16-bit -> 64-bit multiply and accumulate Andrew Stubbs 2011-01-27 2:09 ` Ramana Radhakrishnan 2011-01-27 10:46 ` Andrew Stubbs 2011-01-27 19:22 ` Richard Earnshaw 2011-01-28 13:06 ` Andrew Stubbs 2011-01-28 15:08 ` Richard Earnshaw 2011-01-28 15:55 ` Andrew Stubbs 2011-01-28 16:01 ` Richard Earnshaw 2011-03-25 16:27 ` Andrew Stubbs 2011-04-15 11:23 ` Andrew Stubbs 2011-05-03 9:08 ` Bernd Schmidt 2011-05-03 9:36 ` Andrew Stubbs 2011-05-18 15:32 ` [patch][combine] " Andrew Stubbs 2011-05-19 13:06 ` Bernd Schmidt 2011-05-24 17:51 ` [patch][simplify-rtx] " Andrew Stubbs 2011-05-24 21:00 ` Joseph S. Myers 2011-05-25 9:47 ` Andrew Stubbs 2011-05-25 12:34 ` Joseph S. Myers 2011-05-25 13:42 ` Andrew Stubbs 2011-05-25 13:48 ` Joseph S. Myers 2011-05-25 14:01 ` Andrew Stubbs 2011-05-25 14:27 ` Joseph S. Myers 2011-05-26 13:57 ` Andrew Stubbs 2011-05-26 14:47 ` Joseph S. Myers 2011-06-02 9:46 ` Richard Earnshaw 2011-06-07 11:05 ` Andrew Stubbs 2011-02-09 6:31 ` [patch][ARM] " Hans-Peter Nilsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).