* [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
@ 2015-04-14 8:12 Kyrill Tkachov
2015-04-17 19:38 ` Jeff Law
0 siblings, 1 reply; 7+ messages in thread
From: Kyrill Tkachov @ 2015-04-14 8:12 UTC (permalink / raw)
To: GCC Patches; +Cc: rth
[-- Attachment #1: Type: text/plain, Size: 3289 bytes --]
Hi all,
The description of the relevant code at https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html is helpful for this...
The mult synthesis code at some points assumes that a shift operation can execute in parallel with previous steps
in the algorithm on superscalar machines. However, that assumption is made in the wrong place i.e. the alg_add_factor
and alg_sub_factor steps. These two steps shift the previously accumulated step and then add(subtract) it to itself.
The comment says:
>> alg_add_factor total := total * coeff + total;
>> alg_sub_factor total := total * coeff - total;
This means there's a dependency and the shift part cannot execute in parallel with the calculation of the
subtotal. Looking at the code, the only place where a shift can execute in parallel with a subtotal
calculation is in the case of:
>> alg_add_t_m2 total := total + multiplicand * coeff;
>> alg_sub_t_m2 total := total - multiplicand * coeff;
But alg_add_t_m2 is only ever used with a shift of 0, so alg_sub_t_m2 is the only place where
it makes sense to make that assumption.
This patch fixes the cost calculations by moving the assumption that in a shift+sub operation
the shift can execute in parallel with the subtotal calculation to the alg_sub_t_m2 and removing
it from the alg_add_factor and alg_add_factor cases.
Is this a correct line of reasoning, or am I misunderstanding something in the code?
Of course the effect on codegen of this patch depends a lot on the rtx costs in the backend.
On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being exceeded in more cases and the
expansion code choosing instead to do a move-immediate and a mul instruction.
No regressions on SPEC2006 on a Cortex-A57.
For example, for code:
long f0 (int x, int y)
{
return (long)x * 6L;
}
int f1(int x)
{
return x * 10;
}
int f2(int x)
{
return x * 100;
}
int f3(int x)
{
return x * 20;
}
int f4(int x)
{
return x * 25;
}
int f5(int x)
{
return x * 11;
}
before this patch we generate:
f0:
sxtw x0, w0
lsl x1, x0, 3
sub x0, x1, x0, lsl 1
ret
f1:
lsl w1, w0, 3
add w0, w1, w0, lsl 1
ret
f2:
mov w1, 100
mul w0, w0, w1
ret
f3:
lsl w1, w0, 4
add w0, w1, w0, lsl 2
ret
f4:
lsl w1, w0, 5
sub w1, w1, w0, lsl 3
add w0, w1, w0
ret
f5:
lsl w1, w0, 4
sub w1, w1, w0, lsl 2
sub w0, w1, w0
ret
with this patch we generate:
f0:
mov w1, 6
smull x0, w0, w1
ret
f1:
add w0, w0, w0, lsl 2
lsl w0, w0, 1
ret
f2:
mov w1, 100
mul w0, w0, w1
ret
f3:
add w0, w0, w0, lsl 2
lsl w0, w0, 2
ret
f4:
mov w1, 25
mul w0, w0, w1
ret
f5:
mov w1, 11
mul w0, w0, w1
ret
Bootstrapped and tested on arm, aarch64, x86_64-linux.
Ok for trunk?
Thanks,
Kyrill
2015-04-14 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* expmed.c: (synth_mult): Only assume overlapping
shift with previous steps in alg_sub_t_m2 case.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: expmed-shiftsub.patch --]
[-- Type: text/x-patch; name=expmed-shiftsub.patch, Size: 3843 bytes --]
commit dce3d3ba2e16a812348e4d8c4184c3779dc47c3d
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date: Thu Mar 12 17:38:20 2015 +0000
[expmed] Properly account for the cost and latency of shift+sub ops when synthesizing mults
diff --git a/gcc/expmed.c b/gcc/expmed.c
index 4fc35df..961b79e 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -2653,14 +2653,28 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
m = exact_log2 (-orig_t + 1);
if (m >= 0 && m < maxm)
{
- op_cost = shiftsub1_cost (speed, mode, m);
+ op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
+ /* If the target has a cheap shift-and-subtract insn use
+ that in preference to a shift insn followed by a sub insn.
+ Assume that the shift-and-sub is "atomic" with a latency
+ equal to it's cost, otherwise assume that on superscalar
+ hardware the shift may be executed concurrently with the
+ earlier steps in the algorithm. */
+ if (shiftsub1_cost (speed, mode, m) <= op_cost)
+ {
+ op_cost = shiftsub1_cost (speed, mode, m);
+ op_latency = op_cost;
+ }
+ else
+ op_latency = add_cost (speed, mode);
+
new_limit.cost = best_cost.cost - op_cost;
- new_limit.latency = best_cost.latency - op_cost;
+ new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, (unsigned HOST_WIDE_INT) (-orig_t + 1) >> m,
&new_limit, mode);
alg_in->cost.cost += op_cost;
- alg_in->cost.latency += op_cost;
+ alg_in->cost.latency += op_latency;
if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
{
best_cost = alg_in->cost;
@@ -2693,20 +2707,12 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
if (t % d == 0 && t > d && m < maxm
&& (!cache_hit || cache_alg == alg_add_factor))
{
- /* If the target has a cheap shift-and-add instruction use
- that in preference to a shift insn followed by an add insn.
- Assume that the shift-and-add is "atomic" with a latency
- equal to its cost, otherwise assume that on superscalar
- hardware the shift may be executed concurrently with the
- earlier steps in the algorithm. */
op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
- if (shiftadd_cost (speed, mode, m) < op_cost)
- {
- op_cost = shiftadd_cost (speed, mode, m);
- op_latency = op_cost;
- }
- else
- op_latency = add_cost (speed, mode);
+ if (shiftadd_cost (speed, mode, m) <= op_cost)
+ op_cost = shiftadd_cost (speed, mode, m);
+
+ op_latency = op_cost;
+
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_latency;
@@ -2731,20 +2737,11 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
if (t % d == 0 && t > d && m < maxm
&& (!cache_hit || cache_alg == alg_sub_factor))
{
- /* If the target has a cheap shift-and-subtract insn use
- that in preference to a shift insn followed by a sub insn.
- Assume that the shift-and-sub is "atomic" with a latency
- equal to it's cost, otherwise assume that on superscalar
- hardware the shift may be executed concurrently with the
- earlier steps in the algorithm. */
op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
- if (shiftsub0_cost (speed, mode, m) < op_cost)
- {
- op_cost = shiftsub0_cost (speed, mode, m);
- op_latency = op_cost;
- }
- else
- op_latency = add_cost (speed, mode);
+ if (shiftsub0_cost (speed, mode, m) <= op_cost)
+ op_cost = shiftsub0_cost (speed, mode, m);
+
+ op_latency = op_cost;
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_latency;
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
2015-04-14 8:12 [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults Kyrill Tkachov
@ 2015-04-17 19:38 ` Jeff Law
2015-04-20 11:09 ` Kyrill Tkachov
0 siblings, 1 reply; 7+ messages in thread
From: Jeff Law @ 2015-04-17 19:38 UTC (permalink / raw)
To: Kyrill Tkachov, GCC Patches; +Cc: rth
On 04/14/2015 02:11 AM, Kyrill Tkachov wrote:
>
> Of course the effect on codegen of this patch depends a lot on the rtx
> costs in the backend.
> On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being
> exceeded in more cases and the
> expansion code choosing instead to do a move-immediate and a mul
> instruction.
> No regressions on SPEC2006 on a Cortex-A57.
>
> For example, for code:
> long f0 (int x, int y)
> {
> return (long)x * 6L;
> }
>
>
> int f1(int x)
> {
> return x * 10;
> }
>
> int f2(int x)
> {
> return x * 100;
> }
>
> int f3(int x)
> {
> return x * 20;
> }
>
> int f4(int x)
> {
> return x * 25;
> }
>
> int f5(int x)
> {
> return x * 11;
> }
Please turn this into a test for the testsuite. It's fine if this the
test is specific to AArch64. You may need to break it into 6 individual
tests since what you want to check for in each one may be significantly
different. For example, f0, f4 and f5 you'd probably check for the
constant load & multiply instructions. Not sure how to best test for
what you want in f1-f3.
>
>
> Bootstrapped and tested on arm, aarch64, x86_64-linux.
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> 2015-04-14 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * expmed.c: (synth_mult): Only assume overlapping
> shift with previous steps in alg_sub_t_m2 case.
OK with a test for the testsuite.
jeff
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
2015-04-17 19:38 ` Jeff Law
@ 2015-04-20 11:09 ` Kyrill Tkachov
2015-04-20 15:06 ` Jeff Law
0 siblings, 1 reply; 7+ messages in thread
From: Kyrill Tkachov @ 2015-04-20 11:09 UTC (permalink / raw)
To: Jeff Law, GCC Patches; +Cc: rth
Hi Jeff,
On 17/04/15 20:38, Jeff Law wrote:
> On 04/14/2015 02:11 AM, Kyrill Tkachov wrote:
>> Of course the effect on codegen of this patch depends a lot on the rtx
>> costs in the backend.
>> On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being
>> exceeded in more cases and the
>> expansion code choosing instead to do a move-immediate and a mul
>> instruction.
>> No regressions on SPEC2006 on a Cortex-A57.
>>
>> For example, for code:
>> long f0 (int x, int y)
>> {
>> return (long)x * 6L;
>> }
>>
>>
>> int f1(int x)
>> {
>> return x * 10;
>> }
>>
>> int f2(int x)
>> {
>> return x * 100;
>> }
>>
>> int f3(int x)
>> {
>> return x * 20;
>> }
>>
>> int f4(int x)
>> {
>> return x * 25;
>> }
>>
>> int f5(int x)
>> {
>> return x * 11;
>> }
> Please turn this into a test for the testsuite. It's fine if this the
> test is specific to AArch64. You may need to break it into 6 individual
> tests since what you want to check for in each one may be significantly
> different. For example, f0, f4 and f5 you'd probably check for the
> constant load & multiply instructions. Not sure how to best test for
> what you want in f1-f3.
f1/f3 still end up synthesising the mult, but prefer a different
algorithm. I don't think the algorithm chosen in f1/f3 is worse or
better than what it was producing before, so I don't think there's
much point in testing for it. If you think it's really better to
test for something, I propose testing that only two instructions are
generated, and neither of them are a 'mul'. I'll repost a patch with
my proposed testcases for f0,f2,f4,f5.
>
>>
>> Bootstrapped and tested on arm, aarch64, x86_64-linux.
>> Ok for trunk?
>>
>> Thanks,
>> Kyrill
>>
>> 2015-04-14 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>
>> * expmed.c: (synth_mult): Only assume overlapping
>> shift with previous steps in alg_sub_t_m2 case.
> OK with a test for the testsuite.
Thanks for looking at this,
Kyrill
>
> jeff
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
2015-04-20 11:09 ` Kyrill Tkachov
@ 2015-04-20 15:06 ` Jeff Law
2015-04-20 15:12 ` Kyrill Tkachov
0 siblings, 1 reply; 7+ messages in thread
From: Jeff Law @ 2015-04-20 15:06 UTC (permalink / raw)
To: Kyrill Tkachov, GCC Patches; +Cc: rth
On 04/20/2015 05:09 AM, Kyrill Tkachov wrote:
> Hi Jeff,
>
> On 17/04/15 20:38, Jeff Law wrote:
>> On 04/14/2015 02:11 AM, Kyrill Tkachov wrote:
>>> Of course the effect on codegen of this patch depends a lot on the rtx
>>> costs in the backend.
>>> On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being
>>> exceeded in more cases and the
>>> expansion code choosing instead to do a move-immediate and a mul
>>> instruction.
>>> No regressions on SPEC2006 on a Cortex-A57.
>>>
>>> For example, for code:
>>> long f0 (int x, int y)
>>> {
>>> return (long)x * 6L;
>>> }
>>>
>>>
>>> int f1(int x)
>>> {
>>> return x * 10;
>>> }
>>>
>>> int f2(int x)
>>> {
>>> return x * 100;
>>> }
>>>
>>> int f3(int x)
>>> {
>>> return x * 20;
>>> }
>>>
>>> int f4(int x)
>>> {
>>> return x * 25;
>>> }
>>>
>>> int f5(int x)
>>> {
>>> return x * 11;
>>> }
>> Please turn this into a test for the testsuite. It's fine if this the
>> test is specific to AArch64. You may need to break it into 6 individual
>> tests since what you want to check for in each one may be significantly
>> different. For example, f0, f4 and f5 you'd probably check for the
>> constant load & multiply instructions. Not sure how to best test for
>> what you want in f1-f3.
>
> f1/f3 still end up synthesising the mult, but prefer a different
> algorithm. I don't think the algorithm chosen in f1/f3 is worse or
> better than what it was producing before, so I don't think there's
> much point in testing for it.
Yea, when I looked at the differences, it wasn't immediately clear if
there was a real improvement or not. The new sequences use the single
register, so one might claim they're marginally better due to that.
If you think it's really better to
> test for something, I propose testing that only two instructions are
> generated, and neither of them are a 'mul'. I'll repost a patch with
> my proposed testcases for f0,f2,f4,f5.
Your call on f1/f3. IIRC f2 didn't change at all, so I'm not sure if
you need a test for that (perhaps you should make sure it continues to
use a mul rather than a synthesized sequence).
Pre-approved with whatever you decide on the testing side.
jeff
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
2015-04-20 15:06 ` Jeff Law
@ 2015-04-20 15:12 ` Kyrill Tkachov
2015-04-21 12:46 ` Marcus Shawcroft
0 siblings, 1 reply; 7+ messages in thread
From: Kyrill Tkachov @ 2015-04-20 15:12 UTC (permalink / raw)
To: Jeff Law, GCC Patches
Cc: rth, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh
[-- Attachment #1: Type: text/plain, Size: 3300 bytes --]
On 20/04/15 16:06, Jeff Law wrote:
> On 04/20/2015 05:09 AM, Kyrill Tkachov wrote:
>> Hi Jeff,
>>
>> On 17/04/15 20:38, Jeff Law wrote:
>>> On 04/14/2015 02:11 AM, Kyrill Tkachov wrote:
>>>> Of course the effect on codegen of this patch depends a lot on the rtx
>>>> costs in the backend.
>>>> On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being
>>>> exceeded in more cases and the
>>>> expansion code choosing instead to do a move-immediate and a mul
>>>> instruction.
>>>> No regressions on SPEC2006 on a Cortex-A57.
>>>>
>>>> For example, for code:
>>>> long f0 (int x, int y)
>>>> {
>>>> return (long)x * 6L;
>>>> }
>>>>
>>>>
>>>> int f1(int x)
>>>> {
>>>> return x * 10;
>>>> }
>>>>
>>>> int f2(int x)
>>>> {
>>>> return x * 100;
>>>> }
>>>>
>>>> int f3(int x)
>>>> {
>>>> return x * 20;
>>>> }
>>>>
>>>> int f4(int x)
>>>> {
>>>> return x * 25;
>>>> }
>>>>
>>>> int f5(int x)
>>>> {
>>>> return x * 11;
>>>> }
>>> Please turn this into a test for the testsuite. It's fine if this the
>>> test is specific to AArch64. You may need to break it into 6 individual
>>> tests since what you want to check for in each one may be significantly
>>> different. For example, f0, f4 and f5 you'd probably check for the
>>> constant load & multiply instructions. Not sure how to best test for
>>> what you want in f1-f3.
>> f1/f3 still end up synthesising the mult, but prefer a different
>> algorithm. I don't think the algorithm chosen in f1/f3 is worse or
>> better than what it was producing before, so I don't think there's
>> much point in testing for it.
> Yea, when I looked at the differences, it wasn't immediately clear if
> there was a real improvement or not. The new sequences use the single
> register, so one might claim they're marginally better due to that.
>
> If you think it's really better to
>> test for something, I propose testing that only two instructions are
>> generated, and neither of them are a 'mul'. I'll repost a patch with
>> my proposed testcases for f0,f2,f4,f5.
> Your call on f1/f3. IIRC f2 didn't change at all, so I'm not sure if
> you need a test for that (perhaps you should make sure it continues to
> use a mul rather than a synthesized sequence).
>
> Pre-approved with whatever you decide on the testing side.
Thanks,
I could've sworn I had sent this version out a couple hours ago.
My mail client has been playing up.
Here it is with 6 tests. For the tests corresponding to f1/f3 in my
example above I scan that we don't use the 'w1' reg.
I'll give the AArch64 maintainers to comment on the tests for a day or two
before committing.
Thanks,
Kyrill
2015-04-20 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* expmed.c: (synth_mult): Only assume overlapping
shift with previous steps in alg_sub_t_m2 case.
2015-04-20 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* gcc.target/aarch64/mult-synth_1.c: New test.
* gcc.target/aarch64/mult-synth_2.c: Likewise.
* gcc.target/aarch64/mult-synth_3.c: Likewise.
* gcc.target/aarch64/mult-synth_4.c: Likewise.
* gcc.target/aarch64/mult-synth_5.c: Likewise.
* gcc.target/aarch64/mult-synth_6.c: Likewise.
>
> jeff
>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: expmed-shiftsub-tests.patch --]
[-- Type: text/x-patch; name=expmed-shiftsub-tests.patch, Size: 6847 bytes --]
commit 6dddd785f2839811c9f8e59e3e36f64cdb3a4ebe
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date: Thu Mar 12 17:38:20 2015 +0000
[expmed] Properly account for the cost and latency of shift+sub ops when synthesizing mults
diff --git a/gcc/expmed.c b/gcc/expmed.c
index f570612..e890a1e 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -2664,14 +2664,28 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
m = exact_log2 (-orig_t + 1);
if (m >= 0 && m < maxm)
{
- op_cost = shiftsub1_cost (speed, mode, m);
+ op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
+ /* If the target has a cheap shift-and-subtract insn use
+ that in preference to a shift insn followed by a sub insn.
+ Assume that the shift-and-sub is "atomic" with a latency
+ equal to it's cost, otherwise assume that on superscalar
+ hardware the shift may be executed concurrently with the
+ earlier steps in the algorithm. */
+ if (shiftsub1_cost (speed, mode, m) <= op_cost)
+ {
+ op_cost = shiftsub1_cost (speed, mode, m);
+ op_latency = op_cost;
+ }
+ else
+ op_latency = add_cost (speed, mode);
+
new_limit.cost = best_cost.cost - op_cost;
- new_limit.latency = best_cost.latency - op_cost;
+ new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, (unsigned HOST_WIDE_INT) (-orig_t + 1) >> m,
&new_limit, mode);
alg_in->cost.cost += op_cost;
- alg_in->cost.latency += op_cost;
+ alg_in->cost.latency += op_latency;
if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
{
best_cost = alg_in->cost;
@@ -2704,20 +2718,12 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
if (t % d == 0 && t > d && m < maxm
&& (!cache_hit || cache_alg == alg_add_factor))
{
- /* If the target has a cheap shift-and-add instruction use
- that in preference to a shift insn followed by an add insn.
- Assume that the shift-and-add is "atomic" with a latency
- equal to its cost, otherwise assume that on superscalar
- hardware the shift may be executed concurrently with the
- earlier steps in the algorithm. */
op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
- if (shiftadd_cost (speed, mode, m) < op_cost)
- {
- op_cost = shiftadd_cost (speed, mode, m);
- op_latency = op_cost;
- }
- else
- op_latency = add_cost (speed, mode);
+ if (shiftadd_cost (speed, mode, m) <= op_cost)
+ op_cost = shiftadd_cost (speed, mode, m);
+
+ op_latency = op_cost;
+
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_latency;
@@ -2742,20 +2748,11 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
if (t % d == 0 && t > d && m < maxm
&& (!cache_hit || cache_alg == alg_sub_factor))
{
- /* If the target has a cheap shift-and-subtract insn use
- that in preference to a shift insn followed by a sub insn.
- Assume that the shift-and-sub is "atomic" with a latency
- equal to it's cost, otherwise assume that on superscalar
- hardware the shift may be executed concurrently with the
- earlier steps in the algorithm. */
op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
- if (shiftsub0_cost (speed, mode, m) < op_cost)
- {
- op_cost = shiftsub0_cost (speed, mode, m);
- op_latency = op_cost;
- }
- else
- op_latency = add_cost (speed, mode);
+ if (shiftsub0_cost (speed, mode, m) <= op_cost)
+ op_cost = shiftsub0_cost (speed, mode, m);
+
+ op_latency = op_cost;
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_latency;
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
new file mode 100644
index 0000000..8b6ee70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 100;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
new file mode 100644
index 0000000..9278817
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 25;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
new file mode 100644
index 0000000..6898e78
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 11;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
new file mode 100644
index 0000000..1088a75
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+long
+foo (int x, int y)
+{
+ return (long)x * 6L;
+}
+
+/* { dg-final { scan-assembler "smull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
new file mode 100644
index 0000000..8cf3314
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 10;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
new file mode 100644
index 0000000..e941b72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 20;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
2015-04-20 15:12 ` Kyrill Tkachov
@ 2015-04-21 12:46 ` Marcus Shawcroft
2015-04-21 12:58 ` Kyrill Tkachov
0 siblings, 1 reply; 7+ messages in thread
From: Marcus Shawcroft @ 2015-04-21 12:46 UTC (permalink / raw)
To: Kyrill Tkachov; +Cc: GCC Patches
On 20 April 2015 at 16:12, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> Thanks,
> I could've sworn I had sent this version out a couple hours ago.
> My mail client has been playing up.
>
> Here it is with 6 tests. For the tests corresponding to f1/f3 in my
> example above I scan that we don't use the 'w1' reg.
>
> I'll give the AArch64 maintainers to comment on the tests for a day or two
> before committing.
Using scan-assembler-times is more robust than scan-assembler.
Otherwise, OK by me.
/Marcus
> Thanks,
> Kyrill
>
> 2015-04-20 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * expmed.c: (synth_mult): Only assume overlapping
> shift with previous steps in alg_sub_t_m2 case.
>
> 2015-04-20 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.target/aarch64/mult-synth_1.c: New test.
> * gcc.target/aarch64/mult-synth_2.c: Likewise.
> * gcc.target/aarch64/mult-synth_3.c: Likewise.
> * gcc.target/aarch64/mult-synth_4.c: Likewise.
> * gcc.target/aarch64/mult-synth_5.c: Likewise.
> * gcc.target/aarch64/mult-synth_6.c: Likewise.
>>
>>
>> jeff
>>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults
2015-04-21 12:46 ` Marcus Shawcroft
@ 2015-04-21 12:58 ` Kyrill Tkachov
0 siblings, 0 replies; 7+ messages in thread
From: Kyrill Tkachov @ 2015-04-21 12:58 UTC (permalink / raw)
To: Marcus Shawcroft; +Cc: GCC Patches
[-- Attachment #1: Type: text/plain, Size: 1329 bytes --]
On 21/04/15 13:46, Marcus Shawcroft wrote:
> On 20 April 2015 at 16:12, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>
>> Thanks,
>> I could've sworn I had sent this version out a couple hours ago.
>> My mail client has been playing up.
>>
>> Here it is with 6 tests. For the tests corresponding to f1/f3 in my
>> example above I scan that we don't use the 'w1' reg.
>>
>> I'll give the AArch64 maintainers to comment on the tests for a day or two
>> before committing.
> Using scan-assembler-times is more robust than scan-assembler.
> Otherwise, OK by me.
> /Marcus
Thanks, I used scan-assembler-times for those tests.
Attached is what I committed with r222268.
Kyrill
>
>> Thanks,
>> Kyrill
>>
>> 2015-04-20 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>
>> * expmed.c: (synth_mult): Only assume overlapping
>> shift with previous steps in alg_sub_t_m2 case.
>>
>> 2015-04-20 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>
>> * gcc.target/aarch64/mult-synth_1.c: New test.
>> * gcc.target/aarch64/mult-synth_2.c: Likewise.
>> * gcc.target/aarch64/mult-synth_3.c: Likewise.
>> * gcc.target/aarch64/mult-synth_4.c: Likewise.
>> * gcc.target/aarch64/mult-synth_5.c: Likewise.
>> * gcc.target/aarch64/mult-synth_6.c: Likewise.
>>>
>>> jeff
>>>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: expmed-synthmult-svn.patch --]
[-- Type: text/x-patch; name=expmed-synthmult-svn.patch, Size: 7849 bytes --]
Index: gcc/ChangeLog
===================================================================
--- gcc/ChangeLog (revision 222266)
+++ gcc/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2015-04-21 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
+
+ * expmed.c: (synth_mult): Only assume overlapping
+ shift with previous steps in alg_sub_t_m2 case.
+
2015-04-21 Richard Biener <rguenther@suse.de>
PR tree-optimization/65788
Index: gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/mult-synth_2.c (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/mult-synth_2.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 25;
+}
+
+/* { dg-final { scan-assembler-times "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" 1 } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/mult-synth_3.c (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/mult-synth_3.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 11;
+}
+
+/* { dg-final { scan-assembler-times "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" 1 } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/mult-synth_4.c (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/mult-synth_4.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+long
+foo (int x, int y)
+{
+ return (long)x * 6L;
+}
+
+/* { dg-final { scan-assembler-times "smull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" 1 } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/mult-synth_5.c (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/mult-synth_5.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 10;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/mult-synth_6.c (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/mult-synth_6.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 20;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/mult-synth_1.c (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/mult-synth_1.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+ return x * 100;
+}
+
+/* { dg-final { scan-assembler-times "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" 1 } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/ChangeLog
===================================================================
--- gcc/testsuite/ChangeLog (revision 222266)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,12 @@
+2015-04-21 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
+
+ * gcc.target/aarch64/mult-synth_1.c: New test.
+ * gcc.target/aarch64/mult-synth_2.c: Likewise.
+ * gcc.target/aarch64/mult-synth_3.c: Likewise.
+ * gcc.target/aarch64/mult-synth_4.c: Likewise.
+ * gcc.target/aarch64/mult-synth_5.c: Likewise.
+ * gcc.target/aarch64/mult-synth_6.c: Likewise.
+
2015-04-21 Tom de Vries <tom@codesourcery.com>
PR tree-optimization/65802
Index: gcc/expmed.c
===================================================================
--- gcc/expmed.c (revision 222266)
+++ gcc/expmed.c (working copy)
@@ -2664,14 +2664,28 @@
m = exact_log2 (-orig_t + 1);
if (m >= 0 && m < maxm)
{
- op_cost = shiftsub1_cost (speed, mode, m);
+ op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
+ /* If the target has a cheap shift-and-subtract insn use
+ that in preference to a shift insn followed by a sub insn.
+ Assume that the shift-and-sub is "atomic" with a latency
+ equal to it's cost, otherwise assume that on superscalar
+ hardware the shift may be executed concurrently with the
+ earlier steps in the algorithm. */
+ if (shiftsub1_cost (speed, mode, m) <= op_cost)
+ {
+ op_cost = shiftsub1_cost (speed, mode, m);
+ op_latency = op_cost;
+ }
+ else
+ op_latency = add_cost (speed, mode);
+
new_limit.cost = best_cost.cost - op_cost;
- new_limit.latency = best_cost.latency - op_cost;
+ new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, (unsigned HOST_WIDE_INT) (-orig_t + 1) >> m,
&new_limit, mode);
alg_in->cost.cost += op_cost;
- alg_in->cost.latency += op_cost;
+ alg_in->cost.latency += op_latency;
if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
{
best_cost = alg_in->cost;
@@ -2704,21 +2718,13 @@
if (t % d == 0 && t > d && m < maxm
&& (!cache_hit || cache_alg == alg_add_factor))
{
- /* If the target has a cheap shift-and-add instruction use
- that in preference to a shift insn followed by an add insn.
- Assume that the shift-and-add is "atomic" with a latency
- equal to its cost, otherwise assume that on superscalar
- hardware the shift may be executed concurrently with the
- earlier steps in the algorithm. */
op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
- if (shiftadd_cost (speed, mode, m) < op_cost)
- {
- op_cost = shiftadd_cost (speed, mode, m);
- op_latency = op_cost;
- }
- else
- op_latency = add_cost (speed, mode);
+ if (shiftadd_cost (speed, mode, m) <= op_cost)
+ op_cost = shiftadd_cost (speed, mode, m);
+ op_latency = op_cost;
+
+
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, t / d, &new_limit, mode);
@@ -2742,21 +2748,12 @@
if (t % d == 0 && t > d && m < maxm
&& (!cache_hit || cache_alg == alg_sub_factor))
{
- /* If the target has a cheap shift-and-subtract insn use
- that in preference to a shift insn followed by a sub insn.
- Assume that the shift-and-sub is "atomic" with a latency
- equal to it's cost, otherwise assume that on superscalar
- hardware the shift may be executed concurrently with the
- earlier steps in the algorithm. */
op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
- if (shiftsub0_cost (speed, mode, m) < op_cost)
- {
- op_cost = shiftsub0_cost (speed, mode, m);
- op_latency = op_cost;
- }
- else
- op_latency = add_cost (speed, mode);
+ if (shiftsub0_cost (speed, mode, m) <= op_cost)
+ op_cost = shiftsub0_cost (speed, mode, m);
+ op_latency = op_cost;
+
new_limit.cost = best_cost.cost - op_cost;
new_limit.latency = best_cost.latency - op_latency;
synth_mult (alg_in, t / d, &new_limit, mode);
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-04-21 12:58 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-14 8:12 [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults Kyrill Tkachov
2015-04-17 19:38 ` Jeff Law
2015-04-20 11:09 ` Kyrill Tkachov
2015-04-20 15:06 ` Jeff Law
2015-04-20 15:12 ` Kyrill Tkachov
2015-04-21 12:46 ` Marcus Shawcroft
2015-04-21 12:58 ` Kyrill Tkachov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).