* Re: [PATCH 0/6] aarch64: Implement TImode comparisons
@ 2020-03-19 15:47 Wilco Dijkstra
2020-03-19 16:50 ` Richard Henderson
0 siblings, 1 reply; 4+ messages in thread
From: Wilco Dijkstra @ 2020-03-19 15:47 UTC (permalink / raw)
To: Richard Henderson, GCC Patches
Hi Richard,
Thanks for these patches - yes TI mode expansions can certainly be improved!
So looking at your expansions for signed compares, why not copy the optimal
sequence from 32-bit Arm?
Any compare can be done in at most 2 instructions:
void doit(void);
void f(long long a)
{
if (a <= 1)
doit();
}
f:
cmp r0, #2
sbcs r3, r1, #0
blt .L4
bx lr
.L4:
b doit
Cheers,
Wilco
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 0/6] aarch64: Implement TImode comparisons
2020-03-19 15:47 [PATCH 0/6] aarch64: Implement TImode comparisons Wilco Dijkstra
@ 2020-03-19 16:50 ` Richard Henderson
2020-03-19 23:33 ` Wilco Dijkstra
0 siblings, 1 reply; 4+ messages in thread
From: Richard Henderson @ 2020-03-19 16:50 UTC (permalink / raw)
To: Wilco Dijkstra, GCC Patches
On 3/19/20 8:47 AM, Wilco Dijkstra wrote:
> Hi Richard,
>
> Thanks for these patches - yes TI mode expansions can certainly be improved!
> So looking at your expansions for signed compares, why not copy the optimal
> sequence from 32-bit Arm?
>
> Any compare can be done in at most 2 instructions:
>
> void doit(void);
> void f(long long a)
> {
> if (a <= 1)
> doit();
> }
>
> f:
> cmp r0, #2
> sbcs r3, r1, #0
> blt .L4
Well, this one requires that you be able to add 1 to an input and for that
input to not overflow. But you're right that I should be using this sequence
for LT (not LE).
I'll have another look.
r~
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 0/6] aarch64: Implement TImode comparisons
2020-03-19 16:50 ` Richard Henderson
@ 2020-03-19 23:33 ` Wilco Dijkstra
0 siblings, 0 replies; 4+ messages in thread
From: Wilco Dijkstra @ 2020-03-19 23:33 UTC (permalink / raw)
To: Richard Henderson, GCC Patches
Hi Richard,
> Any compare can be done in at most 2 instructions:
>
> void doit(void);
> void f(long long a)
> {
> if (a <= 1)
> doit();
> }
>
> f:
> cmp r0, #2
> sbcs r3, r1, #0
> blt .L4
> Well, this one requires that you be able to add 1 to an input and for that
> input to not overflow. But you're right that I should be using this sequence
> for LT (not LE).
And for GE. For LE and GT swap operands and condition. You can safely
increment the immediate since small immediates that fit CMP have no
chance of overflowing and large immediates have to be split off anyway
(and then are treated like register variants).
Cheers,
Wilco
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 0/6] aarch64: Implement TImode comparisons
@ 2020-03-19 6:47 Richard Henderson
0 siblings, 0 replies; 4+ messages in thread
From: Richard Henderson @ 2020-03-19 6:47 UTC (permalink / raw)
To: gcc-patches
Cc: richard.earnshaw, richard.sandiford, marcus.shawcroft, kyrylo.tkachov
This is attacking case 3 of PR 94174.
The existing ccmp optimization happens at the gimple level,
which means that rtl expansion of TImode stuff cannot take
advantage. But we can to even better than the existing
ccmp optimization.
This expansion is similar size to our current branchful
expansion, but all straight-line code. I will assume in
general that the branch predictor will work better with
fewer branches.
E.g.
- 10: b7f800a3 tbnz x3, #63, 24 <__subvti3+0x24>
- 14: eb02003f cmp x1, x2
- 18: 5400010c b.gt 38 <__subvti3+0x38>
- 1c: 54000140 b.eq 44 <__subvti3+0x44> // b.none
- 20: d65f03c0 ret
- 24: eb01005f cmp x2, x1
- 28: 5400008c b.gt 38 <__subvti3+0x38>
- 2c: 54ffffa1 b.ne 20 <__subvti3+0x20> // b.any
- 30: eb00009f cmp x4, x0
- 34: 54ffff69 b.ls 20 <__subvti3+0x20> // b.plast
- 38: a9bf7bfd stp x29, x30, [sp, #-16]!
- 3c: 910003fd mov x29, sp
- 40: 94000000 bl 0 <abort>
- 44: eb04001f cmp x0, x4
- 48: 54ffff88 b.hi 38 <__subvti3+0x38> // b.pmore
- 4c: d65f03c0 ret
+ 10: b7f800e3 tbnz x3, #63, 2c <__subvti3+0x2c>
+ 14: eb01005f cmp x2, x1
+ 18: 1a9fb7e2 cset w2, ge // ge = tcont
+ 1c: fa400080 ccmp x4, x0, #0x0, eq // eq = none
+ 20: 7a40a844 ccmp w2, #0x0, #0x4, ge // ge = tcont
+ 24: 540000e0 b.eq 40 <__subvti3+0x40> // b.none
+ 28: d65f03c0 ret
+ 2c: eb01005f cmp x2, x1
+ 30: 1a9fc7e2 cset w2, le
+ 34: fa400081 ccmp x4, x0, #0x1, eq // eq = none
+ 38: 7a40d844 ccmp w2, #0x0, #0x4, le
+ 3c: 54ffff60 b.eq 28 <__subvti3+0x28> // b.none
+ 40: a9bf7bfd stp x29, x30, [sp, #-16]!
+ 44: 910003fd mov x29, sp
+ 48: 94000000 bl 0 <abort>
So one less insn, but 2 branches instead of 6.
As for the specific case of the PR,
void test_int128(__int128 a, uint64_t l)
{
if ((__int128_t)a - l <= 1)
doit();
}
0: eb020000 subs x0, x0, x2
4: da1f0021 sbc x1, x1, xzr
8: f100003f cmp x1, #0x0
- c: 5400004d b.le 14 <test_int128+0x14>
- 10: d65f03c0 ret
- 14: 54000061 b.ne 20 <test_int128+0x20> // b.any
- 18: f100041f cmp x0, #0x1
- 1c: 54ffffa8 b.hi 10 <test_int128+0x10> // b.pmore
+ c: 1a9fc7e1 cset w1, le
+ 10: fa410801 ccmp x0, #0x1, #0x1, eq // eq = none
+ 14: 7a40d824 ccmp w1, #0x0, #0x4, le
+ 18: 54000041 b.ne 20 <test_int128+0x20> // b.any
+ 1c: d65f03c0 ret
20: 14000000 b 0 <doit>
r~
Richard Henderson (6):
aarch64: Add ucmp_*_carryinC patterns for all usub_*_carryinC
aarch64: Adjust result of aarch64_gen_compare_reg
aarch64: Accept 0 as first argument to compares
aarch64: Simplify @ccmp<cc_mode><mode> operands
aarch64: Improve nzcv argument to ccmp
aarch64: Implement TImode comparisons
gcc/config/aarch64/aarch64.c | 304 ++++++++++++++++------
gcc/config/aarch64/aarch64-simd.md | 18 +-
gcc/config/aarch64/aarch64-speculation.cc | 5 +-
gcc/config/aarch64/aarch64.md | 280 ++++++++++++++------
4 files changed, 429 insertions(+), 178 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-03-19 23:33 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 15:47 [PATCH 0/6] aarch64: Implement TImode comparisons Wilco Dijkstra
2020-03-19 16:50 ` Richard Henderson
2020-03-19 23:33 ` Wilco Dijkstra
-- strict thread matches above, loose matches on Subject: below --
2020-03-19 6:47 Richard Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).