Re: [PATCH 0/6] aarch64: Implement TImode comparisons

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/6] aarch64: Implement TImode comparisons
@ 2020-03-19 15:47 Wilco Dijkstra
  2020-03-19 16:50 ` Richard Henderson
  0 siblings, 1 reply; 4+ messages in thread
From: Wilco Dijkstra @ 2020-03-19 15:47 UTC (permalink / raw)
  To: Richard Henderson, GCC Patches

Hi Richard,

Thanks for these patches - yes TI mode expansions can certainly be improved!
So looking at your expansions for signed compares, why not copy the optimal
sequence from 32-bit Arm?

Any compare can be done in at most 2 instructions:

void doit(void);
void f(long long a)
{
    if (a <= 1)
        doit();
}

f:
        cmp     r0, #2
        sbcs    r3, r1, #0
        blt     .L4
        bx      lr
.L4:
        b       doit

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/6] aarch64: Implement TImode comparisons
  2020-03-19 15:47 [PATCH 0/6] aarch64: Implement TImode comparisons Wilco Dijkstra
@ 2020-03-19 16:50 ` Richard Henderson
  2020-03-19 23:33   ` Wilco Dijkstra
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Henderson @ 2020-03-19 16:50 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches

On 3/19/20 8:47 AM, Wilco Dijkstra wrote:
> Hi Richard,
> 
> Thanks for these patches - yes TI mode expansions can certainly be improved!
> So looking at your expansions for signed compares, why not copy the optimal
> sequence from 32-bit Arm?
> 
> Any compare can be done in at most 2 instructions:
> 
> void doit(void);
> void f(long long a)
> {
>     if (a <= 1)
>         doit();
> }
> 
> f:
>         cmp     r0, #2
>         sbcs    r3, r1, #0
>         blt     .L4

Well, this one requires that you be able to add 1 to an input and for that
input to not overflow.  But you're right that I should be using this sequence
for LT (not LE).

I'll have another look.


r~

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/6] aarch64: Implement TImode comparisons
  2020-03-19 16:50 ` Richard Henderson
@ 2020-03-19 23:33   ` Wilco Dijkstra
  0 siblings, 0 replies; 4+ messages in thread
From: Wilco Dijkstra @ 2020-03-19 23:33 UTC (permalink / raw)
  To: Richard Henderson, GCC Patches

Hi Richard,

> Any compare can be done in at most 2 instructions:
> 
> void doit(void);
> void f(long long a)
> {
>     if (a <= 1)
>         doit();
> }
> 
> f:
>         cmp     r0, #2
>         sbcs    r3, r1, #0
>         blt     .L4

> Well, this one requires that you be able to add 1 to an input and for that
> input to not overflow.  But you're right that I should be using this sequence
> for LT (not LE).

And for GE. For LE and GT swap operands and condition. You can safely
increment the immediate since small immediates that fit CMP have no
chance of overflowing and large immediates have to be split off anyway
(and then are treated like register variants).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 0/6] aarch64: Implement TImode comparisons
@ 2020-03-19  6:47 Richard Henderson
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Henderson @ 2020-03-19  6:47 UTC (permalink / raw)
  To: gcc-patches
  Cc: richard.earnshaw, richard.sandiford, marcus.shawcroft, kyrylo.tkachov

This is attacking case 3 of PR 94174.

The existing ccmp optimization happens at the gimple level,
which means that rtl expansion of TImode stuff cannot take
advantage.  But we can to even better than the existing
ccmp optimization.

This expansion is similar size to our current branchful 
expansion, but all straight-line code.  I will assume in
general that the branch predictor will work better with
fewer branches.

E.g.

-  10:  b7f800a3        tbnz    x3, #63, 24 <__subvti3+0x24>
-  14:  eb02003f        cmp     x1, x2
-  18:  5400010c        b.gt    38 <__subvti3+0x38>
-  1c:  54000140        b.eq    44 <__subvti3+0x44>  // b.none
-  20:  d65f03c0        ret
-  24:  eb01005f        cmp     x2, x1
-  28:  5400008c        b.gt    38 <__subvti3+0x38>
-  2c:  54ffffa1        b.ne    20 <__subvti3+0x20>  // b.any
-  30:  eb00009f        cmp     x4, x0
-  34:  54ffff69        b.ls    20 <__subvti3+0x20>  // b.plast
-  38:  a9bf7bfd        stp     x29, x30, [sp, #-16]!
-  3c:  910003fd        mov     x29, sp
-  40:  94000000        bl      0 <abort>
-  44:  eb04001f        cmp     x0, x4
-  48:  54ffff88        b.hi    38 <__subvti3+0x38>  // b.pmore
-  4c:  d65f03c0        ret

+  10:  b7f800e3        tbnz    x3, #63, 2c <__subvti3+0x2c>
+  14:  eb01005f        cmp     x2, x1
+  18:  1a9fb7e2        cset    w2, ge  // ge = tcont
+  1c:  fa400080        ccmp    x4, x0, #0x0, eq  // eq = none
+  20:  7a40a844        ccmp    w2, #0x0, #0x4, ge  // ge = tcont
+  24:  540000e0        b.eq    40 <__subvti3+0x40>  // b.none
+  28:  d65f03c0        ret
+  2c:  eb01005f        cmp     x2, x1
+  30:  1a9fc7e2        cset    w2, le
+  34:  fa400081        ccmp    x4, x0, #0x1, eq  // eq = none
+  38:  7a40d844        ccmp    w2, #0x0, #0x4, le
+  3c:  54ffff60        b.eq    28 <__subvti3+0x28>  // b.none
+  40:  a9bf7bfd        stp     x29, x30, [sp, #-16]!
+  44:  910003fd        mov     x29, sp
+  48:  94000000        bl      0 <abort>

So one less insn, but 2 branches instead of 6.

As for the specific case of the PR,

void test_int128(__int128 a, uint64_t l)
{
	if ((__int128_t)a - l <= 1)
		doit();
}

    0:  eb020000        subs    x0, x0, x2
    4:  da1f0021        sbc     x1, x1, xzr
    8:  f100003f        cmp     x1, #0x0
-   c:  5400004d        b.le    14 <test_int128+0x14>
-  10:  d65f03c0        ret
-  14:  54000061        b.ne    20 <test_int128+0x20>  // b.any
-  18:  f100041f        cmp     x0, #0x1
-  1c:  54ffffa8        b.hi    10 <test_int128+0x10>  // b.pmore
+   c:  1a9fc7e1        cset    w1, le
+  10:  fa410801        ccmp    x0, #0x1, #0x1, eq  // eq = none
+  14:  7a40d824        ccmp    w1, #0x0, #0x4, le
+  18:  54000041        b.ne    20 <test_int128+0x20>  // b.any
+  1c:  d65f03c0        ret
   20:  14000000        b       0 <doit>


r~


Richard Henderson (6):
  aarch64: Add ucmp_*_carryinC patterns for all usub_*_carryinC
  aarch64: Adjust result of aarch64_gen_compare_reg
  aarch64: Accept 0 as first argument to compares
  aarch64: Simplify @ccmp<cc_mode><mode> operands
  aarch64: Improve nzcv argument to ccmp
  aarch64: Implement TImode comparisons

 gcc/config/aarch64/aarch64.c              | 304 ++++++++++++++++------
 gcc/config/aarch64/aarch64-simd.md        |  18 +-
 gcc/config/aarch64/aarch64-speculation.cc |   5 +-
 gcc/config/aarch64/aarch64.md             | 280 ++++++++++++++------
 4 files changed, 429 insertions(+), 178 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-03-19 23:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 15:47 [PATCH 0/6] aarch64: Implement TImode comparisons Wilco Dijkstra
2020-03-19 16:50 ` Richard Henderson
2020-03-19 23:33   ` Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2020-03-19  6:47 Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).