RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
@ 2023-06-27 19:32 Roger Sayle
  2023-06-28  3:22 ` Hongtao Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Sayle @ 2023-06-27 19:32 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Uros Bizjak', 'Hongtao Liu'

[-- Attachment #1: Type: text/plain, Size: 1715 bytes --]

Doh! Wrong patch...
Roger
--

From: Roger Sayle <roger@nextmovesoftware.com> 
Sent: 27 June 2023 20:28
To: 'gcc-patches@gcc.gnu.org' <gcc-patches@gcc.gnu.org>
Cc: 'Uros Bizjak' <ubizjak@gmail.com>; 'Hongtao Liu' <crazylht@gmail.com>
Subject: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector
equality.

Hi Uros,

Hopefully Hongtao will approve my patch to support SUBREG conversions
in STV https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
but for some of the examples described in the above post (and its test
case), I've also come up with an alternate/complementary/supplementary
fix of generating the PTEST during RTL expansion, rather than rely on
this being caught/optimized later during STV.

You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
appear last.  When I was writing this, I initially also added support
for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
support 256-bit OImode (which also explains why we don't have an OImode
to V1OImode scalar-to-vector pass).  Retaining this clause ordering
should minimize the lines changed if things change in future.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-expand.cc (ix86_expand_int_compare): If
        testing a TImode SUBREG of a 128-bit vector register against
        zero, use a PTEST instruction instead of first moving it to
        to scalar registers.

Please let me know what you think.
Roger
--

[-- Attachment #2: patchic.txt --]
[-- Type: text/plain, Size: 1290 bytes --]

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 9a8d244..814d63b 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -2958,9 +2958,26 @@ ix86_expand_int_compare (enum rtx_code code, rtx op0, rtx op1)
   cmpmode = SELECT_CC_MODE (code, op0, op1);
   flags = gen_rtx_REG (cmpmode, FLAGS_REG);

+  /* Attempt to use PTEST, if available, when testing vector modes for
+     equality/inequality against zero.  */
+  if (op1 == const0_rtx
+      && SUBREG_P (op0)
+      && cmpmode == CCZmode
+      && SUBREG_BYTE (op0) == 0
+      && REG_P (SUBREG_REG (op0))
+      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
+      && TARGET_SSE4_1
+      && GET_MODE (op0) == TImode
+      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
+    {
+      tmp = SUBREG_REG (op0);
+      tmp = gen_rtx_UNSPEC (CCZmode, gen_rtvec (2, tmp, tmp), UNSPEC_PTEST);
+    }
+  else
+    tmp = gen_rtx_COMPARE (cmpmode, op0, op1);
+
   /* This is very simple, but making the interface the same as in the
      FP case makes the rest of the code easier.  */
-  tmp = gen_rtx_COMPARE (cmpmode, op0, op1);
   emit_insn (gen_rtx_SET (flags, tmp));

   /* Return the test that should be put into the flags user, i.e.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
  2023-06-27 19:32 [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality Roger Sayle
@ 2023-06-28  3:22 ` Hongtao Liu
  2023-07-11 20:57   ` Roger Sayle
  0 siblings, 1 reply; 6+ messages in thread
From: Hongtao Liu @ 2023-06-28  3:22 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches, Uros Bizjak

On Wed, Jun 28, 2023 at 3:32 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Doh! Wrong patch...
> Roger
> --
>
> From: Roger Sayle <roger@nextmovesoftware.com>
> Sent: 27 June 2023 20:28
> To: 'gcc-patches@gcc.gnu.org' <gcc-patches@gcc.gnu.org>
> Cc: 'Uros Bizjak' <ubizjak@gmail.com>; 'Hongtao Liu' <crazylht@gmail.com>
> Subject: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector
> equality.
>
>
> Hi Uros,
>
> Hopefully Hongtao will approve my patch to support SUBREG conversions
> in STV https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
> but for some of the examples described in the above post (and its test
> case), I've also come up with an alternate/complementary/supplementary
> fix of generating the PTEST during RTL expansion, rather than rely on
> this being caught/optimized later during STV.
>
> You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> appear last.  When I was writing this, I initially also added support
> for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> support 256-bit OImode (which also explains why we don't have an OImode
> to V1OImode scalar-to-vector pass).  Retaining this clause ordering
> should minimize the lines changed if things change in future.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
>         testing a TImode SUBREG of a 128-bit vector register against
>         zero, use a PTEST instruction instead of first moving it to
>         to scalar registers.
>
>
> Please let me know what you think.
> Roger
> --
>

+  /* Attempt to use PTEST, if available, when testing vector modes for
+     equality/inequality against zero.  */
+  if (op1 == const0_rtx
+      && SUBREG_P (op0)
+      && cmpmode == CCZmode
+      && SUBREG_BYTE (op0) == 0
+      && REG_P (SUBREG_REG (op0))
Just register_operand (op0, TImode),
+      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
+      && TARGET_SSE4_1
+      && GET_MODE (op0) == TImode
+      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
+    {
+      tmp = SUBREG_REG (op0);
and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
I think RA can handle SUBREG correctly, no need for extra predicates.
+      tmp = gen_rtx_UNSPEC (CCZmode, gen_rtvec (2, tmp, tmp), UNSPEC_PTEST);
+    }
+  else
+    tmp = gen_rtx_COMPARE (cmpmode, op0, op1);



--
BR,
Hongtao

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
  2023-06-28  3:22 ` Hongtao Liu
@ 2023-07-11 20:57   ` Roger Sayle
  2023-07-12  0:44     ` Hongtao Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Sayle @ 2023-07-11 20:57 UTC (permalink / raw)
  To: 'Hongtao Liu'; +Cc: gcc-patches, 'Uros Bizjak'

> From: Hongtao Liu <crazylht@gmail.com>
> Sent: 28 June 2023 04:23
> > From: Roger Sayle <roger@nextmovesoftware.com>
> > Sent: 27 June 2023 20:28
> >
> > I've also come up with an alternate/complementary/supplementary
> > fix of generating the PTEST during RTL expansion, rather than rely on
> > this being caught/optimized later during STV.
> >
> > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> > appear last.  When I was writing this, I initially also added support
> > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> > support 256-bit OImode (which also explains why we don't have an
> > OImode to V1OImode scalar-to-vector pass).  Retaining this clause
> > ordering should minimize the lines changed if things change in future.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> >         testing a TImode SUBREG of a 128-bit vector register against
> >         zero, use a PTEST instruction instead of first moving it to
> >         to scalar registers.
> >
> 
> +  /* Attempt to use PTEST, if available, when testing vector modes for
> +     equality/inequality against zero.  */  if (op1 == const0_rtx
> +      && SUBREG_P (op0)
> +      && cmpmode == CCZmode
> +      && SUBREG_BYTE (op0) == 0
> +      && REG_P (SUBREG_REG (op0))
> Just register_operand (op0, TImode),

I completely agree that in most circumstances, the early RTL optimizers
should use standard predicates, such as register_operand, that don't
distinguish between REG and SUBREG, allowing the choice (assignment)
to be left to register allocation (reload).

However in this case, unusually, the presence of the SUBREG, and treating
it differently from a REG is critical (in fact the reason for the patch).  x86_64
can very efficiently test whether a 128-bit value is zero, setting ZF, either
in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in
V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction.
There's no reason to prefer one form over the other.  A SUREG, however, that
moves the value from the scalar registers to a vector register, or from a vector
registers to scalar registers, requires two or three instructions, often reading
and writing values via memory, at a huge performance penalty.   Hence the
goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate
single-cycle test instruction for where the data is located.  Hence we want
to leave REG_P alone, but optimize (only) the SUBREG_P cases.
register_operand doesn't help with this.

Note this is counter to the usual advice.  Normally, a SUBREG between scalar
registers is cheap (in fact free) on x86, hence it safe for predicates to ignore
them prior to register allocation.  But another use of SUBREG, to represent
a VIEW_CONVERT_EXPR/transfer between processing units is closer to a
conversion, and a very expensive one (going via memory with different size
reads vs writes) at that.

> +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> +      && TARGET_SSE4_1
> +      && GET_MODE (op0) == TImode
> +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> +    {
> +      tmp = SUBREG_REG (op0);
> and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> I think RA can handle SUBREG correctly, no need for extra predicates.

Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
is forcing there to always be an inter-unit transfer/pipeline stall, when this is
idiom that we're trying to eliminate.

I should have repeated the motivating example from my original post at
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html

typedef long long __m128i __attribute__ ((__vector_size__ (16)));
int foo (__m128i x, __m128i y) {
  return (__int128)x == (__int128)y;
}

is currently generated as:
foo:    movaps  %xmm0, -40(%rsp)
        movq    -32(%rsp), %rdx
        movq    %xmm0, %rax
        movq    %xmm1, %rsi
        movaps  %xmm1, -24(%rsp)
        movq    -16(%rsp), %rcx
        xorq    %rsi, %rax
        xorq    %rcx, %rdx
        orq     %rdx, %rax
        sete    %al
        movzbl  %al, %eax
        ret

with this patch (to eliminate the interunit SUBREG) this becomes:

foo:    pxor    %xmm1, %xmm0
        xorl    %eax, %eax
        ptest   %xmm0, %xmm0
        sete    %al
        ret

Hopefully, this clarifies things a little.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
  2023-07-11 20:57   ` Roger Sayle
@ 2023-07-12  0:44     ` Hongtao Liu
  2023-07-12  7:29       ` Roger Sayle
  0 siblings, 1 reply; 6+ messages in thread
From: Hongtao Liu @ 2023-07-12  0:44 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches, Uros Bizjak

On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: 28 June 2023 04:23
> > > From: Roger Sayle <roger@nextmovesoftware.com>
> > > Sent: 27 June 2023 20:28
> > >
> > > I've also come up with an alternate/complementary/supplementary
> > > fix of generating the PTEST during RTL expansion, rather than rely on
> > > this being caught/optimized later during STV.
> > >
> > > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> > > appear last.  When I was writing this, I initially also added support
> > > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> > > support 256-bit OImode (which also explains why we don't have an
> > > OImode to V1OImode scalar-to-vector pass).  Retaining this clause
> > > ordering should minimize the lines changed if things change in future.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > >         testing a TImode SUBREG of a 128-bit vector register against
> > >         zero, use a PTEST instruction instead of first moving it to
> > >         to scalar registers.
> > >
> >
> > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > +     equality/inequality against zero.  */  if (op1 == const0_rtx
> > +      && SUBREG_P (op0)
> > +      && cmpmode == CCZmode
> > +      && SUBREG_BYTE (op0) == 0
> > +      && REG_P (SUBREG_REG (op0))
> > Just register_operand (op0, TImode),
>
> I completely agree that in most circumstances, the early RTL optimizers
> should use standard predicates, such as register_operand, that don't
> distinguish between REG and SUBREG, allowing the choice (assignment)
> to be left to register allocation (reload).
>
> However in this case, unusually, the presence of the SUBREG, and treating
> it differently from a REG is critical (in fact the reason for the patch).  x86_64
> can very efficiently test whether a 128-bit value is zero, setting ZF, either
> in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in
> V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction.
> There's no reason to prefer one form over the other.  A SUREG, however, that
> moves the value from the scalar registers to a vector register, or from a vector
> registers to scalar registers, requires two or three instructions, often reading
> and writing values via memory, at a huge performance penalty.   Hence the
> goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate
> single-cycle test instruction for where the data is located.  Hence we want
> to leave REG_P alone, but optimize (only) the SUBREG_P cases.
> register_operand doesn't help with this.
>
> Note this is counter to the usual advice.  Normally, a SUBREG between scalar
> registers is cheap (in fact free) on x86, hence it safe for predicates to ignore
> them prior to register allocation.  But another use of SUBREG, to represent
> a VIEW_CONVERT_EXPR/transfer between processing units is closer to a
> conversion, and a very expensive one (going via memory with different size
> reads vs writes) at that.
>
>
> > +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > +      && TARGET_SSE4_1
> > +      && GET_MODE (op0) == TImode
> > +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > +    {
> > +      tmp = SUBREG_REG (op0);
> > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > I think RA can handle SUBREG correctly, no need for extra predicates.
>
> Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> is forcing there to always be an inter-unit transfer/pipeline stall, when this is
> idiom that we're trying to eliminate.
>
> I should have repeated the motivating example from my original post at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
>
> typedef long long __m128i __attribute__ ((__vector_size__ (16)));
> int foo (__m128i x, __m128i y) {
>   return (__int128)x == (__int128)y;
> }
>
> is currently generated as:
> foo:    movaps  %xmm0, -40(%rsp)
>         movq    -32(%rsp), %rdx
>         movq    %xmm0, %rax
>         movq    %xmm1, %rsi
>         movaps  %xmm1, -24(%rsp)
>         movq    -16(%rsp), %rcx
>         xorq    %rsi, %rax
>         xorq    %rcx, %rdx
>         orq     %rdx, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> with this patch (to eliminate the interunit SUBREG) this becomes:
>
> foo:    pxor    %xmm1, %xmm0
>         xorl    %eax, %eax
>         ptest   %xmm0, %xmm0
>         sete    %al
>         ret
>
> Hopefully, this clarifies things a little.
Thanks for the explanation, the patch LGTM.
One curious question, is there any case SUBREG_BYTE != 0 when inner
and outer mode(TImode) have the same mode size(16 bytes)?

>
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
  2023-07-12  0:44     ` Hongtao Liu
@ 2023-07-12  7:29       ` Roger Sayle
  0 siblings, 0 replies; 6+ messages in thread
From: Roger Sayle @ 2023-07-12  7:29 UTC (permalink / raw)
  To: 'Hongtao Liu'; +Cc: gcc-patches, 'Uros Bizjak'


> From: Hongtao Liu <crazylht@gmail.com>
> Sent: 12 July 2023 01:45
> 
> On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle <roger@nextmovesoftware.com>
> > > From: Hongtao Liu <crazylht@gmail.com>
> > > Sent: 28 June 2023 04:23
> > > > From: Roger Sayle <roger@nextmovesoftware.com>
> > > > Sent: 27 June 2023 20:28
> > > >
> > > > I've also come up with an alternate/complementary/supplementary
> > > > fix of generating the PTEST during RTL expansion, rather than rely
> > > > on this being caught/optimized later during STV.
> > > >
> > > > You may notice in this patch, the tests for TARGET_SSE4_1 and
> > > > TImode appear last.  When I was writing this, I initially also
> > > > added support for AVX VPTEST and OImode, before realizing that x86
> > > > doesn't (yet) support 256-bit OImode (which also explains why we
> > > > don't have an OImode to V1OImode scalar-to-vector pass).
> > > > Retaining this clause ordering should minimize the lines changed if things
> change in future.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > > >         testing a TImode SUBREG of a 128-bit vector register against
> > > >         zero, use a PTEST instruction instead of first moving it to
> > > >         to scalar registers.
> > > >
> > >
> > > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > > +     equality/inequality against zero.  */  if (op1 == const0_rtx
> > > +      && SUBREG_P (op0)
> > > +      && cmpmode == CCZmode
> > > +      && SUBREG_BYTE (op0) == 0
> > > +      && REG_P (SUBREG_REG (op0))
> > > Just register_operand (op0, TImode),
> >
> > I completely agree that in most circumstances, the early RTL
> > optimizers should use standard predicates, such as register_operand,
> > that don't distinguish between REG and SUBREG, allowing the choice
> > (assignment) to be left to register allocation (reload).
> >
> > However in this case, unusually, the presence of the SUBREG, and
> > treating it differently from a REG is critical (in fact the reason for
> > the patch).  x86_64 can very efficiently test whether a 128-bit value
> > is zero, setting ZF, either in TImode, using orq %rax,%rdx in a single
> > cycle/single instruction, or in V1TImode, using ptest %xmm0,%xmm0, in a single
> cycle/single instruction.
> > There's no reason to prefer one form over the other.  A SUREG,
> > however, that moves the value from the scalar registers to a vector
> > register, or from a vector registers to scalar registers, requires two or three
> instructions, often reading
> > and writing values via memory, at a huge performance penalty.   Hence the
> > goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the
> > appropriate single-cycle test instruction for where the data is
> > located.  Hence we want to leave REG_P alone, but optimize (only) the
> SUBREG_P cases.
> > register_operand doesn't help with this.
> >
> > Note this is counter to the usual advice.  Normally, a SUBREG between
> > scalar registers is cheap (in fact free) on x86, hence it safe for
> > predicates to ignore them prior to register allocation.  But another
> > use of SUBREG, to represent a VIEW_CONVERT_EXPR/transfer between
> > processing units is closer to a conversion, and a very expensive one
> > (going via memory with different size reads vs writes) at that.
> >
> >
> > > +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > > +      && TARGET_SSE4_1
> > > +      && GET_MODE (op0) == TImode
> > > +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > > +    {
> > > +      tmp = SUBREG_REG (op0);
> > > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > > I think RA can handle SUBREG correctly, no need for extra predicates.
> >
> > Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> > is forcing there to always be an inter-unit transfer/pipeline stall,
> > when this is idiom that we're trying to eliminate.
> >
> > I should have repeated the motivating example from my original post at
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
> >
> > typedef long long __m128i __attribute__ ((__vector_size__ (16))); int
> > foo (__m128i x, __m128i y) {
> >   return (__int128)x == (__int128)y;
> > }
> >
> > is currently generated as:
> > foo:    movaps  %xmm0, -40(%rsp)
> >         movq    -32(%rsp), %rdx
> >         movq    %xmm0, %rax
> >         movq    %xmm1, %rsi
> >         movaps  %xmm1, -24(%rsp)
> >         movq    -16(%rsp), %rcx
> >         xorq    %rsi, %rax
> >         xorq    %rcx, %rdx
> >         orq     %rdx, %rax
> >         sete    %al
> >         movzbl  %al, %eax
> >         ret
> >
> > with this patch (to eliminate the interunit SUBREG) this becomes:
> >
> > foo:    pxor    %xmm1, %xmm0
> >         xorl    %eax, %eax
> >         ptest   %xmm0, %xmm0
> >         sete    %al
> >         ret
> >
> > Hopefully, this clarifies things a little.
> Thanks for the explanation, the patch LGTM.

Thanks.

> One curious question, is there any case SUBREG_BYTE != 0 when inner
> and outer mode(TImode) have the same mode size(16 bytes)?

Great question. Some of this is tucked away in the e-mails quoted above.
In addition to setting the ZF flag by comparing a 128-bit vector with zero, x86's
ptest/vptest pattern can also be used for comparing 256-bit vectors with zero
(hypothetically) using OImode, and 512-bit vectors with zero, using XImode.
Hence mention of out mode (OImode) and mode size (32 bytes).

The hold-up is that there doesn't appear to be a portable way in C/C++ to
express "if (is_all_zero_p (vector))" without using backend specific builtins,
as vector comparisons return vector results [though if someone knows the
secret incantation, please share].  Casting to __int128 (TImode) works on
x86_64 [as in this patch], because TImode is supported, but unfortunately
OImode and XImode aren't (yet).

This is related to the interesting bug, that even with SSE, AVX or AVX512
support, it's not possible to define V1TImode with -m32, as the usual typedef:
typedef __int128 v1ti __attribute__ ((vector_size (16))
fails with the error __int128 is not supported on  this platform.  I've been
investigating whether the front-ends can peek at the vector_size attribute
to work-around the target's scalar constraints.  This might even allow for
V1OImode and V1XImode on x86_64.

But you're right, if I'm always testing that the scalar mode and the vector
mode are the same size, the SUBREG offset should (I believe) always be
zero (which on x86 corresponds to lowpart), if the SUBREG isn't paradoxical.
The apparent redundancy is for readability, defensive programming and
avoiding more expensive tests/clauses when the SUBREG_BYTE isn't zero.


Thanks again,
Roger
--


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
@ 2023-06-27 19:27 Roger Sayle
  0 siblings, 0 replies; 6+ messages in thread
From: Roger Sayle @ 2023-06-27 19:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Uros Bizjak', 'Hongtao Liu'


[-- Attachment #1.1: Type: text/plain, Size: 1393 bytes --]

 

Hi Uros,

 

Hopefully Hongtao will approve my patch to support SUBREG conversions

in STV https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html

but for some of the examples described in the above post (and its test

case), I've also come up with an alternate/complementary/supplementary

fix of generating the PTEST during RTL expansion, rather than rely on

this being caught/optimized later during STV.

 

You may notice in this patch, the tests for TARGET_SSE4_1 and TImode

appear last.  When I was writing this, I initially also added support

for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)

support 256-bit OImode (which also explains why we don't have an OImode

to V1OImode scalar-to-vector pass).  Retaining this clause ordering

should minimize the lines changed if things change in future.

 

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap

and make -k check, both with and without --target_board=unix{-m32}

with no new failures.  Ok for mainline?

 

 

2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>

 

gcc/ChangeLog

        * config/i386/i386-expand.cc (ix86_expand_int_compare): If

        testing a TImode SUBREG of a 128-bit vector register against

        zero, use a PTEST instruction instead of first moving it to

        to scalar registers.

 

 

Please let me know what you think.

Roger

--

 


[-- Attachment #2: patchvc.txt --]
[-- Type: text/plain, Size: 1078 bytes --]

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 4a3b07a..53bec08 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -631,7 +631,31 @@ general_scalar_chain::compute_convert_gain ()
 	    break;
 
 	  case COMPARE:
-	    /* Assume comparison cost is the same.  */
+	    if (XEXP (src, 1) != const0_rtx)
+	      {
+		/* cmp vs. pxor;pshufd;ptest.  */
+		igain += COSTS_N_INSNS (m - 3);
+	      }
+	    else if (GET_CODE (XEXP (src, 0)) != AND)
+	      {
+		/* test vs. pshufd;ptest.  */
+		igain += COSTS_N_INSNS (m - 2);
+	      }
+	    else if (GET_CODE (XEXP (XEXP (src, 0), 0)) != NOT)
+	      {
+		/* and;test vs. pshufd;ptest.  */
+		igain += COSTS_N_INSNS (2 * m - 2);
+	      }
+	    else if (TARGET_BMI)
+	      {
+		/* andn;test vs. pandn;pshufd;ptest.  */
+		igain += COSTS_N_INSNS (2 * m - 3);
+	      }
+	    else
+	      {
+		/* not;and;test vs. pandn;pshufd;ptest.  */
+		igain += COSTS_N_INSNS (3 * m - 3);
+	      }
 	    break;
 
 	  case CONST_INT:

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-12  7:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-27 19:32 [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality Roger Sayle
2023-06-28  3:22 ` Hongtao Liu
2023-07-11 20:57   ` Roger Sayle
2023-07-12  0:44     ` Hongtao Liu
2023-07-12  7:29       ` Roger Sayle
  -- strict thread matches above, loose matches on Subject: below --
2023-06-27 19:27 Roger Sayle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).