RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'Hongtao Liu'" <crazylht@gmail.com>
Cc: <gcc-patches@gcc.gnu.org>, "'Uros Bizjak'" <ubizjak@gmail.com>
Subject: RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
Date: Wed, 12 Jul 2023 08:29:13 +0100	[thread overview]
Message-ID: <002301d9b492$8ffa49f0$afeeddd0$@nextmovesoftware.com> (raw)
In-Reply-To: <CAMZc-by38AE_kbs4K2A6USpzTV+K8trr6w+RZOVMeG4Ac60JPQ@mail.gmail.com>


> From: Hongtao Liu <crazylht@gmail.com>
> Sent: 12 July 2023 01:45
> 
> On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle <roger@nextmovesoftware.com>
> > > From: Hongtao Liu <crazylht@gmail.com>
> > > Sent: 28 June 2023 04:23
> > > > From: Roger Sayle <roger@nextmovesoftware.com>
> > > > Sent: 27 June 2023 20:28
> > > >
> > > > I've also come up with an alternate/complementary/supplementary
> > > > fix of generating the PTEST during RTL expansion, rather than rely
> > > > on this being caught/optimized later during STV.
> > > >
> > > > You may notice in this patch, the tests for TARGET_SSE4_1 and
> > > > TImode appear last.  When I was writing this, I initially also
> > > > added support for AVX VPTEST and OImode, before realizing that x86
> > > > doesn't (yet) support 256-bit OImode (which also explains why we
> > > > don't have an OImode to V1OImode scalar-to-vector pass).
> > > > Retaining this clause ordering should minimize the lines changed if things
> change in future.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > > >         testing a TImode SUBREG of a 128-bit vector register against
> > > >         zero, use a PTEST instruction instead of first moving it to
> > > >         to scalar registers.
> > > >
> > >
> > > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > > +     equality/inequality against zero.  */  if (op1 == const0_rtx
> > > +      && SUBREG_P (op0)
> > > +      && cmpmode == CCZmode
> > > +      && SUBREG_BYTE (op0) == 0
> > > +      && REG_P (SUBREG_REG (op0))
> > > Just register_operand (op0, TImode),
> >
> > I completely agree that in most circumstances, the early RTL
> > optimizers should use standard predicates, such as register_operand,
> > that don't distinguish between REG and SUBREG, allowing the choice
> > (assignment) to be left to register allocation (reload).
> >
> > However in this case, unusually, the presence of the SUBREG, and
> > treating it differently from a REG is critical (in fact the reason for
> > the patch).  x86_64 can very efficiently test whether a 128-bit value
> > is zero, setting ZF, either in TImode, using orq %rax,%rdx in a single
> > cycle/single instruction, or in V1TImode, using ptest %xmm0,%xmm0, in a single
> cycle/single instruction.
> > There's no reason to prefer one form over the other.  A SUREG,
> > however, that moves the value from the scalar registers to a vector
> > register, or from a vector registers to scalar registers, requires two or three
> instructions, often reading
> > and writing values via memory, at a huge performance penalty.   Hence the
> > goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the
> > appropriate single-cycle test instruction for where the data is
> > located.  Hence we want to leave REG_P alone, but optimize (only) the
> SUBREG_P cases.
> > register_operand doesn't help with this.
> >
> > Note this is counter to the usual advice.  Normally, a SUBREG between
> > scalar registers is cheap (in fact free) on x86, hence it safe for
> > predicates to ignore them prior to register allocation.  But another
> > use of SUBREG, to represent a VIEW_CONVERT_EXPR/transfer between
> > processing units is closer to a conversion, and a very expensive one
> > (going via memory with different size reads vs writes) at that.
> >
> >
> > > +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > > +      && TARGET_SSE4_1
> > > +      && GET_MODE (op0) == TImode
> > > +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > > +    {
> > > +      tmp = SUBREG_REG (op0);
> > > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > > I think RA can handle SUBREG correctly, no need for extra predicates.
> >
> > Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> > is forcing there to always be an inter-unit transfer/pipeline stall,
> > when this is idiom that we're trying to eliminate.
> >
> > I should have repeated the motivating example from my original post at
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
> >
> > typedef long long __m128i __attribute__ ((__vector_size__ (16))); int
> > foo (__m128i x, __m128i y) {
> >   return (__int128)x == (__int128)y;
> > }
> >
> > is currently generated as:
> > foo:    movaps  %xmm0, -40(%rsp)
> >         movq    -32(%rsp), %rdx
> >         movq    %xmm0, %rax
> >         movq    %xmm1, %rsi
> >         movaps  %xmm1, -24(%rsp)
> >         movq    -16(%rsp), %rcx
> >         xorq    %rsi, %rax
> >         xorq    %rcx, %rdx
> >         orq     %rdx, %rax
> >         sete    %al
> >         movzbl  %al, %eax
> >         ret
> >
> > with this patch (to eliminate the interunit SUBREG) this becomes:
> >
> > foo:    pxor    %xmm1, %xmm0
> >         xorl    %eax, %eax
> >         ptest   %xmm0, %xmm0
> >         sete    %al
> >         ret
> >
> > Hopefully, this clarifies things a little.
> Thanks for the explanation, the patch LGTM.

Thanks.

> One curious question, is there any case SUBREG_BYTE != 0 when inner
> and outer mode(TImode) have the same mode size(16 bytes)?

Great question. Some of this is tucked away in the e-mails quoted above.
In addition to setting the ZF flag by comparing a 128-bit vector with zero, x86's
ptest/vptest pattern can also be used for comparing 256-bit vectors with zero
(hypothetically) using OImode, and 512-bit vectors with zero, using XImode.
Hence mention of out mode (OImode) and mode size (32 bytes).

The hold-up is that there doesn't appear to be a portable way in C/C++ to
express "if (is_all_zero_p (vector))" without using backend specific builtins,
as vector comparisons return vector results [though if someone knows the
secret incantation, please share].  Casting to __int128 (TImode) works on
x86_64 [as in this patch], because TImode is supported, but unfortunately
OImode and XImode aren't (yet).

This is related to the interesting bug, that even with SSE, AVX or AVX512
support, it's not possible to define V1TImode with -m32, as the usual typedef:
typedef __int128 v1ti __attribute__ ((vector_size (16))
fails with the error __int128 is not supported on  this platform.  I've been
investigating whether the front-ends can peek at the vector_size attribute
to work-around the target's scalar constraints.  This might even allow for
V1OImode and V1XImode on x86_64.

But you're right, if I'm always testing that the scalar mode and the vector
mode are the same size, the SUBREG offset should (I believe) always be
zero (which on x86 corresponds to lowpart), if the SUBREG isn't paradoxical.
The apparent redundancy is for readability, defensive programming and
avoiding more expensive tests/clauses when the SUBREG_BYTE isn't zero.


Thanks again,
Roger
--

next prev parent reply	other threads:[~2023-07-12  7:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-27 19:32 Roger Sayle
2023-06-28  3:22 ` Hongtao Liu
2023-07-11 20:57   ` Roger Sayle
2023-07-12  0:44     ` Hongtao Liu
2023-07-12  7:29       ` Roger Sayle [this message]
  -- strict thread matches above, loose matches on Subject: below --
2023-06-27 19:27 Roger Sayle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='002301d9b492$8ffa49f0$afeeddd0$@nextmovesoftware.com' \
    --to=roger@nextmovesoftware.com \
    --cc=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).