public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Hongtao Liu <crazylht@gmail.com>
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: gcc-patches@gcc.gnu.org, Uros Bizjak <ubizjak@gmail.com>
Subject: Re: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
Date: Wed, 12 Jul 2023 08:44:37 +0800	[thread overview]
Message-ID: <CAMZc-by38AE_kbs4K2A6USpzTV+K8trr6w+RZOVMeG4Ac60JPQ@mail.gmail.com> (raw)
In-Reply-To: <02d501d9b43a$4fbb7b50$ef3271f0$@nextmovesoftware.com>

On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: 28 June 2023 04:23
> > > From: Roger Sayle <roger@nextmovesoftware.com>
> > > Sent: 27 June 2023 20:28
> > >
> > > I've also come up with an alternate/complementary/supplementary
> > > fix of generating the PTEST during RTL expansion, rather than rely on
> > > this being caught/optimized later during STV.
> > >
> > > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> > > appear last.  When I was writing this, I initially also added support
> > > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> > > support 256-bit OImode (which also explains why we don't have an
> > > OImode to V1OImode scalar-to-vector pass).  Retaining this clause
> > > ordering should minimize the lines changed if things change in future.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-06-27  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > >         testing a TImode SUBREG of a 128-bit vector register against
> > >         zero, use a PTEST instruction instead of first moving it to
> > >         to scalar registers.
> > >
> >
> > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > +     equality/inequality against zero.  */  if (op1 == const0_rtx
> > +      && SUBREG_P (op0)
> > +      && cmpmode == CCZmode
> > +      && SUBREG_BYTE (op0) == 0
> > +      && REG_P (SUBREG_REG (op0))
> > Just register_operand (op0, TImode),
>
> I completely agree that in most circumstances, the early RTL optimizers
> should use standard predicates, such as register_operand, that don't
> distinguish between REG and SUBREG, allowing the choice (assignment)
> to be left to register allocation (reload).
>
> However in this case, unusually, the presence of the SUBREG, and treating
> it differently from a REG is critical (in fact the reason for the patch).  x86_64
> can very efficiently test whether a 128-bit value is zero, setting ZF, either
> in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in
> V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction.
> There's no reason to prefer one form over the other.  A SUREG, however, that
> moves the value from the scalar registers to a vector register, or from a vector
> registers to scalar registers, requires two or three instructions, often reading
> and writing values via memory, at a huge performance penalty.   Hence the
> goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate
> single-cycle test instruction for where the data is located.  Hence we want
> to leave REG_P alone, but optimize (only) the SUBREG_P cases.
> register_operand doesn't help with this.
>
> Note this is counter to the usual advice.  Normally, a SUBREG between scalar
> registers is cheap (in fact free) on x86, hence it safe for predicates to ignore
> them prior to register allocation.  But another use of SUBREG, to represent
> a VIEW_CONVERT_EXPR/transfer between processing units is closer to a
> conversion, and a very expensive one (going via memory with different size
> reads vs writes) at that.
>
>
> > +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > +      && TARGET_SSE4_1
> > +      && GET_MODE (op0) == TImode
> > +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > +    {
> > +      tmp = SUBREG_REG (op0);
> > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > I think RA can handle SUBREG correctly, no need for extra predicates.
>
> Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> is forcing there to always be an inter-unit transfer/pipeline stall, when this is
> idiom that we're trying to eliminate.
>
> I should have repeated the motivating example from my original post at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
>
> typedef long long __m128i __attribute__ ((__vector_size__ (16)));
> int foo (__m128i x, __m128i y) {
>   return (__int128)x == (__int128)y;
> }
>
> is currently generated as:
> foo:    movaps  %xmm0, -40(%rsp)
>         movq    -32(%rsp), %rdx
>         movq    %xmm0, %rax
>         movq    %xmm1, %rsi
>         movaps  %xmm1, -24(%rsp)
>         movq    -16(%rsp), %rcx
>         xorq    %rsi, %rax
>         xorq    %rcx, %rdx
>         orq     %rdx, %rax
>         sete    %al
>         movzbl  %al, %eax
>         ret
>
> with this patch (to eliminate the interunit SUBREG) this becomes:
>
> foo:    pxor    %xmm1, %xmm0
>         xorl    %eax, %eax
>         ptest   %xmm0, %xmm0
>         sete    %al
>         ret
>
> Hopefully, this clarifies things a little.
Thanks for the explanation, the patch LGTM.
One curious question, is there any case SUBREG_BYTE != 0 when inner
and outer mode(TImode) have the same mode size(16 bytes)?

>
>


-- 
BR,
Hongtao

  reply	other threads:[~2023-07-12  0:44 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-27 19:32 Roger Sayle
2023-06-28  3:22 ` Hongtao Liu
2023-07-11 20:57   ` Roger Sayle
2023-07-12  0:44     ` Hongtao Liu [this message]
2023-07-12  7:29       ` Roger Sayle
  -- strict thread matches above, loose matches on Subject: below --
2023-06-27 19:27 Roger Sayle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMZc-by38AE_kbs4K2A6USpzTV+K8trr6w+RZOVMeG4Ac60JPQ@mail.gmail.com \
    --to=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=roger@nextmovesoftware.com \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).