public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Hongtao Liu <crazylht@gmail.com>
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: gcc-patches@gcc.gnu.org, Uros Bizjak <ubizjak@gmail.com>
Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
Date: Mon, 8 Jan 2024 09:58:56 +0800	[thread overview]
Message-ID: <CAMZc-bzPzXNPujQKNNgsui9Y2R9zJSOxESafgpJi3daw2e72Cw@mail.gmail.com> (raw)
In-Reply-To: <01e401da40f3$3ac28c70$b047a550$@nextmovesoftware.com>

On Sun, Jan 7, 2024 at 6:53 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
> Hi Hongtao,
>
> Many thanks for the review.  This revised patch implements several
> of your suggestions, specifically to use pshufd for V4SImode and
> punpcklqdq for V2DImode.  These changes are demonstrated by the
> examples below:
>
> typedef unsigned int v4si __attribute((vector_size(16)));
> typedef unsigned long long v2di __attribute((vector_size(16)));
>
> v4si foo() { return (v4si){1,1,1,1}; }
> v2di bar() { return (v2di){1,1}; }
>
> The previous version of my patch generated:
>
> foo:    movdqa  .LC0(%rip), %xmm0
>         ret
> bar:    movdqa  .LC1(%rip), %xmm0
>         ret
>
> with this revised version, -O2 generates:
>
> foo:    movl    $1, %eax
>         movd    %eax, %xmm0
>         pshufd  $0, %xmm0, %xmm0
>         ret
> bar:    movl    $1, %eax
>         movq    %rax, %xmm0
>         punpcklqdq      %xmm0, %xmm0
>         ret
>
> However, if it's OK with you, I'd prefer to allow this function to
> return false, safely falling back to emitting a vector load from
> the constant bool rather than ICEing from a gcc_assert.  For one
Sure, that makes sense.
> thing this isn't a unrecoverable correctness issue, but at worst
> a missed optimization.  The deeper reason is that this usefully
> provides a handle for tuning on different microarchitectures.
> On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC,
> the first form above may be preferable to the second.  Currently
> the start of ix86_convert_const_wide_int_to_broadcast disables
> broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an
> implementation doesn't reuire an inter unit move, such as a
> broadcast from memory.  I plan follow-up patches that benefit
> from this flexibility.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
Ok.
>
> gcc/ChangeLog
>         PR target/112992
>         * config/i386/i386-expand.cc
>         (ix86_convert_const_wide_int_to_broadcast): Allow call to
>         ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
>         (ix86_broadcast_from_constant): Revert recent change; Return a
>         suitable MEMREF independently of mode/target combinations.
>         (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
>         to decide whether expansion is possible/preferrable.  Only try
>         forcing DImode constants to memory (and trying again) if calling
>         ix86_expand_vector_init_duplicate fails with an DImode immediate
>         constant.
>         (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
>         V4SImode for suitable immediate constants.
>         <case E_V4DImode>: Try using V8SImode for suitable constants.
>         <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
>         <case E_V2HImode>: Likewise.
>         <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
>         <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
>         <label widen>: Handle CONT_INTs via simplify_binary_operation.
>         Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
>         <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
>         <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
>         (ix86_expand_vector_init): Move try using a broadcast for all_same
>         with ix86_expand_vector_init_duplicate before using constant pool.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/auto-init-8.c: Update test case.
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise.
>         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/avx512fp16-13.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/pr100865-1.c: Likewise.
>         * gcc.target/i386/pr100865-10a.c: Likewise.
>         * gcc.target/i386/pr100865-10b.c: Likewise.
>         * gcc.target/i386/pr100865-2.c: Likewise.
>         * gcc.target/i386/pr100865-3.c: Likewise.
>         * gcc.target/i386/pr100865-4a.c: Likewise.
>         * gcc.target/i386/pr100865-4b.c: Likewise.
>         * gcc.target/i386/pr100865-5a.c: Likewise.
>         * gcc.target/i386/pr100865-5b.c: Likewise.
>         * gcc.target/i386/pr100865-9a.c: Likewise.
>         * gcc.target/i386/pr100865-9b.c: Likewise.
>         * gcc.target/i386/pr102021.c: Likewise.
>         * gcc.target/i386/pr90773-17.c: Likewise.
>
> Thanks in advance.
> Roger
> --
>
> > -----Original Message-----
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: 02 January 2024 05:40
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak <ubizjak@gmail.com>
> > Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of
> > constants.
> >
> > On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > This patch resolves the second part of PR target/112992, building upon
> > > Hongtao Liu's solution to the first part.
> > >
> > > The issue addressed by this patch is that when initializing vectors by
> > > broadcasting integer constants, the compiler has the flexibility to
> > > select the most appropriate vector mode to perform the broadcast, as
> > > long as the resulting vector has an identical bit pattern.  For
> > > example, the following constants are all equivalent:
> > > V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } V8HImode
> > > {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
> > > V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ...
> > > 0x01 } So instruction sequences that construct any of these can be
> > > used to construct the others (with a suitable cast/SUBREG).
> > >
> > > On x86_64, it turns out that broadcasts of SImode constants are
> > > preferred, as DImode constants often require a longer movabs
> > > instruction, and HImode and QImode broadcasts require multiple uops on some
> > architectures.
> > > Hence, SImode is always the equal shortest/fastest implementation.
> > >
> > > Examples of this improvement, can be seen in the testsuite.
> > >
> > > gcc.target/i386/pr102021.c
> > > Before:
> > >    0:   48 b8 0c 00 0c 00 0c    movabs $0xc000c000c000c,%rax
> > >    7:   00 0c 00
> > >    a:   62 f2 fd 28 7c c0       vpbroadcastq %rax,%ymm0
> > >   10:   c3                      retq
> > >
> > > After:
> > >    0:   b8 0c 00 0c 00          mov    $0xc000c,%eax
> > >    5:   62 f2 7d 28 7c c0       vpbroadcastd %eax,%ymm0
> > >    b:   c3                      retq
> > >
> > > and
> > > gcc.target/i386/pr90773-17.c:
> > > Before:
> > >    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
> > >    7:   b8 0c 00 00 00          mov    $0xc,%eax
> > >    c:   62 f2 7d 08 7a c0       vpbroadcastb %eax,%xmm0
> > >   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
> > >   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
> > >   1f:   c3                      retq
> > >
> > > After:
> > >    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
> > >    7:   b8 0c 0c 0c 0c          mov    $0xc0c0c0c,%eax
> > >    c:   62 f2 7d 08 7c c0       vpbroadcastd %eax,%xmm0
> > >   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
> > >   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
> > >   1f:   c3                      retq
> > >
> > > where according to Agner Fog's instruction tables broadcastd is
> > > slightly faster on some microarchitectures, for example Knight's Landing.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-12-21  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR target/112992
> > >         * config/i386/i386-expand.cc
> > >         (ix86_convert_const_wide_int_to_broadcast): Allow call to
> > >         ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> > >         (ix86_broadcast_from_constant): Revert recent change; Return a
> > >         suitable MEMREF independently of mode/target combinations.
> > >         (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> > >         to decide whether expansion is possible/preferrable.  Only try
> > >         forcing DImode constants to memory (and trying again) if calling
> > >         ix86_expand_vector_init_duplicate fails with an DImode immediate
> > >         constant.
> > >         (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
> > >         V4SImode for suitable immediate constants.
> > >         <case E_V4DImode>: Try using V8SImode for suitable constants.
> > >         <case E_V4SImode>: Use constant pool for AVX without AVX2.
> > >         <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
> > >         <case E_V2HImode>: Likewise.
> > >         <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
> > >         <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
> > >         <label widen>: Handle CONT_INTs via simplify_binary_operation.
> > >         Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> > >         <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
> > >         <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
> > >         (ix86_expand_vector_init): Move try using a broadcast for all_same
> > >         with ix86_expand_vector_init_duplicate before using constant pool.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
> > >         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> > >         * gcc.target/i386/avx512fp16-13.c: Likewise.
> > >         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> > >         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> > >         * gcc.target/i386/pr100865-10a.c: Likewise.
> > >         * gcc.target/i386/pr100865-10b.c: Likewise.
> > >         * gcc.target/i386/pr100865-11c.c: Likewise.
> > >         * gcc.target/i386/pr100865-12c.c: Likewise.
> > >         * gcc.target/i386/pr100865-2.c: Likewise.
> > >         * gcc.target/i386/pr100865-3.c: Likewise.
> > >         * gcc.target/i386/pr100865-4a.c: Likewise.
> > >         * gcc.target/i386/pr100865-4b.c: Likewise.
> > >         * gcc.target/i386/pr100865-5a.c: Likewise.
> > >         * gcc.target/i386/pr100865-5b.c: Likewise.
> > >         * gcc.target/i386/pr100865-9a.c: Likewise.
> > >         * gcc.target/i386/pr100865-9b.c: Likewise.
> > >         * gcc.target/i386/pr102021.c: Likewise.
> > >         * gcc.target/i386/pr90773-17.c: Likewise.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
> >
> > >+    case E_V2DImode:
> > >+      if (CONST_INT_P (val))
> > >+        {
> > >+          int tmp = (int)INTVAL (val);
> > >+          if (tmp == (int)(INTVAL (val) >> 32))
> > >+            {
> > >+              rtx reg = gen_reg_rtx (V4SImode);
> > >+              ok = ix86_vector_duplicate_value (V4SImode, reg,
> > >+                                                GEN_INT (tmp));
> > >+              if (ok)
> > >+                {
> > >+                  emit_move_insn (target, gen_lowpart (V2DImode, reg));
> > >+                  return true;
> > >+                }
> > >+            }
> > >+          if (!TARGET_AVX)
> > >+            return false;
> > >+          if (!TARGET_AVX2)
> > >+            val = force_const_mem (DImode, val);
> >
> > We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for v2di, so I
> > think there is no need to return false or force_const_mem here.
> >
> > >+    case E_V4SImode:
> > >+      if (CONST_INT_P (val))
> > >+        {
> > >+          if (!TARGET_AVX)
> > >+            return false;
> > >+          if (!TARGET_AVX2)
> > >+            val = force_const_mem (SImode, val);
> >
> > Ditto.
> >
> > >+        }
> > >+      return ix86_vector_duplicate_value (mode, target, val);
> >
> >
> >
> > >         x = gen_reg_rtx (wvmode);
> > >         ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
> > >-        gcc_assert (ok);
> > >+        if (!ok)
> > >+          return false;
> > Then we can still gcc_assert (ok) here.
> > >         emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
> > >-        return ok;
> > >+        return true;
> >
> >
> > >+    case E_V32QImode:
> > >+      if (CONST_INT_P (val))
> > >+        goto widen;
> > >+      /* FALLTHRU */
> > >+
> > >     case E_V16HFmode:
> > >    case E_V16BFmode:
> > >-    case E_V32QImode:
> > >       if (TARGET_AVX2)
> > >         return ix86_vector_duplicate_value (mode, target, val);
> > >       else
> > >@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok,
> > machine_mode mode,
> > >           rtx x = gen_reg_rtx (hvmode);
> > >
> > >           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> > >-          gcc_assert (ok);
> > >+          if (!ok)
> > >+            return false;
> > >
> > >           x = gen_rtx_VEC_CONCAT (mode, x, x);
> > >           emit_insn (gen_rtx_SET (target, x)); @@ -15941,7 +16003,8 @@
> > >ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> > >           rtx x = gen_reg_rtx (hvmode);
> > >
> > >           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> > >-          gcc_assert (ok);
> > >+          if (!ok)
> > Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to vec_dupv16hi
> > widen to vec_dupv8si.
> > >+            return false;
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

      reply	other threads:[~2024-01-08  1:50 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-22 10:25 Roger Sayle
2024-01-02  5:39 ` Hongtao Liu
2024-01-06 22:53   ` Roger Sayle
2024-01-08  1:58     ` Hongtao Liu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMZc-bzPzXNPujQKNNgsui9Y2R9zJSOxESafgpJi3daw2e72Cw@mail.gmail.com \
    --to=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=roger@nextmovesoftware.com \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).