From: Hongtao Liu <crazylht@gmail.com>
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: gcc-patches@gcc.gnu.org, Uros Bizjak <ubizjak@gmail.com>
Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
Date: Mon, 8 Jan 2024 09:58:56 +0800 [thread overview]
Message-ID: <CAMZc-bzPzXNPujQKNNgsui9Y2R9zJSOxESafgpJi3daw2e72Cw@mail.gmail.com> (raw)
In-Reply-To: <01e401da40f3$3ac28c70$b047a550$@nextmovesoftware.com>
On Sun, Jan 7, 2024 at 6:53 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
> Hi Hongtao,
>
> Many thanks for the review. This revised patch implements several
> of your suggestions, specifically to use pshufd for V4SImode and
> punpcklqdq for V2DImode. These changes are demonstrated by the
> examples below:
>
> typedef unsigned int v4si __attribute((vector_size(16)));
> typedef unsigned long long v2di __attribute((vector_size(16)));
>
> v4si foo() { return (v4si){1,1,1,1}; }
> v2di bar() { return (v2di){1,1}; }
>
> The previous version of my patch generated:
>
> foo: movdqa .LC0(%rip), %xmm0
> ret
> bar: movdqa .LC1(%rip), %xmm0
> ret
>
> with this revised version, -O2 generates:
>
> foo: movl $1, %eax
> movd %eax, %xmm0
> pshufd $0, %xmm0, %xmm0
> ret
> bar: movl $1, %eax
> movq %rax, %xmm0
> punpcklqdq %xmm0, %xmm0
> ret
>
> However, if it's OK with you, I'd prefer to allow this function to
> return false, safely falling back to emitting a vector load from
> the constant bool rather than ICEing from a gcc_assert. For one
Sure, that makes sense.
> thing this isn't a unrecoverable correctness issue, but at worst
> a missed optimization. The deeper reason is that this usefully
> provides a handle for tuning on different microarchitectures.
> On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC,
> the first form above may be preferable to the second. Currently
> the start of ix86_convert_const_wide_int_to_broadcast disables
> broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an
> implementation doesn't reuire an inter unit move, such as a
> broadcast from memory. I plan follow-up patches that benefit
> from this flexibility.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures. Ok for mainline?
Ok.
>
> gcc/ChangeLog
> PR target/112992
> * config/i386/i386-expand.cc
> (ix86_convert_const_wide_int_to_broadcast): Allow call to
> ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> (ix86_broadcast_from_constant): Revert recent change; Return a
> suitable MEMREF independently of mode/target combinations.
> (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> to decide whether expansion is possible/preferrable. Only try
> forcing DImode constants to memory (and trying again) if calling
> ix86_expand_vector_init_duplicate fails with an DImode immediate
> constant.
> (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
> V4SImode for suitable immediate constants.
> <case E_V4DImode>: Try using V8SImode for suitable constants.
> <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
> <case E_V2HImode>: Likewise.
> <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
> <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
> <label widen>: Handle CONT_INTs via simplify_binary_operation.
> Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
> <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
> (ix86_expand_vector_init): Move try using a broadcast for all_same
> with ix86_expand_vector_init_duplicate before using constant pool.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/auto-init-8.c: Update test case.
> * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise.
> * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/avx512fp16-13.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/pr100865-1.c: Likewise.
> * gcc.target/i386/pr100865-10a.c: Likewise.
> * gcc.target/i386/pr100865-10b.c: Likewise.
> * gcc.target/i386/pr100865-2.c: Likewise.
> * gcc.target/i386/pr100865-3.c: Likewise.
> * gcc.target/i386/pr100865-4a.c: Likewise.
> * gcc.target/i386/pr100865-4b.c: Likewise.
> * gcc.target/i386/pr100865-5a.c: Likewise.
> * gcc.target/i386/pr100865-5b.c: Likewise.
> * gcc.target/i386/pr100865-9a.c: Likewise.
> * gcc.target/i386/pr100865-9b.c: Likewise.
> * gcc.target/i386/pr102021.c: Likewise.
> * gcc.target/i386/pr90773-17.c: Likewise.
>
> Thanks in advance.
> Roger
> --
>
> > -----Original Message-----
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: 02 January 2024 05:40
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak <ubizjak@gmail.com>
> > Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of
> > constants.
> >
> > On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > This patch resolves the second part of PR target/112992, building upon
> > > Hongtao Liu's solution to the first part.
> > >
> > > The issue addressed by this patch is that when initializing vectors by
> > > broadcasting integer constants, the compiler has the flexibility to
> > > select the most appropriate vector mode to perform the broadcast, as
> > > long as the resulting vector has an identical bit pattern. For
> > > example, the following constants are all equivalent:
> > > V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } V8HImode
> > > {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
> > > V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ...
> > > 0x01 } So instruction sequences that construct any of these can be
> > > used to construct the others (with a suitable cast/SUBREG).
> > >
> > > On x86_64, it turns out that broadcasts of SImode constants are
> > > preferred, as DImode constants often require a longer movabs
> > > instruction, and HImode and QImode broadcasts require multiple uops on some
> > architectures.
> > > Hence, SImode is always the equal shortest/fastest implementation.
> > >
> > > Examples of this improvement, can be seen in the testsuite.
> > >
> > > gcc.target/i386/pr102021.c
> > > Before:
> > > 0: 48 b8 0c 00 0c 00 0c movabs $0xc000c000c000c,%rax
> > > 7: 00 0c 00
> > > a: 62 f2 fd 28 7c c0 vpbroadcastq %rax,%ymm0
> > > 10: c3 retq
> > >
> > > After:
> > > 0: b8 0c 00 0c 00 mov $0xc000c,%eax
> > > 5: 62 f2 7d 28 7c c0 vpbroadcastd %eax,%ymm0
> > > b: c3 retq
> > >
> > > and
> > > gcc.target/i386/pr90773-17.c:
> > > Before:
> > > 0: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 7 <foo+0x7>
> > > 7: b8 0c 00 00 00 mov $0xc,%eax
> > > c: 62 f2 7d 08 7a c0 vpbroadcastb %eax,%xmm0
> > > 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx)
> > > 18: c7 42 0f 0c 0c 0c 0c movl $0xc0c0c0c,0xf(%rdx)
> > > 1f: c3 retq
> > >
> > > After:
> > > 0: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 7 <foo+0x7>
> > > 7: b8 0c 0c 0c 0c mov $0xc0c0c0c,%eax
> > > c: 62 f2 7d 08 7c c0 vpbroadcastd %eax,%xmm0
> > > 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx)
> > > 18: c7 42 0f 0c 0c 0c 0c movl $0xc0c0c0c,0xf(%rdx)
> > > 1f: c3 retq
> > >
> > > where according to Agner Fog's instruction tables broadcastd is
> > > slightly faster on some microarchitectures, for example Knight's Landing.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures. Ok for mainline?
> > >
> > >
> > > 2023-12-21 Roger Sayle <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > > PR target/112992
> > > * config/i386/i386-expand.cc
> > > (ix86_convert_const_wide_int_to_broadcast): Allow call to
> > > ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> > > (ix86_broadcast_from_constant): Revert recent change; Return a
> > > suitable MEMREF independently of mode/target combinations.
> > > (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> > > to decide whether expansion is possible/preferrable. Only try
> > > forcing DImode constants to memory (and trying again) if calling
> > > ix86_expand_vector_init_duplicate fails with an DImode immediate
> > > constant.
> > > (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
> > > V4SImode for suitable immediate constants.
> > > <case E_V4DImode>: Try using V8SImode for suitable constants.
> > > <case E_V4SImode>: Use constant pool for AVX without AVX2.
> > > <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
> > > <case E_V2HImode>: Likewise.
> > > <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
> > > <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
> > > <label widen>: Handle CONT_INTs via simplify_binary_operation.
> > > Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> > > <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
> > > <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
> > > (ix86_expand_vector_init): Move try using a broadcast for all_same
> > > with ix86_expand_vector_init_duplicate before using constant pool.
> > >
> > > gcc/testsuite/ChangeLog
> > > * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
> > > * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> > > * gcc.target/i386/avx512fp16-13.c: Likewise.
> > > * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> > > * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> > > * gcc.target/i386/pr100865-10a.c: Likewise.
> > > * gcc.target/i386/pr100865-10b.c: Likewise.
> > > * gcc.target/i386/pr100865-11c.c: Likewise.
> > > * gcc.target/i386/pr100865-12c.c: Likewise.
> > > * gcc.target/i386/pr100865-2.c: Likewise.
> > > * gcc.target/i386/pr100865-3.c: Likewise.
> > > * gcc.target/i386/pr100865-4a.c: Likewise.
> > > * gcc.target/i386/pr100865-4b.c: Likewise.
> > > * gcc.target/i386/pr100865-5a.c: Likewise.
> > > * gcc.target/i386/pr100865-5b.c: Likewise.
> > > * gcc.target/i386/pr100865-9a.c: Likewise.
> > > * gcc.target/i386/pr100865-9b.c: Likewise.
> > > * gcc.target/i386/pr102021.c: Likewise.
> > > * gcc.target/i386/pr90773-17.c: Likewise.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
> >
> > >+ case E_V2DImode:
> > >+ if (CONST_INT_P (val))
> > >+ {
> > >+ int tmp = (int)INTVAL (val);
> > >+ if (tmp == (int)(INTVAL (val) >> 32))
> > >+ {
> > >+ rtx reg = gen_reg_rtx (V4SImode);
> > >+ ok = ix86_vector_duplicate_value (V4SImode, reg,
> > >+ GEN_INT (tmp));
> > >+ if (ok)
> > >+ {
> > >+ emit_move_insn (target, gen_lowpart (V2DImode, reg));
> > >+ return true;
> > >+ }
> > >+ }
> > >+ if (!TARGET_AVX)
> > >+ return false;
> > >+ if (!TARGET_AVX2)
> > >+ val = force_const_mem (DImode, val);
> >
> > We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for v2di, so I
> > think there is no need to return false or force_const_mem here.
> >
> > >+ case E_V4SImode:
> > >+ if (CONST_INT_P (val))
> > >+ {
> > >+ if (!TARGET_AVX)
> > >+ return false;
> > >+ if (!TARGET_AVX2)
> > >+ val = force_const_mem (SImode, val);
> >
> > Ditto.
> >
> > >+ }
> > >+ return ix86_vector_duplicate_value (mode, target, val);
> >
> >
> >
> > > x = gen_reg_rtx (wvmode);
> > > ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
> > >- gcc_assert (ok);
> > >+ if (!ok)
> > >+ return false;
> > Then we can still gcc_assert (ok) here.
> > > emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
> > >- return ok;
> > >+ return true;
> >
> >
> > >+ case E_V32QImode:
> > >+ if (CONST_INT_P (val))
> > >+ goto widen;
> > >+ /* FALLTHRU */
> > >+
> > > case E_V16HFmode:
> > > case E_V16BFmode:
> > >- case E_V32QImode:
> > > if (TARGET_AVX2)
> > > return ix86_vector_duplicate_value (mode, target, val);
> > > else
> > >@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok,
> > machine_mode mode,
> > > rtx x = gen_reg_rtx (hvmode);
> > >
> > > ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> > >- gcc_assert (ok);
> > >+ if (!ok)
> > >+ return false;
> > >
> > > x = gen_rtx_VEC_CONCAT (mode, x, x);
> > > emit_insn (gen_rtx_SET (target, x)); @@ -15941,7 +16003,8 @@
> > >ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> > > rtx x = gen_reg_rtx (hvmode);
> > >
> > > ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> > >- gcc_assert (ok);
> > >+ if (!ok)
> > Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to vec_dupv16hi
> > widen to vec_dupv8si.
> > >+ return false;
> >
> > --
> > BR,
> > Hongtao
--
BR,
Hongtao
prev parent reply other threads:[~2024-01-08 1:50 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-22 10:25 Roger Sayle
2024-01-02 5:39 ` Hongtao Liu
2024-01-06 22:53 ` Roger Sayle
2024-01-08 1:58 ` Hongtao Liu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMZc-bzPzXNPujQKNNgsui9Y2R9zJSOxESafgpJi3daw2e72Cw@mail.gmail.com \
--to=crazylht@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=roger@nextmovesoftware.com \
--cc=ubizjak@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).