[PATCH] Optimize vlddqu to vmovdqu for TARGET

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX
@ 2023-07-20  7:35 liuhongt
  2023-07-20  8:10 ` Uros Bizjak
  0 siblings, 1 reply; 5+ messages in thread
From: liuhongt @ 2023-07-20  7:35 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak, hubicka

For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
Can someone confirm this with AMD folks?
If AMD doesn't like such optimization, I'll put my optimization under
micro-architecture tuning.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
If AMD also like such optimization, Ok for trunk?

gcc/ChangeLog:

	* config/i386/sse.md (<sse3>_lddqu<avxsizesuffix>): Change to
	define_expand, expand as simple move when TARGET_AVX
	&& (<MODE_SIZE> == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD).
	The original define_insn is renamed to
	..
	(<sse3>_lddqu<avxsizesuffix>): .. this.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/vlddqu_vinserti128.c: New test.
---
 gcc/config/i386/sse.md                            | 15 ++++++++++++++-
 .../gcc.target/i386/vlddqu_vinserti128.c          | 11 +++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 2d81347c7b6..d571a78f4c4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1835,7 +1835,20 @@ (define_peephole2
   [(set (match_dup 4) (match_dup 1))]
   "operands[4] = adjust_address (operands[0], V2DFmode, 0);")
 
-(define_insn "<sse3>_lddqu<avxsizesuffix>"
+(define_expand "<sse3>_lddqu<avxsizesuffix>"
+  [(set (match_operand:VI1 0 "register_operand")
+	(unspec:VI1 [(match_operand:VI1 1 "memory_operand")]
+		    UNSPEC_LDDQU))]
+  "TARGET_SSE3"
+{
+  if (TARGET_AVX && (<MODE_SIZE> == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD))
+    {
+      emit_move_insn (operands[0], operands[1]);
+      DONE;
+    }
+})
+
+(define_insn "*<sse3>_lddqu<avxsizesuffix>"
   [(set (match_operand:VI1 0 "register_operand" "=x")
 	(unspec:VI1 [(match_operand:VI1 1 "memory_operand" "m")]
 		    UNSPEC_LDDQU))]
diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
new file mode 100644
index 00000000000..29699a5fa7f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx2 -O2" } */
+/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
+/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
+
+#include <immintrin.h>
+__m256i foo(void *data) {
+    __m128i X1 = _mm_lddqu_si128((__m128i*)data);
+    __m256i V1 = _mm256_broadcastsi128_si256 (X1);
+    return V1;
+}
-- 
2.39.1.388.g2fc9e9ca3c


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX
  2023-07-20  7:35 [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX liuhongt
@ 2023-07-20  8:10 ` Uros Bizjak
  2023-07-20 23:50   ` Hongtao Liu
  0 siblings, 1 reply; 5+ messages in thread
From: Uros Bizjak @ 2023-07-20  8:10 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, hubicka

On Thu, Jul 20, 2023 at 9:35 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
> as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
> Can someone confirm this with AMD folks?
> If AMD doesn't like such optimization, I'll put my optimization under
> micro-architecture tuning.

The instruction is reachable only as __builtin_ia32_lddqu* (aka
_mm_lddqu_si*), so it was chosen by the programmer for a reason. I
think that in this case, the compiler should not be too smart and
change the instruction behind the programmer's back. The caveats are
also explained at length in the ISA manual.

Uros.

> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> If AMD also like such optimization, Ok for trunk?
>
> gcc/ChangeLog:
>
>         * config/i386/sse.md (<sse3>_lddqu<avxsizesuffix>): Change to
>         define_expand, expand as simple move when TARGET_AVX
>         && (<MODE_SIZE> == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD).
>         The original define_insn is renamed to
>         ..
>         (<sse3>_lddqu<avxsizesuffix>): .. this.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/vlddqu_vinserti128.c: New test.
> ---
>  gcc/config/i386/sse.md                            | 15 ++++++++++++++-
>  .../gcc.target/i386/vlddqu_vinserti128.c          | 11 +++++++++++
>  2 files changed, 25 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 2d81347c7b6..d571a78f4c4 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1835,7 +1835,20 @@ (define_peephole2
>    [(set (match_dup 4) (match_dup 1))]
>    "operands[4] = adjust_address (operands[0], V2DFmode, 0);")
>
> -(define_insn "<sse3>_lddqu<avxsizesuffix>"
> +(define_expand "<sse3>_lddqu<avxsizesuffix>"
> +  [(set (match_operand:VI1 0 "register_operand")
> +       (unspec:VI1 [(match_operand:VI1 1 "memory_operand")]
> +                   UNSPEC_LDDQU))]
> +  "TARGET_SSE3"
> +{
> +  if (TARGET_AVX && (<MODE_SIZE> == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD))
> +    {
> +      emit_move_insn (operands[0], operands[1]);
> +      DONE;
> +    }
> +})
> +
> +(define_insn "*<sse3>_lddqu<avxsizesuffix>"
>    [(set (match_operand:VI1 0 "register_operand" "=x")
>         (unspec:VI1 [(match_operand:VI1 1 "memory_operand" "m")]
>                     UNSPEC_LDDQU))]
> diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> new file mode 100644
> index 00000000000..29699a5fa7f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2" } */
> +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> +
> +#include <immintrin.h>
> +__m256i foo(void *data) {
> +    __m128i X1 = _mm_lddqu_si128((__m128i*)data);
> +    __m256i V1 = _mm256_broadcastsi128_si256 (X1);
> +    return V1;
> +}
> --
> 2.39.1.388.g2fc9e9ca3c
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX
  2023-07-20  8:10 ` Uros Bizjak
@ 2023-07-20 23:50   ` Hongtao Liu
  2023-08-02  1:31     ` [PATCH] Optimize vlddqu + inserti128 to vbroadcasti128 liuhongt
  0 siblings, 1 reply; 5+ messages in thread
From: Hongtao Liu @ 2023-07-20 23:50 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: liuhongt, gcc-patches, hubicka

On Thu, Jul 20, 2023 at 4:11 PM Uros Bizjak via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 20, 2023 at 9:35 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
> > as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
> > Can someone confirm this with AMD folks?
> > If AMD doesn't like such optimization, I'll put my optimization under
> > micro-architecture tuning.
>
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart and
> change the instruction behind the programmer's back. The caveats are
> also explained at length in the ISA manual.
fine.
>
> Uros.
>
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > If AMD also like such optimization, Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         * config/i386/sse.md (<sse3>_lddqu<avxsizesuffix>): Change to
> >         define_expand, expand as simple move when TARGET_AVX
> >         && (<MODE_SIZE> == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD).
> >         The original define_insn is renamed to
> >         ..
> >         (<sse3>_lddqu<avxsizesuffix>): .. this.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/vlddqu_vinserti128.c: New test.
> > ---
> >  gcc/config/i386/sse.md                            | 15 ++++++++++++++-
> >  .../gcc.target/i386/vlddqu_vinserti128.c          | 11 +++++++++++
> >  2 files changed, 25 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 2d81347c7b6..d571a78f4c4 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -1835,7 +1835,20 @@ (define_peephole2
> >    [(set (match_dup 4) (match_dup 1))]
> >    "operands[4] = adjust_address (operands[0], V2DFmode, 0);")
> >
> > -(define_insn "<sse3>_lddqu<avxsizesuffix>"
> > +(define_expand "<sse3>_lddqu<avxsizesuffix>"
> > +  [(set (match_operand:VI1 0 "register_operand")
> > +       (unspec:VI1 [(match_operand:VI1 1 "memory_operand")]
> > +                   UNSPEC_LDDQU))]
> > +  "TARGET_SSE3"
> > +{
> > +  if (TARGET_AVX && (<MODE_SIZE> == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD))
> > +    {
> > +      emit_move_insn (operands[0], operands[1]);
> > +      DONE;
> > +    }
> > +})
> > +
> > +(define_insn "*<sse3>_lddqu<avxsizesuffix>"
> >    [(set (match_operand:VI1 0 "register_operand" "=x")
> >         (unspec:VI1 [(match_operand:VI1 1 "memory_operand" "m")]
> >                     UNSPEC_LDDQU))]
> > diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> > new file mode 100644
> > index 00000000000..29699a5fa7f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx2 -O2" } */
> > +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> > +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> > +
> > +#include <immintrin.h>
> > +__m256i foo(void *data) {
> > +    __m128i X1 = _mm_lddqu_si128((__m128i*)data);
> > +    __m256i V1 = _mm256_broadcastsi128_si256 (X1);
> > +    return V1;
> > +}
> > --
> > 2.39.1.388.g2fc9e9ca3c
> >



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] Optimize vlddqu + inserti128 to vbroadcasti128
  2023-07-20 23:50   ` Hongtao Liu
@ 2023-08-02  1:31     ` liuhongt
  2023-08-02  5:47       ` Uros Bizjak
  0 siblings, 1 reply; 5+ messages in thread
From: liuhongt @ 2023-08-02  1:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak

In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics
after AVX2, it's rejected as
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart and
> change the instruction behind the programmer's back. The caveats are
> also explained at length in the ISA manual.

So the patch is more conservative, only optimize vlddqu + vinserti128
to vbroadcasti128.
vlddqu + vinserti128 will use shuffle port in addition to load port
comparing to vbroadcasti128, For latency perspective,vbroadcasti is no
worse than vlddqu + vinserti128.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625122.html

Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	* config/i386/sse.md (*avx2_lddqu_inserti_to_bcasti): New
	pre_reload define_insn_and_split.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/vlddqu_vinserti128.c: New test.
---
 gcc/config/i386/sse.md                         | 18 ++++++++++++++++++
 .../gcc.target/i386/vlddqu_vinserti128.c       | 11 +++++++++++
 2 files changed, 29 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 2d81347c7b6..4bdd2b43ba7 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -26600,6 +26600,24 @@ (define_insn "avx2_vbroadcasti128_<mode>"
    (set_attr "prefix" "vex,evex,evex")
    (set_attr "mode" "OI")])
 
+;; optimize vlddqu + vinserti128 to vbroadcasti128, the former will use
+;; extra shuffle port in addition to load port than the latter.
+;; For latency perspective,vbroadcasti is no worse.
+(define_insn_and_split "avx2_lddqu_inserti_to_bcasti"
+  [(set (match_operand:V4DI 0 "register_operand" "=x,v,v")
+	(vec_concat:V4DI
+	  (subreg:V2DI
+	    (unspec:V16QI [(match_operand:V16QI 1 "memory_operand")]
+			  UNSPEC_LDDQU) 0)
+	  (subreg:V2DI (unspec:V16QI [(match_dup 1)]
+			  UNSPEC_LDDQU) 0)))]
+  "TARGET_AVX2 && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(vec_concat:V4DI (match_dup 1) (match_dup 1)))]
+  "operands[1] = adjust_address (operands[1], V2DImode, 0);")
+
 ;; Modes handled by AVX vec_dup patterns.
 (define_mode_iterator AVX_VEC_DUP_MODE
   [V8SI V8SF V4DI V4DF])
diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
new file mode 100644
index 00000000000..29699a5fa7f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx2 -O2" } */
+/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
+/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
+
+#include <immintrin.h>
+__m256i foo(void *data) {
+    __m128i X1 = _mm_lddqu_si128((__m128i*)data);
+    __m256i V1 = _mm256_broadcastsi128_si256 (X1);
+    return V1;
+}
-- 
2.39.1.388.g2fc9e9ca3c


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Optimize vlddqu + inserti128 to vbroadcasti128
  2023-08-02  1:31     ` [PATCH] Optimize vlddqu + inserti128 to vbroadcasti128 liuhongt
@ 2023-08-02  5:47       ` Uros Bizjak
  0 siblings, 0 replies; 5+ messages in thread
From: Uros Bizjak @ 2023-08-02  5:47 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches

On Wed, Aug 2, 2023 at 3:33 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics
> after AVX2, it's rejected as
> > The instruction is reachable only as __builtin_ia32_lddqu* (aka
> > _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> > think that in this case, the compiler should not be too smart and
> > change the instruction behind the programmer's back. The caveats are
> > also explained at length in the ISA manual.
>
> So the patch is more conservative, only optimize vlddqu + vinserti128
> to vbroadcasti128.
> vlddqu + vinserti128 will use shuffle port in addition to load port
> comparing to vbroadcasti128, For latency perspective,vbroadcasti is no
> worse than vlddqu + vinserti128.
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625122.html
>
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         * config/i386/sse.md (*avx2_lddqu_inserti_to_bcasti): New
>         pre_reload define_insn_and_split.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/vlddqu_vinserti128.c: New test.

OK with a small change bellow.

Thanks,
Uros.

> ---
>  gcc/config/i386/sse.md                         | 18 ++++++++++++++++++
>  .../gcc.target/i386/vlddqu_vinserti128.c       | 11 +++++++++++
>  2 files changed, 29 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 2d81347c7b6..4bdd2b43ba7 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -26600,6 +26600,24 @@ (define_insn "avx2_vbroadcasti128_<mode>"
>     (set_attr "prefix" "vex,evex,evex")
>     (set_attr "mode" "OI")])
>
> +;; optimize vlddqu + vinserti128 to vbroadcasti128, the former will use
> +;; extra shuffle port in addition to load port than the latter.
> +;; For latency perspective,vbroadcasti is no worse.
> +(define_insn_and_split "avx2_lddqu_inserti_to_bcasti"
> +  [(set (match_operand:V4DI 0 "register_operand" "=x,v,v")
> +       (vec_concat:V4DI
> +         (subreg:V2DI
> +           (unspec:V16QI [(match_operand:V16QI 1 "memory_operand")]
> +                         UNSPEC_LDDQU) 0)
> +         (subreg:V2DI (unspec:V16QI [(match_dup 1)]
> +                         UNSPEC_LDDQU) 0)))]
> +  "TARGET_AVX2 && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +       (vec_concat:V4DI (match_dup 1) (match_dup 1)))]
> +  "operands[1] = adjust_address (operands[1], V2DImode, 0);")

No need to validate address before reload, adjust_address_nv can be used.

> +
>  ;; Modes handled by AVX vec_dup patterns.
>  (define_mode_iterator AVX_VEC_DUP_MODE
>    [V8SI V8SF V4DI V4DF])
> diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> new file mode 100644
> index 00000000000..29699a5fa7f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2" } */
> +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> +
> +#include <immintrin.h>
> +__m256i foo(void *data) {
> +    __m128i X1 = _mm_lddqu_si128((__m128i*)data);
> +    __m256i V1 = _mm256_broadcastsi128_si256 (X1);
> +    return V1;
> +}
> --
> 2.39.1.388.g2fc9e9ca3c
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-08-02  5:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20  7:35 [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX liuhongt
2023-07-20  8:10 ` Uros Bizjak
2023-07-20 23:50   ` Hongtao Liu
2023-08-02  1:31     ` [PATCH] Optimize vlddqu + inserti128 to vbroadcasti128 liuhongt
2023-08-02  5:47       ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).