[Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled
@ 2014-09-24  5:39 agner at agner dot org
  2014-09-24  7:41 ` [Bug target/63351] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: agner at agner dot org @ 2014-09-24  5:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

            Bug ID: 63351
           Summary: Optimization: contract broadcast intrinsics when
                    AVX512 is enabled
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: agner at agner dot org

The AVX512 instruction set allows instructions with broadcast, but there are no
corresponding intrinsic functions. The programmer has to write a broadcast
intrinsic followed by some other intrinsic and rely on the compiler to contract
this into a single instruction.

I would expect the optimizer to contract a broadcast intrinsic with any
subsequent intrinsic into a single instruction. For example:

// gcc -Ofast -mavx512f

#include "x86intrin.h"

void dummyz(__m512i a, __m512i b);

void broadcastz(__m512i a, int b) {
    // expect reduction to instruction with broadcast,
    // something like: vpaddd b, %zmm0, %zmm3 {1to16}
    __m512i bb = _mm512_set1_epi32(b);
    __m512i ab = _mm512_add_epi32(a,bb);
    __m512i cc = _mm512_set1_epi32(5);
    __m512i ac = _mm512_add_epi32(a,cc);
    dummyz(ab, ac);
}


This should actually be possible for smaller vector sizes as well when AVX512
is enabled:

void dummyx(__m128 a, __m128 b);

void broadcastx(__m128 a, float b) {
    // broadcasting should even be possible with smaller vectors
    __m128 bb = _mm_set1_ps(b);
    __m128 ab = _mm_add_ps(a,bb);
    __m128 cc = _mm_set1_ps(5.0);
    __m128 ac = _mm_add_ps(a,cc);
    dummyx(ab, ac);
}


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/63351] Optimization: contract broadcast intrinsics when AVX512 is enabled
  2014-09-24  5:39 [Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled agner at agner dot org
@ 2014-09-24  7:41 ` rguenth at gcc dot gnu.org
  2014-09-25  5:51 ` agner at agner dot org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-09-24  7:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*, i?86-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-09-24
          Component|c                           |target
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Interesting - so AVX512 allows all(?) source operands to be scalars (in
vector registers)?

In theory combine should be able to handle this if the backend provides
proper patterns.  But I see that _mm512_set1_epi32 expands to sth like

;; _7 = __builtin_ia32_pbroadcastd512_gpr_mask (b_1(D), _6, -1);

(insn 7 6 8 (set (reg:SI 101)
        (reg/v:SI 99 [ b ])) ./include/avx512fintrin.h:3566 -1
     (nil))

(insn 8 7 9 (set (reg:V16SI 102)
        (subreg:V16SI (reg/v:V8DI 83 [ __Y ]) 0))
./include/avx512fintrin.h:3566 -1
     (nil))

(insn 9 8 10 (set (reg:HI 103)
        (const_int -1 [0xffffffffffffffff])) ./include/avx512fintrin.h:3566 -1
     (nil))

(insn 10 9 11 (set (reg:V16SI 100)
        (vec_merge:V16SI (vec_duplicate:V16SI (reg:SI 101))
            (reg:V16SI 102)
            (reg:HI 103))) ./include/avx512fintrin.h:3566 -1
     (nil))

(insn 11 10 0 (set (reg:V16SI 85 [ D.15281 ])
        (reg:V16SI 100)) ./include/avx512fintrin.h:3566 -1
     (nil))

which looks really awkward - or even bogus (insn 10).  What's the semantics
of _mm512_set1_epi32?

It seems that all of the intrinsics expand to sth weird as the above
(the vec_merge), even _mm512_add_epi32.

I'm quite sure this doesn't make the combiners job easier.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/63351] Optimization: contract broadcast intrinsics when AVX512 is enabled
  2014-09-24  5:39 [Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled agner at agner dot org
  2014-09-24  7:41 ` [Bug target/63351] " rguenth at gcc dot gnu.org
@ 2014-09-25  5:51 ` agner at agner dot org
  2014-09-25  6:32 ` kyukhin at gcc dot gnu.org
  2014-09-25  7:11 ` kyukhin at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: agner at agner dot org @ 2014-09-25  5:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

--- Comment #2 from Agner Fog <agner at agner dot org> ---
AVX512 allows all _memory_ source operands to broadcast from a scalar on almost
all vector instructions for 128-, 256- and 512-bit vectors with 32- or 64-bit
elements. See section 4.6.1 in "Intel® Architecture Instruction Set Extensions
Programming Reference"
https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf

This feature comes for free; there is no performance cost to broadcasting other
than making the instruction prefix longer for vector sizes smaller than 512.

This feature has no explicit support in intrinsic functions, so the only way to
utilize this excellent optimization opportunity without using assembly is to
contract broadcast intrinsics with subsequent instructions.

An obvious application is to store scalar constants as 32 or 64 bit constants
rather than as full vectors.

Often, it is not known to the programmer whether a variable is stored in memory
or in a register. If a scalar variable is already in a register then it is
better to use a broadcast instruction. If the scalar variable is in memory then
it is better to contract the broadcast into the vector instruction that uses
it, even if the broadcasted value is used multiple times.
>From gcc-bugs-return-462498-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Thu Sep 25 06:14:55 2014
Return-Path: <gcc-bugs-return-462498-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 18774 invoked by alias); 25 Sep 2014 06:14:55 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 18733 invoked by uid 48); 25 Sep 2014 06:14:47 -0000
From: "agner at agner dot org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/56253] fp-contract does not work with SSE and AVX FMAs (neither FMA4 nor FMA3)
Date: Thu, 25 Sep 2014 06:14:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 4.7.2
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: agner at agner dot org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-56253-4-syih15nD5F@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-56253-4@http.gcc.gnu.org/bugzilla/>
References: <bug-56253-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-09/txt/msg02332.txt.bz2
Content-length: 390

https://gcc.gnu.org/bugzilla/show_bug.cgi?idV253

--- Comment #13 from Agner Fog <agner at agner dot org> ---
Thank you. I agree that integer overflow should be well-defined when using
intrinsics.

Is it possible to do the same optimization with boolean vector intrinsics, such
as _mm_and_epi32 and _mm_or_ps to enable optimizations such as algebraic
reduction and constant propagation?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/63351] Optimization: contract broadcast intrinsics when AVX512 is enabled
  2014-09-24  5:39 [Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled agner at agner dot org
  2014-09-24  7:41 ` [Bug target/63351] " rguenth at gcc dot gnu.org
  2014-09-25  5:51 ` agner at agner dot org
@ 2014-09-25  6:32 ` kyukhin at gcc dot gnu.org
  2014-09-25  7:11 ` kyukhin at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: kyukhin at gcc dot gnu.org @ 2014-09-25  6:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

--- Comment #3 from Kirill Yukhin <kyukhin at gcc dot gnu.org> ---
Hello,
For AVX-512F (zmm-s)
We have a patch which enables such as stuff basing
on combiner machinery: a new subst which allows
`broadcasted' version of patterns.
Combiner can combine (load-bcst + actual insn)
into (actual insn w/ bcst-ed mem-op).

This patch generates emb. bcts for such a cases:
+/* { dg-options "-O3 -mavxavx512f" } */
+/* { dg-final { scan-assembler-times "vpmulps\[
\\t\]+\[^\n\]*.*1to16.*%zmm\[0-9\]\[\\n\]" 1 } } */
+
+#define N 16
+
+float f1 (float *c1_p, float *c2_p)
+{
+
+  float a[N];
+  float b[N];
+  float c[N];
+  float c1 = *c1_p;
+  float c2 = *c2_p;
+  int i;
+
+  for (i = 0; i < N; i++)
+  {
+    a[i] = c1;
+    b[i] = c2;
+  }
+
+  for (i = 0; i < N; i++)
+  {
+    c[i] = a[i] * b[i];
+  }
+
+  return c[(int)(c1 + c2) % N];
+}

The patch almost no impact on Spec2006 (one of the reasons
is the combiner not working through bb-s).

For AVX-512VL ([xy]mm-s)
Such an optimization should be also applicable, when
all new patterns will reach the trunk.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/63351] Optimization: contract broadcast intrinsics when AVX512 is enabled
  2014-09-24  5:39 [Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled agner at agner dot org
                   ` (2 preceding siblings ...)
  2014-09-25  6:32 ` kyukhin at gcc dot gnu.org
@ 2014-09-25  7:11 ` kyukhin at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: kyukhin at gcc dot gnu.org @ 2014-09-25  7:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

--- Comment #4 from Kirill Yukhin <kyukhin at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> ;; _7 = __builtin_ia32_pbroadcastd512_gpr_mask (b_1(D), _6, -1);
> 
> (insn 7 6 8 (set (reg:SI 101)
>         (reg/v:SI 99 [ b ])) ./include/avx512fintrin.h:3566 -1
>      (nil))
> 
> (insn 8 7 9 (set (reg:V16SI 102)
>         (subreg:V16SI (reg/v:V8DI 83 [ __Y ]) 0))
> ./include/avx512fintrin.h:3566 -1
>      (nil))
> 
> (insn 9 8 10 (set (reg:HI 103)
>         (const_int -1 [0xffffffffffffffff])) ./include/avx512fintrin.h:3566
> -1
>      (nil))
> 
> (insn 10 9 11 (set (reg:V16SI 100)
>         (vec_merge:V16SI (vec_duplicate:V16SI (reg:SI 101))
>             (reg:V16SI 102)
>             (reg:HI 103))) ./include/avx512fintrin.h:3566 -1
>      (nil))
> 
> (insn 11 10 0 (set (reg:V16SI 85 [ D.15281 ])
>         (reg:V16SI 100)) ./include/avx512fintrin.h:3566 -1
>      (nil))
> 
> which looks really awkward - or even bogus (insn 10).  What's the semantics
> of _mm512_set1_epi32?

This was generic approach when adding support for new built-ins.
Straight-forward one would add following built-ins for almost every new insn:
  res = op_built_in (x)
  res = op_built_in_mask (x, res, mask)
  res = op_built_in_mask_zero (x, mask)
Resulting up to 3 built-ins per new instruction (+ emb. rounding is also
possible).

We decided to add built-in for `op_built_in_mask' only resulting:
  res = op_built_in_mask (a, _mm512_undefined (), -1)
  res = op_built_in_mask (x, res, mask)
  res = op_built_in_mask (x, 0, mask)
relying on optimizations to use proper pattern for all 3 cases.
BTW, this is covered by tests. E.g. `__builtin_ia32_pbroadcastd512_gpr_mask'
checked in `gcc.target/i386/avx512f-vpbroadcastd-1.c'.

If compile it with `-O2' you could see that for:
   x = _mm512_set1_epi32 (z);

following assembler is generated:
        movl    z(%rip), %eax   # 5     *movsi_internal/1       [length = 7]
        vpbroadcastd    %eax, %zmm0     # 9     *avx512f_vec_dup_gprv16si      
[length = 6]
        vmovdqa64       %zmm0, x(%rip)  # 12    *movv8di_internal/3     [length
= 11]

> It seems that all of the intrinsics expand to sth weird as the above
> (the vec_merge), even _mm512_add_epi32.
> 
> I'm quite sure this doesn't make the combiners job easier.
Definitely.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-09-25  7:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24  5:39 [Bug c/63351] New: Optimization: contract broadcast intrinsics when AVX512 is enabled agner at agner dot org
2014-09-24  7:41 ` [Bug target/63351] " rguenth at gcc dot gnu.org
2014-09-25  5:51 ` agner at agner dot org
2014-09-25  6:32 ` kyukhin at gcc dot gnu.org
2014-09-25  7:11 ` kyukhin at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).