[Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP
@ 2021-09-27  0:52 gabravier at gmail dot com
  2021-09-27  1:45 ` [Bug tree-optimization/102494] " pinskia at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: gabravier at gmail dot com @ 2021-09-27  0:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

            Bug ID: 102494
           Summary: Failure to optimize out vector reduction properly
                    especially when using OpenMP
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

#include <stdint.h>
#include <stddef.h>

typedef int8_t simde_int8x8_t __attribute__((__vector_size__(8)));

int16_t
simde_vaddlv_s8(simde_int8x8_t a) {
    int16_t r = 0;

#pragma omp simd reduction(+:r)
    for (size_t i = 0 ; i < (sizeof(a) / sizeof(a[0])) ; i++) {
      r += a[i];
    }

    return r;
}

Compiled with -O3 -fopenmp-simd, this is the output on AMD64:

simde_vaddlv_s8(signed char __vector(8)):
        pxor    xmm1, xmm1
        movdqa  xmm2, xmm0
        pcmpgtb xmm1, xmm0
        punpcklbw       xmm0, xmm1
        punpcklbw       xmm2, xmm1
        pshufd  xmm0, xmm0, 78
        movq    QWORD PTR [rsp-24], xmm2
        movq    QWORD PTR [rsp-16], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp-24]
        psrldq  xmm0, 8
        paddw   xmm0, XMMWORD PTR [rsp-24]
        movdqa  xmm1, xmm0
        psrldq  xmm1, 4
        paddw   xmm0, xmm1
        movdqa  xmm1, xmm0
        psrldq  xmm1, 2
        paddw   xmm0, xmm1
        pextrw  eax, xmm0, 0
        ret

This is what Clang manages:

simde_vaddlv_s8(signed char __vector(8)):
        punpcklbw       xmm0, xmm0              # xmm0 =
xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        psraw   xmm0, 8
        pshufd  xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        paddw   xmm1, xmm0
        pshufd  xmm0, xmm1, 85                  # xmm0 = xmm1[1,1,1,1]
        paddw   xmm0, xmm1
        movdqa  xmm1, xmm0
        psrld   xmm1, 16
        paddw   xmm1, xmm0
        movd    eax, xmm1
        ret

Weirdly enough, removing the `#pragma omp simd reduction(+r)` slightly improves
  GCC's output to this:

simde_vaddlv_s8(signed char __vector(8)):
        pxor    xmm1, xmm1
        movdqa  xmm2, xmm0
        pcmpgtb xmm1, xmm0
        punpcklbw       xmm0, xmm1
        punpcklbw       xmm2, xmm1
        pshufd  xmm0, xmm0, 78
        paddw   xmm0, xmm2
        pextrw  edx, xmm0, 1
        pextrw  eax, xmm0, 0
        add     eax, edx
        pextrw  edx, xmm0, 2
        add     eax, edx
        pextrw  edx, xmm0, 3
        add     eax, edx
        ret

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize out vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
@ 2021-09-27  1:45 ` pinskia at gcc dot gnu.org
  2021-09-27  3:01 ` crazylht at gmail dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-27  1:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
           Keywords|                            |missed-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Both with and without -fopenmp-simd works on aarch64-linux-gnu which has a
reduction addition.

Just looks like how reduction addition is handled for x86_64 really.

Also we have:
  MEM <vector(4) short int> [(short int *)&D.2916] = vect__21.35_111;
  MEM <vector(4) short int> [(short int *)&D.2916 + 8B] = vect__21.35_112;
  vect__24.24_88 = MEM <vector(8) short int> [(short int *)&D.2916];

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize out vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
  2021-09-27  1:45 ` [Bug tree-optimization/102494] " pinskia at gcc dot gnu.org
@ 2021-09-27  3:01 ` crazylht at gmail dot com
  2021-09-27  5:08 ` [Bug tree-optimization/102494] Failure to optimize " crazylht at gmail dot com
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2021-09-27  3:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
It seems x86 doesn't supports optab reduc_plus_scal_v8hi yet.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
  2021-09-27  1:45 ` [Bug tree-optimization/102494] " pinskia at gcc dot gnu.org
  2021-09-27  3:01 ` crazylht at gmail dot com
@ 2021-09-27  5:08 ` crazylht at gmail dot com
  2021-09-27  5:13 ` crazylht at gmail dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2021-09-27  5:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #2)
> It seems x86 doesn't supports optab reduc_plus_scal_v8hi yet.
vectorizer does the work for backend. 

typedef short v8hi __attribute__((vector_size(16)));
short
foo1 (v8hi p, int n)
{
  short sum = 0;
  for (int i = 0; i != 8; i++)
    sum += p[i];
  return sum;
}

  # sum_21 = PHI <sum_9(3)>
  # vect_sum_9.26_5 = PHI <vect_sum_9.26_6(3)>
  _22 = (vector(8) unsigned short) vect_sum_9.26_5;
  _23 = VEC_PERM_EXPR <_22, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 4, 5, 6, 7, 8, 9, 10,
11 }>;
  _24 = _23 + _22;
  _25 = VEC_PERM_EXPR <_24, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6, 7, 8,
9 }>;
  _26 = _25 + _24;
  _27 = VEC_PERM_EXPR <_26, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 2, 3, 4, 5, 6, 7,
8 }>;
  _28 = _27 + _26;
  stmp_sum_9.27_29 = BIT_FIELD_REF <_28, 16, 0>;


But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for v4hi.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (2 preceding siblings ...)
  2021-09-27  5:08 ` [Bug tree-optimization/102494] Failure to optimize " crazylht at gmail dot com
@ 2021-09-27  5:13 ` crazylht at gmail dot com
  2021-09-27  5:55 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2021-09-27  5:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---

> 
> But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for
> v4hi.

We need add (define_expand "reduc_plus_scal_v4hi" just like (define_expand
"reduc_plus_scal_v8qi" in mmx.md.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (3 preceding siblings ...)
  2021-09-27  5:13 ` crazylht at gmail dot com
@ 2021-09-27  5:55 ` crazylht at gmail dot com
  2021-09-27  8:47 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2021-09-27  5:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #4)
> > 
> > But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for
> > v4hi.
> 
> We need add (define_expand "reduc_plus_scal_v4hi" just like (define_expand
> "reduc_plus_scal_v8qi" in mmx.md.

Also for reduc_{umax,umin,smax,smin}_scal_v4hi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (4 preceding siblings ...)
  2021-09-27  5:55 ` crazylht at gmail dot com
@ 2021-09-27  8:47 ` rguenth at gcc dot gnu.org
  2021-09-28  6:57 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-27  8:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
The vectorizer looks for a way to "shift" the whole vector by either vec_shr
or a corresponding vec_perm with constant shuffle operands.  When the target
provides none of those you get element extracts and scalar adds.

So yes, the vectorizer does the work for you but only if you hand it the
pieces.

It could possibly use a larger vector, doing only the "tail" of its final
reduction, so try with v8hi instead of v4hi, but it's not really clear if
such strategy would be good in general.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (5 preceding siblings ...)
  2021-09-27  8:47 ` rguenth at gcc dot gnu.org
@ 2021-09-28  6:57 ` crazylht at gmail dot com
  2021-09-28  7:09 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2021-09-28  6:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
After supporting v4hi reduce, gimple seems not optimal to convert v8qi to v8hi.

 6  vector(4) short int vect__21.36;
 7  vector(4) unsigned short vect__2.31;
 8  int16_t stmp_r_17.17;
 9  vector(8) short int vect__16.15;
10  int16_t D.2229[8];
11  vector(8) short int _50;
12  vector(8) short int _51;
13  vector(8) short int _52;
14  vector(8) short int _53;
15  vector(8) short int _54;
16  vector(8) short int _55;

18  <bb 2> [local count: 189214783]:
19  vect__2.31_97 = [vec_unpack_lo_expr] a_90(D);
20  vect__2.31_98 = [vec_unpack_hi_expr] a_90(D);
21  vect__21.36_105 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__2.31_97);
22  vect__21.36_106 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__2.31_98);
23  MEM <vector(4) short int> [(short int *)&D.2229] = vect__21.36_105;
24  MEM <vector(4) short int> [(short int *)&D.2229 + 8B] = vect__21.36_106;
25  vect__16.15_47 = MEM <vector(8) short int> [(short int *)&D.2229];

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (6 preceding siblings ...)
  2021-09-28  6:57 ` crazylht at gmail dot com
@ 2021-09-28  7:09 ` rguenther at suse dot de
  2021-10-08  2:10 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenther at suse dot de @ 2021-09-28  7:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 28 Sep 2021, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494
> 
> --- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
> After supporting v4hi reduce, gimple seems not optimal to convert v8qi to v8hi.
> 
>  6  vector(4) short int vect__21.36;
>  7  vector(4) unsigned short vect__2.31;
>  8  int16_t stmp_r_17.17;
>  9  vector(8) short int vect__16.15;
> 10  int16_t D.2229[8];
> 11  vector(8) short int _50;
> 12  vector(8) short int _51;
> 13  vector(8) short int _52;
> 14  vector(8) short int _53;
> 15  vector(8) short int _54;
> 16  vector(8) short int _55;
> 
> 18  <bb 2> [local count: 189214783]:
> 19  vect__2.31_97 = [vec_unpack_lo_expr] a_90(D);
> 20  vect__2.31_98 = [vec_unpack_hi_expr] a_90(D);
> 21  vect__21.36_105 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__2.31_97);
> 22  vect__21.36_106 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__2.31_98);
> 23  MEM <vector(4) short int> [(short int *)&D.2229] = vect__21.36_105;
> 24  MEM <vector(4) short int> [(short int *)&D.2229 + 8B] = vect__21.36_106;

so the above could possibly use a V8QI -> V8HI conversion, the loop
vectorizer isn't good at producing those though.  And of course the
appropriate conversion optab has to exist.

> 25  vect__16.15_47 = MEM <vector(8) short int> [(short int *)&D.2229];

Here's lack of "CSE" - I do have patches somewhere to turn this into

  vect__16.15_47 = { vect__21.36_105, vect__21.36_106 };

but I'm not sure that's going to be profitable (well, the code as-is
will get a STLF hit).

There's also store-merging that could instead merge the stores
similarly (but then there's no CSE after store-merging so the load
would remain).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (7 preceding siblings ...)
  2021-09-28  7:09 ` rguenther at suse dot de
@ 2021-10-08  2:10 ` cvs-commit at gcc dot gnu.org
  2021-10-25 21:44 ` peter at cordes dot ca
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-10-08  2:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:77ca2cfcdcccee3c8e8aeaf1d03e9920893d2486

commit r12-4241-g77ca2cfcdcccee3c8e8aeaf1d03e9920893d2486
Author: liuhongt <hongtao.liu@intel.com>
Date:   Tue Sep 28 12:55:10 2021 +0800

    Support reduc_{plus,smax,smin,umax,min}_scal_v4hi.

    gcc/ChangeLog:

            PR target/102494
            * config/i386/i386-expand.c (emit_reduc_half): Hanlde V4HImode.
            * config/i386/mmx.md (reduc_plus_scal_v4hi): New.
            (reduc_<code>_scal_v4hi): New.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/mmx-reduce-op-1.c: New test.
            * gcc.target/i386/mmx-reduce-op-2.c: New test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (8 preceding siblings ...)
  2021-10-08  2:10 ` cvs-commit at gcc dot gnu.org
@ 2021-10-25 21:44 ` peter at cordes dot ca
  2021-10-25 22:00 ` peter at cordes dot ca
  2021-10-26  8:13 ` crazylht at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: peter at cordes dot ca @ 2021-10-25 21:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #10 from Peter Cordes <peter at cordes dot ca> ---
Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa 
Still doing two separate sign extensions and two stores / wider reload (store
forwarding stall):

-O3 -march=skylake -fopenmp
simde_vaddlv_s8:
        push    rbp
        vpmovsxbw       xmm2, xmm0
        vpsrlq  xmm0, xmm0, 32
        mov     rbp, rsp
        vpmovsxbw       xmm3, xmm0
        and     rsp, -32
        vmovq   QWORD PTR [rsp-16], xmm2
        vmovq   QWORD PTR [rsp-8], xmm3
        vmovdqa xmm4, XMMWORD PTR [rsp-16]
   ... then asm using byte-shifts

Including stuff like
   movdqa  xmm1, xmm0
   psrldq  xmm1, 4

instead of pshufd, which is an option because high garbage can be ignored.

And ARM64 goes scalar.

----

Current trunk *without* -fopenmp produces decent asm
https://godbolt.org/z/h1KEKPTW9

For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3)
simde_vaddlv_s8:
        sxtl    v0.8h, v0.8b
        addv    h0, v0.8h
        umov    w0, v0.h[0]
        ret

x86-64 gcc  -O3 -march=skylake
simde_vaddlv_s8:
        vpmovsxbw       xmm1, xmm0
        vpsrlq  xmm0, xmm0, 32
        vpmovsxbw       xmm0, xmm0
        vpaddw  xmm0, xmm1, xmm0
        vpsrlq  xmm1, xmm0, 32
        vpaddw  xmm0, xmm0, xmm1
        vpsrlq  xmm1, xmm0, 16
        vpaddw  xmm0, xmm0, xmm1
        vpextrw eax, xmm0, 0
        ret


That's pretty good, but  VMOVD eax, xmm0  would be more efficient than  VPEXTRW
when we don't need to avoid high garbage (because it's a return value in this
case).  VPEXTRW zero-extends into RAX, so it's not directly helpful if we need
to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx.

Or with BMI2, go scalar before the last shift / VPADDW step, e.g.
  ...
  vmovd  eax, xmm0
  rorx   edx, eax, 16
  add    eax, edx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (9 preceding siblings ...)
  2021-10-25 21:44 ` peter at cordes dot ca
@ 2021-10-25 22:00 ` peter at cordes dot ca
  2021-10-26  8:13 ` crazylht at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: peter at cordes dot ca @ 2021-10-25 22:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #11 from Peter Cordes <peter at cordes dot ca> ---
Also, horizontal byte sums are generally best done with  VPSADBW against a zero
vector, even if that means some fiddling to flip to unsigned first and then
undo the bias.

simde_vaddlv_s8:
 vpxor    xmm0, xmm0, .LC0[rip]  # set1_epi8(0x80) flip to unsigned 0..255
range
 vpxor    xmm1, xmm1
 vpsadbw  xmm0, xmm0, xmm1       # horizontal byte sum within each 64-bit half
 vmovd    eax, xmm0              # we only wanted the low half anyway
 sub      eax, 8 * 128      # subtract the bias we added earlier by flipping
sign bits
 ret

This is so much shorter we'd still be ahead if we generated the vector constant
on the fly instead of loading it.  (3 instructions: vpcmpeqd same,same / vpabsb
/ vpslld by 7.  Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128)

If we had wanted a 128-bit (16 byte) vector sum, we'd need

  ...
  vpsadbw ...

  vpshufd  xmm1, xmm0, 0xfe     # shuffle upper 64 bits to the bottom
  vpaddd   xmm0, xmm0, xmm1
  vmovd    eax, xmm0
  sub      eax, 16 * 128

Works efficiently with only SSE2.  Actually with AVX2, we should unpack the top
half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need
PSHUFD copy-and-shuffle.

Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
  2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
                   ` (10 preceding siblings ...)
  2021-10-25 22:00 ` peter at cordes dot ca
@ 2021-10-26  8:13 ` crazylht at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2021-10-26  8:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #12 from Hongtao.liu <crazylht at gmail dot com> ---

> That's pretty good, but  VMOVD eax, xmm0  would be more efficient than 
> VPEXTRW when we don't need to avoid high garbage (because it's a return
> value in this case). 
And TARGET_AVX512FP16 has vmovw.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-26  8:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27  0:52 [Bug tree-optimization/102494] New: Failure to optimize out vector reduction properly especially when using OpenMP gabravier at gmail dot com
2021-09-27  1:45 ` [Bug tree-optimization/102494] " pinskia at gcc dot gnu.org
2021-09-27  3:01 ` crazylht at gmail dot com
2021-09-27  5:08 ` [Bug tree-optimization/102494] Failure to optimize " crazylht at gmail dot com
2021-09-27  5:13 ` crazylht at gmail dot com
2021-09-27  5:55 ` crazylht at gmail dot com
2021-09-27  8:47 ` rguenth at gcc dot gnu.org
2021-09-28  6:57 ` crazylht at gmail dot com
2021-09-28  7:09 ` rguenther at suse dot de
2021-10-08  2:10 ` cvs-commit at gcc dot gnu.org
2021-10-25 21:44 ` peter at cordes dot ca
2021-10-25 22:00 ` peter at cordes dot ca
2021-10-26  8:13 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).