public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/87077] missed optimization for horizontal add for x86 SSE
       [not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-03  0:52 ` pinskia at gcc dot gnu.org
  2021-08-03  6:56 ` rguenth at gcc dot gnu.org
  2021-08-03  7:09 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03  0:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/87077] missed optimization for horizontal add for x86 SSE
       [not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
  2021-08-03  0:52 ` [Bug target/87077] missed optimization for horizontal add for x86 SSE pinskia at gcc dot gnu.org
@ 2021-08-03  6:56 ` rguenth at gcc dot gnu.org
  2021-08-03  7:09 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-03  6:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are now vectorizing the outer loop with the inner loop being unrolled.

If you add #pragma GCC unroll 0 to the inner loop we get comparatively good
code, but we reduce to scalar 4 times.

If you add #pragma GCC unroll 4 to both loops we apply BB vectorization
which expands the reductions in suboptimal way - it now also detects the
reductions but they are covered by the BB vectorization we recognize
for the store of the reduction results.

Note haddp[sd] is slow.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/87077] missed optimization for horizontal add for x86 SSE
       [not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
  2021-08-03  0:52 ` [Bug target/87077] missed optimization for horizontal add for x86 SSE pinskia at gcc dot gnu.org
  2021-08-03  6:56 ` rguenth at gcc dot gnu.org
@ 2021-08-03  7:09 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-03  7:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Just to quote, with the inner loop forced not unrolled we get

  <bb 2> [local count: 53687093]:
  vect__1.11_14 = MEM <const vector(4) float> [(float *)mtx_12(D)];
  vect__2.14_15 = MEM <const vector(4) float> [(float *)vec_13(D)];
  vect__3.15_21 = vect__1.11_14 * vect__2.14_15;
  _37 = .REDUC_PLUS (vect__3.15_21);
  vectp_mtx.10_46 = mtx_12(D) + 32;
  vect__1.11_47 = MEM <const vector(4) float> [(float *)vectp_mtx.10_46];
  vect__3.15_49 = vect__2.14_15 * vect__1.11_47;
  _52 = .REDUC_PLUS (vect__3.15_49);
  vectp_mtx.10_61 = mtx_12(D) + 64;
  vect__1.11_62 = MEM <const vector(4) float> [(float *)vectp_mtx.10_61];
  vect__3.15_64 = vect__2.14_15 * vect__1.11_62;
  _67 = .REDUC_PLUS (vect__3.15_64);
  vectp_mtx.10_17 = mtx_12(D) + 96;
  vect__1.11_5 = MEM <const vector(4) float> [(float *)vectp_mtx.10_17];
  vect__3.15_30 = vect__1.11_5 * vect__2.14_15;
  _33 = .REDUC_PLUS (vect__3.15_30);

so 4 optimal inner loop executions

  _27 = {_37, _52, _67, _33};
  MEM <vector(4) float> [(float *)&<retval>] = _27;

the BB store vectorized.

This results in

        vmovaps (%rdx), %xmm1
        vmulps  (%rsi), %xmm1, %xmm0
        movq    %rdi, %rax
        vmovhlps        %xmm0, %xmm0, %xmm2
        vaddps  %xmm0, %xmm2, %xmm2
        vshufps $85, %xmm2, %xmm2, %xmm0
        vaddps  %xmm2, %xmm0, %xmm0
        vmulps  32(%rsi), %xmm1, %xmm2
        vmovhlps        %xmm2, %xmm2, %xmm3
        vaddps  %xmm2, %xmm3, %xmm3
        vshufps $85, %xmm3, %xmm3, %xmm2
        vaddps  %xmm3, %xmm2, %xmm2
        vmovaps %xmm2, %xmm3
        vmulps  64(%rsi), %xmm1, %xmm2
        vunpcklps       %xmm3, %xmm0, %xmm0
        vmulps  96(%rsi), %xmm1, %xmm1
        vmovhlps        %xmm2, %xmm2, %xmm4
        vaddps  %xmm2, %xmm4, %xmm4
        vshufps $85, %xmm4, %xmm4, %xmm2
        vaddps  %xmm4, %xmm2, %xmm2
        vmovhlps        %xmm1, %xmm1, %xmm4
        vaddps  %xmm1, %xmm4, %xmm4
        vshufps $85, %xmm4, %xmm4, %xmm1
        vaddps  %xmm4, %xmm1, %xmm1
        vunpcklps       %xmm1, %xmm2, %xmm2
        vmovlhps        %xmm2, %xmm0, %xmm0
        vmovaps %xmm0, (%rdi)
        ret

which I think is quite optimal - using hadd would likely be slower if not
for cleverly re-using its permutation handling.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-03  7:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
2021-08-03  0:52 ` [Bug target/87077] missed optimization for horizontal add for x86 SSE pinskia at gcc dot gnu.org
2021-08-03  6:56 ` rguenth at gcc dot gnu.org
2021-08-03  7:09 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).