public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not
@ 2021-12-21 23:29 hubicka at gcc dot gnu.org
  2021-12-22  1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-21 23:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

            Bug ID: 103797
           Summary: Clang vectorized LightPixel while GCC does not
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Clang vectorises divss in LightPixel while GCC does not (at -O3).  This seems
to account for 17% difference in resteflood_svg benchmark of Firefox.

       │    0000000001864660 <mozilla::gfx::(anonymous
namespace)::SpecularLightingSoftware::LightPixel(mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits,
float> const&, mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits, float>
const&, unsigned int)>:
       │    mozilla::gfx::(anonymous
namespace)::SpecularLightingSoftware::LightPixel(mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits,
float> const&, mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits, float>
const&, unsigned int):
  0.05 │      push      %rbp
  0.07 │      mov       %rsp,%rbp
  0.71 │      xorps     %xmm6,%xmm6
  0.32 │      addss     %xmm6,%xmm4
       │      unpcklps  %xmm3,%xmm5
  0.78 │      movss    
anon.5bcbce9b5eeaaf1a18a99b9a5b62e1ce.3.llvm.5306652999446557335+0x6d8,%xmm8
  0.01 │      addps     %xmm8,%xmm5
  1.47 │      movaps    %xmm4,%xmm9
       │      mulss     %xmm4,%xmm9
       │      movaps    %xmm5,%xmm7
  0.01 │      mulps     %xmm5,%xmm7
  3.35 │      movaps    %xmm7,%xmm3
       │      shufps    $0x55,%xmm7,%xmm3
  0.99 │      addss     %xmm9,%xmm3
  1.59 │      addss     %xmm7,%xmm3
  2.01 │      sqrtss    %xmm3,%xmm3
 11.43 │      divss     %xmm3,%xmm4
  6.76 │      shufps    $0x0,%xmm3,%xmm3
  0.01 │      divps     %xmm3,%xmm5
  2.58 │      mulss     %xmm1,%xmm4
  0.04 │      unpcklps  %xmm0,%xmm2
       │      mulps     %xmm5,%xmm2
  2.67 │      movaps    %xmm2,%xmm0
  0.04 │      shufps    $0x55,%xmm2,%xmm0
  2.11 │      addss     %xmm4,%xmm0
  1.87 │      addss     %xmm2,%xmm0
  2.82 │      cmpless   %xmm0,%xmm6
  2.20 │      andps     %xmm8,%xmm6
  1.05 │      mulss     %xmm0,%xmm6
  4.04 │      mulss     .str.6.llvm.231702015065810902+0x77,%xmm6
  3.14 │      cvttss2si %xmm6,%eax
  4.45 │      mov       0x8(%rdi),%ecx
  0.00 │      mov       0xc(%rdi),%edx
       │      movzwl    %ax,%eax
  1.10 │      test      %edx,%edx
       │    ↓ jle       92
       │88:   imul      %eax,%eax
  9.06 │      shr       $0xf,%eax
  3.12 │      dec       %edx
       │    ↑ jne       88
       │92:   shr       $0x8,%eax
  1.95 │      movzwl    0x10(%rdi,%rax,2),%eax
  6.48 │      imul      %eax,%ecx
  0.99 │      shr       $0x8,%ecx
  1.06 │      mov       %esi,%eax
  0.01 │      shr       $0x8,%eax
       │      mov       %esi,%edx
       │      shr       $0x10,%edx
  0.01 │      mov       $0xff,%edi
       │      and       %edi,%esi
  0.01 │      imul      %ecx,%esi
  3.32 │      shr       $0xf,%esi
  1.81 │      cmp       %edi,%esi ▒
  0.04 │      cmovae    %edi,%esi ▒
  1.99 │      and       %edi,%eax ▒
  0.01 │      imul      %ecx,%eax ▒
       │      shr       $0xf,%eax ▒
  0.01 │      cmp       %edi,%eax ▒
  0.28 │      cmovae    %edi,%eax ▒
  0.96 │      and       %edi,%edx ▒
       │      imul      %ecx,%edx ▒
       │      shr       $0xf,%edx ▒
  0.92 │      cmp       %edi,%edx ▒
  0.85 │      cmovae    %edi,%edx ▒
  1.00 │      cmp       %eax,%edx ▒
  1.20 │      mov       %eax,%ecx ▒
       │      cmova     %edx,%ecx ▒
  2.17 │      cmp       %esi,%ecx ▒
  1.15 │      cmovbe    %esi,%ecx ▒
  1.79 │      shl       $0x18,%ecx▒
  1.17 │      shl       $0x10,%edx▒
       │      shl       $0x8,%eax ▒
  0.03 │      or        %edx,%eax ▒
  0.01 │      or        %esi,%eax ▒
  0.14 │      or        %ecx,%eax ▒
  0.72 │      pop       %rbp      ▒
  0.04 │    ← ret                                                              
                                                                               
                                                                               
                      ▒

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
@ 2021-12-22  1:16 ` pinskia at gcc dot gnu.org
  2021-12-22  8:46 ` marxin at gcc dot gnu.org
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-22  1:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
  2021-12-22  1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
@ 2021-12-22  8:46 ` marxin at gcc dot gnu.org
  2021-12-22  9:14 ` hubicka at kam dot mff.cuni.cz
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-22  8:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
     Ever confirmed|0                           |1
                 CC|                            |marxin at gcc dot gnu.org
   Last reconfirmed|                            |2021-12-22

--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
Can you please attach a reduced test-case?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
  2021-12-22  1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
  2021-12-22  8:46 ` marxin at gcc dot gnu.org
@ 2021-12-22  9:14 ` hubicka at kam dot mff.cuni.cz
  2021-12-22  9:21 ` marxin at gcc dot gnu.org
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> Can you please attach a reduced test-case?
Do you know how to produce one with a reasonable effort?  The
declaratoins are quite convoluted, but the function is well isolated and
easy to inspect from full one...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-12-22  9:14 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22  9:21 ` marxin at gcc dot gnu.org
  2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-22  9:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to hubicka from comment #2)
> > Can you please attach a reduced test-case?
> Do you know how to produce one with a reasonable effort?

-E and remove not needed code.

> The
> declaratoins are quite convoluted, but the function is well isolated and
> easy to inspect from full one...

Do we speak about:
https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
?

It should be possible creating a synthetical test that does the same (and lives
in a loop, right?).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-12-22  9:21 ` marxin at gcc dot gnu.org
@ 2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
  2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> -E and remove not needed code.
> 
> > The
> > declaratoins are quite convoluted, but the function is well isolated and
> > easy to inspect from full one...
> 
> Do we speak about:
> https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
> ?
Yes.
> 
> It should be possible creating a synthetical test that does the same (and lives
> in a loop, right?).

Well, I tried that for a while and got bit lost (either code got
vectorized by both gcc and clang or by neither).  There are more issues
where we have over 50% regression wrt clang build at gfx code, so I
think I will first try to reproduce those locally and perf them to see
if there is more pattern here.

The releavant code is:

uint32_t mozilla::gfx::{anonymous}::SpecularLightingSoftware::LightPixel
(struct SpecularLightingSoftware * const this, const struct Point3D & aNormal,
const struct Point3D & aVectorToLight, uint32_t aColor)
{

  <bb 2> [local count: 118111600]:
  _48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
  _49 = _48 + 1.0e+0;
  _50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
  _51 = _50 + 0.0;
  _52 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
  _53 = _52 + 0.0;
  _80 = _53 * _53;
  _82 = _51 * _51;
  _83 = _80 + _82;
  _85 = _49 * _49;
  _86 = _83 + _85;
  if (_86 u>= 0.0)
    goto <bb 3>; [99.95%]
  else
    goto <bb 4>; [0.05%]

  <bb 3> [local count: 118052545]:
  _87 = .SQRT (_86);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 59055]:
  _29 = __builtin_sqrtf (_86);

  <bb 5> [local count: 118111600]:
  # _30 = PHI <_29(4), _87(3)>
  _88 = _53 / _30;
  _89 = _51 / _30;
  _90 = _49 / _30;
  _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
  _39 = _41 * _88;
  _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
  _33 = _37 * _89;
  _27 = _33 + _39;
  _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
  _46 = _45 * _90;
  _47 = _27 + _46;
  if (_47 >= 0.0)
    goto <bb 12>; [59.00%]
  else
    goto <bb 6>; [41.00%]


With -Ofast it gets bit more streamlined:


  <bb 2> [local count: 118111600]:
  _48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
  _49 = _48 + 1.0e+0;
  _50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
  _51 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
  powmult_78 = _51 * _51;
  powmult_80 = _50 * _50;
  _81 = powmult_78 + powmult_80;
  powmult_83 = _49 * _49;
  _84 = _81 + powmult_83;
  _85 = __builtin_sqrtf (_84);
  _86 = _51 / _85;
  _87 = _50 / _85;
  _88 = _49 / _85;
  _41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
  _39 = _41 * _86;
  _37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
  _33 = _37 * _87;
  _27 = _33 + _39;
  _45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
  _46 = _45 * _88;
  _47 = _27 + _46;
  if (_47 >= 0.0)
    goto <bb 3>; [59.00%]
  else
    goto <bb 9>; [41.00%]

But I do not quite see in the slp dump why this is not considered for
vectorization.

I attach the dump.
Honza

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
  2021-12-22 11:30 ` marxin at gcc dot gnu.org
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
Created attachment 52042
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52042&action=edit
b.slp1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 11:30 ` marxin at gcc dot gnu.org
  2021-12-22 13:44 ` hubicka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-22 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
You may try exporting GIMPLE IL that can be consumed with -fgimple.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-12-22 11:30 ` marxin at gcc dot gnu.org
@ 2021-12-22 13:44 ` hubicka at gcc dot gnu.org
  2021-12-22 14:30 ` pinskia at gcc dot gnu.org
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-22 13:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
OK, here is completely fake testcase that does similar operaitons:

#include <math.h>
struct test {float x; float y; float z;} test;
float f;
void
t()
{
  float x = test.x;
  float y = test.y;
  float z = test.z;

  x = x * f;
  y = y * f;
  z = z * f;
  x = sqrt (x);
  y = sqrt (y);
  z = sqrt (z);
  x = x / f;
  y = y / f;
  z = z / f;
  test.x=x;
  test.y=y;
  test.z=z;
}

We seem to fail to vectorize it with:

t.c:20:9: missed:   op not supported by target.                                 
t.c:17:5: missed:   not vectorized: relevant stmt not supported: x_15 = x_24 /
f.0_1;

clang seems to use divps happilly, so I am not sure why it is not supported.
Even more funny is that with -Ofast it is compiled into multiplication by
reciprocal:

t:
.LFB0:
        .cfi_startproc
        movss   f(%rip), %xmm4
        movss   .LC0(%rip), %xmm2
        movss   test(%rip), %xmm0
        movss   test+4(%rip), %xmm3
        divss   %xmm4, %xmm2
        movss   test+8(%rip), %xmm1
        mulss   %xmm4, %xmm0
        mulss   %xmm4, %xmm3
        mulss   %xmm4, %xmm1
        sqrtss  %xmm0, %xmm0
        sqrtss  %xmm3, %xmm3
        sqrtss  %xmm1, %xmm1
        mulss   %xmm2, %xmm0
        mulss   %xmm2, %xmm3
        mulss   %xmm2, %xmm1
        unpcklps        %xmm3, %xmm0
        movlps  %xmm0, test(%rip)
        movss   %xmm1, test+8(%rip)
        ret


and rewriting it that way by hand:

#include <math.h>
struct test {float x; float y; float z;} test;
float f;
void
t()
{
  float x = test.x;
  float y = test.y;
  float z = test.z;
  float m = 1/f;

  x = x * f;
  y = y * f;
  z = z * f;
  x = sqrt (x);
  y = sqrt (y);
  z = sqrt (z);
  x = x * m;
  y = y * m;
  z = z * m;
  test.x=x;
  test.y=y;
  test.z=z;
}

gets the expected result:
t:
.LFB0:
        .cfi_startproc
        movss   f(%rip), %xmm0
        movq    test(%rip), %xmm1
        movaps  %xmm0, %xmm2
        shufps  $0xe0, %xmm2, %xmm2
        mulps   %xmm1, %xmm2
        movss   .LC0(%rip), %xmm1
        divss   %xmm0, %xmm1
        mulss   test+8(%rip), %xmm0
        sqrtps  %xmm2, %xmm2
        sqrtss  %xmm0, %xmm0
        movaps  %xmm1, %xmm3
        shufps  $0xe0, %xmm3, %xmm3
        mulss   %xmm0, %xmm1
        mulps   %xmm3, %xmm2
        movss   %xmm1, test+8(%rip)
        movlps  %xmm2, test(%rip)
        ret
        .cfi_endproc

Having this however I do not see slp analyzing the divide in the original code
at all.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-12-22 13:44 ` hubicka at gcc dot gnu.org
@ 2021-12-22 14:30 ` pinskia at gcc dot gnu.org
  2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-22 14:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #7)
> Having this however I do not see slp analyzing the divide in the original
> code at all.

recip pass happens after vectorization ....
I don't know/understand why though.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-12-22 14:30 ` pinskia at gcc dot gnu.org
@ 2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
  2021-12-22 19:34 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 14:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> recip pass happens after vectorization ....
> I don't know/understand why though.
Yep, I suppose we want to either special case this in vectorizer or make
it earlier...  I also wonder why the code is vectorized for pairs of
values and third one is computed separately and why we don't use vectors
of length 4...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 19:34 ` jakub at gcc dot gnu.org
  2021-12-22 20:29 ` hubicka at gcc dot gnu.org
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-12-22 19:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
At least on your short testcase clang doesn't use divps either.
We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
before using divps?

Another question is if we could teach SLP to vectorize even factors not power
of 2, say loads/stores could be done (and with e.g. AVX512 almost everything)
could be done with masked loads/stores, most arithmetics could be done normally
and we'd just need to watch what values we'll get in the extra elts and make
sure it doesn't generate exceptions etc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-12-22 19:34 ` jakub at gcc dot gnu.org
@ 2021-12-22 20:29 ` hubicka at gcc dot gnu.org
  2021-12-23  8:12 ` ubizjak at gmail dot com
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-22 20:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Aha, I did not noticed that we need special patterns (I extecpted this is
problem to solve in machine independent code).  So I guess we have
 1) SLP should vectorize the 3 accesses with -ffast-math to only one vector
operation (as opposed to one vector+one scalar it does now)
 2) we could adddivv2sf3 pattern which initializes the elt 4 of the operand2 to
1.0f to avoid funny results
 3) we need to figure out why SLP vectorization is not even considered in the
original testcase (which I do not seem to be able to dig out with reasonable
effort in a way that it preserves original properties - to be vectorized by
clang and not vectorized by gcc)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2021-12-22 20:29 ` hubicka at gcc dot gnu.org
@ 2021-12-23  8:12 ` ubizjak at gmail dot com
  2021-12-23  8:52 ` ubizjak at gmail dot com
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23  8:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #12 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Jakub Jelinek from comment #10)
> At least on your short testcase clang doesn't use divps either.
> We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
> TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
> but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
> before using divps?

It could be done, but I was under impression that the sequence to load 1.0f
into topmost elements nullifies the benefit of operation to divide two
elements.  However, if the missing pattern prevents longer vectorized chains,
this is not entirely true.

The division can be implemented in the same way as sse_cvtps2pi, but using
CONST1_RTX vector.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2021-12-23  8:12 ` ubizjak at gmail dot com
@ 2021-12-23  8:52 ` ubizjak at gmail dot com
  2021-12-23  8:58 ` ubizjak at gmail dot com
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23  8:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 52051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52051&action=edit
Patch that implements v2sf division

Please try the attached patch, for the following testcase:

--cut here--
float a[2], b[2], r[2];

void bar (void)
{
  int i;

  for (i = 0; i < 2; i++)
    r[i] = a[i] / b[i];
}
--cut here--

the compiler generates:

        movq    b(%rip), %xmm1
        movq    a(%rip), %xmm0
        movhps  .LC0(%rip), %xmm1
        divps   %xmm1, %xmm0
        movlps  %xmm0, r(%rip)
        ret

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2021-12-23  8:52 ` ubizjak at gmail dot com
@ 2021-12-23  8:58 ` ubizjak at gmail dot com
  2021-12-23  9:15 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23  8:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #14 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #13)
> Created attachment 52051 [details]
> Patch that implements v2sf division

This patch also enables vectorization of the testcase from Comment #7. Using
-ffast-math, it also generates vectorized reciprocal:

        movss   f(%rip), %xmm4
        movss   test+8(%rip), %xmm3
        movq    test(%rip), %xmm2
        mulss   %xmm4, %xmm3
        movaps  %xmm4, %xmm0
        shufps  $0xe0, %xmm0, %xmm0
        mulps   %xmm0, %xmm2
        movhps  .LC0(%rip), %xmm0
-->     rcpps   %xmm0, %xmm1
        sqrtss  %xmm3, %xmm3
        mulps   %xmm1, %xmm0
        sqrtps  %xmm2, %xmm2
        divss   %xmm4, %xmm3
        movaps  %xmm2, %xmm5
        mulps   %xmm1, %xmm0
        addps   %xmm1, %xmm1
        subps   %xmm0, %xmm1
        mulps   %xmm1, %xmm5
        movlps  %xmm5, test(%rip)
        movss   %xmm3, test+8(%rip)
        ret

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2021-12-23  8:58 ` ubizjak at gmail dot com
@ 2021-12-23  9:15 ` jakub at gcc dot gnu.org
  2021-12-23  9:47 ` hubicka at kam dot mff.cuni.cz
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-12-23  9:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #12)
> (In reply to Jakub Jelinek from comment #10)
> > At least on your short testcase clang doesn't use divps either.
> > We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
> > TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
> > but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
> > before using divps?
> 
> It could be done, but I was under impression that the sequence to load 1.0f
> into topmost elements nullifies the benefit of operation to divide two

Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
division so that we would use it only if it is part of longer sequences?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2021-12-23  9:15 ` jakub at gcc dot gnu.org
@ 2021-12-23  9:47 ` hubicka at kam dot mff.cuni.cz
  2021-12-23 11:16 ` ubizjak at gmail dot com
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-23  9:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> > 
> > It could be done, but I was under impression that the sequence to load 1.0f
> > into topmost elements nullifies the benefit of operation to divide two
> 
> Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
> division so that we would use it only if it is part of longer sequences?

I wonder how the hardware implements it.  If divps is of similar latency
as divss then I guess it is essentially always win to load 1.0 to the
upper part, since it is slow operation.  On the other hand if divps is
about 4 times divss, then this may be harmful.

Agner Fog seems to be listing divss and divps with same latencies.
For zen it is 10 cycles which should be enough to do the setup.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2021-12-23  9:47 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-23 11:16 ` ubizjak at gmail dot com
  2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23 11:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #17 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to hubicka from comment #16)
> > > 
> > > It could be done, but I was under impression that the sequence to load 1.0f
> > > into topmost elements nullifies the benefit of operation to divide two
> > 
> > Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
> > division so that we would use it only if it is part of longer sequences?
> 
> I wonder how the hardware implements it.  If divps is of similar latency
> as divss then I guess it is essentially always win to load 1.0 to the
> upper part, since it is slow operation.  On the other hand if divps is
> about 4 times divss, then this may be harmful.
> 
> Agner Fog seems to be listing divss and divps with same latencies.
> For zen it is 10 cycles which should be enough to do the setup.

OK, I'll prepare and test a formal patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2021-12-23 11:16 ` ubizjak at gmail dot com
@ 2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
  2022-01-03 13:37 ` hubicka at gcc dot gnu.org
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-12-24 16:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

--- Comment #18 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:

https://gcc.gnu.org/g:8f921393e339090566c1589d81009caa954de90d

commit r12-6113-g8f921393e339090566c1589d81009caa954de90d
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Fri Dec 24 17:09:36 2021 +0100

    i386: Add V2SFmode DIV insn pattern [PR95046, PR103797]

    Use V4SFmode "DIVPS X,Y" with [y0, y1, 1.0f, 1.0f] as a divisor
    to avoid division by zero.

    2021-12-24  Uroš Bizjak  <ubizjak@gmail.com>

    gcc/ChangeLog:

            PR target/95046
            PR target/103797
            * config/i386/mmx.md (divv2sf3): New instruction pattern.

    gcc/testsuite/ChangeLog:

            PR target/95046
            PR target/103797
            * gcc.target/i386/pr95046-1.c (test_div): Add.
            (dg-options): Add -mno-recip.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
@ 2022-01-03 13:37 ` hubicka at gcc dot gnu.org
  2022-01-04 13:16 ` rguenth at gcc dot gnu.org
  2022-01-07  6:39 ` pinskia at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2022-01-03 13:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=88602

--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
It turns out that there is yeat another #ifdef __clang__ I missed in the gfx
library, so the vectorised code produced by clang is hand written in the
extensions discussed in PR88602.

Sorry for confusion. However I think the simplified testcase is still perfectly
vectorisable and we should add the div patterns as discussed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2022-01-03 13:37 ` hubicka at gcc dot gnu.org
@ 2022-01-04 13:16 ` rguenth at gcc dot gnu.org
  2022-01-07  6:39 ` pinskia at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-04 13:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
            Version|unknown                     |12.0
         Resolution|---                         |FIXED

--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
The division is now vectorized, your short testcase produces

t:
.LFB0:
        .cfi_startproc
        movss   f(%rip), %xmm4
        movss   test+8(%rip), %xmm3
        movq    test(%rip), %xmm0
        mulss   %xmm4, %xmm3
        movaps  %xmm4, %xmm1
        shufps  $0xe0, %xmm1, %xmm1
        mulps   %xmm1, %xmm0
        movhps  .LC0(%rip), %xmm1
        rcpps   %xmm1, %xmm2
        sqrtss  %xmm3, %xmm3
        mulps   %xmm2, %xmm1
        sqrtps  %xmm0, %xmm0
        divss   %xmm4, %xmm3
        mulps   %xmm2, %xmm1
        addps   %xmm2, %xmm2
        subps   %xmm1, %xmm2
        mulps   %xmm2, %xmm0
        movlps  %xmm0, test(%rip)
        movss   %xmm3, test+8(%rip)
        ret

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
  2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2022-01-04 13:16 ` rguenth at gcc dot gnu.org
@ 2022-01-07  6:39 ` pinskia at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-07  6:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |12.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-01-07  6:39 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
2021-12-22  1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
2021-12-22  8:46 ` marxin at gcc dot gnu.org
2021-12-22  9:14 ` hubicka at kam dot mff.cuni.cz
2021-12-22  9:21 ` marxin at gcc dot gnu.org
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
2021-12-22 11:30 ` marxin at gcc dot gnu.org
2021-12-22 13:44 ` hubicka at gcc dot gnu.org
2021-12-22 14:30 ` pinskia at gcc dot gnu.org
2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
2021-12-22 19:34 ` jakub at gcc dot gnu.org
2021-12-22 20:29 ` hubicka at gcc dot gnu.org
2021-12-23  8:12 ` ubizjak at gmail dot com
2021-12-23  8:52 ` ubizjak at gmail dot com
2021-12-23  8:58 ` ubizjak at gmail dot com
2021-12-23  9:15 ` jakub at gcc dot gnu.org
2021-12-23  9:47 ` hubicka at kam dot mff.cuni.cz
2021-12-23 11:16 ` ubizjak at gmail dot com
2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
2022-01-03 13:37 ` hubicka at gcc dot gnu.org
2022-01-04 13:16 ` rguenth at gcc dot gnu.org
2022-01-07  6:39 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).