public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not
@ 2021-12-21 23:29 hubicka at gcc dot gnu.org
2021-12-22 1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
` (21 more replies)
0 siblings, 22 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-21 23:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Bug ID: 103797
Summary: Clang vectorized LightPixel while GCC does not
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
Clang vectorises divss in LightPixel while GCC does not (at -O3). This seems
to account for 17% difference in resteflood_svg benchmark of Firefox.
│ 0000000001864660 <mozilla::gfx::(anonymous
namespace)::SpecularLightingSoftware::LightPixel(mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits,
float> const&, mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits, float>
const&, unsigned int)>:
│ mozilla::gfx::(anonymous
namespace)::SpecularLightingSoftware::LightPixel(mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits,
float> const&, mozilla::gfx::Point3DTyped<mozilla::gfx::UnknownUnits, float>
const&, unsigned int):
0.05 │ push %rbp
0.07 │ mov %rsp,%rbp
0.71 │ xorps %xmm6,%xmm6
0.32 │ addss %xmm6,%xmm4
│ unpcklps %xmm3,%xmm5
0.78 │ movss
anon.5bcbce9b5eeaaf1a18a99b9a5b62e1ce.3.llvm.5306652999446557335+0x6d8,%xmm8
0.01 │ addps %xmm8,%xmm5
1.47 │ movaps %xmm4,%xmm9
│ mulss %xmm4,%xmm9
│ movaps %xmm5,%xmm7
0.01 │ mulps %xmm5,%xmm7
3.35 │ movaps %xmm7,%xmm3
│ shufps $0x55,%xmm7,%xmm3
0.99 │ addss %xmm9,%xmm3
1.59 │ addss %xmm7,%xmm3
2.01 │ sqrtss %xmm3,%xmm3
11.43 │ divss %xmm3,%xmm4
6.76 │ shufps $0x0,%xmm3,%xmm3
0.01 │ divps %xmm3,%xmm5
2.58 │ mulss %xmm1,%xmm4
0.04 │ unpcklps %xmm0,%xmm2
│ mulps %xmm5,%xmm2
2.67 │ movaps %xmm2,%xmm0
0.04 │ shufps $0x55,%xmm2,%xmm0
2.11 │ addss %xmm4,%xmm0
1.87 │ addss %xmm2,%xmm0
2.82 │ cmpless %xmm0,%xmm6
2.20 │ andps %xmm8,%xmm6
1.05 │ mulss %xmm0,%xmm6
4.04 │ mulss .str.6.llvm.231702015065810902+0x77,%xmm6
3.14 │ cvttss2si %xmm6,%eax
4.45 │ mov 0x8(%rdi),%ecx
0.00 │ mov 0xc(%rdi),%edx
│ movzwl %ax,%eax
1.10 │ test %edx,%edx
│ ↓ jle 92
│88: imul %eax,%eax
9.06 │ shr $0xf,%eax
3.12 │ dec %edx
│ ↑ jne 88
│92: shr $0x8,%eax
1.95 │ movzwl 0x10(%rdi,%rax,2),%eax
6.48 │ imul %eax,%ecx
0.99 │ shr $0x8,%ecx
1.06 │ mov %esi,%eax
0.01 │ shr $0x8,%eax
│ mov %esi,%edx
│ shr $0x10,%edx
0.01 │ mov $0xff,%edi
│ and %edi,%esi
0.01 │ imul %ecx,%esi
3.32 │ shr $0xf,%esi
1.81 │ cmp %edi,%esi ▒
0.04 │ cmovae %edi,%esi ▒
1.99 │ and %edi,%eax ▒
0.01 │ imul %ecx,%eax ▒
│ shr $0xf,%eax ▒
0.01 │ cmp %edi,%eax ▒
0.28 │ cmovae %edi,%eax ▒
0.96 │ and %edi,%edx ▒
│ imul %ecx,%edx ▒
│ shr $0xf,%edx ▒
0.92 │ cmp %edi,%edx ▒
0.85 │ cmovae %edi,%edx ▒
1.00 │ cmp %eax,%edx ▒
1.20 │ mov %eax,%ecx ▒
│ cmova %edx,%ecx ▒
2.17 │ cmp %esi,%ecx ▒
1.15 │ cmovbe %esi,%ecx ▒
1.79 │ shl $0x18,%ecx▒
1.17 │ shl $0x10,%edx▒
│ shl $0x8,%eax ▒
0.03 │ or %edx,%eax ▒
0.01 │ or %esi,%eax ▒
0.14 │ or %ecx,%eax ▒
0.72 │ pop %rbp ▒
0.04 │ ← ret
▒
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
@ 2021-12-22 1:16 ` pinskia at gcc dot gnu.org
2021-12-22 8:46 ` marxin at gcc dot gnu.org
` (20 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-22 1:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
2021-12-22 1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
@ 2021-12-22 8:46 ` marxin at gcc dot gnu.org
2021-12-22 9:14 ` hubicka at kam dot mff.cuni.cz
` (19 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-22 8:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Martin Liška <marxin at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |WAITING
Ever confirmed|0 |1
CC| |marxin at gcc dot gnu.org
Last reconfirmed| |2021-12-22
--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
Can you please attach a reduced test-case?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
2021-12-22 1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
2021-12-22 8:46 ` marxin at gcc dot gnu.org
@ 2021-12-22 9:14 ` hubicka at kam dot mff.cuni.cz
2021-12-22 9:21 ` marxin at gcc dot gnu.org
` (18 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 9:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #2 from hubicka at kam dot mff.cuni.cz ---
> Can you please attach a reduced test-case?
Do you know how to produce one with a reasonable effort? The
declaratoins are quite convoluted, but the function is well isolated and
easy to inspect from full one...
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2021-12-22 9:14 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 9:21 ` marxin at gcc dot gnu.org
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
` (17 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-22 9:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to hubicka from comment #2)
> > Can you please attach a reduced test-case?
> Do you know how to produce one with a reasonable effort?
-E and remove not needed code.
> The
> declaratoins are quite convoluted, but the function is well isolated and
> easy to inspect from full one...
Do we speak about:
https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
?
It should be possible creating a synthetical test that does the same (and lives
in a loop, right?).
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2021-12-22 9:21 ` marxin at gcc dot gnu.org
@ 2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
` (16 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 11:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #4 from hubicka at kam dot mff.cuni.cz ---
> -E and remove not needed code.
>
> > The
> > declaratoins are quite convoluted, but the function is well isolated and
> > easy to inspect from full one...
>
> Do we speak about:
> https://github.com/mozilla/gecko-dev/blob/bd25b1ca76dd5d323ffc69557f6cf759ba76ba23/gfx/2d/FilterNodeSoftware.cpp#L3670-L3691
> ?
Yes.
>
> It should be possible creating a synthetical test that does the same (and lives
> in a loop, right?).
Well, I tried that for a while and got bit lost (either code got
vectorized by both gcc and clang or by neither). There are more issues
where we have over 50% regression wrt clang build at gfx code, so I
think I will first try to reproduce those locally and perf them to see
if there is more pattern here.
The releavant code is:
uint32_t mozilla::gfx::{anonymous}::SpecularLightingSoftware::LightPixel
(struct SpecularLightingSoftware * const this, const struct Point3D & aNormal,
const struct Point3D & aVectorToLight, uint32_t aColor)
{
<bb 2> [local count: 118111600]:
_48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
_49 = _48 + 1.0e+0;
_50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
_51 = _50 + 0.0;
_52 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
_53 = _52 + 0.0;
_80 = _53 * _53;
_82 = _51 * _51;
_83 = _80 + _82;
_85 = _49 * _49;
_86 = _83 + _85;
if (_86 u>= 0.0)
goto <bb 3>; [99.95%]
else
goto <bb 4>; [0.05%]
<bb 3> [local count: 118052545]:
_87 = .SQRT (_86);
goto <bb 5>; [100.00%]
<bb 4> [local count: 59055]:
_29 = __builtin_sqrtf (_86);
<bb 5> [local count: 118111600]:
# _30 = PHI <_29(4), _87(3)>
_88 = _53 / _30;
_89 = _51 / _30;
_90 = _49 / _30;
_41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
_39 = _41 * _88;
_37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
_33 = _37 * _89;
_27 = _33 + _39;
_45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
_46 = _45 * _90;
_47 = _27 + _46;
if (_47 >= 0.0)
goto <bb 12>; [59.00%]
else
goto <bb 6>; [41.00%]
With -Ofast it gets bit more streamlined:
<bb 2> [local count: 118111600]:
_48 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.z;
_49 = _48 + 1.0e+0;
_50 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.y;
_51 = MEM[(const struct BasePoint3D
*)aVectorToLight_25(D)].D.75826.D.75829.x;
powmult_78 = _51 * _51;
powmult_80 = _50 * _50;
_81 = powmult_78 + powmult_80;
powmult_83 = _49 * _49;
_84 = _81 + powmult_83;
_85 = __builtin_sqrtf (_84);
_86 = _51 / _85;
_87 = _50 / _85;
_88 = _49 / _85;
_41 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.x;
_39 = _41 * _86;
_37 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.y;
_33 = _37 * _87;
_27 = _33 + _39;
_45 = MEM[(const struct BasePoint3D *)aNormal_26(D)].D.75826.D.75829.z;
_46 = _45 * _88;
_47 = _27 + _46;
if (_47 >= 0.0)
goto <bb 3>; [59.00%]
else
goto <bb 9>; [41.00%]
But I do not quite see in the slp dump why this is not considered for
vectorization.
I attach the dump.
Honza
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
2021-12-22 11:30 ` marxin at gcc dot gnu.org
` (15 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 11:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #5 from hubicka at kam dot mff.cuni.cz ---
Created attachment 52042
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52042&action=edit
b.slp1
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 11:30 ` marxin at gcc dot gnu.org
2021-12-22 13:44 ` hubicka at gcc dot gnu.org
` (14 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-22 11:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
You may try exporting GIMPLE IL that can be consumed with -fgimple.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2021-12-22 11:30 ` marxin at gcc dot gnu.org
@ 2021-12-22 13:44 ` hubicka at gcc dot gnu.org
2021-12-22 14:30 ` pinskia at gcc dot gnu.org
` (13 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-22 13:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |NEW
--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
OK, here is completely fake testcase that does similar operaitons:
#include <math.h>
struct test {float x; float y; float z;} test;
float f;
void
t()
{
float x = test.x;
float y = test.y;
float z = test.z;
x = x * f;
y = y * f;
z = z * f;
x = sqrt (x);
y = sqrt (y);
z = sqrt (z);
x = x / f;
y = y / f;
z = z / f;
test.x=x;
test.y=y;
test.z=z;
}
We seem to fail to vectorize it with:
t.c:20:9: missed: op not supported by target.
t.c:17:5: missed: not vectorized: relevant stmt not supported: x_15 = x_24 /
f.0_1;
clang seems to use divps happilly, so I am not sure why it is not supported.
Even more funny is that with -Ofast it is compiled into multiplication by
reciprocal:
t:
.LFB0:
.cfi_startproc
movss f(%rip), %xmm4
movss .LC0(%rip), %xmm2
movss test(%rip), %xmm0
movss test+4(%rip), %xmm3
divss %xmm4, %xmm2
movss test+8(%rip), %xmm1
mulss %xmm4, %xmm0
mulss %xmm4, %xmm3
mulss %xmm4, %xmm1
sqrtss %xmm0, %xmm0
sqrtss %xmm3, %xmm3
sqrtss %xmm1, %xmm1
mulss %xmm2, %xmm0
mulss %xmm2, %xmm3
mulss %xmm2, %xmm1
unpcklps %xmm3, %xmm0
movlps %xmm0, test(%rip)
movss %xmm1, test+8(%rip)
ret
and rewriting it that way by hand:
#include <math.h>
struct test {float x; float y; float z;} test;
float f;
void
t()
{
float x = test.x;
float y = test.y;
float z = test.z;
float m = 1/f;
x = x * f;
y = y * f;
z = z * f;
x = sqrt (x);
y = sqrt (y);
z = sqrt (z);
x = x * m;
y = y * m;
z = z * m;
test.x=x;
test.y=y;
test.z=z;
}
gets the expected result:
t:
.LFB0:
.cfi_startproc
movss f(%rip), %xmm0
movq test(%rip), %xmm1
movaps %xmm0, %xmm2
shufps $0xe0, %xmm2, %xmm2
mulps %xmm1, %xmm2
movss .LC0(%rip), %xmm1
divss %xmm0, %xmm1
mulss test+8(%rip), %xmm0
sqrtps %xmm2, %xmm2
sqrtss %xmm0, %xmm0
movaps %xmm1, %xmm3
shufps $0xe0, %xmm3, %xmm3
mulss %xmm0, %xmm1
mulps %xmm3, %xmm2
movss %xmm1, test+8(%rip)
movlps %xmm2, test(%rip)
ret
.cfi_endproc
Having this however I do not see slp analyzing the divide in the original code
at all.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (7 preceding siblings ...)
2021-12-22 13:44 ` hubicka at gcc dot gnu.org
@ 2021-12-22 14:30 ` pinskia at gcc dot gnu.org
2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
` (12 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-22 14:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #7)
> Having this however I do not see slp analyzing the divide in the original
> code at all.
recip pass happens after vectorization ....
I don't know/understand why though.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (8 preceding siblings ...)
2021-12-22 14:30 ` pinskia at gcc dot gnu.org
@ 2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
2021-12-22 19:34 ` jakub at gcc dot gnu.org
` (11 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-22 14:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> recip pass happens after vectorization ....
> I don't know/understand why though.
Yep, I suppose we want to either special case this in vectorizer or make
it earlier... I also wonder why the code is vectorized for pairs of
values and third one is computed separately and why we don't use vectors
of length 4...
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (9 preceding siblings ...)
2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-22 19:34 ` jakub at gcc dot gnu.org
2021-12-22 20:29 ` hubicka at gcc dot gnu.org
` (10 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-12-22 19:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org,
| |uros at gcc dot gnu.org
--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
At least on your short testcase clang doesn't use divps either.
We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
before using divps?
Another question is if we could teach SLP to vectorize even factors not power
of 2, say loads/stores could be done (and with e.g. AVX512 almost everything)
could be done with masked loads/stores, most arithmetics could be done normally
and we'd just need to watch what values we'll get in the extra elts and make
sure it doesn't generate exceptions etc.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (10 preceding siblings ...)
2021-12-22 19:34 ` jakub at gcc dot gnu.org
@ 2021-12-22 20:29 ` hubicka at gcc dot gnu.org
2021-12-23 8:12 ` ubizjak at gmail dot com
` (9 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-22 20:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Aha, I did not noticed that we need special patterns (I extecpted this is
problem to solve in machine independent code). So I guess we have
1) SLP should vectorize the 3 accesses with -ffast-math to only one vector
operation (as opposed to one vector+one scalar it does now)
2) we could adddivv2sf3 pattern which initializes the elt 4 of the operand2 to
1.0f to avoid funny results
3) we need to figure out why SLP vectorization is not even considered in the
original testcase (which I do not seem to be able to dig out with reasonable
effort in a way that it preserves original properties - to be vectorized by
clang and not vectorized by gcc)
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (11 preceding siblings ...)
2021-12-22 20:29 ` hubicka at gcc dot gnu.org
@ 2021-12-23 8:12 ` ubizjak at gmail dot com
2021-12-23 8:52 ` ubizjak at gmail dot com
` (8 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23 8:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #12 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Jakub Jelinek from comment #10)
> At least on your short testcase clang doesn't use divps either.
> We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
> TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
> but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
> before using divps?
It could be done, but I was under impression that the sequence to load 1.0f
into topmost elements nullifies the benefit of operation to divide two
elements. However, if the missing pattern prevents longer vectorized chains,
this is not entirely true.
The division can be implemented in the same way as sse_cvtps2pi, but using
CONST1_RTX vector.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (12 preceding siblings ...)
2021-12-23 8:12 ` ubizjak at gmail dot com
@ 2021-12-23 8:52 ` ubizjak at gmail dot com
2021-12-23 8:58 ` ubizjak at gmail dot com
` (7 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23 8:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 52051
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52051&action=edit
Patch that implements v2sf division
Please try the attached patch, for the following testcase:
--cut here--
float a[2], b[2], r[2];
void bar (void)
{
int i;
for (i = 0; i < 2; i++)
r[i] = a[i] / b[i];
}
--cut here--
the compiler generates:
movq b(%rip), %xmm1
movq a(%rip), %xmm0
movhps .LC0(%rip), %xmm1
divps %xmm1, %xmm0
movlps %xmm0, r(%rip)
ret
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (13 preceding siblings ...)
2021-12-23 8:52 ` ubizjak at gmail dot com
@ 2021-12-23 8:58 ` ubizjak at gmail dot com
2021-12-23 9:15 ` jakub at gcc dot gnu.org
` (6 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23 8:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #14 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #13)
> Created attachment 52051 [details]
> Patch that implements v2sf division
This patch also enables vectorization of the testcase from Comment #7. Using
-ffast-math, it also generates vectorized reciprocal:
movss f(%rip), %xmm4
movss test+8(%rip), %xmm3
movq test(%rip), %xmm2
mulss %xmm4, %xmm3
movaps %xmm4, %xmm0
shufps $0xe0, %xmm0, %xmm0
mulps %xmm0, %xmm2
movhps .LC0(%rip), %xmm0
--> rcpps %xmm0, %xmm1
sqrtss %xmm3, %xmm3
mulps %xmm1, %xmm0
sqrtps %xmm2, %xmm2
divss %xmm4, %xmm3
movaps %xmm2, %xmm5
mulps %xmm1, %xmm0
addps %xmm1, %xmm1
subps %xmm0, %xmm1
mulps %xmm1, %xmm5
movlps %xmm5, test(%rip)
movss %xmm3, test+8(%rip)
ret
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (14 preceding siblings ...)
2021-12-23 8:58 ` ubizjak at gmail dot com
@ 2021-12-23 9:15 ` jakub at gcc dot gnu.org
2021-12-23 9:47 ` hubicka at kam dot mff.cuni.cz
` (5 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-12-23 9:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #12)
> (In reply to Jakub Jelinek from comment #10)
> > At least on your short testcase clang doesn't use divps either.
> > We do support mulv2sf3, addv2sf3 etc. but not divv2sf3 I bet because with
> > TARGET_MMX_WITH_SSE it would divide by zero in the 3rd and 4th elts,
> > but perhaps we could insert 1.0f, 1.0f into those elements of the divisor
> > before using divps?
>
> It could be done, but I was under impression that the sequence to load 1.0f
> into topmost elements nullifies the benefit of operation to divide two
Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
division so that we would use it only if it is part of longer sequences?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (15 preceding siblings ...)
2021-12-23 9:15 ` jakub at gcc dot gnu.org
@ 2021-12-23 9:47 ` hubicka at kam dot mff.cuni.cz
2021-12-23 11:16 ` ubizjak at gmail dot com
` (4 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-23 9:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> >
> > It could be done, but I was under impression that the sequence to load 1.0f
> > into topmost elements nullifies the benefit of operation to divide two
>
> Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
> division so that we would use it only if it is part of longer sequences?
I wonder how the hardware implements it. If divps is of similar latency
as divss then I guess it is essentially always win to load 1.0 to the
upper part, since it is slow operation. On the other hand if divps is
about 4 times divss, then this may be harmful.
Agner Fog seems to be listing divss and divps with same latencies.
For zen it is 10 cycles which should be enough to do the setup.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (16 preceding siblings ...)
2021-12-23 9:47 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-23 11:16 ` ubizjak at gmail dot com
2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
` (3 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2021-12-23 11:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #17 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to hubicka from comment #16)
> > >
> > > It could be done, but I was under impression that the sequence to load 1.0f
> > > into topmost elements nullifies the benefit of operation to divide two
> >
> > Sure, so perhaps we should somewhat increase the vectorization cost of V2SFmode
> > division so that we would use it only if it is part of longer sequences?
>
> I wonder how the hardware implements it. If divps is of similar latency
> as divss then I guess it is essentially always win to load 1.0 to the
> upper part, since it is slow operation. On the other hand if divps is
> about 4 times divss, then this may be harmful.
>
> Agner Fog seems to be listing divss and divps with same latencies.
> For zen it is 10 cycles which should be enough to do the setup.
OK, I'll prepare and test a formal patch.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (17 preceding siblings ...)
2021-12-23 11:16 ` ubizjak at gmail dot com
@ 2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
2022-01-03 13:37 ` hubicka at gcc dot gnu.org
` (2 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-12-24 16:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
--- Comment #18 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:
https://gcc.gnu.org/g:8f921393e339090566c1589d81009caa954de90d
commit r12-6113-g8f921393e339090566c1589d81009caa954de90d
Author: Uros Bizjak <ubizjak@gmail.com>
Date: Fri Dec 24 17:09:36 2021 +0100
i386: Add V2SFmode DIV insn pattern [PR95046, PR103797]
Use V4SFmode "DIVPS X,Y" with [y0, y1, 1.0f, 1.0f] as a divisor
to avoid division by zero.
2021-12-24 Uroš Bizjak <ubizjak@gmail.com>
gcc/ChangeLog:
PR target/95046
PR target/103797
* config/i386/mmx.md (divv2sf3): New instruction pattern.
gcc/testsuite/ChangeLog:
PR target/95046
PR target/103797
* gcc.target/i386/pr95046-1.c (test_div): Add.
(dg-options): Add -mno-recip.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (18 preceding siblings ...)
2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
@ 2022-01-03 13:37 ` hubicka at gcc dot gnu.org
2022-01-04 13:16 ` rguenth at gcc dot gnu.org
2022-01-07 6:39 ` pinskia at gcc dot gnu.org
21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2022-01-03 13:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=88602
--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
It turns out that there is yeat another #ifdef __clang__ I missed in the gfx
library, so the vectorised code produced by clang is hand written in the
extensions discussed in PR88602.
Sorry for confusion. However I think the simplified testcase is still perfectly
vectorisable and we should add the div patterns as discussed.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (19 preceding siblings ...)
2022-01-03 13:37 ` hubicka at gcc dot gnu.org
@ 2022-01-04 13:16 ` rguenth at gcc dot gnu.org
2022-01-07 6:39 ` pinskia at gcc dot gnu.org
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-04 13:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Version|unknown |12.0
Resolution|--- |FIXED
--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
The division is now vectorized, your short testcase produces
t:
.LFB0:
.cfi_startproc
movss f(%rip), %xmm4
movss test+8(%rip), %xmm3
movq test(%rip), %xmm0
mulss %xmm4, %xmm3
movaps %xmm4, %xmm1
shufps $0xe0, %xmm1, %xmm1
mulps %xmm1, %xmm0
movhps .LC0(%rip), %xmm1
rcpps %xmm1, %xmm2
sqrtss %xmm3, %xmm3
mulps %xmm2, %xmm1
sqrtps %xmm0, %xmm0
divss %xmm4, %xmm3
mulps %xmm2, %xmm1
addps %xmm2, %xmm2
subps %xmm1, %xmm2
mulps %xmm2, %xmm0
movlps %xmm0, test(%rip)
movss %xmm3, test+8(%rip)
ret
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/103797] Clang vectorized LightPixel while GCC does not
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
` (20 preceding siblings ...)
2022-01-04 13:16 ` rguenth at gcc dot gnu.org
@ 2022-01-07 6:39 ` pinskia at gcc dot gnu.org
21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-07 6:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103797
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |12.0
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2022-01-07 6:39 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21 23:29 [Bug tree-optimization/103797] New: Clang vectorized LightPixel while GCC does not hubicka at gcc dot gnu.org
2021-12-22 1:16 ` [Bug tree-optimization/103797] " pinskia at gcc dot gnu.org
2021-12-22 8:46 ` marxin at gcc dot gnu.org
2021-12-22 9:14 ` hubicka at kam dot mff.cuni.cz
2021-12-22 9:21 ` marxin at gcc dot gnu.org
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
2021-12-22 11:08 ` hubicka at kam dot mff.cuni.cz
2021-12-22 11:30 ` marxin at gcc dot gnu.org
2021-12-22 13:44 ` hubicka at gcc dot gnu.org
2021-12-22 14:30 ` pinskia at gcc dot gnu.org
2021-12-22 14:59 ` hubicka at kam dot mff.cuni.cz
2021-12-22 19:34 ` jakub at gcc dot gnu.org
2021-12-22 20:29 ` hubicka at gcc dot gnu.org
2021-12-23 8:12 ` ubizjak at gmail dot com
2021-12-23 8:52 ` ubizjak at gmail dot com
2021-12-23 8:58 ` ubizjak at gmail dot com
2021-12-23 9:15 ` jakub at gcc dot gnu.org
2021-12-23 9:47 ` hubicka at kam dot mff.cuni.cz
2021-12-23 11:16 ` ubizjak at gmail dot com
2021-12-24 16:10 ` cvs-commit at gcc dot gnu.org
2022-01-03 13:37 ` hubicka at gcc dot gnu.org
2022-01-04 13:16 ` rguenth at gcc dot gnu.org
2022-01-07 6:39 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).