* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
@ 2021-01-27 12:46 ` marxin at gcc dot gnu.org
2021-01-27 13:13 ` rguenth at gcc dot gnu.org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 12:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
One can see it here:
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=245.639.0&plot.1=171.639.0&
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
2021-01-27 12:46 ` [Bug tree-optimization/98854] " marxin at gcc dot gnu.org
@ 2021-01-27 13:13 ` rguenth at gcc dot gnu.org
2021-01-27 13:46 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 13:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2021-01-27
Target Milestone|--- |11.0
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I will have a look.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
2021-01-27 12:46 ` [Bug tree-optimization/98854] " marxin at gcc dot gnu.org
2021-01-27 13:13 ` rguenth at gcc dot gnu.org
@ 2021-01-27 13:46 ` rguenth at gcc dot gnu.org
2021-01-27 14:22 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 13:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, one can see it with BB vectorization enabled vs. disabled.
Bad:
Samples: 7K of event 'cycles:u', Event count (approx.): 7540324763
Overhead Samples Command Shared Object Symbol
53.11% 3711 a.out a.out [.] shade
25.39% 1774 a.out a.out [.] trace
18.16% 1271 a.out a.out [.] render_scanline
1.56% 109 a.out libm-2.26.so [.] __ieee754_pow_sse2
Good:
Samples: 6K of event 'cycles:u', Event count (approx.): 6673802579
Overhead Samples Command Shared Object Symbol
61.21% 3857 a.out a.out [.] shade
20.44% 1288 a.out a.out [.] trace
14.42% 912 a.out a.out [.] render_scanline
1.81% 114 a.out libm-2.26.so [.] __ieee754_pow_sse2
With added -fwhole-program we have
c-ray-mt.c:624:18: optimized: basic block part vectorized using 32 byte vectors
c-ray-mt.c:372:13: optimized: basic block part vectorized using 32 byte vectors
c-ray-mt.c:372:13: optimized: basic block part vectorized using 32 byte vectors
c-ray-mt.c:432:9: optimized: basic block part vectorized using 32 byte vectors
c-ray-mt.c:656:7: optimized: basic block part vectorized using 32 byte vectors
c-ray-mt.c:656:7: optimized: basic block part vectorized using 32 byte vectors
c-ray-mt.c:265:23: optimized: basic block part vectorized using 32 byte vectors
:372 is bad and then :656
For the first we vectorize a store
<bb 26> [local count: 31445960]:
# nearest_obj_239 = PHI <nearest_obj_11(17), nearest_obj_11(25),
iter_363(24), nearest_obj_11(19), nearest_obj_11(18), iter_363(23)>
...
_816 = {nearest_sp_pos_x_lsm.258_78, nearest_sp_pos_y_lsm.259_174,
nearest_sp_pos_z_lsm.260_201, nearest_sp_normal_x_lsm.261_200};
_820 = {nearest_sp_normal_y_lsm.262_122, nearest_sp_normal_z_lsm.263_293,
nearest_sp_vref_x_lsm.264_124, nearest_sp_vref_y_lsm.265_148};
iter_231 = iter_363->next;
if (iter_231 != 0B)
goto <bb 33>; [89.00%]
else
goto <bb 27>; [11.00%]
<bb 33> [local count: 27986904]:
goto <bb 17>; [100.00%]
<bb 27> [local count: 3459055]:
# nearest_sp_dist_lsm.257_228 = PHI <nearest_sp_dist_lsm.257_66(26)>
# nearest_sp_pos_x_lsm.258_226 = PHI <nearest_sp_pos_x_lsm.258_78(26)>
# nearest_sp_normal_y_lsm.262_343 = PHI <nearest_sp_normal_y_lsm.262_122(26)>
# nearest_sp_vref_x_lsm.264_238 = PHI <nearest_sp_vref_x_lsm.264_124(26)>
# nearest_sp_vref_y_lsm.265_237 = PHI <nearest_sp_vref_y_lsm.265_148(26)>
# nearest_sp_vref_z_lsm.266_236 = PHI <nearest_sp_vref_z_lsm.266_152(26)>
# nearest_sp_pos_y_lsm.259_342 = PHI <nearest_sp_pos_y_lsm.259_174(26)>
# nearest_sp_normal_x_lsm.261_351 = PHI <nearest_sp_normal_x_lsm.261_200(26)>
# nearest_sp_pos_z_lsm.260_304 = PHI <nearest_sp_pos_z_lsm.260_201(26)>
# nearest_obj_197 = PHI <nearest_obj_239(26)>
# nearest_sp_normal_z_lsm.263_821 = PHI <nearest_sp_normal_z_lsm.263_293(26)>
# vect_nearest_sp_pos_x_lsm.258_226.268_815 = PHI <_816(26)>
# vect_nearest_sp_pos_x_lsm.258_226.268_814 = PHI <_820(26)>
nearest_sp.vref.z = nearest_sp_vref_z_lsm.266_236;
MEM <vector(4) double> [(double *)&nearest_sp] =
vect_nearest_sp_pos_x_lsm.258_226.268_815;
_812 = &nearest_sp.pos.x + 32;
MEM <vector(4) double> [(double *)_812] =
vect_nearest_sp_pos_x_lsm.258_226.268_814;
but we insert the vector CTOR on a path that's more often executed than
the use. And since there's no sinking pass after vectorization nothing
fixes this up.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (2 preceding siblings ...)
2021-01-27 13:46 ` rguenth at gcc dot gnu.org
@ 2021-01-27 14:22 ` rguenth at gcc dot gnu.org
2021-01-27 14:39 ` marxin at gcc dot gnu.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 14:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Little bit convoluted testcase:
double a[1024];
int bar();
void foo (int n)
{
double x = 0, y = 0;
int i = 1023;
do
{
x += a[i] + a[i+1];
y += a[i] / a[i+1];
if (bar ())
break;
}
while (--i);
a[0] = x;
a[1] = y;
}
where we end up with the {x, y} vector CTOR inside the loop (and even
spill/reload it because of the call). We have a PHI node-only feed
for the vectorized store:
t.c:16:8: note: Vectorizing SLP tree:
t.c:16:8: note: node 0x3b21ee0 (max_nunits=2, refcnt=1)
t.c:16:8: note: op template: a[0] = x_22;
t.c:16:8: note: stmt 0 a[0] = x_22;
t.c:16:8: note: stmt 1 a[1] = y_21;
t.c:16:8: note: children 0x3b21f68
t.c:16:8: note: node 0x3b21f68 (max_nunits=2, refcnt=1)
t.c:16:8: note: op template: x_22 = PHI <x_26(9), x_25(10)>
t.c:16:8: note: stmt 0 x_22 = PHI <x_26(9), x_25(10)>
t.c:16:8: note: stmt 1 y_21 = PHI <y_24(9), y_23(10)>
t.c:16:8: note: children 0x3b21ff0 0x3b22210
t.c:16:8: note: node 0x3b21ff0 (max_nunits=2, refcnt=1)
t.c:16:8: note: op template: x_26 = PHI <x_14(3)>
t.c:16:8: note: stmt 0 x_26 = PHI <x_14(3)>
t.c:16:8: note: stmt 1 y_24 = PHI <y_15(3)>
t.c:16:8: note: children 0x3b22320
t.c:16:8: note: node (external) 0x3b22320 (max_nunits=1, refcnt=1)
t.c:16:8: note: { x_14, y_15 }
t.c:16:8: note: node 0x3b22210 (max_nunits=2, refcnt=1)
t.c:16:8: note: op template: x_25 = PHI <x_14(4)>
t.c:16:8: note: stmt 0 x_25 = PHI <x_14(4)>
t.c:16:8: note: stmt 1 y_23 = PHI <y_15(4)>
t.c:16:8: note: children 0x3b223a8
t.c:16:8: note: node (external) 0x3b223a8 (max_nunits=1, refcnt=1)
t.c:16:8: note: { x_14, y_15 }
fixing this issue fixes the slowdown. Testing a patch.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (3 preceding siblings ...)
2021-01-27 14:22 ` rguenth at gcc dot gnu.org
@ 2021-01-27 14:39 ` marxin at gcc dot gnu.org
2021-01-27 14:45 ` rguenther at suse dot de
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #5 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> Little bit convoluted testcase:
>
> double a[1024];
>
> int bar();
> void foo (int n)
> {
> double x = 0, y = 0;
> int i = 1023;
> do
> {
> x += a[i] + a[i+1];
> y += a[i] / a[i+1];
> if (bar ())
> break;
> }
> while (--i);
> a[0] = x;
> a[1] = y;
> }
>
What compiler (ISA options) do you use in order to vectorize this?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (4 preceding siblings ...)
2021-01-27 14:39 ` marxin at gcc dot gnu.org
@ 2021-01-27 14:45 ` rguenther at suse dot de
2021-01-27 14:52 ` marxin at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenther at suse dot de @ 2021-01-27 14:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 27 Jan 2021, marxin at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
>
> --- Comment #5 from Martin Li?ka <marxin at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #4)
> > Little bit convoluted testcase:
> >
> > double a[1024];
> >
> > int bar();
> > void foo (int n)
> > {
> > double x = 0, y = 0;
> > int i = 1023;
> > do
> > {
> > x += a[i] + a[i+1];
> > y += a[i] / a[i+1];
> > if (bar ())
> > break;
> > }
> > while (--i);
> > a[0] = x;
> > a[1] = y;
> > }
> >
>
> What compiler (ISA options) do you use in order to vectorize this?
I used -O3 but -O2 -ftree-slp-vectorize also vectorizes it.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (5 preceding siblings ...)
2021-01-27 14:45 ` rguenther at suse dot de
@ 2021-01-27 14:52 ` marxin at gcc dot gnu.org
2021-01-27 15:17 ` rguenther at suse dot de
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #7 from Martin Liška <marxin at gcc dot gnu.org> ---
> I used -O3 but -O2 -ftree-slp-vectorize also vectorizes it.
I must be blind, but I see for the current master:
gcc pr98854.c -c -O2 -ftree-slp-vectorize -fdump-tree-optimized=/dev/stdout
foo (int n)
{
unsigned long ivtmp.8;
double y;
double x;
double _6;
double _8;
double _9;
double _11;
int _14;
void * _29;
unsigned long _31;
<bb 2>:
ivtmp.8_28 = (unsigned long) &MEM[(void *)&a + 8184B];
_31 = (unsigned long) &a;
<bb 3>:
# x_1 = PHI <0.0(2), x_10(5)>
# y_2 = PHI <0.0(2), y_12(5)>
# ivtmp.8_18 = PHI <ivtmp.8_28(2), ivtmp.8_27(5)>
_29 = (void *) ivtmp.8_18;
_6 = MEM[base: _29, offset: 0B];
_8 = MEM[base: _29, offset: 8B];
_9 = _6 + _8;
x_10 = _9 + x_1;
_11 = _6 / _8;
y_12 = _11 + y_2;
_14 = bar ();
if (_14 != 0)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
a[0] = x_10;
a[1] = y_12;
return;
<bb 5>:
ivtmp.8_27 = ivtmp.8_18 - 8;
if (ivtmp.8_27 != _31)
goto <bb 3>;
else
goto <bb 4>;
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (6 preceding siblings ...)
2021-01-27 14:52 ` marxin at gcc dot gnu.org
@ 2021-01-27 15:17 ` rguenther at suse dot de
2021-01-27 16:33 ` cvs-commit at gcc dot gnu.org
2021-01-27 16:34 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenther at suse dot de @ 2021-01-27 15:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 27 Jan 2021, marxin at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
>
> --- Comment #7 from Martin Liška <marxin at gcc dot gnu.org> ---
> > I used -O3 but -O2 -ftree-slp-vectorize also vectorizes it.
>
> I must be blind, but I see for the current master:
>
> gcc pr98854.c -c -O2 -ftree-slp-vectorize -fdump-tree-optimized=/dev/stdout
>
> foo (int n)
> {
> unsigned long ivtmp.8;
> double y;
> double x;
> double _6;
> double _8;
> double _9;
> double _11;
> int _14;
> void * _29;
> unsigned long _31;
>
> <bb 2>:
> ivtmp.8_28 = (unsigned long) &MEM[(void *)&a + 8184B];
> _31 = (unsigned long) &a;
>
> <bb 3>:
> # x_1 = PHI <0.0(2), x_10(5)>
> # y_2 = PHI <0.0(2), y_12(5)>
> # ivtmp.8_18 = PHI <ivtmp.8_28(2), ivtmp.8_27(5)>
> _29 = (void *) ivtmp.8_18;
> _6 = MEM[base: _29, offset: 0B];
> _8 = MEM[base: _29, offset: 8B];
> _9 = _6 + _8;
> x_10 = _9 + x_1;
> _11 = _6 / _8;
> y_12 = _11 + y_2;
> _14 = bar ();
> if (_14 != 0)
> goto <bb 4>;
> else
> goto <bb 5>;
>
> <bb 4>:
> a[0] = x_10;
> a[1] = y_12;
> return;
>
> <bb 5>:
> ivtmp.8_27 = ivtmp.8_18 - 8;
> if (ivtmp.8_27 != _31)
> goto <bb 3>;
> else
> goto <bb 4>;
>
> }
Hmm, maybe my dev tree has related adjustments to SLP ... at least
the posted patch fixes the regression for me.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (7 preceding siblings ...)
2021-01-27 15:17 ` rguenther at suse dot de
@ 2021-01-27 16:33 ` cvs-commit at gcc dot gnu.org
2021-01-27 16:34 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-01-27 16:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:c91db798ec65b3e55f2380ca1530ecb71544f1bb
commit r11-6934-gc91db798ec65b3e55f2380ca1530ecb71544f1bb
Author: Richard Biener <rguenther@suse.de>
Date: Wed Jan 27 15:20:58 2021 +0100
tree-optimization/98854 - avoid some PHI BB vectorization
This avoids cases of PHI node vectorization that just causes us
to insert vector CTORs inside loops for values only required
outside of the loop.
2021-01-27 Richard Biener <rguenther@suse.de>
PR tree-optimization/98854
* tree-vect-slp.c (vect_build_slp_tree_2): Also build
PHIs from scalars when the number of CTORs matches the
number of children.
* gcc.dg/vect/bb-slp-pr98854.c: New testcase.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/98854] [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e
2021-01-27 12:45 [Bug tree-optimization/98854] New: [11 Regression] cray benchmark is about 15% slower since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
` (8 preceding siblings ...)
2021-01-27 16:33 ` cvs-commit at gcc dot gnu.org
@ 2021-01-27 16:34 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 16:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98854
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.
^ permalink raw reply [flat|nested] 11+ messages in thread