* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
@ 2015-05-04 12:47 ` rguenth at gcc dot gnu.org
2015-05-04 12:55 ` trippels at gcc dot gnu.org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 12:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
gprof tells me
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
67.13 25.40 25.40 2097192 0.00 0.00 contextModel2()
18.08 32.24 6.84 18874735 0.00 0.00 ContextMap::mix1(Mixer&,
int, int, int, int)
9.46 35.82 3.58 2097192 0.00 0.00 Mixer::p()
2.72 36.85 1.03 14680344 0.00 0.00 APM1::p(int, int, int)
0.53 37.05 0.20 2097192 0.00 0.00 dmcModel(Mixer&)
probably not too interesting (inlining).
I wonder if you can run clang++ with vectorization disabled?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
2015-05-04 12:47 ` [Bug tree-optimization/66002] " rguenth at gcc dot gnu.org
@ 2015-05-04 12:55 ` trippels at gcc dot gnu.org
2015-05-04 13:43 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-04 12:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #2 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
clang without -vectorize-loops -vectorize-slp:
./paq8p -4 file1.in 54.82s user 0.08s system 100% cpu 54.891 total
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
2015-05-04 12:47 ` [Bug tree-optimization/66002] " rguenth at gcc dot gnu.org
2015-05-04 12:55 ` trippels at gcc dot gnu.org
@ 2015-05-04 13:43 ` rguenth at gcc dot gnu.org
2015-05-04 13:57 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 13:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
high up in the profile are functions train() and dot_product(), also
ContextMap::mix1 and Mixer::p. But
void train(short *t, short *w, int n, int err) {
n=(n+7)&-8;
for (int i=0; i<n; ++i) {
int wt=w[i]+((t[i]*err*2>>16)+1>>1);
if (wt<-32768) wt=-32768;
if (wt>32767) wt=32767;
w[i]=wt;
}
}
seems to be the hottest function.
t.c:4:5: note: not vectorized: relevant stmt not supported: prephitmp_61 = _53
<= 65535 ? pretmp_60 : -32768;
t.c:4:5: note: bad operation or unsupported loop bound.
t.c:1:6: note: vectorized 0 loops in function.
<bb 5>:
# i_33 = PHI <0(4), i_28(7)>
_9 = (long unsigned int) i_33;
_10 = _9 * 2;
_12 = w_11(D) + _10;
_13 = *_12;
_14 = (int) _13;
_16 = t_15(D) + _10;
_17 = *_16;
_18 = (int) _17;
_20 = _18 * err_19(D);
_21 = _20 * 2;
_22 = _21 >> 16;
_23 = _22 + 1;
_24 = _23 >> 1;
wt_25 = _14 + _24;
pretmp_60 = (short int) wt_25;
_31 = (unsigned int) wt_25;
_53 = _31 + 32768;
prephitmp_61 = _53 <= 65535 ? pretmp_60 : -32768;
_32 = _53 <= 65535;
_52 = wt_25 < -32768;
_51 = _32 | _52;
prephitmp_59 = _51 ? prephitmp_61 : 32767;
*_12 = prephitmp_59;
i_28 = i_33 + 1;
if (n_7 > i_28)
goto <bb 7>;
else
goto <bb 6>;
<bb 7>:
goto <bb 5>;
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (2 preceding siblings ...)
2015-05-04 13:43 ` rguenth at gcc dot gnu.org
@ 2015-05-04 13:57 ` rguenth at gcc dot gnu.org
2015-05-04 14:06 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 13:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-05-04
Blocks| |53947
Ever confirmed|0 |1
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
prephitmp_61 = _53 <= 65535 ? pretmp_60 : -32768;
is
unsigned int <= 65535 ? short int : short int;
pushing the condition to a separate stmt might get us to support this
"narrowing" conversion.
Of course ifcvt does a pretty poor job on this as well...
We do vectorize
for (int i=0; i<n; ++i) {
int wt=w[i]+((t[i]*err*2>>16)+1>>1);
if (wt<-32768) wt=-32768;
// if (wt>32767) wt=32767;
w[i]=wt;
}
as if (wt<-32768) wt=-32768; becomes a MAX_EXPR. Also if I change it to
for (int i=0; i<n; ++i) {
int wt=w[i]+((t[i]*err*2>>16)+1>>1);
if (wt<-32768) wt=-32768;
else if (wt>32767) wt=32767;
w[i]=wt;
}
we vectorize it as MIN/MAX_EXPRs.
Maybe you can perform this source change manually and see what it does
to performance.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (3 preceding siblings ...)
2015-05-04 13:57 ` rguenth at gcc dot gnu.org
@ 2015-05-04 14:06 ` rguenth at gcc dot gnu.org
2015-05-04 14:11 ` trippels at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 14:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Status|NEW |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
VRP performs jump-threading to else-if style but phiopt doesn't handle the
min-max case with split PHIs
if (wt_25 < -32768)
goto <bb 5>;
else
goto <bb 4>;
<bb 4>:
if (wt_25 > 32767)
goto <bb 6>;
else
goto <bb 5>;
<bb 5>:
# wt_31 = PHI <wt_25(4), -32768(3)>
<bb 6>:
# wt_3 = PHI <wt_31(5), 32767(4)>
_26 = (short int) wt_3;
vs.
if (wt_24 < -32768)
goto <bb 6>;
else
goto <bb 4>;
<bb 4>:
if (wt_24 > 32767)
goto <bb 6>;
else
goto <bb 5>;
<bb 5>:
<bb 6>:
# wt_2 = PHI <-32768(3), wt_24(5), 32767(4)>
_25 = (short int) wt_2;
so it looks like phiopt "depends" on mergephi (I always wondered what pass
that is useful for...). Currently that pass runs right before VRP which
definitely does _not_ depend on it. I'd move it right before ifcombine
which is the first pass that might care.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (4 preceding siblings ...)
2015-05-04 14:06 ` rguenth at gcc dot gnu.org
@ 2015-05-04 14:11 ` trippels at gcc dot gnu.org
2015-05-06 11:51 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-04 14:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #6 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> prephitmp_61 = _53 <= 65535 ? pretmp_60 : -32768;
>
> is
>
> unsigned int <= 65535 ? short int : short int;
>
> pushing the condition to a separate stmt might get us to support this
> "narrowing" conversion.
>
> Of course ifcvt does a pretty poor job on this as well...
>
> We do vectorize
>
> for (int i=0; i<n; ++i) {
> int wt=w[i]+((t[i]*err*2>>16)+1>>1);
> if (wt<-32768) wt=-32768;
> // if (wt>32767) wt=32767;
> w[i]=wt;
> }
>
> as if (wt<-32768) wt=-32768; becomes a MAX_EXPR. Also if I change it to
>
> for (int i=0; i<n; ++i) {
> int wt=w[i]+((t[i]*err*2>>16)+1>>1);
> if (wt<-32768) wt=-32768;
> else if (wt>32767) wt=32767;
> w[i]=wt;
> }
>
> we vectorize it as MIN/MAX_EXPRs.
>
> Maybe you can perform this source change manually and see what it does
> to performance.
With the "else" added gcc beats clang:
./paq8p -4 file1.in 24.81s user 0.10s system 100% cpu 24.902 total
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (5 preceding siblings ...)
2015-05-04 14:11 ` trippels at gcc dot gnu.org
@ 2015-05-06 11:51 ` rguenth at gcc dot gnu.org
2015-05-06 12:13 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-06 11:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00214.html
regresses
FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times vrp1 "Threaded jump" 3
(a real missed optimization - a redundant if remains)
and also
FAIL: gcc.dg/graphite/scop-dsyr2k.c scan-tree-dump-times graphite "number of
SCo
Ps: 1" 1
FAIL: gcc.dg/graphite/scop-dsyrk.c scan-tree-dump-times graphite "number of
SCoP
s: 1" 1
for 32bits, not investigated yet.
So it seems for the first regression that VRP somehow depends on mergephi,
or at least jump threading as performed by VRP. IL difference before VRP:
@@ -20,16 +19,19 @@
<bb 4>:
if (bytes_11 < 0)
- goto <bb 7>;
+ goto <bb 6>;
else
goto <bb 8>;
<bb 5>:
toread_12 = toread_1 - bytes_11;
+ <bb 6>:
+ # toread_9 = PHI <toread_12(5), toread_1(4)>
+
<bb 7>:
- # toread_1 = PHI <toread_1(4), 4096(2), toread_12(5)>
- # bytes_2 = PHI <bytes_11(4), 1(2), bytes_11(5)>
+ # toread_1 = PHI <toread_9(6), 4096(2)>
+ # bytes_2 = PHI <bytes_11(6), 1(2)>
if (toread_1 != 0)
goto <bb 3>;
else
and then VRP gets
-fix_loop_structure: fixing up loops for function
-Disambiguating loop 1 with multiple latches
-Merged latch edges of loop 1
;; 2 loops found
;;
;; Loop 0
;; header 0, latch 1
;; depth 0, outer -1
-;; nodes: 0 1 2 3 4 5 7 11 8 9 10
+;; nodes: 0 1 2 3 4 5 6 7 8 9 10
;;
;; Loop 1
-;; header 11, latch 7
+;; header 7, latch 6
;; depth 1, outer 0
-;; nodes: 11 7 4 5 3
-;; 2 succs { 11 }
+;; nodes: 7 6 5 4 3
+;; 2 succs { 7 }
;; 3 succs { 4 5 }
-;; 4 succs { 7 8 }
-;; 5 succs { 7 }
-;; 7 succs { 11 }
-;; 11 succs { 3 8 }
+;; 4 succs { 6 8 }
+;; 5 succs { 6 }
+;; 6 succs { 7 }
+;; 7 succs { 3 8 }
;; 8 succs { 9 10 }
;; 9 succs { 10 }
;; 10 succs { 1 }
which might be already the whole story about this - it splits the merged PHI
again but in a different way, ending up with
- <bb 7>:
- # toread_9 = PHI <toread_15(12), toread_12(5)>
- # bytes_8 = PHI <bytes_16(12), bytes_19(5)>
- <bb 11>:
- # toread_1 = PHI <toread_9(7), 4096(2)>
- # bytes_2 = PHI <bytes_8(7), 1(2)>
instead of the following (without mergephi and re-splitting):
+ <bb 6>:
+ # toread_9 = PHI <toread_12(5), toread_8(11)>
+ <bb 7>:
+ # toread_1 = PHI <toread_9(6), 4096(2)>
+ # bytes_2 = PHI <bytes_11(6), 1(2)>
and as final result of VRP:
-bytes_2: ~[0, 0]
+bytes_2: VARYING
and that's the usual issue of VRP not inserting asserts at CFG merges
(it doesn't insert PHIs...). mergephi effectively inserting a PHI
for bytes_11 in BB 6 is pure luck :/
.optimized code difference:
foo ()
{
static char eof_reached = 0;
@@ -13,8 +15,8 @@
<bb 2>:
<bb 3>:
- # toread_22 = PHI <toread_9(6), 4096(2)>
- bytes_11 = bar (toread_22);
+ # toread_18 = PHI <toread_9(6), 4096(2)>
+ bytes_11 = bar (toread_18);
if (bytes_11 <= 0)
goto <bb 4>;
else
@@ -27,21 +29,26 @@
goto <bb 8>;
<bb 5>:
- toread_12 = toread_22 - bytes_11;
+ toread_12 = toread_18 - bytes_11;
<bb 6>:
- # toread_9 = PHI <toread_22(4), toread_12(5)>
+ # toread_9 = PHI <toread_12(5), toread_18(4)>
if (toread_9 != 0)
goto <bb 3>;
else
goto <bb 7>;
<bb 7>:
- return;
+ if (bytes_11 == 0)
+ goto <bb 8>;
+ else
+ goto <bb 9>;
<bb 8>:
eof_reached = 1;
- goto <bb 7>;
+
+ <bb 9>:
+ return;
}
I'm inclined to XFAIL the testcase, but ...
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (6 preceding siblings ...)
2015-05-06 11:51 ` rguenth at gcc dot gnu.org
@ 2015-05-06 12:13 ` rguenth at gcc dot gnu.org
2015-05-07 9:53 ` rguenth at gcc dot gnu.org
2015-05-07 9:54 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-06 12:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm testing adding a mergephi pass instead of moving the existing one.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (7 preceding siblings ...)
2015-05-06 12:13 ` rguenth at gcc dot gnu.org
@ 2015-05-07 9:53 ` rguenth at gcc dot gnu.org
2015-05-07 9:54 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-07 9:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Author: rguenth
Date: Thu May 7 09:52:38 2015
New Revision: 222873
URL: https://gcc.gnu.org/viewcvs?rev=222873&root=gcc&view=rev
Log:
2015-05-07 Richard Biener <rguenther@suse.de>
PR tree-optimization/66002
* passes.def: Schedule another pass_merge_phi after ifcombine, right
before phiopt.
* gcc.dg/vect/vect-125.c: New testcase.
Added:
trunk/gcc/testsuite/gcc.dg/vect/vect-125.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/passes.def
trunk/gcc/testsuite/ChangeLog
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (8 preceding siblings ...)
2015-05-07 9:53 ` rguenth at gcc dot gnu.org
@ 2015-05-07 9:54 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-07 9:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
Should be fixed now.
^ permalink raw reply [flat|nested] 11+ messages in thread