public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge
@ 2015-05-04 7:20 trippels at gcc dot gnu.org
2015-05-04 12:47 ` [Bug tree-optimization/66002] " rguenth at gcc dot gnu.org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-04 7:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Bug ID: 66002
Summary: paq8p benchmark 50% slower than clang on sandybridge
Product: gcc
Version: 6.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: trippels at gcc dot gnu.org
Target Milestone: ---
Created attachment 35451
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35451&action=edit
testcase
On sandybridge I get:
trippels@gcc75 ~ % g++ -O3 -march=native paq8p.ii -o paq8p
trippels@gcc75 ~ % time ./paq8p -4 file1.in
Creating archive with 1 file(s)...
file1.in 262144 -> 262371
262144 -> 262400
Extracting 1 file(s) from archive -4
Comparing file1.in 262144 -> identical
./paq8p -4 file1.in 61.82s user 0.08s system 100% cpu 1:01.90 total
trippels@gcc75 ~ % clang++ -w -O3 -march=native paq8p.ii -o paq8p
trippels@gcc75 ~ % time ./paq8p -4 file1.in
...
./paq8p -4 file1.in 29.60s user 0.12s system 100% cpu 29.715 total
Intel compiler:
./paq8p -4 file1.in 22.00s user 0.09s system 99% cpu 22.092 total
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
@ 2015-05-04 12:47 ` rguenth at gcc dot gnu.org
2015-05-04 12:55 ` trippels at gcc dot gnu.org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 12:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
gprof tells me
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
67.13 25.40 25.40 2097192 0.00 0.00 contextModel2()
18.08 32.24 6.84 18874735 0.00 0.00 ContextMap::mix1(Mixer&,
int, int, int, int)
9.46 35.82 3.58 2097192 0.00 0.00 Mixer::p()
2.72 36.85 1.03 14680344 0.00 0.00 APM1::p(int, int, int)
0.53 37.05 0.20 2097192 0.00 0.00 dmcModel(Mixer&)
probably not too interesting (inlining).
I wonder if you can run clang++ with vectorization disabled?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
2015-05-04 12:47 ` [Bug tree-optimization/66002] " rguenth at gcc dot gnu.org
@ 2015-05-04 12:55 ` trippels at gcc dot gnu.org
2015-05-04 13:43 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-04 12:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #2 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
clang without -vectorize-loops -vectorize-slp:
./paq8p -4 file1.in 54.82s user 0.08s system 100% cpu 54.891 total
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
2015-05-04 12:47 ` [Bug tree-optimization/66002] " rguenth at gcc dot gnu.org
2015-05-04 12:55 ` trippels at gcc dot gnu.org
@ 2015-05-04 13:43 ` rguenth at gcc dot gnu.org
2015-05-04 13:57 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 13:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
high up in the profile are functions train() and dot_product(), also
ContextMap::mix1 and Mixer::p. But
void train(short *t, short *w, int n, int err) {
n=(n+7)&-8;
for (int i=0; i<n; ++i) {
int wt=w[i]+((t[i]*err*2>>16)+1>>1);
if (wt<-32768) wt=-32768;
if (wt>32767) wt=32767;
w[i]=wt;
}
}
seems to be the hottest function.
t.c:4:5: note: not vectorized: relevant stmt not supported: prephitmp_61 = _53
<= 65535 ? pretmp_60 : -32768;
t.c:4:5: note: bad operation or unsupported loop bound.
t.c:1:6: note: vectorized 0 loops in function.
<bb 5>:
# i_33 = PHI <0(4), i_28(7)>
_9 = (long unsigned int) i_33;
_10 = _9 * 2;
_12 = w_11(D) + _10;
_13 = *_12;
_14 = (int) _13;
_16 = t_15(D) + _10;
_17 = *_16;
_18 = (int) _17;
_20 = _18 * err_19(D);
_21 = _20 * 2;
_22 = _21 >> 16;
_23 = _22 + 1;
_24 = _23 >> 1;
wt_25 = _14 + _24;
pretmp_60 = (short int) wt_25;
_31 = (unsigned int) wt_25;
_53 = _31 + 32768;
prephitmp_61 = _53 <= 65535 ? pretmp_60 : -32768;
_32 = _53 <= 65535;
_52 = wt_25 < -32768;
_51 = _32 | _52;
prephitmp_59 = _51 ? prephitmp_61 : 32767;
*_12 = prephitmp_59;
i_28 = i_33 + 1;
if (n_7 > i_28)
goto <bb 7>;
else
goto <bb 6>;
<bb 7>:
goto <bb 5>;
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (2 preceding siblings ...)
2015-05-04 13:43 ` rguenth at gcc dot gnu.org
@ 2015-05-04 13:57 ` rguenth at gcc dot gnu.org
2015-05-04 14:06 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 13:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-05-04
Blocks| |53947
Ever confirmed|0 |1
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
prephitmp_61 = _53 <= 65535 ? pretmp_60 : -32768;
is
unsigned int <= 65535 ? short int : short int;
pushing the condition to a separate stmt might get us to support this
"narrowing" conversion.
Of course ifcvt does a pretty poor job on this as well...
We do vectorize
for (int i=0; i<n; ++i) {
int wt=w[i]+((t[i]*err*2>>16)+1>>1);
if (wt<-32768) wt=-32768;
// if (wt>32767) wt=32767;
w[i]=wt;
}
as if (wt<-32768) wt=-32768; becomes a MAX_EXPR. Also if I change it to
for (int i=0; i<n; ++i) {
int wt=w[i]+((t[i]*err*2>>16)+1>>1);
if (wt<-32768) wt=-32768;
else if (wt>32767) wt=32767;
w[i]=wt;
}
we vectorize it as MIN/MAX_EXPRs.
Maybe you can perform this source change manually and see what it does
to performance.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (3 preceding siblings ...)
2015-05-04 13:57 ` rguenth at gcc dot gnu.org
@ 2015-05-04 14:06 ` rguenth at gcc dot gnu.org
2015-05-04 14:11 ` trippels at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-04 14:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Status|NEW |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
VRP performs jump-threading to else-if style but phiopt doesn't handle the
min-max case with split PHIs
if (wt_25 < -32768)
goto <bb 5>;
else
goto <bb 4>;
<bb 4>:
if (wt_25 > 32767)
goto <bb 6>;
else
goto <bb 5>;
<bb 5>:
# wt_31 = PHI <wt_25(4), -32768(3)>
<bb 6>:
# wt_3 = PHI <wt_31(5), 32767(4)>
_26 = (short int) wt_3;
vs.
if (wt_24 < -32768)
goto <bb 6>;
else
goto <bb 4>;
<bb 4>:
if (wt_24 > 32767)
goto <bb 6>;
else
goto <bb 5>;
<bb 5>:
<bb 6>:
# wt_2 = PHI <-32768(3), wt_24(5), 32767(4)>
_25 = (short int) wt_2;
so it looks like phiopt "depends" on mergephi (I always wondered what pass
that is useful for...). Currently that pass runs right before VRP which
definitely does _not_ depend on it. I'd move it right before ifcombine
which is the first pass that might care.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (4 preceding siblings ...)
2015-05-04 14:06 ` rguenth at gcc dot gnu.org
@ 2015-05-04 14:11 ` trippels at gcc dot gnu.org
2015-05-06 11:51 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-05-04 14:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #6 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> prephitmp_61 = _53 <= 65535 ? pretmp_60 : -32768;
>
> is
>
> unsigned int <= 65535 ? short int : short int;
>
> pushing the condition to a separate stmt might get us to support this
> "narrowing" conversion.
>
> Of course ifcvt does a pretty poor job on this as well...
>
> We do vectorize
>
> for (int i=0; i<n; ++i) {
> int wt=w[i]+((t[i]*err*2>>16)+1>>1);
> if (wt<-32768) wt=-32768;
> // if (wt>32767) wt=32767;
> w[i]=wt;
> }
>
> as if (wt<-32768) wt=-32768; becomes a MAX_EXPR. Also if I change it to
>
> for (int i=0; i<n; ++i) {
> int wt=w[i]+((t[i]*err*2>>16)+1>>1);
> if (wt<-32768) wt=-32768;
> else if (wt>32767) wt=32767;
> w[i]=wt;
> }
>
> we vectorize it as MIN/MAX_EXPRs.
>
> Maybe you can perform this source change manually and see what it does
> to performance.
With the "else" added gcc beats clang:
./paq8p -4 file1.in 24.81s user 0.10s system 100% cpu 24.902 total
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (5 preceding siblings ...)
2015-05-04 14:11 ` trippels at gcc dot gnu.org
@ 2015-05-06 11:51 ` rguenth at gcc dot gnu.org
2015-05-06 12:13 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-06 11:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00214.html
regresses
FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times vrp1 "Threaded jump" 3
(a real missed optimization - a redundant if remains)
and also
FAIL: gcc.dg/graphite/scop-dsyr2k.c scan-tree-dump-times graphite "number of
SCo
Ps: 1" 1
FAIL: gcc.dg/graphite/scop-dsyrk.c scan-tree-dump-times graphite "number of
SCoP
s: 1" 1
for 32bits, not investigated yet.
So it seems for the first regression that VRP somehow depends on mergephi,
or at least jump threading as performed by VRP. IL difference before VRP:
@@ -20,16 +19,19 @@
<bb 4>:
if (bytes_11 < 0)
- goto <bb 7>;
+ goto <bb 6>;
else
goto <bb 8>;
<bb 5>:
toread_12 = toread_1 - bytes_11;
+ <bb 6>:
+ # toread_9 = PHI <toread_12(5), toread_1(4)>
+
<bb 7>:
- # toread_1 = PHI <toread_1(4), 4096(2), toread_12(5)>
- # bytes_2 = PHI <bytes_11(4), 1(2), bytes_11(5)>
+ # toread_1 = PHI <toread_9(6), 4096(2)>
+ # bytes_2 = PHI <bytes_11(6), 1(2)>
if (toread_1 != 0)
goto <bb 3>;
else
and then VRP gets
-fix_loop_structure: fixing up loops for function
-Disambiguating loop 1 with multiple latches
-Merged latch edges of loop 1
;; 2 loops found
;;
;; Loop 0
;; header 0, latch 1
;; depth 0, outer -1
-;; nodes: 0 1 2 3 4 5 7 11 8 9 10
+;; nodes: 0 1 2 3 4 5 6 7 8 9 10
;;
;; Loop 1
-;; header 11, latch 7
+;; header 7, latch 6
;; depth 1, outer 0
-;; nodes: 11 7 4 5 3
-;; 2 succs { 11 }
+;; nodes: 7 6 5 4 3
+;; 2 succs { 7 }
;; 3 succs { 4 5 }
-;; 4 succs { 7 8 }
-;; 5 succs { 7 }
-;; 7 succs { 11 }
-;; 11 succs { 3 8 }
+;; 4 succs { 6 8 }
+;; 5 succs { 6 }
+;; 6 succs { 7 }
+;; 7 succs { 3 8 }
;; 8 succs { 9 10 }
;; 9 succs { 10 }
;; 10 succs { 1 }
which might be already the whole story about this - it splits the merged PHI
again but in a different way, ending up with
- <bb 7>:
- # toread_9 = PHI <toread_15(12), toread_12(5)>
- # bytes_8 = PHI <bytes_16(12), bytes_19(5)>
- <bb 11>:
- # toread_1 = PHI <toread_9(7), 4096(2)>
- # bytes_2 = PHI <bytes_8(7), 1(2)>
instead of the following (without mergephi and re-splitting):
+ <bb 6>:
+ # toread_9 = PHI <toread_12(5), toread_8(11)>
+ <bb 7>:
+ # toread_1 = PHI <toread_9(6), 4096(2)>
+ # bytes_2 = PHI <bytes_11(6), 1(2)>
and as final result of VRP:
-bytes_2: ~[0, 0]
+bytes_2: VARYING
and that's the usual issue of VRP not inserting asserts at CFG merges
(it doesn't insert PHIs...). mergephi effectively inserting a PHI
for bytes_11 in BB 6 is pure luck :/
.optimized code difference:
foo ()
{
static char eof_reached = 0;
@@ -13,8 +15,8 @@
<bb 2>:
<bb 3>:
- # toread_22 = PHI <toread_9(6), 4096(2)>
- bytes_11 = bar (toread_22);
+ # toread_18 = PHI <toread_9(6), 4096(2)>
+ bytes_11 = bar (toread_18);
if (bytes_11 <= 0)
goto <bb 4>;
else
@@ -27,21 +29,26 @@
goto <bb 8>;
<bb 5>:
- toread_12 = toread_22 - bytes_11;
+ toread_12 = toread_18 - bytes_11;
<bb 6>:
- # toread_9 = PHI <toread_22(4), toread_12(5)>
+ # toread_9 = PHI <toread_12(5), toread_18(4)>
if (toread_9 != 0)
goto <bb 3>;
else
goto <bb 7>;
<bb 7>:
- return;
+ if (bytes_11 == 0)
+ goto <bb 8>;
+ else
+ goto <bb 9>;
<bb 8>:
eof_reached = 1;
- goto <bb 7>;
+
+ <bb 9>:
+ return;
}
I'm inclined to XFAIL the testcase, but ...
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (6 preceding siblings ...)
2015-05-06 11:51 ` rguenth at gcc dot gnu.org
@ 2015-05-06 12:13 ` rguenth at gcc dot gnu.org
2015-05-07 9:53 ` rguenth at gcc dot gnu.org
2015-05-07 9:54 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-06 12:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm testing adding a mergephi pass instead of moving the existing one.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (7 preceding siblings ...)
2015-05-06 12:13 ` rguenth at gcc dot gnu.org
@ 2015-05-07 9:53 ` rguenth at gcc dot gnu.org
2015-05-07 9:54 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-07 9:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Author: rguenth
Date: Thu May 7 09:52:38 2015
New Revision: 222873
URL: https://gcc.gnu.org/viewcvs?rev=222873&root=gcc&view=rev
Log:
2015-05-07 Richard Biener <rguenther@suse.de>
PR tree-optimization/66002
* passes.def: Schedule another pass_merge_phi after ifcombine, right
before phiopt.
* gcc.dg/vect/vect-125.c: New testcase.
Added:
trunk/gcc/testsuite/gcc.dg/vect/vect-125.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/passes.def
trunk/gcc/testsuite/ChangeLog
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/66002] paq8p benchmark 50% slower than clang on sandybridge
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
` (8 preceding siblings ...)
2015-05-07 9:53 ` rguenth at gcc dot gnu.org
@ 2015-05-07 9:54 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-07 9:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
Should be fixed now.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-05-07 9:54 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-04 7:20 [Bug tree-optimization/66002] New: paq8p benchmark 50% slower than clang on sandybridge trippels at gcc dot gnu.org
2015-05-04 12:47 ` [Bug tree-optimization/66002] " rguenth at gcc dot gnu.org
2015-05-04 12:55 ` trippels at gcc dot gnu.org
2015-05-04 13:43 ` rguenth at gcc dot gnu.org
2015-05-04 13:57 ` rguenth at gcc dot gnu.org
2015-05-04 14:06 ` rguenth at gcc dot gnu.org
2015-05-04 14:11 ` trippels at gcc dot gnu.org
2015-05-06 11:51 ` rguenth at gcc dot gnu.org
2015-05-06 12:13 ` rguenth at gcc dot gnu.org
2015-05-07 9:53 ` rguenth at gcc dot gnu.org
2015-05-07 9:54 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).