public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC
@ 2013-09-25 13:06 burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:06 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
Bug ID: 58529
Summary: Loop 30% faster with Intel than with GCC
Product: gcc
Version: 4.9.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: burnus at gcc dot gnu.org
Host: x86-64-gnu-linux
Created attachment 30893
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30893&action=edit
Test file
The Intel icpc 13.1.1 compiler generates code which is 30% faster than GCC 4.9
for the following function (see test2.cc):
the_bins_size = 0;
for (int i = 0; i < arraylength; i++) {
if (coordexist[i]) {
the_bins[the_bins_size] = i;
coordexist[i] = the_bins_size++;
}
}
GCC: real 0m2.493s user 0m2.491s sys 0m0.002s -funroll-loops
GCC: real 0m1.494s user 0m1.493s sys 0m0.000s
ICC: real 0m1.160s user 0m1.157s sys 0m0.001s
The main function (test.main.cc) has been compiled with g++; used system:
Intel(R) Xeon(R) CPU E5-2630, CentOS 6/x86-64-gnu-linux, glibc-2.12.
g++ -march=native -fno-rtti -fno-exceptions -Ofast -std=c++
icpc -O3 -no-prec-div -xHost -fno-rtti -fno-exceptions
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
@ 2013-09-25 13:07 ` burnus at gcc dot gnu.org
2013-09-25 13:08 ` burnus at gcc dot gnu.org
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:07 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Created attachment 30894
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30894&action=edit
Main file (calls test file in a loop)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
@ 2013-09-25 13:08 ` burnus at gcc dot gnu.org
2013-09-25 14:04 ` glisse at gcc dot gnu.org
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:08 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Created attachment 30895
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30895&action=edit
Assembler generated by Intel's icpc for test.cc
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
2013-09-25 13:08 ` burnus at gcc dot gnu.org
@ 2013-09-25 14:04 ` glisse at gcc dot gnu.org
2013-09-25 14:59 ` burnus at gcc dot gnu.org
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 14:04 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #3 from Marc Glisse <glisse at gcc dot gnu.org> ---
Does it help if you pass the_bins_size as int*restrict (and adapt the uses)? Or
use a local variable instead that you write at the end? Gcc has a notoriously
restricted view of what restrict means, compared to most other compilers (it is
a feature).
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (2 preceding siblings ...)
2013-09-25 14:04 ` glisse at gcc dot gnu.org
@ 2013-09-25 14:59 ` burnus at gcc dot gnu.org
2013-09-25 17:53 ` glisse at gcc dot gnu.org
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 14:59 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #3)
> Does it help if you pass the_bins_size as int*restrict (and adapt the uses)?
> Or use a local variable instead that you write at the end?
That doesn't have any effect.
Off topic:
> Gcc has a notoriously restricted view of what restrict means [...](it is a
> feature).
Others would call it a bug - and currently work on changing GCC's internal
representation for "restrict". (Cf. PR45586, PR58526 and some other bugs.)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (3 preceding siblings ...)
2013-09-25 14:59 ` burnus at gcc dot gnu.org
@ 2013-09-25 17:53 ` glisse at gcc dot gnu.org
2013-09-25 19:35 ` glisse at gcc dot gnu.org
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 17:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #5 from Marc Glisse <glisse at gcc dot gnu.org> ---
I actually see gcc 4 times (not just 30%) slower than icpc here using the same
command lines. The asm produced by gcc contains tons of mov insn.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (4 preceding siblings ...)
2013-09-25 17:53 ` glisse at gcc dot gnu.org
@ 2013-09-25 19:35 ` glisse at gcc dot gnu.org
2013-09-25 20:20 ` hjl.tools at gmail dot com
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 19:35 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> ---
Please ignore my last comment, I now see the same 30% difference, the rest must
have been a user error on my part.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (5 preceding siblings ...)
2013-09-25 19:35 ` glisse at gcc dot gnu.org
@ 2013-09-25 20:20 ` hjl.tools at gmail dot com
2013-09-26 6:00 ` burnus at gcc dot gnu.org
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: hjl.tools at gmail dot com @ 2013-09-25 20:20 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> ---
Can you add "-funroll-loops --param max-unroll-times=7"?
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (6 preceding siblings ...)
2013-09-25 20:20 ` hjl.tools at gmail dot com
@ 2013-09-26 6:00 ` burnus at gcc dot gnu.org
2013-09-26 7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26 6:00 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #8 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #7)
> Can you add "-funroll-loops --param max-unroll-times=7"?
On Intel Core i5-3570 (glibc-2.18, openSUSE 13.1b1), I get with the attached
Intel .s file and today's GCC:
real 0m0.854s user 0m0.853s sys 0m0.001s ICC
real 0m1.096s user 0m1.095s sys 0m0.001s GCC
real 0m0.653s user 0m0.652s sys 0m0.002s GCC -funroll-loops
real 0m0.661s user 0m0.660s sys 0m0.000s ditto, max-unroll-times=7
I have to re-check why unrolling made it slower on that Xeon E5-2630 (comment
0) but faster on the i5.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (7 preceding siblings ...)
2013-09-26 6:00 ` burnus at gcc dot gnu.org
@ 2013-09-26 7:26 ` burnus at gcc dot gnu.org
2013-09-26 7:36 ` burnus at gcc dot gnu.org
2013-10-11 7:51 ` burnus at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26 7:26 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
Tobias Burnus <burnus at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |target
Summary|Loop 30% faster with Intel |GCC -funroll-loops 150%
|than with GCC |slower with -march=native
| |on x86-64
--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Tobias Burnus from comment #8)
> I have to re-check why unrolling made it slower on that Xeon E5-2630
> (comment 0) but faster on the i5.
Seems to be a tuning problem. All timings on the Xeon E5-2630, but using the
-march=native compile from the i5 vs. the -march=native compilation for the
Xeon E5:
real 1.530s user 1.528s sys 0.000s i5, no unrolling
real 1.483s user 1.481s sys 0.000s Xeon, no unrolling
real 0.937s user 0.934s sys 0.002s i5, -funroll-loops
real 2.480s user 2.478s sys 0.000s Xeon, -funroll-loops
real 0.935s user 0.934s sys 0.000s Xeon, -funroll-loops max-unroll-times=7
The i5's -march=native expands into:
-march=core-avx-i -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed
-mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core-avx-i
The Xeon's -march=native
-march=corei7-avx -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase
-mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=corei7-avx
Namely:
i5: -march=core-avx-i -mrdrnd -mf16c -mfsgsbase
--param l2-cache-size=6144 -mtune=core-avx-i
Xeon: -march=corei7-avx -mno-rdrnd -mno-f16c -mno-fsgsbase
--param l2-cache-size=15360 -mtune=corei7-avx
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (8 preceding siblings ...)
2013-09-26 7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
@ 2013-09-26 7:36 ` burnus at gcc dot gnu.org
2013-10-11 7:51 ` burnus at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26 7:36 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #10 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Playing around with --param max-unroll-times= gives:
0 real 0m1.499s user 0m1.497s sys 0m0.000s
1 real 0m1.492s user 0m1.490s sys 0m0.000s
2 real 0m1.138s user 0m1.137s sys 0m0.000s
3 real 0m1.146s user 0m1.144s sys 0m0.000s
4 real 0m0.932s user 0m0.930s sys 0m0.001s
5 real 0m0.955s user 0m0.953s sys 0m0.000s
6 real 0m0.934s user 0m0.933s sys 0m0.000s
7 real 0m0.934s user 0m0.932s sys 0m0.001s
8 real 0m2.480s user 0m2.477s sys 0m0.000s
9 real 0m2.481s user 0m2.479s sys 0m0.000s
10 real 0m2.482s user 0m2.478s sys 0m0.001s
11 real 0m2.480s user 0m2.477s sys 0m0.000s
12 real 0m2.477s user 0m2.474s sys 0m0.000s
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (9 preceding siblings ...)
2013-09-26 7:36 ` burnus at gcc dot gnu.org
@ 2013-10-11 7:51 ` burnus at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-10-11 7:51 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #11 from Tobias Burnus <burnus at gcc dot gnu.org> ---
I wonder whether the approach of Teresa's patch might help with this issue, cf.
(first email) http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00123.html
(latest patch) http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00321.html
(latest email) http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01931.html
The mentioned follow-up patch is
http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00239.html
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-10-11 7:51 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
2013-09-25 13:08 ` burnus at gcc dot gnu.org
2013-09-25 14:04 ` glisse at gcc dot gnu.org
2013-09-25 14:59 ` burnus at gcc dot gnu.org
2013-09-25 17:53 ` glisse at gcc dot gnu.org
2013-09-25 19:35 ` glisse at gcc dot gnu.org
2013-09-25 20:20 ` hjl.tools at gmail dot com
2013-09-26 6:00 ` burnus at gcc dot gnu.org
2013-09-26 7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
2013-09-26 7:36 ` burnus at gcc dot gnu.org
2013-10-11 7:51 ` burnus at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).