* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
@ 2013-09-25 13:07 ` burnus at gcc dot gnu.org
2013-09-25 13:08 ` burnus at gcc dot gnu.org
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:07 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Created attachment 30894
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30894&action=edit
Main file (calls test file in a loop)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
@ 2013-09-25 13:08 ` burnus at gcc dot gnu.org
2013-09-25 14:04 ` glisse at gcc dot gnu.org
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:08 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Created attachment 30895
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30895&action=edit
Assembler generated by Intel's icpc for test.cc
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
2013-09-25 13:08 ` burnus at gcc dot gnu.org
@ 2013-09-25 14:04 ` glisse at gcc dot gnu.org
2013-09-25 14:59 ` burnus at gcc dot gnu.org
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 14:04 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #3 from Marc Glisse <glisse at gcc dot gnu.org> ---
Does it help if you pass the_bins_size as int*restrict (and adapt the uses)? Or
use a local variable instead that you write at the end? Gcc has a notoriously
restricted view of what restrict means, compared to most other compilers (it is
a feature).
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (2 preceding siblings ...)
2013-09-25 14:04 ` glisse at gcc dot gnu.org
@ 2013-09-25 14:59 ` burnus at gcc dot gnu.org
2013-09-25 17:53 ` glisse at gcc dot gnu.org
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 14:59 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #3)
> Does it help if you pass the_bins_size as int*restrict (and adapt the uses)?
> Or use a local variable instead that you write at the end?
That doesn't have any effect.
Off topic:
> Gcc has a notoriously restricted view of what restrict means [...](it is a
> feature).
Others would call it a bug - and currently work on changing GCC's internal
representation for "restrict". (Cf. PR45586, PR58526 and some other bugs.)
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (3 preceding siblings ...)
2013-09-25 14:59 ` burnus at gcc dot gnu.org
@ 2013-09-25 17:53 ` glisse at gcc dot gnu.org
2013-09-25 19:35 ` glisse at gcc dot gnu.org
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 17:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #5 from Marc Glisse <glisse at gcc dot gnu.org> ---
I actually see gcc 4 times (not just 30%) slower than icpc here using the same
command lines. The asm produced by gcc contains tons of mov insn.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (4 preceding siblings ...)
2013-09-25 17:53 ` glisse at gcc dot gnu.org
@ 2013-09-25 19:35 ` glisse at gcc dot gnu.org
2013-09-25 20:20 ` hjl.tools at gmail dot com
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 19:35 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> ---
Please ignore my last comment, I now see the same 30% difference, the rest must
have been a user error on my part.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (5 preceding siblings ...)
2013-09-25 19:35 ` glisse at gcc dot gnu.org
@ 2013-09-25 20:20 ` hjl.tools at gmail dot com
2013-09-26 6:00 ` burnus at gcc dot gnu.org
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: hjl.tools at gmail dot com @ 2013-09-25 20:20 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> ---
Can you add "-funroll-loops --param max-unroll-times=7"?
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (6 preceding siblings ...)
2013-09-25 20:20 ` hjl.tools at gmail dot com
@ 2013-09-26 6:00 ` burnus at gcc dot gnu.org
2013-09-26 7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26 6:00 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #8 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #7)
> Can you add "-funroll-loops --param max-unroll-times=7"?
On Intel Core i5-3570 (glibc-2.18, openSUSE 13.1b1), I get with the attached
Intel .s file and today's GCC:
real 0m0.854s user 0m0.853s sys 0m0.001s ICC
real 0m1.096s user 0m1.095s sys 0m0.001s GCC
real 0m0.653s user 0m0.652s sys 0m0.002s GCC -funroll-loops
real 0m0.661s user 0m0.660s sys 0m0.000s ditto, max-unroll-times=7
I have to re-check why unrolling made it slower on that Xeon E5-2630 (comment
0) but faster on the i5.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (7 preceding siblings ...)
2013-09-26 6:00 ` burnus at gcc dot gnu.org
@ 2013-09-26 7:26 ` burnus at gcc dot gnu.org
2013-09-26 7:36 ` burnus at gcc dot gnu.org
2013-10-11 7:51 ` burnus at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26 7:26 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
Tobias Burnus <burnus at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |target
Summary|Loop 30% faster with Intel |GCC -funroll-loops 150%
|than with GCC |slower with -march=native
| |on x86-64
--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Tobias Burnus from comment #8)
> I have to re-check why unrolling made it slower on that Xeon E5-2630
> (comment 0) but faster on the i5.
Seems to be a tuning problem. All timings on the Xeon E5-2630, but using the
-march=native compile from the i5 vs. the -march=native compilation for the
Xeon E5:
real 1.530s user 1.528s sys 0.000s i5, no unrolling
real 1.483s user 1.481s sys 0.000s Xeon, no unrolling
real 0.937s user 0.934s sys 0.002s i5, -funroll-loops
real 2.480s user 2.478s sys 0.000s Xeon, -funroll-loops
real 0.935s user 0.934s sys 0.000s Xeon, -funroll-loops max-unroll-times=7
The i5's -march=native expands into:
-march=core-avx-i -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed
-mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core-avx-i
The Xeon's -march=native
-march=corei7-avx -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase
-mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=corei7-avx
Namely:
i5: -march=core-avx-i -mrdrnd -mf16c -mfsgsbase
--param l2-cache-size=6144 -mtune=core-avx-i
Xeon: -march=corei7-avx -mno-rdrnd -mno-f16c -mno-fsgsbase
--param l2-cache-size=15360 -mtune=corei7-avx
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (8 preceding siblings ...)
2013-09-26 7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
@ 2013-09-26 7:36 ` burnus at gcc dot gnu.org
2013-10-11 7:51 ` burnus at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26 7:36 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #10 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Playing around with --param max-unroll-times= gives:
0 real 0m1.499s user 0m1.497s sys 0m0.000s
1 real 0m1.492s user 0m1.490s sys 0m0.000s
2 real 0m1.138s user 0m1.137s sys 0m0.000s
3 real 0m1.146s user 0m1.144s sys 0m0.000s
4 real 0m0.932s user 0m0.930s sys 0m0.001s
5 real 0m0.955s user 0m0.953s sys 0m0.000s
6 real 0m0.934s user 0m0.933s sys 0m0.000s
7 real 0m0.934s user 0m0.932s sys 0m0.001s
8 real 0m2.480s user 0m2.477s sys 0m0.000s
9 real 0m2.481s user 0m2.479s sys 0m0.000s
10 real 0m2.482s user 0m2.478s sys 0m0.001s
11 real 0m2.480s user 0m2.477s sys 0m0.000s
12 real 0m2.477s user 0m2.474s sys 0m0.000s
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
` (9 preceding siblings ...)
2013-09-26 7:36 ` burnus at gcc dot gnu.org
@ 2013-10-11 7:51 ` burnus at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-10-11 7:51 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529
--- Comment #11 from Tobias Burnus <burnus at gcc dot gnu.org> ---
I wonder whether the approach of Teresa's patch might help with this issue, cf.
(first email) http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00123.html
(latest patch) http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00321.html
(latest email) http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01931.html
The mentioned follow-up patch is
http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00239.html
^ permalink raw reply [flat|nested] 12+ messages in thread