public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC
@ 2013-09-25 13:06 burnus at gcc dot gnu.org
  2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

            Bug ID: 58529
           Summary: Loop 30% faster with Intel than with GCC
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: burnus at gcc dot gnu.org
              Host: x86-64-gnu-linux

Created attachment 30893
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30893&action=edit
Test file

The Intel icpc 13.1.1 compiler generates code which is 30% faster than GCC 4.9
for the following function (see test2.cc):

  the_bins_size = 0;
  for (int i = 0; i < arraylength; i++) {
    if (coordexist[i]) {
      the_bins[the_bins_size] = i;
      coordexist[i] = the_bins_size++;
    }
  }

GCC: real 0m2.493s   user 0m2.491s  sys 0m0.002s  -funroll-loops
GCC: real 0m1.494s   user 0m1.493s  sys 0m0.000s
ICC: real 0m1.160s   user 0m1.157s  sys 0m0.001s

The main function (test.main.cc) has been compiled with g++; used system:
Intel(R) Xeon(R) CPU E5-2630, CentOS 6/x86-64-gnu-linux, glibc-2.12.

g++ -march=native -fno-rtti -fno-exceptions -Ofast -std=c++
icpc -O3 -no-prec-div -xHost -fno-rtti -fno-exceptions


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
@ 2013-09-25 13:07 ` burnus at gcc dot gnu.org
  2013-09-25 13:08 ` burnus at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:07 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Created attachment 30894
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30894&action=edit
Main file (calls test file in a loop)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
  2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
@ 2013-09-25 13:08 ` burnus at gcc dot gnu.org
  2013-09-25 14:04 ` glisse at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 13:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Created attachment 30895
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30895&action=edit
Assembler generated by Intel's icpc for test.cc


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
  2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
  2013-09-25 13:08 ` burnus at gcc dot gnu.org
@ 2013-09-25 14:04 ` glisse at gcc dot gnu.org
  2013-09-25 14:59 ` burnus at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 14:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #3 from Marc Glisse <glisse at gcc dot gnu.org> ---
Does it help if you pass the_bins_size as int*restrict (and adapt the uses)? Or
use a local variable instead that you write at the end? Gcc has a notoriously
restricted view of what restrict means, compared to most other compilers (it is
a feature).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2013-09-25 14:04 ` glisse at gcc dot gnu.org
@ 2013-09-25 14:59 ` burnus at gcc dot gnu.org
  2013-09-25 17:53 ` glisse at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-25 14:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #3)
> Does it help if you pass the_bins_size as int*restrict (and adapt the uses)?
> Or use a local variable instead that you write at the end?

That doesn't have any effect.

Off topic:
> Gcc has a notoriously restricted view of what restrict means [...](it is a
> feature).
Others would call it a bug - and currently work on changing GCC's internal
representation for "restrict". (Cf. PR45586, PR58526 and some other bugs.)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2013-09-25 14:59 ` burnus at gcc dot gnu.org
@ 2013-09-25 17:53 ` glisse at gcc dot gnu.org
  2013-09-25 19:35 ` glisse at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 17:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #5 from Marc Glisse <glisse at gcc dot gnu.org> ---
I actually see gcc 4 times (not just 30%) slower than icpc here using the same
command lines. The asm produced by gcc contains tons of mov insn.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2013-09-25 17:53 ` glisse at gcc dot gnu.org
@ 2013-09-25 19:35 ` glisse at gcc dot gnu.org
  2013-09-25 20:20 ` hjl.tools at gmail dot com
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: glisse at gcc dot gnu.org @ 2013-09-25 19:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> ---
Please ignore my last comment, I now see the same 30% difference, the rest must
have been a user error on my part.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2013-09-25 19:35 ` glisse at gcc dot gnu.org
@ 2013-09-25 20:20 ` hjl.tools at gmail dot com
  2013-09-26  6:00 ` burnus at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: hjl.tools at gmail dot com @ 2013-09-25 20:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> ---
Can you add "-funroll-loops --param max-unroll-times=7"?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug middle-end/58529] Loop 30% faster with Intel than with GCC
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2013-09-25 20:20 ` hjl.tools at gmail dot com
@ 2013-09-26  6:00 ` burnus at gcc dot gnu.org
  2013-09-26  7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26  6:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #8 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #7)
> Can you add "-funroll-loops --param max-unroll-times=7"?

On Intel Core i5-3570 (glibc-2.18, openSUSE 13.1b1), I get with the attached
Intel .s file and today's GCC:

real    0m0.854s  user    0m0.853s  sys     0m0.001s  ICC
real    0m1.096s  user    0m1.095s  sys     0m0.001s  GCC
real    0m0.653s  user    0m0.652s  sys     0m0.002s  GCC -funroll-loops
real    0m0.661s  user    0m0.660s  sys     0m0.000s  ditto, max-unroll-times=7

I have to re-check why unrolling made it slower on that Xeon E5-2630 (comment
0) but faster on the i5.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2013-09-26  6:00 ` burnus at gcc dot gnu.org
@ 2013-09-26  7:26 ` burnus at gcc dot gnu.org
  2013-09-26  7:36 ` burnus at gcc dot gnu.org
  2013-10-11  7:51 ` burnus at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26  7:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |target
            Summary|Loop 30% faster with Intel  |GCC -funroll-loops 150%
                   |than with GCC               |slower with -march=native
                   |                            |on x86-64

--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Tobias Burnus from comment #8)
> I have to re-check why unrolling made it slower on that Xeon E5-2630
> (comment 0) but faster on the i5.

Seems to be a tuning problem. All timings on the Xeon E5-2630, but using the
-march=native compile from the i5 vs. the -march=native compilation for the
Xeon E5:

real 1.530s  user 1.528s  sys 0.000s i5,   no unrolling
real 1.483s  user 1.481s  sys 0.000s Xeon, no unrolling
real 0.937s  user 0.934s  sys 0.002s i5,   -funroll-loops
real 2.480s  user 2.478s  sys 0.000s Xeon, -funroll-loops
real 0.935s  user 0.934s  sys 0.000s Xeon, -funroll-loops max-unroll-times=7

The i5's -march=native expands into:
-march=core-avx-i -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt  -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed
-mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core-avx-i

The Xeon's -march=native
-march=corei7-avx -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase
-mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=corei7-avx

Namely:
i5:   -march=core-avx-i -mrdrnd    -mf16c    -mfsgsbase
      --param l2-cache-size=6144  -mtune=core-avx-i
Xeon: -march=corei7-avx -mno-rdrnd -mno-f16c -mno-fsgsbase
      --param l2-cache-size=15360 -mtune=corei7-avx


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2013-09-26  7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
@ 2013-09-26  7:36 ` burnus at gcc dot gnu.org
  2013-10-11  7:51 ` burnus at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-09-26  7:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #10 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Playing around with --param max-unroll-times= gives:

0        real    0m1.499s        user    0m1.497s        sys     0m0.000s
1        real    0m1.492s        user    0m1.490s        sys     0m0.000s
2        real    0m1.138s        user    0m1.137s        sys     0m0.000s
3        real    0m1.146s        user    0m1.144s        sys     0m0.000s
4        real    0m0.932s        user    0m0.930s        sys     0m0.001s
5        real    0m0.955s        user    0m0.953s        sys     0m0.000s
6        real    0m0.934s        user    0m0.933s        sys     0m0.000s
7        real    0m0.934s        user    0m0.932s        sys     0m0.001s
8        real    0m2.480s        user    0m2.477s        sys     0m0.000s
9        real    0m2.481s        user    0m2.479s        sys     0m0.000s
10       real    0m2.482s        user    0m2.478s        sys     0m0.001s
11       real    0m2.480s        user    0m2.477s        sys     0m0.000s
12       real    0m2.477s        user    0m2.474s        sys     0m0.000s


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
  2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2013-09-26  7:36 ` burnus at gcc dot gnu.org
@ 2013-10-11  7:51 ` burnus at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: burnus at gcc dot gnu.org @ 2013-10-11  7:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

--- Comment #11 from Tobias Burnus <burnus at gcc dot gnu.org> ---
I wonder whether the approach of Teresa's patch might help with this issue, cf.
  (first email)   http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00123.html
  (latest patch)  http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00321.html
  (latest email)  http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01931.html
The mentioned follow-up patch is
http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00239.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-10-11  7:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-25 13:06 [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC burnus at gcc dot gnu.org
2013-09-25 13:07 ` [Bug middle-end/58529] " burnus at gcc dot gnu.org
2013-09-25 13:08 ` burnus at gcc dot gnu.org
2013-09-25 14:04 ` glisse at gcc dot gnu.org
2013-09-25 14:59 ` burnus at gcc dot gnu.org
2013-09-25 17:53 ` glisse at gcc dot gnu.org
2013-09-25 19:35 ` glisse at gcc dot gnu.org
2013-09-25 20:20 ` hjl.tools at gmail dot com
2013-09-26  6:00 ` burnus at gcc dot gnu.org
2013-09-26  7:26 ` [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64 burnus at gcc dot gnu.org
2013-09-26  7:36 ` burnus at gcc dot gnu.org
2013-10-11  7:51 ` burnus at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).