How to get GCC on par with ICC?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* How to get GCC on par with ICC?
@ 2018-06-06 15:57 Paul Menzel
  2018-06-06 16:14 ` Joel Sherrill
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Paul Menzel @ 2018-06-06 15:57 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 910 bytes --]

Dear GCC folks,

Some scientists in our organization still want to use the Intel 
compiler, as they say, it produces faster code, which is then executed 
on clusters. Some resources on the Web [1][2] confirm this. (I am aware, 
that it’s heavily dependent on the actual program.)

My question is, is it realistic, that GCC could catch up and that the 
scientists will start to use it over Intel’s compiler? Or will Intel 
developers always have the lead, because they have secret documentation 
and direct contact with the processor designers?

If it is realistic, how can we get there? Would first the program be 
written, and then the compiler be optimized for that? Or are just more 
GCC developers needed?

Kind regards,

Paul

[1]: https://colfaxresearch.com/compiler-comparison/
[2]: 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
@ 2018-06-06 16:14 ` Joel Sherrill
  2018-06-06 16:20   ` Paul Menzel
  2018-06-20 22:42   ` NightStrike
  2018-06-06 16:22 ` Bin.Cheng
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Joel Sherrill @ 2018-06-06 16:14 UTC (permalink / raw)
  To: Paul Menzel; +Cc: gcc

On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:

> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler,
> as they say, it produces faster code, which is then executed on clusters.
> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> heavily dependent on the actual program.)
>

Do they have specific examples where icc is better for them? Or can point
to specific GCC PRs which impact them?

GCC versions?

Are there specific CPU model variants of concern?

What flags are used to compile? Some times a bit of advice can produce
improvements.

Without specific examples, it is hard to set goals.


> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
>

For sure examples are needed so there are test cases to use for reference.

If you want anything improved in any free software project, sponsoring
developers
is always a good thing. If you sponsor the right developers. :)

I'm not discouraging you. I just trying to turn this into something
actionable.

--joel sherrill


>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> .1280&rep=rep1&type=pdf
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 16:14 ` Joel Sherrill
@ 2018-06-06 16:20   ` Paul Menzel
  2018-06-20 22:42   ` NightStrike
  1 sibling, 0 replies; 20+ messages in thread
From: Paul Menzel @ 2018-06-06 16:20 UTC (permalink / raw)
  To: Joel Sherrill; +Cc: gcc

[-- Attachment #1: Type: text/plain, Size: 2505 bytes --]

Dear Joel,


Thank you for your quick reply.


On 06/06/18 17:57, Joel Sherrill wrote:
> On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel wrote:

>> Some scientists in our organization still want to use the Intel compiler,
>> as they say, it produces faster code, which is then executed on clusters.
>> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
>> heavily dependent on the actual program.)
> 
> Do they have specific examples where icc is better for them? Or can point
> to specific GCC PRs which impact them?
> 
> GCC versions?
> 
> Are there specific CPU model variants of concern?
> 
> What flags are used to compile? Some times a bit of advice can produce
> improvements.
> 
> Without specific examples, it is hard to set goals.

I could get such examples, but it will take some time, as it’s from 
other institutes.

The clusters use exclusively Intel processors. (Hopefully, that will 
change.)

I also found the article from the German Linux-Magazin in an English 
version at the ADMIN Magazin [3]. The German article had a more strong 
statement, that they use the Intel compilers due to performance reasons.

>> My question is, is it realistic, that GCC could catch up and that the
>> scientists will start to use it over Intel’s compiler? Or will Intel
>> developers always have the lead, because they have secret documentation and
>> direct contact with the processor designers?
>>
>> If it is realistic, how can we get there? Would first the program be
>> written, and then the compiler be optimized for that? Or are just more GCC
>> developers needed?
> 
> For sure examples are needed so there are test cases to use for reference.
> 
> If you want anything improved in any free software project, sponsoring
> developers is always a good thing. If you sponsor the right developers. :)

That’s what I hoped for, but didn’t ask here. If you could point me to a 
list of possible contractors, that would be great.

Please keep in mind, that in my organization certain decisions are made 
*very* slowly. I’ll try to get answers quickly, but procuring finances 
might take longer (half a year or much longer).


Kind regards,

Paul


>> [1]: https://colfaxresearch.com/compiler-comparison/
>> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
[3] 
http://www.admin-magazine.com/HPC/Articles/Selecting-Compilers-for-a-Supercomputer 
    "HPC Compilers"


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 16:14 ` Joel Sherrill
  2018-06-06 16:20   ` Paul Menzel
@ 2018-06-20 22:42   ` NightStrike
  2018-06-21  9:20     ` Richard Biener
  2018-06-22  0:48     ` Steve Ellcey
  1 sibling, 2 replies; 20+ messages in thread
From: NightStrike @ 2018-06-20 22:42 UTC (permalink / raw)
  To: joel; +Cc: Paul Menzel, gcc

On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill <joel@rtems.org> wrote:
>
> On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
>
> > Dear GCC folks,
> >
> >
> > Some scientists in our organization still want to use the Intel compiler,
> > as they say, it produces faster code, which is then executed on clusters.
> > Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> > heavily dependent on the actual program.)
> >
>
> Do they have specific examples where icc is better for them? Or can point
> to specific GCC PRs which impact them?
>
>
> GCC versions?
>
> Are there specific CPU model variants of concern?
>
> What flags are used to compile? Some times a bit of advice can produce
> improvements.
>
> Without specific examples, it is hard to set goals.

If I could perhaps jump in here for a moment...  Just today I hit upon
a series of small (in lines of code) loops that gcc can't vectorize,
and intel vectorizes like a madman.  They all involve a lot of heavy
use of std::vector<std::vector<float>>.  Comparisons were with gcc
8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
as sched_FIFO, mlockall, affinity set to its own core, and all
interrupts vectored off that core.  So, as close to not-noisy as
possible.

I was surprised at the results results, but using each compiler's methods of
dumping vectorization info, intel wins on two points:

1) It actually vectorizes
2) It's vectorizing output is much more easily readable

Options were:

gcc -Wall -ggdb3 -std=gnu++17 -flto -Ofast -march=native

vs:

icc -Ofast -std=gnu++14


So, not exactly exact, but pretty close.


So here's an example of a chunk of code (not very readable, sorry
about that) that intel can vectorize, and subsequently make about 50%
faster:

        std::size_t nLayers { input.nn.size() };
        //std::size_t ySize = std::max_element(input.nn.cbegin(),
input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
})->size();
        std::size_t ySize = 0;
        for (auto const & nn: input.nn)
                ySize = std::max(ySize, nn.size());

        float yNorm[ySize];
        for (auto & y: yNorm)
                y = 0.0f;
        for (std::size_t i = 0; i < xSize; ++i)
                yNorm[i] = xNorm[i];
        for (std::size_t layer = 0; layer < nLayers; ++layer) {
                auto & nn = input.nn[layer];
                auto & b = nn.back();
                float y[ySize];
                for (std::size_t i = 0; i < nn[0].size(); ++i) {
                        y[i] = b[i];
                        for (std::size_t j = 0; j < nn.size() - 1; ++j)
                                y[i] += nn.at(j).at(i) * yNorm[j];
                }
                for (std::size_t i = 0; i < ySize; ++i) {
                        if (layer < nLayers - 1)
                                y[i] = std::max(y[i], 0.0f);
                        yNorm[i] = y[i];
                }
        }


If I was better at godbolt, I could show the asm, but I'm not.  I'm
willing to learn, though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-20 22:42   ` NightStrike
@ 2018-06-21  9:20     ` Richard Biener
  2018-06-22  0:48     ` Steve Ellcey
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Biener @ 2018-06-21  9:20 UTC (permalink / raw)
  To: NightStrike; +Cc: joel, pmenzel+gcc.gnu.org, GCC Development

On Wed, Jun 20, 2018 at 11:12 PM NightStrike <nightstrike@gmail.com> wrote:
>
> On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill <joel@rtems.org> wrote:
> >
> > On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> > pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
> >
> > > Dear GCC folks,
> > >
> > >
> > > Some scientists in our organization still want to use the Intel compiler,
> > > as they say, it produces faster code, which is then executed on clusters.
> > > Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> > > heavily dependent on the actual program.)
> > >
> >
> > Do they have specific examples where icc is better for them? Or can point
> > to specific GCC PRs which impact them?
> >
> >
> > GCC versions?
> >
> > Are there specific CPU model variants of concern?
> >
> > What flags are used to compile? Some times a bit of advice can produce
> > improvements.
> >
> > Without specific examples, it is hard to set goals.
>
> If I could perhaps jump in here for a moment...  Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman.  They all involve a lot of heavy
> use of std::vector<std::vector<float>>.  Comparisons were with gcc

Ick - C++ ;)

> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core.  So, as close to not-noisy as
> possible.
>
> I was surprised at the results results, but using each compiler's methods of
> dumping vectorization info, intel wins on two points:
>
> 1) It actually vectorizes
> 2) It's vectorizing output is much more easily readable
>
> Options were:
>
> gcc -Wall -ggdb3 -std=gnu++17 -flto -Ofast -march=native
>
> vs:
>
> icc -Ofast -std=gnu++14
>
>
> So, not exactly exact, but pretty close.
>
>
> So here's an example of a chunk of code (not very readable, sorry
> about that) that intel can vectorize, and subsequently make about 50%
> faster:
>
>         std::size_t nLayers { input.nn.size() };
>         //std::size_t ySize = std::max_element(input.nn.cbegin(),
> input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
> })->size();
>         std::size_t ySize = 0;
>         for (auto const & nn: input.nn)
>                 ySize = std::max(ySize, nn.size());
>
>         float yNorm[ySize];
>         for (auto & y: yNorm)
>                 y = 0.0f;
>         for (std::size_t i = 0; i < xSize; ++i)
>                 yNorm[i] = xNorm[i];
>         for (std::size_t layer = 0; layer < nLayers; ++layer) {
>                 auto & nn = input.nn[layer];
>                 auto & b = nn.back();
>                 float y[ySize];
>                 for (std::size_t i = 0; i < nn[0].size(); ++i) {
>                         y[i] = b[i];
>                         for (std::size_t j = 0; j < nn.size() - 1; ++j)
>                                 y[i] += nn.at(j).at(i) * yNorm[j];
>                 }
>                 for (std::size_t i = 0; i < ySize; ++i) {
>                         if (layer < nLayers - 1)
>                                 y[i] = std::max(y[i], 0.0f);
>                         yNorm[i] = y[i];
>                 }
>         }
>
>
> If I was better at godbolt, I could show the asm, but I'm not.  I'm
> willing to learn, though.

A compilable testcase would be more useful - just file a bugzilla.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-20 22:42   ` NightStrike
  2018-06-21  9:20     ` Richard Biener
@ 2018-06-22  0:48     ` Steve Ellcey
  1 sibling, 0 replies; 20+ messages in thread
From: Steve Ellcey @ 2018-06-22  0:48 UTC (permalink / raw)
  To: NightStrike, joel; +Cc: Paul Menzel, gcc

On Wed, 2018-06-20 at 17:11 -0400, NightStrike wrote:
>Â 
> If I could perhaps jump in here for a moment...Â Â Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman.Â Â They all involve a lot of heavy
> use of std::vector<std::vector<float>>.Â Â Comparisons were with gcc
> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core.Â Â So, as close to not-noisy as
> possible.

There are a quite a number of bugzilla reports with examples where GCC
does not vectorize a loop. Â I wonder if this example is related to PR
61247.

Steve Ellcey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
  2018-06-06 16:14 ` Joel Sherrill
@ 2018-06-06 16:22 ` Bin.Cheng
  2018-06-06 18:31 ` Dmitry Mikushin
  2018-06-07 10:06 ` Richard Biener
  3 siblings, 0 replies; 20+ messages in thread
From: Bin.Cheng @ 2018-06-06 16:22 UTC (permalink / raw)
  To: Paul Menzel; +Cc: GCC Development

On Wed, Jun 6, 2018 at 3:51 PM, Paul Menzel
<pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler, as
> they say, it produces faster code, which is then executed on clusters. Some
> resources on the Web [1][2] confirm this. (I am aware, that it’s heavily
> dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
There are developers actually working on performance optimization in
GCC so you are not the only one :).  As an opensource compiler we do
lack resource so more developers is always good for the project.  As
Joel pointed out, typical/reduced workload showing the performance gap
is very important for our developers as well as attracting new
developers.  We can probably open a meta-bug for tracking if you have
many of these example workloads.

Thanks,
bin
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
  2018-06-06 16:14 ` Joel Sherrill
  2018-06-06 16:22 ` Bin.Cheng
@ 2018-06-06 18:31 ` Dmitry Mikushin
  2018-06-06 21:10   ` Ryan Burn
  2018-06-06 22:43   ` Zan Lynx
  2018-06-07 10:06 ` Richard Biener
  3 siblings, 2 replies; 20+ messages in thread
From: Dmitry Mikushin @ 2018-06-06 18:31 UTC (permalink / raw)
  To: Paul Menzel; +Cc: GCC

Dear Paul,

The opinion you've mentioned is common in scientific community. However, in
more detail it often surfaces that the used set of GCC compiler options
simply does not correspond to that "fast" version of Intel. For instance,
when you do "-O3" for Intel it actually corresponds to (at least) "-O3
-ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
introduces significant performance gap.

Kind regards,
- Dmitry Mikushin | Applied Parallel Computing LLC |
https://parallel-computing.pro


2018-06-06 18:51 GMT+03:00 Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>:

> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler,
> as they say, it produces faster code, which is then executed on clusters.
> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> .1280&rep=rep1&type=pdf
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 18:31 ` Dmitry Mikushin
@ 2018-06-06 21:10   ` Ryan Burn
  2018-06-07 10:02     ` Richard Biener
  2018-06-06 22:43   ` Zan Lynx
  1 sibling, 1 reply; 20+ messages in thread
From: Ryan Burn @ 2018-06-06 21:10 UTC (permalink / raw)
  To: Dmitry Mikushin; +Cc: Paul Menzel, GCC

One case where ICC can generate much faster code sometimes is by using
the nontemporal pragma [https://software.intel.com/en-us/node/524559]
with loops.

AFAIK, there's no such equivalent pragma in gcc
[https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].

When I tried this simple example
https://github.com/rnburn/square_timing/blob/master/bench.cpp that
measures times for this loop:

void compute(const double* x, index_t N, double* y) {
  #pragma vector nontemporal
  for(index_t i=0; i<N; ++i) y[i] = x[i]*x[i];
}

 with and without nontemporal I got these times (N = 1,000,000)

Temporal     1,042,080
Non-Temporal 538,842

So running with the non-temporal pragma was nearly twice as fast.

An equivalent non-temporal pragma for GCC would, IMO, certainly be a
very good feature to add.

On Wed, Jun 6, 2018 at 12:22 PM, Dmitry Mikushin <dmitry@kernelgen.org> wrote:
> Dear Paul,
>
> The opinion you've mentioned is common in scientific community. However, in
> more detail it often surfaces that the used set of GCC compiler options
> simply does not correspond to that "fast" version of Intel. For instance,
> when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> introduces significant performance gap.
>
> Kind regards,
> - Dmitry Mikushin | Applied Parallel Computing LLC |
> https://parallel-computing.pro
>
>
> 2018-06-06 18:51 GMT+03:00 Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>:
>
>> Dear GCC folks,
>>
>>
>> Some scientists in our organization still want to use the Intel compiler,
>> as they say, it produces faster code, which is then executed on clusters.
>> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
>> heavily dependent on the actual program.)
>>
>> My question is, is it realistic, that GCC could catch up and that the
>> scientists will start to use it over Intel’s compiler? Or will Intel
>> developers always have the lead, because they have secret documentation and
>> direct contact with the processor designers?
>>
>> If it is realistic, how can we get there? Would first the program be
>> written, and then the compiler be optimized for that? Or are just more GCC
>> developers needed?
>>
>>
>> Kind regards,
>>
>> Paul
>>
>>
>> [1]: https://colfaxresearch.com/compiler-comparison/
>> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
>> .1280&rep=rep1&type=pdf
>>
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 21:10   ` Ryan Burn
@ 2018-06-07 10:02     ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2018-06-07 10:02 UTC (permalink / raw)
  To: rnickb731; +Cc: dmitry, pmenzel+gcc.gnu.org, GCC Development

On Wed, Jun 6, 2018 at 8:31 PM Ryan Burn <rnickb731@gmail.com> wrote:
>
> One case where ICC can generate much faster code sometimes is by using
> the nontemporal pragma [https://software.intel.com/en-us/node/524559]
> with loops.
>
> AFAIK, there's no such equivalent pragma in gcc
> [https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].
>
> When I tried this simple example
> https://github.com/rnburn/square_timing/blob/master/bench.cpp that
> measures times for this loop:
>
> void compute(const double* x, index_t N, double* y) {
>   #pragma vector nontemporal
>   for(index_t i=0; i<N; ++i) y[i] = x[i]*x[i];
> }
>
>  with and without nontemporal I got these times (N = 1,000,000)
>
> Temporal     1,042,080
> Non-Temporal 538,842
>
> So running with the non-temporal pragma was nearly twice as fast.
>
> An equivalent non-temporal pragma for GCC would, IMO, certainly be a
> very good feature to add.

GCC has robust infrastructure for loop pragmas now just the set of pragmas
available isn't very big.  It would be interesting to know which ICC ones people
use regularly so we can support those in GCC as well.

Note using #pragmas is very much hand-optimizing the code for the compiler
you use - sth that is possible for GCC as well.

Richard.

> On Wed, Jun 6, 2018 at 12:22 PM, Dmitry Mikushin <dmitry@kernelgen.org> wrote:
> > Dear Paul,
> >
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
> > Kind regards,
> > - Dmitry Mikushin | Applied Parallel Computing LLC |
> > https://parallel-computing.pro
> >
> >
> > 2018-06-06 18:51 GMT+03:00 Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>:
> >
> >> Dear GCC folks,
> >>
> >>
> >> Some scientists in our organization still want to use the Intel compiler,
> >> as they say, it produces faster code, which is then executed on clusters.
> >> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> >> heavily dependent on the actual program.)
> >>
> >> My question is, is it realistic, that GCC could catch up and that the
> >> scientists will start to use it over Intel’s compiler? Or will Intel
> >> developers always have the lead, because they have secret documentation and
> >> direct contact with the processor designers?
> >>
> >> If it is realistic, how can we get there? Would first the program be
> >> written, and then the compiler be optimized for that? Or are just more GCC
> >> developers needed?
> >>
> >>
> >> Kind regards,
> >>
> >> Paul
> >>
> >>
> >> [1]: https://colfaxresearch.com/compiler-comparison/
> >> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> >> .1280&rep=rep1&type=pdf
> >>
> >>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 18:31 ` Dmitry Mikushin
  2018-06-06 21:10   ` Ryan Burn
@ 2018-06-06 22:43   ` Zan Lynx
  2018-06-07  9:54     ` Richard Biener
  1 sibling, 1 reply; 20+ messages in thread
From: Zan Lynx @ 2018-06-06 22:43 UTC (permalink / raw)
  To: Paul Menzel; +Cc: gcc

On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> The opinion you've mentioned is common in scientific community. However, in
> more detail it often surfaces that the used set of GCC compiler options
> simply does not correspond to that "fast" version of Intel. For instance,
> when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> introduces significant performance gap.
> 

Please note that if your compute cluster uses different models of CPU,
be extremely careful with -march=native.

I've been bitten by it in VMs, several times. Unless you always run on
the same system that did the build, you are running a risk of illegal
instructions.

-- 
                Knowledge is Power -- Power Corrupts
                        Study Hard -- Be Evil

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 22:43   ` Zan Lynx
@ 2018-06-07  9:54     ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2018-06-07  9:54 UTC (permalink / raw)
  To: zlynx; +Cc: pmenzel+gcc.gnu.org, GCC Development

On Wed, Jun 6, 2018 at 11:10 PM Zan Lynx <zlynx@acm.org> wrote:
>
> On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
>
> Please note that if your compute cluster uses different models of CPU,
> be extremely careful with -march=native.
>
> I've been bitten by it in VMs, several times. Unless you always run on
> the same system that did the build, you are running a risk of illegal
> instructions.

Yes.  Note this is where ICC has an advantage because it supports
automagically doing runtime versioning based on the CPU instruction
set for vectorized loops.  We only support that in an awkward
explicit way (the manual talks about this in the 'Function Multiversioning'
section).

But in the end it's just a "detail" that can be worked around with
a little inconvenience ;)  (I've yet to see a heterogenous cluster
where the instruction set differences make a performance difference
over choosing the lowest common one)

Richard.

> --
>                 Knowledge is Power -- Power Corrupts
>                         Study Hard -- Be Evil

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
                   ` (2 preceding siblings ...)
  2018-06-06 18:31 ` Dmitry Mikushin
@ 2018-06-07 10:06 ` Richard Biener
  2018-06-08 22:08   ` Steve Ellcey
  3 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2018-06-07 10:06 UTC (permalink / raw)
  To: pmenzel+gcc.gnu.org; +Cc: GCC Development

On Wed, Jun 6, 2018 at 5:52 PM Paul Menzel
<pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
>
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel
> compiler, as they say, it produces faster code, which is then executed
> on clusters. Some resources on the Web [1][2] confirm this. (I am aware,
> that it’s heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation
> and direct contact with the processor designers?

They will of course have an edge in timing when supporting a new architecture
because they have access to NDA material and hardware.  For example the
OSS community doesn't yet have access to any AVX512 capable machine
(speaking of the GNU compile-farm), and those are prohibitly expensive
for a private contributor.

Similar stories apply to the access to proprietary benchmarks or simply
having resources to continuously work with folks in HPC to make sure ICC
works great for their codes.

> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more
> GCC developers needed?

I think a big part of the story is perception and training.  This means that
for example a coherent and up-to-date source for information on how
to use GCC in a HPC environment (optimizing your code, recommended
compiler options, pitfalls to avoid, etc.) is desperately missing.

When we do our own comparisons of GCC vs. ICC on benchmarks
like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
(in fact it even trails in some benchmarks) unless you get to
"SPEC tricks" like data structure re-organization optimizations that
probably never apply in practice on real-world code (and people
should fix such things at the source level being pointed at them
via actually profiling their codes).

In my own experience which dates back nearly 15 years now ICC is
buggy (generates wrong-code / simulation results) and cannot compile
a "simple" C++ program ;)  This made me start working on GCC.

Note that the very best strength of GCC is the first-class high-quality
(insert more buzzwords here) support infrastructure if you actually
run into issues with the compiler!  Even when using paid ICC I never
got timely fixes (if at all) for wrong-code issues I reported to them!

I've separately replied to specific points in other posts where ICC has
an edge over GCC.

Richard.

>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-07 10:06 ` Richard Biener
@ 2018-06-08 22:08   ` Steve Ellcey
  2018-06-09 15:32     ` Marc Glisse
  2018-06-11 14:50     ` Martin Jambor
  0 siblings, 2 replies; 20+ messages in thread
From: Steve Ellcey @ 2018-06-08 22:08 UTC (permalink / raw)
  To: Richard Biener, pmenzel+gcc.gnu.org; +Cc: GCC Development

On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>Â 
> When we do our own comparisons of GCC vs. ICC on benchmarks
> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
> (in fact it even trails in some benchmarks) unless you get to
> "SPEC tricks" like data structure re-organization optimizations that
> probably never apply in practice on real-world code (and people
> should fix such things at the source level being pointed at them
> via actually profiling their codes).

Richard,

I was wondering if you have any more details about these comparisions
you have done that you can share? Â Compiler versions, options used,
hardware, etc Â Also, were there any tests that stood out in terms of
icc outperforming GCC?

I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
for gcc.

The int rate numbers (running 1 copy only) were not too bad, GCC was
only about 2% slower and only 525.x264_r seemed way slower with GCC.
The fp rate numbers (again only 1 copy) showed a larger difference,Â 
around 20%.Â Â 521.wrf_r was more than twice as slow when compiled with
GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
significant slowdowns when compiled with GCC vs. ICC.

Steve Ellcey
sellcey@cavium.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-08 22:08   ` Steve Ellcey
@ 2018-06-09 15:32     ` Marc Glisse
  2018-06-11 14:50     ` Martin Jambor
  1 sibling, 0 replies; 20+ messages in thread
From: Marc Glisse @ 2018-06-09 15:32 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: Richard Biener, pmenzel+gcc.gnu.org, GCC Development

On Fri, 8 Jun 2018, Steve Ellcey wrote:

> On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>> Â 
>> When we do our own comparisons of GCC vs. ICC on benchmarks
>> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
>> (in fact it even trails in some benchmarks) unless you get to
>> "SPEC tricks" like data structure re-organization optimizations that
>> probably never apply in practice on real-world code (and people
>> should fix such things at the source level being pointed at them
>> via actually profiling their codes).
>
> Richard,
>
> I was wondering if you have any more details about these comparisions
> you have done that you can share? Â Compiler versions, options used,
> hardware, etc Â Also, were there any tests that stood out in terms of
> icc outperforming GCC?
>
> I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
> a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
> I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
> for gcc.

You should use -Ofast for gcc. As mentionned earlier in the discussion, 
ICC has some equivalent of -ffast-math by default.

> The int rate numbers (running 1 copy only) were not too bad, GCC was
> only about 2% slower and only 525.x264_r seemed way slower with GCC.
> The fp rate numbers (again only 1 copy) showed a larger difference,Â 
> around 20%.Â Â 521.wrf_r was more than twice as slow when compiled with
> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
> significant slowdowns when compiled with GCC vs. ICC.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-08 22:08   ` Steve Ellcey
  2018-06-09 15:32     ` Marc Glisse
@ 2018-06-11 14:50     ` Martin Jambor
  2018-06-22 22:41       ` Szabolcs Nagy
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Jambor @ 2018-06-11 14:50 UTC (permalink / raw)
  To: sellcey, Richard Biener, pmenzel+gcc.gnu.org; +Cc: GCC Development

Hi Steve,

On Fri, Jun 08 2018, Steve Ellcey wrote:
> On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>> 
>> When we do our own comparisons of GCC vs. ICC on benchmarks
>> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
>> (in fact it even trails in some benchmarks) unless you get to
>> "SPEC tricks" like data structure re-organization optimizations that
>> probably never apply in practice on real-world code (and people
>> should fix such things at the source level being pointed at them
>> via actually profiling their codes).
>
> Richard,
>
> I was wondering if you have any more details about these comparisions
> you have done that you can share?  Compiler versions, options used,
> hardware, etc  Also, were there any tests that stood out in terms of
> icc outperforming GCC?

Mostly AMD Ryzen, GCC 8 vs ICC 18.  We were comparing a few combinations
of options.  When we compared ICC's and our -Ofast (with or without
native GCC march/mtune and a set ICC options that hopefully generate
best code on for Ryzen), we found out that without LTO/IPO, GCC is
actually slightly ahead of ICC on integer benchmarks (both SPEC 2006 and
2017).

Floating-point results were a more mixed bag (mostly because ICC
performed surprisingly poorly without IPO on a few) but at least on SPEC
2017, they were clearly better... with a caveat, see below my comment
about wrf.

With LTO/IPO, ICC can perform a few memory-reorg tricks that push them
quite a bit ahead of us but I'm not convinced they can perform these
transformations on much source code that happens not to be a well known
benchmark.  So I'd recommend always looking at non-IPO numbers too.

>
> I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
> a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
> I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
> for gcc.

Please try with -Ofast too.  The main reason is that -O3 does not imply
-ffast-math and the performance gain from it is often very big (and I
suspect the 525.x264_r difference is because of that).  Alternatively,
if your own workloads require high-precision floating-point math, you
have to force ICC to use it to get a fair comparison.  -Ofast also turns
on -fno-protect-parens and -fstack-arrays that also help a few
benchmarks a lot but note that you may need to set large stack ulimit
for them not to crash (but ICC does the same thing, as far as we know).

>
> The int rate numbers (running 1 copy only) were not too bad, GCC was
> only about 2% slower and only 525.x264_r seemed way slower with GCC.
> The fp rate numbers (again only 1 copy) showed a larger difference, 
> around 20%.  521.wrf_r was more than twice as slow when compiled with
> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
> significant slowdowns when compiled with GCC vs. ICC.
>

Keep in mind that when discussing FP benchmarks, the used math library
can be (almost) as important as the compiler.  In the case of 481.wrf,
we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
performance is about 70% of ICC's.  When we just linked against AMD's
libm, we got to 83%. When we instructed GCC to generate calls to Intel's
SVML library and linked against it, we got to 91%.  Using both SVML and
AMD's libm, we achieved 93%.

That means that there likely still is 7% to be gained from more clever
optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
is perhaps the most extreme example but definitely not the only one.

Martin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-11 14:50     ` Martin Jambor
@ 2018-06-22 22:41       ` Szabolcs Nagy
  0 siblings, 0 replies; 20+ messages in thread
From: Szabolcs Nagy @ 2018-06-22 22:41 UTC (permalink / raw)
  To: Martin Jambor, sellcey, Richard Biener, pmenzel+gcc.gnu.org
  Cc: nd, GCC Development

On 11/06/18 11:05, Martin Jambor wrote:
>> The int rate numbers (running 1 copy only) were not too bad, GCC was
>> only about 2% slower and only 525.x264_r seemed way slower with GCC.
>> The fp rate numbers (again only 1 copy) showed a larger difference,
>> around 20%.Â Â 521.wrf_r was more than twice as slow when compiled with
>> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
>> significant slowdowns when compiled with GCC vs. ICC.
>>
> 
> Keep in mind that when discussing FP benchmarks, the used math library
> can be (almost) as important as the compiler.  In the case of 481.wrf,
> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
> performance is about 70% of ICC's.  When we just linked against AMD's
> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
> SVML library and linked against it, we got to 91%.  Using both SVML and
> AMD's libm, we achieved 93%.
> 

i think glibc 2.27 should outperform amd's libm on wrf
(since i upstreamed the single precision code from
https://github.com/ARM-software/optimized-routines/ )

the 83% -> 93% diff is because gcc fails to vectorize
math calls in fortran to libmvec calls.

> That means that there likely still is 7% to be gained from more clever
> optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
> is perhaps the most extreme example but definitely not the only one.

there is no longer a problem in gnu libm for the most
common single precision calls and if things go well
then glibc 2.28 will get double precision improvements
too.

but gcc has to learn how to use libmvec in fortran.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
@ 2018-06-15 11:48 Wilco Dijkstra
  2018-06-15 17:03 ` Jeff Law
  0 siblings, 1 reply; 20+ messages in thread
From: Wilco Dijkstra @ 2018-06-15 11:48 UTC (permalink / raw)
  To: mjambor; +Cc: gcc, nd, Steve Ellcey, Richard Biener, pmenzel+gcc.gnu.org

Martin wrote:

> Keep in mind that when discussing FP benchmarks, the used math library
> can be (almost) as important as the compiler.  In the case of 481.wrf,
> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
> performance is about 70% of ICC's.  When we just linked against AMD's
> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
> SVML library and linked against it, we got to 91%.  Using both SVML and
> AMD's libm, we achieved 93%.
>
> That means that there likely still is 7% to be gained from more clever
> optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
> is perhaps the most extreme example but definitely not the only one.

You really should retry with GLIBC 2.27 since several key math functions were
rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in huge
performance gains on all targets (eg. wrf improved over 50%).

I fixed several double precision functions in current GLIBC to avoid extremely bad
performance which had been complained about for years. There are more math
functions on the way, so the GNU libm will not only catch up, but become the fastest
math library available.

Wilco

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-15 11:48 Wilco Dijkstra
@ 2018-06-15 17:03 ` Jeff Law
  2018-06-15 18:01   ` Joseph Myers
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Law @ 2018-06-15 17:03 UTC (permalink / raw)
  To: Wilco Dijkstra, mjambor
  Cc: gcc, nd, Steve Ellcey, Richard Biener, pmenzel+gcc.gnu.org

On 06/15/2018 05:39 AM, Wilco Dijkstra wrote:
> Martin wrote:
> 
>> Keep in mind that when discussing FP benchmarks, the used math library
>> can be (almost) as important as the compiler.  In the case of 481.wrf,
>> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
>> performance is about 70% of ICC's.  When we just linked against AMD's
>> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
>> SVML library and linked against it, we got to 91%.  Using both SVML and
>> AMD's libm, we achieved 93%.
>>
>> That means that there likely still is 7% to be gained from more clever
>> optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
>> is perhaps the most extreme example but definitely not the only one.
> 
> You really should retry with GLIBC 2.27 since several key math functions were
> rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in huge
> performance gains on all targets (eg. wrf improved over 50%).
> 
> I fixed several double precision functions in current GLIBC to avoid extremely bad
> performance which had been complained about for years. There are more math
> functions on the way, so the GNU libm will not only catch up, but become the fastest
> math library available.
And resolution on -fno-math-errno as the default.  Setting errno can be
more expensive than people realize.

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How to get GCC on par with ICC?
  2018-06-15 17:03 ` Jeff Law
@ 2018-06-15 18:01   ` Joseph Myers
  0 siblings, 0 replies; 20+ messages in thread
From: Joseph Myers @ 2018-06-15 18:01 UTC (permalink / raw)
  To: Jeff Law
  Cc: Wilco Dijkstra, mjambor, gcc, nd, Steve Ellcey, Richard Biener,
	pmenzel+gcc.gnu.org

On Fri, 15 Jun 2018, Jeff Law wrote:

> And resolution on -fno-math-errno as the default.  Setting errno can be
> more expensive than people realize.

I don't think I saw any version of the -fno-math-errno patch proposal that 
included the testsuite updates I'd expect.  Certainly 
gcc.dg/torture/pr68264.c tests libm functions setting errno and would need 
to use -fmath-errno explicitly, but it seems likely there are other tests 
involving built-in functions that in fact only test what they're intended 
to test given -fmath-errno; tests using libm functions without explicit 
-ffast-math / -fmath-errno / -fno-math-errno would need review (and there 
should be new tests for optimizations that are only valid given 
-fno-math-errno).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-06-22 11:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
2018-06-06 16:14 ` Joel Sherrill
2018-06-06 16:20   ` Paul Menzel
2018-06-20 22:42   ` NightStrike
2018-06-21  9:20     ` Richard Biener
2018-06-22  0:48     ` Steve Ellcey
2018-06-06 16:22 ` Bin.Cheng
2018-06-06 18:31 ` Dmitry Mikushin
2018-06-06 21:10   ` Ryan Burn
2018-06-07 10:02     ` Richard Biener
2018-06-06 22:43   ` Zan Lynx
2018-06-07  9:54     ` Richard Biener
2018-06-07 10:06 ` Richard Biener
2018-06-08 22:08   ` Steve Ellcey
2018-06-09 15:32     ` Marc Glisse
2018-06-11 14:50     ` Martin Jambor
2018-06-22 22:41       ` Szabolcs Nagy
2018-06-15 11:48 Wilco Dijkstra
2018-06-15 17:03 ` Jeff Law
2018-06-15 18:01   ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).