* How to get GCC on par with ICC?
@ 2018-06-06 15:57 Paul Menzel
2018-06-06 16:14 ` Joel Sherrill
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: Paul Menzel @ 2018-06-06 15:57 UTC (permalink / raw)
To: gcc
[-- Attachment #1: Type: text/plain, Size: 910 bytes --]
Dear GCC folks,
Some scientists in our organization still want to use the Intel
compiler, as they say, it produces faster code, which is then executed
on clusters. Some resources on the Web [1][2] confirm this. (I am aware,
that it’s heavily dependent on the actual program.)
My question is, is it realistic, that GCC could catch up and that the
scientists will start to use it over Intel’s compiler? Or will Intel
developers always have the lead, because they have secret documentation
and direct contact with the processor designers?
If it is realistic, how can we get there? Would first the program be
written, and then the compiler be optimized for that? Or are just more
GCC developers needed?
Kind regards,
Paul
[1]: https://colfaxresearch.com/compiler-comparison/
[2]:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
@ 2018-06-06 16:14 ` Joel Sherrill
2018-06-06 16:20 ` Paul Menzel
2018-06-20 22:42 ` NightStrike
2018-06-06 16:22 ` Bin.Cheng
` (2 subsequent siblings)
3 siblings, 2 replies; 20+ messages in thread
From: Joel Sherrill @ 2018-06-06 16:14 UTC (permalink / raw)
To: Paul Menzel; +Cc: gcc
On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler,
> as they say, it produces faster code, which is then executed on clusters.
> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> heavily dependent on the actual program.)
>
Do they have specific examples where icc is better for them? Or can point
to specific GCC PRs which impact them?
GCC versions?
Are there specific CPU model variants of concern?
What flags are used to compile? Some times a bit of advice can produce
improvements.
Without specific examples, it is hard to set goals.
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
>
For sure examples are needed so there are test cases to use for reference.
If you want anything improved in any free software project, sponsoring
developers
is always a good thing. If you sponsor the right developers. :)
I'm not discouraging you. I just trying to turn this into something
actionable.
--joel sherrill
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> .1280&rep=rep1&type=pdf
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 16:14 ` Joel Sherrill
@ 2018-06-06 16:20 ` Paul Menzel
2018-06-20 22:42 ` NightStrike
1 sibling, 0 replies; 20+ messages in thread
From: Paul Menzel @ 2018-06-06 16:20 UTC (permalink / raw)
To: Joel Sherrill; +Cc: gcc
[-- Attachment #1: Type: text/plain, Size: 2505 bytes --]
Dear Joel,
Thank you for your quick reply.
On 06/06/18 17:57, Joel Sherrill wrote:
> On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel wrote:
>> Some scientists in our organization still want to use the Intel compiler,
>> as they say, it produces faster code, which is then executed on clusters.
>> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
>> heavily dependent on the actual program.)
>
> Do they have specific examples where icc is better for them? Or can point
> to specific GCC PRs which impact them?
>
> GCC versions?
>
> Are there specific CPU model variants of concern?
>
> What flags are used to compile? Some times a bit of advice can produce
> improvements.
>
> Without specific examples, it is hard to set goals.
I could get such examples, but it will take some time, as it’s from
other institutes.
The clusters use exclusively Intel processors. (Hopefully, that will
change.)
I also found the article from the German Linux-Magazin in an English
version at the ADMIN Magazin [3]. The German article had a more strong
statement, that they use the Intel compilers due to performance reasons.
>> My question is, is it realistic, that GCC could catch up and that the
>> scientists will start to use it over Intel’s compiler? Or will Intel
>> developers always have the lead, because they have secret documentation and
>> direct contact with the processor designers?
>>
>> If it is realistic, how can we get there? Would first the program be
>> written, and then the compiler be optimized for that? Or are just more GCC
>> developers needed?
>
> For sure examples are needed so there are test cases to use for reference.
>
> If you want anything improved in any free software project, sponsoring
> developers is always a good thing. If you sponsor the right developers. :)
That’s what I hoped for, but didn’t ask here. If you could point me to a
list of possible contractors, that would be great.
Please keep in mind, that in my organization certain decisions are made
*very* slowly. I’ll try to get answers quickly, but procuring finances
might take longer (half a year or much longer).
Kind regards,
Paul
>> [1]: https://colfaxresearch.com/compiler-comparison/
>> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
[3]
http://www.admin-magazine.com/HPC/Articles/Selecting-Compilers-for-a-Supercomputer
"HPC Compilers"
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 16:14 ` Joel Sherrill
2018-06-06 16:20 ` Paul Menzel
@ 2018-06-20 22:42 ` NightStrike
2018-06-21 9:20 ` Richard Biener
2018-06-22 0:48 ` Steve Ellcey
1 sibling, 2 replies; 20+ messages in thread
From: NightStrike @ 2018-06-20 22:42 UTC (permalink / raw)
To: joel; +Cc: Paul Menzel, gcc
On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill <joel@rtems.org> wrote:
>
> On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
>
> > Dear GCC folks,
> >
> >
> > Some scientists in our organization still want to use the Intel compiler,
> > as they say, it produces faster code, which is then executed on clusters.
> > Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> > heavily dependent on the actual program.)
> >
>
> Do they have specific examples where icc is better for them? Or can point
> to specific GCC PRs which impact them?
>
>
> GCC versions?
>
> Are there specific CPU model variants of concern?
>
> What flags are used to compile? Some times a bit of advice can produce
> improvements.
>
> Without specific examples, it is hard to set goals.
If I could perhaps jump in here for a moment... Just today I hit upon
a series of small (in lines of code) loops that gcc can't vectorize,
and intel vectorizes like a madman. They all involve a lot of heavy
use of std::vector<std::vector<float>>. Comparisons were with gcc
8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
as sched_FIFO, mlockall, affinity set to its own core, and all
interrupts vectored off that core. So, as close to not-noisy as
possible.
I was surprised at the results results, but using each compiler's methods of
dumping vectorization info, intel wins on two points:
1) It actually vectorizes
2) It's vectorizing output is much more easily readable
Options were:
gcc -Wall -ggdb3 -std=gnu++17 -flto -Ofast -march=native
vs:
icc -Ofast -std=gnu++14
So, not exactly exact, but pretty close.
So here's an example of a chunk of code (not very readable, sorry
about that) that intel can vectorize, and subsequently make about 50%
faster:
std::size_t nLayers { input.nn.size() };
//std::size_t ySize = std::max_element(input.nn.cbegin(),
input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
})->size();
std::size_t ySize = 0;
for (auto const & nn: input.nn)
ySize = std::max(ySize, nn.size());
float yNorm[ySize];
for (auto & y: yNorm)
y = 0.0f;
for (std::size_t i = 0; i < xSize; ++i)
yNorm[i] = xNorm[i];
for (std::size_t layer = 0; layer < nLayers; ++layer) {
auto & nn = input.nn[layer];
auto & b = nn.back();
float y[ySize];
for (std::size_t i = 0; i < nn[0].size(); ++i) {
y[i] = b[i];
for (std::size_t j = 0; j < nn.size() - 1; ++j)
y[i] += nn.at(j).at(i) * yNorm[j];
}
for (std::size_t i = 0; i < ySize; ++i) {
if (layer < nLayers - 1)
y[i] = std::max(y[i], 0.0f);
yNorm[i] = y[i];
}
}
If I was better at godbolt, I could show the asm, but I'm not. I'm
willing to learn, though.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-20 22:42 ` NightStrike
@ 2018-06-21 9:20 ` Richard Biener
2018-06-22 0:48 ` Steve Ellcey
1 sibling, 0 replies; 20+ messages in thread
From: Richard Biener @ 2018-06-21 9:20 UTC (permalink / raw)
To: NightStrike; +Cc: joel, pmenzel+gcc.gnu.org, GCC Development
On Wed, Jun 20, 2018 at 11:12 PM NightStrike <nightstrike@gmail.com> wrote:
>
> On Wed, Jun 6, 2018 at 11:57 AM, Joel Sherrill <joel@rtems.org> wrote:
> >
> > On Wed, Jun 6, 2018 at 10:51 AM, Paul Menzel <
> > pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
> >
> > > Dear GCC folks,
> > >
> > >
> > > Some scientists in our organization still want to use the Intel compiler,
> > > as they say, it produces faster code, which is then executed on clusters.
> > > Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> > > heavily dependent on the actual program.)
> > >
> >
> > Do they have specific examples where icc is better for them? Or can point
> > to specific GCC PRs which impact them?
> >
> >
> > GCC versions?
> >
> > Are there specific CPU model variants of concern?
> >
> > What flags are used to compile? Some times a bit of advice can produce
> > improvements.
> >
> > Without specific examples, it is hard to set goals.
>
> If I could perhaps jump in here for a moment... Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman. They all involve a lot of heavy
> use of std::vector<std::vector<float>>. Comparisons were with gcc
Ick - C++ ;)
> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core. So, as close to not-noisy as
> possible.
>
> I was surprised at the results results, but using each compiler's methods of
> dumping vectorization info, intel wins on two points:
>
> 1) It actually vectorizes
> 2) It's vectorizing output is much more easily readable
>
> Options were:
>
> gcc -Wall -ggdb3 -std=gnu++17 -flto -Ofast -march=native
>
> vs:
>
> icc -Ofast -std=gnu++14
>
>
> So, not exactly exact, but pretty close.
>
>
> So here's an example of a chunk of code (not very readable, sorry
> about that) that intel can vectorize, and subsequently make about 50%
> faster:
>
> std::size_t nLayers { input.nn.size() };
> //std::size_t ySize = std::max_element(input.nn.cbegin(),
> input.nn.cend(), [](auto a, auto b){ return a.size() < b.size();
> })->size();
> std::size_t ySize = 0;
> for (auto const & nn: input.nn)
> ySize = std::max(ySize, nn.size());
>
> float yNorm[ySize];
> for (auto & y: yNorm)
> y = 0.0f;
> for (std::size_t i = 0; i < xSize; ++i)
> yNorm[i] = xNorm[i];
> for (std::size_t layer = 0; layer < nLayers; ++layer) {
> auto & nn = input.nn[layer];
> auto & b = nn.back();
> float y[ySize];
> for (std::size_t i = 0; i < nn[0].size(); ++i) {
> y[i] = b[i];
> for (std::size_t j = 0; j < nn.size() - 1; ++j)
> y[i] += nn.at(j).at(i) * yNorm[j];
> }
> for (std::size_t i = 0; i < ySize; ++i) {
> if (layer < nLayers - 1)
> y[i] = std::max(y[i], 0.0f);
> yNorm[i] = y[i];
> }
> }
>
>
> If I was better at godbolt, I could show the asm, but I'm not. I'm
> willing to learn, though.
A compilable testcase would be more useful - just file a bugzilla.
Richard.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-20 22:42 ` NightStrike
2018-06-21 9:20 ` Richard Biener
@ 2018-06-22 0:48 ` Steve Ellcey
1 sibling, 0 replies; 20+ messages in thread
From: Steve Ellcey @ 2018-06-22 0:48 UTC (permalink / raw)
To: NightStrike, joel; +Cc: Paul Menzel, gcc
On Wed, 2018-06-20 at 17:11 -0400, NightStrike wrote:
>Â
> If I could perhaps jump in here for a moment...  Just today I hit upon
> a series of small (in lines of code) loops that gcc can't vectorize,
> and intel vectorizes like a madman.  They all involve a lot of heavy
> use of std::vector<std::vector<float>>.  Comparisons were with gcc
> 8.1, intel 2018.u1, an AMD Opteron 6386 SE, with the program running
> as sched_FIFO, mlockall, affinity set to its own core, and all
> interrupts vectored off that core.  So, as close to not-noisy as
> possible.
There are a quite a number of bugzilla reports with examples where GCC
does not vectorize a loop. Â I wonder if this example is related to PR
61247.
Steve Ellcey
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
2018-06-06 16:14 ` Joel Sherrill
@ 2018-06-06 16:22 ` Bin.Cheng
2018-06-06 18:31 ` Dmitry Mikushin
2018-06-07 10:06 ` Richard Biener
3 siblings, 0 replies; 20+ messages in thread
From: Bin.Cheng @ 2018-06-06 16:22 UTC (permalink / raw)
To: Paul Menzel; +Cc: GCC Development
On Wed, Jun 6, 2018 at 3:51 PM, Paul Menzel
<pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler, as
> they say, it produces faster code, which is then executed on clusters. Some
> resources on the Web [1][2] confirm this. (I am aware, that it’s heavily
> dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
There are developers actually working on performance optimization in
GCC so you are not the only one :). As an opensource compiler we do
lack resource so more developers is always good for the project. As
Joel pointed out, typical/reduced workload showing the performance gap
is very important for our developers as well as attracting new
developers. We can probably open a meta-bug for tracking if you have
many of these example workloads.
Thanks,
bin
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
2018-06-06 16:14 ` Joel Sherrill
2018-06-06 16:22 ` Bin.Cheng
@ 2018-06-06 18:31 ` Dmitry Mikushin
2018-06-06 21:10 ` Ryan Burn
2018-06-06 22:43 ` Zan Lynx
2018-06-07 10:06 ` Richard Biener
3 siblings, 2 replies; 20+ messages in thread
From: Dmitry Mikushin @ 2018-06-06 18:31 UTC (permalink / raw)
To: Paul Menzel; +Cc: GCC
Dear Paul,
The opinion you've mentioned is common in scientific community. However, in
more detail it often surfaces that the used set of GCC compiler options
simply does not correspond to that "fast" version of Intel. For instance,
when you do "-O3" for Intel it actually corresponds to (at least) "-O3
-ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
introduces significant performance gap.
Kind regards,
- Dmitry Mikushin | Applied Parallel Computing LLC |
https://parallel-computing.pro
2018-06-06 18:51 GMT+03:00 Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>:
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel compiler,
> as they say, it produces faster code, which is then executed on clusters.
> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation and
> direct contact with the processor designers?
>
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more GCC
> developers needed?
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> .1280&rep=rep1&type=pdf
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 18:31 ` Dmitry Mikushin
@ 2018-06-06 21:10 ` Ryan Burn
2018-06-07 10:02 ` Richard Biener
2018-06-06 22:43 ` Zan Lynx
1 sibling, 1 reply; 20+ messages in thread
From: Ryan Burn @ 2018-06-06 21:10 UTC (permalink / raw)
To: Dmitry Mikushin; +Cc: Paul Menzel, GCC
One case where ICC can generate much faster code sometimes is by using
the nontemporal pragma [https://software.intel.com/en-us/node/524559]
with loops.
AFAIK, there's no such equivalent pragma in gcc
[https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].
When I tried this simple example
https://github.com/rnburn/square_timing/blob/master/bench.cpp that
measures times for this loop:
void compute(const double* x, index_t N, double* y) {
#pragma vector nontemporal
for(index_t i=0; i<N; ++i) y[i] = x[i]*x[i];
}
with and without nontemporal I got these times (N = 1,000,000)
Temporal 1,042,080
Non-Temporal 538,842
So running with the non-temporal pragma was nearly twice as fast.
An equivalent non-temporal pragma for GCC would, IMO, certainly be a
very good feature to add.
On Wed, Jun 6, 2018 at 12:22 PM, Dmitry Mikushin <dmitry@kernelgen.org> wrote:
> Dear Paul,
>
> The opinion you've mentioned is common in scientific community. However, in
> more detail it often surfaces that the used set of GCC compiler options
> simply does not correspond to that "fast" version of Intel. For instance,
> when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> introduces significant performance gap.
>
> Kind regards,
> - Dmitry Mikushin | Applied Parallel Computing LLC |
> https://parallel-computing.pro
>
>
> 2018-06-06 18:51 GMT+03:00 Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>:
>
>> Dear GCC folks,
>>
>>
>> Some scientists in our organization still want to use the Intel compiler,
>> as they say, it produces faster code, which is then executed on clusters.
>> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
>> heavily dependent on the actual program.)
>>
>> My question is, is it realistic, that GCC could catch up and that the
>> scientists will start to use it over Intel’s compiler? Or will Intel
>> developers always have the lead, because they have secret documentation and
>> direct contact with the processor designers?
>>
>> If it is realistic, how can we get there? Would first the program be
>> written, and then the compiler be optimized for that? Or are just more GCC
>> developers needed?
>>
>>
>> Kind regards,
>>
>> Paul
>>
>>
>> [1]: https://colfaxresearch.com/compiler-comparison/
>> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
>> .1280&rep=rep1&type=pdf
>>
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 21:10 ` Ryan Burn
@ 2018-06-07 10:02 ` Richard Biener
0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2018-06-07 10:02 UTC (permalink / raw)
To: rnickb731; +Cc: dmitry, pmenzel+gcc.gnu.org, GCC Development
On Wed, Jun 6, 2018 at 8:31 PM Ryan Burn <rnickb731@gmail.com> wrote:
>
> One case where ICC can generate much faster code sometimes is by using
> the nontemporal pragma [https://software.intel.com/en-us/node/524559]
> with loops.
>
> AFAIK, there's no such equivalent pragma in gcc
> [https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].
>
> When I tried this simple example
> https://github.com/rnburn/square_timing/blob/master/bench.cpp that
> measures times for this loop:
>
> void compute(const double* x, index_t N, double* y) {
> #pragma vector nontemporal
> for(index_t i=0; i<N; ++i) y[i] = x[i]*x[i];
> }
>
> with and without nontemporal I got these times (N = 1,000,000)
>
> Temporal 1,042,080
> Non-Temporal 538,842
>
> So running with the non-temporal pragma was nearly twice as fast.
>
> An equivalent non-temporal pragma for GCC would, IMO, certainly be a
> very good feature to add.
GCC has robust infrastructure for loop pragmas now just the set of pragmas
available isn't very big. It would be interesting to know which ICC ones people
use regularly so we can support those in GCC as well.
Note using #pragmas is very much hand-optimizing the code for the compiler
you use - sth that is possible for GCC as well.
Richard.
> On Wed, Jun 6, 2018 at 12:22 PM, Dmitry Mikushin <dmitry@kernelgen.org> wrote:
> > Dear Paul,
> >
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
> > Kind regards,
> > - Dmitry Mikushin | Applied Parallel Computing LLC |
> > https://parallel-computing.pro
> >
> >
> > 2018-06-06 18:51 GMT+03:00 Paul Menzel <pmenzel+gcc.gnu.org@molgen.mpg.de>:
> >
> >> Dear GCC folks,
> >>
> >>
> >> Some scientists in our organization still want to use the Intel compiler,
> >> as they say, it produces faster code, which is then executed on clusters.
> >> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> >> heavily dependent on the actual program.)
> >>
> >> My question is, is it realistic, that GCC could catch up and that the
> >> scientists will start to use it over Intel’s compiler? Or will Intel
> >> developers always have the lead, because they have secret documentation and
> >> direct contact with the processor designers?
> >>
> >> If it is realistic, how can we get there? Would first the program be
> >> written, and then the compiler be optimized for that? Or are just more GCC
> >> developers needed?
> >>
> >>
> >> Kind regards,
> >>
> >> Paul
> >>
> >>
> >> [1]: https://colfaxresearch.com/compiler-comparison/
> >> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> >> .1280&rep=rep1&type=pdf
> >>
> >>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 18:31 ` Dmitry Mikushin
2018-06-06 21:10 ` Ryan Burn
@ 2018-06-06 22:43 ` Zan Lynx
2018-06-07 9:54 ` Richard Biener
1 sibling, 1 reply; 20+ messages in thread
From: Zan Lynx @ 2018-06-06 22:43 UTC (permalink / raw)
To: Paul Menzel; +Cc: gcc
On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> The opinion you've mentioned is common in scientific community. However, in
> more detail it often surfaces that the used set of GCC compiler options
> simply does not correspond to that "fast" version of Intel. For instance,
> when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> introduces significant performance gap.
>
Please note that if your compute cluster uses different models of CPU,
be extremely careful with -march=native.
I've been bitten by it in VMs, several times. Unless you always run on
the same system that did the build, you are running a risk of illegal
instructions.
--
Knowledge is Power -- Power Corrupts
Study Hard -- Be Evil
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 22:43 ` Zan Lynx
@ 2018-06-07 9:54 ` Richard Biener
0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2018-06-07 9:54 UTC (permalink / raw)
To: zlynx; +Cc: pmenzel+gcc.gnu.org, GCC Development
On Wed, Jun 6, 2018 at 11:10 PM Zan Lynx <zlynx@acm.org> wrote:
>
> On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
>
> Please note that if your compute cluster uses different models of CPU,
> be extremely careful with -march=native.
>
> I've been bitten by it in VMs, several times. Unless you always run on
> the same system that did the build, you are running a risk of illegal
> instructions.
Yes. Note this is where ICC has an advantage because it supports
automagically doing runtime versioning based on the CPU instruction
set for vectorized loops. We only support that in an awkward
explicit way (the manual talks about this in the 'Function Multiversioning'
section).
But in the end it's just a "detail" that can be worked around with
a little inconvenience ;) (I've yet to see a heterogenous cluster
where the instruction set differences make a performance difference
over choosing the lowest common one)
Richard.
> --
> Knowledge is Power -- Power Corrupts
> Study Hard -- Be Evil
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
` (2 preceding siblings ...)
2018-06-06 18:31 ` Dmitry Mikushin
@ 2018-06-07 10:06 ` Richard Biener
2018-06-08 22:08 ` Steve Ellcey
3 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2018-06-07 10:06 UTC (permalink / raw)
To: pmenzel+gcc.gnu.org; +Cc: GCC Development
On Wed, Jun 6, 2018 at 5:52 PM Paul Menzel
<pmenzel+gcc.gnu.org@molgen.mpg.de> wrote:
>
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel
> compiler, as they say, it produces faster code, which is then executed
> on clusters. Some resources on the Web [1][2] confirm this. (I am aware,
> that it’s heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation
> and direct contact with the processor designers?
They will of course have an edge in timing when supporting a new architecture
because they have access to NDA material and hardware. For example the
OSS community doesn't yet have access to any AVX512 capable machine
(speaking of the GNU compile-farm), and those are prohibitly expensive
for a private contributor.
Similar stories apply to the access to proprietary benchmarks or simply
having resources to continuously work with folks in HPC to make sure ICC
works great for their codes.
> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more
> GCC developers needed?
I think a big part of the story is perception and training. This means that
for example a coherent and up-to-date source for information on how
to use GCC in a HPC environment (optimizing your code, recommended
compiler options, pitfalls to avoid, etc.) is desperately missing.
When we do our own comparisons of GCC vs. ICC on benchmarks
like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
(in fact it even trails in some benchmarks) unless you get to
"SPEC tricks" like data structure re-organization optimizations that
probably never apply in practice on real-world code (and people
should fix such things at the source level being pointed at them
via actually profiling their codes).
In my own experience which dates back nearly 15 years now ICC is
buggy (generates wrong-code / simulation results) and cannot compile
a "simple" C++ program ;) This made me start working on GCC.
Note that the very best strength of GCC is the first-class high-quality
(insert more buzzwords here) support infrastructure if you actually
run into issues with the compiler! Even when using paid ICC I never
got timely fixes (if at all) for wrong-code issues I reported to them!
I've separately replied to specific points in other posts where ICC has
an edge over GCC.
Richard.
>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-07 10:06 ` Richard Biener
@ 2018-06-08 22:08 ` Steve Ellcey
2018-06-09 15:32 ` Marc Glisse
2018-06-11 14:50 ` Martin Jambor
0 siblings, 2 replies; 20+ messages in thread
From: Steve Ellcey @ 2018-06-08 22:08 UTC (permalink / raw)
To: Richard Biener, pmenzel+gcc.gnu.org; +Cc: GCC Development
On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>Â
> When we do our own comparisons of GCC vs. ICC on benchmarks
> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
> (in fact it even trails in some benchmarks) unless you get to
> "SPEC tricks" like data structure re-organization optimizations that
> probably never apply in practice on real-world code (and people
> should fix such things at the source level being pointed at them
> via actually profiling their codes).
Richard,
I was wondering if you have any more details about these comparisions
you have done that you can share? Â Compiler versions, options used,
hardware, etc  Also, were there any tests that stood out in terms of
icc outperforming GCC?
I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
for gcc.
The int rate numbers (running 1 copy only) were not too bad, GCC was
only about 2% slower and only 525.x264_r seemed way slower with GCC.
The fp rate numbers (again only 1 copy) showed a larger difference,Â
around 20%.  521.wrf_r was more than twice as slow when compiled with
GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
significant slowdowns when compiled with GCC vs. ICC.
Steve Ellcey
sellcey@cavium.com
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-08 22:08 ` Steve Ellcey
@ 2018-06-09 15:32 ` Marc Glisse
2018-06-11 14:50 ` Martin Jambor
1 sibling, 0 replies; 20+ messages in thread
From: Marc Glisse @ 2018-06-09 15:32 UTC (permalink / raw)
To: Steve Ellcey; +Cc: Richard Biener, pmenzel+gcc.gnu.org, GCC Development
On Fri, 8 Jun 2018, Steve Ellcey wrote:
> On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>> Â
>> When we do our own comparisons of GCC vs. ICC on benchmarks
>> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
>> (in fact it even trails in some benchmarks) unless you get to
>> "SPEC tricks" like data structure re-organization optimizations that
>> probably never apply in practice on real-world code (and people
>> should fix such things at the source level being pointed at them
>> via actually profiling their codes).
>
> Richard,
>
> I was wondering if you have any more details about these comparisions
> you have done that you can share? Â Compiler versions, options used,
> hardware, etc  Also, were there any tests that stood out in terms of
> icc outperforming GCC?
>
> I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
> a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
> I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
> for gcc.
You should use -Ofast for gcc. As mentionned earlier in the discussion,
ICC has some equivalent of -ffast-math by default.
> The int rate numbers (running 1 copy only) were not too bad, GCC was
> only about 2% slower and only 525.x264_r seemed way slower with GCC.
> The fp rate numbers (again only 1 copy) showed a larger difference,Â
> around 20%.  521.wrf_r was more than twice as slow when compiled with
> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
> significant slowdowns when compiled with GCC vs. ICC.
--
Marc Glisse
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-08 22:08 ` Steve Ellcey
2018-06-09 15:32 ` Marc Glisse
@ 2018-06-11 14:50 ` Martin Jambor
2018-06-22 22:41 ` Szabolcs Nagy
1 sibling, 1 reply; 20+ messages in thread
From: Martin Jambor @ 2018-06-11 14:50 UTC (permalink / raw)
To: sellcey, Richard Biener, pmenzel+gcc.gnu.org; +Cc: GCC Development
Hi Steve,
On Fri, Jun 08 2018, Steve Ellcey wrote:
> On Thu, 2018-06-07 at 12:01 +0200, Richard Biener wrote:
>>
>> When we do our own comparisons of GCC vs. ICC on benchmarks
>> like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
>> (in fact it even trails in some benchmarks) unless you get to
>> "SPEC tricks" like data structure re-organization optimizations that
>> probably never apply in practice on real-world code (and people
>> should fix such things at the source level being pointed at them
>> via actually profiling their codes).
>
> Richard,
>
> I was wondering if you have any more details about these comparisions
> you have done that you can share? Compiler versions, options used,
> hardware, etc Also, were there any tests that stood out in terms of
> icc outperforming GCC?
Mostly AMD Ryzen, GCC 8 vs ICC 18. We were comparing a few combinations
of options. When we compared ICC's and our -Ofast (with or without
native GCC march/mtune and a set ICC options that hopefully generate
best code on for Ryzen), we found out that without LTO/IPO, GCC is
actually slightly ahead of ICC on integer benchmarks (both SPEC 2006 and
2017).
Floating-point results were a more mixed bag (mostly because ICC
performed surprisingly poorly without IPO on a few) but at least on SPEC
2017, they were clearly better... with a caveat, see below my comment
about wrf.
With LTO/IPO, ICC can perform a few memory-reorg tricks that push them
quite a bit ahead of us but I'm not convinced they can perform these
transformations on much source code that happens not to be a well known
benchmark. So I'd recommend always looking at non-IPO numbers too.
>
> I did a compare of SPEC 2017 rate using GCC 8.* (pre release) and
> a recent ICC (2018.0.128?) on my desktop (Xeon CPU E5-1650 v4).
> I used '-xHost -O3' for icc and '-march=native -mtune=native -O3'
> for gcc.
Please try with -Ofast too. The main reason is that -O3 does not imply
-ffast-math and the performance gain from it is often very big (and I
suspect the 525.x264_r difference is because of that). Alternatively,
if your own workloads require high-precision floating-point math, you
have to force ICC to use it to get a fair comparison. -Ofast also turns
on -fno-protect-parens and -fstack-arrays that also help a few
benchmarks a lot but note that you may need to set large stack ulimit
for them not to crash (but ICC does the same thing, as far as we know).
>
> The int rate numbers (running 1 copy only) were not too bad, GCC was
> only about 2% slower and only 525.x264_r seemed way slower with GCC.
> The fp rate numbers (again only 1 copy) showed a larger difference,
> around 20%. 521.wrf_r was more than twice as slow when compiled with
> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
> significant slowdowns when compiled with GCC vs. ICC.
>
Keep in mind that when discussing FP benchmarks, the used math library
can be (almost) as important as the compiler. In the case of 481.wrf,
we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
performance is about 70% of ICC's. When we just linked against AMD's
libm, we got to 83%. When we instructed GCC to generate calls to Intel's
SVML library and linked against it, we got to 91%. Using both SVML and
AMD's libm, we achieved 93%.
That means that there likely still is 7% to be gained from more clever
optimizations in GCC but the real problem is in GNU libm. And 481.wrf
is perhaps the most extreme example but definitely not the only one.
Martin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-11 14:50 ` Martin Jambor
@ 2018-06-22 22:41 ` Szabolcs Nagy
0 siblings, 0 replies; 20+ messages in thread
From: Szabolcs Nagy @ 2018-06-22 22:41 UTC (permalink / raw)
To: Martin Jambor, sellcey, Richard Biener, pmenzel+gcc.gnu.org
Cc: nd, GCC Development
On 11/06/18 11:05, Martin Jambor wrote:
>> The int rate numbers (running 1 copy only) were not too bad, GCC was
>> only about 2% slower and only 525.x264_r seemed way slower with GCC.
>> The fp rate numbers (again only 1 copy) showed a larger difference,
>> around 20%.  521.wrf_r was more than twice as slow when compiled with
>> GCC instead of ICC and 503.bwaves_r and 510.parest_r also showed
>> significant slowdowns when compiled with GCC vs. ICC.
>>
>
> Keep in mind that when discussing FP benchmarks, the used math library
> can be (almost) as important as the compiler. In the case of 481.wrf,
> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
> performance is about 70% of ICC's. When we just linked against AMD's
> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
> SVML library and linked against it, we got to 91%. Using both SVML and
> AMD's libm, we achieved 93%.
>
i think glibc 2.27 should outperform amd's libm on wrf
(since i upstreamed the single precision code from
https://github.com/ARM-software/optimized-routines/ )
the 83% -> 93% diff is because gcc fails to vectorize
math calls in fortran to libmvec calls.
> That means that there likely still is 7% to be gained from more clever
> optimizations in GCC but the real problem is in GNU libm. And 481.wrf
> is perhaps the most extreme example but definitely not the only one.
there is no longer a problem in gnu libm for the most
common single precision calls and if things go well
then glibc 2.28 will get double precision improvements
too.
but gcc has to learn how to use libmvec in fortran.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
@ 2018-06-15 11:48 Wilco Dijkstra
2018-06-15 17:03 ` Jeff Law
0 siblings, 1 reply; 20+ messages in thread
From: Wilco Dijkstra @ 2018-06-15 11:48 UTC (permalink / raw)
To: mjambor; +Cc: gcc, nd, Steve Ellcey, Richard Biener, pmenzel+gcc.gnu.org
Martin wrote:
> Keep in mind that when discussing FP benchmarks, the used math library
> can be (almost) as important as the compiler. In the case of 481.wrf,
> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
> performance is about 70% of ICC's. When we just linked against AMD's
> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
> SVML library and linked against it, we got to 91%. Using both SVML and
> AMD's libm, we achieved 93%.
>
> That means that there likely still is 7% to be gained from more clever
> optimizations in GCC but the real problem is in GNU libm. And 481.wrf
> is perhaps the most extreme example but definitely not the only one.
You really should retry with GLIBC 2.27 since several key math functions were
rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in huge
performance gains on all targets (eg. wrf improved over 50%).
I fixed several double precision functions in current GLIBC to avoid extremely bad
performance which had been complained about for years. There are more math
functions on the way, so the GNU libm will not only catch up, but become the fastest
math library available.
Wilco
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-15 11:48 Wilco Dijkstra
@ 2018-06-15 17:03 ` Jeff Law
2018-06-15 18:01 ` Joseph Myers
0 siblings, 1 reply; 20+ messages in thread
From: Jeff Law @ 2018-06-15 17:03 UTC (permalink / raw)
To: Wilco Dijkstra, mjambor
Cc: gcc, nd, Steve Ellcey, Richard Biener, pmenzel+gcc.gnu.org
On 06/15/2018 05:39 AM, Wilco Dijkstra wrote:
> Martin wrote:
>
>> Keep in mind that when discussing FP benchmarks, the used math library
>> can be (almost) as important as the compiler. In the case of 481.wrf,
>> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
>> performance is about 70% of ICC's. When we just linked against AMD's
>> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
>> SVML library and linked against it, we got to 91%. Using both SVML and
>> AMD's libm, we achieved 93%.
>>
>> That means that there likely still is 7% to be gained from more clever
>> optimizations in GCC but the real problem is in GNU libm. And 481.wrf
>> is perhaps the most extreme example but definitely not the only one.
>
> You really should retry with GLIBC 2.27 since several key math functions were
> rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in huge
> performance gains on all targets (eg. wrf improved over 50%).
>
> I fixed several double precision functions in current GLIBC to avoid extremely bad
> performance which had been complained about for years. There are more math
> functions on the way, so the GNU libm will not only catch up, but become the fastest
> math library available.
And resolution on -fno-math-errno as the default. Setting errno can be
more expensive than people realize.
Jeff
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: How to get GCC on par with ICC?
2018-06-15 17:03 ` Jeff Law
@ 2018-06-15 18:01 ` Joseph Myers
0 siblings, 0 replies; 20+ messages in thread
From: Joseph Myers @ 2018-06-15 18:01 UTC (permalink / raw)
To: Jeff Law
Cc: Wilco Dijkstra, mjambor, gcc, nd, Steve Ellcey, Richard Biener,
pmenzel+gcc.gnu.org
On Fri, 15 Jun 2018, Jeff Law wrote:
> And resolution on -fno-math-errno as the default. Setting errno can be
> more expensive than people realize.
I don't think I saw any version of the -fno-math-errno patch proposal that
included the testsuite updates I'd expect. Certainly
gcc.dg/torture/pr68264.c tests libm functions setting errno and would need
to use -fmath-errno explicitly, but it seems likely there are other tests
involving built-in functions that in fact only test what they're intended
to test given -fmath-errno; tests using libm functions without explicit
-ffast-math / -fmath-errno / -fno-math-errno would need review (and there
should be new tests for optimizations that are only valid given
-fno-math-errno).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-06-22 11:03 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-06 15:57 How to get GCC on par with ICC? Paul Menzel
2018-06-06 16:14 ` Joel Sherrill
2018-06-06 16:20 ` Paul Menzel
2018-06-20 22:42 ` NightStrike
2018-06-21 9:20 ` Richard Biener
2018-06-22 0:48 ` Steve Ellcey
2018-06-06 16:22 ` Bin.Cheng
2018-06-06 18:31 ` Dmitry Mikushin
2018-06-06 21:10 ` Ryan Burn
2018-06-07 10:02 ` Richard Biener
2018-06-06 22:43 ` Zan Lynx
2018-06-07 9:54 ` Richard Biener
2018-06-07 10:06 ` Richard Biener
2018-06-08 22:08 ` Steve Ellcey
2018-06-09 15:32 ` Marc Glisse
2018-06-11 14:50 ` Martin Jambor
2018-06-22 22:41 ` Szabolcs Nagy
2018-06-15 11:48 Wilco Dijkstra
2018-06-15 17:03 ` Jeff Law
2018-06-15 18:01 ` Joseph Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).