Parallelize the compilation using Threads

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Parallelize the compilation using Threads
@ 2018-11-15 10:12 Giuliano Augusto Faulin Belinassi
  2018-11-15 11:44 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Giuliano Augusto Faulin Belinassi @ 2018-11-15 10:12 UTC (permalink / raw)
  To: gcc, Richard Biener, Kernel USP, Alfredo Goldman, Alfredo Goldman

As a brief introduction, I am a graduate student that got interested

in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
am a newcommer in GCC, but already have sent some patches, some of
them have already been accepted [2].

I brought this subject up in IRC, but maybe here is a proper place to
discuss this topic.

From my point of view, parallelizing GCC itself will only speed up the
compilation of projects which have a big file that creates a
bottleneck in the whole project compilation (note: by big, I mean the
amount of code to generate). Additionally, I know that GCC must not
change the project layout, but from the software engineering perspective,
this may be a bad smell that indicates that the file should be broken
into smaller files. Finally, the Makefiles will take care of the
parallelization task.

My questions are:

 1. Is there any project compilation that will significantly be improved
if GCC runs in parallel? Do someone has data about something related
to that? How about the Linux Kernel? If not, I can try to bring some.

 2. Did I correctly understand the goal of the parallelization? Can
anyone provide extra details to me?

I am willing to turn my master’s thesis on that and also apply to GSoC
2019 if it shows to be fruitful.

[1] https://gcc.gnu.org/wiki/SummerOfCode
[2] https://patchwork.ozlabs.org/project/gcc/list/?submitter=74682

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-15 10:12 Parallelize the compilation using Threads Giuliano Augusto Faulin Belinassi
@ 2018-11-15 11:44 ` Richard Biener
  2018-11-15 15:54   ` Jonathan Wakely
                     ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Richard Biener @ 2018-11-15 11:44 UTC (permalink / raw)
  To: Giuliano Augusto Faulin Belinassi
  Cc: GCC Development, kernel-usp, gold, alfredo.goldman

On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> As a brief introduction, I am a graduate student that got interested
>
> in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> am a newcommer in GCC, but already have sent some patches, some of
> them have already been accepted [2].
>
> I brought this subject up in IRC, but maybe here is a proper place to
> discuss this topic.
>
> From my point of view, parallelizing GCC itself will only speed up the
> compilation of projects which have a big file that creates a
> bottleneck in the whole project compilation (note: by big, I mean the
> amount of code to generate).

That's true.  During GCC bootstrap there are some of those (see PR84402).

One way to improve parallelism is to use link-time optimization where
even single source files can be split up into multiple link-time units.  But
then there's the serial whole-program analysis part.

> Additionally, I know that GCC must not
> change the project layout, but from the software engineering perspective,
> this may be a bad smell that indicates that the file should be broken
> into smaller files. Finally, the Makefiles will take care of the
> parallelization task.

What do you mean by GCC must not change the project layout?  GCC
happily re-orders functions and link-time optimization will reorder
TUs (well, linking may as well).

> My questions are:
>
>  1. Is there any project compilation that will significantly be improved
> if GCC runs in parallel? Do someone has data about something related
> to that? How about the Linux Kernel? If not, I can try to bring some.

We do not have any data about this apart from experiments with
splitting up source files for PR84402.

>  2. Did I correctly understand the goal of the parallelization? Can
> anyone provide extra details to me?

You may want to search the mailing list archives since we had a
student application (later revoked) for the task with some discussion.

In my view (I proposed the thing) the most interesting parts are
getting GCCs global state documented and reduced.  The parallelization
itself is an interesting experiment but whether there will be any
substantial improvement for builds that can already benefit from make
parallelism remains a question.

> I am willing to turn my master’s thesis on that and also apply to GSoC
> 2019 if it shows to be fruitful.
>
> [1] https://gcc.gnu.org/wiki/SummerOfCode
> [2] https://patchwork.ozlabs.org/project/gcc/list/?submitter=74682
>
>
> Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-15 11:44 ` Richard Biener
@ 2018-11-15 15:54   ` Jonathan Wakely
  2018-11-15 18:07   ` Jeff Law
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Jonathan Wakely @ 2018-11-15 15:54 UTC (permalink / raw)
  To: Richard Guenther
  Cc: giuliano.belinassi, gcc, kernel-usp, gold, alfredo.goldman

On Thu, 15 Nov 2018 at 10:29, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> <giuliano.belinassi@usp.br> wrote:
> > Additionally, I know that GCC must not
> > change the project layout, but from the software engineering perspective,
> > this may be a bad smell that indicates that the file should be broken
> > into smaller files. Finally, the Makefiles will take care of the
> > parallelization task.
>
> What do you mean by GCC must not change the project layout?

I think this is in response to a comment I made on IRC. Giuliano said
that if a project has a very large file that dominates the total build
time, the file should be split up into smaller pieces. I said  "GCC
can't restructure people's code. it can only try to compile it
faster". We weren't referring to code transformations in the compiler
like re-ordering functions, but physically refactoring the source
code.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-15 11:44 ` Richard Biener
  2018-11-15 15:54   ` Jonathan Wakely
@ 2018-11-15 18:07   ` Jeff Law
  2018-11-15 18:36   ` Szabolcs Nagy
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Jeff Law @ 2018-11-15 18:07 UTC (permalink / raw)
  To: Richard Biener, Giuliano Augusto Faulin Belinassi
  Cc: GCC Development, kernel-usp, gold, alfredo.goldman

On 11/15/18 3:29 AM, Richard Biener wrote:
> 
>>  2. Did I correctly understand the goal of the parallelization? Can
>> anyone provide extra details to me?
> 
> You may want to search the mailing list archives since we had a
> student application (later revoked) for the task with some discussion.
> 
> In my view (I proposed the thing) the most interesting parts are
> getting GCCs global state documented and reduced.  The parallelization
> itself is an interesting experiment but whether there will be any
> substantial improvement for builds that can already benefit from make
> parallelism remains a question.
Agreed.  Driving down the amount of global state is good in and of
itself.  It's also a prerequisite for parallelizing GCC itself using
threads.

I suspect driving down global state probably isn't that interesting for
a master's thesis though :-)

jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-15 11:44 ` Richard Biener
  2018-11-15 15:54   ` Jonathan Wakely
  2018-11-15 18:07   ` Jeff Law
@ 2018-11-15 18:36   ` Szabolcs Nagy
  2018-11-16 14:25   ` Martin Jambor
  2018-11-16 22:40   ` Giuliano Augusto Faulin Belinassi
  4 siblings, 0 replies; 20+ messages in thread
From: Szabolcs Nagy @ 2018-11-15 18:36 UTC (permalink / raw)
  To: Richard Biener, Giuliano Augusto Faulin Belinassi
  Cc: nd, GCC Development, kernel-usp, gold, alfredo.goldman

On 15/11/18 10:29, Richard Biener wrote:
> In my view (I proposed the thing) the most interesting parts are
> getting GCCs global state documented and reduced.  The parallelization
> itself is an interesting experiment but whether there will be any
> substantial improvement for builds that can already benefit from make
> parallelism remains a question.

in the common case (project with many small files, much more than
core count) i'd expect a regression:

if gcc itself tries to parallelize that introduces inter thread
synchronization and potential false sharing in gcc (e.g. malloc
locks) that does not exist with make parallelism (glibc can avoid
some atomic instructions when a process is single threaded).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-15 11:44 ` Richard Biener
                     ` (2 preceding siblings ...)
  2018-11-15 18:36   ` Szabolcs Nagy
@ 2018-11-16 14:25   ` Martin Jambor
  2018-11-16 22:40   ` Giuliano Augusto Faulin Belinassi
  4 siblings, 0 replies; 20+ messages in thread
From: Martin Jambor @ 2018-11-16 14:25 UTC (permalink / raw)
  To: Giuliano Augusto Faulin Belinassi
  Cc: GCC Development, kernel-usp, gold, alfredo.goldman

Hi Giuliano,

On Thu, Nov 15 2018, Richard Biener wrote:
> You may want to search the mailing list archives since we had a
> student application (later revoked) for the task with some discussion.

Specifically, the whole thread beginning with
https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html

Martin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-15 11:44 ` Richard Biener
                     ` (3 preceding siblings ...)
  2018-11-16 14:25   ` Martin Jambor
@ 2018-11-16 22:40   ` Giuliano Augusto Faulin Belinassi
  2018-11-19 14:36     ` Richard Biener
  4 siblings, 1 reply; 20+ messages in thread
From: Giuliano Augusto Faulin Belinassi @ 2018-11-16 22:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, Kernel USP, Alfredo Goldman, Alfredo Goldman

Hi! Sorry for the late reply again :P

On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> <giuliano.belinassi@usp.br> wrote:
> >
> > As a brief introduction, I am a graduate student that got interested
> >
> > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > am a newcommer in GCC, but already have sent some patches, some of
> > them have already been accepted [2].
> >
> > I brought this subject up in IRC, but maybe here is a proper place to
> > discuss this topic.
> >
> > From my point of view, parallelizing GCC itself will only speed up the
> > compilation of projects which have a big file that creates a
> > bottleneck in the whole project compilation (note: by big, I mean the
> > amount of code to generate).
>
> That's true.  During GCC bootstrap there are some of those (see PR84402).
>

> One way to improve parallelism is to use link-time optimization where
> even single source files can be split up into multiple link-time units.  But
> then there's the serial whole-program analysis part.

Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
That is a lot of data :-)

It seems that 'phase opt and generate' is the most time-consuming
part. Is that the 'GIMPLE optimization pipeline' you were talking
about in this thread:
https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html

> > Additionally, I know that GCC must not
> > change the project layout, but from the software engineering perspective,
> > this may be a bad smell that indicates that the file should be broken
> > into smaller files. Finally, the Makefiles will take care of the
> > parallelization task.
>
> What do you mean by GCC must not change the project layout?  GCC
> happily re-orders functions and link-time optimization will reorder
> TUs (well, linking may as well).
>

That was a response to a comment made on IRC:

On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>I think this is in response to a comment I made on IRC. Giuliano said
>that if a project has a very large file that dominates the total build
>time, the file should be split up into smaller pieces. I said  "GCC
>can't restructure people's code. it can only try to compile it
>faster". We weren't referring to code transformations in the compiler
>like re-ordering functions, but physically refactoring the source
>code.

Yes. But from one of the attachments from PR84402, it seems that such
files exist on GCC,
https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440

> > My questions are:
> >
> >  1. Is there any project compilation that will significantly be improved
> > if GCC runs in parallel? Do someone has data about something related
> > to that? How about the Linux Kernel? If not, I can try to bring some.
>
> We do not have any data about this apart from experiments with
> splitting up source files for PR84402.
>
> >  2. Did I correctly understand the goal of the parallelization? Can
> > anyone provide extra details to me?
>
> You may want to search the mailing list archives since we had a
> student application (later revoked) for the task with some discussion.
>
> In my view (I proposed the thing) the most interesting parts are
> getting GCCs global state documented and reduced.  The parallelization
> itself is an interesting experiment but whether there will be any
> substantial improvement for builds that can already benefit from make
> parallelism remains a question.

As I agree that documenting GCC's global states is good for the
community and the development of GCC, I really don't think this a good
motivation for parallelizing a compiler from a research standpoint.
There must be something or someone that could take advantage of the
fine-grained parallelism. But that data from PR84402 seems to have the
answer to it. :-)


On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
>
> On 15/11/18 10:29, Richard Biener wrote:
> > In my view (I proposed the thing) the most interesting parts are
> > getting GCCs global state documented and reduced.  The parallelization
> > itself is an interesting experiment but whether there will be any
> > substantial improvement for builds that can already benefit from make
> > parallelism remains a question.
>
> in the common case (project with many small files, much more than
> core count) i'd expect a regression:
>
> if gcc itself tries to parallelize that introduces inter thread
> synchronization and potential false sharing in gcc (e.g. malloc
> locks) that does not exist with make parallelism (glibc can avoid
> some atomic instructions when a process is single threaded).

That is what I am mostly worried about. Or the most costly part is not
parallelizable at all. Also, I would expect a regression on very small
files, which probably could be avoided implementing this feature as a
flag?

On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
>
> Hi Giuliano,
>
> On Thu, Nov 15 2018, Richard Biener wrote:
> > You may want to search the mailing list archives since we had a
> > student application (later revoked) for the task with some discussion.
>
> Specifically, the whole thread beginning with
> https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
>
> Martin
>

Yes, I will research this carefully ;-)

Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-16 22:40   ` Giuliano Augusto Faulin Belinassi
@ 2018-11-19 14:36     ` Richard Biener
  2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Richard Biener @ 2018-11-19 14:36 UTC (permalink / raw)
  To: Giuliano Augusto Faulin Belinassi
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman

On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> Hi! Sorry for the late reply again :P
>
> On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > <giuliano.belinassi@usp.br> wrote:
> > >
> > > As a brief introduction, I am a graduate student that got interested
> > >
> > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > am a newcommer in GCC, but already have sent some patches, some of
> > > them have already been accepted [2].
> > >
> > > I brought this subject up in IRC, but maybe here is a proper place to
> > > discuss this topic.
> > >
> > > From my point of view, parallelizing GCC itself will only speed up the
> > > compilation of projects which have a big file that creates a
> > > bottleneck in the whole project compilation (note: by big, I mean the
> > > amount of code to generate).
> >
> > That's true.  During GCC bootstrap there are some of those (see PR84402).
> >
>
> > One way to improve parallelism is to use link-time optimization where
> > even single source files can be split up into multiple link-time units.  But
> > then there's the serial whole-program analysis part.
>
> Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> That is a lot of data :-)
>
> It seems that 'phase opt and generate' is the most time-consuming
> part. Is that the 'GIMPLE optimization pipeline' you were talking
> about in this thread:
> https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html

It's everything that comes after the frontend parsing bits, thus this
includes in particular RTL optimization and early GIMPLE optimizations.

> > > Additionally, I know that GCC must not
> > > change the project layout, but from the software engineering perspective,
> > > this may be a bad smell that indicates that the file should be broken
> > > into smaller files. Finally, the Makefiles will take care of the
> > > parallelization task.
> >
> > What do you mean by GCC must not change the project layout?  GCC
> > happily re-orders functions and link-time optimization will reorder
> > TUs (well, linking may as well).
> >
>
> That was a response to a comment made on IRC:
>
> On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> >I think this is in response to a comment I made on IRC. Giuliano said
> >that if a project has a very large file that dominates the total build
> >time, the file should be split up into smaller pieces. I said  "GCC
> >can't restructure people's code. it can only try to compile it
> >faster". We weren't referring to code transformations in the compiler
> >like re-ordering functions, but physically refactoring the source
> >code.
>
> Yes. But from one of the attachments from PR84402, it seems that such
> files exist on GCC,
> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
>
> > > My questions are:
> > >
> > >  1. Is there any project compilation that will significantly be improved
> > > if GCC runs in parallel? Do someone has data about something related
> > > to that? How about the Linux Kernel? If not, I can try to bring some.
> >
> > We do not have any data about this apart from experiments with
> > splitting up source files for PR84402.
> >
> > >  2. Did I correctly understand the goal of the parallelization? Can
> > > anyone provide extra details to me?
> >
> > You may want to search the mailing list archives since we had a
> > student application (later revoked) for the task with some discussion.
> >
> > In my view (I proposed the thing) the most interesting parts are
> > getting GCCs global state documented and reduced.  The parallelization
> > itself is an interesting experiment but whether there will be any
> > substantial improvement for builds that can already benefit from make
> > parallelism remains a question.
>
> As I agree that documenting GCC's global states is good for the
> community and the development of GCC, I really don't think this a good
> motivation for parallelizing a compiler from a research standpoint.

True ;)  Note that my suggestions to the other GSoC student were
purely based on where it's easiest to experiment with paralellization
and not where it would be most beneficial.

> There must be something or someone that could take advantage of the
> fine-grained parallelism. But that data from PR84402 seems to have the
> answer to it. :-)
>
> On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> >
> > On 15/11/18 10:29, Richard Biener wrote:
> > > In my view (I proposed the thing) the most interesting parts are
> > > getting GCCs global state documented and reduced.  The parallelization
> > > itself is an interesting experiment but whether there will be any
> > > substantial improvement for builds that can already benefit from make
> > > parallelism remains a question.
> >
> > in the common case (project with many small files, much more than
> > core count) i'd expect a regression:
> >
> > if gcc itself tries to parallelize that introduces inter thread
> > synchronization and potential false sharing in gcc (e.g. malloc
> > locks) that does not exist with make parallelism (glibc can avoid
> > some atomic instructions when a process is single threaded).
>
> That is what I am mostly worried about. Or the most costly part is not
> parallelizable at all. Also, I would expect a regression on very small
> files, which probably could be avoided implementing this feature as a
> flag?

I think the the issue should be avoided by avoiding fine-grained paralellism.
Which might be somewhat hard given there are core data structures that
are shared (the memory allocator for a start).

The other issue I am more worried about is that we probably have to
interact with make somehow so that we do not end up with 64 threads
when one does -j8 on a 8 core machine.  That's basically the same
issue we run into with -flto and it's threaded WPA writeout or recursive
invocation of make.

>
> On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> >
> > Hi Giuliano,
> >
> > On Thu, Nov 15 2018, Richard Biener wrote:
> > > You may want to search the mailing list archives since we had a
> > > student application (later revoked) for the task with some discussion.
> >
> > Specifically, the whole thread beginning with
> > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> >
> > Martin
> >
>
> Yes, I will research this carefully ;-)
>
> Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-19 14:36     ` Richard Biener
@ 2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
  2018-12-13  8:12         ` Bin.Cheng
  2018-12-17 11:06         ` Richard Biener
  2019-02-07 14:14       ` Giuliano Belinassi
  2019-02-11 21:46       ` Giuliano Belinassi
  2 siblings, 2 replies; 20+ messages in thread
From: Giuliano Augusto Faulin Belinassi @ 2018-12-12 15:46 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, Kernel USP, Alfredo Goldman, Alfredo Goldman

Hi, I have some news. :-)

I replicated the Martin Liška experiment [1] on a 64-cores machine for
gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
and I am excited to dive into this problem. As a result, I want to
propose GSoC project on this issue, starting with something like:
    1- Systematically create a benchmark for easily information
gathering. Martin Liška already made the first version of it, but I
need to improve it.
    2- Find and document the global states (Try to reduce the gcc's
global states as well).
    3- Define the parallelization strategy.
    4- First parallelization attempt.

I also proposed this issue as a research project to my advisor and he
supported me on this idea. So I can work for at least one year on
this, and other things related to it.

Would anyone be willing to mentor me on this?

[1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
[2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
[3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> <giuliano.belinassi@usp.br> wrote:
> >
> > Hi! Sorry for the late reply again :P
> >
> > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > <giuliano.belinassi@usp.br> wrote:
> > > >
> > > > As a brief introduction, I am a graduate student that got interested
> > > >
> > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > them have already been accepted [2].
> > > >
> > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > discuss this topic.
> > > >
> > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > compilation of projects which have a big file that creates a
> > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > amount of code to generate).
> > >
> > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > >
> >
> > > One way to improve parallelism is to use link-time optimization where
> > > even single source files can be split up into multiple link-time units.  But
> > > then there's the serial whole-program analysis part.
> >
> > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > That is a lot of data :-)
> >
> > It seems that 'phase opt and generate' is the most time-consuming
> > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > about in this thread:
> > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
>
> It's everything that comes after the frontend parsing bits, thus this
> includes in particular RTL optimization and early GIMPLE optimizations.
>
> > > > Additionally, I know that GCC must not
> > > > change the project layout, but from the software engineering perspective,
> > > > this may be a bad smell that indicates that the file should be broken
> > > > into smaller files. Finally, the Makefiles will take care of the
> > > > parallelization task.
> > >
> > > What do you mean by GCC must not change the project layout?  GCC
> > > happily re-orders functions and link-time optimization will reorder
> > > TUs (well, linking may as well).
> > >
> >
> > That was a response to a comment made on IRC:
> >
> > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > >I think this is in response to a comment I made on IRC. Giuliano said
> > >that if a project has a very large file that dominates the total build
> > >time, the file should be split up into smaller pieces. I said  "GCC
> > >can't restructure people's code. it can only try to compile it
> > >faster". We weren't referring to code transformations in the compiler
> > >like re-ordering functions, but physically refactoring the source
> > >code.
> >
> > Yes. But from one of the attachments from PR84402, it seems that such
> > files exist on GCC,
> > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> >
> > > > My questions are:
> > > >
> > > >  1. Is there any project compilation that will significantly be improved
> > > > if GCC runs in parallel? Do someone has data about something related
> > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > >
> > > We do not have any data about this apart from experiments with
> > > splitting up source files for PR84402.
> > >
> > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > anyone provide extra details to me?
> > >
> > > You may want to search the mailing list archives since we had a
> > > student application (later revoked) for the task with some discussion.
> > >
> > > In my view (I proposed the thing) the most interesting parts are
> > > getting GCCs global state documented and reduced.  The parallelization
> > > itself is an interesting experiment but whether there will be any
> > > substantial improvement for builds that can already benefit from make
> > > parallelism remains a question.
> >
> > As I agree that documenting GCC's global states is good for the
> > community and the development of GCC, I really don't think this a good
> > motivation for parallelizing a compiler from a research standpoint.
>
> True ;)  Note that my suggestions to the other GSoC student were
> purely based on where it's easiest to experiment with paralellization
> and not where it would be most beneficial.
>
> > There must be something or someone that could take advantage of the
> > fine-grained parallelism. But that data from PR84402 seems to have the
> > answer to it. :-)
> >
> > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > >
> > > On 15/11/18 10:29, Richard Biener wrote:
> > > > In my view (I proposed the thing) the most interesting parts are
> > > > getting GCCs global state documented and reduced.  The parallelization
> > > > itself is an interesting experiment but whether there will be any
> > > > substantial improvement for builds that can already benefit from make
> > > > parallelism remains a question.
> > >
> > > in the common case (project with many small files, much more than
> > > core count) i'd expect a regression:
> > >
> > > if gcc itself tries to parallelize that introduces inter thread
> > > synchronization and potential false sharing in gcc (e.g. malloc
> > > locks) that does not exist with make parallelism (glibc can avoid
> > > some atomic instructions when a process is single threaded).
> >
> > That is what I am mostly worried about. Or the most costly part is not
> > parallelizable at all. Also, I would expect a regression on very small
> > files, which probably could be avoided implementing this feature as a
> > flag?
>
> I think the the issue should be avoided by avoiding fine-grained paralellism.
> Which might be somewhat hard given there are core data structures that
> are shared (the memory allocator for a start).
>
> The other issue I am more worried about is that we probably have to
> interact with make somehow so that we do not end up with 64 threads
> when one does -j8 on a 8 core machine.  That's basically the same
> issue we run into with -flto and it's threaded WPA writeout or recursive
> invocation of make.
>
> >
> > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > >
> > > Hi Giuliano,
> > >
> > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > You may want to search the mailing list archives since we had a
> > > > student application (later revoked) for the task with some discussion.
> > >
> > > Specifically, the whole thread beginning with
> > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > >
> > > Martin
> > >
> >
> > Yes, I will research this carefully ;-)
> >
> > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
@ 2018-12-13  8:12         ` Bin.Cheng
  2018-12-14 14:15           ` Giuliano Belinassi
  2018-12-17 11:06         ` Richard Biener
  1 sibling, 1 reply; 20+ messages in thread
From: Bin.Cheng @ 2018-12-13  8:12 UTC (permalink / raw)
  To: giuliano.belinassi
  Cc: Richard Guenther, GCC Development, kernel-usp, gold, alfredo.goldman

On Wed, Dec 12, 2018 at 11:46 PM Giuliano Augusto Faulin Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> Hi, I have some news. :-)
>
> I replicated the Martin Liška experiment [1] on a 64-cores machine for
> gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> and I am excited to dive into this problem. As a result, I want to
> propose GSoC project on this issue, starting with something like:
>     1- Systematically create a benchmark for easily information
> gathering. Martin Liška already made the first version of it, but I
> need to improve it.
>     2- Find and document the global states (Try to reduce the gcc's
> global states as well).
>     3- Define the parallelization strategy.
>     4- First parallelization attempt.
Hi Giuliano,

Thanks very much for working on this.  It could be very useful, for
example, one bottleneck we have is slow compilation of big single
source file after intensively using distribution compilation.  Of
course, a good parallelization strategy is needed.

Thanks,
bin
>
> I also proposed this issue as a research project to my advisor and he
> supported me on this idea. So I can work for at least one year on
> this, and other things related to it.
>
> Would anyone be willing to mentor me on this?
>
> [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > <giuliano.belinassi@usp.br> wrote:
> > >
> > > Hi! Sorry for the late reply again :P
> > >
> > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > <giuliano.belinassi@usp.br> wrote:
> > > > >
> > > > > As a brief introduction, I am a graduate student that got interested
> > > > >
> > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > them have already been accepted [2].
> > > > >
> > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > discuss this topic.
> > > > >
> > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > compilation of projects which have a big file that creates a
> > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > amount of code to generate).
> > > >
> > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > >
> > >
> > > > One way to improve parallelism is to use link-time optimization where
> > > > even single source files can be split up into multiple link-time units.  But
> > > > then there's the serial whole-program analysis part.
> > >
> > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > That is a lot of data :-)
> > >
> > > It seems that 'phase opt and generate' is the most time-consuming
> > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > about in this thread:
> > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> >
> > It's everything that comes after the frontend parsing bits, thus this
> > includes in particular RTL optimization and early GIMPLE optimizations.
> >
> > > > > Additionally, I know that GCC must not
> > > > > change the project layout, but from the software engineering perspective,
> > > > > this may be a bad smell that indicates that the file should be broken
> > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > parallelization task.
> > > >
> > > > What do you mean by GCC must not change the project layout?  GCC
> > > > happily re-orders functions and link-time optimization will reorder
> > > > TUs (well, linking may as well).
> > > >
> > >
> > > That was a response to a comment made on IRC:
> > >
> > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > >that if a project has a very large file that dominates the total build
> > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > >can't restructure people's code. it can only try to compile it
> > > >faster". We weren't referring to code transformations in the compiler
> > > >like re-ordering functions, but physically refactoring the source
> > > >code.
> > >
> > > Yes. But from one of the attachments from PR84402, it seems that such
> > > files exist on GCC,
> > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > >
> > > > > My questions are:
> > > > >
> > > > >  1. Is there any project compilation that will significantly be improved
> > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > >
> > > > We do not have any data about this apart from experiments with
> > > > splitting up source files for PR84402.
> > > >
> > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > anyone provide extra details to me?
> > > >
> > > > You may want to search the mailing list archives since we had a
> > > > student application (later revoked) for the task with some discussion.
> > > >
> > > > In my view (I proposed the thing) the most interesting parts are
> > > > getting GCCs global state documented and reduced.  The parallelization
> > > > itself is an interesting experiment but whether there will be any
> > > > substantial improvement for builds that can already benefit from make
> > > > parallelism remains a question.
> > >
> > > As I agree that documenting GCC's global states is good for the
> > > community and the development of GCC, I really don't think this a good
> > > motivation for parallelizing a compiler from a research standpoint.
> >
> > True ;)  Note that my suggestions to the other GSoC student were
> > purely based on where it's easiest to experiment with paralellization
> > and not where it would be most beneficial.
> >
> > > There must be something or someone that could take advantage of the
> > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > answer to it. :-)
> > >
> > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > >
> > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > itself is an interesting experiment but whether there will be any
> > > > > substantial improvement for builds that can already benefit from make
> > > > > parallelism remains a question.
> > > >
> > > > in the common case (project with many small files, much more than
> > > > core count) i'd expect a regression:
> > > >
> > > > if gcc itself tries to parallelize that introduces inter thread
> > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > some atomic instructions when a process is single threaded).
> > >
> > > That is what I am mostly worried about. Or the most costly part is not
> > > parallelizable at all. Also, I would expect a regression on very small
> > > files, which probably could be avoided implementing this feature as a
> > > flag?
> >
> > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > Which might be somewhat hard given there are core data structures that
> > are shared (the memory allocator for a start).
> >
> > The other issue I am more worried about is that we probably have to
> > interact with make somehow so that we do not end up with 64 threads
> > when one does -j8 on a 8 core machine.  That's basically the same
> > issue we run into with -flto and it's threaded WPA writeout or recursive
> > invocation of make.
> >
> > >
> > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > >
> > > > Hi Giuliano,
> > > >
> > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > You may want to search the mailing list archives since we had a
> > > > > student application (later revoked) for the task with some discussion.
> > > >
> > > > Specifically, the whole thread beginning with
> > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > >
> > > > Martin
> > > >
> > >
> > > Yes, I will research this carefully ;-)
> > >
> > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-12-13  8:12         ` Bin.Cheng
@ 2018-12-14 14:15           ` Giuliano Belinassi
  0 siblings, 0 replies; 20+ messages in thread
From: Giuliano Belinassi @ 2018-12-14 14:15 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Richard Guenther, GCC Development, kernel-usp, gold, alfredo.goldman

Hi,

See comments inline.

On 12/13, Bin.Cheng wrote:
> On Wed, Dec 12, 2018 at 11:46 PM Giuliano Augusto Faulin Belinassi
> <giuliano.belinassi@usp.br> wrote:
> >
> > Hi, I have some news. :-)
> >
> > I replicated the Martin LiÅ¡ka experiment [1] on a 64-cores machine for
> > gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> > and I am excited to dive into this problem. As a result, I want to
> > propose GSoC project on this issue, starting with something like:
> >     1- Systematically create a benchmark for easily information
> > gathering. Martin LiÅ¡ka already made the first version of it, but I
> > need to improve it.
> >     2- Find and document the global states (Try to reduce the gcc's
> > global states as well).
> >     3- Define the parallelization strategy.
> >     4- First parallelization attempt.
> Hi Giuliano,
> 
> Thanks very much for working on this.  It could be very useful, for
> example, one bottleneck we have is slow compilation of big single
> source file after intensively using distribution compilation.  Of
> course, a good parallelization strategy is needed.
> 

Interesting. How many lines the generated file has? Does it uses C++
templates?

The generated gimple-match.c file, for example, has 98786 lines and
takes about 30s to compile.

> Thanks,
> bin
> >
> > I also proposed this issue as a research project to my advisor and he
> > supported me on this idea. So I can work for at least one year on
> > this, and other things related to it.
> >
> > Would anyone be willing to mentor me on this?
> >
> > [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> > [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> > On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > > <giuliano.belinassi@usp.br> wrote:
> > > >
> > > > Hi! Sorry for the late reply again :P
> > > >
> > > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > >
> > > > > > As a brief introduction, I am a graduate student that got interested
> > > > > >
> > > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > > them have already been accepted [2].
> > > > > >
> > > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > > discuss this topic.
> > > > > >
> > > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > > compilation of projects which have a big file that creates a
> > > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > > amount of code to generate).
> > > > >
> > > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > > >
> > > >
> > > > > One way to improve parallelism is to use link-time optimization where
> > > > > even single source files can be split up into multiple link-time units.  But
> > > > > then there's the serial whole-program analysis part.
> > > >
> > > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > > That is a lot of data :-)
> > > >
> > > > It seems that 'phase opt and generate' is the most time-consuming
> > > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > > about in this thread:
> > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> > >
> > > It's everything that comes after the frontend parsing bits, thus this
> > > includes in particular RTL optimization and early GIMPLE optimizations.
> > >
> > > > > > Additionally, I know that GCC must not
> > > > > > change the project layout, but from the software engineering perspective,
> > > > > > this may be a bad smell that indicates that the file should be broken
> > > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > > parallelization task.
> > > > >
> > > > > What do you mean by GCC must not change the project layout?  GCC
> > > > > happily re-orders functions and link-time optimization will reorder
> > > > > TUs (well, linking may as well).
> > > > >
> > > >
> > > > That was a response to a comment made on IRC:
> > > >
> > > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > > >that if a project has a very large file that dominates the total build
> > > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > > >can't restructure people's code. it can only try to compile it
> > > > >faster". We weren't referring to code transformations in the compiler
> > > > >like re-ordering functions, but physically refactoring the source
> > > > >code.
> > > >
> > > > Yes. But from one of the attachments from PR84402, it seems that such
> > > > files exist on GCC,
> > > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > >
> > > > > > My questions are:
> > > > > >
> > > > > >  1. Is there any project compilation that will significantly be improved
> > > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > > >
> > > > > We do not have any data about this apart from experiments with
> > > > > splitting up source files for PR84402.
> > > > >
> > > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > > anyone provide extra details to me?
> > > > >
> > > > > You may want to search the mailing list archives since we had a
> > > > > student application (later revoked) for the task with some discussion.
> > > > >
> > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > itself is an interesting experiment but whether there will be any
> > > > > substantial improvement for builds that can already benefit from make
> > > > > parallelism remains a question.
> > > >
> > > > As I agree that documenting GCC's global states is good for the
> > > > community and the development of GCC, I really don't think this a good
> > > > motivation for parallelizing a compiler from a research standpoint.
> > >
> > > True ;)  Note that my suggestions to the other GSoC student were
> > > purely based on where it's easiest to experiment with paralellization
> > > and not where it would be most beneficial.
> > >
> > > > There must be something or someone that could take advantage of the
> > > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > > answer to it. :-)
> > > >
> > > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > > >
> > > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > itself is an interesting experiment but whether there will be any
> > > > > > substantial improvement for builds that can already benefit from make
> > > > > > parallelism remains a question.
> > > > >
> > > > > in the common case (project with many small files, much more than
> > > > > core count) i'd expect a regression:
> > > > >
> > > > > if gcc itself tries to parallelize that introduces inter thread
> > > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > > some atomic instructions when a process is single threaded).
> > > >
> > > > That is what I am mostly worried about. Or the most costly part is not
> > > > parallelizable at all. Also, I would expect a regression on very small
> > > > files, which probably could be avoided implementing this feature as a
> > > > flag?
> > >
> > > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > > Which might be somewhat hard given there are core data structures that
> > > are shared (the memory allocator for a start).
> > >
> > > The other issue I am more worried about is that we probably have to
> > > interact with make somehow so that we do not end up with 64 threads
> > > when one does -j8 on a 8 core machine.  That's basically the same
> > > issue we run into with -flto and it's threaded WPA writeout or recursive
> > > invocation of make.
> > >
> > > >
> > > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > > >
> > > > > Hi Giuliano,
> > > > >
> > > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > > You may want to search the mailing list archives since we had a
> > > > > > student application (later revoked) for the task with some discussion.
> > > > >
> > > > > Specifically, the whole thread beginning with
> > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > > >
> > > > > Martin
> > > > >
> > > >
> > > > Yes, I will research this carefully ;-)
> > > >
> > > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
  2018-12-13  8:12         ` Bin.Cheng
@ 2018-12-17 11:06         ` Richard Biener
  2019-01-14 11:42           ` Giuliano Belinassi
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2018-12-17 11:06 UTC (permalink / raw)
  To: Giuliano Augusto Faulin Belinassi
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman

On Wed, Dec 12, 2018 at 4:46 PM Giuliano Augusto Faulin Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> Hi, I have some news. :-)
>
> I replicated the Martin Liška experiment [1] on a 64-cores machine for
> gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> and I am excited to dive into this problem. As a result, I want to
> propose GSoC project on this issue, starting with something like:
>     1- Systematically create a benchmark for easily information
> gathering. Martin Liška already made the first version of it, but I
> need to improve it.
>     2- Find and document the global states (Try to reduce the gcc's
> global states as well).
>     3- Define the parallelization strategy.
>     4- First parallelization attempt.
>
> I also proposed this issue as a research project to my advisor and he
> supported me on this idea. So I can work for at least one year on
> this, and other things related to it.
>
> Would anyone be willing to mentor me on this?

As the one who initially suggested the project I'm certainly willing
to mentor you on this.

Richard.

> [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > <giuliano.belinassi@usp.br> wrote:
> > >
> > > Hi! Sorry for the late reply again :P
> > >
> > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > <giuliano.belinassi@usp.br> wrote:
> > > > >
> > > > > As a brief introduction, I am a graduate student that got interested
> > > > >
> > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > them have already been accepted [2].
> > > > >
> > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > discuss this topic.
> > > > >
> > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > compilation of projects which have a big file that creates a
> > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > amount of code to generate).
> > > >
> > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > >
> > >
> > > > One way to improve parallelism is to use link-time optimization where
> > > > even single source files can be split up into multiple link-time units.  But
> > > > then there's the serial whole-program analysis part.
> > >
> > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > That is a lot of data :-)
> > >
> > > It seems that 'phase opt and generate' is the most time-consuming
> > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > about in this thread:
> > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> >
> > It's everything that comes after the frontend parsing bits, thus this
> > includes in particular RTL optimization and early GIMPLE optimizations.
> >
> > > > > Additionally, I know that GCC must not
> > > > > change the project layout, but from the software engineering perspective,
> > > > > this may be a bad smell that indicates that the file should be broken
> > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > parallelization task.
> > > >
> > > > What do you mean by GCC must not change the project layout?  GCC
> > > > happily re-orders functions and link-time optimization will reorder
> > > > TUs (well, linking may as well).
> > > >
> > >
> > > That was a response to a comment made on IRC:
> > >
> > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > >that if a project has a very large file that dominates the total build
> > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > >can't restructure people's code. it can only try to compile it
> > > >faster". We weren't referring to code transformations in the compiler
> > > >like re-ordering functions, but physically refactoring the source
> > > >code.
> > >
> > > Yes. But from one of the attachments from PR84402, it seems that such
> > > files exist on GCC,
> > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > >
> > > > > My questions are:
> > > > >
> > > > >  1. Is there any project compilation that will significantly be improved
> > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > >
> > > > We do not have any data about this apart from experiments with
> > > > splitting up source files for PR84402.
> > > >
> > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > anyone provide extra details to me?
> > > >
> > > > You may want to search the mailing list archives since we had a
> > > > student application (later revoked) for the task with some discussion.
> > > >
> > > > In my view (I proposed the thing) the most interesting parts are
> > > > getting GCCs global state documented and reduced.  The parallelization
> > > > itself is an interesting experiment but whether there will be any
> > > > substantial improvement for builds that can already benefit from make
> > > > parallelism remains a question.
> > >
> > > As I agree that documenting GCC's global states is good for the
> > > community and the development of GCC, I really don't think this a good
> > > motivation for parallelizing a compiler from a research standpoint.
> >
> > True ;)  Note that my suggestions to the other GSoC student were
> > purely based on where it's easiest to experiment with paralellization
> > and not where it would be most beneficial.
> >
> > > There must be something or someone that could take advantage of the
> > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > answer to it. :-)
> > >
> > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > >
> > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > itself is an interesting experiment but whether there will be any
> > > > > substantial improvement for builds that can already benefit from make
> > > > > parallelism remains a question.
> > > >
> > > > in the common case (project with many small files, much more than
> > > > core count) i'd expect a regression:
> > > >
> > > > if gcc itself tries to parallelize that introduces inter thread
> > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > some atomic instructions when a process is single threaded).
> > >
> > > That is what I am mostly worried about. Or the most costly part is not
> > > parallelizable at all. Also, I would expect a regression on very small
> > > files, which probably could be avoided implementing this feature as a
> > > flag?
> >
> > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > Which might be somewhat hard given there are core data structures that
> > are shared (the memory allocator for a start).
> >
> > The other issue I am more worried about is that we probably have to
> > interact with make somehow so that we do not end up with 64 threads
> > when one does -j8 on a 8 core machine.  That's basically the same
> > issue we run into with -flto and it's threaded WPA writeout or recursive
> > invocation of make.
> >
> > >
> > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > >
> > > > Hi Giuliano,
> > > >
> > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > You may want to search the mailing list archives since we had a
> > > > > student application (later revoked) for the task with some discussion.
> > > >
> > > > Specifically, the whole thread beginning with
> > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > >
> > > > Martin
> > > >
> > >
> > > Yes, I will research this carefully ;-)
> > >
> > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-12-17 11:06         ` Richard Biener
@ 2019-01-14 11:42           ` Giuliano Belinassi
  2019-01-14 12:23             ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Giuliano Belinassi @ 2019-01-14 11:42 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman, Gregory.Mounie

Hi,

I am currently studying the GIMPLE IR documentation and thinking about a
way easily gather the timing information. I was thinking about about
adding this feature to gcc to show/dump the elapsed time on GIMPLE. Does
this makes sense? Is this already implemented somewhere? Where is a good
way to start it?

Richard Biener: I would like to know What is your nickname in IRC :)

Thank you,
Giuliano.

On 12/17, Richard Biener wrote:
> On Wed, Dec 12, 2018 at 4:46 PM Giuliano Augusto Faulin Belinassi
> <giuliano.belinassi@usp.br> wrote:
> >
> > Hi, I have some news. :-)
> >
> > I replicated the Martin LiÅ¡ka experiment [1] on a 64-cores machine for
> > gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> > and I am excited to dive into this problem. As a result, I want to
> > propose GSoC project on this issue, starting with something like:
> >     1- Systematically create a benchmark for easily information
> > gathering. Martin LiÅ¡ka already made the first version of it, but I
> > need to improve it.
> >     2- Find and document the global states (Try to reduce the gcc's
> > global states as well).
> >     3- Define the parallelization strategy.
> >     4- First parallelization attempt.
> >
> > I also proposed this issue as a research project to my advisor and he
> > supported me on this idea. So I can work for at least one year on
> > this, and other things related to it.
> >
> > Would anyone be willing to mentor me on this?
> 
> As the one who initially suggested the project I'm certainly willing
> to mentor you on this.
> 
> Richard.
> 
> > [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> > [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> > On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > > <giuliano.belinassi@usp.br> wrote:
> > > >
> > > > Hi! Sorry for the late reply again :P
> > > >
> > > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > >
> > > > > > As a brief introduction, I am a graduate student that got interested
> > > > > >
> > > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > > them have already been accepted [2].
> > > > > >
> > > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > > discuss this topic.
> > > > > >
> > > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > > compilation of projects which have a big file that creates a
> > > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > > amount of code to generate).
> > > > >
> > > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > > >
> > > >
> > > > > One way to improve parallelism is to use link-time optimization where
> > > > > even single source files can be split up into multiple link-time units.  But
> > > > > then there's the serial whole-program analysis part.
> > > >
> > > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > > That is a lot of data :-)
> > > >
> > > > It seems that 'phase opt and generate' is the most time-consuming
> > > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > > about in this thread:
> > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> > >
> > > It's everything that comes after the frontend parsing bits, thus this
> > > includes in particular RTL optimization and early GIMPLE optimizations.
> > >
> > > > > > Additionally, I know that GCC must not
> > > > > > change the project layout, but from the software engineering perspective,
> > > > > > this may be a bad smell that indicates that the file should be broken
> > > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > > parallelization task.
> > > > >
> > > > > What do you mean by GCC must not change the project layout?  GCC
> > > > > happily re-orders functions and link-time optimization will reorder
> > > > > TUs (well, linking may as well).
> > > > >
> > > >
> > > > That was a response to a comment made on IRC:
> > > >
> > > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > > >that if a project has a very large file that dominates the total build
> > > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > > >can't restructure people's code. it can only try to compile it
> > > > >faster". We weren't referring to code transformations in the compiler
> > > > >like re-ordering functions, but physically refactoring the source
> > > > >code.
> > > >
> > > > Yes. But from one of the attachments from PR84402, it seems that such
> > > > files exist on GCC,
> > > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > >
> > > > > > My questions are:
> > > > > >
> > > > > >  1. Is there any project compilation that will significantly be improved
> > > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > > >
> > > > > We do not have any data about this apart from experiments with
> > > > > splitting up source files for PR84402.
> > > > >
> > > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > > anyone provide extra details to me?
> > > > >
> > > > > You may want to search the mailing list archives since we had a
> > > > > student application (later revoked) for the task with some discussion.
> > > > >
> > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > itself is an interesting experiment but whether there will be any
> > > > > substantial improvement for builds that can already benefit from make
> > > > > parallelism remains a question.
> > > >
> > > > As I agree that documenting GCC's global states is good for the
> > > > community and the development of GCC, I really don't think this a good
> > > > motivation for parallelizing a compiler from a research standpoint.
> > >
> > > True ;)  Note that my suggestions to the other GSoC student were
> > > purely based on where it's easiest to experiment with paralellization
> > > and not where it would be most beneficial.
> > >
> > > > There must be something or someone that could take advantage of the
> > > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > > answer to it. :-)
> > > >
> > > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > > >
> > > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > itself is an interesting experiment but whether there will be any
> > > > > > substantial improvement for builds that can already benefit from make
> > > > > > parallelism remains a question.
> > > > >
> > > > > in the common case (project with many small files, much more than
> > > > > core count) i'd expect a regression:
> > > > >
> > > > > if gcc itself tries to parallelize that introduces inter thread
> > > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > > some atomic instructions when a process is single threaded).
> > > >
> > > > That is what I am mostly worried about. Or the most costly part is not
> > > > parallelizable at all. Also, I would expect a regression on very small
> > > > files, which probably could be avoided implementing this feature as a
> > > > flag?
> > >
> > > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > > Which might be somewhat hard given there are core data structures that
> > > are shared (the memory allocator for a start).
> > >
> > > The other issue I am more worried about is that we probably have to
> > > interact with make somehow so that we do not end up with 64 threads
> > > when one does -j8 on a 8 core machine.  That's basically the same
> > > issue we run into with -flto and it's threaded WPA writeout or recursive
> > > invocation of make.
> > >
> > > >
> > > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > > >
> > > > > Hi Giuliano,
> > > > >
> > > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > > You may want to search the mailing list archives since we had a
> > > > > > student application (later revoked) for the task with some discussion.
> > > > >
> > > > > Specifically, the whole thread beginning with
> > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > > >
> > > > > Martin
> > > > >
> > > >
> > > > Yes, I will research this carefully ;-)
> > > >
> > > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2019-01-14 11:42           ` Giuliano Belinassi
@ 2019-01-14 12:23             ` Richard Biener
  2019-01-15 21:45               ` Giuliano Belinassi
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2019-01-14 12:23 UTC (permalink / raw)
  To: Giuliano Belinassi
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman, Gregory.Mounie

On Mon, Jan 14, 2019 at 12:41 PM Giuliano Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> Hi,
>
> I am currently studying the GIMPLE IR documentation and thinking about a
> way easily gather the timing information. I was thinking about about
> adding this feature to gcc to show/dump the elapsed time on GIMPLE. Does
> this makes sense? Is this already implemented somewhere? Where is a good
> way to start it?

There's -ftime-report which more-or-less tells you the time spent in the
individual passes.  I think there's no overall group to count GIMPLE
optimizers vs. RTL optimizers though.

> Richard Biener: I would like to know What is your nickname in IRC :)

It's richi.

Richard.

> Thank you,
> Giuliano.
>
> On 12/17, Richard Biener wrote:
> > On Wed, Dec 12, 2018 at 4:46 PM Giuliano Augusto Faulin Belinassi
> > <giuliano.belinassi@usp.br> wrote:
> > >
> > > Hi, I have some news. :-)
> > >
> > > I replicated the Martin Liška experiment [1] on a 64-cores machine for
> > > gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> > > and I am excited to dive into this problem. As a result, I want to
> > > propose GSoC project on this issue, starting with something like:
> > >     1- Systematically create a benchmark for easily information
> > > gathering. Martin Liška already made the first version of it, but I
> > > need to improve it.
> > >     2- Find and document the global states (Try to reduce the gcc's
> > > global states as well).
> > >     3- Define the parallelization strategy.
> > >     4- First parallelization attempt.
> > >
> > > I also proposed this issue as a research project to my advisor and he
> > > supported me on this idea. So I can work for at least one year on
> > > this, and other things related to it.
> > >
> > > Would anyone be willing to mentor me on this?
> >
> > As the one who initially suggested the project I'm certainly willing
> > to mentor you on this.
> >
> > Richard.
> >
> > > [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> > > [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> > > On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > > > <giuliano.belinassi@usp.br> wrote:
> > > > >
> > > > > Hi! Sorry for the late reply again :P
> > > > >
> > > > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > > > <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > > >
> > > > > > > As a brief introduction, I am a graduate student that got interested
> > > > > > >
> > > > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > > > them have already been accepted [2].
> > > > > > >
> > > > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > > > discuss this topic.
> > > > > > >
> > > > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > > > compilation of projects which have a big file that creates a
> > > > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > > > amount of code to generate).
> > > > > >
> > > > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > > > >
> > > > >
> > > > > > One way to improve parallelism is to use link-time optimization where
> > > > > > even single source files can be split up into multiple link-time units.  But
> > > > > > then there's the serial whole-program analysis part.
> > > > >
> > > > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > > > That is a lot of data :-)
> > > > >
> > > > > It seems that 'phase opt and generate' is the most time-consuming
> > > > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > > > about in this thread:
> > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> > > >
> > > > It's everything that comes after the frontend parsing bits, thus this
> > > > includes in particular RTL optimization and early GIMPLE optimizations.
> > > >
> > > > > > > Additionally, I know that GCC must not
> > > > > > > change the project layout, but from the software engineering perspective,
> > > > > > > this may be a bad smell that indicates that the file should be broken
> > > > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > > > parallelization task.
> > > > > >
> > > > > > What do you mean by GCC must not change the project layout?  GCC
> > > > > > happily re-orders functions and link-time optimization will reorder
> > > > > > TUs (well, linking may as well).
> > > > > >
> > > > >
> > > > > That was a response to a comment made on IRC:
> > > > >
> > > > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > > > >that if a project has a very large file that dominates the total build
> > > > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > > > >can't restructure people's code. it can only try to compile it
> > > > > >faster". We weren't referring to code transformations in the compiler
> > > > > >like re-ordering functions, but physically refactoring the source
> > > > > >code.
> > > > >
> > > > > Yes. But from one of the attachments from PR84402, it seems that such
> > > > > files exist on GCC,
> > > > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > > >
> > > > > > > My questions are:
> > > > > > >
> > > > > > >  1. Is there any project compilation that will significantly be improved
> > > > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > > > >
> > > > > > We do not have any data about this apart from experiments with
> > > > > > splitting up source files for PR84402.
> > > > > >
> > > > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > > > anyone provide extra details to me?
> > > > > >
> > > > > > You may want to search the mailing list archives since we had a
> > > > > > student application (later revoked) for the task with some discussion.
> > > > > >
> > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > itself is an interesting experiment but whether there will be any
> > > > > > substantial improvement for builds that can already benefit from make
> > > > > > parallelism remains a question.
> > > > >
> > > > > As I agree that documenting GCC's global states is good for the
> > > > > community and the development of GCC, I really don't think this a good
> > > > > motivation for parallelizing a compiler from a research standpoint.
> > > >
> > > > True ;)  Note that my suggestions to the other GSoC student were
> > > > purely based on where it's easiest to experiment with paralellization
> > > > and not where it would be most beneficial.
> > > >
> > > > > There must be something or someone that could take advantage of the
> > > > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > > > answer to it. :-)
> > > > >
> > > > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > > > >
> > > > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > > itself is an interesting experiment but whether there will be any
> > > > > > > substantial improvement for builds that can already benefit from make
> > > > > > > parallelism remains a question.
> > > > > >
> > > > > > in the common case (project with many small files, much more than
> > > > > > core count) i'd expect a regression:
> > > > > >
> > > > > > if gcc itself tries to parallelize that introduces inter thread
> > > > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > > > some atomic instructions when a process is single threaded).
> > > > >
> > > > > That is what I am mostly worried about. Or the most costly part is not
> > > > > parallelizable at all. Also, I would expect a regression on very small
> > > > > files, which probably could be avoided implementing this feature as a
> > > > > flag?
> > > >
> > > > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > > > Which might be somewhat hard given there are core data structures that
> > > > are shared (the memory allocator for a start).
> > > >
> > > > The other issue I am more worried about is that we probably have to
> > > > interact with make somehow so that we do not end up with 64 threads
> > > > when one does -j8 on a 8 core machine.  That's basically the same
> > > > issue we run into with -flto and it's threaded WPA writeout or recursive
> > > > invocation of make.
> > > >
> > > > >
> > > > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > > > >
> > > > > > Hi Giuliano,
> > > > > >
> > > > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > > > You may want to search the mailing list archives since we had a
> > > > > > > student application (later revoked) for the task with some discussion.
> > > > > >
> > > > > > Specifically, the whole thread beginning with
> > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > > > >
> > > > > > Martin
> > > > > >
> > > > >
> > > > > Yes, I will research this carefully ;-)
> > > > >
> > > > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2019-01-14 12:23             ` Richard Biener
@ 2019-01-15 21:45               ` Giuliano Belinassi
  2019-01-16 12:44                 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Giuliano Belinassi @ 2019-01-15 21:45 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman, Gregory.Mounie

Hi

I've managed to compile gimple-match.c with -ftime-report, and "phase opt and
generate" seems to be what takes most of the compilation time. This is captured
by the "TV_PHASE_OPT_GEN" timevar, and all its occurrences seem to be in
toplev.c and lto.c. Any ideas of which part such that this variable captures is
the most costly? Also, is that percentage in "GGC" column the amount of time
inside the Garbage Collector?

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.01 (  0%)   0.01 (  0%)   0.02 (  0%)    1473 kB (  0%)
 phase parsing                      :   3.74 (  4%)   1.43 ( 30%)   5.17 (  5%)  294287 kB ( 16%)
 phase lang. deferred               :   0.08 (  0%)   0.03 (  1%)   0.11 (  0%)    7582 kB (  0%)
 phase opt and generate             :  94.10 ( 95%)   3.26 ( 67%)  97.46 ( 93%) 1543477 kB ( 82%)
 phase last asm                     :   0.89 (  1%)   0.09 (  2%)   0.98 (  1%)   39802 kB (  2%)
 phase finalize                     :   0.00 (  0%)   0.01 (  0%)   0.50 (  0%)       0 kB (  0%)
 |name lookup                       :   0.42 (  0%)   0.12 (  2%)   0.46 (  0%)    6162 kB (  0%)
 |overload resolution               :   0.37 (  0%)   0.13 (  3%)   0.42 (  0%)   18172 kB (  1%)
 garbage collection                 :   2.99 (  3%)   0.03 (  1%)   3.02 (  3%)       0 kB (  0%)
 dump files                         :   0.11 (  0%)   0.01 (  0%)   0.16 (  0%)       0 kB (  0%)
 callgraph construction             :   0.35 (  0%)   0.01 (  0%)   0.24 (  0%)   61143 kB (  3%)
 callgraph optimization             :   0.21 (  0%)   0.01 (  0%)   0.17 (  0%)     175 kB (  0%)
 ipa function summary               :   0.12 (  0%)   0.00 (  0%)   0.14 (  0%)    2216 kB (  0%)
 ipa dead code removal              :   0.04 (  0%)   0.01 (  0%)   0.00 (  0%)       0 kB (  0%)
 ipa devirtualization               :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 ipa cp                             :   0.33 (  0%)   0.01 (  0%)   0.39 (  0%)    9073 kB (  0%)
 ipa inlining heuristics            :   0.48 (  0%)   0.00 (  0%)   0.48 (  0%)    6175 kB (  0%)
 ipa function splitting             :   0.10 (  0%)   0.01 (  0%)   0.07 (  0%)    9111 kB (  0%)
 ipa comdats                        :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 ipa various optimizations          :   0.03 (  0%)   0.03 (  1%)   0.01 (  0%)     480 kB (  0%)
 ipa reference                      :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa profile                        :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 ipa pure const                     :   0.13 (  0%)   0.00 (  0%)   0.12 (  0%)       8 kB (  0%)
 ipa icf                            :   0.08 (  0%)   0.00 (  0%)   0.08 (  0%)       6 kB (  0%)
 ipa SRA                            :   1.26 (  1%)   0.28 (  6%)   1.78 (  2%)  165814 kB (  9%)
 ipa free lang data                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 ipa free inline summary            :   0.00 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 cfg construction                   :   0.09 (  0%)   0.00 (  0%)   0.09 (  0%)    7926 kB (  0%)
 cfg cleanup                        :   1.84 (  2%)   0.00 (  0%)   1.73 (  2%)   13673 kB (  1%)
 CFG verifier                       :   6.05 (  6%)   0.12 (  2%)   6.80 (  7%)       0 kB (  0%)
 trivially dead code                :   0.32 (  0%)   0.01 (  0%)   0.38 (  0%)       0 kB (  0%)
 df scan insns                      :   0.23 (  0%)   0.00 (  0%)   0.30 (  0%)      28 kB (  0%)
 df multiple defs                   :   0.13 (  0%)   0.00 (  0%)   0.20 (  0%)       0 kB (  0%)
 df reaching defs                   :   0.52 (  1%)   0.00 (  0%)   0.55 (  1%)       0 kB (  0%)
 df live regs                       :   2.70 (  3%)   0.02 (  0%)   3.08 (  3%)     425 kB (  0%)
 df live&initialized regs           :   1.28 (  1%)   0.00 (  0%)   1.13 (  1%)       0 kB (  0%)
 df must-initialized regs           :   0.14 (  0%)   0.00 (  0%)   0.16 (  0%)       0 kB (  0%)
 df use-def / def-use chains        :   0.32 (  0%)   0.00 (  0%)   0.26 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   0.96 (  1%)   0.01 (  0%)   0.89 (  1%)   11726 kB (  1%)
 register information               :   0.29 (  0%)   0.00 (  0%)   0.21 (  0%)       0 kB (  0%)
 alias analysis                     :   0.54 (  1%)   0.00 (  0%)   0.53 (  1%)   17487 kB (  1%)
 alias stmt walking                 :   1.10 (  1%)   0.08 (  2%)   1.22 (  1%)     118 kB (  0%)
 register scan                      :   0.08 (  0%)   0.01 (  0%)   0.08 (  0%)     118 kB (  0%)
 rebuild jump labels                :   0.12 (  0%)   0.01 (  0%)   0.11 (  0%)       0 kB (  0%)
 preprocessing                      :   0.29 (  0%)   0.43 (  9%)   0.65 (  1%)   37409 kB (  2%)
 parser (global)                    :   0.39 (  0%)   0.39 (  8%)   0.94 (  1%)   92661 kB (  5%)
 parser struct body                 :   0.07 (  0%)   0.00 (  0%)   0.08 (  0%)    6159 kB (  0%)
 parser enumerator list             :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)    3342 kB (  0%)
 parser function body               :   2.37 (  2%)   0.43 (  9%)   2.82 (  3%)  119124 kB (  6%)
 parser inl. func. body             :   0.18 (  0%)   0.05 (  1%)   0.16 (  0%)   10354 kB (  1%)
 parser inl. meth. body             :   0.04 (  0%)   0.01 (  0%)   0.03 (  0%)    2986 kB (  0%)
 template instantiation             :   0.17 (  0%)   0.08 (  2%)   0.26 (  0%)   15801 kB (  1%)
 constant expression evaluation     :   0.06 (  0%)   0.05 (  1%)   0.07 (  0%)     516 kB (  0%)
 early inlining heuristics          :   0.13 (  0%)   0.00 (  0%)   0.08 (  0%)   19547 kB (  1%)
 inline parameters                  :   0.14 (  0%)   0.01 (  0%)   0.22 (  0%)    3372 kB (  0%)
 integration                        :   1.00 (  1%)   0.23 (  5%)   1.22 (  1%)  132386 kB (  7%)
 tree gimplify                      :   0.36 (  0%)   0.02 (  0%)   0.31 (  0%)   63162 kB (  3%)
 tree eh                            :   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)    4173 kB (  0%)
 tree CFG construction              :   0.07 (  0%)   0.00 (  0%)   0.07 (  0%)   20805 kB (  1%)
 tree CFG cleanup                   :   1.40 (  1%)   0.14 (  3%)   1.57 (  2%)    3995 kB (  0%)
 tree tail merge                    :   0.17 (  0%)   0.01 (  0%)   0.16 (  0%)    7251 kB (  0%)
 tree VRP                           :   1.94 (  2%)   0.08 (  2%)   1.83 (  2%)   40527 kB (  2%)
 tree Early VRP                     :   0.27 (  0%)   0.03 (  1%)   0.30 (  0%)    3298 kB (  0%)
 tree copy propagation              :   0.14 (  0%)   0.00 (  0%)   0.08 (  0%)     427 kB (  0%)
 tree PTA                           :   0.61 (  1%)   0.03 (  1%)   0.53 (  1%)    3861 kB (  0%)
 tree PHI insertion                 :   0.01 (  0%)   0.02 (  0%)   0.03 (  0%)    8529 kB (  0%)
 tree SSA rewrite                   :   0.23 (  0%)   0.03 (  1%)   0.43 (  0%)   24334 kB (  1%)
 tree SSA other                     :   0.10 (  0%)   0.01 (  0%)   0.10 (  0%)     538 kB (  0%)
 tree SSA incremental               :   0.79 (  1%)   0.07 (  1%)   0.88 (  1%)   11828 kB (  1%)
 tree operand scan                  :   1.33 (  1%)   0.30 (  6%)   1.51 (  1%)   56249 kB (  3%)
 dominator optimization             :   1.92 (  2%)   0.07 (  1%)   1.90 (  2%)   31786 kB (  2%)
 backwards jump threading           :   0.20 (  0%)   0.02 (  0%)   0.16 (  0%)    8676 kB (  0%)
 tree SRA                           :   0.17 (  0%)   0.01 (  0%)   0.09 (  0%)    6050 kB (  0%)
 isolate eroneous paths             :   0.01 (  0%)   0.00 (  0%)   0.04 (  0%)    1319 kB (  0%)
 tree CCP                           :   0.67 (  1%)   0.08 (  2%)   0.62 (  1%)    4190 kB (  0%)
 tree PHI const/copy prop           :   0.10 (  0%)   0.00 (  0%)   0.02 (  0%)     132 kB (  0%)
 tree split crit edges              :   0.12 (  0%)   0.00 (  0%)   0.15 (  0%)   10236 kB (  1%)
 tree reassociation                 :   0.14 (  0%)   0.00 (  0%)   0.08 (  0%)     168 kB (  0%)
 tree PRE                           :   0.74 (  1%)   0.04 (  1%)   0.76 (  1%)   16728 kB (  1%)
 tree FRE                           :   0.69 (  1%)   0.04 (  1%)   0.60 (  1%)    5370 kB (  0%)
 tree code sinking                  :   0.06 (  0%)   0.01 (  0%)   0.06 (  0%)    9670 kB (  1%)
 tree linearize phis                :   0.10 (  0%)   0.00 (  0%)   0.09 (  0%)     699 kB (  0%)
 tree backward propagate            :   0.03 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 tree forward propagate             :   0.52 (  1%)   0.04 (  1%)   0.48 (  0%)    3055 kB (  0%)
 tree phiprop                       :   0.05 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 tree conservative DCE              :   0.27 (  0%)   0.03 (  1%)   0.43 (  0%)    1557 kB (  0%)
 tree aggressive DCE                :   0.21 (  0%)   0.04 (  1%)   0.23 (  0%)    2565 kB (  0%)
 tree buildin call DCE              :   0.00 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 tree DSE                           :   0.18 (  0%)   0.01 (  0%)   0.18 (  0%)     274 kB (  0%)
 PHI merge                          :   0.07 (  0%)   0.00 (  0%)   0.06 (  0%)    3170 kB (  0%)
 tree loop optimization             :   0.00 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 loopless fn                        :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 tree loop invariant motion         :   0.03 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 tree canonical iv                  :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      58 kB (  0%)
 complete unrolling                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     361 kB (  0%)
 tree iv optimization               :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     128 kB (  0%)
 tree copy headers                  :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)     414 kB (  0%)
 tree SSA uncprop                   :   0.06 (  0%)   0.00 (  0%)   0.09 (  0%)       0 kB (  0%)
 tree NRV optimization              :   0.01 (  0%)   0.00 (  0%)   0.05 (  0%)      14 kB (  0%)
 tree SSA verifier                  :   8.44 (  9%)   0.26 (  5%)   8.77 (  8%)       0 kB (  0%)
 tree STMT verifier                 :  12.57 ( 13%)   0.35 (  7%)  13.03 ( 12%)       0 kB (  0%)
 tree switch conversion             :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       5 kB (  0%)
 tree switch lowering               :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)    1194 kB (  0%)
 gimple CSE sin/cos                 :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 gimple widening/fma detection      :   0.06 (  0%)   0.00 (  0%)   0.03 (  0%)       2 kB (  0%)
 tree strlen optimization           :   0.03 (  0%)   0.00 (  0%)   0.05 (  0%)       0 kB (  0%)
 callgraph verifier                 :   0.93 (  1%)   0.07 (  1%)   0.99 (  1%)       0 kB (  0%)
 dominance frontiers                :   0.14 (  0%)   0.00 (  0%)   0.07 (  0%)       0 kB (  0%)
 dominance computation              :   1.98 (  2%)   0.05 (  1%)   2.17 (  2%)       0 kB (  0%)
 control dependences                :   0.03 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 out of ssa                         :   0.11 (  0%)   0.00 (  0%)   0.11 (  0%)     253 kB (  0%)
 expand vars                        :   0.12 (  0%)   0.00 (  0%)   0.12 (  0%)    5803 kB (  0%)
 expand                             :   0.68 (  1%)   0.02 (  0%)   0.75 (  1%)  129150 kB (  7%)
 post expand cleanups               :   0.09 (  0%)   0.00 (  0%)   0.03 (  0%)    1400 kB (  0%)
 varconst                           :   0.01 (  0%)   0.01 (  0%)   0.01 (  0%)      13 kB (  0%)
 lower subreg                       :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)      63 kB (  0%)
 forward prop                       :   0.32 (  0%)   0.01 (  0%)   0.34 (  0%)    7384 kB (  0%)
 CSE                                :   1.03 (  1%)   0.02 (  0%)   0.95 (  1%)    4656 kB (  0%)
 dead code elimination              :   0.23 (  0%)   0.00 (  0%)   0.22 (  0%)       0 kB (  0%)
 dead store elim1                   :   0.40 (  0%)   0.00 (  0%)   0.34 (  0%)    5665 kB (  0%)
 dead store elim2                   :   0.60 (  1%)   0.00 (  0%)   0.65 (  1%)    9079 kB (  0%)
 loop analysis                      :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 loop init                          :   1.31 (  1%)   0.05 (  1%)   1.64 (  2%)    5802 kB (  0%)
 loop invariant motion              :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)      19 kB (  0%)
 loop fini                          :   0.02 (  0%)   0.01 (  0%)   0.04 (  0%)       0 kB (  0%)
 CPROP                              :   1.27 (  1%)   0.01 (  0%)   1.14 (  1%)   30881 kB (  2%)
 PRE                                :   0.61 (  1%)   0.00 (  0%)   0.59 (  1%)    1920 kB (  0%)
 CSE 2                              :   0.57 (  1%)   0.01 (  0%)   0.58 (  1%)    2822 kB (  0%)
 branch prediction                  :   0.08 (  0%)   0.01 (  0%)   0.10 (  0%)     887 kB (  0%)
 combiner                           :   1.15 (  1%)   0.00 (  0%)   1.28 (  1%)   35520 kB (  2%)
 if-conversion                      :   0.24 (  0%)   0.00 (  0%)   0.22 (  0%)    5851 kB (  0%)
 integrated RA                      :   2.29 (  2%)   0.03 (  1%)   2.37 (  2%)   54041 kB (  3%)
 LRA non-specific                   :   0.97 (  1%)   0.01 (  0%)   1.04 (  1%)    5294 kB (  0%)
 LRA virtuals elimination           :   0.44 (  0%)   0.00 (  0%)   0.39 (  0%)    6089 kB (  0%)
 LRA reload inheritance             :   0.17 (  0%)   0.00 (  0%)   0.27 (  0%)    5783 kB (  0%)
 LRA create live ranges             :   1.07 (  1%)   0.00 (  0%)   1.09 (  1%)    1004 kB (  0%)
 LRA hard reg assignment            :   0.11 (  0%)   0.00 (  0%)   0.09 (  0%)       0 kB (  0%)
 LRA rematerialization              :   0.20 (  0%)   0.00 (  0%)   0.20 (  0%)       0 kB (  0%)
 reload                             :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 reload CSE regs                    :   0.90 (  1%)   0.01 (  0%)   0.80 (  1%)   13780 kB (  1%)
 ree                                :   0.13 (  0%)   0.00 (  0%)   0.10 (  0%)     589 kB (  0%)
 thread pro- & epilogue             :   0.51 (  1%)   0.01 (  0%)   0.57 (  1%)    2328 kB (  0%)
 if-conversion 2                    :   0.08 (  0%)   0.00 (  0%)   0.08 (  0%)     319 kB (  0%)
 combine stack adjustments          :   0.04 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 peephole 2                         :   0.12 (  0%)   0.00 (  0%)   0.18 (  0%)    1242 kB (  0%)
 hard reg cprop                     :   0.57 (  1%)   0.00 (  0%)   0.49 (  0%)     189 kB (  0%)
 scheduling 2                       :   2.53 (  3%)   0.03 (  1%)   2.53 (  2%)    5740 kB (  0%)
 machine dep reorg                  :   0.08 (  0%)   0.00 (  0%)   0.07 (  0%)       0 kB (  0%)
 reorder blocks                     :   0.74 (  1%)   0.01 (  0%)   0.69 (  1%)    6926 kB (  0%)
 shorten branches                   :   0.20 (  0%)   0.00 (  0%)   0.16 (  0%)       0 kB (  0%)
 final                              :   0.85 (  1%)   0.01 (  0%)   0.97 (  1%)  115151 kB (  6%)
 symout                             :   1.17 (  1%)   0.11 (  2%)   1.25 (  1%)  202121 kB ( 11%)
 variable tracking                  :   0.77 (  1%)   0.01 (  0%)   0.81 (  1%)   45792 kB (  2%)
 var-tracking dataflow              :   1.30 (  1%)   0.01 (  0%)   1.24 (  1%)     926 kB (  0%)
 var-tracking emit                  :   1.43 (  1%)   0.01 (  0%)   1.42 (  1%)   57281 kB (  3%)
 tree if-combine                    :   0.06 (  0%)   0.00 (  0%)   0.02 (  0%)     417 kB (  0%)
 uninit var analysis                :   0.03 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 straight-line strength reduction   :   0.04 (  0%)   0.00 (  0%)   0.03 (  0%)     525 kB (  0%)
 store merging                      :   0.04 (  0%)   0.00 (  0%)   0.03 (  0%)     492 kB (  0%)
 initialize rtl                     :   0.01 (  0%)   0.00 (  0%)   0.04 (  0%)      12 kB (  0%)
 address lowering                   :   0.04 (  0%)   0.00 (  0%)   0.02 (  0%)       2 kB (  0%)
 early local passes                 :   0.02 (  0%)   0.01 (  0%)   0.00 (  0%)       0 kB (  0%)
 unaccounted optimizations          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 rest of compilation                :   1.29 (  1%)   0.01 (  0%)   1.11 (  1%)    5063 kB (  0%)
 remove unused locals               :   0.25 (  0%)   0.04 (  1%)   0.25 (  0%)      37 kB (  0%)
 address taken                      :   0.11 (  0%)   0.10 (  2%)   0.25 (  0%)       0 kB (  0%)
 verify loop closed                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 verify RTL sharing                 :   5.24 (  5%)   0.05 (  1%)   5.37 (  5%)       0 kB (  0%)
 rebuild frequencies                :   0.04 (  0%)   0.00 (  0%)   0.06 (  0%)     621 kB (  0%)
 repair loop structures             :   0.17 (  0%)   0.00 (  0%)   0.24 (  0%)       0 kB (  0%)
 TOTAL                              :  98.82          4.83        104.24        1886632 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

real    1m54.934s
user    1m48.938s
sys     0m5.196s


Thank you
Giuliano.

On 01/14, Richard Biener wrote:
> On Mon, Jan 14, 2019 at 12:41 PM Giuliano Belinassi
> <giuliano.belinassi@usp.br> wrote:
> >
> > Hi,
> >
> > I am currently studying the GIMPLE IR documentation and thinking about a
> > way easily gather the timing information. I was thinking about about
> > adding this feature to gcc to show/dump the elapsed time on GIMPLE. Does
> > this makes sense? Is this already implemented somewhere? Where is a good
> > way to start it?
> 
> There's -ftime-report which more-or-less tells you the time spent in the
> individual passes.  I think there's no overall group to count GIMPLE
> optimizers vs. RTL optimizers though.
> 
> > Richard Biener: I would like to know What is your nickname in IRC :)
> 
> It's richi.
> 
> Richard.
> 
> > Thank you,
> > Giuliano.
> >
> > On 12/17, Richard Biener wrote:
> > > On Wed, Dec 12, 2018 at 4:46 PM Giuliano Augusto Faulin Belinassi
> > > <giuliano.belinassi@usp.br> wrote:
> > > >
> > > > Hi, I have some news. :-)
> > > >
> > > > I replicated the Martin LiÅ¡ka experiment [1] on a 64-cores machine for
> > > > gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> > > > and I am excited to dive into this problem. As a result, I want to
> > > > propose GSoC project on this issue, starting with something like:
> > > >     1- Systematically create a benchmark for easily information
> > > > gathering. Martin LiÅ¡ka already made the first version of it, but I
> > > > need to improve it.
> > > >     2- Find and document the global states (Try to reduce the gcc's
> > > > global states as well).
> > > >     3- Define the parallelization strategy.
> > > >     4- First parallelization attempt.
> > > >
> > > > I also proposed this issue as a research project to my advisor and he
> > > > supported me on this idea. So I can work for at least one year on
> > > > this, and other things related to it.
> > > >
> > > > Would anyone be willing to mentor me on this?
> > >
> > > As the one who initially suggested the project I'm certainly willing
> > > to mentor you on this.
> > >
> > > Richard.
> > >
> > > > [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > > [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> > > > [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> > > > On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > >
> > > > > > Hi! Sorry for the late reply again :P
> > > > > >
> > > > > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > > > >
> > > > > > > > As a brief introduction, I am a graduate student that got interested
> > > > > > > >
> > > > > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > > > > them have already been accepted [2].
> > > > > > > >
> > > > > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > > > > discuss this topic.
> > > > > > > >
> > > > > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > > > > compilation of projects which have a big file that creates a
> > > > > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > > > > amount of code to generate).
> > > > > > >
> > > > > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > > > > >
> > > > > >
> > > > > > > One way to improve parallelism is to use link-time optimization where
> > > > > > > even single source files can be split up into multiple link-time units.  But
> > > > > > > then there's the serial whole-program analysis part.
> > > > > >
> > > > > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > > > > That is a lot of data :-)
> > > > > >
> > > > > > It seems that 'phase opt and generate' is the most time-consuming
> > > > > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > > > > about in this thread:
> > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> > > > >
> > > > > It's everything that comes after the frontend parsing bits, thus this
> > > > > includes in particular RTL optimization and early GIMPLE optimizations.
> > > > >
> > > > > > > > Additionally, I know that GCC must not
> > > > > > > > change the project layout, but from the software engineering perspective,
> > > > > > > > this may be a bad smell that indicates that the file should be broken
> > > > > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > > > > parallelization task.
> > > > > > >
> > > > > > > What do you mean by GCC must not change the project layout?  GCC
> > > > > > > happily re-orders functions and link-time optimization will reorder
> > > > > > > TUs (well, linking may as well).
> > > > > > >
> > > > > >
> > > > > > That was a response to a comment made on IRC:
> > > > > >
> > > > > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > > > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > > > > >that if a project has a very large file that dominates the total build
> > > > > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > > > > >can't restructure people's code. it can only try to compile it
> > > > > > >faster". We weren't referring to code transformations in the compiler
> > > > > > >like re-ordering functions, but physically refactoring the source
> > > > > > >code.
> > > > > >
> > > > > > Yes. But from one of the attachments from PR84402, it seems that such
> > > > > > files exist on GCC,
> > > > > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > > > >
> > > > > > > > My questions are:
> > > > > > > >
> > > > > > > >  1. Is there any project compilation that will significantly be improved
> > > > > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > > > > >
> > > > > > > We do not have any data about this apart from experiments with
> > > > > > > splitting up source files for PR84402.
> > > > > > >
> > > > > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > > > > anyone provide extra details to me?
> > > > > > >
> > > > > > > You may want to search the mailing list archives since we had a
> > > > > > > student application (later revoked) for the task with some discussion.
> > > > > > >
> > > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > > itself is an interesting experiment but whether there will be any
> > > > > > > substantial improvement for builds that can already benefit from make
> > > > > > > parallelism remains a question.
> > > > > >
> > > > > > As I agree that documenting GCC's global states is good for the
> > > > > > community and the development of GCC, I really don't think this a good
> > > > > > motivation for parallelizing a compiler from a research standpoint.
> > > > >
> > > > > True ;)  Note that my suggestions to the other GSoC student were
> > > > > purely based on where it's easiest to experiment with paralellization
> > > > > and not where it would be most beneficial.
> > > > >
> > > > > > There must be something or someone that could take advantage of the
> > > > > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > > > > answer to it. :-)
> > > > > >
> > > > > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > > > > >
> > > > > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > > > itself is an interesting experiment but whether there will be any
> > > > > > > > substantial improvement for builds that can already benefit from make
> > > > > > > > parallelism remains a question.
> > > > > > >
> > > > > > > in the common case (project with many small files, much more than
> > > > > > > core count) i'd expect a regression:
> > > > > > >
> > > > > > > if gcc itself tries to parallelize that introduces inter thread
> > > > > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > > > > some atomic instructions when a process is single threaded).
> > > > > >
> > > > > > That is what I am mostly worried about. Or the most costly part is not
> > > > > > parallelizable at all. Also, I would expect a regression on very small
> > > > > > files, which probably could be avoided implementing this feature as a
> > > > > > flag?
> > > > >
> > > > > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > > > > Which might be somewhat hard given there are core data structures that
> > > > > are shared (the memory allocator for a start).
> > > > >
> > > > > The other issue I am more worried about is that we probably have to
> > > > > interact with make somehow so that we do not end up with 64 threads
> > > > > when one does -j8 on a 8 core machine.  That's basically the same
> > > > > issue we run into with -flto and it's threaded WPA writeout or recursive
> > > > > invocation of make.
> > > > >
> > > > > >
> > > > > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > > > > >
> > > > > > > Hi Giuliano,
> > > > > > >
> > > > > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > > > > You may want to search the mailing list archives since we had a
> > > > > > > > student application (later revoked) for the task with some discussion.
> > > > > > >
> > > > > > > Specifically, the whole thread beginning with
> > > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > > > > >
> > > > > > > Martin
> > > > > > >
> > > > > >
> > > > > > Yes, I will research this carefully ;-)
> > > > > >
> > > > > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2019-01-15 21:45               ` Giuliano Belinassi
@ 2019-01-16 12:44                 ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2019-01-16 12:44 UTC (permalink / raw)
  To: Giuliano Belinassi
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman, Gregory.Mounie

On Tue, Jan 15, 2019 at 10:45 PM Giuliano Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> Hi
>
> I've managed to compile gimple-match.c with -ftime-report, and "phase opt and
> generate" seems to be what takes most of the compilation time. This is captured
> by the "TV_PHASE_OPT_GEN" timevar, and all its occurrences seem to be in
> toplev.c and lto.c.

TV_PHASE_OPT_GEN covers nearly everything besides parsing.  Thus all stuff
below "phase *" is covered by one of the phases.

It would probably be nice to split up TV_PHASE_OPT_GEN into GIMPLE,
IPA and RTL optimization phases.

> Any ideas of which part such that this variable captures is
> the most costly? Also, is that percentage in "GGC" column the amount of time
> inside the Garbage Collector?

The percentage for the GGC column is the percentage of total GGC
memory, not time.
See timevar.c:print_row

The most costly part of opt-and-generate is the various verifiers.
See the note printed
at the bottom:

> Extra diagnostic checks enabled; compiler may run slowly.
> Configure with --enable-checking=release to disable checks.

you can get a clearer picture when you configure GCC with
--enable-checking=release.
For a quick start passing -fno-checking will disable the most costly
bits already.

Richard.

>
> Time variable                                   usr           sys          wall               GGC
>  phase setup                        :   0.01 (  0%)   0.01 (  0%)   0.02 (  0%)    1473 kB (  0%)
>  phase parsing                      :   3.74 (  4%)   1.43 ( 30%)   5.17 (  5%)  294287 kB ( 16%)
>  phase lang. deferred               :   0.08 (  0%)   0.03 (  1%)   0.11 (  0%)    7582 kB (  0%)
>  phase opt and generate             :  94.10 ( 95%)   3.26 ( 67%)  97.46 ( 93%) 1543477 kB ( 82%)
>  phase last asm                     :   0.89 (  1%)   0.09 (  2%)   0.98 (  1%)   39802 kB (  2%)
>  phase finalize                     :   0.00 (  0%)   0.01 (  0%)   0.50 (  0%)       0 kB (  0%)
>  |name lookup                       :   0.42 (  0%)   0.12 (  2%)   0.46 (  0%)    6162 kB (  0%)
>  |overload resolution               :   0.37 (  0%)   0.13 (  3%)   0.42 (  0%)   18172 kB (  1%)
>  garbage collection                 :   2.99 (  3%)   0.03 (  1%)   3.02 (  3%)       0 kB (  0%)
>  dump files                         :   0.11 (  0%)   0.01 (  0%)   0.16 (  0%)       0 kB (  0%)
>  callgraph construction             :   0.35 (  0%)   0.01 (  0%)   0.24 (  0%)   61143 kB (  3%)
>  callgraph optimization             :   0.21 (  0%)   0.01 (  0%)   0.17 (  0%)     175 kB (  0%)
>  ipa function summary               :   0.12 (  0%)   0.00 (  0%)   0.14 (  0%)    2216 kB (  0%)
>  ipa dead code removal              :   0.04 (  0%)   0.01 (  0%)   0.00 (  0%)       0 kB (  0%)
>  ipa devirtualization               :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  ipa cp                             :   0.33 (  0%)   0.01 (  0%)   0.39 (  0%)    9073 kB (  0%)
>  ipa inlining heuristics            :   0.48 (  0%)   0.00 (  0%)   0.48 (  0%)    6175 kB (  0%)
>  ipa function splitting             :   0.10 (  0%)   0.01 (  0%)   0.07 (  0%)    9111 kB (  0%)
>  ipa comdats                        :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
>  ipa various optimizations          :   0.03 (  0%)   0.03 (  1%)   0.01 (  0%)     480 kB (  0%)
>  ipa reference                      :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
>  ipa profile                        :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  ipa pure const                     :   0.13 (  0%)   0.00 (  0%)   0.12 (  0%)       8 kB (  0%)
>  ipa icf                            :   0.08 (  0%)   0.00 (  0%)   0.08 (  0%)       6 kB (  0%)
>  ipa SRA                            :   1.26 (  1%)   0.28 (  6%)   1.78 (  2%)  165814 kB (  9%)
>  ipa free lang data                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
>  ipa free inline summary            :   0.00 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
>  cfg construction                   :   0.09 (  0%)   0.00 (  0%)   0.09 (  0%)    7926 kB (  0%)
>  cfg cleanup                        :   1.84 (  2%)   0.00 (  0%)   1.73 (  2%)   13673 kB (  1%)
>  CFG verifier                       :   6.05 (  6%)   0.12 (  2%)   6.80 (  7%)       0 kB (  0%)
>  trivially dead code                :   0.32 (  0%)   0.01 (  0%)   0.38 (  0%)       0 kB (  0%)
>  df scan insns                      :   0.23 (  0%)   0.00 (  0%)   0.30 (  0%)      28 kB (  0%)
>  df multiple defs                   :   0.13 (  0%)   0.00 (  0%)   0.20 (  0%)       0 kB (  0%)
>  df reaching defs                   :   0.52 (  1%)   0.00 (  0%)   0.55 (  1%)       0 kB (  0%)
>  df live regs                       :   2.70 (  3%)   0.02 (  0%)   3.08 (  3%)     425 kB (  0%)
>  df live&initialized regs           :   1.28 (  1%)   0.00 (  0%)   1.13 (  1%)       0 kB (  0%)
>  df must-initialized regs           :   0.14 (  0%)   0.00 (  0%)   0.16 (  0%)       0 kB (  0%)
>  df use-def / def-use chains        :   0.32 (  0%)   0.00 (  0%)   0.26 (  0%)       0 kB (  0%)
>  df reg dead/unused notes           :   0.96 (  1%)   0.01 (  0%)   0.89 (  1%)   11726 kB (  1%)
>  register information               :   0.29 (  0%)   0.00 (  0%)   0.21 (  0%)       0 kB (  0%)
>  alias analysis                     :   0.54 (  1%)   0.00 (  0%)   0.53 (  1%)   17487 kB (  1%)
>  alias stmt walking                 :   1.10 (  1%)   0.08 (  2%)   1.22 (  1%)     118 kB (  0%)
>  register scan                      :   0.08 (  0%)   0.01 (  0%)   0.08 (  0%)     118 kB (  0%)
>  rebuild jump labels                :   0.12 (  0%)   0.01 (  0%)   0.11 (  0%)       0 kB (  0%)
>  preprocessing                      :   0.29 (  0%)   0.43 (  9%)   0.65 (  1%)   37409 kB (  2%)
>  parser (global)                    :   0.39 (  0%)   0.39 (  8%)   0.94 (  1%)   92661 kB (  5%)
>  parser struct body                 :   0.07 (  0%)   0.00 (  0%)   0.08 (  0%)    6159 kB (  0%)
>  parser enumerator list             :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)    3342 kB (  0%)
>  parser function body               :   2.37 (  2%)   0.43 (  9%)   2.82 (  3%)  119124 kB (  6%)
>  parser inl. func. body             :   0.18 (  0%)   0.05 (  1%)   0.16 (  0%)   10354 kB (  1%)
>  parser inl. meth. body             :   0.04 (  0%)   0.01 (  0%)   0.03 (  0%)    2986 kB (  0%)
>  template instantiation             :   0.17 (  0%)   0.08 (  2%)   0.26 (  0%)   15801 kB (  1%)
>  constant expression evaluation     :   0.06 (  0%)   0.05 (  1%)   0.07 (  0%)     516 kB (  0%)
>  early inlining heuristics          :   0.13 (  0%)   0.00 (  0%)   0.08 (  0%)   19547 kB (  1%)
>  inline parameters                  :   0.14 (  0%)   0.01 (  0%)   0.22 (  0%)    3372 kB (  0%)
>  integration                        :   1.00 (  1%)   0.23 (  5%)   1.22 (  1%)  132386 kB (  7%)
>  tree gimplify                      :   0.36 (  0%)   0.02 (  0%)   0.31 (  0%)   63162 kB (  3%)
>  tree eh                            :   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)    4173 kB (  0%)
>  tree CFG construction              :   0.07 (  0%)   0.00 (  0%)   0.07 (  0%)   20805 kB (  1%)
>  tree CFG cleanup                   :   1.40 (  1%)   0.14 (  3%)   1.57 (  2%)    3995 kB (  0%)
>  tree tail merge                    :   0.17 (  0%)   0.01 (  0%)   0.16 (  0%)    7251 kB (  0%)
>  tree VRP                           :   1.94 (  2%)   0.08 (  2%)   1.83 (  2%)   40527 kB (  2%)
>  tree Early VRP                     :   0.27 (  0%)   0.03 (  1%)   0.30 (  0%)    3298 kB (  0%)
>  tree copy propagation              :   0.14 (  0%)   0.00 (  0%)   0.08 (  0%)     427 kB (  0%)
>  tree PTA                           :   0.61 (  1%)   0.03 (  1%)   0.53 (  1%)    3861 kB (  0%)
>  tree PHI insertion                 :   0.01 (  0%)   0.02 (  0%)   0.03 (  0%)    8529 kB (  0%)
>  tree SSA rewrite                   :   0.23 (  0%)   0.03 (  1%)   0.43 (  0%)   24334 kB (  1%)
>  tree SSA other                     :   0.10 (  0%)   0.01 (  0%)   0.10 (  0%)     538 kB (  0%)
>  tree SSA incremental               :   0.79 (  1%)   0.07 (  1%)   0.88 (  1%)   11828 kB (  1%)
>  tree operand scan                  :   1.33 (  1%)   0.30 (  6%)   1.51 (  1%)   56249 kB (  3%)
>  dominator optimization             :   1.92 (  2%)   0.07 (  1%)   1.90 (  2%)   31786 kB (  2%)
>  backwards jump threading           :   0.20 (  0%)   0.02 (  0%)   0.16 (  0%)    8676 kB (  0%)
>  tree SRA                           :   0.17 (  0%)   0.01 (  0%)   0.09 (  0%)    6050 kB (  0%)
>  isolate eroneous paths             :   0.01 (  0%)   0.00 (  0%)   0.04 (  0%)    1319 kB (  0%)
>  tree CCP                           :   0.67 (  1%)   0.08 (  2%)   0.62 (  1%)    4190 kB (  0%)
>  tree PHI const/copy prop           :   0.10 (  0%)   0.00 (  0%)   0.02 (  0%)     132 kB (  0%)
>  tree split crit edges              :   0.12 (  0%)   0.00 (  0%)   0.15 (  0%)   10236 kB (  1%)
>  tree reassociation                 :   0.14 (  0%)   0.00 (  0%)   0.08 (  0%)     168 kB (  0%)
>  tree PRE                           :   0.74 (  1%)   0.04 (  1%)   0.76 (  1%)   16728 kB (  1%)
>  tree FRE                           :   0.69 (  1%)   0.04 (  1%)   0.60 (  1%)    5370 kB (  0%)
>  tree code sinking                  :   0.06 (  0%)   0.01 (  0%)   0.06 (  0%)    9670 kB (  1%)
>  tree linearize phis                :   0.10 (  0%)   0.00 (  0%)   0.09 (  0%)     699 kB (  0%)
>  tree backward propagate            :   0.03 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  tree forward propagate             :   0.52 (  1%)   0.04 (  1%)   0.48 (  0%)    3055 kB (  0%)
>  tree phiprop                       :   0.05 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  tree conservative DCE              :   0.27 (  0%)   0.03 (  1%)   0.43 (  0%)    1557 kB (  0%)
>  tree aggressive DCE                :   0.21 (  0%)   0.04 (  1%)   0.23 (  0%)    2565 kB (  0%)
>  tree buildin call DCE              :   0.00 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
>  tree DSE                           :   0.18 (  0%)   0.01 (  0%)   0.18 (  0%)     274 kB (  0%)
>  PHI merge                          :   0.07 (  0%)   0.00 (  0%)   0.06 (  0%)    3170 kB (  0%)
>  tree loop optimization             :   0.00 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
>  loopless fn                        :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  tree loop invariant motion         :   0.03 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
>  tree canonical iv                  :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      58 kB (  0%)
>  complete unrolling                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     361 kB (  0%)
>  tree iv optimization               :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     128 kB (  0%)
>  tree copy headers                  :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)     414 kB (  0%)
>  tree SSA uncprop                   :   0.06 (  0%)   0.00 (  0%)   0.09 (  0%)       0 kB (  0%)
>  tree NRV optimization              :   0.01 (  0%)   0.00 (  0%)   0.05 (  0%)      14 kB (  0%)
>  tree SSA verifier                  :   8.44 (  9%)   0.26 (  5%)   8.77 (  8%)       0 kB (  0%)
>  tree STMT verifier                 :  12.57 ( 13%)   0.35 (  7%)  13.03 ( 12%)       0 kB (  0%)
>  tree switch conversion             :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       5 kB (  0%)
>  tree switch lowering               :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)    1194 kB (  0%)
>  gimple CSE sin/cos                 :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  gimple widening/fma detection      :   0.06 (  0%)   0.00 (  0%)   0.03 (  0%)       2 kB (  0%)
>  tree strlen optimization           :   0.03 (  0%)   0.00 (  0%)   0.05 (  0%)       0 kB (  0%)
>  callgraph verifier                 :   0.93 (  1%)   0.07 (  1%)   0.99 (  1%)       0 kB (  0%)
>  dominance frontiers                :   0.14 (  0%)   0.00 (  0%)   0.07 (  0%)       0 kB (  0%)
>  dominance computation              :   1.98 (  2%)   0.05 (  1%)   2.17 (  2%)       0 kB (  0%)
>  control dependences                :   0.03 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  out of ssa                         :   0.11 (  0%)   0.00 (  0%)   0.11 (  0%)     253 kB (  0%)
>  expand vars                        :   0.12 (  0%)   0.00 (  0%)   0.12 (  0%)    5803 kB (  0%)
>  expand                             :   0.68 (  1%)   0.02 (  0%)   0.75 (  1%)  129150 kB (  7%)
>  post expand cleanups               :   0.09 (  0%)   0.00 (  0%)   0.03 (  0%)    1400 kB (  0%)
>  varconst                           :   0.01 (  0%)   0.01 (  0%)   0.01 (  0%)      13 kB (  0%)
>  lower subreg                       :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)      63 kB (  0%)
>  forward prop                       :   0.32 (  0%)   0.01 (  0%)   0.34 (  0%)    7384 kB (  0%)
>  CSE                                :   1.03 (  1%)   0.02 (  0%)   0.95 (  1%)    4656 kB (  0%)
>  dead code elimination              :   0.23 (  0%)   0.00 (  0%)   0.22 (  0%)       0 kB (  0%)
>  dead store elim1                   :   0.40 (  0%)   0.00 (  0%)   0.34 (  0%)    5665 kB (  0%)
>  dead store elim2                   :   0.60 (  1%)   0.00 (  0%)   0.65 (  1%)    9079 kB (  0%)
>  loop analysis                      :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
>  loop init                          :   1.31 (  1%)   0.05 (  1%)   1.64 (  2%)    5802 kB (  0%)
>  loop invariant motion              :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)      19 kB (  0%)
>  loop fini                          :   0.02 (  0%)   0.01 (  0%)   0.04 (  0%)       0 kB (  0%)
>  CPROP                              :   1.27 (  1%)   0.01 (  0%)   1.14 (  1%)   30881 kB (  2%)
>  PRE                                :   0.61 (  1%)   0.00 (  0%)   0.59 (  1%)    1920 kB (  0%)
>  CSE 2                              :   0.57 (  1%)   0.01 (  0%)   0.58 (  1%)    2822 kB (  0%)
>  branch prediction                  :   0.08 (  0%)   0.01 (  0%)   0.10 (  0%)     887 kB (  0%)
>  combiner                           :   1.15 (  1%)   0.00 (  0%)   1.28 (  1%)   35520 kB (  2%)
>  if-conversion                      :   0.24 (  0%)   0.00 (  0%)   0.22 (  0%)    5851 kB (  0%)
>  integrated RA                      :   2.29 (  2%)   0.03 (  1%)   2.37 (  2%)   54041 kB (  3%)
>  LRA non-specific                   :   0.97 (  1%)   0.01 (  0%)   1.04 (  1%)    5294 kB (  0%)
>  LRA virtuals elimination           :   0.44 (  0%)   0.00 (  0%)   0.39 (  0%)    6089 kB (  0%)
>  LRA reload inheritance             :   0.17 (  0%)   0.00 (  0%)   0.27 (  0%)    5783 kB (  0%)
>  LRA create live ranges             :   1.07 (  1%)   0.00 (  0%)   1.09 (  1%)    1004 kB (  0%)
>  LRA hard reg assignment            :   0.11 (  0%)   0.00 (  0%)   0.09 (  0%)       0 kB (  0%)
>  LRA rematerialization              :   0.20 (  0%)   0.00 (  0%)   0.20 (  0%)       0 kB (  0%)
>  reload                             :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
>  reload CSE regs                    :   0.90 (  1%)   0.01 (  0%)   0.80 (  1%)   13780 kB (  1%)
>  ree                                :   0.13 (  0%)   0.00 (  0%)   0.10 (  0%)     589 kB (  0%)
>  thread pro- & epilogue             :   0.51 (  1%)   0.01 (  0%)   0.57 (  1%)    2328 kB (  0%)
>  if-conversion 2                    :   0.08 (  0%)   0.00 (  0%)   0.08 (  0%)     319 kB (  0%)
>  combine stack adjustments          :   0.04 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
>  peephole 2                         :   0.12 (  0%)   0.00 (  0%)   0.18 (  0%)    1242 kB (  0%)
>  hard reg cprop                     :   0.57 (  1%)   0.00 (  0%)   0.49 (  0%)     189 kB (  0%)
>  scheduling 2                       :   2.53 (  3%)   0.03 (  1%)   2.53 (  2%)    5740 kB (  0%)
>  machine dep reorg                  :   0.08 (  0%)   0.00 (  0%)   0.07 (  0%)       0 kB (  0%)
>  reorder blocks                     :   0.74 (  1%)   0.01 (  0%)   0.69 (  1%)    6926 kB (  0%)
>  shorten branches                   :   0.20 (  0%)   0.00 (  0%)   0.16 (  0%)       0 kB (  0%)
>  final                              :   0.85 (  1%)   0.01 (  0%)   0.97 (  1%)  115151 kB (  6%)
>  symout                             :   1.17 (  1%)   0.11 (  2%)   1.25 (  1%)  202121 kB ( 11%)
>  variable tracking                  :   0.77 (  1%)   0.01 (  0%)   0.81 (  1%)   45792 kB (  2%)
>  var-tracking dataflow              :   1.30 (  1%)   0.01 (  0%)   1.24 (  1%)     926 kB (  0%)
>  var-tracking emit                  :   1.43 (  1%)   0.01 (  0%)   1.42 (  1%)   57281 kB (  3%)
>  tree if-combine                    :   0.06 (  0%)   0.00 (  0%)   0.02 (  0%)     417 kB (  0%)
>  uninit var analysis                :   0.03 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
>  straight-line strength reduction   :   0.04 (  0%)   0.00 (  0%)   0.03 (  0%)     525 kB (  0%)
>  store merging                      :   0.04 (  0%)   0.00 (  0%)   0.03 (  0%)     492 kB (  0%)
>  initialize rtl                     :   0.01 (  0%)   0.00 (  0%)   0.04 (  0%)      12 kB (  0%)
>  address lowering                   :   0.04 (  0%)   0.00 (  0%)   0.02 (  0%)       2 kB (  0%)
>  early local passes                 :   0.02 (  0%)   0.01 (  0%)   0.00 (  0%)       0 kB (  0%)
>  unaccounted optimizations          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
>  rest of compilation                :   1.29 (  1%)   0.01 (  0%)   1.11 (  1%)    5063 kB (  0%)
>  remove unused locals               :   0.25 (  0%)   0.04 (  1%)   0.25 (  0%)      37 kB (  0%)
>  address taken                      :   0.11 (  0%)   0.10 (  2%)   0.25 (  0%)       0 kB (  0%)
>  verify loop closed                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
>  verify RTL sharing                 :   5.24 (  5%)   0.05 (  1%)   5.37 (  5%)       0 kB (  0%)
>  rebuild frequencies                :   0.04 (  0%)   0.00 (  0%)   0.06 (  0%)     621 kB (  0%)
>  repair loop structures             :   0.17 (  0%)   0.00 (  0%)   0.24 (  0%)       0 kB (  0%)
>  TOTAL                              :  98.82          4.83        104.24        1886632 kB
> Extra diagnostic checks enabled; compiler may run slowly.
> Configure with --enable-checking=release to disable checks.
>
> real    1m54.934s
> user    1m48.938s
> sys     0m5.196s
>
>
> Thank you
> Giuliano.
>
> On 01/14, Richard Biener wrote:
> > On Mon, Jan 14, 2019 at 12:41 PM Giuliano Belinassi
> > <giuliano.belinassi@usp.br> wrote:
> > >
> > > Hi,
> > >
> > > I am currently studying the GIMPLE IR documentation and thinking about a
> > > way easily gather the timing information. I was thinking about about
> > > adding this feature to gcc to show/dump the elapsed time on GIMPLE. Does
> > > this makes sense? Is this already implemented somewhere? Where is a good
> > > way to start it?
> >
> > There's -ftime-report which more-or-less tells you the time spent in the
> > individual passes.  I think there's no overall group to count GIMPLE
> > optimizers vs. RTL optimizers though.
> >
> > > Richard Biener: I would like to know What is your nickname in IRC :)
> >
> > It's richi.
> >
> > Richard.
> >
> > > Thank you,
> > > Giuliano.
> > >
> > > On 12/17, Richard Biener wrote:
> > > > On Wed, Dec 12, 2018 at 4:46 PM Giuliano Augusto Faulin Belinassi
> > > > <giuliano.belinassi@usp.br> wrote:
> > > > >
> > > > > Hi, I have some news. :-)
> > > > >
> > > > > I replicated the Martin Liška experiment [1] on a 64-cores machine for
> > > > > gcc [2] and Linux kernel [3] (Linux kernel was fully parallelized),
> > > > > and I am excited to dive into this problem. As a result, I want to
> > > > > propose GSoC project on this issue, starting with something like:
> > > > >     1- Systematically create a benchmark for easily information
> > > > > gathering. Martin Liška already made the first version of it, but I
> > > > > need to improve it.
> > > > >     2- Find and document the global states (Try to reduce the gcc's
> > > > > global states as well).
> > > > >     3- Define the parallelization strategy.
> > > > >     4- First parallelization attempt.
> > > > >
> > > > > I also proposed this issue as a research project to my advisor and he
> > > > > supported me on this idea. So I can work for at least one year on
> > > > > this, and other things related to it.
> > > > >
> > > > > Would anyone be willing to mentor me on this?
> > > >
> > > > As the one who initially suggested the project I'm certainly willing
> > > > to mentor you on this.
> > > >
> > > > Richard.
> > > >
> > > > > [1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > > > [2] https://www.ime.usp.br/~belinass/64cores-experiment.svg
> > > > > [3] https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
> > > > > On Mon, Nov 19, 2018 at 8:53 AM Richard Biener
> > > > > <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> > > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > > >
> > > > > > > Hi! Sorry for the late reply again :P
> > > > > > >
> > > > > > > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > > > > > > <giuliano.belinassi@usp.br> wrote:
> > > > > > > > >
> > > > > > > > > As a brief introduction, I am a graduate student that got interested
> > > > > > > > >
> > > > > > > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > > > > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > > > > > > them have already been accepted [2].
> > > > > > > > >
> > > > > > > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > > > > > > discuss this topic.
> > > > > > > > >
> > > > > > > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > > > > > > compilation of projects which have a big file that creates a
> > > > > > > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > > > > > > amount of code to generate).
> > > > > > > >
> > > > > > > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > > > > > > >
> > > > > > >
> > > > > > > > One way to improve parallelism is to use link-time optimization where
> > > > > > > > even single source files can be split up into multiple link-time units.  But
> > > > > > > > then there's the serial whole-program analysis part.
> > > > > > >
> > > > > > > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > > > > > > That is a lot of data :-)
> > > > > > >
> > > > > > > It seems that 'phase opt and generate' is the most time-consuming
> > > > > > > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > > > > > > about in this thread:
> > > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> > > > > >
> > > > > > It's everything that comes after the frontend parsing bits, thus this
> > > > > > includes in particular RTL optimization and early GIMPLE optimizations.
> > > > > >
> > > > > > > > > Additionally, I know that GCC must not
> > > > > > > > > change the project layout, but from the software engineering perspective,
> > > > > > > > > this may be a bad smell that indicates that the file should be broken
> > > > > > > > > into smaller files. Finally, the Makefiles will take care of the
> > > > > > > > > parallelization task.
> > > > > > > >
> > > > > > > > What do you mean by GCC must not change the project layout?  GCC
> > > > > > > > happily re-orders functions and link-time optimization will reorder
> > > > > > > > TUs (well, linking may as well).
> > > > > > > >
> > > > > > >
> > > > > > > That was a response to a comment made on IRC:
> > > > > > >
> > > > > > > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > > > > > > >I think this is in response to a comment I made on IRC. Giuliano said
> > > > > > > >that if a project has a very large file that dominates the total build
> > > > > > > >time, the file should be split up into smaller pieces. I said  "GCC
> > > > > > > >can't restructure people's code. it can only try to compile it
> > > > > > > >faster". We weren't referring to code transformations in the compiler
> > > > > > > >like re-ordering functions, but physically refactoring the source
> > > > > > > >code.
> > > > > > >
> > > > > > > Yes. But from one of the attachments from PR84402, it seems that such
> > > > > > > files exist on GCC,
> > > > > > > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> > > > > > >
> > > > > > > > > My questions are:
> > > > > > > > >
> > > > > > > > >  1. Is there any project compilation that will significantly be improved
> > > > > > > > > if GCC runs in parallel? Do someone has data about something related
> > > > > > > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > > > > > > >
> > > > > > > > We do not have any data about this apart from experiments with
> > > > > > > > splitting up source files for PR84402.
> > > > > > > >
> > > > > > > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > > > > > > anyone provide extra details to me?
> > > > > > > >
> > > > > > > > You may want to search the mailing list archives since we had a
> > > > > > > > student application (later revoked) for the task with some discussion.
> > > > > > > >
> > > > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > > > itself is an interesting experiment but whether there will be any
> > > > > > > > substantial improvement for builds that can already benefit from make
> > > > > > > > parallelism remains a question.
> > > > > > >
> > > > > > > As I agree that documenting GCC's global states is good for the
> > > > > > > community and the development of GCC, I really don't think this a good
> > > > > > > motivation for parallelizing a compiler from a research standpoint.
> > > > > >
> > > > > > True ;)  Note that my suggestions to the other GSoC student were
> > > > > > purely based on where it's easiest to experiment with paralellization
> > > > > > and not where it would be most beneficial.
> > > > > >
> > > > > > > There must be something or someone that could take advantage of the
> > > > > > > fine-grained parallelism. But that data from PR84402 seems to have the
> > > > > > > answer to it. :-)
> > > > > > >
> > > > > > > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > > > > > > >
> > > > > > > > On 15/11/18 10:29, Richard Biener wrote:
> > > > > > > > > In my view (I proposed the thing) the most interesting parts are
> > > > > > > > > getting GCCs global state documented and reduced.  The parallelization
> > > > > > > > > itself is an interesting experiment but whether there will be any
> > > > > > > > > substantial improvement for builds that can already benefit from make
> > > > > > > > > parallelism remains a question.
> > > > > > > >
> > > > > > > > in the common case (project with many small files, much more than
> > > > > > > > core count) i'd expect a regression:
> > > > > > > >
> > > > > > > > if gcc itself tries to parallelize that introduces inter thread
> > > > > > > > synchronization and potential false sharing in gcc (e.g. malloc
> > > > > > > > locks) that does not exist with make parallelism (glibc can avoid
> > > > > > > > some atomic instructions when a process is single threaded).
> > > > > > >
> > > > > > > That is what I am mostly worried about. Or the most costly part is not
> > > > > > > parallelizable at all. Also, I would expect a regression on very small
> > > > > > > files, which probably could be avoided implementing this feature as a
> > > > > > > flag?
> > > > > >
> > > > > > I think the the issue should be avoided by avoiding fine-grained paralellism.
> > > > > > Which might be somewhat hard given there are core data structures that
> > > > > > are shared (the memory allocator for a start).
> > > > > >
> > > > > > The other issue I am more worried about is that we probably have to
> > > > > > interact with make somehow so that we do not end up with 64 threads
> > > > > > when one does -j8 on a 8 core machine.  That's basically the same
> > > > > > issue we run into with -flto and it's threaded WPA writeout or recursive
> > > > > > invocation of make.
> > > > > >
> > > > > > >
> > > > > > > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > > > > > > >
> > > > > > > > Hi Giuliano,
> > > > > > > >
> > > > > > > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > > > > > > You may want to search the mailing list archives since we had a
> > > > > > > > > student application (later revoked) for the task with some discussion.
> > > > > > > >
> > > > > > > > Specifically, the whole thread beginning with
> > > > > > > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > > > > > > >
> > > > > > > > Martin
> > > > > > > >
> > > > > > >
> > > > > > > Yes, I will research this carefully ;-)
> > > > > > >
> > > > > > > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-19 14:36     ` Richard Biener
  2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
@ 2019-02-07 14:14       ` Giuliano Belinassi
  2019-02-11 21:46       ` Giuliano Belinassi
  2 siblings, 0 replies; 20+ messages in thread
From: Giuliano Belinassi @ 2019-02-07 14:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, kernel-usp, gold, Alfredo Goldman

Hi,

Since gimple-match.c takes so long to compile, I was wondering if it
might be possible to reorder the compilation so we can push its
compilation early in the dependency graph.

I've attached a graphic showing what I mean and the methodology into
PR84402 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402).

Maybe there is a simple change that can be made into Makefile? Or maybe
an feature to Make itself to compute the elapsed time for each file and
create a better scheduling for the next compilation?

Giuliano.

On 11/19, Richard Biener wrote:
> On Fri, Nov 16, 2018 at 8:00 PM Giuliano Augusto Faulin Belinassi
> <giuliano.belinassi@usp.br> wrote:
> >
> > Hi! Sorry for the late reply again :P
> >
> > On Thu, Nov 15, 2018 at 8:29 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Nov 14, 2018 at 10:47 PM Giuliano Augusto Faulin Belinassi
> > > <giuliano.belinassi@usp.br> wrote:
> > > >
> > > > As a brief introduction, I am a graduate student that got interested
> > > >
> > > > in the "Parallelize the compilation using threads"(GSoC 2018 [1]). I
> > > > am a newcommer in GCC, but already have sent some patches, some of
> > > > them have already been accepted [2].
> > > >
> > > > I brought this subject up in IRC, but maybe here is a proper place to
> > > > discuss this topic.
> > > >
> > > > From my point of view, parallelizing GCC itself will only speed up the
> > > > compilation of projects which have a big file that creates a
> > > > bottleneck in the whole project compilation (note: by big, I mean the
> > > > amount of code to generate).
> > >
> > > That's true.  During GCC bootstrap there are some of those (see PR84402).
> > >
> >
> > > One way to improve parallelism is to use link-time optimization where
> > > even single source files can be split up into multiple link-time units.  But
> > > then there's the serial whole-program analysis part.
> >
> > Did you mean this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402 ?
> > That is a lot of data :-)
> >
> > It seems that 'phase opt and generate' is the most time-consuming
> > part. Is that the 'GIMPLE optimization pipeline' you were talking
> > about in this thread:
> > https://gcc.gnu.org/ml/gcc/2018-03/msg00202.html
> 
> It's everything that comes after the frontend parsing bits, thus this
> includes in particular RTL optimization and early GIMPLE optimizations.
> 
> > > > Additionally, I know that GCC must not
> > > > change the project layout, but from the software engineering perspective,
> > > > this may be a bad smell that indicates that the file should be broken
> > > > into smaller files. Finally, the Makefiles will take care of the
> > > > parallelization task.
> > >
> > > What do you mean by GCC must not change the project layout?  GCC
> > > happily re-orders functions and link-time optimization will reorder
> > > TUs (well, linking may as well).
> > >
> >
> > That was a response to a comment made on IRC:
> >
> > On Thu, Nov 15, 2018 at 9:44 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
> > >I think this is in response to a comment I made on IRC. Giuliano said
> > >that if a project has a very large file that dominates the total build
> > >time, the file should be split up into smaller pieces. I said  "GCC
> > >can't restructure people's code. it can only try to compile it
> > >faster". We weren't referring to code transformations in the compiler
> > >like re-ordering functions, but physically refactoring the source
> > >code.
> >
> > Yes. But from one of the attachments from PR84402, it seems that such
> > files exist on GCC,
> > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43440
> >
> > > > My questions are:
> > > >
> > > >  1. Is there any project compilation that will significantly be improved
> > > > if GCC runs in parallel? Do someone has data about something related
> > > > to that? How about the Linux Kernel? If not, I can try to bring some.
> > >
> > > We do not have any data about this apart from experiments with
> > > splitting up source files for PR84402.
> > >
> > > >  2. Did I correctly understand the goal of the parallelization? Can
> > > > anyone provide extra details to me?
> > >
> > > You may want to search the mailing list archives since we had a
> > > student application (later revoked) for the task with some discussion.
> > >
> > > In my view (I proposed the thing) the most interesting parts are
> > > getting GCCs global state documented and reduced.  The parallelization
> > > itself is an interesting experiment but whether there will be any
> > > substantial improvement for builds that can already benefit from make
> > > parallelism remains a question.
> >
> > As I agree that documenting GCC's global states is good for the
> > community and the development of GCC, I really don't think this a good
> > motivation for parallelizing a compiler from a research standpoint.
> 
> True ;)  Note that my suggestions to the other GSoC student were
> purely based on where it's easiest to experiment with paralellization
> and not where it would be most beneficial.
> 
> > There must be something or someone that could take advantage of the
> > fine-grained parallelism. But that data from PR84402 seems to have the
> > answer to it. :-)
> >
> > On Thu, Nov 15, 2018 at 4:07 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
> > >
> > > On 15/11/18 10:29, Richard Biener wrote:
> > > > In my view (I proposed the thing) the most interesting parts are
> > > > getting GCCs global state documented and reduced.  The parallelization
> > > > itself is an interesting experiment but whether there will be any
> > > > substantial improvement for builds that can already benefit from make
> > > > parallelism remains a question.
> > >
> > > in the common case (project with many small files, much more than
> > > core count) i'd expect a regression:
> > >
> > > if gcc itself tries to parallelize that introduces inter thread
> > > synchronization and potential false sharing in gcc (e.g. malloc
> > > locks) that does not exist with make parallelism (glibc can avoid
> > > some atomic instructions when a process is single threaded).
> >
> > That is what I am mostly worried about. Or the most costly part is not
> > parallelizable at all. Also, I would expect a regression on very small
> > files, which probably could be avoided implementing this feature as a
> > flag?
> 
> I think the the issue should be avoided by avoiding fine-grained paralellism.
> Which might be somewhat hard given there are core data structures that
> are shared (the memory allocator for a start).
> 
> The other issue I am more worried about is that we probably have to
> interact with make somehow so that we do not end up with 64 threads
> when one does -j8 on a 8 core machine.  That's basically the same
> issue we run into with -flto and it's threaded WPA writeout or recursive
> invocation of make.
> 
> >
> > On Fri, Nov 16, 2018 at 11:05 AM Martin Jambor <mjambor@suse.cz> wrote:
> > >
> > > Hi Giuliano,
> > >
> > > On Thu, Nov 15 2018, Richard Biener wrote:
> > > > You may want to search the mailing list archives since we had a
> > > > student application (later revoked) for the task with some discussion.
> > >
> > > Specifically, the whole thread beginning with
> > > https://gcc.gnu.org/ml/gcc/2018-03/msg00179.html
> > >
> > > Martin
> > >
> >
> > Yes, I will research this carefully ;-)
> >
> > Thank you

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2018-11-19 14:36     ` Richard Biener
  2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
  2019-02-07 14:14       ` Giuliano Belinassi
@ 2019-02-11 21:46       ` Giuliano Belinassi
  2019-02-12 14:12         ` Richard Biener
  2 siblings, 1 reply; 20+ messages in thread
From: Giuliano Belinassi @ 2019-02-11 21:46 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, kernel-usp, gold, Alfredo Goldman

Hi,

I was just wondering what API should I use to spawn threads and control
its flow. Should I use OpenMP, pthreads, or something else?

My point what if we break compatibility with something. If we use
OpenMP, I'm afraid that we will break compatibility with compilers not
supporting it. On the other hand, If we use pthread, we will break
compatibility with non-POSIX systems (Windows).

Giuliano.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2019-02-11 21:46       ` Giuliano Belinassi
@ 2019-02-12 14:12         ` Richard Biener
  2019-02-16  4:35           ` Oleg Endo
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2019-02-12 14:12 UTC (permalink / raw)
  To: Giuliano Belinassi; +Cc: GCC Development, kernel-usp, gold, Alfredo Goldman

On Mon, Feb 11, 2019 at 10:46 PM Giuliano Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> Hi,
>
> I was just wondering what API should I use to spawn threads and control
> its flow. Should I use OpenMP, pthreads, or something else?
>
> My point what if we break compatibility with something. If we use
> OpenMP, I'm afraid that we will break compatibility with compilers not
> supporting it. On the other hand, If we use pthread, we will break
> compatibility with non-POSIX systems (Windows).

I'm not sure we have a thread abstraction for the host - we do have
one for the target via libgcc gthr.h though.  For prototyping I'd resort
to this same interface and fixup the host != target case as needed.

Richard.

>
> Giuliano.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Parallelize the compilation using Threads
  2019-02-12 14:12         ` Richard Biener
@ 2019-02-16  4:35           ` Oleg Endo
  0 siblings, 0 replies; 20+ messages in thread
From: Oleg Endo @ 2019-02-16  4:35 UTC (permalink / raw)
  To: Richard Biener, Giuliano Belinassi
  Cc: GCC Development, kernel-usp, gold, Alfredo Goldman

On Tue, 2019-02-12 at 15:12 +0100, Richard Biener wrote:
> On Mon, Feb 11, 2019 at 10:46 PM Giuliano Belinassi
> <giuliano.belinassi@usp.br> wrote:
> > 
> > Hi,
> > 
> > I was just wondering what API should I use to spawn threads and
> > control
> > its flow. Should I use OpenMP, pthreads, or something else?
> > 
> > My point what if we break compatibility with something. If we use
> > OpenMP, I'm afraid that we will break compatibility with compilers
> > not
> > supporting it. On the other hand, If we use pthread, we will break
> > compatibility with non-POSIX systems (Windows).
> 
> I'm not sure we have a thread abstraction for the host - we do have
> one for the target via libgcc gthr.h though.  For prototyping I'd
> resort
> to this same interface and fixup the host != target case as needed.

Or maybe, in the year 2019, we could assume that most c++ compilers
which are used to compile GCC support c++11 and come with an adequate
<thread> implementation...  yeah, I know, sounds jacked :)

Cheers,
Oleg

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-02-16  4:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-15 10:12 Parallelize the compilation using Threads Giuliano Augusto Faulin Belinassi
2018-11-15 11:44 ` Richard Biener
2018-11-15 15:54   ` Jonathan Wakely
2018-11-15 18:07   ` Jeff Law
2018-11-15 18:36   ` Szabolcs Nagy
2018-11-16 14:25   ` Martin Jambor
2018-11-16 22:40   ` Giuliano Augusto Faulin Belinassi
2018-11-19 14:36     ` Richard Biener
2018-12-12 15:46       ` Giuliano Augusto Faulin Belinassi
2018-12-13  8:12         ` Bin.Cheng
2018-12-14 14:15           ` Giuliano Belinassi
2018-12-17 11:06         ` Richard Biener
2019-01-14 11:42           ` Giuliano Belinassi
2019-01-14 12:23             ` Richard Biener
2019-01-15 21:45               ` Giuliano Belinassi
2019-01-16 12:44                 ` Richard Biener
2019-02-07 14:14       ` Giuliano Belinassi
2019-02-11 21:46       ` Giuliano Belinassi
2019-02-12 14:12         ` Richard Biener
2019-02-16  4:35           ` Oleg Endo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).