GSOC Question about the parallelization project

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* GSOC Question about the parallelization project
@ 2018-03-16 16:25 Sebastiaan Peters
  2018-03-19  9:20 ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastiaan Peters @ 2018-03-16 16:25 UTC (permalink / raw)
  To: gcc

Hello,

My name is Sebastiaan Peters, currently an undergrad compsci student from The Netherlands.

I'm interested in the project about parallelizing the compilation with threads project. 

My main background is with c#, however I have some experience with c, c++ and x86 assembly. 
As for my knowledge about compilers, that is mostly limited to the first few chapters of the dragon book.
I have compiled GCC which acted as a cross compiler for my (very small) toy os.

Currently I have 3 questions:

1. Would my background knowledge be sufficient for this project?
2. What is currently the recommended reading for learning the gcc internals?
3. When it comes down to the different compiler phases, 
which phase would be most benefited from the parallelization? 

Kind regards,

Sebastiaan Peters

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-16 16:25 GSOC Question about the parallelization project Sebastiaan Peters
@ 2018-03-19  9:20 ` Richard Biener
  2018-03-19 15:28   ` Sebastiaan Peters
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-19  9:20 UTC (permalink / raw)
  To: Sebastiaan Peters; +Cc: gcc

On Fri, Mar 16, 2018 at 5:25 PM, Sebastiaan Peters
<SebPeters@outlook.com> wrote:
> Hello,
>
> My name is Sebastiaan Peters, currently an undergrad compsci student from The Netherlands.
>
> I'm interested in the project about parallelizing the compilation with threads project.

Thanks for your interest in this project.

> My main background is with c#, however I have some experience with c, c++ and x86 assembly.
> As for my knowledge about compilers, that is mostly limited to the first few chapters of the dragon book.
> I have compiled GCC which acted as a cross compiler for my (very small) toy os.
>
> Currently I have 3 questions:
>
> 1. Would my background knowledge be sufficient for this project?

It's hard to tell.  It would be surely helpful to have some background
in how to manage
shared state between threads.

> 2. What is currently the recommended reading for learning the gcc internals?

There's the internals manual, for the specific project the source tree
organization
and Passes and Files of the Compiler section might prove useful.  Note our
internals documentation is often out-of-date so the best reference is always the
source.

> 3. When it comes down to the different compiler phases,
> which phase would be most benefited from the parallelization?

I expect the only part that can be reasonably parallelized is the GIMPLE
optimization pipeline as that has the least amount of shared state.  There
may be opportunities in IPA passes analysis stages as well, but we do not
spend a lot of time there.

Parallelizing the GIMPLE stage may not tackle the most expensive parts
of the compilation - if you look at non-optimizing builds what takes the
most time is lexing, parsing and register allocation.

Richard.

> Kind regards,
>
> Sebastiaan Peters

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-19  9:20 ` Richard Biener
@ 2018-03-19 15:28   ` Sebastiaan Peters
  2018-03-19 17:37     ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastiaan Peters @ 2018-03-19 15:28 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

Thank you for your quick response.

Does the GIMPLE optimization pipeline include only the Tree SSA passes or also the RTL passes?

Are the currently other parts of the compiler that have been parallelized?

Kind regards,

Sebastiaan Peters

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-19 15:28   ` Sebastiaan Peters
@ 2018-03-19 17:37     ` Richard Biener
  2018-03-19 19:09       ` Sebastiaan Peters
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-19 17:37 UTC (permalink / raw)
  To: Sebastiaan Peters; +Cc: gcc

On March 19, 2018 4:27:58 PM GMT+01:00, Sebastiaan Peters <sebpeters@outlook.com> wrote:
>Thank you for your quick response.
>
>Does the GIMPLE optimization pipeline include only the Tree SSA passes
>or also the RTL passes?

Yes, it only includes only Tree SSA passes. The RTL part of the pipeline hasn't been audited to work with multiple functions in RTL Form in the same time. 

The only parallelized part of the compiler is LTO byte code write-out at WPA stage which is done in a "fork-and-forget" mode. 

The goal should be to extend TU wise parallelism via make to function wise parallelism within GCC.

Richard. 

>Are the currently other parts of the compiler that have been
>parallelized?
>
>Kind regards,
>
>Sebastiaan Peters

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-19 17:37     ` Richard Biener
@ 2018-03-19 19:09       ` Sebastiaan Peters
  2018-03-19 20:55         ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastiaan Peters @ 2018-03-19 19:09 UTC (permalink / raw)
  To: Richard Biener, Sebastiaan Peters; +Cc: gcc

>The goal should be to extend TU wise parallelism via make to function wise parallelism within GCC.

Could you please elaborate more on this?
________________________________
From: Richard Biener <richard.guenther@gmail.com>
Sent: Monday, March 19, 2018 18:37
To: Sebastiaan Peters
Cc: gcc@gcc.gnu.org
Subject: Re: GSOC Question about the parallelization project

On March 19, 2018 4:27:58 PM GMT+01:00, Sebastiaan Peters <sebpeters@outlook.com> wrote:
>Thank you for your quick response.
>
>Does the GIMPLE optimization pipeline include only the Tree SSA passes
>or also the RTL passes?

Yes, it only includes only Tree SSA passes. The RTL part of the pipeline hasn't been audited to work with multiple functions in RTL Form in the same time.

The only parallelized part of the compiler is LTO byte code write-out at WPA stage which is done in a "fork-and-forget" mode.

The goal should be to extend TU wise parallelism via make to function wise parallelism within GCC.

Richard.

>Are the currently other parts of the compiler that have been
>parallelized?
>
>Kind regards,
>
>Sebastiaan Peters

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-19 19:09       ` Sebastiaan Peters
@ 2018-03-19 20:55         ` Richard Biener
  2018-03-20 13:02           ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-19 20:55 UTC (permalink / raw)
  To: Sebastiaan Peters, Sebastiaan Peters; +Cc: gcc

On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe97@hotmail.com> wrote:
>>The goal should be to extend TU wise parallelism via make to function
>wise parallelism within GCC.
>
>Could you please elaborate more on this?

In the abstract sense you'd view the compilation process separated into N stages, each function being processed by each. You'd assign a thread to each stage and move the work items (the functions) across the set of threads honoring constraints such as an IPA stage needing all functions completed the previous stage. That allows you to easier model the constraints due to shared state (like no pass operating on two functions at the same time) compared to a model where you assign a thread to each function. 

You'll figure that the easiest point in the pipeline to try this 'pipelining' is after IPA has completed and until RTL is generated. 

Ideally the pipelining would start as early as the front ends finished parsing a function and ideally we'd have multiple functions in the RTL pipeline.

The main obstacles will be the global state in the compiler of which there is the least during the GIMPLE passes (mostly cfun and current_function_decl plus globals in the individual passes which is easiest dealt with by not allowing a single pass to run at the same time in multiple threads). TLS can be used for some of the global state plus of course some global data structures need locking. 

Richard. 

>________________________________
>From: Richard Biener <richard.guenther@gmail.com>
>Sent: Monday, March 19, 2018 18:37
>To: Sebastiaan Peters
>Cc: gcc@gcc.gnu.org
>Subject: Re: GSOC Question about the parallelization project
>
>On March 19, 2018 4:27:58 PM GMT+01:00, Sebastiaan Peters
><sebpeters@outlook.com> wrote:
>>Thank you for your quick response.
>>
>>Does the GIMPLE optimization pipeline include only the Tree SSA passes
>>or also the RTL passes?
>
>Yes, it only includes only Tree SSA passes. The RTL part of the
>pipeline hasn't been audited to work with multiple functions in RTL
>Form in the same time.
>
>The only parallelized part of the compiler is LTO byte code write-out
>at WPA stage which is done in a "fork-and-forget" mode.
>
>The goal should be to extend TU wise parallelism via make to function
>wise parallelism within GCC.
>
>Richard.
>
>>Are the currently other parts of the compiler that have been
>>parallelized?
>>
>>Kind regards,
>>
>>Sebastiaan Peters

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-19 20:55         ` Richard Biener
@ 2018-03-20 13:02           ` Richard Biener
  2018-03-20 14:49             ` David Malcolm
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-20 13:02 UTC (permalink / raw)
  To: Sebastiaan Peters, Sebastiaan Peters; +Cc: gcc

On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe97@hotmail.com> wrote:
>>>The goal should be to extend TU wise parallelism via make to function
>>wise parallelism within GCC.
>>
>>Could you please elaborate more on this?
>
> In the abstract sense you'd view the compilation process separated into N stages, each function being processed by each. You'd assign a thread to each stage and move the work items (the functions) across the set of threads honoring constraints such as an IPA stage needing all functions completed the previous stage. That allows you to easier model the constraints due to shared state (like no pass operating on two functions at the same time) compared to a model where you assign a thread to each function.
>
> You'll figure that the easiest point in the pipeline to try this 'pipelining' is after IPA has completed and until RTL is generated.
>
> Ideally the pipelining would start as early as the front ends finished parsing a function and ideally we'd have multiple functions in the RTL pipeline.
>
> The main obstacles will be the global state in the compiler of which there is the least during the GIMPLE passes (mostly cfun and current_function_decl plus globals in the individual passes which is easiest dealt with by not allowing a single pass to run at the same time in multiple threads). TLS can be used for some of the global state plus of course some global data structures need locking.

Oh, and just to mention - there are a few things that may block
adoption in the end
like whether builds are still reproducible (we allocate things like
DECL_UID from
global pools and doing that somewhat randomly because of threading
might - but not
must - change code generation).  Or that some diagnostics will appear in
non-deterministic order, or that dump files are messed up (both issues could be
solved by code dealing with the issue, like buffering and doing a re-play in
program order).  I guess reproducability is important when it comes down to
debugging code-generation issues - I'd prefer to debug gcc when it doesn't run
threaded but if that doesn't reproduce an issue that's bad.

So the most important "milestone" of this project is to identify such issues and
document them somewhere.

Richard.

> Richard.
>
>>________________________________
>>From: Richard Biener <richard.guenther@gmail.com>
>>Sent: Monday, March 19, 2018 18:37
>>To: Sebastiaan Peters
>>Cc: gcc@gcc.gnu.org
>>Subject: Re: GSOC Question about the parallelization project
>>
>>On March 19, 2018 4:27:58 PM GMT+01:00, Sebastiaan Peters
>><sebpeters@outlook.com> wrote:
>>>Thank you for your quick response.
>>>
>>>Does the GIMPLE optimization pipeline include only the Tree SSA passes
>>>or also the RTL passes?
>>
>>Yes, it only includes only Tree SSA passes. The RTL part of the
>>pipeline hasn't been audited to work with multiple functions in RTL
>>Form in the same time.
>>
>>The only parallelized part of the compiler is LTO byte code write-out
>>at WPA stage which is done in a "fork-and-forget" mode.
>>
>>The goal should be to extend TU wise parallelism via make to function
>>wise parallelism within GCC.
>>
>>Richard.
>>
>>>Are the currently other parts of the compiler that have been
>>>parallelized?
>>>
>>>Kind regards,
>>>
>>>Sebastiaan Peters
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-20 13:02           ` Richard Biener
@ 2018-03-20 14:49             ` David Malcolm
  2018-03-20 15:07               ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: David Malcolm @ 2018-03-20 14:49 UTC (permalink / raw)
  To: Richard Biener, Sebastiaan Peters, Sebastiaan Peters; +Cc: gcc

On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
> > 7@hotmail.com> wrote:
> > > > The goal should be to extend TU wise parallelism via make to
> > > > function
> > > 
> > > wise parallelism within GCC.
> > > 
> > > Could you please elaborate more on this?
> > 
> > In the abstract sense you'd view the compilation process separated
> > into N stages, each function being processed by each. You'd assign
> > a thread to each stage and move the work items (the functions)
> > across the set of threads honoring constraints such as an IPA stage
> > needing all functions completed the previous stage. That allows you
> > to easier model the constraints due to shared state (like no pass
> > operating on two functions at the same time) compared to a model
> > where you assign a thread to each function.
> > 
> > You'll figure that the easiest point in the pipeline to try this
> > 'pipelining' is after IPA has completed and until RTL is generated.
> > 
> > Ideally the pipelining would start as early as the front ends
> > finished parsing a function and ideally we'd have multiple
> > functions in the RTL pipeline.
> > 
> > The main obstacles will be the global state in the compiler of
> > which there is the least during the GIMPLE passes (mostly cfun and
> > current_function_decl plus globals in the individual passes which
> > is easiest dealt with by not allowing a single pass to run at the
> > same time in multiple threads). TLS can be used for some of the
> > global state plus of course some global data structures need
> > locking.
> 
> Oh, and just to mention - there are a few things that may block
> adoption in the end
> like whether builds are still reproducible (we allocate things like
> DECL_UID from
> global pools and doing that somewhat randomly because of threading
> might - but not
> must - change code generation).  Or that some diagnostics will appear
> in
> non-deterministic order, or that dump files are messed up (both
> issues could be
> solved by code dealing with the issue, like buffering and doing a re-
> play in
> program order).  I guess reproducability is important when it comes
> down to
> debugging code-generation issues - I'd prefer to debug gcc when it
> doesn't run
> threaded but if that doesn't reproduce an issue that's bad.
> 
> So the most important "milestone" of this project is to identify such
> issues and
> document them somewhere.

One issue would be the garbage-collector: there are plenty of places in
GCC that have hidden assumptions that "a collection can't happen here"
(where we have temporaries that reference GC-managed objects, but which
aren't tracked by GC-roots).

I had some patches for that back in 2014 that I think I managed to drop
on the floor (sorry):
  https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
  https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
  https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html

The GC's allocator is used almost everywhere, and is probably not
thread-safe yet.

FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
it's five years out-of-date, but maybe is still relevant in places?
  https://dmalcolm.fedorapeople.org/gcc/global-state/
  https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
(I tackled this for libgccjit by instead introducing a mutex, a "big
compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
thread is calling into the rest of the compiler sources).

Hope this is helpful
Dave

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-20 14:49             ` David Malcolm
@ 2018-03-20 15:07               ` Richard Biener
  2018-03-21  8:51                 ` Sebastiaan Peters
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-20 15:07 UTC (permalink / raw)
  To: David Malcolm; +Cc: Sebastiaan Peters, Sebastiaan Peters, gcc

On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalcolm@redhat.com> wrote:
> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>> > 7@hotmail.com> wrote:
>> > > > The goal should be to extend TU wise parallelism via make to
>> > > > function
>> > >
>> > > wise parallelism within GCC.
>> > >
>> > > Could you please elaborate more on this?
>> >
>> > In the abstract sense you'd view the compilation process separated
>> > into N stages, each function being processed by each. You'd assign
>> > a thread to each stage and move the work items (the functions)
>> > across the set of threads honoring constraints such as an IPA stage
>> > needing all functions completed the previous stage. That allows you
>> > to easier model the constraints due to shared state (like no pass
>> > operating on two functions at the same time) compared to a model
>> > where you assign a thread to each function.
>> >
>> > You'll figure that the easiest point in the pipeline to try this
>> > 'pipelining' is after IPA has completed and until RTL is generated.
>> >
>> > Ideally the pipelining would start as early as the front ends
>> > finished parsing a function and ideally we'd have multiple
>> > functions in the RTL pipeline.
>> >
>> > The main obstacles will be the global state in the compiler of
>> > which there is the least during the GIMPLE passes (mostly cfun and
>> > current_function_decl plus globals in the individual passes which
>> > is easiest dealt with by not allowing a single pass to run at the
>> > same time in multiple threads). TLS can be used for some of the
>> > global state plus of course some global data structures need
>> > locking.
>>
>> Oh, and just to mention - there are a few things that may block
>> adoption in the end
>> like whether builds are still reproducible (we allocate things like
>> DECL_UID from
>> global pools and doing that somewhat randomly because of threading
>> might - but not
>> must - change code generation).  Or that some diagnostics will appear
>> in
>> non-deterministic order, or that dump files are messed up (both
>> issues could be
>> solved by code dealing with the issue, like buffering and doing a re-
>> play in
>> program order).  I guess reproducability is important when it comes
>> down to
>> debugging code-generation issues - I'd prefer to debug gcc when it
>> doesn't run
>> threaded but if that doesn't reproduce an issue that's bad.
>>
>> So the most important "milestone" of this project is to identify such
>> issues and
>> document them somewhere.
>
> One issue would be the garbage-collector: there are plenty of places in
> GCC that have hidden assumptions that "a collection can't happen here"
> (where we have temporaries that reference GC-managed objects, but which
> aren't tracked by GC-roots).
>
> I had some patches for that back in 2014 that I think I managed to drop
> on the floor (sorry):
>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html
>
> The GC's allocator is used almost everywhere, and is probably not
> thread-safe yet.

Yes.  There's also global tree modification like chaining new
pointer types into TYPE_POINTER_TO and friends so some
helpers in tree.c need to be guarded as well.

> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
> it's five years out-of-date, but maybe is still relevant in places?
>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
> (I tackled this for libgccjit by instead introducing a mutex, a "big
> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
> thread is calling into the rest of the compiler sources).
>
> Hope this is helpful
> Dave
>
> [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-20 15:07               ` Richard Biener
@ 2018-03-21  8:51                 ` Sebastiaan Peters
  2018-03-21 10:34                   ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastiaan Peters @ 2018-03-21  8:51 UTC (permalink / raw)
  To: Richard Biener, David Malcolm; +Cc: Sebastiaan Peters, gcc

>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalcolm@redhat.com> wrote:
>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>>> > 7@hotmail.com> wrote:
>>> > > > The goal should be to extend TU wise parallelism via make to
>>> > > > function
>>> > >
>>> > > wise parallelism within GCC.
>>> > >
>>> > > Could you please elaborate more on this?
>>> >
>>> > In the abstract sense you'd view the compilation process separated
>>> > into N stages, each function being processed by each. You'd assign
>>> > a thread to each stage and move the work items (the functions)
>>> > across the set of threads honoring constraints such as an IPA stage
>>> > needing all functions completed the previous stage. That allows you
>>> > to easier model the constraints due to shared state (like no pass
>>> > operating on two functions at the same time) compared to a model
>>> > where you assign a thread to each function.
>>> >
>>> > You'll figure that the easiest point in the pipeline to try this
>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>> >
>>> > Ideally the pipelining would start as early as the front ends
>>> > finished parsing a function and ideally we'd have multiple
>>> > functions in the RTL pipeline.
>>> >
>>> > The main obstacles will be the global state in the compiler of
>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>> > current_function_decl plus globals in the individual passes which
>>> > is easiest dealt with by not allowing a single pass to run at the
>>> > same time in multiple threads). TLS can be used for some of the
>>> > global state plus of course some global data structures need
>>> > locking.

This would mean that all the passes have to be individually analyzed for which global state
they use, and lock/schedule them accordingly?

If this is the case, is there any documentation that describes the pre-reqs for each pass?
I have looked at the internal documentation, however it seems that all of this still has to be created?

As to how this would be implemented, it seems the easiest way would be to extend the macros to
accept a pre-req pass. This would encourage more documentation since the dependencies
become explicit instead of the current implicit ordering.

Assuming the dependencies for the all the tree-ssa passes have to be individually analyzed.
Currently I have this as my timeline:
    - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa passes (mid-may - early june)
    - Test for possible reproducibility issues for the binary/debug info (early june - late june)
    - Parallelize the rest of tree-ssa (late june - late july)
    - Update documentation and test again for reproducibility issues (late july - early august)

Would this be acceptable?

>>> Oh, and just to mention - there are a few things that may block
>>> adoption in the end
>>> like whether builds are still reproducible (we allocate things like
>>> DECL_UID from
>>> global pools and doing that somewhat randomly because of threading
>>> might - but not
>>> must - change code generation).  Or that some diagnostics will appear
>>> in
>>> non-deterministic order, or that dump files are messed up (both
>>> issues could be
>>> solved by code dealing with the issue, like buffering and doing a re-
>>> play in
>>> program order).  I guess reproducability is important when it comes
>>> down to
>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>> doesn't run
>>> threaded but if that doesn't reproduce an issue that's bad.
>>>
>>> So the most important "milestone" of this project is to identify such
>>> issues and
>>> document them somewhere.
>>
>> One issue would be the garbage-collector: there are plenty of places in
>> GCC that have hidden assumptions that "a collection can't happen here"
>> (where we have temporaries that reference GC-managed objects, but which
>> aren't tracked by GC-roots).
>>
>> I had some patches for that back in 2014 that I think I managed to drop
>> on the floor (sorry):
>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html

Would there be a way to easily create a static analyzer to find these untracked temporaries?

A quick look at registered passes makes me count ~135 tree-ssa passes,
So your code on analyzing what globals are referenced where might come in
handy while analyzing if passes are easily parallelized.

>> The GC's allocator is used almost everywhere, and is probably not
>> thread-safe yet.
>Yes.  There's also global tree modification like chaining new
>pointer types into TYPE_POINTER_TO and friends so some
>helpers in tree.c need to be guarded as well.
>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>> it's five years out-of-date, but maybe is still relevant in places?
>>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
>> (I tackled this for libgccjit by instead introducing a mutex, a "big
>> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
>> thread is calling into the rest of the compiler sources).
>>
>> Hope this is helpful
>> Dave
>>
>> [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-21  8:51                 ` Sebastiaan Peters
@ 2018-03-21 10:34                   ` Richard Biener
  2018-03-21 19:04                     ` Sebastiaan Peters
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-21 10:34 UTC (permalink / raw)
  To: Sebastiaan Peters; +Cc: David Malcolm, Sebastiaan Peters, gcc

On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
<sebaspe97@hotmail.com> wrote:
>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalcolm@redhat.com> wrote:
>>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>>>> > 7@hotmail.com> wrote:
>>>> > > > The goal should be to extend TU wise parallelism via make to
>>>> > > > function
>>>> > >
>>>> > > wise parallelism within GCC.
>>>> > >
>>>> > > Could you please elaborate more on this?
>>>> >
>>>> > In the abstract sense you'd view the compilation process separated
>>>> > into N stages, each function being processed by each. You'd assign
>>>> > a thread to each stage and move the work items (the functions)
>>>> > across the set of threads honoring constraints such as an IPA stage
>>>> > needing all functions completed the previous stage. That allows you
>>>> > to easier model the constraints due to shared state (like no pass
>>>> > operating on two functions at the same time) compared to a model
>>>> > where you assign a thread to each function.
>>>> >
>>>> > You'll figure that the easiest point in the pipeline to try this
>>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>>> >
>>>> > Ideally the pipelining would start as early as the front ends
>>>> > finished parsing a function and ideally we'd have multiple
>>>> > functions in the RTL pipeline.
>>>> >
>>>> > The main obstacles will be the global state in the compiler of
>>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>>> > current_function_decl plus globals in the individual passes which
>>>> > is easiest dealt with by not allowing a single pass to run at the
>>>> > same time in multiple threads). TLS can be used for some of the
>>>> > global state plus of course some global data structures need
>>>> > locking.
>
> This would mean that all the passes have to be individually analyzed for
> which global state
> they use, and lock/schedule them accordingly?

Their private global state would be excempt by assuring that a pass never
runs twice at the same time.

The global state that remains should be the same for all passes we are talking
about (during the late GIMPLE optimization phase which I'd tackle).

> If this is the case, is there any documentation that describes the pre-reqs
> for each pass?
> I have looked at the internal documentation, however it seems that all of
> this still has to be created?

The prereqs are actually the same and not very well documented (if at all).
There's the global GC memory pool where we allocate statements and
stuff like that from (and luckyly statements themselves are function private).
Then there's global trees like types ('int') where modification needs to be
appropriately guarded.  Note that "modification" means for example
building a new type for the address of 'int' given that all different
pointer types
to 'int' are chained in a list rooted in the tree for 'int'.  That
means (a few?)
tree building helpers need to be guarded with locks.  I don't have a great
idea how to identify those apart from knowing them in advance or running
into races ... my gut feeling is that there's not a lot to guard but I may
be wrong ;)

> As to how this would be implemented, it seems the easiest way would be to
> extend the macros to
> accept a pre-req pass. This would encourage more documentation since the
> dependencies
> become explicit instead of the current implicit ordering.

Actually the order is quite explicit.  Maybe I now understand your
question - no,
passes do not "communicate" between each other via global state, all such
state is per function and the execution order of passes on a given function
is hard-coded in passes.def.

> Assuming the dependencies for the all the tree-ssa passes have to be
> individually analyzed.
> Currently I have this as my timeline:
>     - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
> passes (mid-may - early june)
>     - Test for possible reproducibility issues for the binary/debug info
> (early june - late june)
>     - Parallelize the rest of tree-ssa (late june - late july)
>     - Update documentation and test again for reproducibility issues (late
> july - early august)
>
> Would this be acceptable?

Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
understand what "Parallelize the rest of tree-ssa" means though.  If
you want to tackle a tiny bit more I'd rather include "RTL" by treating
the whole RTL part as a single "pass" (as said only one function can
be in RTL right now).

>>>> Oh, and just to mention - there are a few things that may block
>>>> adoption in the end
>>>> like whether builds are still reproducible (we allocate things like
>>>> DECL_UID from
>>>> global pools and doing that somewhat randomly because of threading
>>>> might - but not
>>>> must - change code generation).  Or that some diagnostics will appear
>>>> in
>>>> non-deterministic order, or that dump files are messed up (both
>>>> issues could be
>>>> solved by code dealing with the issue, like buffering and doing a re-
>>>> play in
>>>> program order).  I guess reproducability is important when it comes
>>>> down to
>>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>>> doesn't run
>>>> threaded but if that doesn't reproduce an issue that's bad.
>>>>
>>>> So the most important "milestone" of this project is to identify such
>>>> issues and
>>>> document them somewhere.
>>>
>>> One issue would be the garbage-collector: there are plenty of places in
>>> GCC that have hidden assumptions that "a collection can't happen here"
>>> (where we have temporaries that reference GC-managed objects, but which
>>> aren't tracked by GC-roots).
>>>
>>> I had some patches for that back in 2014 that I think I managed to drop
>>> on the floor (sorry):
>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html
>
> Would there be a way to easily create a static analyzer to find these
> untracked temporaries?

I don't think so.

> A quick look at registered passes makes me count ~135 tree-ssa passes,
> So your code on analyzing what globals are referenced where might come in
> handy while analyzing if passes are easily parallelized.

I think the "solution" to the garbage collecting issue is to simply keep
that serialized.  It's currently done on-demand anyway at certain
safe collection points so the work scheduler can simply hold off
scheduling more work when the collector would want to run, waiting for
running jobs to complete.

Richard.

>>> The GC's allocator is used almost everywhere, and is probably not
>>> thread-safe yet.
>>Yes.  There's also global tree modification like chaining new
>>pointer types into TYPE_POINTER_TO and friends so some
>>helpers in tree.c need to be guarded as well.
>>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>>> it's five years out-of-date, but maybe is still relevant in places?
>>>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
>>> (I tackled this for libgccjit by instead introducing a mutex, a "big
>>> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
>>> thread is calling into the rest of the compiler sources).
>>>
>>> Hope this is helpful
>>> Dave
>>>
>>> [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-21 10:34                   ` Richard Biener
@ 2018-03-21 19:04                     ` Sebastiaan Peters
  2018-03-22 12:27                       ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Sebastiaan Peters @ 2018-03-21 19:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: David Malcolm, Sebastiaan Peters, gcc

>On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
><sebaspe97@hotmail.com> wrote:
>>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalcolm@redhat.com> wrote:
>>>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>>>>> > 7@hotmail.com> wrote:
>>>>> > > > The goal should be to extend TU wise parallelism via make to
>>>>> > > > function
>>>>> > >
>>>>> > > wise parallelism within GCC.
>>>>> > >
>>>>> > > Could you please elaborate more on this?
>>>>> >
>>>>> > In the abstract sense you'd view the compilation process separated
>>>>> > into N stages, each function being processed by each. You'd assign
>>>>> > a thread to each stage and move the work items (the functions)
>>>>> > across the set of threads honoring constraints such as an IPA stage
>>>>> > needing all functions completed the previous stage. That allows you
>>>>> > to easier model the constraints due to shared state (like no pass
>>>>> > operating on two functions at the same time) compared to a model
>>>>> > where you assign a thread to each function.
>>>>> >
>>>>> > You'll figure that the easiest point in the pipeline to try this
>>>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>>>> >
>>>>> > Ideally the pipelining would start as early as the front ends
>>>>> > finished parsing a function and ideally we'd have multiple
>>>>> > functions in the RTL pipeline.
>>>>> >
>>>>> > The main obstacles will be the global state in the compiler of
>>>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>>>> > current_function_decl plus globals in the individual passes which
>>>>> > is easiest dealt with by not allowing a single pass to run at the
>>>>> > same time in multiple threads). TLS can be used for some of the
>>>>> > global state plus of course some global data structures need
>>>>> > locking.
>>
>> This would mean that all the passes have to be individually analyzed for
>> which global state
>> they use, and lock/schedule them accordingly?
>
>Their private global state would be excempt by assuring that a pass never
>runs twice at the same time.
>
>The global state that remains should be the same for all passes we are talking
>about (during the late GIMPLE optimization phase which I'd tackle).
>
>> If this is the case, is there any documentation that describes the pre-reqs
>> for each pass?
>> I have looked at the internal documentation, however it seems that all of
>> this still has to be created?
>
>The prereqs are actually the same and not very well documented (if at all).
>There's the global GC memory pool where we allocate statements and
>stuff like that from (and luckyly statements themselves are function private).
>Then there's global trees like types ('int') where modification needs to be
>appropriately guarded.  Note that "modification" means for example
>building a new type for the address of 'int' given that all different
>pointer types
>to 'int' are chained in a list rooted in the tree for 'int'.  That
>means (a few?)
>tree building helpers need to be guarded with locks.  I don't have a great
>idea how to identify those apart from knowing them in advance or running
>into races ... my gut feeling is that there's not a lot to guard but I may
>be wrong ;)

What does it mean to be a node of a type tree?
Does it describe information about that type,
or does it keep a reference to where something
of that type has been declared?

>> As to how this would be implemented, it seems the easiest way would be to
>> extend the macros to
>> accept a pre-req pass. This would encourage more documentation since the
>> dependencies
>> become explicit instead of the current implicit ordering.
>
>Actually the order is quite explicit.  Maybe I now understand your
>question - no,
>passes do not "communicate" between each other via global state, all such
>state is per function and the execution order of passes on a given function
>is hard-coded in passes.def.
>
>> Assuming the dependencies for the all the tree-ssa passes have to be
>> individually analyzed.
>> Currently I have this as my timeline:
>>     - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
>> passes (mid-may - early june)
>>     - Test for possible reproducibility issues for the binary/debug info
>> (early june - late june)
>>     - Parallelize the rest of tree-ssa (late june - late july)
>>     - Update documentation and test again for reproducibility issues (late
>> july - early august)
>>
>> Would this be acceptable?
>
>Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
>understand what "Parallelize the rest of tree-ssa" means though.  If
>you want to tackle a tiny bit more I'd rather include "RTL" by treating
>the whole RTL part as a single "pass" (as said only one function can
>be in RTL right now).
>

I was under the assumption that passes had to be indivdually analysed
when I wrote that. The timeline is now updated to include some extra
time in the beginning to familiarize myself a bit more with the project,
and added the RTL as an optional deliverable:)

I added a draft of the proposal[0], any feedback would be highly appreciated.

>>>>> Oh, and just to mention - there are a few things that may block
>>>>> adoption in the end
>>>>> like whether builds are still reproducible (we allocate things like
>>>>> DECL_UID from
>>>>> global pools and doing that somewhat randomly because of threading
>>>>> might - but not
>>>>> must - change code generation).  Or that some diagnostics will appear
>>>>> in
>>>>> non-deterministic order, or that dump files are messed up (both
>>>>> issues could be
>>>>> solved by code dealing with the issue, like buffering and doing a re-
>>>>> play in
>>>>> program order).  I guess reproducability is important when it comes
>>>>> down to
>>>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>>>> doesn't run
>>>>> threaded but if that doesn't reproduce an issue that's bad.
>>>>>
>>>>> So the most important "milestone" of this project is to identify such
>>>>> issues and
>>>>> document them somewhere.
>>>>
>>>> One issue would be the garbage-collector: there are plenty of places in
>>>> GCC that have hidden assumptions that "a collection can't happen here"
>>>> (where we have temporaries that reference GC-managed objects, but which
>>>> aren't tracked by GC-roots).
>>>>
>>>> I had some patches for that back in 2014 that I think I managed to drop
>>>> on the floor (sorry):
>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html
>>
>> Would there be a way to easily create a static analyzer to find these
>> untracked temporaries?
>
>I don't think so.
>
>> A quick look at registered passes makes me count ~135 tree-ssa passes,
>> So your code on analyzing what globals are referenced where might come in
>> handy while analyzing if passes are easily parallelized.
>
>I think the "solution" to the garbage collecting issue is to simply keep
>that serialized.  It's currently done on-demand anyway at certain
>safe collection points so the work scheduler can simply hold off
>scheduling more work when the collector would want to run, waiting for
>running jobs to complete.
>
>Richard.
>
>>>> The GC's allocator is used almost everywhere, and is probably not
>>>> thread-safe yet.
>>>Yes.  There's also global tree modification like chaining new
>>>pointer types into TYPE_POINTER_TO and friends so some
>>>helpers in tree.c need to be guarded as well.
>>>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>>>> it's five years out-of-date, but maybe is still relevant in places?
>>>>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>>>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
>>>> (I tackled this for libgccjit by instead introducing a mutex, a "big
>>>> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
>>>> thread is calling into the rest of the compiler sources).
>>>>
>>>> Hope this is helpful
>>>> Dave
>>>>
>>>> 

[0] https://docs.google.com/document/d/1YkKYI3J-pKFfHLdxljWVoJ89l2oVGibxslIW3ikluz8/edit?usp=sharing

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-21 19:04                     ` Sebastiaan Peters
@ 2018-03-22 12:27                       ` Richard Biener
  2018-03-23 15:05                         ` Sebastiaan Peters
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2018-03-22 12:27 UTC (permalink / raw)
  To: Sebastiaan Peters; +Cc: David Malcolm, Sebastiaan Peters, gcc

On Wed, Mar 21, 2018 at 8:04 PM, Sebastiaan Peters
<sebaspe97@hotmail.com> wrote:
>>On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
>><sebaspe97@hotmail.com> wrote:
>>>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>>>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>>>>>> > 7@hotmail.com> wrote:
>>>>>> > > > The goal should be to extend TU wise parallelism via make to
>>>>>> > > > function
>>>>>> > >
>>>>>> > > wise parallelism within GCC.
>>>>>> > >
>>>>>> > > Could you please elaborate more on this?
>>>>>> >
>>>>>> > In the abstract sense you'd view the compilation process separated
>>>>>> > into N stages, each function being processed by each. You'd assign
>>>>>> > a thread to each stage and move the work items (the functions)
>>>>>> > across the set of threads honoring constraints such as an IPA stage
>>>>>> > needing all functions completed the previous stage. That allows you
>>>>>> > to easier model the constraints due to shared state (like no pass
>>>>>> > operating on two functions at the same time) compared to a model
>>>>>> > where you assign a thread to each function.
>>>>>> >
>>>>>> > You'll figure that the easiest point in the pipeline to try this
>>>>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>>>>> >
>>>>>> > Ideally the pipelining would start as early as the front ends
>>>>>> > finished parsing a function and ideally we'd have multiple
>>>>>> > functions in the RTL pipeline.
>>>>>> >
>>>>>> > The main obstacles will be the global state in the compiler of
>>>>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>>>>> > current_function_decl plus globals in the individual passes which
>>>>>> > is easiest dealt with by not allowing a single pass to run at the
>>>>>> > same time in multiple threads). TLS can be used for some of the
>>>>>> > global state plus of course some global data structures need
>>>>>> > locking.
>>>
>>> This would mean that all the passes have to be individually analyzed for
>>> which global state
>>> they use, and lock/schedule them accordingly?
>>
>>Their private global state would be excempt by assuring that a pass never
>>runs twice at the same time.
>>
>>The global state that remains should be the same for all passes we are talking
>>about (during the late GIMPLE optimization phase which I'd tackle).
>>
>>> If this is the case, is there any documentation that describes the pre-reqs
>>> for each pass?
>>> I have looked at the internal documentation, however it seems that all of
>>> this still has to be created?
>>
>>The prereqs are actually the same and not very well documented (if at all).
>>There's the global GC memory pool where we allocate statements and
>>stuff like that from (and luckyly statements themselves are function private).
>>Then there's global trees like types ('int') where modification needs to be
>>appropriately guarded.  Note that "modification" means for example
>>building a new type for the address of 'int' given that all different
>>pointer types
>>to 'int' are chained in a list rooted in the tree for 'int'.  That
>>means (a few?)
>>tree building helpers need to be guarded with locks.  I don't have a great
>>idea how to identify those apart from knowing them in advance or running
>>into races ... my gut feeling is that there's not a lot to guard but I may
>>be wrong ;)
>
> What does it mean to be a node of a type tree?
> Does it describe information about that type,
> or does it keep a reference to where something
> of that type has been declared?

On the GIMPLE level we have two main kinds of data objects, one is
'gimple' (and derived clases) for statements and 'tree' for operands,
types, declarations.  'tree's is also what GENERIC is represented in, the
IL that gets produced by the frontends which is lowered to GIMPLE by
the gimplifier.

On GENERIC everything is a 'tree' (there's a class hierarchy of trees
implemented in C ways via unions), types are, and so are decls
and expressions (in GENERIC).  You can look at tree.h (and tree-core.h
for implementation details) and for example see

#define TYPE_POINTER_TO(NODE) (TYPE_CHECK (NODE)->type_common.pointer_to)

where tree.c:build_pointer_type_for_mode does

  /* First, if we already have a type for pointers to TO_TYPE and it's
     the proper mode, use it.  */
  for (t = TYPE_POINTER_TO (to_type); t; t = TYPE_NEXT_PTR_TO (t))
    if (TYPE_MODE (t) == mode && TYPE_REF_CAN_ALIAS_ALL (t) == can_alias_all)
      return t;

so if a pass builds a new pointer type that doesn't already exist we
modify this list.
This means an existing type tree is modified even if the type itself
isn't modified.

>>> As to how this would be implemented, it seems the easiest way would be to
>>> extend the macros to
>>> accept a pre-req pass. This would encourage more documentation since the
>>> dependencies
>>> become explicit instead of the current implicit ordering.
>>
>>Actually the order is quite explicit.  Maybe I now understand your
>>question - no,
>>passes do not "communicate" between each other via global state, all such
>>state is per function and the execution order of passes on a given function
>>is hard-coded in passes.def.
>>
>>> Assuming the dependencies for the all the tree-ssa passes have to be
>>> individually analyzed.
>>> Currently I have this as my timeline:
>>>     - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
>>> passes (mid-may - early june)
>>>     - Test for possible reproducibility issues for the binary/debug info
>>> (early june - late june)
>>>     - Parallelize the rest of tree-ssa (late june - late july)
>>>     - Update documentation and test again for reproducibility issues (late
>>> july - early august)
>>>
>>> Would this be acceptable?
>>
>>Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
>>understand what "Parallelize the rest of tree-ssa" means though.  If
>>you want to tackle a tiny bit more I'd rather include "RTL" by treating
>>the whole RTL part as a single "pass" (as said only one function can
>>be in RTL right now).
>>
>
> I was under the assumption that passes had to be indivdually analysed
> when I wrote that. The timeline is now updated to include some extra
> time in the beginning to familiarize myself a bit more with the project,
> and added the RTL as an optional deliverable:)
>
> I added a draft of the proposal[0], any feedback would be highly appreciated.

The poposal looks good to me!

Thanks,
Richard.

>>>>>> Oh, and just to mention - there are a few things that may block
>>>>>> adoption in the end
>>>>>> like whether builds are still reproducible (we allocate things like
>>>>>> DECL_UID from
>>>>>> global pools and doing that somewhat randomly because of threading
>>>>>> might - but not
>>>>>> must - change code generation).  Or that some diagnostics will appear
>>>>>> in
>>>>>> non-deterministic order, or that dump files are messed up (both
>>>>>> issues could be
>>>>>> solved by code dealing with the issue, like buffering and doing a re-
>>>>>> play in
>>>>>> program order).  I guess reproducability is important when it comes
>>>>>> down to
>>>>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>>>>> doesn't run
>>>>>> threaded but if that doesn't reproduce an issue that's bad.
>>>>>>
>>>>>> So the most important "milestone" of this project is to identify such
>>>>>> issues and
>>>>>> document them somewhere.
>>>>>
>>>>> One issue would be the garbage-collector: there are plenty of places in
>>>>> GCC that have hidden assumptions that "a collection can't happen here"
>>>>> (where we have temporaries that reference GC-managed objects, but which
>>>>> aren't tracked by GC-roots).
>>>>>
>>>>> I had some patches for that back in 2014 that I think I managed to drop
>>>>> on the floor (sorry):
>>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html
>>>
>>> Would there be a way to easily create a static analyzer to find these
>>> untracked temporaries?
>>
>>I don't think so.
>>
>>> A quick look at registered passes makes me count ~135 tree-ssa passes,
>>> So your code on analyzing what globals are referenced where might come in
>>> handy while analyzing if passes are easily parallelized.
>>
>>I think the "solution" to the garbage collecting issue is to simply keep
>>that serialized.  It's currently done on-demand anyway at certain
>>safe collection points so the work scheduler can simply hold off
>>scheduling more work when the collector would want to run, waiting for
>>running jobs to complete.
>>
>>Richard.
>>
>>>>> The GC's allocator is used almost everywhere, and is probably not
>>>>> thread-safe yet.
>>>>Yes.  There's also global tree modification like chaining new
>>>>pointer types into TYPE_POINTER_TO and friends so some
>>>>helpers in tree.c need to be guarded as well.
>>>>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>>>>> it's five years out-of-date, but maybe is still relevant in places?
>>>>>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>>>>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
>>>>> (I tackled this for libgccjit by instead introducing a mutex, a "big
>>>>> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
>>>>> thread is calling into the rest of the compiler sources).
>>>>>
>>>>> Hope this is helpful
>>>>> Dave
>>>>>
>>>>>
>
> [0] https://docs.google.com/document/d/1YkKYI3J-pKFfHLdxljWVoJ89l2oVGibxslIW3ikluz8/edit?usp=sharing

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: GSOC Question about the parallelization project
  2018-03-22 12:27                       ` Richard Biener
@ 2018-03-23 15:05                         ` Sebastiaan Peters
  0 siblings, 0 replies; 14+ messages in thread
From: Sebastiaan Peters @ 2018-03-23 15:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: David Malcolm, Sebastiaan Peters, gcc

>On Wed, Mar 21, 2018 at 8:04 PM, Sebastiaan Peters
><sebaspe97@hotmail.com> wrote:
>>>On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
>>><sebaspe97@hotmail.com> wrote:
>>>>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>>>>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>>>>>> <richard.guenther@gmail.com> wrote:
>>>>>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>>>>>>> > 7@hotmail.com> wrote:
>>>>>>> > > > The goal should be to extend TU wise parallelism via make to
>>>>>>> > > > function
>>>>>>> > >
>>>>>>> > > wise parallelism within GCC.
>>>>>>> > >
>>>>>>> > > Could you please elaborate more on this?
>>>>>>> >
>>>>>>> > In the abstract sense you'd view the compilation process separated
>>>>>>> > into N stages, each function being processed by each. You'd assign
>>>>>>> > a thread to each stage and move the work items (the functions)
>>>>>>> > across the set of threads honoring constraints such as an IPA stage
>>>>>>> > needing all functions completed the previous stage. That allows you
>>>>>>> > to easier model the constraints due to shared state (like no pass
>>>>>>> > operating on two functions at the same time) compared to a model
>>>>>>> > where you assign a thread to each function.
>>>>>>> >
>>>>>>> > You'll figure that the easiest point in the pipeline to try this
>>>>>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>>>>>> >
>>>>>>> > Ideally the pipelining would start as early as the front ends
>>>>>>> > finished parsing a function and ideally we'd have multiple
>>>>>>> > functions in the RTL pipeline.
>>>>>>> >
>>>>>>> > The main obstacles will be the global state in the compiler of
>>>>>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>>>>>> > current_function_decl plus globals in the individual passes which
>>>>>>> > is easiest dealt with by not allowing a single pass to run at the
>>>>>>> > same time in multiple threads). TLS can be used for some of the
>>>>>>> > global state plus of course some global data structures need
>>>>>>> > locking.
>>>>
>>>> This would mean that all the passes have to be individually analyzed for
>>>> which global state
>>>> they use, and lock/schedule them accordingly?
>>>
>>>Their private global state would be excempt by assuring that a pass never
>>>runs twice at the same time.
>>>
>>>The global state that remains should be the same for all passes we are talking
>>>about (during the late GIMPLE optimization phase which I'd tackle).
>>>
>>>> If this is the case, is there any documentation that describes the pre-reqs
>>>> for each pass?
>>>> I have looked at the internal documentation, however it seems that all of
>>>> this still has to be created?
>>>
>>>The prereqs are actually the same and not very well documented (if at all).
>>>There's the global GC memory pool where we allocate statements and
>>>stuff like that from (and luckyly statements themselves are function private).
>>>Then there's global trees like types ('int') where modification needs to be
>>>appropriately guarded.  Note that "modification" means for example
>>>building a new type for the address of 'int' given that all different
>>>pointer types
>>>to 'int' are chained in a list rooted in the tree for 'int'.  That
>>>means (a few?)
>>>tree building helpers need to be guarded with locks.  I don't have a great
>>>idea how to identify those apart from knowing them in advance or running
>>>into races ... my gut feeling is that there's not a lot to guard but I may
>>>be wrong ;)
>>
>> What does it mean to be a node of a type tree?
>> Does it describe information about that type,
>> or does it keep a reference to where something
>> of that type has been declared?
>
>On the GIMPLE level we have two main kinds of data objects, one is
>'gimple' (and derived clases) for statements and 'tree' for operands,
>types, declarations.  'tree's is also what GENERIC is represented in, the
>IL that gets produced by the frontends which is lowered to GIMPLE by
>the gimplifier.
>
>On GENERIC everything is a 'tree' (there's a class hierarchy of trees
>implemented in C ways via unions), types are, and so are decls
>and expressions (in GENERIC).  You can look at tree.h (and tree-core.h
>for implementation details) and for example see
>
>#define TYPE_POINTER_TO(NODE) (TYPE_CHECK (NODE)->type_common.pointer_to)
>
>where tree.c:build_pointer_type_for_mode does
>
  >/* First, if we already have a type for pointers to TO_TYPE and it's
     >the proper mode, use it.  */
  >for (t = TYPE_POINTER_TO (to_type); t; t = TYPE_NEXT_PTR_TO (t))
    >if (TYPE_MODE (t) == mode && TYPE_REF_CAN_ALIAS_ALL (t) == can_alias_all)
      >return t;
>
>so if a pass builds a new pointer type that doesn't already exist we
>modify this list.
>This means an existing type tree is modified even if the type itself
>isn't modified.
>
>>>> As to how this would be implemented, it seems the easiest way would be to
>>>> extend the macros to
>>>> accept a pre-req pass. This would encourage more documentation since the
>>>> dependencies
>>>> become explicit instead of the current implicit ordering.
>>>
>>>Actually the order is quite explicit.  Maybe I now understand your
>>>question - no,
>>>passes do not "communicate" between each other via global state, all such
>>>state is per function and the execution order of passes on a given function
>>>is hard-coded in passes.def.
>>>
>>>> Assuming the dependencies for the all the tree-ssa passes have to be
>>>> individually analyzed.
>>>> Currently I have this as my timeline:
>>>>     - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
>>>> passes (mid-may - early june)
>>>>     - Test for possible reproducibility issues for the binary/debug info
>>>> (early june - late june)
>>>>     - Parallelize the rest of tree-ssa (late june - late july)
>>>>     - Update documentation and test again for reproducibility issues (late
>>>> july - early august)
>>>>
>>>> Would this be acceptable?
>>>
>>>Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
>>>understand what "Parallelize the rest of tree-ssa" means though.  If
>>>you want to tackle a tiny bit more I'd rather include "RTL" by treating
>>>the whole RTL part as a single "pass" (as said only one function can
>>>be in RTL right now).
>>>
>>
>> I was under the assumption that passes had to be indivdually analysed
>> when I wrote that. The timeline is now updated to include some extra
>> time in the beginning to familiarize myself a bit more with the project,
>> and added the RTL as an optional deliverable:)
>> then
>> I added a draft of the proposal[0], any feedback would be highly appreciated.
>
>The poposal looks good to me!
>
>Thanks,
>Richard.

Awesome, I'll submit this is the coming days.

Thanks for all your help, it has helped tremendously.

I'm looking forward to (hopefully) work with you this summer.

Kind regards,

Sebastiaan Peters

>>>>>>> Oh, and just to mention - there are a few things that may block
>>>>>>> adoption in the end
>>>>>>> like whether builds are still reproducible (we allocate things like
>>>>>>> DECL_UID from
>>>>>>> global pools and doing that somewhat randomly because of threading
>>>>>>> might - but not
>>>>>>> must - change code generation).  Or that some diagnostics will appear
>>>>>>> in
>>>>>>> non-deterministic order, or that dump files are messed up (both
>>>>>>> issues could be
>>>>>>> solved by code dealing with the issue, like buffering and doing a re-
>>>>>>> play in
>>>>>>> program order).  I guess reproducability is important when it comes
>>>>>>> down to
>>>>>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>>>>>> doesn't run
>>>>>>> threaded but if that doesn't reproduce an issue that's bad.
>>>>>>>
>>>>>>> So the most important "milestone" of this project is to identify such
>>>>>>> issues and
>>>>>>> document them somewhere.
>>>>>>
>>>>>> One issue would be the garbage-collector: there are plenty of places in
>>>>>> GCC that have hidden assumptions that "a collection can't happen here"
>>>>>> (where we have temporaries that reference GC-managed objects, but which
>>>>>> aren't tracked by GC-roots).
>>>>>>
>>>>>> I had some patches for that back in 2014 that I think I managed to drop
>>>>>> on the floor (sorry):
>>>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html
>>>>
>>>> Would there be a way to easily create a static analyzer to find these
>>>> untracked temporaries?
>>>
>>>I don't think so.
>>>
>>>> A quick look at registered passes makes me count ~135 tree-ssa passes,
>>>> So your code on analyzing what globals are referenced where might come in
>>>> handy while analyzing if passes are easily parallelized.
>>>
>>>I think the "solution" to the garbage collecting issue is to simply keep
>>>that serialized.  It's currently done on-demand anyway at certain
>>>safe collection points so the work scheduler can simply hold off
>>>scheduling more work when the collector would want to run, waiting for
>>>running jobs to complete.
>>>
>>>Richard.
>>>
>>>>>> The GC's allocator is used almost everywhere, and is probably not
>>>>>> thread-safe yet.
>>>>>Yes.  There's also global tree modification like chaining new
>>>>>pointer types into TYPE_POINTER_TO and friends so some
>>>>>helpers in tree.c need to be guarded as well.
>>>>>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>>>>>> it's five years out-of-date, but maybe is still relevant in places?
>>>>>>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>>>>>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
>>>>>> (I tackled this for libgccjit by instead introducing a mutex, a "big
>>>>>> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
>>>>>> thread is calling into the rest of the compiler sources).
>>>>>>
>>>>>> Hope this is helpful
>>>>>> Dave
>>>>>>
>>>>>>
>>
>> [0] https://docs.google.com/document/d/1YkKYI3J-pKFfHLdxljWVoJ89l2oVGibxslIW3ikluz8/edit?usp=sharing

<https://docs.google.com/document/d/1YkKYI3J-pKFfHLdxljWVoJ89l2oVGibxslIW3ikluz8/edit?usp=sharing>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-03-23 15:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 16:25 GSOC Question about the parallelization project Sebastiaan Peters
2018-03-19  9:20 ` Richard Biener
2018-03-19 15:28   ` Sebastiaan Peters
2018-03-19 17:37     ` Richard Biener
2018-03-19 19:09       ` Sebastiaan Peters
2018-03-19 20:55         ` Richard Biener
2018-03-20 13:02           ` Richard Biener
2018-03-20 14:49             ` David Malcolm
2018-03-20 15:07               ` Richard Biener
2018-03-21  8:51                 ` Sebastiaan Peters
2018-03-21 10:34                   ` Richard Biener
2018-03-21 19:04                     ` Sebastiaan Peters
2018-03-22 12:27                       ` Richard Biener
2018-03-23 15:05                         ` Sebastiaan Peters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).