public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug lto/116535] New: LTO partitioning vs. offloading compilation
@ 2024-08-29 14:00 tschwinge at gcc dot gnu.org
  2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2024-08-29 14:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

            Bug ID: 116535
           Summary: LTO partitioning vs. offloading compilation
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: openacc, openmp, wrong-code
          Severity: normal
          Priority: P3
         Component: lto
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: burnus at gcc dot gnu.org, hubicka at gcc dot gnu.org,
                    jakub at gcc dot gnu.org, rguenth at gcc dot gnu.org
  Target Milestone: ---

I ran into a curious issue re LTO partitioning vs. offloading compilation.  I
guess what happened was that, given sufficiently high 'make -j', and only
'check-target-libgomp' running (which limits itself to 19 parallel slots), for
the '-flto -flto-partition=max' test cases
'libgomp.oacc-c-c++-common/data-clauses-{kernels,parallel}-ipa-pta.c', there
was actual parallel LTO compilation/partitioning, and these test cases then
FAIL, for example:

    libgomp: Cannot map target functions or variables (expected 180 + 0 + 1,
have 11)

Running the compilation outside of 'make -j check', we see:

    lto-wrapper: warning: using serial compilation of 36 LTRANS jobs
    lto-wrapper: note: see the ‘-flto’ option documentation for more
information

..., and they execute successfully.  However, for example, with '-flto=3
-flto-partition=max':

    libgomp: Cannot map target functions or variables (expected 30 + 0 + 1,
have 11)

... where here '30' is 'N * 10' for '-flto=N'.  So the "expected" number is
wrong.

This reproduces down to GCC 10:

    libgomp: Cannot map target functions or variables (expected 30, have 10)

(..., so isn't related to the PR116361 LTO plugin changes.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
@ 2024-08-30 18:35 ` burnus at gcc dot gnu.org
  2024-08-30 18:46 ` burnus at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-08-30 18:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> ---
I have problems reproducing it fully reliably – and my impression is that a
global variable is not atomically set.

The difference between -flto=1 and -flto=2 with -flto-partition=max is rather
small. In either case, 36 partitions (LTRANS jobs) exist - and the partition
writing is the same.

* * *

If I add an
  #undef HAVE_WORKING_FORK
to gcc/lto/lto.cc, it seems to work - while without, I get the error for:

~/projects/gcc-trunk-offload/bin/gcc -save-temps -fopenacc -flto=2
-foffload=nvptx-none -flto-partition=max 
*/*/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c

libgomp: Cannot map target functions or variables (expected 20 + 0 + 1, have
11)

* * *

With -flto=<N> (with some reasonable N), stream_out_partitions is called <N>
times, i.e. for the example above with N=3:
  stream_out_partitions_1 ("./a.ltrans35.o", 10, 0, 12)
  stream_out_partitions_1 ("./a.ltrans11.o", 10, 12, 24)
  stream_out_partitions_1 ("./a.ltrans23.o", 10, 24, 36)
while for -flto=1, it's
  stream_out_partitions_1 ("./a.ltrans35.o", 10, 0, 36)

That's the only difference.

* * *

For -flto=3 + looking for '.offload_var_table' in the
'a.ltrans<i>.ltrans.{s,o}' files, shows that is always in the first file for
each set of files writt lto_parallelism  i.e., for -flto=3, it is in file
'a.ltrans<i>.ltrans.{s,o}' for <i> = 0, 12, and 24, where the files <i> = 0,
..., 35 exist.

* * *




* * *

However, with
#undef 




Recall how the data is created:
--------------------------------------------------

The expected number is:
  void **host_func_table = ((void ***) host_table)[0];
  void **host_funcs_end  = ((void ***) host_table)[1];
...
  int num_funcs = host_funcs_end - host_func_table;
  int num_vars  = (host_vars_end - host_var_table) / 2;

where host_table = __OFFLOAD_TABLE__ and libgcc/offloadstuff.c has:

#elif defined CRT_TABLE

const void *const __OFFLOAD_TABLE__[]
  __attribute__ ((__visibility__ ("hidden"))) =
{
  &__offload_func_table, &__offload_funcs_end,
  &__offload_var_table, &__offload_vars_end,
  &__offload_ind_func_table, &__offload_ind_funcs_end,
};


Where the entries are formed by:

#define OFFLOAD_FUNC_TABLE_SECTION_NAME ".gnu.offload_funcs"
#define OFFLOAD_IND_FUNC_TABLE_SECTION_NAME ".gnu.offload_ind_funcs"
#define OFFLOAD_VAR_TABLE_SECTION_NAME ".gnu.offload_vars"

#ifdef CRT_BEGIN

const void *const __offload_func_table[0]
  __attribute__ ((__used__, visibility ("hidden"),
                  section (OFFLOAD_FUNC_TABLE_SECTION_NAME))) = { };
...

#elif defined CRT_END
const void *const __offload_funcs_end[0]
  __attribute__ ((__used__, visibility ("hidden"),
                  section (OFFLOAD_FUNC_TABLE_SECTION_NAME))) = { };

And during link time, from which the files
  crtoffload{begin,end,table}$(objext)
are produced and GNU_USER_TARGET_STARTFILE_SPEC / GNU_USER_TARGET_ENDFILE_SPEC
ensures that __offload_func_table comes first and __offload_funcs_end last in
the list.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
  2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
@ 2024-08-30 18:46 ` burnus at gcc dot gnu.org
  2024-09-02  7:14 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-08-30 18:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Namely, the following seems to be problematic if the code is run concurrently.

The FORK part is actually quite old r207515 (Feb 2014) as is the following code
with flag_wpa, which was added   in r217489 (Nov 2014), but admittedly slightly
before.

gcc/lto-cgraph.cc

void  
output_offload_tables (void)
{
  ...

  /* In WHOPR mode during the WPA stage the joint offload tables need to be
     streamed to one partition only.  That's why we free offload_funcs and
     offload_vars after the first call of output_offload_tables.  */
  if (flag_wpa)
    {
      vec_free (offload_funcs);
      vec_free (offload_vars);
      vec_free (offload_ind_funcs);
    }
}

which obviously fails if multiple forks process such that each forks writes it
once, before freeing the vector ...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
  2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
  2024-08-30 18:46 ` burnus at gcc dot gnu.org
@ 2024-09-02  7:14 ` rguenth at gcc dot gnu.org
  2024-09-02  9:28 ` burnus at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-02  7:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The forked processes may not write to any "global" data because forking makes
that data not global ... instead any such "global" data has to be computed
before forking.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-09-02  7:14 ` rguenth at gcc dot gnu.org
@ 2024-09-02  9:28 ` burnus at gcc dot gnu.org
  2024-09-02 10:23 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-09-02  9:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> The forked processes may not write to any "global" data because forking
> makes that data not global ... instead any such "global" data has to be
> computed before forking.

Well, the data is "computed" before - just a means is missing to propagate the
information whether to output the offload tables should be written or not is
missing. The current call stack is:

With WPA, the code is run as follows:

 output_offload_tables
  stream_out_partitions_1
   stream_out
    ipa_write_optimization_summaries
     write_lto
      lto_output
       output_offload_tables

* * *

I see two possibilities, the quick one:

(A)

--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -463,0 +464 @@ struct lto_symtab_encoder_d
+  bool output_offload_tables;

setting it - and propagating it through.

(B)
Change how we output the tables. Currently, those bypass normal WPA and IPA
handling. Making them visible should avoid the 'force_output = 1' for all table
entries (and only require it for the tables themselves) - and presumably fix
this issue as well as fixing PR 95622.

* * *

Thoughts?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-09-02  9:28 ` burnus at gcc dot gnu.org
@ 2024-09-02 10:23 ` rguenther at suse dot de
  2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenther at suse dot de @ 2024-09-02 10:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 2 Sep 2024, burnus at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
> 
> --- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #3)
> > The forked processes may not write to any "global" data because forking
> > makes that data not global ... instead any such "global" data has to be
> > computed before forking.
> 
> Well, the data is "computed" before - just a means is missing to propagate the
> information whether to output the offload tables should be written or not is
> missing. The current call stack is:
> 
> With WPA, the code is run as follows:
> 
>  output_offload_tables
>   stream_out_partitions_1
>    stream_out
>     ipa_write_optimization_summaries
>      write_lto
>       lto_output
>        output_offload_tables
> 
> * * *
> 
> I see two possibilities, the quick one:
> 
> (A)
> 
> --- a/gcc/lto-streamer.h
> +++ b/gcc/lto-streamer.h
> @@ -463,0 +464 @@ struct lto_symtab_encoder_d
> +  bool output_offload_tables;
> 
> setting it - and propagating it through.

Or set a global flag - all forked processes should see that.

> (B)
> Change how we output the tables. Currently, those bypass normal WPA and IPA
> handling. Making them visible should avoid the 'force_output = 1' for all table
> entries (and only require it for the tables themselves) - and presumably fix
> this issue as well as fixing PR 95622.

Yeah, but the this sounds like a lot of work and we might want to fix
this on branches.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-09-02 10:23 ` rguenther at suse dot de
@ 2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
  2024-09-03 11:06 ` hubicka at ucw dot cz
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-09-03 10:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #6 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tobias Burnus <burnus@gcc.gnu.org>:

https://gcc.gnu.org/g:2fcccf21a34f92ea060b492c9b2aecb56cd5d167

commit r15-3413-g2fcccf21a34f92ea060b492c9b2aecb56cd5d167
Author: Tobias Burnus <tburnus@baylibre.com>
Date:   Tue Sep 3 12:02:23 2024 +0200

    LTO/WPA: Ensure that output_offload_tables only writes table once
[PR116535]

    When ltrans was written concurrently, e.g. via -flto=N (N > 1, assuming
    sufficient partiations, e.g., via -flto-partition=max),
output_offload_tables
    wrote the output tables once per fork.

            PR lto/116535

    gcc/ChangeLog:

            * lto-cgraph.cc (output_offload_tables): Remove offload_ frees.
            * lto-streamer-out.cc (lto_output): Make call to it depend on
            lto_get_out_decl_state ()->output_offload_tables_p.
            * lto-streamer.h (struct lto_out_decl_state): Add
            output_offload_tables_p field.
            * tree-pass.h (ipa_write_optimization_summaries): Add bool
argument.
            * passes.cc (ipa_write_summaries_1): Add bool
            output_offload_tables_p arg.
            (ipa_write_summaries): Update call.
            (ipa_write_optimization_summaries): Accept output_offload_tables_p.

    gcc/lto/ChangeLog:

            * lto.cc (stream_out): Update call to
            ipa_write_optimization_summaries to pass true for first partition.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
@ 2024-09-03 11:06 ` hubicka at ucw dot cz
  2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
  2024-09-03 13:22 ` burnus at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: hubicka at ucw dot cz @ 2024-09-03 11:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #7 from Jan Hubicka <hubicka at ucw dot cz> ---
> void  
> output_offload_tables (void)
> {
>   ...
> 
>   /* In WHOPR mode during the WPA stage the joint offload tables need to be
>      streamed to one partition only.  That's why we free offload_funcs and
>      offload_vars after the first call of output_offload_tables.  */
>   if (flag_wpa)
>     {
>       vec_free (offload_funcs);
>       vec_free (offload_vars);
>       vec_free (offload_ind_funcs);
>     }
> }
> 
> which obviously fails if multiple forks process such that each forks writes it
> once, before freeing the vector ...

We treat first partition somewhat specially in other code too, so I
guess we could a test if the streamed partition is first one instad of
relying on free to happen.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-09-03 11:06 ` hubicka at ucw dot cz
@ 2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
  2024-09-03 13:22 ` burnus at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2024-09-03 12:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

Thomas Schwinge <tschwinge at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2024-09-03
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |burnus at gcc dot gnu.org

--- Comment #8 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
(In reply to GCC Commits from comment #6)
> commit r15-3413-g2fcccf21a34f92ea060b492c9b2aecb56cd5d167
>     LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535]

I confirm this does resolve the issue, thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug lto/116535] LTO partitioning vs. offloading compilation
  2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
@ 2024-09-03 13:22 ` burnus at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-09-03 13:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #7)
> We treat first partition somewhat specially in other code too, so I
> guess we could a test if the streamed partition is first one instad of
> relying on free to happen.

The current new code (→ r15-3413-g2fcccf21a34 ) passes the information on via 
struct lto_out_decl_state.

If you think it could be done differently & is cleaner, I am happy to use the
'is first streamed partition' flag, but I didn't see it when I looked for one.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-09-03 13:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
2024-08-30 18:46 ` burnus at gcc dot gnu.org
2024-09-02  7:14 ` rguenth at gcc dot gnu.org
2024-09-02  9:28 ` burnus at gcc dot gnu.org
2024-09-02 10:23 ` rguenther at suse dot de
2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
2024-09-03 11:06 ` hubicka at ucw dot cz
2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
2024-09-03 13:22 ` burnus at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).