public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug lto/116535] New: LTO partitioning vs. offloading compilation
@ 2024-08-29 14:00 tschwinge at gcc dot gnu.org
2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2024-08-29 14:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
Bug ID: 116535
Summary: LTO partitioning vs. offloading compilation
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: openacc, openmp, wrong-code
Severity: normal
Priority: P3
Component: lto
Assignee: unassigned at gcc dot gnu.org
Reporter: tschwinge at gcc dot gnu.org
CC: burnus at gcc dot gnu.org, hubicka at gcc dot gnu.org,
jakub at gcc dot gnu.org, rguenth at gcc dot gnu.org
Target Milestone: ---
I ran into a curious issue re LTO partitioning vs. offloading compilation. I
guess what happened was that, given sufficiently high 'make -j', and only
'check-target-libgomp' running (which limits itself to 19 parallel slots), for
the '-flto -flto-partition=max' test cases
'libgomp.oacc-c-c++-common/data-clauses-{kernels,parallel}-ipa-pta.c', there
was actual parallel LTO compilation/partitioning, and these test cases then
FAIL, for example:
libgomp: Cannot map target functions or variables (expected 180 + 0 + 1,
have 11)
Running the compilation outside of 'make -j check', we see:
lto-wrapper: warning: using serial compilation of 36 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more
information
..., and they execute successfully. However, for example, with '-flto=3
-flto-partition=max':
libgomp: Cannot map target functions or variables (expected 30 + 0 + 1,
have 11)
... where here '30' is 'N * 10' for '-flto=N'. So the "expected" number is
wrong.
This reproduces down to GCC 10:
libgomp: Cannot map target functions or variables (expected 30, have 10)
(..., so isn't related to the PR116361 LTO plugin changes.)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
@ 2024-08-30 18:35 ` burnus at gcc dot gnu.org
2024-08-30 18:46 ` burnus at gcc dot gnu.org
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-08-30 18:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> ---
I have problems reproducing it fully reliably – and my impression is that a
global variable is not atomically set.
The difference between -flto=1 and -flto=2 with -flto-partition=max is rather
small. In either case, 36 partitions (LTRANS jobs) exist - and the partition
writing is the same.
* * *
If I add an
#undef HAVE_WORKING_FORK
to gcc/lto/lto.cc, it seems to work - while without, I get the error for:
~/projects/gcc-trunk-offload/bin/gcc -save-temps -fopenacc -flto=2
-foffload=nvptx-none -flto-partition=max
*/*/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c
libgomp: Cannot map target functions or variables (expected 20 + 0 + 1, have
11)
* * *
With -flto=<N> (with some reasonable N), stream_out_partitions is called <N>
times, i.e. for the example above with N=3:
stream_out_partitions_1 ("./a.ltrans35.o", 10, 0, 12)
stream_out_partitions_1 ("./a.ltrans11.o", 10, 12, 24)
stream_out_partitions_1 ("./a.ltrans23.o", 10, 24, 36)
while for -flto=1, it's
stream_out_partitions_1 ("./a.ltrans35.o", 10, 0, 36)
That's the only difference.
* * *
For -flto=3 + looking for '.offload_var_table' in the
'a.ltrans<i>.ltrans.{s,o}' files, shows that is always in the first file for
each set of files writt lto_parallelism i.e., for -flto=3, it is in file
'a.ltrans<i>.ltrans.{s,o}' for <i> = 0, 12, and 24, where the files <i> = 0,
..., 35 exist.
* * *
* * *
However, with
#undef
Recall how the data is created:
--------------------------------------------------
The expected number is:
void **host_func_table = ((void ***) host_table)[0];
void **host_funcs_end = ((void ***) host_table)[1];
...
int num_funcs = host_funcs_end - host_func_table;
int num_vars = (host_vars_end - host_var_table) / 2;
where host_table = __OFFLOAD_TABLE__ and libgcc/offloadstuff.c has:
#elif defined CRT_TABLE
const void *const __OFFLOAD_TABLE__[]
__attribute__ ((__visibility__ ("hidden"))) =
{
&__offload_func_table, &__offload_funcs_end,
&__offload_var_table, &__offload_vars_end,
&__offload_ind_func_table, &__offload_ind_funcs_end,
};
Where the entries are formed by:
#define OFFLOAD_FUNC_TABLE_SECTION_NAME ".gnu.offload_funcs"
#define OFFLOAD_IND_FUNC_TABLE_SECTION_NAME ".gnu.offload_ind_funcs"
#define OFFLOAD_VAR_TABLE_SECTION_NAME ".gnu.offload_vars"
#ifdef CRT_BEGIN
const void *const __offload_func_table[0]
__attribute__ ((__used__, visibility ("hidden"),
section (OFFLOAD_FUNC_TABLE_SECTION_NAME))) = { };
...
#elif defined CRT_END
const void *const __offload_funcs_end[0]
__attribute__ ((__used__, visibility ("hidden"),
section (OFFLOAD_FUNC_TABLE_SECTION_NAME))) = { };
And during link time, from which the files
crtoffload{begin,end,table}$(objext)
are produced and GNU_USER_TARGET_STARTFILE_SPEC / GNU_USER_TARGET_ENDFILE_SPEC
ensures that __offload_func_table comes first and __offload_funcs_end last in
the list.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
@ 2024-08-30 18:46 ` burnus at gcc dot gnu.org
2024-09-02 7:14 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-08-30 18:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Namely, the following seems to be problematic if the code is run concurrently.
The FORK part is actually quite old r207515 (Feb 2014) as is the following code
with flag_wpa, which was added in r217489 (Nov 2014), but admittedly slightly
before.
gcc/lto-cgraph.cc
void
output_offload_tables (void)
{
...
/* In WHOPR mode during the WPA stage the joint offload tables need to be
streamed to one partition only. That's why we free offload_funcs and
offload_vars after the first call of output_offload_tables. */
if (flag_wpa)
{
vec_free (offload_funcs);
vec_free (offload_vars);
vec_free (offload_ind_funcs);
}
}
which obviously fails if multiple forks process such that each forks writes it
once, before freeing the vector ...
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
2024-08-30 18:46 ` burnus at gcc dot gnu.org
@ 2024-09-02 7:14 ` rguenth at gcc dot gnu.org
2024-09-02 9:28 ` burnus at gcc dot gnu.org
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-02 7:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The forked processes may not write to any "global" data because forking makes
that data not global ... instead any such "global" data has to be computed
before forking.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
` (2 preceding siblings ...)
2024-09-02 7:14 ` rguenth at gcc dot gnu.org
@ 2024-09-02 9:28 ` burnus at gcc dot gnu.org
2024-09-02 10:23 ` rguenther at suse dot de
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-09-02 9:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> The forked processes may not write to any "global" data because forking
> makes that data not global ... instead any such "global" data has to be
> computed before forking.
Well, the data is "computed" before - just a means is missing to propagate the
information whether to output the offload tables should be written or not is
missing. The current call stack is:
With WPA, the code is run as follows:
output_offload_tables
stream_out_partitions_1
stream_out
ipa_write_optimization_summaries
write_lto
lto_output
output_offload_tables
* * *
I see two possibilities, the quick one:
(A)
--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -463,0 +464 @@ struct lto_symtab_encoder_d
+ bool output_offload_tables;
setting it - and propagating it through.
(B)
Change how we output the tables. Currently, those bypass normal WPA and IPA
handling. Making them visible should avoid the 'force_output = 1' for all table
entries (and only require it for the tables themselves) - and presumably fix
this issue as well as fixing PR 95622.
* * *
Thoughts?
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
` (3 preceding siblings ...)
2024-09-02 9:28 ` burnus at gcc dot gnu.org
@ 2024-09-02 10:23 ` rguenther at suse dot de
2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenther at suse dot de @ 2024-09-02 10:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 2 Sep 2024, burnus at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
>
> --- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #3)
> > The forked processes may not write to any "global" data because forking
> > makes that data not global ... instead any such "global" data has to be
> > computed before forking.
>
> Well, the data is "computed" before - just a means is missing to propagate the
> information whether to output the offload tables should be written or not is
> missing. The current call stack is:
>
> With WPA, the code is run as follows:
>
> output_offload_tables
> stream_out_partitions_1
> stream_out
> ipa_write_optimization_summaries
> write_lto
> lto_output
> output_offload_tables
>
> * * *
>
> I see two possibilities, the quick one:
>
> (A)
>
> --- a/gcc/lto-streamer.h
> +++ b/gcc/lto-streamer.h
> @@ -463,0 +464 @@ struct lto_symtab_encoder_d
> + bool output_offload_tables;
>
> setting it - and propagating it through.
Or set a global flag - all forked processes should see that.
> (B)
> Change how we output the tables. Currently, those bypass normal WPA and IPA
> handling. Making them visible should avoid the 'force_output = 1' for all table
> entries (and only require it for the tables themselves) - and presumably fix
> this issue as well as fixing PR 95622.
Yeah, but the this sounds like a lot of work and we might want to fix
this on branches.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
` (4 preceding siblings ...)
2024-09-02 10:23 ` rguenther at suse dot de
@ 2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
2024-09-03 11:06 ` hubicka at ucw dot cz
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-09-03 10:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #6 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tobias Burnus <burnus@gcc.gnu.org>:
https://gcc.gnu.org/g:2fcccf21a34f92ea060b492c9b2aecb56cd5d167
commit r15-3413-g2fcccf21a34f92ea060b492c9b2aecb56cd5d167
Author: Tobias Burnus <tburnus@baylibre.com>
Date: Tue Sep 3 12:02:23 2024 +0200
LTO/WPA: Ensure that output_offload_tables only writes table once
[PR116535]
When ltrans was written concurrently, e.g. via -flto=N (N > 1, assuming
sufficient partiations, e.g., via -flto-partition=max),
output_offload_tables
wrote the output tables once per fork.
PR lto/116535
gcc/ChangeLog:
* lto-cgraph.cc (output_offload_tables): Remove offload_ frees.
* lto-streamer-out.cc (lto_output): Make call to it depend on
lto_get_out_decl_state ()->output_offload_tables_p.
* lto-streamer.h (struct lto_out_decl_state): Add
output_offload_tables_p field.
* tree-pass.h (ipa_write_optimization_summaries): Add bool
argument.
* passes.cc (ipa_write_summaries_1): Add bool
output_offload_tables_p arg.
(ipa_write_summaries): Update call.
(ipa_write_optimization_summaries): Accept output_offload_tables_p.
gcc/lto/ChangeLog:
* lto.cc (stream_out): Update call to
ipa_write_optimization_summaries to pass true for first partition.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
` (5 preceding siblings ...)
2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
@ 2024-09-03 11:06 ` hubicka at ucw dot cz
2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
2024-09-03 13:22 ` burnus at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: hubicka at ucw dot cz @ 2024-09-03 11:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #7 from Jan Hubicka <hubicka at ucw dot cz> ---
> void
> output_offload_tables (void)
> {
> ...
>
> /* In WHOPR mode during the WPA stage the joint offload tables need to be
> streamed to one partition only. That's why we free offload_funcs and
> offload_vars after the first call of output_offload_tables. */
> if (flag_wpa)
> {
> vec_free (offload_funcs);
> vec_free (offload_vars);
> vec_free (offload_ind_funcs);
> }
> }
>
> which obviously fails if multiple forks process such that each forks writes it
> once, before freeing the vector ...
We treat first partition somewhat specially in other code too, so I
guess we could a test if the streamed partition is first one instad of
relying on free to happen.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
` (6 preceding siblings ...)
2024-09-03 11:06 ` hubicka at ucw dot cz
@ 2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
2024-09-03 13:22 ` burnus at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2024-09-03 12:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
Thomas Schwinge <tschwinge at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2024-09-03
Ever confirmed|0 |1
Assignee|unassigned at gcc dot gnu.org |burnus at gcc dot gnu.org
--- Comment #8 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
(In reply to GCC Commits from comment #6)
> commit r15-3413-g2fcccf21a34f92ea060b492c9b2aecb56cd5d167
> LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535]
I confirm this does resolve the issue, thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug lto/116535] LTO partitioning vs. offloading compilation
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
` (7 preceding siblings ...)
2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
@ 2024-09-03 13:22 ` burnus at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: burnus at gcc dot gnu.org @ 2024-09-03 13:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #7)
> We treat first partition somewhat specially in other code too, so I
> guess we could a test if the streamed partition is first one instad of
> relying on free to happen.
The current new code (→ r15-3413-g2fcccf21a34 ) passes the information on via
struct lto_out_decl_state.
If you think it could be done differently & is cleaner, I am happy to use the
'is first streamed partition' flag, but I didn't see it when I looked for one.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-09-03 13:22 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-29 14:00 [Bug lto/116535] New: LTO partitioning vs. offloading compilation tschwinge at gcc dot gnu.org
2024-08-30 18:35 ` [Bug lto/116535] " burnus at gcc dot gnu.org
2024-08-30 18:46 ` burnus at gcc dot gnu.org
2024-09-02 7:14 ` rguenth at gcc dot gnu.org
2024-09-02 9:28 ` burnus at gcc dot gnu.org
2024-09-02 10:23 ` rguenther at suse dot de
2024-09-03 10:03 ` cvs-commit at gcc dot gnu.org
2024-09-03 11:06 ` hubicka at ucw dot cz
2024-09-03 12:48 ` tschwinge at gcc dot gnu.org
2024-09-03 13:22 ` burnus at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).