public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type
@ 2024-05-15 1:02 arthur.j.odwyer at gmail dot com
2024-05-15 6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: arthur.j.odwyer at gmail dot com @ 2024-05-15 1:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
Bug ID: 115097
Summary: Strange suboptimal codegen specifically at -O2 when
copying struct type
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: arthur.j.odwyer at gmail dot com
Target Milestone: ---
// https://godbolt.org/z/G7qG4vvWb (C++ version)
// https://godbolt.org/z/fT793cznT (C version)
struct A { int a; short b; };
A test1(A& a) { return a; }
A test2(A&& a) { return a; }
A test3(const A& a) { return a; }
A test4(const A&& a) { return a; }
At -O1, they all have the same perfect codegen:
test2(A&&):
mov rax, QWORD PTR [rdi]
ret
At -O2, all-but-one of them have weird suboptimal (but correct) codegen:
test2(A&&):
movzx edx, WORD PTR [rdi+4]
mov eax, DWORD PTR [rdi]
sal rdx, 32
or rax, rdx
ret
What's really weird is that it really is "all but one of them." The lexically
first function will have good codegen, and then the subsequent ones will have
the suboptimal codegen. You can comment out the good one and watch GCC pick
another one to make good. You can reorder the definitions and see GCC's
decision of which one to optimize will change.
This behavior dates all the way back to GCC 5. In GCC 4.9.4, -O2 didn't have
this weird behavior; all four functions would just be optimal all the time.
The same symptom reproduces on x86-64, ARM64, and RISC-V (32 *and* 64). The
RISC-V result seems to indicate it's not specifically limited to 64-bit.
It also reproduces in C, with pointers instead of references.
It seems to have something to do with the signature of the enclosing function,
which is super weird: https://godbolt.org/z/KPac7bvTM
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
@ 2024-05-15 6:50 ` rguenth at gcc dot gnu.org
2024-05-15 7:05 ` [Bug ipa/115097] " rguenth at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15 6:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Version|unknown |15.0
Component|c |tree-optimization
CC| |jamborm at gcc dot gnu.org
Last reconfirmed| |2024-05-15
Status|UNCONFIRMED |NEW
Target| |x86_64-*-*
Keywords| |missed-optimization
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. The IL difference is
struct A test1 (struct A & a)
{
struct A D.2842;
<bb 2> [local count: 1073741824]:
D.2842 = MEM[(const struct A &)a_2(D)];
return D.2842;
vs
struct A test2 (struct A & a)
{
struct A D.2873;
struct A retval.4;
<bb 2> [local count: 1073741824]:
D.2873 = MEM[(const struct A &)a_2(D)];
retval.4 = D.2873;
return retval.4;
so there's an additional aggregate copy. With -O2 SRA scalarizes that
copy and we're not able to elide the resulting code on RTL while without
the SRA we can handle this fine.
SRA makes test2 into
struct A test2 (struct A & a)
{
short int SR.12;
int SR.11;
struct A retval.4;
<bb 2> [local count: 1073741824]:
SR.11_3 = MEM[(const struct A &)a_2(D)].a;
SR.12_6 = MEM[(const struct A &)a_2(D)].b;
retval.4.a = SR.11_3;
retval.4.b = SR.12_6;
return retval.4;
The extra copy is introduced during gimplfication, the GENERIC looks the
same (but of course there's a hidden difference):
;; Function A test1(A&) (null)
;; enabled by -tree-original
<<cleanup_point return <retval> = TARGET_EXPR <D.2823, *(const struct A &)
a>>>;
;; Function A test2(A&&) (null)
;; enabled by -tree-original
<<cleanup_point return <retval> = TARGET_EXPR <D.2833, *(const struct A &)
a>>>;
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
2024-05-15 6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
@ 2024-05-15 7:05 ` rguenth at gcc dot gnu.org
2024-05-15 7:10 ` jakub at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15 7:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|tree-optimization |ipa
CC| |hubicka at gcc dot gnu.org,
| |rguenth at gcc dot gnu.org
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
So actually it seems that the reason is ICF plus inlining:
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test2(A&&)/1
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test2O1A/1
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test3(const A&)/2
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test3RK1A/2
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test4(const A&&)/3
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test4OK1A/3
optimized: Inlined A test1(A&)/4 into A test2(A&&)/1 which now has time
4.000000 and size 5, net change of -1.
optimized: Inlined A test1(A&)/5 into A test3(const A&)/2 which now has time
4.000000 and size 5, net change of -1.
optimized: Inlined A test1(A&)/6 into A test4(const A&&)/3 which now has time
4.000000 and size 5, net change of -1.
for some reason we "optimize" the functions to the following in IPA ICF:
struct A test4 (const struct A & a)
{
struct A retval.6;
<bb 2> [local count: 1073741824]:
retval.6 = test1 (a_2(D)); [tail call]
return retval.6;
}
struct A test3 (const struct A & a)
{
struct A retval.5;
<bb 2> [local count: 1073741824]:
retval.5 = test1 (a_2(D)); [tail call]
return retval.5;
}
struct A test2 (struct A & a)
{
struct A retval.4;
<bb 2> [local count: 1073741824]:
retval.4 = test1 (a_2(D)); [tail call]
return retval.4;
}
and then we inline them back, introducing the extra copy. Why do we use
tail-calls here instead of aliases? Why do we lack cost modeling here?
Why do we inline back? It looks like a pointless exercise to me ...
With -fdisable-ipa-inline we get
_Z5test2O1A:
.LFB5:
.cfi_startproc
jmp _Z5test1R1A
so that's at least reasonable and what's expected I suppose. So one
could argue the bug is in the inliner and with introducing the extra
copy (IIRC there's a bug about this), but still.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
2024-05-15 6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
2024-05-15 7:05 ` [Bug ipa/115097] " rguenth at gcc dot gnu.org
@ 2024-05-15 7:10 ` jakub at gcc dot gnu.org
2024-05-15 7:12 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-15 7:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> Why do we use
> tail-calls here instead of aliases?
The functions are exported, so we can't prove if some other TU won't do
return &test1 == &test2;
or similar.
> Why do we inline back?
But IMHO if we ICF optimize something, we shouldn't undo that, so should
disable the inlining across that edge. Or decide not to ICF optimize it.
Plus we have the similar problem with fnsplit, where if we split some function
and inline it back it causes undesirable debug info differences, ideally the
inlining of fnsplit into self should just restore the old body.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
` (2 preceding siblings ...)
2024-05-15 7:10 ` jakub at gcc dot gnu.org
@ 2024-05-15 7:12 ` pinskia at gcc dot gnu.org
2024-05-15 7:24 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-15 7:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |DUPLICATE
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> So actually it seems that the reason is ICF plus inlining:
....
>, introducing the extra copy. Why do we use
> tail-calls here instead of aliases? Why do we lack cost modeling here?
> Why do we inline back? It looks like a pointless exercise to me ...
> With -fdisable-ipa-inline we get
>
> _Z5test2O1A:
> .LFB5:
> .cfi_startproc
> jmp _Z5test1R1A
>
> so that's at least reasonable and what's expected I suppose. So one
> could argue the bug is in the inliner and with introducing the extra
> copy (IIRC there's a bug about this), but still.
Which means this is dup of bug 96252
*** This bug has been marked as a duplicate of bug 96252 ***
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
` (3 preceding siblings ...)
2024-05-15 7:12 ` pinskia at gcc dot gnu.org
@ 2024-05-15 7:24 ` rguenth at gcc dot gnu.org
2024-05-15 7:34 ` rguenth at gcc dot gnu.org
2024-05-15 9:05 ` hubicka at ucw dot cz
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15 7:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase for the inliner issue:
struct A { int a; short b; };
A test(A& a) { return a; }
A test1(A& a) { return test(a); }
where the issue is that test() doesn't use DECL_RESULT for its return and thus
declare_return_variable setting up DECL_RESULT mapped to the call LHS doesn't
help.
And the reason why we don't use DECL_RESULT is that we have
<return_expr 0x7ffff6bb0d40
type <void_type 0x7ffff6a30f18 void VOID
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7ffff6a30f18
pointer_to_this <pointer_type 0x7ffff6a37000>>
side-effects
arg:0 <init_expr 0x7ffff6bb7fc8
type <record_type 0x7ffff6bbdbd0 A cxx-odr-p type_5 DI
size <integer_cst 0x7ffff6a29240 constant 64>
unit-size <integer_cst 0x7ffff6a29258 constant 8>
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7ffff6bbdbd0 fields <function_decl 0x7ffff6bd5000 __dt > context
<translation_unit_decl 0x7ffff6a18168 t2.ii>
full-name "struct A"
X() X(constX&) this=(X&) n_parents=0 use_template=0
interface-unknown
pointer_to_this <pointer_type 0x7ffff6bbde70> reference_to_this
<reference_type 0x7ffff6bbdd20> chain <type_decl 0x7ffff6a3ebe0 A>>
side-effects
arg:0 <result_decl 0x7ffff6bd1000 D.2798 type <record_type
0x7ffff6bbdbd0 A>
ignored DI t2.ii:2:1 size <integer_cst 0x7ffff6a29240 64> unit-size
<integer_cst 0x7ffff6a29258 8>
align:32 warn_if_not_align:0 context <function_decl 0x7ffff6bc0500
test>>
arg:1 <target_expr 0x7ffff6a09700 type <record_type 0x7ffff6bbdbd0 A>
side-effects tree_3 arg:0 <var_decl 0x7ffff6a19cf0 D.2823>
...
and gimplification changes the RESULT_DECL return because
/* If aggregate_value_p is true, then we can return the bare RESULT_DECL.
Recall that aggregate_value_p is FALSE for any aggregate type that is
returned in registers. If we're returning values in registers, then
we don't want to extend the lifetime of the RESULT_DECL, particularly
across another call. In addition, for those aggregates for which
hard_function_value generates a PARALLEL, we'll die during normal
expansion of structure assignments; there's special code in expand_return
to handle this case that does not exist in expand_expr. */
but the code does something else, using a new temporary register
(here !aggregate_value_p). The following fixes that.
diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index b0ed58ed0f9..e4cf5438e39 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
to handle this case that does not exist in expand_expr. */
if (!result_decl)
result = NULL_TREE;
- else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl)))
+ else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl))
+ || !is_gimple_reg_type (TREE_TYPE (result_decl)))
{
if (!poly_int_tree_p (DECL_SIZE (result_decl)))
{
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
` (4 preceding siblings ...)
2024-05-15 7:24 ` rguenth at gcc dot gnu.org
@ 2024-05-15 7:34 ` rguenth at gcc dot gnu.org
2024-05-15 9:05 ` hubicka at ucw dot cz
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15 7:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
[...]
> /* If aggregate_value_p is true, then we can return the bare RESULT_DECL.
> Recall that aggregate_value_p is FALSE for any aggregate type that is
> returned in registers. If we're returning values in registers, then
> we don't want to extend the lifetime of the RESULT_DECL, particularly
> across another call. In addition, for those aggregates for which
> hard_function_value generates a PARALLEL, we'll die during normal
> expansion of structure assignments; there's special code in
> expand_return
> to handle this case that does not exist in expand_expr. */
>
> but the code does something else, using a new temporary register
> (here !aggregate_value_p). The following fixes that.
>
> diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
> index b0ed58ed0f9..e4cf5438e39 100644
> --- a/gcc/gimplify.cc
> +++ b/gcc/gimplify.cc
> @@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
> to handle this case that does not exist in expand_expr. */
> if (!result_decl)
> result = NULL_TREE;
> - else if (aggregate_value_p (result_decl, TREE_TYPE
> (current_function_decl)))
> + else if (aggregate_value_p (result_decl, TREE_TYPE
> (current_function_decl))
> + || !is_gimple_reg_type (TREE_TYPE (result_decl)))
> {
> if (!poly_int_tree_p (DECL_SIZE (result_decl)))
> {
That said, this likely breaks exactly what the comment says but this should
be then worked around during RTL expansion and not gimplification (but
of course historically that was the same time).
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
` (5 preceding siblings ...)
2024-05-15 7:34 ` rguenth at gcc dot gnu.org
@ 2024-05-15 9:05 ` hubicka at ucw dot cz
6 siblings, 0 replies; 8+ messages in thread
From: hubicka at ucw dot cz @ 2024-05-15 9:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097
--- Comment #7 from Jan Hubicka <hubicka at ucw dot cz> ---
> and then we inline them back, introducing the extra copy. Why do we use
> tail-calls here instead of aliases? Why do we lack cost modeling here?
Because the function is exported and we must keep addresses different.
Cost modeling is somewhat hard here, since it is not clear what will
help inliner and whether inliner will inline back.
For example if we icf two function calling third function. the third one
may become called once and we get better code by inlining it and eliding
offline copy rather than keeping the caller duplicated.
So the idea is to get code into as canonical form as possible (with no
obvious duplicates) and then let the inliner heuristics to decide what
should and what should not be duplicated in the final code.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-05-15 9:05 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-15 1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
2024-05-15 6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
2024-05-15 7:05 ` [Bug ipa/115097] " rguenth at gcc dot gnu.org
2024-05-15 7:10 ` jakub at gcc dot gnu.org
2024-05-15 7:12 ` pinskia at gcc dot gnu.org
2024-05-15 7:24 ` rguenth at gcc dot gnu.org
2024-05-15 7:34 ` rguenth at gcc dot gnu.org
2024-05-15 9:05 ` hubicka at ucw dot cz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).