[Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type
@ 2024-05-15  1:02 arthur.j.odwyer at gmail dot com
  2024-05-15  6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: arthur.j.odwyer at gmail dot com @ 2024-05-15  1:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

            Bug ID: 115097
           Summary: Strange suboptimal codegen specifically at -O2 when
                    copying struct type
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: arthur.j.odwyer at gmail dot com
  Target Milestone: ---

// https://godbolt.org/z/G7qG4vvWb (C++ version)
// https://godbolt.org/z/fT793cznT (C version)

struct A { int a; short b; };
A test1(A& a) { return a; }
A test2(A&& a) { return a; }
A test3(const A& a) { return a; }
A test4(const A&& a) { return a; }

At -O1, they all have the same perfect codegen:

    test2(A&&):
        mov     rax, QWORD PTR [rdi]
        ret

At -O2, all-but-one of them have weird suboptimal (but correct) codegen:

    test2(A&&):
        movzx   edx, WORD PTR [rdi+4]
        mov     eax, DWORD PTR [rdi]
        sal     rdx, 32
        or      rax, rdx
        ret

What's really weird is that it really is "all but one of them." The lexically
first function will have good codegen, and then the subsequent ones will have
the suboptimal codegen. You can comment out the good one and watch GCC pick
another one to make good. You can reorder the definitions and see GCC's
decision of which one to optimize will change.

This behavior dates all the way back to GCC 5. In GCC 4.9.4, -O2 didn't have
this weird behavior; all four functions would just be optimal all the time.

The same symptom reproduces on x86-64, ARM64, and RISC-V (32 *and* 64). The
RISC-V result seems to indicate it's not specifically limited to 64-bit.

It also reproduces in C, with pointers instead of references.
It seems to have something to do with the signature of the enclosing function,
which is super weird: https://godbolt.org/z/KPac7bvTM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
@ 2024-05-15  6:50 ` rguenth at gcc dot gnu.org
  2024-05-15  7:05 ` [Bug ipa/115097] " rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15  6:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unknown                     |15.0
          Component|c                           |tree-optimization
                 CC|                            |jamborm at gcc dot gnu.org
   Last reconfirmed|                            |2024-05-15
             Status|UNCONFIRMED                 |NEW
             Target|                            |x86_64-*-*
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  The IL difference is

struct A test1 (struct A & a)
{ 
  struct A D.2842;

  <bb 2> [local count: 1073741824]:
  D.2842 = MEM[(const struct A &)a_2(D)];
  return D.2842;

vs

struct A test2 (struct A & a)
{
  struct A D.2873;
  struct A retval.4;

  <bb 2> [local count: 1073741824]:
  D.2873 = MEM[(const struct A &)a_2(D)];
  retval.4 = D.2873;
  return retval.4;

so there's an additional aggregate copy.  With -O2 SRA scalarizes that
copy and we're not able to elide the resulting code on RTL while without
the SRA we can handle this fine.

SRA makes test2 into

struct A test2 (struct A & a)
{ 
  short int SR.12;
  int SR.11;
  struct A retval.4;

  <bb 2> [local count: 1073741824]:
  SR.11_3 = MEM[(const struct A &)a_2(D)].a;
  SR.12_6 = MEM[(const struct A &)a_2(D)].b;
  retval.4.a = SR.11_3;
  retval.4.b = SR.12_6;
  return retval.4;


The extra copy is introduced during gimplfication, the GENERIC looks the
same (but of course there's a hidden difference):

;; Function A test1(A&) (null)
;; enabled by -tree-original


<<cleanup_point return <retval> = TARGET_EXPR <D.2823, *(const struct A &)
a>>>;


;; Function A test2(A&&) (null)
;; enabled by -tree-original


<<cleanup_point return <retval> = TARGET_EXPR <D.2833, *(const struct A &)
a>>>;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
  2024-05-15  6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
@ 2024-05-15  7:05 ` rguenth at gcc dot gnu.org
  2024-05-15  7:10 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15  7:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |ipa
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
So actually it seems that the reason is ICF plus inlining:

t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test2(A&&)/1
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test2O1A/1
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test3(const A&)/2
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test3RK1A/2
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test4(const A&&)/3
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test4OK1A/3
optimized:  Inlined A test1(A&)/4 into A test2(A&&)/1 which now has time
4.000000 and size 5, net change of -1.
optimized:  Inlined A test1(A&)/5 into A test3(const A&)/2 which now has time
4.000000 and size 5, net change of -1.
optimized:  Inlined A test1(A&)/6 into A test4(const A&&)/3 which now has time
4.000000 and size 5, net change of -1.

for some reason we "optimize" the functions to the following in IPA ICF:

struct A test4 (const struct A & a)
{
  struct A retval.6;

  <bb 2> [local count: 1073741824]:
  retval.6 = test1 (a_2(D)); [tail call]
  return retval.6;

}

struct A test3 (const struct A & a)
{
  struct A retval.5;

  <bb 2> [local count: 1073741824]:
  retval.5 = test1 (a_2(D)); [tail call]
  return retval.5;

}

struct A test2 (struct A & a)
{
  struct A retval.4;

  <bb 2> [local count: 1073741824]:
  retval.4 = test1 (a_2(D)); [tail call]
  return retval.4;

}

and then we inline them back, introducing the extra copy.  Why do we use
tail-calls here instead of aliases?  Why do we lack cost modeling here?
Why do we inline back?  It looks like a pointless exercise to me ...
With -fdisable-ipa-inline we get

_Z5test2O1A:
.LFB5:
        .cfi_startproc
        jmp     _Z5test1R1A

so that's at least reasonable and what's expected I suppose.  So one
could argue the bug is in the inliner and with introducing the extra
copy (IIRC there's a bug about this), but still.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
  2024-05-15  6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
  2024-05-15  7:05 ` [Bug ipa/115097] " rguenth at gcc dot gnu.org
@ 2024-05-15  7:10 ` jakub at gcc dot gnu.org
  2024-05-15  7:12 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-15  7:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> Why do we use
> tail-calls here instead of aliases?

The functions are exported, so we can't prove if some other TU won't do
return &test1 == &test2;
or similar.

> Why do we inline back?

But IMHO if we ICF optimize something, we shouldn't undo that, so should
disable the inlining across that edge.  Or decide not to ICF optimize it.

Plus we have the similar problem with fnsplit, where if we split some function
and inline it back it causes undesirable debug info differences, ideally the
inlining of fnsplit into self should just restore the old body.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
                   ` (2 preceding siblings ...)
  2024-05-15  7:10 ` jakub at gcc dot gnu.org
@ 2024-05-15  7:12 ` pinskia at gcc dot gnu.org
  2024-05-15  7:24 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-15  7:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |DUPLICATE

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> So actually it seems that the reason is ICF plus inlining:
....
>, introducing the extra copy.  Why do we use
> tail-calls here instead of aliases?  Why do we lack cost modeling here?
> Why do we inline back?  It looks like a pointless exercise to me ...
> With -fdisable-ipa-inline we get
> 
> _Z5test2O1A:
> .LFB5:
>         .cfi_startproc
>         jmp     _Z5test1R1A
> 
> so that's at least reasonable and what's expected I suppose.  So one
> could argue the bug is in the inliner and with introducing the extra
> copy (IIRC there's a bug about this), but still.

Which means this is dup of bug 96252

*** This bug has been marked as a duplicate of bug 96252 ***

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
                   ` (3 preceding siblings ...)
  2024-05-15  7:12 ` pinskia at gcc dot gnu.org
@ 2024-05-15  7:24 ` rguenth at gcc dot gnu.org
  2024-05-15  7:34 ` rguenth at gcc dot gnu.org
  2024-05-15  9:05 ` hubicka at ucw dot cz
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase for the inliner issue:

struct A { int a; short b; };
A test(A& a) { return a; }
A test1(A& a) { return test(a); }

where the issue is that test() doesn't use DECL_RESULT for its return and thus
declare_return_variable setting up DECL_RESULT mapped to the call LHS doesn't
help.

And the reason why we don't use DECL_RESULT is that we have

 <return_expr 0x7ffff6bb0d40
    type <void_type 0x7ffff6a30f18 void VOID
        align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7ffff6a30f18
        pointer_to_this <pointer_type 0x7ffff6a37000>>
    side-effects
    arg:0 <init_expr 0x7ffff6bb7fc8
        type <record_type 0x7ffff6bbdbd0 A cxx-odr-p type_5 DI
            size <integer_cst 0x7ffff6a29240 constant 64>
            unit-size <integer_cst 0x7ffff6a29258 constant 8>
            align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7ffff6bbdbd0 fields <function_decl 0x7ffff6bd5000 __dt > context
<translation_unit_decl 0x7ffff6a18168 t2.ii>
            full-name "struct A"
            X() X(constX&) this=(X&) n_parents=0 use_template=0
interface-unknown
            pointer_to_this <pointer_type 0x7ffff6bbde70> reference_to_this
<reference_type 0x7ffff6bbdd20> chain <type_decl 0x7ffff6a3ebe0 A>>
        side-effects
        arg:0 <result_decl 0x7ffff6bd1000 D.2798 type <record_type
0x7ffff6bbdbd0 A>
            ignored DI t2.ii:2:1 size <integer_cst 0x7ffff6a29240 64> unit-size
<integer_cst 0x7ffff6a29258 8>
            align:32 warn_if_not_align:0 context <function_decl 0x7ffff6bc0500
test>>
        arg:1 <target_expr 0x7ffff6a09700 type <record_type 0x7ffff6bbdbd0 A>
            side-effects tree_3 arg:0 <var_decl 0x7ffff6a19cf0 D.2823>
...

and gimplification changes the RESULT_DECL return because

  /* If aggregate_value_p is true, then we can return the bare RESULT_DECL.
     Recall that aggregate_value_p is FALSE for any aggregate type that is
     returned in registers.  If we're returning values in registers, then
     we don't want to extend the lifetime of the RESULT_DECL, particularly
     across another call.  In addition, for those aggregates for which
     hard_function_value generates a PARALLEL, we'll die during normal
     expansion of structure assignments; there's special code in expand_return
     to handle this case that does not exist in expand_expr.  */

but the code does something else, using a new temporary register
(here !aggregate_value_p).  The following fixes that.

diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index b0ed58ed0f9..e4cf5438e39 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
      to handle this case that does not exist in expand_expr.  */
   if (!result_decl)
     result = NULL_TREE;
-  else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl)))
+  else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl))
+          || !is_gimple_reg_type (TREE_TYPE (result_decl)))
     {
       if (!poly_int_tree_p (DECL_SIZE (result_decl)))
        {

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
                   ` (4 preceding siblings ...)
  2024-05-15  7:24 ` rguenth at gcc dot gnu.org
@ 2024-05-15  7:34 ` rguenth at gcc dot gnu.org
  2024-05-15  9:05 ` hubicka at ucw dot cz
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-15  7:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
[...]
>   /* If aggregate_value_p is true, then we can return the bare RESULT_DECL.
>      Recall that aggregate_value_p is FALSE for any aggregate type that is
>      returned in registers.  If we're returning values in registers, then
>      we don't want to extend the lifetime of the RESULT_DECL, particularly
>      across another call.  In addition, for those aggregates for which
>      hard_function_value generates a PARALLEL, we'll die during normal
>      expansion of structure assignments; there's special code in
> expand_return
>      to handle this case that does not exist in expand_expr.  */
> 
> but the code does something else, using a new temporary register
> (here !aggregate_value_p).  The following fixes that.
> 
> diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
> index b0ed58ed0f9..e4cf5438e39 100644
> --- a/gcc/gimplify.cc
> +++ b/gcc/gimplify.cc
> @@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
>       to handle this case that does not exist in expand_expr.  */
>    if (!result_decl)
>      result = NULL_TREE;
> -  else if (aggregate_value_p (result_decl, TREE_TYPE
> (current_function_decl)))
> +  else if (aggregate_value_p (result_decl, TREE_TYPE
> (current_function_decl))
> +          || !is_gimple_reg_type (TREE_TYPE (result_decl)))
>      {
>        if (!poly_int_tree_p (DECL_SIZE (result_decl)))
>         {

That said, this likely breaks exactly what the comment says but this should
be then worked around during RTL expansion and not gimplification (but
of course historically that was the same time).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
  2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
                   ` (5 preceding siblings ...)
  2024-05-15  7:34 ` rguenth at gcc dot gnu.org
@ 2024-05-15  9:05 ` hubicka at ucw dot cz
  6 siblings, 0 replies; 8+ messages in thread
From: hubicka at ucw dot cz @ 2024-05-15  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #7 from Jan Hubicka <hubicka at ucw dot cz> ---
> and then we inline them back, introducing the extra copy.  Why do we use
> tail-calls here instead of aliases?  Why do we lack cost modeling here?
Because the function is exported and we must keep addresses different.
Cost modeling is somewhat hard here, since it is not clear what will
help inliner and whether inliner will inline back.

For example if we icf two function calling third function. the third one
may become called once and we get better code by inlining it and eliding
offline copy rather than keeping the caller duplicated.

So the idea is to get code into as canonical form as possible (with no
obvious duplicates) and then let the inliner heuristics to decide what
should and what should not be duplicated in the final code.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-15  9:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-15  1:02 [Bug c/115097] New: Strange suboptimal codegen specifically at -O2 when copying struct type arthur.j.odwyer at gmail dot com
2024-05-15  6:50 ` [Bug tree-optimization/115097] " rguenth at gcc dot gnu.org
2024-05-15  7:05 ` [Bug ipa/115097] " rguenth at gcc dot gnu.org
2024-05-15  7:10 ` jakub at gcc dot gnu.org
2024-05-15  7:12 ` pinskia at gcc dot gnu.org
2024-05-15  7:24 ` rguenth at gcc dot gnu.org
2024-05-15  7:34 ` rguenth at gcc dot gnu.org
2024-05-15  9:05 ` hubicka at ucw dot cz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).