[Bug c++/113478] New: -Os does not inline single instruction function

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/113478] New: -Os does not inline single instruction function
@ 2024-01-18 12:56 jari.helaakoski at qt dot io
  2024-01-18 13:12 ` [Bug ipa/113478] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: jari.helaakoski at qt dot io @ 2024-01-18 12:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

            Bug ID: 113478
           Summary: -Os does not inline single instruction function
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jari.helaakoski at qt dot io
  Target Milestone: ---

GCC -Os option does not inline single instruction std::atomic function when
accessed via template class. Inlining works ok with -O2. Inlining is also
working if std::atomic is used directly. 

Test case and compilation result found from godbolt:
https://godbolt.org/z/nh758aT3K

Problem reproduces all GCC versions with x86 and aarch64 target and with and
without specific -march and -mtune options.

And here's test case pasted in case of godbolt is not accessable:
template <typename T>
class D {
    public:
#if 1
    inline T test() { return i.load(std::memory_order_relaxed); }
#else
    __attribute__((always_inline)) T test() { return
i.load(std::memory_order_relaxed); }
#endif
    std::atomic<T> i {0};
};

D<char> a1;
D<long> a2;
D<int> a3;
D<short> a4;
D<long long> a5;
D<wchar_t> a6;

extern void mod();

int main()
{
    int ret = 0;
    mod();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a1.test();
    ret |= a2.test();
    ret |= a3.test();
    ret |= a4.test();
    ret |= a5.test();
    ret |= a6.test();
    return ret;
}

Function that should be inlined
D<char>::test():
        ldrb    w0, [x0]
        ret

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug ipa/113478] -Os does not inline single instruction function
  2024-01-18 12:56 [Bug c++/113478] New: -Os does not inline single instruction function jari.helaakoski at qt dot io
@ 2024-01-18 13:12 ` rguenth at gcc dot gnu.org
  2024-01-19 10:51 ` hubicka at ucw dot cz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-18 13:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-01-18
                 CC|                            |hubicka at gcc dot gnu.org
          Component|c++                         |ipa

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
t.C:35:19: note: Considering inline candidate T D<T>::test() [with T =
char]/231.
   Estimating body: T D<T>::test() [with T = char]/231
   Known to be false: not inlined, op0 changed
   size:4
   Estimating body: T D<T>::test() [with T = char]/231
   Known to be false: not inlined, op0 changed
   size:4
t.C:35:19: missed:   will not early inline: int main()/185->T D<T>::test()
[with T = char]/231, call is cold and code would grow by 1

which is because

Analyzing function body size: T D<T>::test() [with T = char]/231
                Accounting size:2.00, time:0.00 on new predicate exec:(not
inlined)

 BB 2 predicate:(true)
  _3 = &MEM[(const struct __atomic_base *)this_1(D)]._M_i;
                freq:1.00 size:  0 time:  0
  _4 = __atomic_load_1 (_3, 0);
                freq:1.00 size:  4 time: 13
  _5 = (char) _4;
                freq:1.00 size:  0 time:  0
  return _5;
                freq:1.00 size:  1 time:  2
                Will be eliminated by inlining
                Accounting size:1.00, time:2.00 on predicate exec:(not inlined)

so the __atomic_load_1 "call"s are not special-cased during size estimation
and we assume you'd get parameter setup and call.  Note some targets
might expand this to a call to libatomic (there's also -finline-atomics).

We might want to consider improving heuristics here.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug ipa/113478] -Os does not inline single instruction function
  2024-01-18 12:56 [Bug c++/113478] New: -Os does not inline single instruction function jari.helaakoski at qt dot io
  2024-01-18 13:12 ` [Bug ipa/113478] " rguenth at gcc dot gnu.org
@ 2024-01-19 10:51 ` hubicka at ucw dot cz
  2024-01-19 11:09 ` rguenth at gcc dot gnu.org
  2024-01-19 16:43 ` hubicka at ucw dot cz
  3 siblings, 0 replies; 5+ messages in thread
From: hubicka at ucw dot cz @ 2024-01-19 10:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

--- Comment #2 from Jan Hubicka <hubicka at ucw dot cz> ---
Probably is_inexpensive_bulitin_p should return true here?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug ipa/113478] -Os does not inline single instruction function
  2024-01-18 12:56 [Bug c++/113478] New: -Os does not inline single instruction function jari.helaakoski at qt dot io
  2024-01-18 13:12 ` [Bug ipa/113478] " rguenth at gcc dot gnu.org
  2024-01-19 10:51 ` hubicka at ucw dot cz
@ 2024-01-19 11:09 ` rguenth at gcc dot gnu.org
  2024-01-19 16:43 ` hubicka at ucw dot cz
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-19 11:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #2)
> Probably is_inexpensive_bulitin_p should return true here?

Possibly, at least when we know it doesn't expand to a libatomic call?  OTOH
even then a function just wrapping such call should probably be inlined,
so the question is whether the problem that

  _3 = &MEM[(const struct __atomic_base *)this_1(D)]._M_i;
                freq:1.00 size:  0 time:  0
  _4 = __atomic_load_1 (_3, 0);
                freq:1.00 size:  4 time: 13
  _5 = (char) _4;
                freq:1.00 size:  0 time:  0
  return _5;
                freq:1.00 size:  1 time:  2

is estimated as too big compared to the call calling the function
(OTOH a1.test () has no arguments while __atomic_load_1 has two).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug ipa/113478] -Os does not inline single instruction function
  2024-01-18 12:56 [Bug c++/113478] New: -Os does not inline single instruction function jari.helaakoski at qt dot io
                   ` (2 preceding siblings ...)
  2024-01-19 11:09 ` rguenth at gcc dot gnu.org
@ 2024-01-19 16:43 ` hubicka at ucw dot cz
  3 siblings, 0 replies; 5+ messages in thread
From: hubicka at ucw dot cz @ 2024-01-19 16:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113478

--- Comment #4 from Jan Hubicka <hubicka at ucw dot cz> ---
> Possibly, at least when we know it doesn't expand to a libatomic call?  OTOH
> even then a function just wrapping such call should probably be inlined,
> so the question is whether the problem that
> is estimated as too big compared to the call calling the function
> (OTOH a1.test () has no arguments while __atomic_load_1 has two).

If we really want to optimize for size, calling function with one
parameter is shorter then calling function with two parameters.  The
code size model takes into account when the offline copy of the function
will disappear and it also has some biass towards understanding that a
lot of comdat functions are not really shared across units.

The testcase calls function 15 times and I guess wrapper function on
most architectures is shorter than 15 load zero instructions...

We now have -Os and -Oz and two-level optimize_size predicates. We may
make this less restrictive with lower size optimization level. But when
optimizing for size and if __atomic_load was ordinary function call, I
think the decision is correct.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-01-19 16:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-18 12:56 [Bug c++/113478] New: -Os does not inline single instruction function jari.helaakoski at qt dot io
2024-01-18 13:12 ` [Bug ipa/113478] " rguenth at gcc dot gnu.org
2024-01-19 10:51 ` hubicka at ucw dot cz
2024-01-19 11:09 ` rguenth at gcc dot gnu.org
2024-01-19 16:43 ` hubicka at ucw dot cz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).