public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
* Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?
@ 2020-08-21 12:33 Groke, Paul
  2020-08-23 19:54 ` Marc Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Groke, Paul @ 2020-08-21 12:33 UTC (permalink / raw)
  To: libstdc++

Hi,

I've recently noticed that GCC is inlining vector::_M_realloc_insert into some of my functions that call vector::emplace_back. IMO that doesn't make a lot of sense - reallocation is usually so expensive that an extra function call doesn't really matter. And it bloats the code, which has two drawbacks: the binary gets bigger and the fast path gets slower (less efficient pre-fetching/more cache thrashing).

What do you think about marking such functions __attribute__ ((noinline))?

In our own code, we've seen measurable improvements (size & execution speed) by splitting the "slow path" out into helper functions and making those helper functions noinline.

(I've also noticed that there's no __builtin_expect for the fast path, but that's a different topic and IMO far less important.)

Regards,
Paul Groke

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4020 Linz, Austria, Am Fünfundzwanziger Turm 20

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?
  2020-08-21 12:33 Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert? Groke, Paul
@ 2020-08-23 19:54 ` Marc Glisse
  2020-08-24  9:19   ` Groke, Paul
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Glisse @ 2020-08-23 19:54 UTC (permalink / raw)
  To: Groke, Paul; +Cc: libstdc++

On Fri, 21 Aug 2020, Groke, Paul via Libstdc++ wrote:

> I've recently noticed that GCC is inlining vector::_M_realloc_insert 
> into some of my functions that call vector::emplace_back.

Could you share which version of gcc you are testing with, what flags you 
are using, and ideally even some code to reproduce this? With current gcc, 
I don't see that happening very often.

> IMO that doesn't make a lot of sense - reallocation is usually so 
> expensive that an extra function call doesn't really matter. And it 
> bloats the code, which has two drawbacks: the binary gets bigger and the 
> fast path gets slower (less efficient pre-fetching/more cache 
> thrashing).
>
> What do you think about marking such functions __attribute__ ((noinline))?

That's a possibility, but I'd rather avoid it if possible. If someone 
inserts an element in a newly created vector, it does make sense to inline 
_M_realloc_insert, especially if we want any hope of making some small 
local vectors use the stack, or removing some unused small vectors.

Ideally, the inliner would already be clever enough not to inline the 
function except in special cases. It is rather large, calls other 
functions, is called with low probability (gcc guesses it as 17% on a 
simple example, using profile guided optimization would lower that number 
if the vectors are usually large), etc. This code isn't doing anything 
unusual (not like std::function or std::any), if the inliner behaves badly 
there, maybe a heuristic needs some tweaking.

> In our own code, we've seen measurable improvements (size & execution speed) by splitting the "slow path" out into helper functions and making those helper functions noinline.

Yes, that's a common strategy, and it would indeed be good to make sure 
that gcc does the right thing for std::vector.

> (I've also noticed that there's no __builtin_expect for the fast path, but that's a different topic and IMO far less important.)

Inlining takes probabilities into account, so this may be strongly 
related. However, gcc already guesses that the fast path is the most 
likely one, it isn't clear that making the probability of the slow path 
too low is a good idea.

Does profile-guided optimization help with your application?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?
  2020-08-23 19:54 ` Marc Glisse
@ 2020-08-24  9:19   ` Groke, Paul
  2020-08-24 10:05     ` Marc Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Groke, Paul @ 2020-08-24  9:19 UTC (permalink / raw)
  To: marc.glisse; +Cc: libstdc++

> Could you share which version of gcc you are testing with, what flags you are using, and ideally even some code to reproduce this? With current gcc, I don't see that happening very often.

I stumbled upon this while looking at the generated code with godbolt.org. I tried GCC 9.2 and 10.2. Flags were -std=c++17 -O2, architecture was amd64.

The code where I saw this was

https://godbolt.org/z/oG9EMM
// -------------------------------------------------

#include <vector>

struct Foo {
    Foo(int a, int b) : a(a), b(b) {}
    int a;
    int b;
};

class Bar {
public:
    void test(int a, int b);

private:
    std::vector<Foo> m_fooVec;
};

void Bar::test(int a, int b) {
    m_fooVec.emplace_back(a, b);
}

// -------------------------------------------------

> Ideally, the inliner would already be clever enough not to inline the function except in special cases.

After writing this mail, I found out that it only happens if the function is called just once in the translation unit. So I think it's not that big of a problem. I wish we could use LTCG/LTO with our project. Maybe I should give that a try, see how badly it will affect our build times. But since our link times are already kind-of high, I don't have high hopes.

Maybe it would also be worth grepping over our code to see where we have push_back/emplace_back/insert in header files. Those would have a high probability of creating just one call to a particular specialization of _M_realloc_insert per translation unit, but in multiple translation units.

> Does profile-guided optimization help with your application?

I guess it probably would, if we were able to do it right. It's a huge product though and I don't think we'll be able to use PGO in the near future (or at all). As far as I understand PGO heavily relies on the application used for profiling covering all common use cases in a realistic fashion. And that's something that we simply cannot do.

Anyway, I understand your reasoning. When I wrote my mail I didn't know yet that the behavior I was seeing was only triggered by the more aggressive inlining for functions that are called only once per TU. And I didn't think of some of the things you mentioned.

Just out of curiosity, going back to...

> If someone inserts an element in a newly created vector, it does make sense to inline _M_realloc_insert, especially if we want any hope of making some small local vectors use the stack, or removing some unused small vectors.

Is that something that we can realistically expect in the next few years? I've seen manual new/delete calls removed by the optimizer in very simple examples, but I've never seen that happen with any kind of container. Seems like as soon as std::allocator is involved, the calls will not be optimized out.

Regards,
Paul

-----Original Message-----
From: Marc Glisse <marc.glisse@inria.fr>
Sent: Sonntag, 23. August 2020 21:55
To: Groke, Paul <paul.groke@dynatrace.com>
Cc: libstdc++@gcc.gnu.org
Subject: Re: Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?

On Fri, 21 Aug 2020, Groke, Paul via Libstdc++ wrote:

> I've recently noticed that GCC is inlining vector::_M_realloc_insert
> into some of my functions that call vector::emplace_back.

Could you share which version of gcc you are testing with, what flags you are using, and ideally even some code to reproduce this? With current gcc, I don't see that happening very often.

> IMO that doesn't make a lot of sense - reallocation is usually so
> expensive that an extra function call doesn't really matter. And it
> bloats the code, which has two drawbacks: the binary gets bigger and
> the fast path gets slower (less efficient pre-fetching/more cache
> thrashing).
>
> What do you think about marking such functions __attribute__ ((noinline))?

That's a possibility, but I'd rather avoid it if possible. If someone inserts an element in a newly created vector, it does make sense to inline _M_realloc_insert, especially if we want any hope of making some small local vectors use the stack, or removing some unused small vectors.

Ideally, the inliner would already be clever enough not to inline the function except in special cases. It is rather large, calls other functions, is called with low probability (gcc guesses it as 17% on a simple example, using profile guided optimization would lower that number if the vectors are usually large), etc. This code isn't doing anything unusual (not like std::function or std::any), if the inliner behaves badly there, maybe a heuristic needs some tweaking.

> In our own code, we've seen measurable improvements (size & execution speed) by splitting the "slow path" out into helper functions and making those helper functions noinline.

Yes, that's a common strategy, and it would indeed be good to make sure that gcc does the right thing for std::vector.

> (I've also noticed that there's no __builtin_expect for the fast path,
> but that's a different topic and IMO far less important.)

Inlining takes probabilities into account, so this may be strongly related. However, gcc already guesses that the fast path is the most likely one, it isn't clear that making the probability of the slow path too low is a good idea.

Does profile-guided optimization help with your application?

--
Marc Glisse

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4020 Linz, Austria, Am Fünfundzwanziger Turm 20

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?
  2020-08-24  9:19   ` Groke, Paul
@ 2020-08-24 10:05     ` Marc Glisse
  2020-08-24 10:55       ` Groke, Paul
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Glisse @ 2020-08-24 10:05 UTC (permalink / raw)
  To: Groke, Paul; +Cc: libstdc++

On Mon, 24 Aug 2020, Groke, Paul wrote:

>> Could you share which version of gcc you are testing with, what flags you are using, and ideally even some code to reproduce this? With current gcc, I don't see that happening very often.
>
> I stumbled upon this while looking at the generated code with godbolt.org. I tried GCC 9.2 and 10.2. Flags were -std=c++17 -O2, architecture was amd64.
>
> The code where I saw this was
>
> https://godbolt.org/z/oG9EMM
> // -------------------------------------------------
>
> #include <vector>
>
> struct Foo {
>    Foo(int a, int b) : a(a), b(b) {}
>    int a;
>    int b;
> };
>
> class Bar {
> public:
>    void test(int a, int b);
>
> private:
>    std::vector<Foo> m_fooVec;
> };
>
> void Bar::test(int a, int b) {
>    m_fooVec.emplace_back(a, b);
> }

Thanks. I am a bit surprised it gets inlined even at -O2, while a similar 
example with just 1 int isn't inlined even at -O3.

>> Ideally, the inliner would already be clever enough not to inline the function except in special cases.
>
> After writing this mail, I found out that it only happens if the function is called just once in the translation unit.

That definitely affects inlining decisions, indeed.

> I wish we could use LTCG/LTO with our project. Maybe I should give that 
> a try, see how badly it will affect our build times. But since our link 
> times are already kind-of high, I don't have high hopes.

-flto=auto (or any variant that enables parallelism) doesn't necessarily 
increase the build time, sometimes it can even decrease it. It seems worth 
a try.

> Just out of curiosity, going back to...
>
>> If someone inserts an element in a newly created vector, it does make sense to inline _M_realloc_insert, especially if we want any hope of making some small local vectors use the stack, or removing some unused small vectors.
>
> Is that something that we can realistically expect in the next few 
> years? I've seen manual new/delete calls removed by the optimizer in 
> very simple examples, but I've never seen that happen with any kind of 
> container. Seems like as soon as std::allocator is involved, the calls 
> will not be optimized out.

   std::allocator<int> a;
   a.deallocate(a.allocate(42),42);

gets simplified to nothing.

An unused

   std::vector<int> a={42};

doesn't get optimized, we are left with something like

   int* p = operator new(4);
   *p = 42;
   operator delete(p, 4);

and don't remove the store in *p before delete as dead (whether we are 
allowed to do that in general is not obvious).

To enable more simplifications, I often add in my projects

   inline void* operator new(std::size_t n){return malloc(n);}
   inline void operator delete(void*p)noexcept{free(p);}
   inline void operator delete(void*p,std::size_t)noexcept{free(p);}
etc

(it is illegal, but I don't care)
and then the vector disappears completely.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?
  2020-08-24 10:05     ` Marc Glisse
@ 2020-08-24 10:55       ` Groke, Paul
  0 siblings, 0 replies; 5+ messages in thread
From: Groke, Paul @ 2020-08-24 10:55 UTC (permalink / raw)
  To: Marc Glisse; +Cc: libstdc++

> -flto=auto (or any variant that enables parallelism) doesn't necessarily increase the build time, sometimes it can even decrease it. It seems worth a try.

Thanks, I think we may give that a try!

> Thanks. I am a bit surprised it gets inlined even at -O2, while a similar example with just 1 int isn't inlined even at -O3.

I also noticed that vector<int>::push_back doesn't get inlined fully. But I thought maybe it was some manual tuning for commonly used vector types. Didn't try with a struct with just one integer.

>    std::allocator<int> a;
>    a.deallocate(a.allocate(42),42);
>
> gets simplified to nothing.

Interesting. I tried

    std::allocator<char> a;
    char* p = a.allocate(3);
    a.deallocate(p, 3);

and that didn't get optimized out. But that was with GCC 9, GCC 10 does optimize it away.

>    inline void* operator new(std::size_t n){return malloc(n);}
>    inline void operator delete(void*p)noexcept{free(p);}
>    inline void operator delete(void*p,std::size_t)noexcept{free(p);}
> etc
>
> (it is illegal, but I don't care)
> and then the vector disappears completely.

I wanted to ask what's illegal about that, but then I found it myself: it's the inline. Hm. Too bad.
Would it be possible to make GCC understand that new/delete can be optimized away without resorting to such a hack? Sounds like a good candidate for an opt-in option to me.

-----Original Message-----
From: Marc Glisse <marc.glisse@inria.fr>
Sent: Montag, 24. August 2020 12:06
To: Groke, Paul <paul.groke@dynatrace.com>
Cc: libstdc++@gcc.gnu.org
Subject: RE: Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert?

On Mon, 24 Aug 2020, Groke, Paul wrote:

>> Could you share which version of gcc you are testing with, what flags you are using, and ideally even some code to reproduce this? With current gcc, I don't see that happening very often.
>
> I stumbled upon this while looking at the generated code with godbolt.org. I tried GCC 9.2 and 10.2. Flags were -std=c++17 -O2, architecture was amd64.
>
> The code where I saw this was
>
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgodb
> olt.org%2Fz%2FoG9EMM&amp;data=02%7C01%7Cpaul.groke%40dynatrace.com%7C9
> 6b9e453466547eb924908d848155026%7C70ebe3a35b30435d9d677716d74ca190%7C1
> %7C0%7C637338603664439688&amp;sdata=c5PmtIOCBS0DuUK0SWrb141H3tGF%2BXJy
> Nbv%2FBewUR%2Fs%3D&amp;reserved=0 //
> -------------------------------------------------
>
> #include <vector>
>
> struct Foo {
>    Foo(int a, int b) : a(a), b(b) {}
>    int a;
>    int b;
> };
>
> class Bar {
> public:
>    void test(int a, int b);
>
> private:
>    std::vector<Foo> m_fooVec;
> };
>
> void Bar::test(int a, int b) {
>    m_fooVec.emplace_back(a, b);
> }

Thanks. I am a bit surprised it gets inlined even at -O2, while a similar example with just 1 int isn't inlined even at -O3.

>> Ideally, the inliner would already be clever enough not to inline the function except in special cases.
>
> After writing this mail, I found out that it only happens if the function is called just once in the translation unit.

That definitely affects inlining decisions, indeed.

> I wish we could use LTCG/LTO with our project. Maybe I should give
> that a try, see how badly it will affect our build times. But since
> our link times are already kind-of high, I don't have high hopes.

-flto=auto (or any variant that enables parallelism) doesn't necessarily increase the build time, sometimes it can even decrease it. It seems worth a try.

> Just out of curiosity, going back to...
>
>> If someone inserts an element in a newly created vector, it does make sense to inline _M_realloc_insert, especially if we want any hope of making some small local vectors use the stack, or removing some unused small vectors.
>
> Is that something that we can realistically expect in the next few
> years? I've seen manual new/delete calls removed by the optimizer in
> very simple examples, but I've never seen that happen with any kind of
> container. Seems like as soon as std::allocator is involved, the calls
> will not be optimized out.

   std::allocator<int> a;
   a.deallocate(a.allocate(42),42);

gets simplified to nothing.

An unused

   std::vector<int> a={42};

doesn't get optimized, we are left with something like

   int* p = operator new(4);
   *p = 42;
   operator delete(p, 4);

and don't remove the store in *p before delete as dead (whether we are allowed to do that in general is not obvious).

To enable more simplifications, I often add in my projects

   inline void* operator new(std::size_t n){return malloc(n);}
   inline void operator delete(void*p)noexcept{free(p);}
   inline void operator delete(void*p,std::size_t)noexcept{free(p);}
etc

(it is illegal, but I don't care)
and then the vector disappears completely.

--
Marc Glisse

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4020 Linz, Austria, Am Fünfundzwanziger Turm 20

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-24 10:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21 12:33 Would it make sense to use __attribute__ ((noinline)) for functions like _M_realloc_insert? Groke, Paul
2020-08-23 19:54 ` Marc Glisse
2020-08-24  9:19   ` Groke, Paul
2020-08-24 10:05     ` Marc Glisse
2020-08-24 10:55       ` Groke, Paul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).