public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* posix_memalign performance regression in 2.38?
@ 2023-08-04  2:52 Xi Ruoyao
  2023-08-04 14:12 ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 12+ messages in thread
From: Xi Ruoyao @ 2023-08-04  2:52 UTC (permalink / raw)
  To: libc-alpha

Hi,

There seems a performance regression of posix_memalign in Glibc-2.38:

$ cat t.c
#include <stdlib.h>
int main()
{
	void *buf;
	for (int i = 0; i < (1 << 16); i++)
		posix_memalign(&buf, 64, 16);
}
$ cc t.c
$ time ./a.out 

real	0m0.008s
user	0m0.005s
sys	0m0.003s
$ time ~/sources/lfs/glibc-2.38/build/testrun.sh ./a.out 

real	0m4.376s
user	0m4.369s
sys	0m0.007s

The behavior seems worse than quadratic: if I change "1 << 16" to "1 <<
17", I get:

$ time ~/sources/lfs/glibc-2.38/build/testrun.sh ./a.out 

real	0m28.597s
user	0m28.568s
sys	0m0.022s

I've not bisected for this yet.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-04  2:52 posix_memalign performance regression in 2.38? Xi Ruoyao
@ 2023-08-04 14:12 ` Adhemerval Zanella Netto
  2023-08-07 19:49   ` DJ Delorie
  0 siblings, 1 reply; 12+ messages in thread
From: Adhemerval Zanella Netto @ 2023-08-04 14:12 UTC (permalink / raw)
  To: Xi Ruoyao, libc-alpha, DJ Delorie



On 03/08/23 23:52, Xi Ruoyao via Libc-alpha wrote:
> Hi,
> 
> There seems a performance regression of posix_memalign in Glibc-2.38:
> 
> $ cat t.c
> #include <stdlib.h>
> int main()
> {
> 	void *buf;
> 	for (int i = 0; i < (1 << 16); i++)
> 		posix_memalign(&buf, 64, 16);
> }
> $ cc t.c
> $ time ./a.out 
> 
> real	0m0.008s
> user	0m0.005s
> sys	0m0.003s
> $ time ~/sources/lfs/glibc-2.38/build/testrun.sh ./a.out 
> 
> real	0m4.376s
> user	0m4.369s
> sys	0m0.007s
> 
> The behavior seems worse than quadratic: if I change "1 << 16" to "1 <<
> 17", I get:
> 
> $ time ~/sources/lfs/glibc-2.38/build/testrun.sh ./a.out 
> 
> real	0m28.597s
> user	0m28.568s
> sys	0m0.022s
> 
> I've not bisected for this yet.
> 

It seems to be caused by 24cdd6c71debfd10a9f7cb217fe2a2c4c486ed6f, where
posix_memalign now calls chunk_ok_for_memalign (which takes most of time).

DJ, any idea on how we can improve this?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-04 14:12 ` Adhemerval Zanella Netto
@ 2023-08-07 19:49   ` DJ Delorie
  2023-08-07 19:57     ` Sam James
  2023-08-07 19:58     ` Noah Goldstein
  0 siblings, 2 replies; 12+ messages in thread
From: DJ Delorie @ 2023-08-07 19:49 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: xry111, libc-alpha

Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> It seems to be caused by 24cdd6c71debfd10a9f7cb217fe2a2c4c486ed6f, where
> posix_memalign now calls chunk_ok_for_memalign (which takes most of time).
>
> DJ, any idea on how we can improve this?

This was not unexpected, as we're adding more logic to the memalign
family of functions in order to use less memory.  We can do some
optimizations in this area but it will remain a balancing act between
speed and memory usage.

The old code was terrible wrt memory usage.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-07 19:49   ` DJ Delorie
@ 2023-08-07 19:57     ` Sam James
  2023-08-07 20:15       ` DJ Delorie
  2023-08-08  3:38       ` DJ Delorie
  2023-08-07 19:58     ` Noah Goldstein
  1 sibling, 2 replies; 12+ messages in thread
From: Sam James @ 2023-08-07 19:57 UTC (permalink / raw)
  To: DJ Delorie
  Cc: Adhemerval Zanella Netto, xry111, libc-alpha,
	Andreas K. Hüttel, Timo Rothenpieler


DJ Delorie via Libc-alpha <libc-alpha@sourceware.org> writes:

> Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
>> It seems to be caused by 24cdd6c71debfd10a9f7cb217fe2a2c4c486ed6f, where
>> posix_memalign now calls chunk_ok_for_memalign (which takes most of time).
>>
>> DJ, any idea on how we can improve this?
>
> This was not unexpected, as we're adding more logic to the memalign
> family of functions in order to use less memory.  We can do some
> optimizations in this area but it will remain a balancing act between
> speed and memory usage.
>
> The old code was terrible wrt memory usage.

We've had a bunch of reports of this impacting mpv and ffmpeg as well,
so it's not contrived FWIW.

See e.g. https://github.com/mpv-player/mpv/issues/12076 but I've
also heard from the ffmpeg developers who hit it themselves too.

In that bug, Xi Ruoyao mentions a problem in Mesa's test suite as well.

glibc bug is https://sourceware.org/bugzilla/show_bug.cgi?id=30723.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-07 19:49   ` DJ Delorie
  2023-08-07 19:57     ` Sam James
@ 2023-08-07 19:58     ` Noah Goldstein
  2023-08-07 20:07       ` DJ Delorie
  1 sibling, 1 reply; 12+ messages in thread
From: Noah Goldstein @ 2023-08-07 19:58 UTC (permalink / raw)
  To: DJ Delorie; +Cc: Adhemerval Zanella Netto, xry111, libc-alpha

On Mon, Aug 7, 2023 at 2:50 PM DJ Delorie via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> > It seems to be caused by 24cdd6c71debfd10a9f7cb217fe2a2c4c486ed6f, where
> > posix_memalign now calls chunk_ok_for_memalign (which takes most of time).
> >
> > DJ, any idea on how we can improve this?
>
> This was not unexpected, as we're adding more logic to the memalign
> family of functions in order to use less memory.  We can do some
> optimizations in this area but it will remain a balancing act between
> speed and memory usage.
>
> The old code was terrible wrt memory usage.
>

How is the perf compared to malloc?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-07 19:58     ` Noah Goldstein
@ 2023-08-07 20:07       ` DJ Delorie
  0 siblings, 0 replies; 12+ messages in thread
From: DJ Delorie @ 2023-08-07 20:07 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: adhemerval.zanella, xry111, libc-alpha

Noah Goldstein <goldstein.w.n@gmail.com> writes:
>> The old code was terrible wrt memory usage.
>
> How is the perf compared to malloc?

The old code *always* used at least 2X the memory as unaligned malloc,
and never reused free'd chunks, even if they were originally alloc'd
with a memalign.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-07 19:57     ` Sam James
@ 2023-08-07 20:15       ` DJ Delorie
  2023-08-08  3:38       ` DJ Delorie
  1 sibling, 0 replies; 12+ messages in thread
From: DJ Delorie @ 2023-08-07 20:15 UTC (permalink / raw)
  To: Sam James; +Cc: adhemerval.zanella, xry111, libc-alpha, dilfridge, timo

Sam James <sam@gentoo.org> writes:
> We've had a bunch of reports of this impacting mpv and ffmpeg as well,
> so it's not contrived FWIW.

Hmm... that sounds like a bug, not a performance issue.  I'll try to
reproduce it locally.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-07 19:57     ` Sam James
  2023-08-07 20:15       ` DJ Delorie
@ 2023-08-08  3:38       ` DJ Delorie
  2023-08-08  8:08         ` Xi Ruoyao
  1 sibling, 1 reply; 12+ messages in thread
From: DJ Delorie @ 2023-08-08  3:38 UTC (permalink / raw)
  To: Sam James; +Cc: adhemerval.zanella, xry111, libc-alpha, dilfridge, timo


Reproduced.

In the case where I reproduced it, the most common problematic case was
an allocation of 64-byte aligned chunks of 472 bytes, where 30 smallbin
chunks were tested without finding a match.

The most common non-problematic case was a 64-byte-aligned request for
24 bytes.

There were a LOT of other size requests.  The smallest I saw was TWO
bytes.  WHY?  I'm tempted to not fix this, to teach developers to not
use posix_memalign() unless they REALLY need it ;-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-08  3:38       ` DJ Delorie
@ 2023-08-08  8:08         ` Xi Ruoyao
  2023-08-08 15:08           ` DJ Delorie
  2023-08-09 10:47           ` Florian Weimer
  0 siblings, 2 replies; 12+ messages in thread
From: Xi Ruoyao @ 2023-08-08  8:08 UTC (permalink / raw)
  To: DJ Delorie, Sam James; +Cc: adhemerval.zanella, libc-alpha, dilfridge, timo

On Mon, 2023-08-07 at 23:38 -0400, DJ Delorie wrote:
> 
> Reproduced.
> 
> In the case where I reproduced it, the most common problematic case was
> an allocation of 64-byte aligned chunks of 472 bytes, where 30 smallbin
> chunks were tested without finding a match.
> 
> The most common non-problematic case was a 64-byte-aligned request for
> 24 bytes.
> 
> There were a LOT of other size requests.  The smallest I saw was TWO
> bytes.  WHY?  I'm tempted to not fix this, to teach developers to not
> use posix_memalign() unless they REALLY need it ;-)


Have you tested this?

$ cat t.c
#include <stdlib.h>
int main()
{
	void *buf;
	for (int i = 0; i < (1 << 16); i++)
		posix_memalign(&buf, 64, 64);
}

To me this is quite reasonable (if we just want many blocks each can fit
into a cache line), but this costs 17.7 seconds on my system.  Do you
think people just should avoid this?  If so we at least need to document
the issue in the manual.

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-08  8:08         ` Xi Ruoyao
@ 2023-08-08 15:08           ` DJ Delorie
  2023-08-09 10:47           ` Florian Weimer
  1 sibling, 0 replies; 12+ messages in thread
From: DJ Delorie @ 2023-08-08 15:08 UTC (permalink / raw)
  To: Xi Ruoyao; +Cc: sam, adhemerval.zanella, libc-alpha, dilfridge, timo

Xi Ruoyao <xry111@xry111.site> writes:
> Have you tested this?
>
> $ cat t.c
> #include <stdlib.h>
> int main()
> {
> 	void *buf;
> 	for (int i = 0; i < (1 << 16); i++)
> 		posix_memalign(&buf, 64, 64);
> }
>
> To me this is quite reasonable (if we just want many blocks each can fit
> into a cache line), but this costs 17.7 seconds on my system.  Do you
> think people just should avoid this?  If so we at least need to document
> the issue in the manual.

This is the worst possible way (at least, with glibc's malloc) to
allocate the blocks you want.  You should call mmap() once and break up
the memory it returns.  Note: this has *always* been the worst possible
way; the new code just makes "worst" worse.

We will, of course, consider the worst case scenario and try to optimize
or limit it.  You still should not write code like that.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-08  8:08         ` Xi Ruoyao
  2023-08-08 15:08           ` DJ Delorie
@ 2023-08-09 10:47           ` Florian Weimer
  2023-08-09 16:59             ` Florian Weimer
  1 sibling, 1 reply; 12+ messages in thread
From: Florian Weimer @ 2023-08-09 10:47 UTC (permalink / raw)
  To: Xi Ruoyao via Libc-alpha
  Cc: DJ Delorie, Sam James, Xi Ruoyao, adhemerval.zanella, dilfridge, timo

* Xi Ruoyao via Libc-alpha:

> On Mon, 2023-08-07 at 23:38 -0400, DJ Delorie wrote:
>> 
>> Reproduced.
>> 
>> In the case where I reproduced it, the most common problematic case was
>> an allocation of 64-byte aligned chunks of 472 bytes, where 30 smallbin
>> chunks were tested without finding a match.
>> 
>> The most common non-problematic case was a 64-byte-aligned request for
>> 24 bytes.
>> 
>> There were a LOT of other size requests.  The smallest I saw was TWO
>> bytes.  WHY?  I'm tempted to not fix this, to teach developers to not
>> use posix_memalign() unless they REALLY need it ;-)
>
>
> Have you tested this?
>
> $ cat t.c
> #include <stdlib.h>
> int main()
> {
> 	void *buf;
> 	for (int i = 0; i < (1 << 16); i++)
> 		posix_memalign(&buf, 64, 64);
> }
>
> To me this is quite reasonable (if we just want many blocks each can fit
> into a cache line), but this costs 17.7 seconds on my system.  Do you
> think people just should avoid this?  If so we at least need to document
> the issue in the manual.

This code doesn't work well for glibc malloc (and other dlmalloc-style
mallocs), and never has.  Even with glibc 2.37, it produces a heap
layout like this:

v: 64-byte allocation boundary (all characters are 8 byte wide)
U: available user data
u: unused userdata tail
m: glibc metadata
-: data available for allocation

   v       v       v       v       v       v       v       v
   UUUUUUUUum--------------UUUUUUUUum--------------UUUUUUUUum

This can be seen from the 192 byte increments in the pointers.  The gaps
are not wide enough for reuse, so that part is expected.

However, we should not produce these gaps because with a clean heap, we
split from the remainder, so we should produce this more compact layout
instead:

   v       v       v       v       v       v       v       v
   UUUUUUUUum------UUUUUUUUum------UUUUUUUUum------UUUUUUUUum

It seems to me that this doesn't happen because we call _int_free to
give back the unused memory, and _int_free will use tcache and fastbins,
so it does not make memory available for consolidation.  Eventually this
memory is flushed to the low-level allocator, but that's too late
because then we already have another allocation after 112 bytes that
block further consolidation.  And of course these 112 byte chunks are
all not suitably aligned for re-use.

(Even the compact layout wastes 50% of memory, but at least it's better
than what any glibc version produces today.)

DJ, could you look at bypassing the start of _int_free for these
deallocations?  Once we do that, at least for the synthetic reproducer,
I expect the bin lists to remain short, so hunting for aligned chunks
(which still will not exist) will be fast.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: posix_memalign performance regression in 2.38?
  2023-08-09 10:47           ` Florian Weimer
@ 2023-08-09 16:59             ` Florian Weimer
  0 siblings, 0 replies; 12+ messages in thread
From: Florian Weimer @ 2023-08-09 16:59 UTC (permalink / raw)
  To: Xi Ruoyao via Libc-alpha
  Cc: DJ Delorie, Sam James, Xi Ruoyao, adhemerval.zanella, dilfridge, timo

* Florian Weimer:

> * Xi Ruoyao via Libc-alpha:
>
>> On Mon, 2023-08-07 at 23:38 -0400, DJ Delorie wrote:
>>> 
>>> Reproduced.
>>> 
>>> In the case where I reproduced it, the most common problematic case was
>>> an allocation of 64-byte aligned chunks of 472 bytes, where 30 smallbin
>>> chunks were tested without finding a match.
>>> 
>>> The most common non-problematic case was a 64-byte-aligned request for
>>> 24 bytes.
>>> 
>>> There were a LOT of other size requests.  The smallest I saw was TWO
>>> bytes.  WHY?  I'm tempted to not fix this, to teach developers to not
>>> use posix_memalign() unless they REALLY need it ;-)
>>
>>
>> Have you tested this?
>>
>> $ cat t.c
>> #include <stdlib.h>
>> int main()
>> {
>> 	void *buf;
>> 	for (int i = 0; i < (1 << 16); i++)
>> 		posix_memalign(&buf, 64, 64);
>> }
>>
>> To me this is quite reasonable (if we just want many blocks each can fit
>> into a cache line), but this costs 17.7 seconds on my system.  Do you
>> think people just should avoid this?  If so we at least need to document
>> the issue in the manual.
>
> This code doesn't work well for glibc malloc (and other dlmalloc-style
> mallocs), and never has.  Even with glibc 2.37, it produces a heap
> layout like this:
>
> v: 64-byte allocation boundary (all characters are 8 byte wide)
> U: available user data
> u: unused userdata tail
> m: glibc metadata
> -: data available for allocation
>
>    v       v       v       v       v       v       v       v
>    UUUUUUUUum--------------UUUUUUUUum--------------UUUUUUUUum
>
> This can be seen from the 192 byte increments in the pointers.  The gaps
> are not wide enough for reuse, so that part is expected.
>
> However, we should not produce these gaps because with a clean heap, we
> split from the remainder, so we should produce this more compact layout
> instead:
>
>    v       v       v       v       v       v       v       v
>    UUUUUUUUum------UUUUUUUUum------UUUUUUUUum------UUUUUUUUum
>
> It seems to me that this doesn't happen because we call _int_free to
> give back the unused memory, and _int_free will use tcache and fastbins,
> so it does not make memory available for consolidation.  Eventually this
> memory is flushed to the low-level allocator, but that's too late
> because then we already have another allocation after 112 bytes that
> block further consolidation.  And of course these 112 byte chunks are
> all not suitably aligned for re-use.
>
> (Even the compact layout wastes 50% of memory, but at least it's better
> than what any glibc version produces today.)

There's a second issue that makes this loop really sensitive to initial
heap alignment.  In _int_memalign, we have this:

  /* Also give back spare room at the end */
  if (!chunk_is_mmapped (p))
    {
      size = chunksize (p);
      if ((unsigned long) (size) > (unsigned long) (nb + MINSIZE))
        {
          remainder_size = size - nb;
          remainder = chunk_at_offset (p, nb);
          set_head (remainder, remainder_size | PREV_INUSE |
                    (av != &main_arena ? NON_MAIN_ARENA : 0));
          set_head_size (p, nb);
          _int_free (av, remainder, 1);
        }
    }

The MINSIZE slack is necessary to avoid creating chunks of less than
MINSIZE bytes, which the allocator cannot deal with.  But it also
prevents merging the tail with a following chunk that is unused
(including the top chunk).

Could someone who can reproduce this with a non-synthetic program please
file a bug in Bugzilla?  I'm going to post a draft patch, too.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-08-09 16:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-04  2:52 posix_memalign performance regression in 2.38? Xi Ruoyao
2023-08-04 14:12 ` Adhemerval Zanella Netto
2023-08-07 19:49   ` DJ Delorie
2023-08-07 19:57     ` Sam James
2023-08-07 20:15       ` DJ Delorie
2023-08-08  3:38       ` DJ Delorie
2023-08-08  8:08         ` Xi Ruoyao
2023-08-08 15:08           ` DJ Delorie
2023-08-09 10:47           ` Florian Weimer
2023-08-09 16:59             ` Florian Weimer
2023-08-07 19:58     ` Noah Goldstein
2023-08-07 20:07       ` DJ Delorie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).