public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
* expecations of glibcs (de)allocator
@ 2022-01-18 11:32 Dominik Csapak
  2022-01-18 12:45 ` Christian Hoff
  0 siblings, 1 reply; 3+ messages in thread
From: Dominik Csapak @ 2022-01-18 11:32 UTC (permalink / raw)
  To: libc-help; +Cc: Wolfgang Bumiller

Hi,

i am sorry in advance for the wall of text, but maybe this list can help
me, or at least shed some light on issues we have had regarding memory
(de)allocation.

The setup: we have a long running daemon written in rust for x86_64
Linux (especially Debian, currently bullseye) that uses the default rust
malloc/free which is AFAIK glibcs malloc/free (de)allocator.

The daemon makes heavy use of the async rust frameworks tokio and hyper.
Our problem is the following: during the lifetime of the daemon,
there are some memory heavy operations (network traffic/disk io/etc.)
and thus it allocates quite a bit of memory, but at the end of the 
operation, this memory is still allocated to the program
(we checked with e.g. htop/ps for the RSS/resident memory)
and even letting it run for extended periods of time does not really
releases the memory. (We had customers where it retained over 5GiB
of memory basically doing nothing).

There are some things we tried (e.g. by tuning the options mentioned in
mallopt(3)):

* calling malloc_trim(0) at the end of the program released
   the memory (see the reproducer at the end)

* Changing M_TRIM_THRESHOLD did not change anything, the
   memory is still allocated to the program (even setting
   it to 0 or 1 did not make a difference)
   This surprised us quite a bit since the documentation reads
   like this would release the memory sooner and because
   malloc_trim also released it.

* Setting M_MMAP_THRESHOLD option to a very low value fixed
   the behavior (but not really surprising).

(ofc changing the allocator altogether e.g. jemalloc or musl
also changed the behaviour, but we'd like to avoid that if
possible)

We have some small reproducer that trigger this behavior too.
(i paste it at the end of the mail). It starts a large number
of async tasks, and waits for them to finish, then drops the
async runtime completely (at this point the program cannot really
have any memory in use, but uses still memory according to htop).
to debug we added a 'malloc_trim' at the end which actually releases
the memory to the os.

So from our side it looks like that either

* we (or the frameworks) trigger some bad/worst case in the
   memory allocation pattern. In this case it would be interesting
   how we could check/debug that and, if possible, how to fix
   it in our program

* glibcs allocator has some bug regarding releasing memory to the
   os. while i personally doubt that, it's curious that tuning the
   M_TRIM_THRESHOLD does not seem to do anything. It would also
   be interesting how to debug/check that ofc.

I hope that this list is not completely wrong, but if it is, just say
so (and maybe point me in the right direction)

thanks
Dominik


---- below is the reproducer (note: uses about 1.4GiB peak memory) ----
use std::io;
use std::time::Duration;
use tokio::task;

extern "C" {
     fn malloc_trim(pad: libc::size_t) -> i32;
}

async fn wait(_i: usize) {
     let delay_in_seconds = Duration::new(2, 0);
     tokio::time::sleep(delay_in_seconds).await;
}

fn main() {
     let rt = tokio::runtime::Runtime::new().unwrap();
     rt.block_on(async move {
         let num = 1_000_000;
         for i in 0..num {
             task::spawn(async move {
                 wait(i).await;
             });
         }

         wait(0).await;
         wait(0).await;
     });

     println!("all tasks should be finished");
     let mut buffer = String::new();
     io::stdin().read_line(&mut buffer).expect("error");

     drop(rt);
     println!("dropped runtime");

     let mut buffer = String::new();
     io::stdin().read_line(&mut buffer).expect("error");

     unsafe { malloc_trim(0); };
     println!("called malloc_trim");

     let mut buffer = String::new();
     io::stdin().read_line(&mut buffer).expect("error");
}


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: expecations of glibcs (de)allocator
  2022-01-18 11:32 expecations of glibcs (de)allocator Dominik Csapak
@ 2022-01-18 12:45 ` Christian Hoff
  2022-01-19 10:42   ` Dominik Csapak
  0 siblings, 1 reply; 3+ messages in thread
From: Christian Hoff @ 2022-01-18 12:45 UTC (permalink / raw)
  To: Dominik Csapak, libc-help; +Cc: Wolfgang Bumiller

Hello Dominik,

this looks a lot like the same issue I faced in December, and because of
which I also contacted the Glibc mailing list. You can read the e-mail
thread in the mailing list archives here:
https://sourceware.org/pipermail/libc-help/2021-November/006048.html
<https://sourceware.org/pipermail/libc-help/2021-November/006048.html> .
I described two separate issues in that mail - the first issue is the
one you are also facing. See especially the very helpful reply from
Carlos O'Donnell,which explains this behaviour.

Basically, this a known problem in Glibc. If there is a memory chunk
with a long lifetime on top of the heap, it prevents the whole heap from
being trimmed. malloc_trim() however is able to reclaim the memory,
because it walks all the way down the heap. So, currently, the only
available solution for you is to call malloc_trim() or to use a
different memory allocator (e.g. jemalloc). There has been a discussion
on this list if glibc should call malloc_trim() internally if there is a
lot of free memory on the heap that cannot be freed because a chunk with
a long lifetime is keeping the heap from being trimmed. But I think no
one followed up on this.

This issue becomes even worse if you need to do the memory-intensive
computations multiple times in different threads. In that case, the
memory usage of the program continues to grow far beyond what the actual
needed peak memory of your program is. This is because the threads are
assigned to different arenas and each arena gets larger and larger over
the lifetime of the program.

In my opinion, glibc's memory allocator has an architectural issue.
Larger allocations shouldn't be served by the individual arenas, but by
one central allocation area - just as tcmalloc redirects allocations
larger than 256 KB to the backend. That way, previously free()'ed large
memory chunks can easily be reused by another thread, even though that
other thread is served by a different arena.

We are in the process of changing our software to use tcmalloc instead
of glibc malloc because of this issue. I also noticed that jemalloc is
doing a better job than tcmalloc to return memory back to the OS, so
this may also be an option.


Best regards,

    Christian

On 1/18/22 12:32 PM, Dominik Csapak wrote:
> Hi,
>
> i am sorry in advance for the wall of text, but maybe this list can help
> me, or at least shed some light on issues we have had regarding memory
> (de)allocation.
>
> The setup: we have a long running daemon written in rust for x86_64
> Linux (especially Debian, currently bullseye) that uses the default rust
> malloc/free which is AFAIK glibcs malloc/free (de)allocator.
>
> The daemon makes heavy use of the async rust frameworks tokio and hyper.
> Our problem is the following: during the lifetime of the daemon,
> there are some memory heavy operations (network traffic/disk io/etc.)
> and thus it allocates quite a bit of memory, but at the end of the
> operation, this memory is still allocated to the program
> (we checked with e.g. htop/ps for the RSS/resident memory)
> and even letting it run for extended periods of time does not really
> releases the memory. (We had customers where it retained over 5GiB
> of memory basically doing nothing).
>
> There are some things we tried (e.g. by tuning the options mentioned in
> mallopt(3)):
>
> * calling malloc_trim(0) at the end of the program released
>   the memory (see the reproducer at the end)
>
> * Changing M_TRIM_THRESHOLD did not change anything, the
>   memory is still allocated to the program (even setting
>   it to 0 or 1 did not make a difference)
>   This surprised us quite a bit since the documentation reads
>   like this would release the memory sooner and because
>   malloc_trim also released it.
>
> * Setting M_MMAP_THRESHOLD option to a very low value fixed
>   the behavior (but not really surprising).
>
> (ofc changing the allocator altogether e.g. jemalloc or musl
> also changed the behaviour, but we'd like to avoid that if
> possible)
>
> We have some small reproducer that trigger this behavior too.
> (i paste it at the end of the mail). It starts a large number
> of async tasks, and waits for them to finish, then drops the
> async runtime completely (at this point the program cannot really
> have any memory in use, but uses still memory according to htop).
> to debug we added a 'malloc_trim' at the end which actually releases
> the memory to the os.
>
> So from our side it looks like that either
>
> * we (or the frameworks) trigger some bad/worst case in the
>   memory allocation pattern. In this case it would be interesting
>   how we could check/debug that and, if possible, how to fix
>   it in our program
>
> * glibcs allocator has some bug regarding releasing memory to the
>   os. while i personally doubt that, it's curious that tuning the
>   M_TRIM_THRESHOLD does not seem to do anything. It would also
>   be interesting how to debug/check that ofc.
>
> I hope that this list is not completely wrong, but if it is, just say
> so (and maybe point me in the right direction)
>
> thanks
> Dominik
>
>
> ---- below is the reproducer (note: uses about 1.4GiB peak memory) ----
> use std::io;
> use std::time::Duration;
> use tokio::task;
>
> extern "C" {
>     fn malloc_trim(pad: libc::size_t) -> i32;
> }
>
> async fn wait(_i: usize) {
>     let delay_in_seconds = Duration::new(2, 0);
>     tokio::time::sleep(delay_in_seconds).await;
> }
>
> fn main() {
>     let rt = tokio::runtime::Runtime::new().unwrap();
>     rt.block_on(async move {
>         let num = 1_000_000;
>         for i in 0..num {
>             task::spawn(async move {
>                 wait(i).await;
>             });
>         }
>
>         wait(0).await;
>         wait(0).await;
>     });
>
>     println!("all tasks should be finished");
>     let mut buffer = String::new();
>     io::stdin().read_line(&mut buffer).expect("error");
>
>     drop(rt);
>     println!("dropped runtime");
>
>     let mut buffer = String::new();
>     io::stdin().read_line(&mut buffer).expect("error");
>
>     unsafe { malloc_trim(0); };
>     println!("called malloc_trim");
>
>     let mut buffer = String::new();
>     io::stdin().read_line(&mut buffer).expect("error");
> }
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: expecations of glibcs (de)allocator
  2022-01-18 12:45 ` Christian Hoff
@ 2022-01-19 10:42   ` Dominik Csapak
  0 siblings, 0 replies; 3+ messages in thread
From: Dominik Csapak @ 2022-01-19 10:42 UTC (permalink / raw)
  To: Christian Hoff, libc-help; +Cc: Wolfgang Bumiller

On 1/18/22 13:45, Christian Hoff wrote:
> Hello Dominik,

Hi, and thanks for your answer.

> 
> this looks a lot like the same issue I faced in December, and because of
> which I also contacted the Glibc mailing list. You can read the e-mail
> thread in the mailing list archives here:
> https://sourceware.org/pipermail/libc-help/2021-November/006048.html
> <https://sourceware.org/pipermail/libc-help/2021-November/006048.html> .
> I described two separate issues in that mail - the first issue is the
> one you are also facing. See especially the very helpful reply from
> Carlos O'Donnell,which explains this behaviour.
> 
> Basically, this a known problem in Glibc. If there is a memory chunk
> with a long lifetime on top of the heap, it prevents the whole heap from
> being trimmed. malloc_trim() however is able to reclaim the memory,
> because it walks all the way down the heap. So, currently, the only
> available solution for you is to call malloc_trim() or to use a
> different memory allocator (e.g. jemalloc). There has been a discussion
> on this list if glibc should call malloc_trim() internally if there is a
> lot of free memory on the heap that cannot be freed because a chunk with
> a long lifetime is keeping the heap from being trimmed. But I think no
> one followed up on this.
> 
> This issue becomes even worse if you need to do the memory-intensive
> computations multiple times in different threads. In that case, the
> memory usage of the program continues to grow far beyond what the actual
> needed peak memory of your program is. This is because the threads are
> assigned to different arenas and each arena gets larger and larger over
> the lifetime of the program.
> 
> In my opinion, glibc's memory allocator has an architectural issue.
> Larger allocations shouldn't be served by the individual arenas, but by
> one central allocation area - just as tcmalloc redirects allocations
> larger than 256 KB to the backend. That way, previously free()'ed large
> memory chunks can easily be reused by another thread, even though that
> other thread is served by a different arena.
> 
> We are in the process of changing our software to use tcmalloc instead
> of glibc malloc because of this issue. I also noticed that jemalloc is
> doing a better job than tcmalloc to return memory back to the OS, so
> this may also be an option.

yes, reading the threads, it seems using glibcs allocator is not an 
option for us, though i am wondering if all sufficiently complex
programs will run into such issues.

In our case, we use rust+tokio+hyper which are very popular rust
frameworks for having async code, and it seems that using them in a
pretty standard way will already run into these problems. We also have
the problem that we cannot really know what the peak memory usage will
be for our application, because it depends mostly on user behaviour
and setup. And since our application is necessarily the only
program running, returning memory to the OS is also important.

Tuning the options from mallopt, we did see some changes,
but AFAICS these were always too aggressive (it released
memory back too often, which also hurts performance...)

So it seems we'll have to evaluate different allocators...

Thanks

Best Regards,
Dominik

> 
> 
> Best regards,
> 
>     Christian
> 
> On 1/18/22 12:32 PM, Dominik Csapak wrote:
>> Hi,
>>
>> i am sorry in advance for the wall of text, but maybe this list can help
>> me, or at least shed some light on issues we have had regarding memory
>> (de)allocation.
>>
>> The setup: we have a long running daemon written in rust for x86_64
>> Linux (especially Debian, currently bullseye) that uses the default rust
>> malloc/free which is AFAIK glibcs malloc/free (de)allocator.
>>
>> The daemon makes heavy use of the async rust frameworks tokio and hyper.
>> Our problem is the following: during the lifetime of the daemon,
>> there are some memory heavy operations (network traffic/disk io/etc.)
>> and thus it allocates quite a bit of memory, but at the end of the
>> operation, this memory is still allocated to the program
>> (we checked with e.g. htop/ps for the RSS/resident memory)
>> and even letting it run for extended periods of time does not really
>> releases the memory. (We had customers where it retained over 5GiB
>> of memory basically doing nothing).
>>
>> There are some things we tried (e.g. by tuning the options mentioned in
>> mallopt(3)):
>>
>> * calling malloc_trim(0) at the end of the program released
>>   the memory (see the reproducer at the end)
>>
>> * Changing M_TRIM_THRESHOLD did not change anything, the
>>   memory is still allocated to the program (even setting
>>   it to 0 or 1 did not make a difference)
>>   This surprised us quite a bit since the documentation reads
>>   like this would release the memory sooner and because
>>   malloc_trim also released it.
>>
>> * Setting M_MMAP_THRESHOLD option to a very low value fixed
>>   the behavior (but not really surprising).
>>
>> (ofc changing the allocator altogether e.g. jemalloc or musl
>> also changed the behaviour, but we'd like to avoid that if
>> possible)
>>
>> We have some small reproducer that trigger this behavior too.
>> (i paste it at the end of the mail). It starts a large number
>> of async tasks, and waits for them to finish, then drops the
>> async runtime completely (at this point the program cannot really
>> have any memory in use, but uses still memory according to htop).
>> to debug we added a 'malloc_trim' at the end which actually releases
>> the memory to the os.
>>
>> So from our side it looks like that either
>>
>> * we (or the frameworks) trigger some bad/worst case in the
>>   memory allocation pattern. In this case it would be interesting
>>   how we could check/debug that and, if possible, how to fix
>>   it in our program
>>
>> * glibcs allocator has some bug regarding releasing memory to the
>>   os. while i personally doubt that, it's curious that tuning the
>>   M_TRIM_THRESHOLD does not seem to do anything. It would also
>>   be interesting how to debug/check that ofc.
>>
>> I hope that this list is not completely wrong, but if it is, just say
>> so (and maybe point me in the right direction)
>>
>> thanks
>> Dominik
>>
>>
>> ---- below is the reproducer (note: uses about 1.4GiB peak memory) ----
>> use std::io;
>> use std::time::Duration;
>> use tokio::task;
>>
>> extern "C" {
>>     fn malloc_trim(pad: libc::size_t) -> i32;
>> }
>>
>> async fn wait(_i: usize) {
>>     let delay_in_seconds = Duration::new(2, 0);
>>     tokio::time::sleep(delay_in_seconds).await;
>> }
>>
>> fn main() {
>>     let rt = tokio::runtime::Runtime::new().unwrap();
>>     rt.block_on(async move {
>>         let num = 1_000_000;
>>         for i in 0..num {
>>             task::spawn(async move {
>>                 wait(i).await;
>>             });
>>         }
>>
>>         wait(0).await;
>>         wait(0).await;
>>     });
>>
>>     println!("all tasks should be finished");
>>     let mut buffer = String::new();
>>     io::stdin().read_line(&mut buffer).expect("error");
>>
>>     drop(rt);
>>     println!("dropped runtime");
>>
>>     let mut buffer = String::new();
>>     io::stdin().read_line(&mut buffer).expect("error");
>>
>>     unsafe { malloc_trim(0); };
>>     println!("called malloc_trim");
>>
>>     let mut buffer = String::new();
>>     io::stdin().read_line(&mut buffer).expect("error");
>> }
>>
> 



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-01-19 10:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 11:32 expecations of glibcs (de)allocator Dominik Csapak
2022-01-18 12:45 ` Christian Hoff
2022-01-19 10:42   ` Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).