* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
@ 2020-11-28 13:59 ` keyid.w at qq dot com
2020-11-29 3:24 ` keyid.w at qq dot com
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: keyid.w at qq dot com @ 2020-11-28 13:59 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
keyid.w at qq dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P2 |P1
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
2020-11-28 13:59 ` [Bug malloc/26969] " keyid.w at qq dot com
@ 2020-11-29 3:24 ` keyid.w at qq dot com
2020-12-01 1:03 ` uwydoc at gmail dot com
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: keyid.w at qq dot com @ 2020-11-29 3:24 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
keyid.w at qq dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|enhancement |minor
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
2020-11-28 13:59 ` [Bug malloc/26969] " keyid.w at qq dot com
2020-11-29 3:24 ` keyid.w at qq dot com
@ 2020-12-01 1:03 ` uwydoc at gmail dot com
2020-12-01 2:51 ` carlos at redhat dot com
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: uwydoc at gmail dot com @ 2020-12-01 1:03 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
uwydoc <uwydoc at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |uwydoc at gmail dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
` (2 preceding siblings ...)
2020-12-01 1:03 ` uwydoc at gmail dot com
@ 2020-12-01 2:51 ` carlos at redhat dot com
2020-12-01 8:43 ` keyid.w at qq dot com
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: carlos at redhat dot com @ 2020-12-01 2:51 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |carlos at redhat dot com
Status|UNCONFIRMED |RESOLVED
Resolution|--- |NOTABUG
--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
The glibc implementation of malloc is a heap-based allocator and in that design
the heap must be logically freed back down in the order that it was originally
allocated or the heap will continue to grow to keep a maximum working set of
chunks for application.
If you want to free back down to zero at the last deallocation you must tune
the allocator by disabling fastbins and tcache.
For example:
- Allocate A
- Allocate B
- Allocate C
- Free A
- Free B
- Free C
Consider A, B and C are all the same size.
Until "Free C" happens the entire stack is held at 3 objects deep.
This can happen because tcache or fastbins holds the most recently freed chunk
for re-use. There is nothing wrong with this strategy because the C library,
apriori, doesn't know if you'll carry out this entire workload again.
The worse-case degenerate situation for tcache is a sequence of allocations
which cause tcache to always hold the top-of-heap chunks as in-use. In a real
program those chunks are refilled into the tcache much more randomly via malloc
from the unsorted bin or small bin refill strategy. Thus tcache should not keep
the top-of-heap from freeing down in those cases. It's only in synthetic test
cases like this where I think you see tcache being the blocker to freeing down
from the top of heap.
If you need to free pages between workloads and while idle you can call
malloc_trim() to release page-sized consolidated parts of the heaps.
If you need a minimal working set, then you need to turn off fastbins and
tcache.
One possible enhancement we can make is to split the heaps by pool sizes, and
that's something I've talked about a bit with DJ Delorie. As it stands though
that would be a distinct enhancement.
I'm marking this as RESOLVED/NOTABUG since the algorithm is working as intended
but doesn't meet your specific synthetic workload. If you have a real
non-synthetic workload that exhibits problems please open a bug and we can talk
about it and review performance and capture an API trace.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
` (3 preceding siblings ...)
2020-12-01 2:51 ` carlos at redhat dot com
@ 2020-12-01 8:43 ` keyid.w at qq dot com
2021-01-29 16:08 ` dimahabr at gmail dot com
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: keyid.w at qq dot com @ 2020-12-01 8:43 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
keyid.w at qq dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|RESOLVED |UNCONFIRMED
Resolution|NOTABUG |---
--- Comment #2 from keyid.w at qq dot com ---
(In reply to Carlos O'Donell from comment #1)
> The glibc implementation of malloc is a heap-based allocator and in that
> design the heap must be logically freed back down in the order that it was
> originally allocated or the heap will continue to grow to keep a maximum
> working set of chunks for application.
>
> If you want to free back down to zero at the last deallocation you must tune
> the allocator by disabling fastbins and tcache.
>
> For example:
> - Allocate A
> - Allocate B
> - Allocate C
> - Free A
> - Free B
> - Free C
>
> Consider A, B and C are all the same size.
>
> Until "Free C" happens the entire stack is held at 3 objects deep.
>
> This can happen because tcache or fastbins holds the most recently freed
> chunk for re-use. There is nothing wrong with this strategy because the C
> library, apriori, doesn't know if you'll carry out this entire workload
> again.
>
> The worse-case degenerate situation for tcache is a sequence of allocations
> which cause tcache to always hold the top-of-heap chunks as in-use. In a
> real program those chunks are refilled into the tcache much more randomly
> via malloc from the unsorted bin or small bin refill strategy. Thus tcache
> should not keep the top-of-heap from freeing down in those cases. It's only
> in synthetic test cases like this where I think you see tcache being the
> blocker to freeing down from the top of heap.
>
> If you need to free pages between workloads and while idle you can call
> malloc_trim() to release page-sized consolidated parts of the heaps.
>
> If you need a minimal working set, then you need to turn off fastbins and
> tcache.
>
> One possible enhancement we can make is to split the heaps by pool sizes,
> and that's something I've talked about a bit with DJ Delorie. As it stands
> though that would be a distinct enhancement.
>
> I'm marking this as RESOLVED/NOTABUG since the algorithm is working as
> intended but doesn't meet your specific synthetic workload. If you have a
> real non-synthetic workload that exhibits problems please open a bug and we
> can talk about it and review performance and capture an API trace.
Thanks for your reply! I indeed faced this problem in a real workload. I tried
to simplify my code, however, the final code is still a little complex and is
in C++. The code is attached at the end.
There's a thread-queue that execute n tasks with m worker threads. Each task
stores some calculated (field, value) data into a map. In my real workload, I
calculate some double numbers from some loaded data then I store the double
numbers into map. The calculation process is very complex so I simplified
here.I think creating the map's key(a short string) is similar to malloc-ing
small pieces and creating the map's value(a large vector) is similar to
malloc-ing large pieces. However, if I don't use the thread-queue, the memory
will be released. So I guess some malloc of something in STL in the thread
queue compound the result. In fact I used gdb to check the content of
tcache/fast bins near the heap top and found they were probably allocated to
something in STL. Also, If I comment the 149th line("return dp;") and
un-comment the 147th and 148th lines, the memory will be released. I don't know
why.
You can compile it just using "g++ test.cpp -o test -lpthread" and run it with
"./test task_number thread_number" .
#include <cstdio>
#include <thread>
#include <mutex>
#include <map>
#include <future>
#include <queue>
#include <memory>
#include <string>
#include <list>
#include <functional>
#include <utility>
using namespace std;
class TestClass {
public:
void DoSomething() {
const int Count = 10000;
map <string,vector<double>> values;
for (int i = 0; i < Count; ++i) {
vector <double> v(10000);
values[std::to_string(i)] = v;
}
}
};
class MultiThreadWorkQueue {
public:
// Constructor.
// cache_size is the maximum capicity of result cache.
// n_threads is number of worker threads. If it is 0, it will be set to
// a reasonable value based on number of cores.
// If cache_size is 0, it will be set to n_threads.
MultiThreadWorkQueue(int cache_size, int n_threads)
: cache_size_(cache_size),
n_threads_(n_threads) {
for (int i = 0; i < n_threads_; ++i) {
workers_.push_back(std::thread(&MultiThreadWorkQueue::ProcessTasks,
this));
}
}
~MultiThreadWorkQueue() {
Abort();
}
void Enqueue(std::function<TestClass*()>&& func) {
{
std::unique_lock<std::mutex> ul(tasks_mutex_);
tasks_.emplace(std::forward<std::function<TestClass*()>> (func));
}
worker_cv_.notify_one();
}
// Gets result from the next task in queue. If it's still pending, block the
current
// thread and wait until the result is available.
//
// Noted that if this is called after Abort(), it will crash.
TestClass* Dequeue() {
std::unique_lock<std::mutex> ul(tasks_mutex_);
dequeue_cv_.wait(ul, [this] {
return aborted_ || returns_.size() > 0;
});
std::future<TestClass*> future = std::move(returns_.front());
returns_.pop();
ul.unlock();
worker_cv_.notify_one();
return future.get();
}
// Stop executing any new tasks and join all the worker threads.
void Abort() {
{
std::unique_lock<std::mutex> ul(tasks_mutex_);
if (aborted_) {
return;
} else {
aborted_ = true;
}
}
worker_cv_.notify_all();
dequeue_cv_.notify_all();
for (auto& thread : workers_) {
thread.join();
}
}
// Size = N(Enqueue) - N(Dequeue).
size_t Size() {
std::unique_lock<std::mutex> ul(tasks_mutex_);
return returns_.size() + tasks_.size();
}
private:
void ProcessTasks() {
std::unique_lock<std::mutex> ul(tasks_mutex_);
while (!aborted_) {
worker_cv_.wait(ul, [this]() {
return aborted_ || (tasks_.size() > 0 && returns_.size() <
cache_size_);
});
if (aborted_) {
break;
}
std::packaged_task<TestClass*()> t;
t.swap(tasks_.front());
tasks_.pop();
returns_.emplace(t.get_future());
ul.unlock();
dequeue_cv_.notify_one();
t();
ul.lock();
}
}
std::mutex tasks_mutex_;
std::atomic<bool> aborted_ { false };
int cache_size_;
int n_threads_;
std::condition_variable worker_cv_;
std::condition_variable dequeue_cv_;
std::queue<std::packaged_task<TestClass*()>> tasks_;
std::queue<std::future<TestClass*>> returns_;
std::list<std::thread> workers_;
};
int main(int argc, char** argv) {
int n = atoi(argv[1]);
int thread_num = atoi(argv[2]);
auto CreateDP = [] {
TestClass* dp = new TestClass;
dp->DoSomething();
// delete dp;
// return nullptr;
return dp;
};
printf("* before run, press enter to continue");
fflush(stdout);
std::getchar();
if (thread_num > 0) {
printf("Multi-thread\n");
MultiThreadWorkQueue work_queue(10, thread_num);
for (int i = 0; i < n; ++i) {
work_queue.Enqueue(CreateDP);
}
for (int i = 0; i < n; ++i) {
std::unique_ptr<TestClass> dp(work_queue.Dequeue());
}
} else {
printf("Single-thread\n");
for (int i = 0; i < n; ++i) {
fflush(stdout);
std::unique_ptr<TestClass> dp(CreateDP());
}
}
printf("* after run, press enter to continue");
fflush(stdout);
std::getchar();
return 0;
}
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
` (4 preceding siblings ...)
2020-12-01 8:43 ` keyid.w at qq dot com
@ 2021-01-29 16:08 ` dimahabr at gmail dot com
2021-02-01 8:52 ` keyid.w at qq dot com
2022-06-29 16:34 ` romash at rbbn dot com
7 siblings, 0 replies; 9+ messages in thread
From: dimahabr at gmail dot com @ 2021-01-29 16:08 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
Dmitry <dimahabr at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |dimahabr at gmail dot com
--- Comment #3 from Dmitry <dimahabr at gmail dot com> ---
This is a really common problem especially for long-running multithreaded
processes with a lot of arenas. Calling malloc_trim - not always an option, if
a service using C bindings and is written on high level languages. Also,
malloc_trim would cause unnecessary overhead, since it will be trimming and
locking all arenas. Additional confusion comes from man page for malloc_trim
which says it's called sometimes during free.
I was looking at the code and have the following suggestion for improvement:
the main idea is to call mtrim for arena in _int_free. To amortize performance
overhead: call it only if free is called for a chunk with size > than
FASTBIN_CONSOLIDATION_THRESHOLD, and we have a chance to free more memory than
let's say 3*TRIM_THRESHOLD. To understand how much we can return to OS we can
add 1 bit flag to chunk, set it to 0 after returning to OS, set to 1 otherwise.
This would not give 100% accurate result, but should give a good estimate.
I'd be glad to work on this functionality, if you think it makes sense. If you
have other suggestions also happy to discuss this. The main point - mtrim
should be called during free some times, otherwise current malloc is
unfortunately hard to use for long-running multithreaded workload, since it's
using 2-3x more RSS in comparison with jemalloc or tcmalloc.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
` (5 preceding siblings ...)
2021-01-29 16:08 ` dimahabr at gmail dot com
@ 2021-02-01 8:52 ` keyid.w at qq dot com
2022-06-29 16:34 ` romash at rbbn dot com
7 siblings, 0 replies; 9+ messages in thread
From: keyid.w at qq dot com @ 2021-02-01 8:52 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
--- Comment #4 from XY Wen <keyid.w at qq dot com> ---
(In reply to Dmitry from comment #3)
> This is a really common problem especially for long-running multithreaded
> processes with a lot of arenas. Calling malloc_trim - not always an option,
> if a service using C bindings and is written on high level languages. Also,
> malloc_trim would cause unnecessary overhead, since it will be trimming and
> locking all arenas. Additional confusion comes from man page for malloc_trim
> which says it's called sometimes during free.
>
> I was looking at the code and have the following suggestion for improvement:
> the main idea is to call mtrim for arena in _int_free. To amortize
> performance overhead: call it only if free is called for a chunk with size >
> than FASTBIN_CONSOLIDATION_THRESHOLD, and we have a chance to free more
> memory than let's say 3*TRIM_THRESHOLD. To understand how much we can return
> to OS we can add 1 bit flag to chunk, set it to 0 after returning to OS, set
> to 1 otherwise. This would not give 100% accurate result, but should give a
> good estimate.
>
> I'd be glad to work on this functionality, if you think it makes sense. If
> you have other suggestions also happy to discuss this. The main point -
> mtrim should be called during free some times, otherwise current malloc is
> unfortunately hard to use for long-running multithreaded workload, since
> it's using 2-3x more RSS in comparison with jemalloc or tcmalloc.
I think your solution will be helpful in this situation. I have no better idea
right now(and I think it's hard to design a trade-off strategy since it
probably need lots of tests). But jemalloc may have some different methods to
solve this problem, to which you can refer, because I noticed that the virtual
memory is also reduced using jemalloc. By the way, tcmalloc doesn't do better
in my usage.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug malloc/26969] A common malloc pattern can make memory not given back to OS
2020-11-28 13:58 [Bug malloc/26969] New: A common malloc pattern can make memory not given back to OS keyid.w at qq dot com
` (6 preceding siblings ...)
2021-02-01 8:52 ` keyid.w at qq dot com
@ 2022-06-29 16:34 ` romash at rbbn dot com
7 siblings, 0 replies; 9+ messages in thread
From: romash at rbbn dot com @ 2022-06-29 16:34 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=26969
"Romash, Cliff" <romash at rbbn dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |romash at rbbn dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread