From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 6041C3896C1D; Tue,  1 Dec 2020 08:43:30 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6041C3896C1D
From: "keyid.w at qq dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug malloc/26969] A common malloc pattern can make memory not given
 back to OS
Date: Tue, 01 Dec 2020 08:43:30 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: malloc
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: minor
X-Bugzilla-Who: keyid.w at qq dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_status resolution
Message-ID: <bug-26969-131-3sIrEluRom@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-26969-131@http.sourceware.org/bugzilla/>
References: <bug-26969-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Dec 2020 08:43:30 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D26969

keyid.w at qq dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |UNCONFIRMED
         Resolution|NOTABUG                     |---

--- Comment #2 from keyid.w at qq dot com ---
(In reply to Carlos O'Donell from comment #1)
> The glibc implementation of malloc is a heap-based allocator and in that
> design the heap must be logically freed back down in the order that it was
> originally allocated or the heap will continue to grow to keep a maximum
> working set of chunks for application.
>=20
> If you want to free back down to zero at the last deallocation you must t=
une
> the allocator by disabling fastbins and tcache.
>=20
> For example:
> - Allocate A
> - Allocate B
> - Allocate C
> - Free A
> - Free B
> - Free C
>=20
> Consider A, B and C are all the same size.
>=20
> Until "Free C" happens the entire stack is held at 3 objects deep.
>=20
> This can happen because tcache or fastbins holds the most recently freed
> chunk for re-use. There is nothing wrong with this strategy because the C
> library, apriori, doesn't know if you'll carry out this entire workload
> again.
>=20
> The worse-case degenerate situation for tcache is a sequence of allocatio=
ns
> which cause tcache to always hold the top-of-heap chunks as in-use. In a
> real program those chunks are refilled into the tcache much more randomly
> via malloc from the unsorted bin or small bin refill strategy. Thus tcache
> should not keep the top-of-heap from freeing down in those cases. It's on=
ly
> in synthetic test cases like this where I think you see tcache being the
> blocker to freeing down from the top of heap.
>=20
> If you need to free pages between workloads and while idle you can call
> malloc_trim() to release page-sized consolidated parts of the heaps.
>=20
> If you need a minimal working set, then you need to turn off fastbins and
> tcache.
>=20
> One possible enhancement we can make is to split the heaps by pool sizes,
> and that's something I've talked about a bit with DJ Delorie. As it stands
> though that would be a distinct enhancement.
>=20
> I'm marking this as RESOLVED/NOTABUG since the algorithm is working as
> intended but doesn't meet your specific synthetic workload. If you have a
> real non-synthetic workload that exhibits problems please open a bug and =
we
> can talk about it and review performance and capture an API trace.

Thanks for your reply! I indeed faced this problem in a real workload. I tr=
ied
to simplify my code, however, the final code is still a little complex and =
is
in C++. The code is attached at the end.

There's a thread-queue that execute n tasks with m worker threads. Each task
stores some calculated (field, value) data into a map. In my real workload,=
 I
calculate some double numbers from some loaded data then I store the double
numbers into map. The calculation process is very complex so I simplified
here.I think creating the map's key(a short string) is similar to malloc-ing
small pieces and creating the map's value(a large vector) is similar to
malloc-ing large pieces. However, if I don't use the thread-queue, the memo=
ry
will be released. So I guess some malloc of something in STL in the thread
queue compound the result. In fact I used gdb to check the content of
tcache/fast bins near the heap top and found they were probably allocated to
something in STL. Also, If I comment the 149th line("return dp;") and
un-comment the 147th and 148th lines, the memory will be released. I don't =
know
why.

You can compile it just using "g++ test.cpp -o test -lpthread" and run it w=
ith
"./test task_number thread_number" .

#include <cstdio>
#include <thread>
#include <mutex>
#include <map>
#include <future>
#include <queue>
#include <memory>
#include <string>
#include <list>
#include <functional>
#include <utility>

using namespace std;

class TestClass {
 public:
  void DoSomething() {
    const int Count =3D 10000;

    map <string,vector<double>> values;

    for (int i =3D 0; i < Count; ++i) {
      vector <double> v(10000);
      values[std::to_string(i)] =3D v;
    }
  }
};

class MultiThreadWorkQueue {
 public:
  // Constructor.
  // cache_size is the maximum capicity of result cache.
  // n_threads is number of worker threads. If it is 0, it will be set to
  // a reasonable value based on number of cores.
  // If cache_size is 0, it will be set to n_threads.
  MultiThreadWorkQueue(int cache_size, int n_threads)
    : cache_size_(cache_size),
      n_threads_(n_threads) {
    for (int i =3D 0; i < n_threads_; ++i) {
      workers_.push_back(std::thread(&MultiThreadWorkQueue::ProcessTasks,
this));
    }
  }

  ~MultiThreadWorkQueue() {
    Abort();
  }

  void Enqueue(std::function<TestClass*()>&& func) {
    {
      std::unique_lock<std::mutex> ul(tasks_mutex_);
      tasks_.emplace(std::forward<std::function<TestClass*()>> (func));
    }

    worker_cv_.notify_one();
  }

  // Gets result from the next task in queue. If it's still pending, block =
the
current
  // thread and wait until the result is available.
  //
  // Noted that if this is called after Abort(), it will crash.
  TestClass* Dequeue() {
    std::unique_lock<std::mutex> ul(tasks_mutex_);

    dequeue_cv_.wait(ul, [this] {
      return aborted_ || returns_.size() > 0;
    });

    std::future<TestClass*> future =3D std::move(returns_.front());
    returns_.pop();

    ul.unlock();

    worker_cv_.notify_one();
    return future.get();
  }

  // Stop executing any new tasks and join all the worker threads.
  void Abort() {
    {
      std::unique_lock<std::mutex> ul(tasks_mutex_);
      if (aborted_) {
        return;
      } else {
        aborted_ =3D true;
      }
    }

    worker_cv_.notify_all();
    dequeue_cv_.notify_all();

    for (auto& thread : workers_) {
      thread.join();
    }
  }

  // Size =3D N(Enqueue) - N(Dequeue).
  size_t Size() {
    std::unique_lock<std::mutex> ul(tasks_mutex_);
    return returns_.size() + tasks_.size();
  }

 private:
  void ProcessTasks() {
    std::unique_lock<std::mutex> ul(tasks_mutex_);

    while (!aborted_) {
      worker_cv_.wait(ul, [this]() {
        return aborted_ || (tasks_.size() > 0 && returns_.size() <
cache_size_);
      });

      if (aborted_) {
        break;
      }

      std::packaged_task<TestClass*()> t;
      t.swap(tasks_.front());
      tasks_.pop();

      returns_.emplace(t.get_future());

      ul.unlock();
      dequeue_cv_.notify_one();
      t();
      ul.lock();
    }
  }

  std::mutex tasks_mutex_;

  std::atomic<bool> aborted_ { false };
  int cache_size_;
  int n_threads_;
  std::condition_variable worker_cv_;
  std::condition_variable dequeue_cv_;
  std::queue<std::packaged_task<TestClass*()>> tasks_;
  std::queue<std::future<TestClass*>> returns_;
  std::list<std::thread> workers_;
};

int main(int argc, char** argv) {
  int n =3D atoi(argv[1]);
  int thread_num =3D atoi(argv[2]);

  auto CreateDP =3D [] {
    TestClass* dp =3D new TestClass;
    dp->DoSomething();
    // delete dp;
    // return nullptr;
    return dp;
  };

  printf("* before run, press enter to continue");
  fflush(stdout);
  std::getchar();

  if (thread_num > 0) {
    printf("Multi-thread\n");
    MultiThreadWorkQueue work_queue(10, thread_num);
    for (int i =3D 0; i < n; ++i) {
      work_queue.Enqueue(CreateDP);
    }
    for (int i =3D 0; i < n; ++i) {
      std::unique_ptr<TestClass> dp(work_queue.Dequeue());
    }
  } else {
    printf("Single-thread\n");
    for (int i =3D 0; i < n; ++i) {
      fflush(stdout);
      std::unique_ptr<TestClass> dp(CreateDP());
    }
  }
  printf("* after run, press enter to continue");
  fflush(stdout);
  std::getchar();
  return 0;
}

--=20
You are receiving this mail because:
You are on the CC list for the bug.=