public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications
@ 2010-02-08 20:23 rich at testardi dot com
  2010-02-09 15:28 ` [Bug libc/11261] " drepper at redhat dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: rich at testardi dot com @ 2010-02-08 20:23 UTC (permalink / raw)
  To: glibc-bugs

malloc uses excessive memory for multi-threaded applications

The following program demonstrates malloc(3) using in excess of 600 megabytes 
of system memory while the program has never allocated more than 100 megabytes 
at any given time.  This results from the use of thread-specific "preferred 
arenas" for memory allocations.

The program first starts by contending a number of threads doing simple 
malloc/frees, with no net memory allocations.  This establishes preferred 
arenas for each thread as a result of USE_ARENAS and PER_THREADS.  Once 
preferred arenas are established, the program then has each thread, in turn, 
allocate 100 megabytes and then free all but 20 kilobytes, for a net memory 
allocation of 200 kilobytes.  The resulting malloc_stats() show 600 megabytes 
of allocated memory that cannot be returned to the system.

Over time, fragmentation of the heap can cause excessive paging when actual 
memory allocation never exceeded system capacity.  With the use of preferred 
arenas in this way, multi-threaded program memory usage is essentially 
unbounded (or bounded to the number of threads times the actual memory usage).

The program run and source code is below, as well as the glibc version from my 
RHEL5 system.  Thank you for your consideration.

[root@lab2-160 test_heap]# ./memx
creating 10 threads
allowing threads to contend to create preferred arenas
display preferred arenas
Arena 0:
system bytes     =     135168
in use bytes     =       2880
Arena 1:
system bytes     =     135168
in use bytes     =       2224
Arena 2:
system bytes     =     135168
in use bytes     =       2224
Arena 3:
system bytes     =     135168
in use bytes     =       2224
Arena 4:
system bytes     =     135168
in use bytes     =       2224
Arena 5:
system bytes     =     135168
in use bytes     =       2224
Total (incl. mmap):
system bytes     =     811008
in use bytes     =      14000
max mmap regions =          0
max mmap bytes   =          0
allowing threads to allocate 100MB each, sequentially in turn
thread 3 alloc 100MB
thread 3 free 100MB-20kB
thread 5 alloc 100MB
thread 5 free 100MB-20kB
thread 7 alloc 100MB
thread 7 free 100MB-20kB
thread 2 alloc 100MB
thread 2 free 100MB-20kB
thread 0 alloc 100MB
thread 0 free 100MB-20kB
thread 8 alloc 100MB
thread 8 free 100MB-20kB
thread 4 alloc 100MB
thread 4 free 100MB-20kB
thread 6 alloc 100MB
thread 6 free 100MB-20kB
thread 9 alloc 100MB
thread 9 free 100MB-20kB
thread 1 alloc 100MB
thread 1 free 100MB-20kB
Arena 0:
system bytes     =  100253696
in use bytes     =      40928
Arena 1:
system bytes     =  100184064
in use bytes     =      42352
Arena 2:
system bytes     =  100163584
in use bytes     =      22320
Arena 3:
system bytes     =  100163584
in use bytes     =      22320
Arena 4:
system bytes     =  100163584
in use bytes     =      22320
Arena 5:
system bytes     =  100204544
in use bytes     =      62384
Total (incl. mmap):
system bytes     =  601133056
in use bytes     =     212624
max mmap regions =          0
max mmap bytes   =          0
[root@lab2-160 test_heap]# rpm -q glibc
glibc-2.5-42.el5_4.2
glibc-2.5-42.el5_4.2
[root@lab2-160 test_heap]# 

====================================================================

[root@lab2-160 test_heap]# cat memx.c
// ****************************************************************************

#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>
#include <inttypes.h>

#define NTHREADS  10
#define NALLOCS  10000
#define ALLOCSIZE  10000

static volatile int go;
static volatile int die;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

static void *ps[NALLOCS];  // allocations that are freed in turn by each thread
static void *pps1[NTHREADS];  // straggling allocations to prevent arena free
static void *pps2[NTHREADS];  // straggling allocations to prevent arena free

void
my_sleep(
    int ms
    )
{
    int rv;
    struct timespec ts;
    struct timespec rem;

    ts.tv_sec  = ms / 1000;
    ts.tv_nsec = (ms % 1000) * 1000000;
    for (;;) {
        rv = nanosleep(&ts, &rem);
        if (! rv) {
            break;
        }
        assert(errno == EINTR);
        ts = rem;
    }
}

void *
my_thread(
    void *context
    )
{
    int i;
    int rv;
    void *p;

    // first we spin to get our own arena
    while (go == 0) {
        p = malloc(ALLOCSIZE);
        assert(p);
        if (rand()%20000 == 0) {
            my_sleep(10);
        }
        free(p);
    }

    // then we give main a chance to print stats
    while (go == 1) {
        my_sleep(100);
    }
    assert(go == 2);

    // then one thread at a time, do our big allocs
    rv = pthread_mutex_lock(&mutex);
    assert(! rv);
    printf("thread %d alloc 100MB\n", (int)(intptr_t)context);
    for (i = 0; i < NALLOCS; i++) {
        ps[i] = malloc(ALLOCSIZE);
        assert(ps[i]);
    }
    printf("thread %d free 100MB-20kB\n", (int)(intptr_t)context);
    // N.B. we leave two allocations straggling
    pps1[(int)(intptr_t)context] = ps[0];
    for (i = 1; i < NALLOCS-1; i++) {
        free(ps[i]);
    }
    pps2[(int)(intptr_t)context] = ps[i];
    rv = pthread_mutex_unlock(&mutex);
    assert(! rv);
}

int
main()
{
    int i;
    int rv;
    pthread_t thread;

    printf("creating %d threads\n", NTHREADS);
    for (i = 0; i < NTHREADS; i++) {
        rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i);
        assert(! rv);
        rv = pthread_detach(thread);
        assert(! rv);
    }

    printf("allowing threads to contend to create preferred arenas\n");
    my_sleep(20000);

    printf("display preferred arenas\n");
    go = 1;
    my_sleep(1000);
    malloc_stats();

    printf("allowing threads to allocate 100MB each, sequentially in turn\n");
    go = 2;
    my_sleep(5000);
    malloc_stats();

    // free the stragglers
    for (i = 0; i < NTHREADS; i++) {
        free(pps1[i]);
        free(pps2[i]);
    }

    return 0;
}
[root@lab2-160 test_heap]#

-- 
           Summary: malloc uses excessive memory for multi-threaded
                    applications
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: drepper at redhat dot com
        ReportedBy: rich at testardi dot com
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
@ 2010-02-09 15:28 ` drepper at redhat dot com
  2010-02-09 16:02 ` rich at testardi dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: drepper at redhat dot com @ 2010-02-09 15:28 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From drepper at redhat dot com  2010-02-09 15:28 -------
You don't understand the difference between address space and allocated memory.
 The cost of large amounts of allocated address space is insignificant.

If you don't want it control it using the MALLOC_ARENA_MAX and MALLOC_ARENA_TEST
envvars.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
  2010-02-09 15:28 ` [Bug libc/11261] " drepper at redhat dot com
@ 2010-02-09 16:02 ` rich at testardi dot com
  2010-02-10 13:10 ` rich at testardi dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rich at testardi dot com @ 2010-02-09 16:02 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From rich at testardi dot com  2010-02-09 16:01 -------
Actually, I totally understand the difference and that is why I mentioned the 
fragmentation of memory...  When each arena has just a few straggling 
allocations, the maximum *committed* RAM required for the program's *working 
set* using the thread-preferred arena model is, in fact, N times that required 
for a traditional model, where N is the number of threads.  This shows up in 
real-world thrashing that could actually be avoided.  Basically, if the 
program is doing small allocations, a small percentage of stragglers can pin 
the entire allocated space -- and the allocated space is, in fact, much larger 
than it needs to be (and larger than it is in other OS's).  But thank you for 
your time -- we all want the same thing here, a ever better Linux that is more 
suited to heavily threaded applications. :-)

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
  2010-02-09 15:28 ` [Bug libc/11261] " drepper at redhat dot com
  2010-02-09 16:02 ` rich at testardi dot com
@ 2010-02-10 13:10 ` rich at testardi dot com
  2010-02-10 13:21 ` drepper at redhat dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rich at testardi dot com @ 2010-02-10 13:10 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From rich at testardi dot com  2010-02-10 13:10 -------
Hi Ulrich,

I apologize in advance and want you to know I will not reopen this bug again, 
but I felt I had to show you a new test program that clearly shows "The cost 
of large amounts of allocated address space is insignificant" can be 
exceedingly untrue for heavily threaded systems using large amounts of 
memory.  In our product, we require 2x the RAM on Linux vs other OS's because 
of this. :-(

I've reduced the problem to a program that you can invoke with no options and 
it runs fine, but with the "-x" option it thrashes wildly.  The only 
difference is that in the "-x" case we allow the threads to do some dummy 
malloc/frees up front to create thread-preferred arenas.

The program simply has a bunch of threads that, in turn (i.e., not 
concurrently), allocate a bunch of memory, and then free most (but not all!) 
of it.  The resulting allocations easily fit in RAM, even when fragmented.  It 
then attempts to memset the unfreed memory to 0.

The problem is that in the thread-preferred arena case, the fragmented 
allocations are now spread over 10x the virtual space, and when accessed, 
result in actual commitment of at least 2x the physical space -- enough to 
push us over the top of RAM and into thrashing.

So as a result, without the -x option, the program memset runs in two seconds 
or so on my system (8-way, 2GHz, 12GB RAM); with the -x option, the program 
memset can take hundreds to thousands of seconds.

I know this sounds contrived, but it was in fact *derived* from a real-life 
problem.

All I am hoping to convey is that there are memory intensive applications for 
which thread-preferred arenas actually hurt performance significantly.  
Furthermore, turning on MALLOC_PER_THREAD can actually have an even more 
devastating effect on these applications than the default behavior.  And 
unfortunately, neither MALLOC_ARENA_MAX nor MALLOC_ARENA_TEST can prevent the 
thread-preferred arena proliferation.

The test run output without and with "-x" option are below; the source code is 
below that.

Thank you for your time.  Like I said, I won't reopen this again, but I hope 
you'll consider giving applications like ours a "way out" of the thread-
preferred arenas in the future -- especially since it seems our future is even 
more bleak with MALLOC_PER_THREAD, and that's the way you are moving (and for 
certain applications, MALLOC_PER_THREAD makes sense!).

Anyway, I've already written a small block binned allocator that will live on 
top of mmap'd pages for us for Linux, so we're OK.  But I'd rather just use 
malloc(3).

-- Rich

[root@lab2-160 test_heap]# ./memx2
cpus = 8; pages = 3072694; pagesize = 4096
nallocs = 307200
--- creating 100 threads ---
--- waiting for threads to allocate memory ---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
--- malloc_stats() ---
Arena 0:
system bytes     = 1557606400
in use bytes     =  743366944
Total (incl. mmap):
system bytes     = 1562529792
in use bytes     =  748290336
max mmap regions =          2
max mmap bytes   =    4923392
--- cat /proc/29565/status | grep -i vm ---
VmPeak:  9961304 kB
VmSize:  9951060 kB
VmLck:         0 kB
VmHWM:   2517656 kB
VmRSS:   2517656 kB
VmData:  9945304 kB
VmStk:        84 kB
VmExe:         8 kB
VmLib:      1532 kB
VmPTE:     19432 kB
--- accessing memory ---
--- done in 3 seconds ---


[root@lab2-160 test_heap]# ./memx2 -x
cpus = 8; pages = 3072694; pagesize = 4096
nallocs = 307200
--- creating 100 threads ---
--- allowing threads to create preferred arenas ---
--- waiting for threads to allocate memory ---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
--- malloc_stats() ---
Arena 0:
system bytes     = 1264455680
in use bytes     =  505209392
Arena 1:
system bytes     = 1344937984
in use bytes     =  653695200
Arena 2:
system bytes     = 1396580352
in use bytes     =  705338800
Arena 3:
system bytes     = 1195057152
in use bytes     =  503815408
Arena 4:
system bytes     = 1295818752
in use bytes     =  604577136
Arena 5:
system bytes     = 1094295552
in use bytes     =  403053744
Arena 6:
system bytes     = 1245437952
in use bytes     =  554196272
Arena 7:
system bytes     = 1144676352
in use bytes     =  453434608
Arena 8:
system bytes     = 1346199552
in use bytes     =  654958000
Total (incl. mmap):
system bytes     = 2742448128
in use bytes     =  748234656
max mmap regions =          2
max mmap bytes   =    4923392
--- cat /proc/29669/status | grep -i vm ---
VmPeak: 49213720 kB
VmSize: 49182988 kB
VmLck:         0 kB
VmHWM:  12052384 kB
VmRSS:  11861284 kB
VmData: 49177232 kB
VmStk:        84 kB
VmExe:         8 kB
VmLib:      1532 kB
VmPTE:     95452 kB
--- accessing memory ---
60 secs... 120 secs... 180 secs... 240 secs... 300 secs... 360 secs... 420 
secs... 480 secs... 540 secs... 600 secs... 660 secs... 720 secs... 780 secs...
--- done in 818 seconds ---
[root@lab2-160 test_heap]#


[root@lab2-160 test_heap]# cat memx2.c
// ****************************************************************************

#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <inttypes.h>

#define NTHREADS  100
#define ALLOCSIZE  16384
#define STRAGGLERS  100

static uint cpus;
static uint pages;
static uint pagesize;

static uint nallocs;

static volatile int go;
static volatile int done;
static volatile int spin;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

static void **ps;  // allocations that are freed in turn by each thread
static int nps;
static void **ss;  // straggling allocations to prevent arena free
static int nss;

void
my_sleep(
    int ms
    )
{
    int rv;
    struct timespec ts;
    struct timespec rem;

    ts.tv_sec  = ms / 1000;
    ts.tv_nsec = (ms % 1000) * 1000000;
    for (;;) {
        rv = nanosleep(&ts, &rem);
        if (! rv) {
            break;
        }
        assert(errno == EINTR);
        ts = rem;
    }
}

void *
my_thread(
    void *context
    )
{
    int i;
    int n;
    int si;
    int rv;
    void *p;

    n = (int)(intptr_t)context;

    while (! go) {
        my_sleep(100);
    }

    // first we spin to get our own arena
    while (spin) {
        p = malloc(ALLOCSIZE);
        assert(p);
        if (rand()%20000 == 0) {
            my_sleep(10);
        }
        free(p);
    }

    my_sleep(1000);

    // then one thread at a time, do our big allocs
    rv = pthread_mutex_lock(&mutex);
    assert(! rv);
    for (i = 0; i < nallocs; i++) {
        assert(i < nps);
        ps[i] = malloc(ALLOCSIZE);
        assert(ps[i]);
    }
    // N.B. we leave 1 of every STRAGGLERS allocations straggling
    for (i = 0; i < nallocs; i++) {
        assert(i < nps);
        if (i%STRAGGLERS == 0) {
            si = nallocs/STRAGGLERS*n + i/STRAGGLERS;
            assert(si < nss);
            ss[si] = ps[i];
        } else {
            free(ps[i]);
        }
    }
    done++;
    printf("%d ", done);
    fflush(stdout);
    rv = pthread_mutex_unlock(&mutex);
    assert(! rv);
}

int
main(int argc, char **argv)
{
    int i;
    int rv;
    time_t n;
    time_t t;
    time_t lt;
    pthread_t thread;
    char command[128];


    if (argc > 1) {
        if (! strcmp(argv[1], "-x")) {
            spin = 1;
            argc--;
            argv++;
        }
    }
    if (argc > 1) {
        printf("usage: memx2 [-x]\n");
        return 1;
    }

    cpus = sysconf(_SC_NPROCESSORS_CONF);
    pages = sysconf (_SC_PHYS_PAGES);
    pagesize = sysconf (_SC_PAGESIZE);
    printf("cpus = %d; pages = %d; pagesize = %d\n", cpus, pages, pagesize);

    nallocs = pages/10/STRAGGLERS*STRAGGLERS;
    assert(! (nallocs%STRAGGLERS));
    printf("nallocs = %d\n", nallocs);

    nps = nallocs;
    ps = malloc(nps*sizeof(*ps));
    assert(ps);
    nss = NTHREADS*nallocs/STRAGGLERS;
    ss = malloc(nss*sizeof(*ss));
    assert(ss);

    if (pagesize != 4096) {
        printf("WARNING -- this program expects 4096 byte pagesize!\n");
    }

    printf("--- creating %d threads ---\n", NTHREADS);
    for (i = 0; i < NTHREADS; i++) {
        rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i);
        assert(! rv);
        rv = pthread_detach(thread);
        assert(! rv);
    }
    go = 1;

    if (spin) {
        printf("--- allowing threads to create preferred arenas ---\n");
        my_sleep(5000);
        spin = 0;
    }

    printf("--- waiting for threads to allocate memory ---\n");
    while (done != NTHREADS) {
        my_sleep(1000);
    }
    printf("\n");

    printf("--- malloc_stats() ---\n");
    malloc_stats();
    sprintf(command, "cat /proc/%d/status | grep -i vm", (int)getpid());
    printf("--- %s ---\n", command);
    (void)system(command);

    // access the stragglers
    printf("--- accessing memory ---\n");
    t = time(NULL);
    lt = t;
    for (i = 0; i < nss; i++) {
        memset(ss[i], 0, ALLOCSIZE);
        n = time(NULL);
        if (n-lt >= 60) {
            printf("%d secs... ", (int)(n-t));
            fflush(stdout);
            lt = n;
        }
    }
    if (lt != t) {
        printf("\n");
    }
    printf("--- done in %d seconds ---\n", (int)(time(NULL)-t));

    return 0;
}
[root@lab2-160 test_heap]#


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|WONTFIX                     |


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
                   ` (2 preceding siblings ...)
  2010-02-10 13:10 ` rich at testardi dot com
@ 2010-02-10 13:21 ` drepper at redhat dot com
  2010-02-10 13:42 ` rich at testardi dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: drepper at redhat dot com @ 2010-02-10 13:21 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From drepper at redhat dot com  2010-02-10 13:21 -------
I already described what you can do to limit the number of memory pools.  Just
use it.  If you don't like envvars use the appropriate mallopt() calls (using
M_ARENA_MAX and M_ARENA_TEST).

No malloc implementation is optimal for all situations.  This is why there are
customization knobs.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
                   ` (3 preceding siblings ...)
  2010-02-10 13:21 ` drepper at redhat dot com
@ 2010-02-10 13:42 ` rich at testardi dot com
  2010-02-10 14:29 ` rich at testardi dot com
  2010-02-10 15:52 ` rich at testardi dot com
  6 siblings, 0 replies; 8+ messages in thread
From: rich at testardi dot com @ 2010-02-10 13:42 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From rich at testardi dot com  2010-02-10 13:41 -------
Hi Ulrich,

Agreed 100% no one size fits all...

Unfortunately, the neither of the "tuning" settings for MALLOC_ARENA_MAX nor 
MALLOC_ARENA_TEST seem to work.  Neither do mallopt() M_ARENA_MAX nor 
M_ARENA_TEST. :-(

Part of the problem seems to stem from the fact that the global "narenas" is 
only incremented if MALLOC_PER_THREAD/use_per_thread is true...

#ifdef PER_THREAD
  if (__builtin_expect (use_per_thread, 0)) {
    ++narenas;

    (void)mutex_unlock(&list_lock);
  }
#endif

So the tests of those other variables in reused_arena() never limit anything.  
And setting MALLOC_PER_THREAD makes our problem much worse.

static mstate
reused_arena (void)
{
  if (narenas <= mp_.arena_test)
    return NULL;

  ...

  if (narenas < narenas_limit)
    return NULL;

I also tried all combinations I could imagine of MALLOC_PER_THREAD and the 
other variables, to no avail.  I also did the same with mallopt(), verifying 
at the assembly level that we got all the right values into mp_. :-(

Specifically, I tried things like:

export MALLOC_PER_THREAD=1
export MALLOC_ARENA_MAX=1
export MALLOC_ARENA_TEST=1

and:

    rv = mallopt(-7, 1);
    printf("%d\n", rv);
    rv = mallopt(-8, 1);
    printf("%d\n", rv);

Anyway, thank you.  You've already pointed me in all of the right directions.  
If I did something completely brain-dead, above, feel free to tell me and save 
me another few days of work! :-)

-- Rich

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
                   ` (4 preceding siblings ...)
  2010-02-10 13:42 ` rich at testardi dot com
@ 2010-02-10 14:29 ` rich at testardi dot com
  2010-02-10 15:52 ` rich at testardi dot com
  6 siblings, 0 replies; 8+ messages in thread
From: rich at testardi dot com @ 2010-02-10 14:29 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From rich at testardi dot com  2010-02-10 14:29 -------
And a comment for anyone else who might stumble this way...

I *can* reduce the total number of arenas to *2* (not low enough for our 
purposes) with the following sequence:

export MALLOC_PER_THREAD=1

    rv = mallopt(-7, 1);  // M_ARENA_TEST
    printf("%d\n", rv);
    rv = mallopt(-8, 1);  // M_ARENA_MAX
    printf("%d\n", rv);

*PLUS* I have to have a global pthread mutex around every malloc(3) and free
(3) call -- I can't figure out from the code why this is required, but without 
it the number of arenas seems independent of the mallopt settings.

I cannot get to *1* arena because a) mallopt() won't allow you to set 
arena_test to 0:

#ifdef PER_THREAD
  case M_ARENA_TEST:
    if (value > 0)
      mp_.arena_test = value;
    break;

  case M_ARENA_MAX:
    if (value > 0)
      mp_.arena_max = value;
    break;
#endif

And b) reused_arena() uses a ">=" here rather than a ">":

static mstate
reused_arena (void)
{
  if (narenas <= mp_.arena_test)
    return NULL;




-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
  2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
                   ` (5 preceding siblings ...)
  2010-02-10 14:29 ` rich at testardi dot com
@ 2010-02-10 15:52 ` rich at testardi dot com
  6 siblings, 0 replies; 8+ messages in thread
From: rich at testardi dot com @ 2010-02-10 15:52 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From rich at testardi dot com  2010-02-10 15:52 -------
Last mail...

It turns out the arena_max and arena_test numbers are "fuzzy" (I am sure by 
design), since no lock is held here:

static mstate
internal_function
arena_get2(mstate a_tsd, size_t size)
{
  mstate a;
#ifdef PER_THREAD
  if (__builtin_expect (use_per_thread, 0)) {
    if ((a = get_free_list ()) == NULL
        && (a = reused_arena ()) == NULL)
      /* Nothing immediately available, so generate a new arena.  */
      a = _int_new_arena(size);
    return a;
  }
#endif

Therefore, if narenas is less than the limit tested for in reused_arena(), and 
N threads get in to this code at once, narenas can then end up N-1 *above* the 
limit.  The likelihood of this happening is proportional to the malloc arrival 
rate and the time spend in _int_new_arena().

This is exactly what I am seeing.

So if you can live with 2 arenas, the critical thing to do is to make sure 
narenas is exactly 2 before going heavily multi-threaded, and then it won't be 
able to go above 2; otherwise, it can sneak up to 2+N-1, where N is the number 
of threads contending for allocations.

If the ">=" in reused_arena() was changed to ">", then we could use this 
mechanism to limit narenas to exactly 1 right from the get-go.  That would be 
ideal for our kind of applications (that can't live with 2 arenas).

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=11261

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-02-10 15:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-08 20:23 [Bug libc/11261] New: malloc uses excessive memory for multi-threaded applications rich at testardi dot com
2010-02-09 15:28 ` [Bug libc/11261] " drepper at redhat dot com
2010-02-09 16:02 ` rich at testardi dot com
2010-02-10 13:10 ` rich at testardi dot com
2010-02-10 13:21 ` drepper at redhat dot com
2010-02-10 13:42 ` rich at testardi dot com
2010-02-10 14:29 ` rich at testardi dot com
2010-02-10 15:52 ` rich at testardi dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).