Possible inline malloc alternatives: bitmap

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Possible inline malloc alternatives: bitmap
@ 2018-03-02  5:54 Ondřej Bílka
  2018-03-02  5:54 ` Ondřej Bílka
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-02  5:54 UTC (permalink / raw)
  To: libc-alpha

Like I mentioned in other thread I plan to make improved malloc+friends.
It will be split to several aspects.

This one is about option to inline small size allocations which would
give same performance as using memory pool.

I planned to make this generic but it isn't case now as I realized
posibility of using bit tricks which would lead to more effective free
as it doesn't need to determine size of allocation.

Representation of small sizes is simple, for each 1024 page there would
be metadata with 64-bit integer which has i-th bit 1 if and only if
there is free chunk at offset 16*i of page which turns both allocation
and free into simple bit manipulation. 

Sample code how would inline look like is below. To handle space usage
one would in refill_bucket set up bitmap pages.

For free a case to call noninline_free if bitmap is zero serves to
register that there are some free chunks at that page.

For returning memory to different thread that created it one would use
atomic operations with basically same algorithm.

For medium same idea of denoting free into bitmap could work with same
code , with bit
operations I could quickly merge all adjacent free chunks into one and
delete old entries.

This one is at idea stage, I don't know how effective it will be to
merge multiple adjacent frees. I could do it with following bit tricks:

You need three bitmaps

allocated - 1 where allocated chunk starts
old_free - 1 where free chunk that is in data structure starts
new_free - 1 where chunk that was free'd but isn't in data structure
yet.
To handle are first construct mask with ones upto position of lowest
1 in new_free 

low_mask = new_free ^ (new_free - 1)

To get lower end you must compare 

old_free & low_mask > allocated & low_mask

if this is true then area starts on highest 1 bit of old_free &
low_mask, otherwise it starts with lowest bit of new_free. In first case
we need to remove that old_free element from data structure.

Then its easy to see that free area would end at position of lowest 1 in
formula

allocated & (~low_mask)

Additional elements that need to be deleted from data structure are
determined by following but we accumulate them into bitmap to_delete to handle
them all together at end. 

to_delete |= old_free & high_end_mask & ~low_mask;

then we mask handled elements of new_free and repeat while new_free is
nonzero.

struct page_map {
  int thread;
  uint64_t map;
  uint64_t atomic_map;
};

extern inline struct page_map *get_page_map(void *p)
{
  size_t m = (size_t) p;
  m = 1024 * (m / 1024) - 32;
  return (struct page_map *) m;
}

extern inline size_t base_addr (struct page_map *p)
{
   return ((size_t) p) + 32;
}

struct page_map *buckets[256/16];

extern inline void *inl_alloc(size_t size)
{
  if (size < 256)
   {
     int bucket = (size + 15) / 16;
     uint64_t map = buckets[bucket]->map;
     /* note that we handle case of malloc(0) 
        by always calling refill_buckets(0) with suitably filled  bucket[0] */
     if (map == 0)
       return refill_buckets (size);
     uint64_t new_map = map & (map - 1);
     size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map);
     buckets[bucket]->map = new_map;
     return (void *) address;
   }
  else
    return big_alloc (size);
}

extern inline void inl_free (void *p)
{
#ifndef PAGEMAP_NULL_VALID
  if (p == NULL) return;
#endif
  struct page_map *m = get_page_map (p);
  if (m->thread == thread_id && m->map != 0)
    m->map |= 1 << ((((size_t) p) / 16) % 64);
  else 
    {
      uint64_t n = m->atomic_map;
      uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64));
      if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm))
        noninline_free(p);
    }
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Possible inline malloc alternatives: bitmap
  2018-03-02  5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka
@ 2018-03-02  5:54 ` Ondřej Bílka
  2018-03-02 18:20   ` DJ Delorie
  2018-03-02 16:28 ` Carlos O'Donell
  2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka
  2 siblings, 1 reply; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-02  5:54 UTC (permalink / raw)
  To: libc-alpha

Like I mentioned in other thread I plan to make improved malloc+friends.
It will be split to several aspects.

This one is about option to inline small size allocations which would
give same performance as using memory pool.

I planned to make this generic but it isn't case now as I realized
posibility of using bit tricks which would lead to more effective free
as it doesn't need to determine size of allocation.

Representation of small sizes is simple, for each 1024 page there would
be metadata with 64-bit integer which has i-th bit 1 if and only if
there is free chunk at offset 16*i of page which turns both allocation
and free into simple bit manipulation. 

Sample code how would inline look like is below. To handle space usage
one would in refill_bucket set up bitmap pages.

For free a case to call noninline_free if bitmap is zero serves to
register that there are some free chunks at that page.

For returning memory to different thread that created it one would use
atomic operations with basically same algorithm.

For medium same idea of denoting free into bitmap could work with same
code , with bit
operations I could quickly merge all adjacent free chunks into one and
delete old entries.

This one is at idea stage, I don't know how effective it will be to
merge multiple adjacent frees. I could do it with following bit tricks:

You need three bitmaps

allocated - 1 where allocated chunk starts
old_free - 1 where free chunk that is in data structure starts
new_free - 1 where chunk that was free'd but isn't in data structure
yet.
To handle are first construct mask with ones upto position of lowest
1 in new_free 

low_mask = new_free ^ (new_free - 1)

To get lower end you must compare 

old_free & low_mask > allocated & low_mask

if this is true then area starts on highest 1 bit of old_free &
low_mask, otherwise it starts with lowest bit of new_free. In first case
we need to remove that old_free element from data structure.

Then its easy to see that free area would end at position of lowest 1 in
formula

allocated & (~low_mask)

Additional elements that need to be deleted from data structure are
determined by following but we accumulate them into bitmap to_delete to handle
them all together at end. 

to_delete |= old_free & high_end_mask & ~low_mask;

then we mask handled elements of new_free and repeat while new_free is
nonzero.

struct page_map {
  int thread;
  uint64_t map;
  uint64_t atomic_map;
};

extern inline struct page_map *get_page_map(void *p)
{
  size_t m = (size_t) p;
  m = 1024 * (m / 1024) - 32;
  return (struct page_map *) m;
}

extern inline size_t base_addr (struct page_map *p)
{
   return ((size_t) p) + 32;
}

struct page_map *buckets[256/16];

extern inline void *inl_alloc(size_t size)
{
  if (size < 256)
   {
     int bucket = (size + 15) / 16;
     uint64_t map = buckets[bucket]->map;
     /* note that we handle case of malloc(0) 
        by always calling refill_buckets(0) with suitably filled  bucket[0] */
     if (map == 0)
       return refill_buckets (size);
     uint64_t new_map = map & (map - 1);
     size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map);
     buckets[bucket]->map = new_map;
     return (void *) address;
   }
  else
    return big_alloc (size);
}

extern inline void inl_free (void *p)
{
#ifndef PAGEMAP_NULL_VALID
  if (p == NULL) return;
#endif
  struct page_map *m = get_page_map (p);
  if (m->thread == thread_id && m->map != 0)
    m->map |= 1 << ((((size_t) p) / 16) % 64);
  else 
    {
      uint64_t n = m->atomic_map;
      uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64));
      if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm))
        noninline_free(p);
    }
}

-- 

knot in cables caused data stream to become twisted and kinked

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-02  5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka
  2018-03-02  5:54 ` Ondřej Bílka
@ 2018-03-02 16:28 ` Carlos O'Donell
  2018-03-02 20:21   ` Paul Eggert
  2018-03-02 23:00   ` Possible inline malloc alternatives: bitmap Ondřej Bílka
  2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka
  2 siblings, 2 replies; 19+ messages in thread
From: Carlos O'Donell @ 2018-03-02 16:28 UTC (permalink / raw)
  To: Ondřej Bílka, libc-alpha

On 03/01/2018 01:33 PM, OndÅ™ej BÃlka wrote:
> Like I mentioned in other thread I plan to make improved malloc+friends.
> It will be split to several aspects.

Great to hear that!

Please keep in mind that at Red Hat we are working on 3 relatively new
things for malloc:

* Trace functionality (part of our whole-system trace work)
* Replay functionality (our simulator)
* Heap dumping functionality

Only the heap dumper will be impacted by any work done by you, but this
is OK, all that we ask is that you consider including us in your work
so we can stay abreast of the changes you are thinking about and be
able to comment on how the functionality might impact accurate dumping.

> This one is about option to inline small size allocations which would
> give same performance as using memory pool.

What you describe below is a page-based allocator with a bit-map of the
in-use allocations. I have no objection to such enhancements for glibc's
malloc. However, these enhancements, like the tcache ones, must be driven
by collected workloads which can be run by others to verify the work yields
performance benefits across their machines.

When we did the tcache work we collected dozens of workloads, ran them
through trace, and re-simulated them on various architectures to look at
the performance characteristics of tcache.

Similarly any future work should be driven by a workloads and performance
measurements against those workloads. We can provide the workload data for
the workloads we used for tcache if you need them.

It was the workloads for example that showed that even with tcache, fastbins
was *still* a win from a performance perspective.

The harder work that I face today is not about performance though, the harder
work, and I would hope you could help us on this, is *limiting* RSS consumption
and improving the core algorithms used in malloc to be able to limit RSS
without an immense cost to performance.

Developers want to deploy memory limited systems and glibc's malloc does not
always consider RSS costs. Some issues around this are:

* fastbins prevent free'ing down of heap from top.
* fastbins can grow unbounded.
* heaps can only be freed from top down (unless you call malloc_trim() which
  can find sub-pages in the heap to free back).

In summary:

Application mixing of chunks of disparate lifetimes results in fragmented
heaps for heap-based allocator, while page-based allocators loose at most a
page for those long-lived chunks. However, lifetime information is not at all
available in the current API, and we need novel research into ways to mitigate
this issue. Segregating allocations of similar size classes seems to be the
best practice for this, and nothing stops glibc from doing the same approach
e.g. multiple heaps for segregated size classes.

While I am not dissuading you from working on a page-based extension to glibc's
malloc, I think tackling RSS reduction, perhaps using page-based methods, is
where the community should focus their attention. It would yield the largest
net win for our users based on the bugs and feedback we've been getting.

Again, I also reiterate it must be driven by workloads, to create a data-driven
optimization process.

> I planned to make this generic but it isn't case now as I realized
> posibility of using bit tricks which would lead to more effective free
> as it doesn't need to determine size of allocation.
>> Representation of small sizes is simple, for each 1024 page there would
>
> be metadata with 64-bit integer which has i-th bit 1 if and only if
> there is free chunk at offset 16*i of page which turns both allocation
> and free into simple bit manipulation.
>> Sample code how would inline look like is below. To handle space usage
> one would in refill_bucket set up bitmap pages.
> 
> For free a case to call noninline_free if bitmap is zero serves to
> register that there are some free chunks at that page.
> 
> For returning memory to different thread that created it one would use
> atomic operations with basically same algorithm.
>
> For medium same idea of denoting free into bitmap could work with same
> code , with bit
> operations I could quickly merge all adjacent free chunks into one and
> delete old entries.
> 
> This one is at idea stage, I don't know how effective it will be to
> merge multiple adjacent frees. I could do it with following bit tricks:
> 
> You need three bitmaps
> 
> allocated - 1 where allocated chunk starts
> old_free - 1 where free chunk that is in data structure starts
> new_free - 1 where chunk that was free'd but isn't in data structure
> yet.
> To handle are first construct mask with ones upto position of lowest
> 1 in new_free 
> 
> low_mask = new_free ^ (new_free - 1)
> 
> To get lower end you must compare 
> 
> old_free & low_mask > allocated & low_mask
> 
> if this is true then area starts on highest 1 bit of old_free &
> low_mask, otherwise it starts with lowest bit of new_free. In first case
> we need to remove that old_free element from data structure.
> 
> Then its easy to see that free area would end at position of lowest 1 in
> formula
> 
> allocated & (~low_mask)
> 
> Additional elements that need to be deleted from data structure are
> determined by following but we accumulate them into bitmap to_delete to handle
> them all together at end. 
> 
> to_delete |= old_free & high_end_mask & ~low_mask;
> 
> then we mask handled elements of new_free and repeat while new_free is
> nonzero.
> 
> 
> struct page_map {
>   int thread;
>   uint64_t map;
>   uint64_t atomic_map;
> };
>> extern inline struct page_map *get_page_map(void *p)
> {
>   size_t m = (size_t) p;
>   m = 1024 * (m / 1024) - 32;
>   return (struct page_map *) m;
> }
> 
> extern inline size_t base_addr (struct page_map *p)
> {
>    return ((size_t) p) + 32;
> }
> 
> struct page_map *buckets[256/16];
> 
> extern inline void *inl_alloc(size_t size)
> {
>   if (size < 256)
>>    {
>      int bucket = (size + 15) / 16;
>>      uint64_t map = buckets[bucket]->map;
>      /* note that we handle case of malloc(0) 
>         by always calling refill_buckets(0) with suitably filled  bucket[0] */
>      if (map == 0)
>        return refill_buckets (size);
>      uint64_t new_map = map & (map - 1);
>      size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map);
>      buckets[bucket]->map = new_map;
>      return (void *) address;
>    }
>   else
>     return big_alloc (size);
> }
> 
> extern inline void inl_free (void *p)
> {
> #ifndef PAGEMAP_NULL_VALID
>   if (p == NULL) return;
> #endif
>   struct page_map *m = get_page_map (p);
>   if (m->thread == thread_id && m->map != 0)
>     m->map |= 1 << ((((size_t) p) / 16) % 64);
>   else 
>     {
>       uint64_t n = m->atomic_map;
>       uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64));
>       if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm))
>         noninline_free(p);
>     }
> }
> 
> 

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-02  5:54 ` Ondřej Bílka
@ 2018-03-02 18:20   ` DJ Delorie
  0 siblings, 0 replies; 19+ messages in thread
From: DJ Delorie @ 2018-03-02 18:20 UTC (permalink / raw)
  To: OndÃ…Â™ej BÃƒÂlka; +Cc: libc-alpha

Note that there are existing malloc implementations that may already do
what you plan, it might be worth a little time to see if they're a
better starting point than glibc's malloc.

Also, I have a corpus of workloads for benchmarking malloc here:

	http://www.delorie.com/malloc/

(note: that's my house network, please be gentle and kind when
downloading ;)

The simulator that runs those workloads is in glibc's dj/malloc branch
(malloc/trace_run.c).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-02 16:28 ` Carlos O'Donell
@ 2018-03-02 20:21   ` Paul Eggert
  2018-03-03 12:56     ` Ondřej Bílka
  2018-03-02 23:00   ` Possible inline malloc alternatives: bitmap Ondřej Bílka
  1 sibling, 1 reply; 19+ messages in thread
From: Paul Eggert @ 2018-03-02 20:21 UTC (permalink / raw)
  To: Carlos O'Donell, Ondřej Bílka, libc-alpha

On 03/02/2018 08:27 AM, Carlos O'Donell wrote:
> lifetime information is not at all
> available in the current API, and we need novel research into ways to mitigate
> this issue

We do need research here. And it's not just lifetime info; it's also 
cache pollution. I looked for relevant recent work and found NightWatch:

Liao X, Guo R, Jin H. Enhancing the Malloc System with Pollution 
Awareness for Better Cache Performance. IEEE Trans Parallel Distributed 
Sys. 2017-03-01. 28(3):731-45. http://dx.doi.org/10.1109/TPDS.2016.2587644

NightWatch doesn't do what you want (it focuses on cache-aware malloc). 
Also, it requires Linux kernel mods. Still, it may be useful as a source 
of ideas.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-02 16:28 ` Carlos O'Donell
  2018-03-02 20:21   ` Paul Eggert
@ 2018-03-02 23:00   ` Ondřej Bílka
  2018-03-05 20:32     ` DJ Delorie
  1 sibling, 1 reply; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-02 23:00 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha

On Fri, Mar 02, 2018 at 08:27:50AM -0800, Carlos O'Donell wrote:
> On 03/01/2018 01:33 PM, OndÅ™ej BÃlka wrote:
> > Like I mentioned in other thread I plan to make improved malloc+friends.
> > It will be split to several aspects.
> 
> Great to hear that!
> 
> Please keep in mind that at Red Hat we are working on 3 relatively new
> things for malloc:
> 
> * Trace functionality (part of our whole-system trace work)
> * Replay functionality (our simulator)

These two are wasted work as any relation to actual performance is
purely coincidental. 

Running a program first with allocator A, then with
allocator B and comparing total run time is the only way to get accurate
results (Unless it is just microoptimalization that doesn't change any
allocated address.). 

It doesn't measure majority of performance factors because it happens
outside of malloc depending on resulting cache layout so obviously it
couldn't measure performance.

Simple case to demonstrate this is:

int i, j, al;

int *ary;
int *tmalloc (int i){
//  al+=1;
//  al+=10;
//  al = rand () % 20000;
  al = rand () % 1000;
  return  ary + al;
}

int main(){
  al = 0;
  ary = malloc (100000000);
int *x[1000];
for (i=0; i<1000; i++)
  x[i] = (int *) tmalloc (4);
for (j=0; j<1000000; j++)
  for (i=0; i<1000; i++)
    x[i][0] = 42;
}

Where depending on four cases how this simple allocator sets layout it
takes four times more time in last case than in first case. These cache
factors are ubiquious. 
For example it could look that keeping metadata outside of allocated
chunk is faster because prefetching that it does for application
singificantly reduces overall latency.

> The harder work that I face today is not about performance though, the harder
> work, and I would hope you could help us on this, is *limiting* RSS consumption
> and improving the core algorithms used in malloc to be able to limit RSS
> without an immense cost to performance.
> 
> Developers want to deploy memory limited systems and glibc's malloc does not
> always consider RSS costs. Some issues around this are:
> 
> * fastbins prevent free'ing down of heap from top.
> * fastbins can grow unbounded.
> * heaps can only be freed from top down (unless you call malloc_trim() which
>   can find sub-pages in the heap to free back).
> 
Real problem with first two that current code is convoluted mess.

For different representations of small allocations its easy to bound
space usage with tunable of maximal number of pages for these. When no
page is available it will just do allocation from heap. This would be
easy to add with clean code.

As for third problem of freeing pages after thinking for few hours I realized 
that bitmaps are only sane way to keep track of that (and five other insane ways)

Idea is that we will do madvise(dont_need) to every page that wasn't
used for specified amount of time. You need for bitmaps

free - denotes page that is completely free
cleared - free page on which we called madvise
live1 - these two capture recent action on page. After specified time period we copy live1 to live2 and zero live1
live2 

Then we could find that pages needing madvise are result of formula: free & ~cleared & ~(live1 | live2)

Then there is trick that for large allocations one should always clear
upper half of memory because buffer rarely uses whole area.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-02 20:21   ` Paul Eggert
@ 2018-03-03 12:56     ` Ondřej Bílka
  2018-03-03 20:24       ` Paul Eggert
  0 siblings, 1 reply; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-03 12:56 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Carlos O'Donell, libc-alpha

On Fri, Mar 02, 2018 at 12:21:27PM -0800, Paul Eggert wrote:
> On 03/02/2018 08:27 AM, Carlos O'Donell wrote:
> >lifetime information is not at all
> >available in the current API, and we need novel research into ways to mitigate
> >this issue
> 
> We do need research here. And it's not just lifetime info; it's also
> cache pollution. I looked for relevant recent work and found
> NightWatch:
> 
> Liao X, Guo R, Jin H. Enhancing the Malloc System with Pollution
> Awareness for Better Cache Performance. IEEE Trans Parallel
> Distributed Sys. 2017-03-01. 28(3):731-45.
> http://dx.doi.org/10.1109/TPDS.2016.2587644
> 
> NightWatch doesn't do what you want (it focuses on cache-aware
> malloc). Also, it requires Linux kernel mods. Still, it may be
> useful as a source of ideas.

This is paywalled so I didn't read it. Abstract is too vague, it doesn't
say what exactly they do and claimed speedup looks like marketing trick
as it is too big considering that it applies only when memory is managed with lot
of small allocations. 

One problem is that it merges several different issues. For allocator
I know following issues and how I plan to handle them:

1. Cache line sharing. Multiple small chunks could fit into same cache
line, also two longer chunks could share same cache line at their
borders.

First eliminating border sharing is relatively easy, it suffices that
everything in a main heap has alignment and capacity multiple of 
cache line. For small sizes use pool of same capacities, this allow less
of border sharing if its for example of 1 1/2 cache line capacity. 

Sub-cache-line requests leads to two problems:

1a. Bad cache utilization. You want to avoid situation where only small
part of cache line is frequently accessed(hot) while rest is rarely
accessed(cold). You want hot data with hot data and cold with cold.

Here low hanging fruit is that we use terrible data structure that fills
half of cache line with cold metadata about chunk.

Another good distinction here is if data is short-lived or long lived.
With long lived data it requires complex profiling to find out if its
hot or cold. Short lived is hot by definition as it gets alocated,
something written there and freed before time it should be evicted from
cache. It is easy to check if data is short lived by basic profiling
described below.

Bigger advantage of finding what is short lived is in space management
as long-lived small chunks pining down page are one of main problems,
with all chunks short lived it suffices to avoid allocating from 
that page for while and you will have empty page. Longer explanation how
I plan to profile is [1]

1b. Cross-talk which happens when two cores write to different chunks
which happened to be on same cache line. Cores must synchronize state
which leads to worse performance. 

Per-thread cache and returning chunk to original thread on free is best
that we can do without profiling. It looks that nightwatch does that and
when non-allocator thread writes there it gives cache line with only one 
chunk.

2. Cache layout of multicache-line chunks.
As there isn't border sharing only two relevant factors are
a. When will be data in chunks first accessed.
b. Where in cache hierarchy are chunk's cache lines when free is called.

Of those b. is more important, as probably in a) you could assume that 
data will be written in short order.

Obviously you want to keep giving hot chunks before they became cold
where a. is short.

For b) one could easily find that by profiling of when free happens do 
reads of chunk memory and measure how long it took.

3. Spatial locality and prefetching.

For small allocations this is related to 1a as for subsequent
allocations you want to give chunk in same cache line as they would be
likely accessed together.

Another consideration is to try to make subsequent allocations use
continuos memory area to maximize spatial locality. As heuristic these 
could be accessed in same order as allocated and it allow hw prefetcher 
to do its job.

[1]
For that profiling one would reuse inlining code to add information
about each call site, other means of identifying that are significantly
more time intensive. Here impact is minimal, for allocation it is just
calculating tcb+atomic_malloc_site_offset

For first few allocations this would point to data structure set to
cause call of refill bucket where one return chunks with another of
planned features, allocation-specific free callback. 

This works that one could say that for chunks on given page one will
call given callback instead of normal free. This has nearly zero effect
on free performance as it sets up page metadata to fall through to slow
path before one needs to check if there is callback.

That callback will do statistic how many of allocated chunks were freed.

On 32-th allocation (for example) from given site we change offset or
site data to allocate from long-lived heap.

If user would do equivalent of -fprofile-generate, some
profiling(doesn't really matter what as gcc and this just needs real basics) -fprofile-use
then it would be more accurate and recognizing if long-lived is hot/cold
is doable.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-03 12:56     ` Ondřej Bílka
@ 2018-03-03 20:24       ` Paul Eggert
  2018-03-03 23:37         ` Carlos O'Donell
  2018-03-04 11:59         ` Per call-site malloc specialization Ondřej Bílka
  0 siblings, 2 replies; 19+ messages in thread
From: Paul Eggert @ 2018-03-03 20:24 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: Carlos O'Donell, libc-alpha

Ondřej Bílka wrote:

> This is paywalled so I didn't read it.

You can freely read an older NightWatch paper here:

Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating Lightweight and 
Transparent Cache Pollution Control into Dynamic Memory Allocation Systems. 
USENIX ATC. 2015. 307-18. 
https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf

Although NightWatch's source code is freely readable (it's on GitHub), it does 
not specify a software licence so I have not read it and don't recommend that 
you read it.

> One problem is that it merges several different issues.

Yes, a problem common to many systems papers.

I didn't quite follow your profiling proposals. As near as I can make out you'd 
like to log calls to malloc, free, etc. But for good results don't you also need 
to log memory accesses? Otherwise how will you know which objects are hot?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-03 20:24       ` Paul Eggert
@ 2018-03-03 23:37         ` Carlos O'Donell
  2018-03-03 23:40           ` Carlos O'Donell
  2018-03-04 15:51           ` Ondřej Bílka
  2018-03-04 11:59         ` Per call-site malloc specialization Ondřej Bílka
  1 sibling, 2 replies; 19+ messages in thread
From: Carlos O'Donell @ 2018-03-03 23:37 UTC (permalink / raw)
  To: Paul Eggert, Ondřej Bílka; +Cc: libc-alpha

On 03/03/2018 12:24 PM, Paul Eggert wrote:
> I didn't quite follow your profiling proposals. As near as I can make
> out you'd like to log calls to malloc, free, etc. But for good
> results don't you also need to log memory accesses? Otherwise how
> will you know which objects are hot?

We can already log malloc, free, etc. with the inlined tracing code we
have on DJ's branch, and we in turn simulate those traces again.

You only need to log memory accesses if you want perfect reconstruction
of the various memory subsystem effects. As you know this is obviously
very expensive and difficult if not impossible to do. Some suggestions
at LPC were provided though, including tracking heuristics or statistical
profiles of hits/misses and trying to recreate those profiles in the
page-touch heuristics used by the simulator (something it does when it
gets a new block of memory).

Today we mostly use the simulator to look at in-use RSS, RSS max,
and external and internal fragmentation, looking at trying to reduce
that as much as we can.

Actually using it for profile driven feedback is very hard, and that
kind of behaviour seems to be largely derived from first principles
in most academic work.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-03 23:37         ` Carlos O'Donell
@ 2018-03-03 23:40           ` Carlos O'Donell
  2018-03-04 15:51           ` Ondřej Bílka
  1 sibling, 0 replies; 19+ messages in thread
From: Carlos O'Donell @ 2018-03-03 23:40 UTC (permalink / raw)
  To: Paul Eggert, Ondřej Bílka; +Cc: libc-alpha

On 03/03/2018 03:37 PM, Carlos O'Donell wrote:
> On 03/03/2018 12:24 PM, Paul Eggert wrote:
>> I didn't quite follow your profiling proposals. As near as I can make
>> out you'd like to log calls to malloc, free, etc. But for good
>> results don't you also need to log memory accesses? Otherwise how
>> will you know which objects are hot?
> 
> We can already log malloc, free, etc. with the inlined tracing code we
> have on DJ's branch, and we in turn simulate those traces again.
> 
> You only need to log memory accesses if you want perfect reconstruction
> of the various memory subsystem effects. As you know this is obviously
> very expensive and difficult if not impossible to do. Some suggestions
> at LPC were provided though, including tracking heuristics or statistical
> profiles of hits/misses and trying to recreate those profiles in the
> page-touch heuristics used by the simulator (something it does when it
> gets a new block of memory).
> 
> Today we mostly use the simulator to look at in-use RSS, RSS max,
> and external and internal fragmentation, looking at trying to reduce
> that as much as we can.
> 
> Actually using it for profile driven feedback is very hard, and that
> kind of behaviour seems to be largely derived from first principles
> in most academic work.
> 

This is a link to the tcmalloc work which we are trying to coordinate with:
https://docs.google.com/document/d/1S8T9apA2juEC5sV2UVXUan4e4_029fTmoAEoX7MEB9s/edit#heading=h.ljaekvl39oaq

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Per call-site malloc specialization.
  2018-03-03 20:24       ` Paul Eggert
  2018-03-03 23:37         ` Carlos O'Donell
@ 2018-03-04 11:59         ` Ondřej Bílka
  1 sibling, 0 replies; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-04 11:59 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Carlos O'Donell, libc-alpha

After thinking about this I realized that profiling beside short/long
lived isn't necessary to solver problem of multiple chunks sharing cache 
line as classification into call sites captures all factors so we could 
just split that according to that.

Allocator would have two levels, upper would work only on whole cache
lines.

Lower level would get 1-3 cache-line chunks from upper level. 

For short-lived sites lower level uses thread-local cache like one I
proposed.

For long lived there will be assigned to pair (thread, caller) which would 
service small requests and when entire chunk was free then return it to 
the upper level.

This would give highest amount of locality, unless you could connect
your computer to crystal ball.

For that one would want to use macro like following to recognize fast
path. Getting caller from stack pointers would need hash table and is 
more complicated.

#define malloc(x) \
static void *malloc_data = (void *) &initial_malloc_data; \
malloc_with_data (x, &malloc_data, static_hint | gcc_hint() );

Another thing is that gcc could pass information about some malloc
properties with gcc_hint above.

On Sat, Mar 03, 2018 at 12:24:20PM -0800, Paul Eggert wrote:
> OndÅ™ej BÃlka wrote:
> 
> >This is paywalled so I didn't read it.
> 
> You can freely read an older NightWatch paper here:
> 
> Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating
> Lightweight and Transparent Cache Pollution Control into Dynamic
> Memory Allocation Systems. USENIX ATC. 2015. 307-18. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf
> 
> Although NightWatch's source code is freely readable (it's on
> GitHub), it does not specify a software licence so I have not read
> it and don't recommend that you read it.
>

Read it, it uses different trick that I considered. It uses trick to
select physical pages that map to same 4k of cache into virtual page
region.

Then it allocates chunks with bad cache locality from this region to
allow rest use nearly full cache.

Integrating these is easy part, hard part is low-overhead profiler.

Adding flag to kernel mmap for that mapping shouldn't be that difficult.

There would be generic interface to pass information from profiler to
allocator so writing profiler is independent part.

In previous post I said that with training data one could use profiler
and don't care about overhead because after recompiling it will use
profiler that just sets it up based on array with previous run results.

> >One problem is that it merges several different issues.
> 
> Yes, a problem common to many systems papers.
> 
> I didn't quite follow your profiling proposals. As near as I can
> make out you'd like to log calls to malloc, free, etc. But for good
> results don't you also need to log memory accesses? Otherwise how
> will you know which objects are hot?

No, I described something completely different. How could one add to
a production allocator a zero overhead profiler.

Here it was simple heuristic: For each call site mark first 32
allocations and depending when they are free'd estimate lifetime.
With that call site decides if it uses heap for short lived or for 
long lived allocations. 

There wouldn't be performance impact to allocations that weren't marked.

It is something that is simple and could be good enough when hot data
are short lived. 

For sake of profiling classification it would be easy to find cache
properties of chunks larger than cache-lines as there shouldn't be
cache-line sharing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-03 23:37         ` Carlos O'Donell
  2018-03-03 23:40           ` Carlos O'Donell
@ 2018-03-04 15:51           ` Ondřej Bílka
  1 sibling, 0 replies; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-04 15:51 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Paul Eggert, libc-alpha

On Sat, Mar 03, 2018 at 03:37:45PM -0800, Carlos O'Donell wrote:
> On 03/03/2018 12:24 PM, Paul Eggert wrote:
> > I didn't quite follow your profiling proposals. As near as I can make
> > out you'd like to log calls to malloc, free, etc. But for good
> > results don't you also need to log memory accesses? Otherwise how
> > will you know which objects are hot?
> 
> We can already log malloc, free, etc. with the inlined tracing code we
> have on DJ's branch, and we in turn simulate those traces again.
> 
> You only need to log memory accesses if you want perfect reconstruction
> of the various memory subsystem effects. As you know this is obviously
> very expensive and difficult if not impossible to do. Some suggestions
> at LPC were provided though, including tracking heuristics or statistical
> profiles of hits/misses and trying to recreate those profiles in the
> page-touch heuristics used by the simulator (something it does when it
> gets a new block of memory).
> 
> Today we mostly use the simulator to look at in-use RSS, RSS max,
> and external and internal fragmentation, looking at trying to reduce
> that as much as we can.
> 
> Actually using it for profile driven feedback is very hard, and that
> kind of behaviour seems to be largely derived from first principles
> in most academic work.
>
No, just if you collect bunch of data unrelated to profiling you
couldn't be surprised that you couldn't use for profiling.

This impossible to do with trace, you could use trace only for
memory usage. Logging itself changes cache usage pattern which makes
everything considering cache inaccurate.

After that for profiling you need to gather completely different things, there is
zero overlap.

Here you knowing caller is essentional, you need to make decisions based on that.
Otherwise you would just find out that for one half you must do A, for
another B and don't know which part is which. 

Like I said measuring live time is easy as we need relatively low
precision. It would suffice to access variable that is increased each
1-2ms, accuracy doesn't matter that much.

Likewise cache properties when we do free are easy to get. Knowing if
chunks are still hot after specified time(like 10ms) after allocation
could be done by random sampling that marks some and if they weren't
freed we measure how long it takes to read data from them.

Here limitation is that we could do only certain number of reads before
they significantly change cache.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible inline malloc alternatives: bitmap
  2018-03-02 23:00   ` Possible inline malloc alternatives: bitmap Ondřej Bílka
@ 2018-03-05 20:32     ` DJ Delorie
  0 siblings, 0 replies; 19+ messages in thread
From: DJ Delorie @ 2018-03-05 20:32 UTC (permalink / raw)
  To: OndÃ…Â™ej BÃƒÂlka
  Cc: carlos, libc-alpha

> These two are wasted work

Please do not refer to someone else's efforts as "wasted" until you
understand why those efforts were made, and how they're being used.  The
trace/simulate code allowed us to benchmark a customer's application
where we had no access to said application, it ran in an environment
where it interacted with many other applications and systems, and over
the course of a few days generated a few terabytes of trace.  We were
able to use that trace to reproduce the performance issue and solve it.
We've also used the trace/simulate code to validate tcache's
performance, which was confirmed in real-world tests later.  They are
not "wasted" work even if they aren't the right tool for your particular
needs.

Of course, you also need to understand what the benchmark benchmarks.
The trace simulator doesn't benchmark printf or strcpy, just the malloc
API.  Likewise, as we have no way of capturing actual memory
read/writes, we have no way of simulating them.  This is why we have
separate microbenchmarks in glibc, although microbenchmarks are poor
reasons to change a general-purpose system allocator.  People care about
things like compilers, browsers, and databases, not microbenchmarks.  So
while you can write a microbenchmark that shows a 400% improvement in
malloc speed, whether that translates to real-world performance
improvements or not still needs to be determined.

I'm not trying to discourage you from improving malloc here, of course.
A change that improves a microbenchmark has a good chance of improving
whole system performance also.  We just need a way to *prove it* by
reliably (and automatically) comparing "apples to apples" at a level and
scale that's appropriate for a general-purpose system allocator like
glibc's malloc, and the trace simulator is the best tool we can come up
with at this time, to serve that purpose.  So please do submit patches
that improve malloc, but understand that we'll expect statistically
robust comparisons with both microbenchmarks, and the simulator[*]
running available workloads, to demonstrate that your patch doesn't have
unexpected negative impacts in applications our users care about.

[*] All workloads, with and without your patch, showing both time and
    rss usage, averaged across many runs, etc...

> as any relation to actual performance is purely coincidental.

*Any* benchmarks are statistical in nature, and how well they correspond
to real-world performance needs to be understood before using said
benchmarks.  That in no way means they're "coincidental".  Even running
the real app once is only statistically related to its performance on
average.

> Running a program first with allocator A, then with allocator B and
> comparing total run time is the only way to get accurate results

Which we can do with the simulator, in as much as the workload is
representative of the application's performance.  Note I say
"representative".  Our experience with it is that it's quite close, so
we use it with some confidence that improvements seen in the simulator
will translate to improvements seen in the application.  But as I noted,
benchmarking a *real* program is also only "representative" of its
average performance.  And sometimes, it's not practical to run an
application with different allocators enough times, and consistently
enough, to get reliable numbers.

> Real problem with first two that current code is convoluted mess.

While I agree, its generally a bad idea to start a discussion with "your
code is a mess" when you're talking to the people who (1) wrote that
code, and (2) will ultimately be reviewing your changes to it.  We try
our best to be fair, but we're only human.

> For different representations of small allocations its easy to bound
> space usage with tunable of maximal number of pages for these. When no
> page is available it will just do allocation from heap. This would be
> easy to add with clean code.

I look forward to seeing your patch :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* best-fit algorithm with bitmap heap
  2018-03-02  5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka
  2018-03-02  5:54 ` Ondřej Bílka
  2018-03-02 16:28 ` Carlos O'Donell
@ 2018-03-06 20:42 ` Ondřej Bílka
  2018-03-06 21:26   ` Ondřej Bílka
  2 siblings, 1 reply; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-06 20:42 UTC (permalink / raw)
  To: libc-alpha

On Thu, Mar 01, 2018 at 10:33:18PM +0100, OndÅ™ej BÃlka wrote:
> Like I mentioned in other thread I plan to make improved malloc+friends.
> It will be split to several aspects.
> 
I got idea when I pondered about merging chunks and deleting old data
that you don't have to delete them from data structure. 
It requires to save data about free chunks separately and data to check 
if chunk information is up-to-date.
To limit space usage one would need to use fixed size buffer with these, when its full one
needs to delete all invalid entries from buffer, and add/remove elements from bigger
storage to make buffer half full.

I could do this quickly with bitmaps and some conditions for sake of simplicity.
1. It would consider only allocation where capacity is 1-16 integer multiple of block 
2. Data would be in 64-block pages where first and last block are reserved by allocator.

As data structure for page start of allocated chunk would be denoted by
1 in corresponding  bit of alloc_first and end in corresponding bit of alloc_last. 
Reserved blocks are there to put sentinel 1 at end of bitmaps. 
For faster operations bits in alloc_first could be in reversed order.

When one has address and capacity of b blocks testing if it is valid free area is easy,
from address we determine header and number s of starting block.
First we check that free space starts there with formula (alloc_last & (1<<(s-1)))
To test that it is free space until s+b which should be start of chunk
it should be formula like this(maybe its off by one somewhere, didn't test that)
((alloc_first>>s)<<(64-b)) == 1<<63

On free with offset s one zeroes allocated bits with 

alloc_first = alloc_first ^ (1<<s);

Start of free region is just after index of highest 1 bit of alloc_first & ((1<<s) - 1)
Here comes comment that alloc_first could have reversed order as lowest
1 bit could be easier to compute

End of allocation could be zeroes just with easy bit operations like
following:

ml = alloc_last & (~((1<<s)-1));
mask = ml ^ (ml - 1);
alloc_last = alloc_last ^ mask ^ (mask / 2);

end of free region is last 1 bit of alloc_last & (~mask)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: best-fit algorithm with bitmap heap
  2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka
@ 2018-03-06 21:26   ` Ondřej Bílka
  0 siblings, 0 replies; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-06 21:26 UTC (permalink / raw)
  To: libc-alpha

Correction as I initially discovered better solution but forgot about
that when writing previous post. Operations are lot simpler.

Description below is mostly same except that free is simpler. I could do
multiple frees with free_bitmap with 1 at start of each chunk to be freed.
Bitmaps alloc_first and alloc_last with 1 at start/end of allocated
chunk stay. Removing bits corresponding to freed chunk is actually
simple with following formulas

alloc_first = alloc_first - free_bitmap;
alloc_last = alloc_last & (alloc_last - free_bitmap);

Endpoints (noninclusive) of free areas that changed are given by following bitmap

free_last = alloc_first & (alloc_first ^ (alloc_first - free_bitmap))

When processing these you get starts from highest bit in alloc_last & mask

Forgot to add that cross page allocation is possible by adding more
cases but I don't know if bit better space utilization is worth worse
performance.

Check if block is valid could be also simplified and we could use same
formulas as those to remove bits, to allocate offset s and size l we
check and change bitmaps with following checks:

if (((alloc_first ^ (~ (1<<s))) != alloc_first) &&
    alloc_last & (alloc_last - (1<<s)) == alloc_last & (~(1<<(s+l))))
  {
    alloc_first = alloc_first ^ (~ (1<<s));
    alloc_last = alloc_last & (~(1<<(s+l)));
    /* valid, do allocation */
  }


On Tue, Mar 06, 2018 at 09:42:43PM +0100, OndÅ™ej BÃlka wrote:
> On Thu, Mar 01, 2018 at 10:33:18PM +0100, OndÅ™ej BÃlka wrote:
> > Like I mentioned in other thread I plan to make improved malloc+friends.
> > It will be split to several aspects.
> > 
> I got idea when I pondered about merging chunks and deleting old data
> that you don't have to delete them from data structure. 
> It requires to save data about free chunks separately and data to check 
> if chunk information is up-to-date.
> To limit space usage one would need to use fixed size buffer with these, when its full one
> needs to delete all invalid entries from buffer, and add/remove elements from bigger
> storage to make buffer half full.
> 
> I could do this quickly with bitmaps and some conditions for sake of simplicity.
> 1. It would consider only allocation where capacity is 1-16 integer multiple of block 
> 2. Data would be in 64-block pages where first and last block are reserved by allocator.
> 
> As data structure for page start of allocated chunk would be denoted by
> 1 in corresponding  bit of alloc_first and end in corresponding bit of alloc_last. 
> Reserved blocks are there to put sentinel 1 at end of bitmaps. 
> For faster operations bits in alloc_first could be in reversed order.
> 
> When one has address and capacity of b blocks testing if it is valid free area is easy,
> from address we determine header and number s of starting block.
> First we check that free space starts there with formula (alloc_last & (1<<(s-1)))
> To test that it is free space until s+b which should be start of chunk
> it should be formula like this(maybe its off by one somewhere, didn't test that)
> ((alloc_first>>s)<<(64-b)) == 1<<63
> 
> On free with offset s one zeroes allocated bits with 
> 
> alloc_first = alloc_first ^ (1<<s);
> 
> Start of free region is just after index of highest 1 bit of alloc_first & ((1<<s) - 1)
> Here comes comment that alloc_first could have reversed order as lowest
> 1 bit could be easier to compute
> 
> End of allocation could be zeroes just with easy bit operations like
> following:
> 
> ml = alloc_last & (~((1<<s)-1));
> mask = ml ^ (ml - 1);
> alloc_last = alloc_last ^ mask ^ (mask / 2);
> 
> end of free region is last 1 bit of alloc_last & (~mask)
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: best-fit algorithm with bitmap heap
  2018-03-07 14:43     ` Wilco Dijkstra
@ 2018-03-07 17:56       ` Ondřej Bílka
  0 siblings, 0 replies; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-07 17:56 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: libc-alpha, nd

On Wed, Mar 07, 2018 at 02:43:19PM +0000, Wilco Dijkstra wrote:
> OndÅ™ej BÃlka wrote:
> > No, there is no searching and as I described it is really easy.
> 
> There is always a search here and that is completely unavoidable.
> 
> If you have a free list which contains multiple different sizes, you need
> to walk it until you get a block that fits. This could be slow if there are
> thousands of entries in that free list. If you use individual free lists for
> every supported allocation size (only practical for small sizes), and the
> free list for the requested size is empty, you now need to search all larger
> free lists until you find one which isn't empty (if you cache which sizes
> have free blocks then that's yet another data structure to keep up to date).
> 
> I think bitmaps are a good choice for allocation of small fixed size blocks,
> but for medium sized blocks it's less clear.
>
No, that is pretty easy to do if you know data structures. Well I
thought that I also send mail with how I do best fit but didn't so I do
that now.

If there are 64 bins then one just needs bitmap bin_map which bins are
empty and which not. Getting first nonempty bin is just three
instructions.

int best_fit_bin(int bin)
{
  return __builtin_ctzl (bin_map & (-1L << bin));
}

For 64*64 bins you need two-level structure with 64 individual bitmaps
and one high bitmap if lower bitmaps are nonzero. Same idea, only done
in two levels.

int best_fit_bin(int bin)
{
  int bin_hi = bin / 64;
  bin = bin % 64;
  if  (bin_map[bin_hi] & (-1L << bin))
    return 64*bin_hi + __builtin_ctzl (bin_map[bin_hi] & (-1L << bin))
  else {
    bin_hi = __builtin_ctzl (high_bin_map & (-1L << (bin_hi+1)));
    return 64*bin_hi + __builtin_ctzl (bin_map[bin_hi]);
  }
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: best-fit algorithm with bitmap heap
  2018-03-07  7:28   ` Ondřej Bílka
@ 2018-03-07 14:43     ` Wilco Dijkstra
  2018-03-07 17:56       ` Ondřej Bílka
  0 siblings, 1 reply; 19+ messages in thread
From: Wilco Dijkstra @ 2018-03-07 14:43 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, nd

Ondřej Bílka wrote:
> No, there is no searching and as I described it is really easy.

There is always a search here and that is completely unavoidable.

If you have a free list which contains multiple different sizes, you need
to walk it until you get a block that fits. This could be slow if there are
thousands of entries in that free list. If you use individual free lists for
every supported allocation size (only practical for small sizes), and the
free list for the requested size is empty, you now need to search all larger
free lists until you find one which isn't empty (if you cache which sizes
have free blocks then that's yet another data structure to keep up to date).

I think bitmaps are a good choice for allocation of small fixed size blocks,
but for medium sized blocks it's less clear.

Wilco

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: best-fit algorithm with bitmap heap
  2018-03-06 23:25 ` Wilco Dijkstra
@ 2018-03-07  7:28   ` Ondřej Bílka
  2018-03-07 14:43     ` Wilco Dijkstra
  0 siblings, 1 reply; 19+ messages in thread
From: Ondřej Bílka @ 2018-03-07  7:28 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: libc-alpha, nd

On Tue, Mar 06, 2018 at 11:25:31PM +0000, Wilco Dijkstra wrote:
> Hi,
> 
> I think before going into the details it is worth considering whether it is even
> feasible to use a bitmap search for first-fit, let alone best-fit. What you seem
> to be saying is there will be lots of small pages for allocation, each of which
> has its own bitmap. If say we have 100MBytes worth of data allocated, you will
> need to start searching every single bitmap that has free space. That would imply
> adding a fast data structure that keeps track of all the free space across all
> the bitmaps. Then this needs to be kept up to date by both malloc and free, as
> well as supporting multiple threads. It seems to me this would become the key
> bottleneck - so it's essential to sort out the overall allocation strategy first.
> 
No, there is no searching and as I described it is really easy.

I could do this for all sizes by making these structures two-level,
larger size are above mmap threshold.

For multiple frees done with bitmap one uses simple trick that if on
free free_bitmap is zero and add it to list pending_pages. When it is
time to do free it just traverses pending_pages, does free and resets 
corresponding free_bitmap to zero. 

For given capacity best fit would just point you to array with all free
pointers of given capacity. Here idea is that instead of deleting
pointer from linked list on free free we delete it in malloc with few 
bit operations if free region starts and ends at claimed points.

Problem of this is that we need to store those pointers separately and
to limit space usage delete invalid pointers when we exceed capacity.

Without this idea free is more complicated, it needs another bitmap with
free chunks that are double linked list, following formula should
identify those excluding start of new free region which should be handled
separately.

free_in_list & (alloc_first_updated ^ (alloc_first_updated - free_bitmap)).

In my previous mail I wrote check if chunk is valid incorrecty, previous one should be correct.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: best-fit algorithm with bitmap heap
       [not found] <DB6PR0801MB2053B88AEF84C7772505072B83D90@DB6PR0801MB2053.eurprd08.prod.outlook.com>
@ 2018-03-06 23:25 ` Wilco Dijkstra
  2018-03-07  7:28   ` Ondřej Bílka
  0 siblings, 1 reply; 19+ messages in thread
From: Wilco Dijkstra @ 2018-03-06 23:25 UTC (permalink / raw)
  To: libc-alpha; +Cc: nd

Hi,

I think before going into the details it is worth considering whether it is even
feasible to use a bitmap search for first-fit, let alone best-fit. What you seem
to be saying is there will be lots of small pages for allocation, each of which
has its own bitmap. If say we have 100MBytes worth of data allocated, you will
need to start searching every single bitmap that has free space. That would imply
adding a fast data structure that keeps track of all the free space across all
the bitmaps. Then this needs to be kept up to date by both malloc and free, as
well as supporting multiple threads. It seems to me this would become the key
bottleneck - so it's essential to sort out the overall allocation strategy first.

Wilco

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-03-07 17:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-02  5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka
2018-03-02  5:54 ` Ondřej Bílka
2018-03-02 18:20   ` DJ Delorie
2018-03-02 16:28 ` Carlos O'Donell
2018-03-02 20:21   ` Paul Eggert
2018-03-03 12:56     ` Ondřej Bílka
2018-03-03 20:24       ` Paul Eggert
2018-03-03 23:37         ` Carlos O'Donell
2018-03-03 23:40           ` Carlos O'Donell
2018-03-04 15:51           ` Ondřej Bílka
2018-03-04 11:59         ` Per call-site malloc specialization Ondřej Bílka
2018-03-02 23:00   ` Possible inline malloc alternatives: bitmap Ondřej Bílka
2018-03-05 20:32     ` DJ Delorie
2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka
2018-03-06 21:26   ` Ondřej Bílka
     [not found] <DB6PR0801MB2053B88AEF84C7772505072B83D90@DB6PR0801MB2053.eurprd08.prod.outlook.com>
2018-03-06 23:25 ` Wilco Dijkstra
2018-03-07  7:28   ` Ondřej Bílka
2018-03-07 14:43     ` Wilco Dijkstra
2018-03-07 17:56       ` Ondřej Bílka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).