Per call-site malloc specialization.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "Ondřej Bílka" <neleai@seznam.cz>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: Carlos O'Donell <carlos@redhat.com>, libc-alpha@sourceware.org
Subject: Per call-site malloc specialization.
Date: Sun, 04 Mar 2018 11:59:00 -0000	[thread overview]
Message-ID: <20180304115937.GA20647@domone> (raw)
In-Reply-To: <9c9121c0-a1cc-767c-497f-99156e4edae6@cs.ucla.edu>

After thinking about this I realized that profiling beside short/long
lived isn't necessary to solver problem of multiple chunks sharing cache 
line as classification into call sites captures all factors so we could 
just split that according to that.

Allocator would have two levels, upper would work only on whole cache
lines.

Lower level would get 1-3 cache-line chunks from upper level. 

For short-lived sites lower level uses thread-local cache like one I
proposed.

For long lived there will be assigned to pair (thread, caller) which would 
service small requests and when entire chunk was free then return it to 
the upper level.

This would give highest amount of locality, unless you could connect
your computer to crystal ball.

For that one would want to use macro like following to recognize fast
path. Getting caller from stack pointers would need hash table and is 
more complicated.

#define malloc(x) \
static void *malloc_data = (void *) &initial_malloc_data; \
malloc_with_data (x, &malloc_data, static_hint | gcc_hint() );

Another thing is that gcc could pass information about some malloc
properties with gcc_hint above.

On Sat, Mar 03, 2018 at 12:24:20PM -0800, Paul Eggert wrote:
> OndÅ™ej BÃlka wrote:
> 
> >This is paywalled so I didn't read it.
> 
> You can freely read an older NightWatch paper here:
> 
> Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating
> Lightweight and Transparent Cache Pollution Control into Dynamic
> Memory Allocation Systems. USENIX ATC. 2015. 307-18. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf
> 
> Although NightWatch's source code is freely readable (it's on
> GitHub), it does not specify a software licence so I have not read
> it and don't recommend that you read it.
>

Read it, it uses different trick that I considered. It uses trick to
select physical pages that map to same 4k of cache into virtual page
region.

Then it allocates chunks with bad cache locality from this region to
allow rest use nearly full cache.

Integrating these is easy part, hard part is low-overhead profiler.

Adding flag to kernel mmap for that mapping shouldn't be that difficult.

There would be generic interface to pass information from profiler to
allocator so writing profiler is independent part.

In previous post I said that with training data one could use profiler
and don't care about overhead because after recompiling it will use
profiler that just sets it up based on array with previous run results.

> >One problem is that it merges several different issues.
> 
> Yes, a problem common to many systems papers.
> 
> I didn't quite follow your profiling proposals. As near as I can
> make out you'd like to log calls to malloc, free, etc. But for good
> results don't you also need to log memory accesses? Otherwise how
> will you know which objects are hot?

No, I described something completely different. How could one add to
a production allocator a zero overhead profiler.

Here it was simple heuristic: For each call site mark first 32
allocations and depending when they are free'd estimate lifetime.
With that call site decides if it uses heap for short lived or for 
long lived allocations. 

There wouldn't be performance impact to allocations that weren't marked.

It is something that is simple and could be good enough when hot data
are short lived. 

For sake of profiling classification it would be easy to find cache
properties of chunks larger than cache-lines as there shouldn't be
cache-line sharing.

next prev parent reply	other threads:[~2018-03-04 11:59 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-02  5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka
2018-03-02  5:54 ` Ondřej Bílka
2018-03-02 18:20   ` DJ Delorie
2018-03-02 16:28 ` Carlos O'Donell
2018-03-02 20:21   ` Paul Eggert
2018-03-03 12:56     ` Ondřej Bílka
2018-03-03 20:24       ` Paul Eggert
2018-03-03 23:37         ` Carlos O'Donell
2018-03-03 23:40           ` Carlos O'Donell
2018-03-04 15:51           ` Ondřej Bílka
2018-03-04 11:59         ` Ondřej Bílka [this message]
2018-03-02 23:00   ` Ondřej Bílka
2018-03-05 20:32     ` DJ Delorie
2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka
2018-03-06 21:26   ` Ondřej Bílka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180304115937.GA20647@domone \
    --to=neleai@seznam.cz \
    --cc=carlos@redhat.com \
    --cc=eggert@cs.ucla.edu \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).