From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-90788-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 107931 invoked by alias); 4 Mar 2018 11:59:54 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 107903 invoked by uid 89); 4 Mar 2018 11:59:54 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 spammy=beside, H*F:D*seznam.cz, crystal
X-HELO: popelka.ms.mff.cuni.cz
Date: Sun, 04 Mar 2018 11:59:00 -0000
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: Carlos O'Donell <carlos@redhat.com>, libc-alpha@sourceware.org
Subject: Per call-site malloc specialization.
Message-ID: <20180304115937.GA20647@domone>
References: <20180301213318.GB3062@domone>
 <085a3206-4ae8-5d0a-800c-134d9d508ba1@redhat.com>
 <6d67177c-4cfe-4eba-ab27-6f75d74ca63e@cs.ucla.edu>
 <20180303125546.GA6516@domone>
 <9c9121c0-a1cc-767c-497f-99156e4edae6@cs.ucla.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <9c9121c0-a1cc-767c-497f-99156e4edae6@cs.ucla.edu>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-SW-Source: 2018-03/txt/msg00095.txt.bz2


After thinking about this I realized that profiling beside short/long
lived isn't necessary to solver problem of multiple chunks sharing cache 
line as classification into call sites captures all factors so we could 
just split that according to that.

Allocator would have two levels, upper would work only on whole cache
lines.

Lower level would get 1-3 cache-line chunks from upper level. 

For short-lived sites lower level uses thread-local cache like one I
proposed.

For long lived there will be assigned to pair (thread, caller) which would 
service small requests and when entire chunk was free then return it to 
the upper level.

This would give highest amount of locality, unless you could connect
your computer to crystal ball.

For that one would want to use macro like following to recognize fast
path. Getting caller from stack pointers would need hash table and is 
more complicated.

#define malloc(x) \
static void *malloc_data = (void *) &initial_malloc_data; \
malloc_with_data (x, &malloc_data, static_hint | gcc_hint() );

Another thing is that gcc could pass information about some malloc
properties with gcc_hint above.

On Sat, Mar 03, 2018 at 12:24:20PM -0800, Paul Eggert wrote:
> OndÅej BÃ­lka wrote:
> 
> >This is paywalled so I didn't read it.
> 
> You can freely read an older NightWatch paper here:
> 
> Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating
> Lightweight and Transparent Cache Pollution Control into Dynamic
> Memory Allocation Systems. USENIX ATC. 2015. 307-18. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf
> 
> Although NightWatch's source code is freely readable (it's on
> GitHub), it does not specify a software licence so I have not read
> it and don't recommend that you read it.
>

Read it, it uses different trick that I considered. It uses trick to
select physical pages that map to same 4k of cache into virtual page
region.

Then it allocates chunks with bad cache locality from this region to
allow rest use nearly full cache.

Integrating these is easy part, hard part is low-overhead profiler.

Adding flag to kernel mmap for that mapping shouldn't be that difficult.

There would be generic interface to pass information from profiler to
allocator so writing profiler is independent part.

In previous post I said that with training data one could use profiler
and don't care about overhead because after recompiling it will use
profiler that just sets it up based on array with previous run results.
 
> >One problem is that it merges several different issues.
> 
> Yes, a problem common to many systems papers.
> 
> I didn't quite follow your profiling proposals. As near as I can
> make out you'd like to log calls to malloc, free, etc. But for good
> results don't you also need to log memory accesses? Otherwise how
> will you know which objects are hot?

No, I described something completely different. How could one add to
a production allocator a zero overhead profiler.

Here it was simple heuristic: For each call site mark first 32
allocations and depending when they are free'd estimate lifetime.
With that call site decides if it uses heap for short lived or for 
long lived allocations. 

There wouldn't be performance impact to allocations that weren't marked.

It is something that is simple and could be good enough when hot data
are short lived. 

For sake of profiling classification it would be easy to find cache
properties of chunks larger than cache-lines as there shouldn't be
cache-line sharing.