From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 107931 invoked by alias); 4 Mar 2018 11:59:54 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 107903 invoked by uid 89); 4 Mar 2018 11:59:54 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 spammy=beside, H*F:D*seznam.cz, crystal X-HELO: popelka.ms.mff.cuni.cz Date: Sun, 04 Mar 2018 11:59:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Paul Eggert Cc: Carlos O'Donell , libc-alpha@sourceware.org Subject: Per call-site malloc specialization. Message-ID: <20180304115937.GA20647@domone> References: <20180301213318.GB3062@domone> <085a3206-4ae8-5d0a-800c-134d9d508ba1@redhat.com> <6d67177c-4cfe-4eba-ab27-6f75d74ca63e@cs.ucla.edu> <20180303125546.GA6516@domone> <9c9121c0-a1cc-767c-497f-99156e4edae6@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <9c9121c0-a1cc-767c-497f-99156e4edae6@cs.ucla.edu> User-Agent: Mutt/1.5.20 (2009-06-14) X-SW-Source: 2018-03/txt/msg00095.txt.bz2 After thinking about this I realized that profiling beside short/long lived isn't necessary to solver problem of multiple chunks sharing cache line as classification into call sites captures all factors so we could just split that according to that. Allocator would have two levels, upper would work only on whole cache lines. Lower level would get 1-3 cache-line chunks from upper level. For short-lived sites lower level uses thread-local cache like one I proposed. For long lived there will be assigned to pair (thread, caller) which would service small requests and when entire chunk was free then return it to the upper level. This would give highest amount of locality, unless you could connect your computer to crystal ball. For that one would want to use macro like following to recognize fast path. Getting caller from stack pointers would need hash table and is more complicated. #define malloc(x) \ static void *malloc_data = (void *) &initial_malloc_data; \ malloc_with_data (x, &malloc_data, static_hint | gcc_hint() ); Another thing is that gcc could pass information about some malloc properties with gcc_hint above. On Sat, Mar 03, 2018 at 12:24:20PM -0800, Paul Eggert wrote: > Ondřej Bílka wrote: > > >This is paywalled so I didn't read it. > > You can freely read an older NightWatch paper here: > > Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating > Lightweight and Transparent Cache Pollution Control into Dynamic > Memory Allocation Systems. USENIX ATC. 2015. 307-18. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf > > Although NightWatch's source code is freely readable (it's on > GitHub), it does not specify a software licence so I have not read > it and don't recommend that you read it. > Read it, it uses different trick that I considered. It uses trick to select physical pages that map to same 4k of cache into virtual page region. Then it allocates chunks with bad cache locality from this region to allow rest use nearly full cache. Integrating these is easy part, hard part is low-overhead profiler. Adding flag to kernel mmap for that mapping shouldn't be that difficult. There would be generic interface to pass information from profiler to allocator so writing profiler is independent part. In previous post I said that with training data one could use profiler and don't care about overhead because after recompiling it will use profiler that just sets it up based on array with previous run results. > >One problem is that it merges several different issues. > > Yes, a problem common to many systems papers. > > I didn't quite follow your profiling proposals. As near as I can > make out you'd like to log calls to malloc, free, etc. But for good > results don't you also need to log memory accesses? Otherwise how > will you know which objects are hot? No, I described something completely different. How could one add to a production allocator a zero overhead profiler. Here it was simple heuristic: For each call site mark first 32 allocations and depending when they are free'd estimate lifetime. With that call site decides if it uses heap for short lived or for long lived allocations. There wouldn't be performance impact to allocations that weren't marked. It is something that is simple and could be good enough when hot data are short lived. For sake of profiling classification it would be easy to find cache properties of chunks larger than cache-lines as there shouldn't be cache-line sharing.