* Possible inline malloc alternatives: bitmap @ 2018-03-02 5:54 Ondřej Bílka 2018-03-02 5:54 ` Ondřej Bílka ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Ondřej Bílka @ 2018-03-02 5:54 UTC (permalink / raw) To: libc-alpha Like I mentioned in other thread I plan to make improved malloc+friends. It will be split to several aspects. This one is about option to inline small size allocations which would give same performance as using memory pool. I planned to make this generic but it isn't case now as I realized posibility of using bit tricks which would lead to more effective free as it doesn't need to determine size of allocation. Representation of small sizes is simple, for each 1024 page there would be metadata with 64-bit integer which has i-th bit 1 if and only if there is free chunk at offset 16*i of page which turns both allocation and free into simple bit manipulation. Sample code how would inline look like is below. To handle space usage one would in refill_bucket set up bitmap pages. For free a case to call noninline_free if bitmap is zero serves to register that there are some free chunks at that page. For returning memory to different thread that created it one would use atomic operations with basically same algorithm. For medium same idea of denoting free into bitmap could work with same code , with bit operations I could quickly merge all adjacent free chunks into one and delete old entries. This one is at idea stage, I don't know how effective it will be to merge multiple adjacent frees. I could do it with following bit tricks: You need three bitmaps allocated - 1 where allocated chunk starts old_free - 1 where free chunk that is in data structure starts new_free - 1 where chunk that was free'd but isn't in data structure yet. To handle are first construct mask with ones upto position of lowest 1 in new_free low_mask = new_free ^ (new_free - 1) To get lower end you must compare old_free & low_mask > allocated & low_mask if this is true then area starts on highest 1 bit of old_free & low_mask, otherwise it starts with lowest bit of new_free. In first case we need to remove that old_free element from data structure. Then its easy to see that free area would end at position of lowest 1 in formula allocated & (~low_mask) Additional elements that need to be deleted from data structure are determined by following but we accumulate them into bitmap to_delete to handle them all together at end. to_delete |= old_free & high_end_mask & ~low_mask; then we mask handled elements of new_free and repeat while new_free is nonzero. struct page_map { int thread; uint64_t map; uint64_t atomic_map; }; extern inline struct page_map *get_page_map(void *p) { size_t m = (size_t) p; m = 1024 * (m / 1024) - 32; return (struct page_map *) m; } extern inline size_t base_addr (struct page_map *p) { return ((size_t) p) + 32; } struct page_map *buckets[256/16]; extern inline void *inl_alloc(size_t size) { if (size < 256) { int bucket = (size + 15) / 16; uint64_t map = buckets[bucket]->map; /* note that we handle case of malloc(0) by always calling refill_buckets(0) with suitably filled bucket[0] */ if (map == 0) return refill_buckets (size); uint64_t new_map = map & (map - 1); size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map); buckets[bucket]->map = new_map; return (void *) address; } else return big_alloc (size); } extern inline void inl_free (void *p) { #ifndef PAGEMAP_NULL_VALID if (p == NULL) return; #endif struct page_map *m = get_page_map (p); if (m->thread == thread_id && m->map != 0) m->map |= 1 << ((((size_t) p) / 16) % 64); else { uint64_t n = m->atomic_map; uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64)); if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm)) noninline_free(p); } } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Possible inline malloc alternatives: bitmap 2018-03-02 5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka @ 2018-03-02 5:54 ` Ondřej Bílka 2018-03-02 18:20 ` DJ Delorie 2018-03-02 16:28 ` Carlos O'Donell 2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka 2 siblings, 1 reply; 17+ messages in thread From: Ondřej Bílka @ 2018-03-02 5:54 UTC (permalink / raw) To: libc-alpha Like I mentioned in other thread I plan to make improved malloc+friends. It will be split to several aspects. This one is about option to inline small size allocations which would give same performance as using memory pool. I planned to make this generic but it isn't case now as I realized posibility of using bit tricks which would lead to more effective free as it doesn't need to determine size of allocation. Representation of small sizes is simple, for each 1024 page there would be metadata with 64-bit integer which has i-th bit 1 if and only if there is free chunk at offset 16*i of page which turns both allocation and free into simple bit manipulation. Sample code how would inline look like is below. To handle space usage one would in refill_bucket set up bitmap pages. For free a case to call noninline_free if bitmap is zero serves to register that there are some free chunks at that page. For returning memory to different thread that created it one would use atomic operations with basically same algorithm. For medium same idea of denoting free into bitmap could work with same code , with bit operations I could quickly merge all adjacent free chunks into one and delete old entries. This one is at idea stage, I don't know how effective it will be to merge multiple adjacent frees. I could do it with following bit tricks: You need three bitmaps allocated - 1 where allocated chunk starts old_free - 1 where free chunk that is in data structure starts new_free - 1 where chunk that was free'd but isn't in data structure yet. To handle are first construct mask with ones upto position of lowest 1 in new_free low_mask = new_free ^ (new_free - 1) To get lower end you must compare old_free & low_mask > allocated & low_mask if this is true then area starts on highest 1 bit of old_free & low_mask, otherwise it starts with lowest bit of new_free. In first case we need to remove that old_free element from data structure. Then its easy to see that free area would end at position of lowest 1 in formula allocated & (~low_mask) Additional elements that need to be deleted from data structure are determined by following but we accumulate them into bitmap to_delete to handle them all together at end. to_delete |= old_free & high_end_mask & ~low_mask; then we mask handled elements of new_free and repeat while new_free is nonzero. struct page_map { int thread; uint64_t map; uint64_t atomic_map; }; extern inline struct page_map *get_page_map(void *p) { size_t m = (size_t) p; m = 1024 * (m / 1024) - 32; return (struct page_map *) m; } extern inline size_t base_addr (struct page_map *p) { return ((size_t) p) + 32; } struct page_map *buckets[256/16]; extern inline void *inl_alloc(size_t size) { if (size < 256) { int bucket = (size + 15) / 16; uint64_t map = buckets[bucket]->map; /* note that we handle case of malloc(0) by always calling refill_buckets(0) with suitably filled bucket[0] */ if (map == 0) return refill_buckets (size); uint64_t new_map = map & (map - 1); size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map); buckets[bucket]->map = new_map; return (void *) address; } else return big_alloc (size); } extern inline void inl_free (void *p) { #ifndef PAGEMAP_NULL_VALID if (p == NULL) return; #endif struct page_map *m = get_page_map (p); if (m->thread == thread_id && m->map != 0) m->map |= 1 << ((((size_t) p) / 16) % 64); else { uint64_t n = m->atomic_map; uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64)); if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm)) noninline_free(p); } } -- knot in cables caused data stream to become twisted and kinked ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-02 5:54 ` Ondřej Bílka @ 2018-03-02 18:20 ` DJ Delorie 0 siblings, 0 replies; 17+ messages in thread From: DJ Delorie @ 2018-03-02 18:20 UTC (permalink / raw) To: Ondà Âej BÃÂlka; +Cc: libc-alpha Note that there are existing malloc implementations that may already do what you plan, it might be worth a little time to see if they're a better starting point than glibc's malloc. Also, I have a corpus of workloads for benchmarking malloc here: http://www.delorie.com/malloc/ (note: that's my house network, please be gentle and kind when downloading ;) The simulator that runs those workloads is in glibc's dj/malloc branch (malloc/trace_run.c). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-02 5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka 2018-03-02 5:54 ` Ondřej Bílka @ 2018-03-02 16:28 ` Carlos O'Donell 2018-03-02 20:21 ` Paul Eggert 2018-03-02 23:00 ` Possible inline malloc alternatives: bitmap Ondřej Bílka 2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka 2 siblings, 2 replies; 17+ messages in thread From: Carlos O'Donell @ 2018-03-02 16:28 UTC (permalink / raw) To: Ondřej Bílka, libc-alpha On 03/01/2018 01:33 PM, OndÅej BÃlka wrote: > Like I mentioned in other thread I plan to make improved malloc+friends. > It will be split to several aspects. Great to hear that! Please keep in mind that at Red Hat we are working on 3 relatively new things for malloc: * Trace functionality (part of our whole-system trace work) * Replay functionality (our simulator) * Heap dumping functionality Only the heap dumper will be impacted by any work done by you, but this is OK, all that we ask is that you consider including us in your work so we can stay abreast of the changes you are thinking about and be able to comment on how the functionality might impact accurate dumping. > This one is about option to inline small size allocations which would > give same performance as using memory pool. What you describe below is a page-based allocator with a bit-map of the in-use allocations. I have no objection to such enhancements for glibc's malloc. However, these enhancements, like the tcache ones, must be driven by collected workloads which can be run by others to verify the work yields performance benefits across their machines. When we did the tcache work we collected dozens of workloads, ran them through trace, and re-simulated them on various architectures to look at the performance characteristics of tcache. Similarly any future work should be driven by a workloads and performance measurements against those workloads. We can provide the workload data for the workloads we used for tcache if you need them. It was the workloads for example that showed that even with tcache, fastbins was *still* a win from a performance perspective. The harder work that I face today is not about performance though, the harder work, and I would hope you could help us on this, is *limiting* RSS consumption and improving the core algorithms used in malloc to be able to limit RSS without an immense cost to performance. Developers want to deploy memory limited systems and glibc's malloc does not always consider RSS costs. Some issues around this are: * fastbins prevent free'ing down of heap from top. * fastbins can grow unbounded. * heaps can only be freed from top down (unless you call malloc_trim() which can find sub-pages in the heap to free back). In summary: Application mixing of chunks of disparate lifetimes results in fragmented heaps for heap-based allocator, while page-based allocators loose at most a page for those long-lived chunks. However, lifetime information is not at all available in the current API, and we need novel research into ways to mitigate this issue. Segregating allocations of similar size classes seems to be the best practice for this, and nothing stops glibc from doing the same approach e.g. multiple heaps for segregated size classes. While I am not dissuading you from working on a page-based extension to glibc's malloc, I think tackling RSS reduction, perhaps using page-based methods, is where the community should focus their attention. It would yield the largest net win for our users based on the bugs and feedback we've been getting. Again, I also reiterate it must be driven by workloads, to create a data-driven optimization process. > I planned to make this generic but it isn't case now as I realized > posibility of using bit tricks which would lead to more effective free > as it doesn't need to determine size of allocation. >> Representation of small sizes is simple, for each 1024 page there would > > be metadata with 64-bit integer which has i-th bit 1 if and only if > there is free chunk at offset 16*i of page which turns both allocation > and free into simple bit manipulation. >> Sample code how would inline look like is below. To handle space usage > one would in refill_bucket set up bitmap pages. > > For free a case to call noninline_free if bitmap is zero serves to > register that there are some free chunks at that page. > > For returning memory to different thread that created it one would use > atomic operations with basically same algorithm. > > For medium same idea of denoting free into bitmap could work with same > code , with bit > operations I could quickly merge all adjacent free chunks into one and > delete old entries. > > This one is at idea stage, I don't know how effective it will be to > merge multiple adjacent frees. I could do it with following bit tricks: > > You need three bitmaps > > allocated - 1 where allocated chunk starts > old_free - 1 where free chunk that is in data structure starts > new_free - 1 where chunk that was free'd but isn't in data structure > yet. > To handle are first construct mask with ones upto position of lowest > 1 in new_free > > low_mask = new_free ^ (new_free - 1) > > To get lower end you must compare > > old_free & low_mask > allocated & low_mask > > if this is true then area starts on highest 1 bit of old_free & > low_mask, otherwise it starts with lowest bit of new_free. In first case > we need to remove that old_free element from data structure. > > Then its easy to see that free area would end at position of lowest 1 in > formula > > allocated & (~low_mask) > > Additional elements that need to be deleted from data structure are > determined by following but we accumulate them into bitmap to_delete to handle > them all together at end. > > to_delete |= old_free & high_end_mask & ~low_mask; > > then we mask handled elements of new_free and repeat while new_free is > nonzero. > > > struct page_map { > int thread; > uint64_t map; > uint64_t atomic_map; > }; >> extern inline struct page_map *get_page_map(void *p) > { > size_t m = (size_t) p; > m = 1024 * (m / 1024) - 32; > return (struct page_map *) m; > } > > extern inline size_t base_addr (struct page_map *p) > { > return ((size_t) p) + 32; > } > > struct page_map *buckets[256/16]; > > extern inline void *inl_alloc(size_t size) > { > if (size < 256) >> { > int bucket = (size + 15) / 16; >> uint64_t map = buckets[bucket]->map; > /* note that we handle case of malloc(0) > by always calling refill_buckets(0) with suitably filled bucket[0] */ > if (map == 0) > return refill_buckets (size); > uint64_t new_map = map & (map - 1); > size_t address = base_addr (buckets[bucket]) + 16 * __builtin_ctzl (map); > buckets[bucket]->map = new_map; > return (void *) address; > } > else > return big_alloc (size); > } > > extern inline void inl_free (void *p) > { > #ifndef PAGEMAP_NULL_VALID > if (p == NULL) return; > #endif > struct page_map *m = get_page_map (p); > if (m->thread == thread_id && m->map != 0) > m->map |= 1 << ((((size_t) p) / 16) % 64); > else > { > uint64_t n = m->atomic_map; > uint64_t nm = n | (1 << ((((size_t) p) / 16) % 64)); > if (n == 0 || !atomic_compare_swap(&(m->atomic_map), n, nm)) > noninline_free(p); > } > } > > -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-02 16:28 ` Carlos O'Donell @ 2018-03-02 20:21 ` Paul Eggert 2018-03-03 12:56 ` Ondřej Bílka 2018-03-02 23:00 ` Possible inline malloc alternatives: bitmap Ondřej Bílka 1 sibling, 1 reply; 17+ messages in thread From: Paul Eggert @ 2018-03-02 20:21 UTC (permalink / raw) To: Carlos O'Donell, Ondřej Bílka, libc-alpha On 03/02/2018 08:27 AM, Carlos O'Donell wrote: > lifetime information is not at all > available in the current API, and we need novel research into ways to mitigate > this issue We do need research here. And it's not just lifetime info; it's also cache pollution. I looked for relevant recent work and found NightWatch: Liao X, Guo R, Jin H. Enhancing the Malloc System with Pollution Awareness for Better Cache Performance. IEEE Trans Parallel Distributed Sys. 2017-03-01. 28(3):731-45. http://dx.doi.org/10.1109/TPDS.2016.2587644 NightWatch doesn't do what you want (it focuses on cache-aware malloc). Also, it requires Linux kernel mods. Still, it may be useful as a source of ideas. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-02 20:21 ` Paul Eggert @ 2018-03-03 12:56 ` Ondřej Bílka 2018-03-03 20:24 ` Paul Eggert 0 siblings, 1 reply; 17+ messages in thread From: Ondřej Bílka @ 2018-03-03 12:56 UTC (permalink / raw) To: Paul Eggert; +Cc: Carlos O'Donell, libc-alpha On Fri, Mar 02, 2018 at 12:21:27PM -0800, Paul Eggert wrote: > On 03/02/2018 08:27 AM, Carlos O'Donell wrote: > >lifetime information is not at all > >available in the current API, and we need novel research into ways to mitigate > >this issue > > We do need research here. And it's not just lifetime info; it's also > cache pollution. I looked for relevant recent work and found > NightWatch: > > Liao X, Guo R, Jin H. Enhancing the Malloc System with Pollution > Awareness for Better Cache Performance. IEEE Trans Parallel > Distributed Sys. 2017-03-01. 28(3):731-45. > http://dx.doi.org/10.1109/TPDS.2016.2587644 > > NightWatch doesn't do what you want (it focuses on cache-aware > malloc). Also, it requires Linux kernel mods. Still, it may be > useful as a source of ideas. This is paywalled so I didn't read it. Abstract is too vague, it doesn't say what exactly they do and claimed speedup looks like marketing trick as it is too big considering that it applies only when memory is managed with lot of small allocations. One problem is that it merges several different issues. For allocator I know following issues and how I plan to handle them: 1. Cache line sharing. Multiple small chunks could fit into same cache line, also two longer chunks could share same cache line at their borders. First eliminating border sharing is relatively easy, it suffices that everything in a main heap has alignment and capacity multiple of cache line. For small sizes use pool of same capacities, this allow less of border sharing if its for example of 1 1/2 cache line capacity. Sub-cache-line requests leads to two problems: 1a. Bad cache utilization. You want to avoid situation where only small part of cache line is frequently accessed(hot) while rest is rarely accessed(cold). You want hot data with hot data and cold with cold. Here low hanging fruit is that we use terrible data structure that fills half of cache line with cold metadata about chunk. Another good distinction here is if data is short-lived or long lived. With long lived data it requires complex profiling to find out if its hot or cold. Short lived is hot by definition as it gets alocated, something written there and freed before time it should be evicted from cache. It is easy to check if data is short lived by basic profiling described below. Bigger advantage of finding what is short lived is in space management as long-lived small chunks pining down page are one of main problems, with all chunks short lived it suffices to avoid allocating from that page for while and you will have empty page. Longer explanation how I plan to profile is [1] 1b. Cross-talk which happens when two cores write to different chunks which happened to be on same cache line. Cores must synchronize state which leads to worse performance. Per-thread cache and returning chunk to original thread on free is best that we can do without profiling. It looks that nightwatch does that and when non-allocator thread writes there it gives cache line with only one chunk. 2. Cache layout of multicache-line chunks. As there isn't border sharing only two relevant factors are a. When will be data in chunks first accessed. b. Where in cache hierarchy are chunk's cache lines when free is called. Of those b. is more important, as probably in a) you could assume that data will be written in short order. Obviously you want to keep giving hot chunks before they became cold where a. is short. For b) one could easily find that by profiling of when free happens do reads of chunk memory and measure how long it took. 3. Spatial locality and prefetching. For small allocations this is related to 1a as for subsequent allocations you want to give chunk in same cache line as they would be likely accessed together. Another consideration is to try to make subsequent allocations use continuos memory area to maximize spatial locality. As heuristic these could be accessed in same order as allocated and it allow hw prefetcher to do its job. [1] For that profiling one would reuse inlining code to add information about each call site, other means of identifying that are significantly more time intensive. Here impact is minimal, for allocation it is just calculating tcb+atomic_malloc_site_offset For first few allocations this would point to data structure set to cause call of refill bucket where one return chunks with another of planned features, allocation-specific free callback. This works that one could say that for chunks on given page one will call given callback instead of normal free. This has nearly zero effect on free performance as it sets up page metadata to fall through to slow path before one needs to check if there is callback. That callback will do statistic how many of allocated chunks were freed. On 32-th allocation (for example) from given site we change offset or site data to allocate from long-lived heap. If user would do equivalent of -fprofile-generate, some profiling(doesn't really matter what as gcc and this just needs real basics) -fprofile-use then it would be more accurate and recognizing if long-lived is hot/cold is doable. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-03 12:56 ` Ondřej Bílka @ 2018-03-03 20:24 ` Paul Eggert 2018-03-03 23:37 ` Carlos O'Donell 2018-03-04 11:59 ` Per call-site malloc specialization Ondřej Bílka 0 siblings, 2 replies; 17+ messages in thread From: Paul Eggert @ 2018-03-03 20:24 UTC (permalink / raw) To: Ondřej Bílka; +Cc: Carlos O'Donell, libc-alpha Ondřej Bílka wrote: > This is paywalled so I didn't read it. You can freely read an older NightWatch paper here: Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating Lightweight and Transparent Cache Pollution Control into Dynamic Memory Allocation Systems. USENIX ATC. 2015. 307-18. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf Although NightWatch's source code is freely readable (it's on GitHub), it does not specify a software licence so I have not read it and don't recommend that you read it. > One problem is that it merges several different issues. Yes, a problem common to many systems papers. I didn't quite follow your profiling proposals. As near as I can make out you'd like to log calls to malloc, free, etc. But for good results don't you also need to log memory accesses? Otherwise how will you know which objects are hot? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-03 20:24 ` Paul Eggert @ 2018-03-03 23:37 ` Carlos O'Donell 2018-03-03 23:40 ` Carlos O'Donell 2018-03-04 15:51 ` Ondřej Bílka 2018-03-04 11:59 ` Per call-site malloc specialization Ondřej Bílka 1 sibling, 2 replies; 17+ messages in thread From: Carlos O'Donell @ 2018-03-03 23:37 UTC (permalink / raw) To: Paul Eggert, Ondřej Bílka; +Cc: libc-alpha On 03/03/2018 12:24 PM, Paul Eggert wrote: > I didn't quite follow your profiling proposals. As near as I can make > out you'd like to log calls to malloc, free, etc. But for good > results don't you also need to log memory accesses? Otherwise how > will you know which objects are hot? We can already log malloc, free, etc. with the inlined tracing code we have on DJ's branch, and we in turn simulate those traces again. You only need to log memory accesses if you want perfect reconstruction of the various memory subsystem effects. As you know this is obviously very expensive and difficult if not impossible to do. Some suggestions at LPC were provided though, including tracking heuristics or statistical profiles of hits/misses and trying to recreate those profiles in the page-touch heuristics used by the simulator (something it does when it gets a new block of memory). Today we mostly use the simulator to look at in-use RSS, RSS max, and external and internal fragmentation, looking at trying to reduce that as much as we can. Actually using it for profile driven feedback is very hard, and that kind of behaviour seems to be largely derived from first principles in most academic work. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-03 23:37 ` Carlos O'Donell @ 2018-03-03 23:40 ` Carlos O'Donell 2018-03-04 15:51 ` Ondřej Bílka 1 sibling, 0 replies; 17+ messages in thread From: Carlos O'Donell @ 2018-03-03 23:40 UTC (permalink / raw) To: Paul Eggert, Ondřej Bílka; +Cc: libc-alpha On 03/03/2018 03:37 PM, Carlos O'Donell wrote: > On 03/03/2018 12:24 PM, Paul Eggert wrote: >> I didn't quite follow your profiling proposals. As near as I can make >> out you'd like to log calls to malloc, free, etc. But for good >> results don't you also need to log memory accesses? Otherwise how >> will you know which objects are hot? > > We can already log malloc, free, etc. with the inlined tracing code we > have on DJ's branch, and we in turn simulate those traces again. > > You only need to log memory accesses if you want perfect reconstruction > of the various memory subsystem effects. As you know this is obviously > very expensive and difficult if not impossible to do. Some suggestions > at LPC were provided though, including tracking heuristics or statistical > profiles of hits/misses and trying to recreate those profiles in the > page-touch heuristics used by the simulator (something it does when it > gets a new block of memory). > > Today we mostly use the simulator to look at in-use RSS, RSS max, > and external and internal fragmentation, looking at trying to reduce > that as much as we can. > > Actually using it for profile driven feedback is very hard, and that > kind of behaviour seems to be largely derived from first principles > in most academic work. > This is a link to the tcmalloc work which we are trying to coordinate with: https://docs.google.com/document/d/1S8T9apA2juEC5sV2UVXUan4e4_029fTmoAEoX7MEB9s/edit#heading=h.ljaekvl39oaq -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-03 23:37 ` Carlos O'Donell 2018-03-03 23:40 ` Carlos O'Donell @ 2018-03-04 15:51 ` Ondřej Bílka 1 sibling, 0 replies; 17+ messages in thread From: Ondřej Bílka @ 2018-03-04 15:51 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Paul Eggert, libc-alpha On Sat, Mar 03, 2018 at 03:37:45PM -0800, Carlos O'Donell wrote: > On 03/03/2018 12:24 PM, Paul Eggert wrote: > > I didn't quite follow your profiling proposals. As near as I can make > > out you'd like to log calls to malloc, free, etc. But for good > > results don't you also need to log memory accesses? Otherwise how > > will you know which objects are hot? > > We can already log malloc, free, etc. with the inlined tracing code we > have on DJ's branch, and we in turn simulate those traces again. > > You only need to log memory accesses if you want perfect reconstruction > of the various memory subsystem effects. As you know this is obviously > very expensive and difficult if not impossible to do. Some suggestions > at LPC were provided though, including tracking heuristics or statistical > profiles of hits/misses and trying to recreate those profiles in the > page-touch heuristics used by the simulator (something it does when it > gets a new block of memory). > > Today we mostly use the simulator to look at in-use RSS, RSS max, > and external and internal fragmentation, looking at trying to reduce > that as much as we can. > > Actually using it for profile driven feedback is very hard, and that > kind of behaviour seems to be largely derived from first principles > in most academic work. > No, just if you collect bunch of data unrelated to profiling you couldn't be surprised that you couldn't use for profiling. This impossible to do with trace, you could use trace only for memory usage. Logging itself changes cache usage pattern which makes everything considering cache inaccurate. After that for profiling you need to gather completely different things, there is zero overlap. Here you knowing caller is essentional, you need to make decisions based on that. Otherwise you would just find out that for one half you must do A, for another B and don't know which part is which. Like I said measuring live time is easy as we need relatively low precision. It would suffice to access variable that is increased each 1-2ms, accuracy doesn't matter that much. Likewise cache properties when we do free are easy to get. Knowing if chunks are still hot after specified time(like 10ms) after allocation could be done by random sampling that marks some and if they weren't freed we measure how long it takes to read data from them. Here limitation is that we could do only certain number of reads before they significantly change cache. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Per call-site malloc specialization. 2018-03-03 20:24 ` Paul Eggert 2018-03-03 23:37 ` Carlos O'Donell @ 2018-03-04 11:59 ` Ondřej Bílka 1 sibling, 0 replies; 17+ messages in thread From: Ondřej Bílka @ 2018-03-04 11:59 UTC (permalink / raw) To: Paul Eggert; +Cc: Carlos O'Donell, libc-alpha After thinking about this I realized that profiling beside short/long lived isn't necessary to solver problem of multiple chunks sharing cache line as classification into call sites captures all factors so we could just split that according to that. Allocator would have two levels, upper would work only on whole cache lines. Lower level would get 1-3 cache-line chunks from upper level. For short-lived sites lower level uses thread-local cache like one I proposed. For long lived there will be assigned to pair (thread, caller) which would service small requests and when entire chunk was free then return it to the upper level. This would give highest amount of locality, unless you could connect your computer to crystal ball. For that one would want to use macro like following to recognize fast path. Getting caller from stack pointers would need hash table and is more complicated. #define malloc(x) \ static void *malloc_data = (void *) &initial_malloc_data; \ malloc_with_data (x, &malloc_data, static_hint | gcc_hint() ); Another thing is that gcc could pass information about some malloc properties with gcc_hint above. On Sat, Mar 03, 2018 at 12:24:20PM -0800, Paul Eggert wrote: > OndÅej BÃlka wrote: > > >This is paywalled so I didn't read it. > > You can freely read an older NightWatch paper here: > > Guo R, Liao X, Jin H, Yue J, Tan G. NightWatch: Integrating > Lightweight and Transparent Cache Pollution Control into Dynamic > Memory Allocation Systems. USENIX ATC. 2015. 307-18. https://www.usenix.org/system/files/conference/atc15/atc15-paper-guo.pdf > > Although NightWatch's source code is freely readable (it's on > GitHub), it does not specify a software licence so I have not read > it and don't recommend that you read it. > Read it, it uses different trick that I considered. It uses trick to select physical pages that map to same 4k of cache into virtual page region. Then it allocates chunks with bad cache locality from this region to allow rest use nearly full cache. Integrating these is easy part, hard part is low-overhead profiler. Adding flag to kernel mmap for that mapping shouldn't be that difficult. There would be generic interface to pass information from profiler to allocator so writing profiler is independent part. In previous post I said that with training data one could use profiler and don't care about overhead because after recompiling it will use profiler that just sets it up based on array with previous run results. > >One problem is that it merges several different issues. > > Yes, a problem common to many systems papers. > > I didn't quite follow your profiling proposals. As near as I can > make out you'd like to log calls to malloc, free, etc. But for good > results don't you also need to log memory accesses? Otherwise how > will you know which objects are hot? No, I described something completely different. How could one add to a production allocator a zero overhead profiler. Here it was simple heuristic: For each call site mark first 32 allocations and depending when they are free'd estimate lifetime. With that call site decides if it uses heap for short lived or for long lived allocations. There wouldn't be performance impact to allocations that weren't marked. It is something that is simple and could be good enough when hot data are short lived. For sake of profiling classification it would be easy to find cache properties of chunks larger than cache-lines as there shouldn't be cache-line sharing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-02 16:28 ` Carlos O'Donell 2018-03-02 20:21 ` Paul Eggert @ 2018-03-02 23:00 ` Ondřej Bílka 2018-03-05 20:32 ` DJ Delorie 1 sibling, 1 reply; 17+ messages in thread From: Ondřej Bílka @ 2018-03-02 23:00 UTC (permalink / raw) To: Carlos O'Donell; +Cc: libc-alpha On Fri, Mar 02, 2018 at 08:27:50AM -0800, Carlos O'Donell wrote: > On 03/01/2018 01:33 PM, OndÅej BÃlka wrote: > > Like I mentioned in other thread I plan to make improved malloc+friends. > > It will be split to several aspects. > > Great to hear that! > > Please keep in mind that at Red Hat we are working on 3 relatively new > things for malloc: > > * Trace functionality (part of our whole-system trace work) > * Replay functionality (our simulator) These two are wasted work as any relation to actual performance is purely coincidental. Running a program first with allocator A, then with allocator B and comparing total run time is the only way to get accurate results (Unless it is just microoptimalization that doesn't change any allocated address.). It doesn't measure majority of performance factors because it happens outside of malloc depending on resulting cache layout so obviously it couldn't measure performance. Simple case to demonstrate this is: int i, j, al; int *ary; int *tmalloc (int i){ // al+=1; // al+=10; // al = rand () % 20000; al = rand () % 1000; return ary + al; } int main(){ al = 0; ary = malloc (100000000); int *x[1000]; for (i=0; i<1000; i++) x[i] = (int *) tmalloc (4); for (j=0; j<1000000; j++) for (i=0; i<1000; i++) x[i][0] = 42; } Where depending on four cases how this simple allocator sets layout it takes four times more time in last case than in first case. These cache factors are ubiquious. For example it could look that keeping metadata outside of allocated chunk is faster because prefetching that it does for application singificantly reduces overall latency. > The harder work that I face today is not about performance though, the harder > work, and I would hope you could help us on this, is *limiting* RSS consumption > and improving the core algorithms used in malloc to be able to limit RSS > without an immense cost to performance. > > Developers want to deploy memory limited systems and glibc's malloc does not > always consider RSS costs. Some issues around this are: > > * fastbins prevent free'ing down of heap from top. > * fastbins can grow unbounded. > * heaps can only be freed from top down (unless you call malloc_trim() which > can find sub-pages in the heap to free back). > Real problem with first two that current code is convoluted mess. For different representations of small allocations its easy to bound space usage with tunable of maximal number of pages for these. When no page is available it will just do allocation from heap. This would be easy to add with clean code. As for third problem of freeing pages after thinking for few hours I realized that bitmaps are only sane way to keep track of that (and five other insane ways) Idea is that we will do madvise(dont_need) to every page that wasn't used for specified amount of time. You need for bitmaps free - denotes page that is completely free cleared - free page on which we called madvise live1 - these two capture recent action on page. After specified time period we copy live1 to live2 and zero live1 live2 Then we could find that pages needing madvise are result of formula: free & ~cleared & ~(live1 | live2) Then there is trick that for large allocations one should always clear upper half of memory because buffer rarely uses whole area. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-02 23:00 ` Possible inline malloc alternatives: bitmap Ondřej Bílka @ 2018-03-05 20:32 ` DJ Delorie 0 siblings, 0 replies; 17+ messages in thread From: DJ Delorie @ 2018-03-05 20:32 UTC (permalink / raw) To: Ondà Âej BÃÂlka Cc: carlos, libc-alpha > These two are wasted work Please do not refer to someone else's efforts as "wasted" until you understand why those efforts were made, and how they're being used. The trace/simulate code allowed us to benchmark a customer's application where we had no access to said application, it ran in an environment where it interacted with many other applications and systems, and over the course of a few days generated a few terabytes of trace. We were able to use that trace to reproduce the performance issue and solve it. We've also used the trace/simulate code to validate tcache's performance, which was confirmed in real-world tests later. They are not "wasted" work even if they aren't the right tool for your particular needs. Of course, you also need to understand what the benchmark benchmarks. The trace simulator doesn't benchmark printf or strcpy, just the malloc API. Likewise, as we have no way of capturing actual memory read/writes, we have no way of simulating them. This is why we have separate microbenchmarks in glibc, although microbenchmarks are poor reasons to change a general-purpose system allocator. People care about things like compilers, browsers, and databases, not microbenchmarks. So while you can write a microbenchmark that shows a 400% improvement in malloc speed, whether that translates to real-world performance improvements or not still needs to be determined. I'm not trying to discourage you from improving malloc here, of course. A change that improves a microbenchmark has a good chance of improving whole system performance also. We just need a way to *prove it* by reliably (and automatically) comparing "apples to apples" at a level and scale that's appropriate for a general-purpose system allocator like glibc's malloc, and the trace simulator is the best tool we can come up with at this time, to serve that purpose. So please do submit patches that improve malloc, but understand that we'll expect statistically robust comparisons with both microbenchmarks, and the simulator[*] running available workloads, to demonstrate that your patch doesn't have unexpected negative impacts in applications our users care about. [*] All workloads, with and without your patch, showing both time and rss usage, averaged across many runs, etc... > as any relation to actual performance is purely coincidental. *Any* benchmarks are statistical in nature, and how well they correspond to real-world performance needs to be understood before using said benchmarks. That in no way means they're "coincidental". Even running the real app once is only statistically related to its performance on average. > Running a program first with allocator A, then with allocator B and > comparing total run time is the only way to get accurate results Which we can do with the simulator, in as much as the workload is representative of the application's performance. Note I say "representative". Our experience with it is that it's quite close, so we use it with some confidence that improvements seen in the simulator will translate to improvements seen in the application. But as I noted, benchmarking a *real* program is also only "representative" of its average performance. And sometimes, it's not practical to run an application with different allocators enough times, and consistently enough, to get reliable numbers. > Real problem with first two that current code is convoluted mess. While I agree, its generally a bad idea to start a discussion with "your code is a mess" when you're talking to the people who (1) wrote that code, and (2) will ultimately be reviewing your changes to it. We try our best to be fair, but we're only human. > For different representations of small allocations its easy to bound > space usage with tunable of maximal number of pages for these. When no > page is available it will just do allocation from heap. This would be > easy to add with clean code. I look forward to seeing your patch :-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* best-fit algorithm with bitmap heap 2018-03-02 5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka 2018-03-02 5:54 ` Ondřej Bílka 2018-03-02 16:28 ` Carlos O'Donell @ 2018-03-06 20:42 ` Ondřej Bílka 2018-03-06 21:26 ` Ondřej Bílka 2 siblings, 1 reply; 17+ messages in thread From: Ondřej Bílka @ 2018-03-06 20:42 UTC (permalink / raw) To: libc-alpha On Thu, Mar 01, 2018 at 10:33:18PM +0100, OndÅej BÃlka wrote: > Like I mentioned in other thread I plan to make improved malloc+friends. > It will be split to several aspects. > I got idea when I pondered about merging chunks and deleting old data that you don't have to delete them from data structure. It requires to save data about free chunks separately and data to check if chunk information is up-to-date. To limit space usage one would need to use fixed size buffer with these, when its full one needs to delete all invalid entries from buffer, and add/remove elements from bigger storage to make buffer half full. I could do this quickly with bitmaps and some conditions for sake of simplicity. 1. It would consider only allocation where capacity is 1-16 integer multiple of block 2. Data would be in 64-block pages where first and last block are reserved by allocator. As data structure for page start of allocated chunk would be denoted by 1 in corresponding bit of alloc_first and end in corresponding bit of alloc_last. Reserved blocks are there to put sentinel 1 at end of bitmaps. For faster operations bits in alloc_first could be in reversed order. When one has address and capacity of b blocks testing if it is valid free area is easy, from address we determine header and number s of starting block. First we check that free space starts there with formula (alloc_last & (1<<(s-1))) To test that it is free space until s+b which should be start of chunk it should be formula like this(maybe its off by one somewhere, didn't test that) ((alloc_first>>s)<<(64-b)) == 1<<63 On free with offset s one zeroes allocated bits with alloc_first = alloc_first ^ (1<<s); Start of free region is just after index of highest 1 bit of alloc_first & ((1<<s) - 1) Here comes comment that alloc_first could have reversed order as lowest 1 bit could be easier to compute End of allocation could be zeroes just with easy bit operations like following: ml = alloc_last & (~((1<<s)-1)); mask = ml ^ (ml - 1); alloc_last = alloc_last ^ mask ^ (mask / 2); end of free region is last 1 bit of alloc_last & (~mask) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: best-fit algorithm with bitmap heap 2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka @ 2018-03-06 21:26 ` Ondřej Bílka 0 siblings, 0 replies; 17+ messages in thread From: Ondřej Bílka @ 2018-03-06 21:26 UTC (permalink / raw) To: libc-alpha Correction as I initially discovered better solution but forgot about that when writing previous post. Operations are lot simpler. Description below is mostly same except that free is simpler. I could do multiple frees with free_bitmap with 1 at start of each chunk to be freed. Bitmaps alloc_first and alloc_last with 1 at start/end of allocated chunk stay. Removing bits corresponding to freed chunk is actually simple with following formulas alloc_first = alloc_first - free_bitmap; alloc_last = alloc_last & (alloc_last - free_bitmap); Endpoints (noninclusive) of free areas that changed are given by following bitmap free_last = alloc_first & (alloc_first ^ (alloc_first - free_bitmap)) When processing these you get starts from highest bit in alloc_last & mask Forgot to add that cross page allocation is possible by adding more cases but I don't know if bit better space utilization is worth worse performance. Check if block is valid could be also simplified and we could use same formulas as those to remove bits, to allocate offset s and size l we check and change bitmaps with following checks: if (((alloc_first ^ (~ (1<<s))) != alloc_first) && alloc_last & (alloc_last - (1<<s)) == alloc_last & (~(1<<(s+l)))) { alloc_first = alloc_first ^ (~ (1<<s)); alloc_last = alloc_last & (~(1<<(s+l))); /* valid, do allocation */ } On Tue, Mar 06, 2018 at 09:42:43PM +0100, OndÅej BÃlka wrote: > On Thu, Mar 01, 2018 at 10:33:18PM +0100, OndÅej BÃlka wrote: > > Like I mentioned in other thread I plan to make improved malloc+friends. > > It will be split to several aspects. > > > I got idea when I pondered about merging chunks and deleting old data > that you don't have to delete them from data structure. > It requires to save data about free chunks separately and data to check > if chunk information is up-to-date. > To limit space usage one would need to use fixed size buffer with these, when its full one > needs to delete all invalid entries from buffer, and add/remove elements from bigger > storage to make buffer half full. > > I could do this quickly with bitmaps and some conditions for sake of simplicity. > 1. It would consider only allocation where capacity is 1-16 integer multiple of block > 2. Data would be in 64-block pages where first and last block are reserved by allocator. > > As data structure for page start of allocated chunk would be denoted by > 1 in corresponding bit of alloc_first and end in corresponding bit of alloc_last. > Reserved blocks are there to put sentinel 1 at end of bitmaps. > For faster operations bits in alloc_first could be in reversed order. > > When one has address and capacity of b blocks testing if it is valid free area is easy, > from address we determine header and number s of starting block. > First we check that free space starts there with formula (alloc_last & (1<<(s-1))) > To test that it is free space until s+b which should be start of chunk > it should be formula like this(maybe its off by one somewhere, didn't test that) > ((alloc_first>>s)<<(64-b)) == 1<<63 > > On free with offset s one zeroes allocated bits with > > alloc_first = alloc_first ^ (1<<s); > > Start of free region is just after index of highest 1 bit of alloc_first & ((1<<s) - 1) > Here comes comment that alloc_first could have reversed order as lowest > 1 bit could be easier to compute > > End of allocation could be zeroes just with easy bit operations like > following: > > ml = alloc_last & (~((1<<s)-1)); > mask = ml ^ (ml - 1); > alloc_last = alloc_last ^ mask ^ (mask / 2); > > end of free region is last 1 bit of alloc_last & (~mask) > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap @ 2018-03-04 14:07 Wilco Dijkstra 2018-03-05 18:57 ` Carlos O'Donell 0 siblings, 1 reply; 17+ messages in thread From: Wilco Dijkstra @ 2018-03-04 14:07 UTC (permalink / raw) To: eggert, Ondřej Bílka; +Cc: nd, Carlos O'Donell, libc-alpha Paul Eggert wrote: > You can freely read an older NightWatch paper here: > > Although NightWatch's source code is freely readable (it's on GitHub), it does not specify a software licence so I have not read it and don't recommend that you read it. > > One problem is that it merges several different issues. > > Yes, a problem common to many systems papers. Indeed, it is not relevant to the issues we are having with GLIBC malloc. The paper is about system wide cache partitioning which is very system specific and between processes. > I didn't quite follow your profiling proposals. As near as I can make out you'd like to log calls to malloc, free, etc. But for good results don't you also need to log memory accesses? Otherwise how will you know which objects are hot? I don't believe it is feasible to profile malloc more accurately, but neither is it necessary for a quick evaluation of malloc improvements like (de)allocation performance and fragmentation reduction. You always need to run the original benchmark to prove the gains are real. Wilco ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Possible inline malloc alternatives: bitmap 2018-03-04 14:07 Possible inline malloc alternatives: bitmap Wilco Dijkstra @ 2018-03-05 18:57 ` Carlos O'Donell 0 siblings, 0 replies; 17+ messages in thread From: Carlos O'Donell @ 2018-03-05 18:57 UTC (permalink / raw) To: Wilco Dijkstra, eggert, Ondřej Bílka; +Cc: nd, libc-alpha On 03/04/2018 06:07 AM, Wilco Dijkstra wrote: > I don't believe it is feasible to profile malloc more accurately, but > neither is it necessary for a quick evaluation of malloc improvements > like (de)allocation performance and fragmentation reduction. You > always need to run the original benchmark to prove the gains are > real. 100% agreed. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-03-06 21:26 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-03-02 5:54 Possible inline malloc alternatives: bitmap Ondřej Bílka 2018-03-02 5:54 ` Ondřej Bílka 2018-03-02 18:20 ` DJ Delorie 2018-03-02 16:28 ` Carlos O'Donell 2018-03-02 20:21 ` Paul Eggert 2018-03-03 12:56 ` Ondřej Bílka 2018-03-03 20:24 ` Paul Eggert 2018-03-03 23:37 ` Carlos O'Donell 2018-03-03 23:40 ` Carlos O'Donell 2018-03-04 15:51 ` Ondřej Bílka 2018-03-04 11:59 ` Per call-site malloc specialization Ondřej Bílka 2018-03-02 23:00 ` Possible inline malloc alternatives: bitmap Ondřej Bílka 2018-03-05 20:32 ` DJ Delorie 2018-03-06 20:42 ` best-fit algorithm with bitmap heap Ondřej Bílka 2018-03-06 21:26 ` Ondřej Bílka 2018-03-04 14:07 Possible inline malloc alternatives: bitmap Wilco Dijkstra 2018-03-05 18:57 ` Carlos O'Donell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).