* page alignment for large malloc requests? [not found] <CB025C55-0E07-48DE-89A1-02F1F78E355A@dilger.ca> @ 2020-10-03 20:31 ` Andreas Dilger 2020-10-04 19:38 ` Carlos O'Donell 2020-10-05 7:29 ` Florian Weimer 0 siblings, 2 replies; 10+ messages in thread From: Andreas Dilger @ 2020-10-03 20:31 UTC (permalink / raw) To: libc-alpha [-- Attachment #1: Type: text/plain, Size: 1034 bytes --] How easy/hard would it be for glibc malloc to automatically align larger allocations (e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that they are always properly aligned for O_DIRECT IO? I _thought_ that was already being done by default, but much to my surprise that was not the case. For improved IO efficiency, I was looking at whether it would be possible to transparently avoid doing a user->kernel data copy during large write() calls and just submitting the IO directly to underlying flash storage, but since the input buffers are not aligned properly, this isn't possible. I'm of course aware of posix_memalign(), but I was wondering about "normal" applications that are written by users that don't know anything about this, and just allocate memory and use it to submit IO. I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap would be very efficient for malloc, but I am not really familiar with the details. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-03 20:31 ` page alignment for large malloc requests? Andreas Dilger @ 2020-10-04 19:38 ` Carlos O'Donell 2020-10-05 7:30 ` Florian Weimer 2020-10-05 21:19 ` Andreas Dilger 2020-10-05 7:29 ` Florian Weimer 1 sibling, 2 replies; 10+ messages in thread From: Carlos O'Donell @ 2020-10-04 19:38 UTC (permalink / raw) To: Andreas Dilger, libc-alpha On 10/3/20 4:31 PM, Andreas Dilger wrote: > How easy/hard would it be for glibc malloc to automatically align larger allocations > (e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that > they are always properly aligned for O_DIRECT IO? Could you define what you mean by "easy" or "hard?" It is *technically* easy to make such a change. The deeper question is: Is it worth the increased RSS cost and memory fragmentation to support an "everything is ready for O_DIRECT IO" use case? You would be increasing fragmentation by not allowing the allocator to pick a chunck that did not meet the alignment requirement. My opinion is that if the application needs to use O_DIRECT it should use aligned_alloc() or any of the suitable aligned allocation functions. > I _thought_ that was already being done by default, but much to my surprise that was > not the case. For improved IO efficiency, I was looking at whether it would be possible > to transparently avoid doing a user->kernel data copy during large write() calls and > just submitting the IO directly to underlying flash storage, but since the input buffers > are not aligned properly, this isn't possible. If you lower the malloc mmap threshold then you'll go straight to mmap for allocations and those will be page aligned, but you'll pay the performance cost for those allocations. It might be possible to create a tunable to raise the minimum chunk alignment to any arbitrary size, but you'd have a unique system with that setting set that high. > I'm of course aware of posix_memalign(), but I was wondering about "normal" applications > that are written by users that don't know anything about this, and just allocate memory > and use it to submit IO. Are you injecting O_DIRECT into the application open*() calls? Can you please explain a bit more about your use case? > I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap > would be very efficient for malloc, but I am not really familiar with the details. It is workload dependent. In general this will lead to huge fragmentation at the page size, and for 64KiB page architectures there is going to be a lot of fragmentation. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-04 19:38 ` Carlos O'Donell @ 2020-10-05 7:30 ` Florian Weimer 2020-10-06 1:52 ` Carlos O'Donell 2020-10-05 21:19 ` Andreas Dilger 1 sibling, 1 reply; 10+ messages in thread From: Florian Weimer @ 2020-10-05 7:30 UTC (permalink / raw) To: Carlos O'Donell via Libc-alpha; +Cc: Andreas Dilger, Carlos O'Donell * Carlos O'Donell via Libc-alpha: > If you lower the malloc mmap threshold then you'll go straight to mmap > for allocations and those will be page aligned, but you'll pay the > performance cost for those allocations. This is not correct. Due to the glibc malloc header, the pointers returned to applications will NOT be page-aligned. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-05 7:30 ` Florian Weimer @ 2020-10-06 1:52 ` Carlos O'Donell 0 siblings, 0 replies; 10+ messages in thread From: Carlos O'Donell @ 2020-10-06 1:52 UTC (permalink / raw) To: Florian Weimer, Carlos O'Donell via Libc-alpha; +Cc: Andreas Dilger On 10/5/20 3:30 AM, Florian Weimer wrote: > * Carlos O'Donell via Libc-alpha: > >> If you lower the malloc mmap threshold then you'll go straight to mmap >> for allocations and those will be page aligned, but you'll pay the >> performance cost for those allocations. > > This is not correct. Due to the glibc malloc header, the pointers > returned to applications will NOT be page-aligned. You are absolutely right, I wasn't thinking straight here, you'd need two pages for this to work, and that's really bad fragmentation again. We would need to have a completely distinct allocator to make this work, and at that point we're back to the discussion we had earlier about arenas with "types" of memory [1]. -- Cheers, Carlos. [1] https://sourceware.org/pipermail/libc-alpha/2020-September/117864.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-04 19:38 ` Carlos O'Donell 2020-10-05 7:30 ` Florian Weimer @ 2020-10-05 21:19 ` Andreas Dilger 2020-10-06 0:43 ` DJ Delorie 2020-10-06 7:41 ` Florian Weimer 1 sibling, 2 replies; 10+ messages in thread From: Andreas Dilger @ 2020-10-05 21:19 UTC (permalink / raw) To: Carlos O'Donell, Florian Weimer; +Cc: libc-alpha [-- Attachment #1: Type: text/plain, Size: 6015 bytes --] On Oct 4, 2020, at 1:38 PM, Carlos O'Donell <carlos@redhat.com> wrote: > > On 10/3/20 4:31 PM, Andreas Dilger wrote: >> How easy/hard would it be for glibc malloc to automatically align larger allocations >> (e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that >> they are always properly aligned for O_DIRECT IO? > > Could you define what you mean by "easy" or "hard?" I'm a kernel developer, and I'm not familiar with the glibc code at all, so I was hoping to get some idea from people knowledgeable in this code on whether it is possible for a glibc newbie, and whether it is practical and desirable to make such a change. Maybe even "someone wrote an old patch for doing this, but ...". > It is *technically* easy to make such a change. > > The deeper question is: Is it worth the increased RSS cost and memory fragmentation > to support an "everything is ready for O_DIRECT IO" use case? > > You would be increasing fragmentation by not allowing the allocator to pick a chunck > that did not meet the alignment requirement. I was hoping that adding a new allocation policy to glibc might be similar to allocating a new slab in the kernel, but with a different allocation policy. I've read about "arenas" in the mallinfo(3) and malloc_info(3), but I don't really have much idea of the details of "arenas", and reading StackOverflow and other postings such as https://heap-exploitation.dhavalkapil.com/diving_into_glibc_heap/bins_chunks are only really second-hand speculation. If allocations are PAGE_SIZE multiples and/or powers-of-two, then I'd _think_ having a separate "arena" (if I'm using it correctly) that keeps these allocations aligned on the PAGE_SIZE boundaries would not result in much overhead or fragmentation. These would not IMHO be mixed in with small or random-sized allocations to keep the allocations CPU/RAM efficient. >> I _thought_ that was already being done by default, but much to my surprise that was >> not the case. For improved IO efficiency, I was looking at whether it would be possible >> to transparently avoid doing a user->kernel data copy during large write() calls and >> just submitting the IO directly to underlying flash storage, but since the input buffers >> are not aligned properly, this isn't possible. > > If you lower the malloc mmap threshold then you'll go straight to mmap for allocations > and those will be page aligned, but you'll pay the performance cost for those allocations. > > It might be possible to create a tunable to raise the minimum chunk alignment to any > arbitrary size, but you'd have a unique system with that setting set that high. > >> I'm of course aware of posix_memalign(), but I was wondering about "normal" applications >> that are written by users that don't know anything about this, and just allocate memory >> and use it to submit IO. > > Are you injecting O_DIRECT into the application open*() calls? No, because we can't know in advance whether all IO on the fd will be properly aligned and sized for O_DIRECT. This would be decided on a per-write*() syscall basis, if it is large enough and aligned properly. We saw good improvements for some IO benchmarks with automatic O_DIRECTifying large write calls, but then realized that even large malloc() calls do not align on more than an 16-byte boundary (256 bytes on MacOS because of alignment for vector ops). I'd assumed this would be true for glibc, since kernel allocation of PAGE_SIZE multiples are aligned on a PAGE_SIZE boundary). > Can you please explain a bit more about your use case? > > My opinion is that if the application needs to use O_DIRECT it should use > aligned_alloc() or any of the suitable aligned allocation functions. My goal is to try and optimize application IO performance (with improved speed and/or reduced CPU usage) without having to rewrite every application and library. While it is possible to fix common tools like "dd" and "copy" and "tar", it is not practical to do this on a wider scale, and new code is continually being written by users that are not at all familiar with good IO, so that is a losing battle. >> I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap >> would be very efficient for malloc, but I am not really familiar with the details. > > It is workload dependent. In general this will lead to huge fragmentation at the page > size, and for 64KiB page architectures there is going to be a lot of fragmentation. Any mention of "4KB" is really meant to be PAGE_SIZE, since it would be the internal alignment requirements to avoid data copies anyway. On Oct 5, 2020, at 1:29 AM, Florian Weimer <fweimer@redhat.com> wrote: > > Andreas Dilger wrote: >> I'm of course aware of posix_memalign(), but I was wondering about >> "normal" applications that are written by users that don't know >> anything about this, and just allocate memory and use it to submit IO. > > posix_memalign is currently rather broken: If you free an allocation, in > many cases, glibc malloc cannot return the same allocation for a > subsequent posix_memalign request with the same parameters. I guess > this is a form of fragmentation. It can lead to poor memory utilization > with allocation patterns that happen in practice. > > We have an idea to fix this (by making posix_memalign slower > unfortunately), but the patch hasn't been upstreamed & reviewed in > months. This should give you an idea how difficult it is to make malloc > changes. That is unfortunate. I'd _think_ that segregating large/aligned malloc could be done efficiently, especially if there was a higher demand for them if they were enabled by default. That wouldn't affect the most common case of small malloc. Essentially, there would be a dedicated area for the "auto-posix_memalign()" behavior of malloc() with large allocations. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-05 21:19 ` Andreas Dilger @ 2020-10-06 0:43 ` DJ Delorie 2020-10-06 7:41 ` Florian Weimer 1 sibling, 0 replies; 10+ messages in thread From: DJ Delorie @ 2020-10-06 0:43 UTC (permalink / raw) To: Andreas Dilger; +Cc: carlos, fweimer, libc-alpha Andreas Dilger <adilger@dilger.ca> writes: > I was hoping that adding a new allocation policy to glibc might be > similar to allocating a new slab in the kernel, but with a different > allocation policy. The fundamental problem with trying this is that the ancient APIs we use for memory allocation give almost no clues to us about what policies make sense. Most malloc implementations use heuristics and benchmarks to decide what the "right" thing to do is. Additionally, most standards-compliant programs have little need of alignment other than for base types (int, long, etc). > I've read about "arenas" in the mallinfo(3) and malloc_info(3), See https://sourceware.org/glibc/wiki/MallocInternals Arenas are just memory pools we can create, resize, and discard as needed. There's no correlation between arenas and policies. I've thought about doing something along those lines though, with each arena having a set of internal hooks for alloc/realloc/free and their own policies and optimizations. For example, a "page arena" would need to have metadata stored separately so that the allocations could be doled out one per page, without waste. Our current algorithm stores metadata between allocations, which would waste a whole page per page-aligned allocation. > That is unfortunate. I'd _think_ that segregating large/aligned malloc > could be done efficiently, especially if there was a higher demand for > them if they were enabled by default. That wouldn't affect the most > common case of small malloc. Essentially, there would be a dedicated > area for the "auto-posix_memalign()" behavior of malloc() with large > allocations. We segregate large allocations into mmap-per-alloc, but applications which need such large aligned allocations should call mmap directly anyway. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-05 21:19 ` Andreas Dilger 2020-10-06 0:43 ` DJ Delorie @ 2020-10-06 7:41 ` Florian Weimer 2020-10-06 20:24 ` Andreas Dilger 1 sibling, 1 reply; 10+ messages in thread From: Florian Weimer @ 2020-10-06 7:41 UTC (permalink / raw) To: Andreas Dilger; +Cc: Carlos O'Donell, libc-alpha * Andreas Dilger: > No, because we can't know in advance whether all IO on the fd will > be properly aligned and sized for O_DIRECT. This would be decided > on a per-write*() syscall basis, if it is large enough and aligned > properly. We saw good improvements for some IO benchmarks with > automatic O_DIRECTifying large write calls, […] I do not think this would result in an overall benefit. Have you tweaked the kernel to evcit the modified pages from the page cache after large writes, to fake the impact of the change? And run some typical tasks with that, like a kernel build? I expect that performance will not be great because most written data is read again after a fairly short time. Thanks, Florian ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-06 7:41 ` Florian Weimer @ 2020-10-06 20:24 ` Andreas Dilger 2020-10-07 7:06 ` Florian Weimer 0 siblings, 1 reply; 10+ messages in thread From: Andreas Dilger @ 2020-10-06 20:24 UTC (permalink / raw) To: Florian Weimer; +Cc: Carlos O'Donell, libc-alpha [-- Attachment #1: Type: text/plain, Size: 1348 bytes --] On Oct 6, 2020, at 1:41 AM, Florian Weimer <fweimer@redhat.com> wrote: > > * Andreas Dilger: > >> No, because we can't know in advance whether all IO on the fd will >> be properly aligned and sized for O_DIRECT. This would be decided >> on a per-write*() syscall basis, if it is large enough and aligned >> properly. We saw good improvements for some IO benchmarks with >> automatic O_DIRECTifying large write calls, […] > > I do not think this would result in an overall benefit. Have you > tweaked the kernel to evcit the modified pages from the page cache after > large writes, to fake the impact of the change? And run some typical > tasks with that, like a kernel build? > > I expect that performance will not be great because most written data is > read again after a fairly short time. That really depends heavily on the IO workload. While a kernel build is a common workload for kernel or GCC developers, there are lots of other workloads (e.g. machine learning, weather forecasting, fluid dynamics, CGI, video streaming, etc.) that read/write large chunks from/to disk. I thought after some of the comments here that maybe larger allocations will be naturally aligned due to mmap, but checked up to 16MiB allocations, and they still did not show any aligned allocations. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-06 20:24 ` Andreas Dilger @ 2020-10-07 7:06 ` Florian Weimer 0 siblings, 0 replies; 10+ messages in thread From: Florian Weimer @ 2020-10-07 7:06 UTC (permalink / raw) To: Andreas Dilger; +Cc: libc-alpha * Andreas Dilger: > On Oct 6, 2020, at 1:41 AM, Florian Weimer <fweimer@redhat.com> wrote: >> >> * Andreas Dilger: >> >>> No, because we can't know in advance whether all IO on the fd will >>> be properly aligned and sized for O_DIRECT. This would be decided >>> on a per-write*() syscall basis, if it is large enough and aligned >>> properly. We saw good improvements for some IO benchmarks with >>> automatic O_DIRECTifying large write calls, […] >> >> I do not think this would result in an overall benefit. Have you >> tweaked the kernel to evcit the modified pages from the page cache after >> large writes, to fake the impact of the change? And run some typical >> tasks with that, like a kernel build? >> >> I expect that performance will not be great because most written data is >> read again after a fairly short time. > > That really depends heavily on the IO workload. While a kernel build > is a common workload for kernel or GCC developers, there are lots of other > workloads (e.g. machine learning, weather forecasting, fluid dynamics, > CGI, video streaming, etc.) that read/write large chunks from/to disk. Still it's unclear which of them actually benefit from forcing a write-read cycle to go through persistent storage for the read. The page cache experiment I suggested would allow you to determine that without having to change userspace at first. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: page alignment for large malloc requests? 2020-10-03 20:31 ` page alignment for large malloc requests? Andreas Dilger 2020-10-04 19:38 ` Carlos O'Donell @ 2020-10-05 7:29 ` Florian Weimer 1 sibling, 0 replies; 10+ messages in thread From: Florian Weimer @ 2020-10-05 7:29 UTC (permalink / raw) To: Andreas Dilger; +Cc: libc-alpha * Andreas Dilger: > How easy/hard would it be for glibc malloc to automatically align > larger allocations (e.g. say 4KB+ that are also multiples of 4KB) to a > page address boundary, so that they are always properly aligned for > O_DIRECT IO? The current data structures are not set up for this at all, so it's difficult to do this and retain good performance. > I'm of course aware of posix_memalign(), but I was wondering about > "normal" applications that are written by users that don't know > anything about this, and just allocate memory and use it to submit IO. posix_memalign is currently rather broken: If you free an allocation, in many cases, glibc malloc cannot return the same allocation for a subsequent posix_memalign request with the same parameters. I guess this is a form of fragmentation. It can lead to poor memory utilization with allocation patterns that happen in practice. We have an idea to fix this (by making posix_memalign slower unfortunately), but the patch hasn't been upstreamed & reviewed in months. This should give you an idea how difficult it is to make malloc changes. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-10-07 7:06 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CB025C55-0E07-48DE-89A1-02F1F78E355A@dilger.ca> 2020-10-03 20:31 ` page alignment for large malloc requests? Andreas Dilger 2020-10-04 19:38 ` Carlos O'Donell 2020-10-05 7:30 ` Florian Weimer 2020-10-06 1:52 ` Carlos O'Donell 2020-10-05 21:19 ` Andreas Dilger 2020-10-06 0:43 ` DJ Delorie 2020-10-06 7:41 ` Florian Weimer 2020-10-06 20:24 ` Andreas Dilger 2020-10-07 7:06 ` Florian Weimer 2020-10-05 7:29 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).